Logistic Regression Model Analysis: Female Role Models And Career Progression

Variables

  • The null hypothesis is that = 0 meaning that employee quality has no effect on whether one is promoted or not.
  • Checking for each coefficient and assuming a normal distribution since the sample size is large:
    1. Employee quality,, this is less than 1.96 hence the coefficient for this is not significantly different from zero.
    2. Female employee,, this value falls out of the range hence the coefficient is significantly different from zero at a 95% confidence interval.
    3. Female manager, , this value is less than 1.96 hence the coefficient is not significantly different from zero.
    4. Female employee* Female manager, , this value is less than 1.96 and therefore it is not significantly different from zero.
    5. Constant term,, this value falls out of the range and is therefore significantly different from zero.
  • The coefficient for employee quality is 0.326, hence, an employee is 1.39 times more likely to be promoted as compared to an employee with one less quality unit.
  • The female employee coefficient is -1.338, hence , this shows that a female employee is 0.26 times more likely to get promoted than a male employee.
    1. , then obtain the exponential of this, , the probability becomes; , there is a 0.83 probability that a female employee with an employee quality of 6 and with a male manager is promoted.
    2. , the exponent is,, the probability is therefore , there is a 0.95 probability that a male employee with an employment quality of 6 and with a male manager is promoted.

The independent variables picked for this model will include; urban population percentage, GDP per capita, average years of education after 15 years of age, the polity of the country, political stability and government effectiveness. The urban population percentage is important as it gives a proportion of people exposed to a certain lifestyle and environment. This will affect their mortality and as such the proportion affects the overall mortality. In the dataset this appears as a percentage. The GDP per capita is also going to affect people’s mortality.

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

This is because it is an indicator of people’s lifestyles with wealthier people having better life expectancy. This is a continuous variable in the dataset. The average years of education after 15 years shows how well educated the populace is. This will influence their lifestyle and choices and therefore their life expectancy. The polity of a country will also be considered, this is because more democratic countries may have less political killings.

A more democratic country may therefore have better life expectancy, and will be represented by a higher value in the dataset. Political stability will also be important since a more stable country is less likely to have civil wars and it will also have better security. It would be expected of a better life expectancy in a stable country. This will be a continuous variable in the dataset. Government effectiveness will also be considered as it affects many aspects like housing, medical insurance and security. A country with a more effective government may have better life expectancy. In the dataset this is also represented as a continuous variable. When a model of the life expectancy and the above named explanatory variables is fitted, the results are as shown in table 1.

Table 1: Fitted model of life expectancy

Estimate

Save Time On Research and Writing
Hire a Pro to Write You a 100% Plagiarism-Free Paper.
Get My Paper

Standard Deviation

t-value

Intercept

51.98

0.59

88.68

Urban Population Percentage

0.14

0.01

12.72

GDP per Capita

-1.24 x10^-4

5.41 x 10^-5

-2.29

Average education Years after 15

1.43

9.99 x 10^-2

14.28

Government Effectiveness

1.95

0.36

5.42

Polity

0.04

2.81 x 10^-2

1.25

Political Stability

0.39

0.25

1.59

The first hypothesis to test is whether all the coefficients are insignificant, that is,

Using the F-test, the p-value obtained is 2.2 x 10^-16, this is lower than the significance level of 5% showing that at least one coefficient is not zero. The next test considers the significance of each individual variable using the t-value. The null hypothesis becomes:

Since the sample is large, 1040 observations, this can be compared to the standard normal value of . The urban population percentage coefficient is significant at 5% level, as 12.72 is greater than 1.96 and the null hypothesis is therefore rejected. For the GDP per capita coefficient, the null hypothesis is also rejected as -2.29 is less than -1.96 and therefore is not equal to zero. The coefficient for the average years of education after 15 has a t-value of 14.28, this means the null hypothesis is rejected as it is greater than 1.96.

Government effectiveness coefficient is also rejected as its t-value 5.42 is greater than 1.96 and can therefore not be zero. The next coefficient for polity has a t-value of 1.25, the null hypothesis can therefore not be rejected and its value could be zero at a 5% significance level. Finally, the coefficient for stability has a t-value of 1.59, this cannot be rejected implying that the value of the coefficient could be zero.

Model Specification and Estimation

The fit for the model will be assessed using the R squared. In this model the R squared value was 0.7218. This shows that 72.18% of the total variability of the model was explained by the independent variables. The model therefore seems to be sufficient to explain the effect of the independent variables on life expectancy.

The model has an intercept of 51.98. This would imply that if all those explanatory variables were at zero, that is, a rural population with no GDP, no education after 15 years and a zero score of polity, political stability and government effectiveness then the life expectancy would be approximately 52 years. This estimate has a standard deviation of 0.59 which shows there is quite a significant error in calculating the estimate.

Urban population percentage has a coefficient estimate of 0.14. The standard error for this is 0.01 showing little error in calculation of this estimate. This shows that a 1% increase in urban population has a resultant 0.14 increase in life expectancy with all other factors kept constant. The estimate of GDP per capita is -1.24 x 10^-4 and has a standard error of 5.41 x 10^-5. This is an indication of the low error in calculating the estimate. The estimate implies that a unit increase in GDP results in a decrease in life expectancy by 0.000124 with all other factors kept constant. The impact of GDP may however not be as negligible as it looks as the difference in GDP between one country and the next may be quite large.

The average years of education after 15 have an estimate of 1.43. The standard error for this is 0.1 which is a significant error in calculating the estimate. This variable however has a huge effect on the life expectancy, with every extra year of education leading to a 1.43 increase in life expectancy with all other factors kept constant.

Government effectiveness has an estimate of 1.95 with a standard error of 0.36. This is indicative of a large error in calculating the estimate. It however has the biggest positive influence on life expectancy in the model. A unit increase in the government effectiveness scale leads to a 1.95 increase in the life expectancy of a country with all factors remaining constant.

Polity has an estimate of 0.04 with a standard error of 0.03 which shows that the estimate is quite accurate. In this model, polity has the least positive effect on life expectancy. A unit increase on the scale of polity leads to an increase of only 0.04 of the life expectancy with all other factors kept constant.

Finally, political stability has an estimate of 0.39. There is a significant error in the calculation of this estimate as the standard error is 0.25. A unit increase on the scale of political stability will lead to a resultant increase of 0.39 of life expectancy ceteris paribus.

When a residual plot is made to check for linear assumptions, the residuals are found to be around zero. This is in line with their mean being zero. The data points do not have many data points away from the rest which is an indication of very few outliers. The residuals are however not in a complete horizontal band which may indicate the variances of residuals are not all equal. The q-q plot shows that the residuals follow a normal distribution which satisfies the normalcy of residuals assumption.

Analysis Results

Checking for the homoscedasticity of this model, the breuch-pagan test is used. The test has a null hypothesis that there is homoscedasticity. The test gives a p-value of 1.903 x 10^-5 hence the null hypothesis is rejected and heteroskedasticity is present. To correct this the following output was obtained.

================================================

                          Model 1      Model 2  

————————————————

(Intercept)                 51.98 ***  51.98 ***

                            (0.59)     (0.59)   

urban_population_pct         0.14 ***   0.14 ***

                            (0.01)     (0.01)   

gdp_per_cap                 -0.00 *    -0.00 *  

                            (0.00)     (0.00)   

education15                  1.43 ***   1.43 ***

                            (0.10)     (0.10)   

government_effectiveness     1.95 ***   1.95 ***

                            (0.36)     (0.36)   

polity                       0.04       0.04    

                            (0.03)     (0.03)   

political_stability          0.39       0.39    

                            (0.25)     (0.25)   

————————————————

R^2                          0.72               

Adj. R^2                     0.72               

Num. obs.                 1040                  

RMSE                         5.34               

================================================

*** p < 0.001, ** p < 0.01, * p < 0.05

When the variable healthcare is considered alone on its effect to life expectancy, the following table is obtained.

Table 2: model with one explanatory variable (healthcare)

Estimate

Standard Error

T value

Intercept

62.31

0.25

249.25

Healthcare

4.56

0.16

29.11

Since the two models are not nested, the adjusted R squared is used. The original model has an adjusted R squared of 0.72 while the model with one variable has an adjusted R squared of 0.45. Since this form of R squared is independent of the number of variables, it is clear that the original model is better at predicting life expectancy than the one variable model. This is because 72% of variability in the original model is explained by the variables compared to 45% by the one variable model.

When a new model is made that includes the healthcare variable, the results are as shown in table 3.

Table 3: model with healthcare and other variables

Estimate

Standard Error

t value

Intercept

14.21

2.15

89.47

Urban population percentage

0.50

0.15

12.34

GDP per capita

-0.03

0.04

-2.84

Education

1.29

0.38

12.49

Government Effectiveness

1.76

1.32

4.92

Polity

0.57

0.38

1.51

Political Stability

0.48

0.94

0.14

Healthcare

3.13

0.67

4.69

The p-value from the F test on the two models is 0.0076, this is below 0.05, the null hypothesis that all region coefficients are zero is rejected. The new model is therefore better at explaining life expectancy than the first model.

The healthcare variable has an estimate of 3.13, this implies that a unit increase in healthcare leads to an increase of 3.13 years in life expectancy with all other factors held constant.

In assessing the factors that determine whether one votes or not, the following factors will be considered; age, left right politics, gender, attention and encouragement. The first factor age will be important as each age group has its own ideas and priorities in life. This would mean that it is possible that there is a trend in terms of age of the decision to vote or not to. The age variable will be a continuous variable indicating the age of the person observed. Secondly, left right position will be assessed. This is the position one stands for in matters of politics.

It is represented on a scale of 0 for left position and 10 for the right. It would be expected of the people on the left being more enthusiastic to vote. Gender of the people observed will be considered. This is a categorical variable with 1 representing females and 2 representing males. The two genders may have differing political needs and as such their voting patterns and willingness to vote may be different. The other variable is attention to politics which is a categorical variable from 0 being no attention to 10 for full attention.

People who closely follow politics are more likely to vote as compared to those who do not. Finally, the last variable to be considered is encouragement. This represents the number of times one was encouraged by friends and family to vote. The more pressure one is under from those around them the more likely it would be for them to vote. Such a person will take voting more seriously than one who is not encouraged.

Conclusion

Table 4: Model on factors affecting turnout

 

Estimate

Standard Error

Z value

Intercept

-1.96

0.29

-6.75

Age

0.02

3.64 x 10^-3

6.49

Left_right

0.07

0.04

1.74

Gender

-0.28

0.13

-2.17

Attention

0.30

0.03

11.79

Encouragement

1.32

0.17

7.76

 Table 4 shows the value of the estimates when a logistic regression model is fitted to the data. The first test will be for the significance of the individual estimates. The null hypothesis for this is;

The intercept has a z value of -6.75, being less than -1.96, the null hypothesis is rejected at a 5% level of confidence implying that it cannot be zero. The z value for age is 6.49 which is greater than 1.96. The null hypothesis is rejected implying that the estimate cannot be zero at a 5% significance level. For the next variable of left_right, the z value is 1.74 which is lower than 1.96. There lacks sufficient evidence to reject the null hypothesis and the estimate could be zero. Gender has a z value of -2.17, since this is less than -1.96

, the null hypothesis is rejected and it is concluded that at a 5% significance level the estimate is not zero. For the next variable, attention, the z value is 11.79. This exceeds 1.96 and therefore the null hypothesis is rejected and it is concluded that the estimate cannot be zero at a 5% significance level. Finally, for encouragement, the z value, 7.76 is greater than 1.96, the null hypothesis is rejected at a 5% significance level and it can therefore not be zero. Four of the five variables are statistically significant with left_right being the exception.

The fit of the model will be assessed through the hosmer-lemeshow test and the log likelihood test. The hosmer-lemeshow test has a null hypothesis of

The test gives a chi-square statistic of 7.87 with 8 degrees of freedom and a p-value of 0.4462, showing no evidence of poor fit. Using the log likelihood test, the null hypothesis is:

The test gives a p-value of 2.2 x 10^-16, this is below 0.05 and the null hypothesis is rejected.

The odds ratios and the probabilities for the explanatory variables are given in table 5

Table 5: odds ratios and probabilities of different variables

Age

Left_right

Gender

Attention

Encouragement

Odds ratio

1.02

1.07

0.75

1.35

3.73

The age variable has an odds ratio of 1.02, this suggests that with every unit increase in age, one is 1.02 times more likely to vote than someone one year below. The estimate for age is 0.02 which suggests that with a unit increase in age, the probability of voting increases by 0.02. The variable left_right has an estimate of 0.07 and therefore a unit increase in it leads to a 0.07 increase in probability for voting. The odds ratio is 1.07, hence one is 1.07 times more likely to vote to the right. Gender has an estimate of -0.28 which shows a 0.28 decrease in probability of voting if one is male. A male person is 0.75 times likely to vote as compared to a female.

The attention variable has an estimate of 0.3, hence with a unitary increase in attention the probability to vote increases by 0.3 and one with one more unit of attention is 1.35 times more likely to vote. Finally, the variable encouragement has an estimate of 1.32, this shows the probability increases by 1.32 with every extra unit of encouragement. One is also 3.73 times more likely to vote with every extra person that encourages them.

When the new variable is added the new model is as shown in table 6.

Table 6: model with last election variable

Estimate

Standard Error

Z value

Intercept

-2.05

0.30

-6.85

Age

0.01

3.82 x 10^-3

3.71

Left_right

0.07

0.04

1.83

Gender

-0.27

0.13

-2.04

Attention

0.25

0.03

9.30

Encouragement

1.22

0.17

7.01

Last_election

1.23

0.14

8.66

Comparing the two models using their AIC, the original model had an AIC of 1537.4 and the new model has an AIC of 1465. The reduction is an indication of a better fit. Executing a likelihood ratio test between the two models gives a p-value of 2.2 x 10^-16 which is less than 0.05 and hence the introduction of the variable last election significantly improves the model. The interpretation of the original model changes since it is clear last election has an impact on the decision to vote. As such, more weight may have been placed on other factors that may not necessarily be influential on the decision to vote.

References

Archer, K.J., Lemeshow, S. and Hosmer, D.W., 2007. Goodness-of-fit tests for logistic regression models when data are collected using a complex sampling design. Computational Statistics & Data Analysis, 51(9), pp.4450-4464.

Hosmer Jr, D.W., Lemeshow, S. and Sturdivant, R.X., 2013. Applied logistic regression (Vol. 398). John Wiley & Sons.

Cohen, J., Cohen, P., West, S.G. and Aiken, L.S., 2013. Applied multiple regression/correlation analysis for the behavioral sciences. Routledge.