Catherine De Vries, Spyros Kosmidis & Andreas Murr

Size: px

Start display at page:

Download "Catherine De Vries, Spyros Kosmidis & Andreas Murr"

Rachel Bryan
5 years ago
Views:

1 APPLIED STATISTICS FOR POLITICAL SCIENTISTS WEEK 8: DEPENDENT CATEGORICAL VARIABLES II Catherine De Vries, Spyros Kosmidis & Andreas Murr Topic: Logistic regression. Predicted probabilities. STATA commands and features: logit, logistic, char, test, margins Data set: bes00.dta, taken from the British Election Study 00. Readings: Alan Agresti and Barbara Finlay (997). Statistical Methods for the Social Sciences, 3 rd ed. Upper Saddle River, NJ: Prentice Hall. [CHAPTER 5] N.B.: Please note that Assignment 3 is due on the nd of December at 6 pm.. INTRODUCTION This week, we will continue using the British Election Survey from 00 to investigate social structural and attitudinal causes of vote choice. As we will be using a binary measure of vote choice (Conservative vs. non-conservative), we use logistic regression, which is a form of regression used for dealing with 0/ variables. The variable vote is a categorical variable with several categories. Quite often it is possible to focus attention on just one binary distinction within the categorical variable. Like last week we will choose the decision to vote Conservative or not. Compute a dummy variable for Conservative voters with:. recode bq_ (- -=.) (= "Conservative voter") ( 3/9=0 "Non Conservative voter"), gen(con). label variable con "Conservative vote dummy variable". tabulate bq_ con Note that anyone who did not give a party choice is treated as missing. For clarity it will help us to also recode the variable aq66_ (whether the respondent is a trade union member) as a binary variable, indicating trade membership.. recode aq66_ (-=.) ( =) (3=0), gen(union). label variable union "Union member". tabulate aq66_ union Finally, recode the variable aq67 into the variable educ the following way:. recode aq67 (- -=.) (6 7=5), gen(educ). lab var educ "Education". tab aq67 educ Week 8 Page of 7

2 . CATEGORICAL INDEPENDENT VARIABLES As with linear regression, categorical variables can be added to the model with the xi command.. xi: logit con union i.educ STATA creates a reference category with the first value of educ (left school at the age of 5 years or younger). Although changing the reference category for a dependent variable does not change the model, the choice of the reference category can make a big difference to how clearly you can see what is going on. We can control which category STATA uses as the reference with the command,. char educ[omit] 5. xi: logit con union i.educ Now those who left education at 9 years or older are treated as the reference category. Compare the two models; the statistical significance and size of the education coefficients change from one model to another. This is because the coefficients represent a comparison between categories, so that changing the reference category changes the point of comparison. Given that we are comparing categories, what we are actually interested in is whether any one category is different from any other category. For example, are those with the lowest education different from those in the middle category of education in terms of their voting behaviour? One way of checking this is to set the lowest education class again as the reference category (i.e. char educ[omit] ), but then it is of course difficult to tell whether, say, those in the middle differ from those with the highest education. The size of the coefficients for the lowest and middle categories, but are they statistically significantly different? We can use the test command to check whether two or more coefficients are equal to each other. To test whether the lowest and middle category of education differ, simply type:. test _Ieduc_ = _Ieduc_3 chi( ) = 6.0 Prob > chi = 0.04 We are testing here whether we can distinguish statistically between the size of the two coefficients. Our null hypothesis here is that the coefficients are the same. Given that the p-value is 0.0, we can reject this hypothesis, and therefore conclude that the coefficients are different. EXERCISE Now run a model with union membership, education and interest in the election (aq) predicting Conservative vote. Remember to account for missing values. Oddly, election interest is coded as = very interested, = somewhat interested, 3 = not very interested, and 4 = not at all interested. Generate a new variable interest, where increasing values signify more interest. Week 8 Page of 7

3 Test whether all coefficients of election interest, except that for category (i.e. those which are not at all interested), could be equal to zero. How could me make the model more parsimonious by recoding election interest? 3. MORE STRAIGHTFORWARD INTERPRETATIONS Producing tables of logistic regression coefficients is standard practice but they basically mean nothing to the general reader. Instead of talking in terms of statistically significant coefficients we should be talking about real quantities that people understand. So instead of saying that the odds of voting Conservative are x times higher for trade unionists than for non-trade unionists after controlling for such and such, we would rather say the probability of voting Conservative is y amount greater for trade unionists than other voters after controlling for such and such. Unfortunately this isn't easy, because the difference in predicted probabilities for union members and others depend on the values of all the other variables in the model. There are two packages that have been written for STATA to work out predicted probabilities, which we describe later on. In STATA and more recent versions, this can easily be done with margins. One can also produce these probabilities manually which, at least at first, helps one to understand what s going on. So first run the logistic regression that you are interested in (remembering to account for missing data). Let us take a hypothetical example of a model that predicts Conservative vote choice (relative to all other parties) using union membership and age.. logit con union age Second, examine the table of coefficients. So for this example the coefficients are 0.44 for union membership and 0.04 for age. Both are statistically significant. The constant is.3. As we don t know what these coefficients mean, we want to calculate them in the terms of probabilities. To do this we can calculate the predicted probability for any hypothetical observation using the formula below: π = e + e a+ b X + b X a+ b X + b X where in our case X is union membership and X is age, and b is the coefficient for union membership, b is the coefficient for age and a is the constant. An easy way of calculating these probabilities is to open a new STATA window, and then generate a new variable using this formula to represent predicted probabilities for a range of the independent variable that you are interested in. To do this, enter some Week 8 Page 3 of 7

4 values for the variable, in this case let s say age, over a plausible range in the empty Data Editor, again in this case let s say 8 up to 90:. set obs 90. gen age=_n. drop if age<8 Then generate a new variable that corresponds to the formula above; i.e. for our example the formula for non-trade unionists would be: e age π = + e age In the newly opened version of STATA, we will call this new variable prob and generate it as follows:. generate prob = exp( *age)/( + (exp( *age))) We now have predicted probabilities for each value of age and can plot prob against prob to look at how the predicted probability of voting Conservative changes by age. The same logic applies for a bigger model, again we need to pick values for the other independent variables (just as we picked non-trade unionists for the predicted probabilities above). Generally for interval level variables use the mean, for categorical use the mode. For example if our model included income and education (5 categories), then we would include the mean of income multiplied by its coefficient, and the mode category for education (i.e. the fifth category in this case). For interaction terms, we need to include the variable we are interested in, both the main effect and the interaction term. If we interacted age and union membership then we would need to generate two separate sets of probabilities one for trade unionists and one for non-trade unionists: For non trade unionists. generate nonunionprob = exp(constant + b union *0 + b age *age + b interaction *0*age)/( + (exp(constant + b union *0 + b age *age + b interaction *0*age))) = exp(constant + b age *age)/( + (exp(constant + b age *age))) For trade unionists. generate unionprob = exp(constant + b union * + b age *age + b interaction **age)/( + (exp(constant + b union * + b age *age + b interaction **age))) = exp(constant + b union + b age *age + b interaction *age)/( + (exp(constant + b union + b age *age + b interaction *age))) Obviously we can also calculate predicted probabilities for categorical variables, just pick the categories that you wish to compare and enter the numbers into the formula (sometimes this may be easiest with a calculator and pencil and paper if the model is small). Week 8 Page 4 of 7

5 EXERCISE Run a model with education, trade union membership, and self-placements on two eleven-point scales about attitudes towards taxation (aq35_) and crime (aq38_). Recode the self-placement variables such that higher values refer to cutting taxes and reducing crime, respectively. Open a new STATA window and plot the effect of attitudes towards cutting taxes on vote choice. Don t forget that as measured here you can t have a mean level of trade union membership, education, and attitudes towards crime, so you need to pick a particular type of respondent and plot the impact of libertarian-authoritarian values for them. 4. MARGINS Although it is important to know how to calculate predicted probabilities manually, margins in STATA will do this for you.. logit con i.union age Holding age at its mean, margins union will calculate the predicted probabilities of the Conservative vote separately for union members and non-members, and the associated confidence intervals, holding age at its mean:. margins union, atmeans Delta-method Margin Std. Err. z P> z [95% Conf. Interval] union To obtain the predicted probabilities for each category of age for union members:. margins, at(age=(8()97) union=). marginsplot Week 8 Page 5 of 7

6 To obtain the predicted probabilities for each category of age for non-members:. margins, at(age=(8()97) union=0) Unlike in OLS regression, the marginal effect of each predictor depends on the values of all other predictors. To obtain the marginal effect of union membership on the predicted probability of voting Conservative at the mean of age:. margins, dydx(union) atmeans Delta-method dy/dx Std. Err. z P> z [95% Conf. Interval] union Note: dy/dx for factor levels is the discrete change from the base level. Since union is a binary variable, the marginal effect is the effect of a discrete change from 0 to. It is simply the difference in the predicted probabilities of the Conservative vote between trade unionists and non-trade unionists. To obtain the marginal effect of age at the two values of union:. margins, dydx(age) at(union=(0 )) DELTA-METHOD DY/DX STD. ERR. Z P> Z [95% CONF. INTERVAL] AGE _AT The margins command is particularly useful when interactions or non-linear terms are present. Week 8 Page 6 of 7

7 EXERCISE 3 Please estimate a binary logit model where voting for the Conservatives is a function of income (aq70), attitudes towards cutting taxes, and their interaction. Create a plot to show how the marginal effect of attitudes towards cutting taxes depends on income. Interpret the result. Week 8 Page 7 of 7

Logistic Regression Analysis

Revised July 2018 Logistic Regression Analysis This set of notes shows how to use Stata to estimate a logistic regression equation. It assumes that you have set Stata up on your computer (see the Getting