Determining Probability Estimates From Logistic Regression Results Vartanian: SW 541 In determining logistic regression results, you will generally be given the odds ratio in the SPSS or SAS output. However, you will not be given the probability estimates for different values of your independent variables. For example, you may be examining the likelihood of being in poverty and would like to know what the probability is for single mothers, for those with 5 children, or for single mothers with 5 children. Through the method I will show you below, you will be able to determine the probability estimates for any group you=re interested in. I will show you how to determine these probability estimates in SAS. Let=s say we=re examining the likelihood of having income below the poverty line (inpov) and are using 4 independent variables: 1.Whether a person lives in public housing (pubhouse B dummy variable) 2. Whether a person lives in a big city (over 750,000 population) (bigcit -- dummy variable) 3. The wages of the head of household (wagehd B continuous variable) 4. The level of education of the head of household (edhd -- continuous variable) The first step will be to use SAS to determine the logistic regression results. SAS automatically determines the likelihood of the 0 condition instead of the likelihood of the 1 condition (for example, the likelihood of not being in poverty), unless you use a descending command, which we will do. Prob(pov = 1)= Remember that the logistic equation takes the following form: We will use this form to determine probability estimates for the different variables within the model. While some researchers use mean values of all other variables to determine the probability estimates of a given variable, you will use the actual observed values to determine these probabilities. e 1+e You will use SAS commands to determine the probabilities of having income below the poverty line for individuals with specific conditions. Below, I have presented the first set of SAS commands to determine the logistic regression results and then to determine the probability estimates from these results. a+xb a+ xb libname in 'P:\pubdata\gssw'; (or wherever the data is located) D:\WP60\LECT2.PHD\LOGIST\LOGIST.PROBAB.SAS.WPD Page 1
data a;set in.psidphd; * You will use this a variable as a merge variable later. For some reason, SAS insists on having this variable to merge by. proc logist descending outest=dd maxiter=100; output out=cc xbeta=xb; model inpov=pubhouse bigcit wagehd edhd; *proc logist is the command for determining logistic regression results. data f;set dd; rename pubhouse=cpubhous bigcit=cbigcit wagehd=cwagehd edhd=cedhd; drop _type_; * You=re creating this merge variable in the second data set, so you can eventually merge the two data sets together. What the different commands mean: The outest=dd. This command tells SAS to create a new data set that only contains coefficient estimates from the logistic regression model. These newly created variables (the b coefficients) have the same names as the original variables (the X variables). We will need to change the names of these because we will eventually be multiply the Xs and the Bs together to determine the probability of an event occurring. Notice that we SET the variables from data set CC into data set f, and we use a rename statement to give these coefficient estimates new names. I=ve changed the names of the variables from their regular names to names that begin with c (for coefficient estimate). Maxiter=100. Logistic regression uses an interative process to determine b coefficients. The process keeps iterating until a stable b coefficient if found. The default number of iterations in SAS is 25. I have simply upped this number to 100 iterations. Out=CC. This is a means for creating a new data set that contains all the variables from the data that went into the logistic regression analysis. Thus, the data in the new data set CC, contains all the variables that went into the logistic run. You will need these variables to determine the probability estimates B these are your X variables. Xbeta=XB. This gives you the estimate for the variable values for each observation times the b coefficient. This XB estimate includes the estimate for the intercept (or a+xb). You=ll be able to use this value for determining probability estimates. However, you will be adding or subtracting from this value, depending on the probability you=re examining. D:\WP60\LECT2.PHD\LOGIST\LOGIST.PROBAB.SAS.WPD Page 2
SECOND SET OF SAS COMMANDS: data g;merge f cc;by a; Now, we will merge together the two created data sets, which will put the b coefficient estimates and the variable values into one data set. Each observation will have the same values for the b coefficients and will have different values for the variables. For example, each observation in the new merged data set below will have the same value for cpubhous=1.4513, while the value for pubhouse will depend on whether or not they lived in public housing (1=yes, 0=no). xb_nopub=xb-cpubhous*pubhouse; * This is the derivation of your XB for the likelihood of being in poverty for those who do not live in public housing. What I=ve done is taken the overall XB (which includes the XB for public housing) and subtracted off the coefficient*variable estimate for living in public housing. This is the same as setting pubhouse=0. In other words, we=re asking for the likelihood of being in poverty, given that all individuals do not live in public housing. This is just as what we did with the estimates in class. That is, we wanted to determine the probability of being in poverty if someone had 5 kids. We substituted the number 5 for the X and multiplied it by the b coefficient. What we will do here is determine this probability for all individuals (holding all else equal), and then take a mean for the sample. This will give us the overall likelihood of being in the condition. xb_pub=xb_nopub+cpubhous; * In this second estimate for public housing, I=m determining the XB for those who live in public housing. We again need to subtract off the coefficient*variable estimates for public housing (as we did above), and then add back in cpubhous*1 (or simply cpubhous). We=re again forcing everyone into a particular state, they live in public housing. We=ll then see how this affects their likelihood of being in poverty. Xb_nobig=xb-cbigcit*bigcit; xb_big=xb_nobig+cbigcit; xb_wag5=xb-cwagehd*wagehd+cwagehd*5; xb_wag10=xb-cwagehd*wagehd+cwagehd*10; xb_wag1=xb-cwagehd*wagehd+cwagehd*1; xb_ed10=xb-cedhd*edhd+cedhd*10; xb_ed12=xb-cedhd*edhd+cedhd*12; xb_ed16=xb-cedhd*edhd+cwagehd*16; With interval/ratio scale variables, we again need to subtract off the coefficient estimates of the variable times the actual value of the variable since these values are already contained in xb. We want to determine estimates for this variable at specific levels B not at the particular level of any individual. So we will add back on the coefficient estimate and multiply it by 10 (to get an estimate of those who have a 10 th grade education), or 12 (high school graduate) or 16 (college D:\WP60\LECT2.PHD\LOGIST\LOGIST.PROBAB.SAS.WPD Page 3
graduate). Below, we take the Xb estimates from above and put them into a logistic form (see the formula above). Each individual within the sample will a value for each of the variables below. We will then take the mean, which will give us an overall average probability of being in poverty. pr_nopub=(exp(xb_nopub))/(1+exp(xb_nopub)); pr_pub=(exp(xb_pub))/(1+exp(xb_pub)); pr_nobig=(exp(xb_nobig))/(1+exp(xb_nobig)); pr_big=(exp(xb_big))/(1+exp(xb_big)); pr_wag5=(exp(xb_wag5))/(1+exp(xb_wag5)); pr_wag10=(exp(xb_wag10))/(1+exp(xb_wag10)); pr_wag1=(exp(xb_wag1))/(1+exp(xb_wag1)); pr_ed10=(exp(xb_ed10))/(1+exp(xb_ed10)); pr_ed12=(exp(xb_ed12))/(1+exp(xb_ed12)); pr_ed16=(exp(xb_ed16))/(1+exp(xb_ed16)); proc means;var pr_nopub pr_pub pr_nobig pr_big pr_wag5 pr_wag10 pr_wag1 pr_ed10 pr_ed12 pr_ed16; weight weight; run; Results The LOGISTIC Procedure Data Set: WORK.A Response Variable: INPOV Response Levels: 2 Number of Observations: 15406 Link Function: Logit Response Profile Ordered Value INPOV Count 1 1 3208 2 0 12198 (In this sample, 3,208 lived below the poverty line, 12,198 did not.) Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 15765.507 10460.511. SC 15773.149 10498.723. -2 LOG L 15763.507 10450.511 5312.996 with 4 DF (p=0.0001) Score.. 3249.816 with 4 DF (p=0.0001) Analysis of Maximum Likelihood Estimates D:\WP60\LECT2.PHD\LOGIST\LOGIST.PROBAB.SAS.WPD Page 4
Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1 1.0972 0.1038 111.6740 0.0001.. PUBHOUSE 1 1.4513 0.0832 304.4061 0.0001 0.192948 4.269 BIGCIT 1 0.3778 0.0568 44.2624 0.0001 0.086084 1.459 WAGEHD 1-0.2556 0.00592 1864.9660 0.0001-1.628713 0.774 EDHD 1-0.0964 0.00908 112.8993 0.0001-0.150797 0.908 Notice that all coefficient estimates are significantly related to being in poverty. Those living in public housing and in big cities are positively related to having income below the poverty line. Wages and education level of the head have a negative relationship to having income below the poverty line. Also note that -2 Log L is significant (p=.0001). Thus, all of the variables together are related to the dependent variable. Below are the probability estimates, or the likelihood of being in poverty, given particular conditions. The likelihood of being in poverty given that you don=t live in public housing, controlling for all other variables in the model, is 17.12%. For the probability for those living in public housing is 33.9355%. For those with wages of $1/hour, the likelihood of being in poverty is 43.04%, while for those earning $10/hour, this likelihood is 7.64%. All of the probability estimates are determined holding all other variables within the model constant. The SAS System 15:40 Thursday, January 28, 1999 Variable N Mean Std Dev Minimum Maximum ----------------------------------------------------------------------- PR_NOPUB 15406 0.1712863 0.2055081 4.642831E-12 0.8138157 PR_PUB 15406 0.3393527 0.3135689 1.98185E-11 0.9491308 PR_NOBIG 15406 0.1738347 0.2131065 4.642831E-12 0.9274740 PR_BIG 15406 0.2129280 0.2445817 6.774009E-12 0.9491308 PR_WAG5 15406 0.2209343 0.0850559 0.1394452 0.8386856 PR_WAG10 15406 0.0764884 0.0457277 0.0432017 0.5916215 PR_WAG1 15406 0.4303992 0.0979813 0.3105265 0.9352771 PR_ED10 15406 0.1962643 0.2239156 9.118824E-12 0.8767470 PR_ED12 15406 0.1759773 0.2071500 7.519345E-12 0.8543479 PR_ED16 15406 0.0174886 0.0295942 4.00783E-13 0.2381776 ----------------------------------------------------------------------- D:\WP60\LECT2.PHD\LOGIST\LOGIST.PROBAB.SAS.WPD Page 5
The SAS program: libname in 'P:\pubdata\gssw'; (or wherever the data is located) data a;set in.psidphd; proc logist descending outest=dd maxiter=100; output out=cc xbeta=xb; model inpov=pubhouse bigcit wagehd edhd; data f;set dd; rename pubhouse=cpubhous bigcit=cbigcit wagehd=cwagehd edhd=cedhd; drop _type_; data g;merge f cc;by a; xb_nopub=xb-cpubhous*pubhouse; xb_pub=xb_nopub+cpubhous; Xb_nobig=xb-cbigcit*bigcit; xb_big=xb_nobig+cbigcit; xb_wag5=xb-cwagehd*wagehd+cwagehd*5; xb_wag10=xb-cwagehd*wagehd+cwagehd*10; xb_wag1=xb-cwagehd*wagehd+cwagehd*1; xb_ed10=xb-cedhd*edhd+cedhd*10; xb_ed12=xb-cedhd*edhd+cedhd*12; xb_ed16=xb-cedhd*edhd+cwagehd*16; pr_nopub=(exp(xb_nopub))/(1+exp(xb_nopub)); pr_pub=(exp(xb_pub))/(1+exp(xb_pub)); pr_nobig=(exp(xb_nobig))/(1+exp(xb_nobig)); pr_big=(exp(xb_big))/(1+exp(xb_big)); pr_wag5=(exp(xb_wag5))/(1+exp(xb_wag5)); pr_wag10=(exp(xb_wag10))/(1+exp(xb_wag10)); pr_wag1=(exp(xb_wag1))/(1+exp(xb_wag1)); pr_ed10=(exp(xb_ed10))/(1+exp(xb_ed10)); pr_ed12=(exp(xb_ed12))/(1+exp(xb_ed12)); pr_ed16=(exp(xb_ed16))/(1+exp(xb_ed16)); proc means;var pr_nopub pr_pub pr_nobig pr_big pr_wag5 pr_wag10 pr_wag1 pr_ed10 pr_ed12 pr_ed16; weight weight; run; D:\WP60\LECT2.PHD\LOGIST\LOGIST.PROBAB.SAS.WPD Page 6