Your Name (Please print) Did you agree to take the optional portion of the final exam Yes No (Your online answer will be used to verify your response.) Directions There are two parts to the final exam. Everyone must take the first part of the exam which includes 20 questions. Students who elected to take the optional portion of the final exam are required to complete the second part. For those who elected to take the optional portion of the final exam, your second midterm score will be recalculated as the average of your original score and what you score on the optional portion of the exam. All questions are worth 4 points each unless indicated otherwise. Place all answers in the space provided below or within each question. Round all numerical answers to the nearest 100 th (e.g. 1.23) unless told otherwise. The formula sheet and tables with the standard normal CDF, the F- and Chi-squared distributions will be provided as a separate document. 1
Using data from the 2015 Current Population Survey, I estimated a regression of workers hourly wages as a function of their age, years of schooling, and a dummy variable indicating whether the person is female.. reg hrwage age age2 _school female Source SS df MS Number of obs = 97,330 F(4, 97325) = 6357.59 Model 1897494.59 4 474373.646 Prob > F = 0.0000 Residual 7261940.53 97,325 74.6153663 R-squared = 0.2072 Adj R-squared = 0.2071 Total 9159435.11 97,329 94.1079751 Root MSE = 8.638 hrwage Coef. Std. Err. t P> t [95% Conf. Interval] age.7363723.0105456 69.83 0.000.7157031.7570416 age2 -.0069645.0001226-56.83 0.000 -.0072047 -.0067243 _school 1.336863.0119937 111.46 0.000 1.313355 1.36037 female -2.929198.0557887-52.51 0.000-3.038544-2.819853 _cons -16.33226.2480879-65.83 0.000-16.81851-15.84601. predict uhat, res. gen uhat2=uhat^2. reg uhat2 age age2 _school female Source SS df MS Number of obs = 97,330 F(4, 97325) = 1052.66 Model 275992027 4 68998006.8 Prob > F = 0.0000 Residual 6.3793e+09 97,325 65546.5201 R-squared = 0.0415 Adj R-squared = 0.0414 Total 6.6553e+09 97,329 68379.487 Root MSE = 256.02 uhat2 Coef. Std. Err. t P> t [95% Conf. Interval] age 4.307184.3125586 13.78 0.000 3.694572 4.919795 age2 -.0314069.0036323-8.65 0.000 -.0385261 -.0242877 _school 18.99781.3554801 53.44 0.000 18.30108 19.69455 female -28.98642 1.653511-17.53 0.000-32.22729-25.74556 _cons -274.5837 7.353029-37.34 0.000-288.9955-260.1718 1. Based on the above regressions, the value of the Chi-squared test statistic for the null hypothesis that the residuals in the loan equation are homoscedastic is 4039.195 and the test statistic has a Chi-squared distribution with 4 degrees of freedom. 2. The null hypothesis that the residuals in the loan equation are homoskedastic is rejected at the.05 level if the chi-squared statistic is (greater, less) than 9.49. 3. (6 points) If you wanted to perform the simple form of the White test for heteroskedasticity for the above wage equation, a. what regression command would you execute in Stata? (use the variable names defined above to be sure there is no ambiguity in your answer.) reg uhat2 yhat yhat2 where yhat and yhat2 are the predicted hourly wage and its square from the hrwage regression. 2
b. What test statement would you issue after the above regression? test yhat=yhat2=0 What would be the distribution of this test statistic (e.g. F-distribution, chi-squared, normal? how many degrees of freedom)? F(2,97227) 4. Based on the above regressions, what age maximizes the variance of the residuals (round your answer to the nearest year of age)? 4.307/(2*.0314)=68.58 5. If the null hypothesis of homoscedasticity is rejected, a. The standard OLS estimates of the coefficients remain unbiased b. The standard OLS estimates of the standard errors for the coefficients are incorrect c. The standard OLS estimates of the coefficients are no longer efficient d. All of the above e. Only b and c 6. Suppose you have data on total employment by county and you estimate the following regression: emp i = α 0 + α 1 pop i + α 2 (pop i age i ) + e i where the subscript I indexes the county emp represents total employment in the county, pop represents the total population, and age is the average age of people living in the county. The residuals (e)are likely to be a. Heteroskedastic and their variance will be greater in counties with larger populations. b. Homoskedastic and their variance will be greater counties with larger populations c. Heteroskedastic and their variance will be greater in counties with smaller populations. d. Homoskedastic and their variance will be greater counties with smaller populations 7. If the fraction of the population that works is lower among older workers, but employment always increases with population, we would expect a. α 1 + α 2 > 0 b. α 2 < 0 c. Both of the above d. None of the above. 3
To answer the next several questions, consider the following data drawn from an article by Li, Yi, and Zhang (2011). 1 The article was aimed at studying whether introduction of China s one-child policy caused the male-female sex ratio in China to increase. The one-child policy started in 1980, but applied to the Han Chinese but not minorities. The table below provides the fraction of new-borns that are male for the Han and Minorities. The pre-policy period is 1973-1979; the post-policy period (i.e. after the one-child policy was implemented) is 1980-1990. Birth cohort Han Minorities Sample size 1973-1979 51.53 51.49 1,521,563 1980-1990 52.34 51.24 2,334,926 8. (6 points) What is the diff-in-diff estimate of the effect of the one-child policy on the fraction of births that are male? Be sure to indicate whether the one-child policy increased or decreased the fraction of births that are male. (52.34-51.53) -(51.24-51.49)- = 1.06 (i.e. one child policy increased percent male by 1.06 percentage points). 9. Some commentators suggested that an outbreak of hepatitis B in the 1980s may have caused the percentage of male births in China to increase. a. (4 points) Assuming that hepatitis B does increase the probability of male births, under what conditions would the outbreak of hepatitis B cause NO bias in the diff-in-diff estimate of the impact of the one-child policy on male births? No bias would be created if hepatitis B caused the same increase in the probability of male births for the Han and minorities since such effects would be removed by the Diffin-diff estimator. b. (4 points) Assuming that hepatitis B does increase the probability of male births, under what conditions would the outbreak of hepatitis B cause an UPWARD bias in the diff-indiff estimate of the impact of the one-child policy on male births? It would cause an upward bias in the diff-in-diff estimate if hepatitis B caused a larger increase in the probability of male births among the Han (the treated group). 1 Li, Hongbin, Junjian Yi, and Junsen Zhang. "Estimating the effect of the one-child policy on the sex ratio imbalance in China: identification based on the difference-in-differences." Demography 48, no. 4 (2011): 1535-1557. 4
10. (6 points) The authors used a linear probability model to estimate their diff-in-diff equation. The dependent variable was a dummy that equals one if a birth was male; 0 if it was female. They had data on nearly 4 million births spanning from 1973-1990. Label the dummy indicating male child as M i. Write out the regression model that you would estimate to obtain the diff-in-diff estimate. Be sure to define any variables that are included in your regression and be sure to point out which coefficient is the estimate of the diff-in-diff effect of the one-child policy on the probability that a child is male. M i = α 0 + α 1 After i + α 2 Han i + α 3 (Han i after i ) + u i where After i is a dummy that equals one for any birth after 1980; and Han i is a dummy that equals one for any Han birth. α 3 is the diff-in-diff estimate of the effect of the one child policy on the probability that the child is male. 11. (6 points) If you estimate the above linear probability model, why shouldn t you use the standard OLS estimates for the standard errors? What should you do instead? You shouldn t use the standard OLS estimates for the standard errors because the LPM has heteroscedasticity built into the error term. The variance of the residual is p i(1-p i) where p i is the probability of a yes (i.e. male child). The model should either use robust standard errors, or weight least squares for estimation. WLS should use weights of 1/p i(1-p i) if all of the predicted probabilities lie within the unit interval. 5
You must answer 5 of the last 6 questions. Each question is worth 6 points. Write SKIPPED in the one question you do no answer. If you answer all 6, I will grade the first 5. A financial services company is interested in understanding whether a financial education seminar increases the amount that people save. To address this question, they collect data on the percentage of workers earnings that they contribute to their 401k pension plan along with information about the workers. The regression they estimate is: (1) S i = β 0 + β 1 sem i + β 2 age i + β 3 fem i + β 4 educ i + e i where S i is the percentage of the worker s income that is saved (i.e. put into the 401k plan), sem is a dummy variable indicating whether the worker attended the seminar, fem is a dummy variable indicating whether the person is female, and educ is the worker s number of years of formal schooling. In considering the questions below, assume that each worker in the sample has the opportunity to attend the seminar, but the choice to attend is voluntary. 12. (6 points) Explain why a simple OLS regression would likely suffer from an endogeneity problem that would cause a biased estimate of β 1. Be sure to discuss whether you believe the bias will be positive or negative and why. Since people are allowed to decide whether to attend the seminar, it is likely cov(sem i, e i ) 0. For example, if people who have unobserved characteristics that cause them to save more are more likely to attend the seminar, the cov(sem i, e i ) > 0 and the estimated effect of the seminar on saving (β 1 ) will be biased upward. 13. (6 points) Suppose that the financial planning seminars were held on weekends and workers would have to travel from home to attend. If you have data for each worker on the distance they would have to travel (D i), would this be an appropriate instrument for attendance at the seminar? Be sure that you precisely define the necessary conditions for distance to be an appropriate instrument and whether these properties are likely to be satisfied in this case. The two necessary conditions are that: (i)cov(d i, Sem i ) 0 and (ii)cov(d i, e i ) = 0 We expect (i) to be true because those who live further away will be less likely to attend the seminar. We expect (ii) to be true unless there is some reason to believe that the decision of how far to live away from the seminar location is systematically related to their saving preferences. 6
14. (6 points) Assuming distance is an appropriate instrument for seminar attendance, describe the 2 steps of the two stage least squares process that you would use to estimate the regression in (1). Use the variable names already defined in describing the two stage process. Step 1: Estimate the following reduced form regression for seminar attendance Sem i = π 0 + π 1 age i + π 2 fem i + π 3 educ i + π 4 D i + v i Step 2: Estimate the original structural form model in (1) after replacing the endogenous variable with its predicted va S i = β 0 + β 1 sem i + β 2 age i + β 3 fem i + β 4 educ i + e i (6 points) Suppose that you have panel data on worker saving that includes years before and after the financial planning seminars are offered. Explain how the use of panel data and worker specific fixed effects model could eliminate the endogeneity bias in OLS estimation of the savings equation in (1). Reconsider the original regression in (1) as the panel model below where the subscript i indexes the person and t indexes the time period. (1a) S it = β 0 + β 1 sem it + β 2 age it + β 3 fem it + β 4 educ it + a i + e it Suppose that the unobserved savings preferences of an individual can be captured in an individual fixed effect (a i ) that does not change over time. Panel data allows us to difference-out this effect and eliminate any of the endogeneity bias that was caused by cov(sem it, a i ) 0. That is, with panel data that includes fixed effects for the individual, we can estimate (1a) as (1a) S it = β 0 + β 1 sem it + β 2 age it + β 4 educ it + e it where the * superscript indicates that the variable is measured as a deviation from i-specific (person specific) means over the panel. Notice that taking deviations from means cause a i to be differenced out of the regression and thus any bias caused by cov(sem it, a i ) 0 in OLS will now be eliminated. 7
15. If you estimate the savings equation in (1) with panel data and fixed effects for each worker, can you still include the female dummy variable in your regression? Why or why not? The female dummy can no longer be included because deviations from individual specific means would always be zero for time-invariant variables. 8