Monday October 3 10:11:57 2011 Page 1 (R) / / / / / / / / / / / / Statistics/Data Analysis Education Box and save these files in a local folder. name: <unnamed log: M:\pc\Dokumenter\ECON4135\Extra_sem\Extraseminar.smcl log type: smcl opened on: 3 Oct 2011, 10:06:48 1. /* ECON4135 Applied Statistics and Econometrics Autumn 2011 STATA course UiO Problem set for the extra seminar on week 39 Exercise 1 Consider the Education Box, on page 202 in the textbook (Chapter 5). It is claimed that earnings of workers with high education are more spread than those w ith lower education. And this will result in heteroskedastic error terms when we regress hourly earnings on education. In this exercise, we will provide more explicit justification for some of the claims in the Education Box. First of all, save the excel dataset and the file that describes the data by following these instructions: 1. Go to the course homepage: Econ4135 Autumn 2011. 2. Click on the Lecture and seminar plan link. 3. When the pdf file opens, click on the Student resource link and allow the connection. 4. On the textbook website that opens up, click on Datasets for Replicating Empirical on the left column. 5. Choose Data Description and Excel Data Set, under Economic value of a Year of Education Box and save these files in a local folder. NOTE: The above dataset comes also in Stata format, which is simpler to save and load on Stat a directly. For the sake of practicing how to upload a dataset in excel format onto Stata, let us u se the excel version. 2. 3. /* a) Open the dataset in an Excel sheet. Save the dataset as a tab delimited text file: name _of_file.txt. The txt ending is important here. Upload this txt file on Stata by using the command insheet. Remember that you need to include the directory path with the name of the txt file when you use the insheet command, so that Stata will be able to find the file. 4. insheet using "M:\pc\Dokumenter\ECON4135\Extra_sem\ch5_cps_box.txt" (4 vars, 2989 obs) 5. 6. 7. /* b) Use the information on the file that describes the dataset to label the variables. (HINT: Stata command label var ) 8. label var a_age "age"
Monday October 3 10:11:57 2011 Page 2 9. label var a_sex "1 if male; 2 if female" 10. label var ahe "Average Hourly Earnings in 2004" 11. label var yrseduc "Years of Education" 12. 13. /* c) What are the types (i.e. string or numeric) of variables in this dataset? Transform all the variables in numeric format. (HINT: Stata command destring; pay particular attention to the option dpcomma; you may also want to use the replace option) 14. describe Contains data obs: 2,989 vars: 4 size: 65,758 (99.9% of memory free) storage display value variable name type format label variable label a_age byte %8.0g age a_sex byte %8.0g 1 if male; 2 if female ahe str11 %11s Average Hourly Earnings in 2004 yrseduc byte %8.0g Years of Education Sorted by: Note: dataset has changed since last saved 15. destring ahe, replace dpcomma ahe has all characters numeric; replaced as double 16. 17. /* d) Are there any missing values in this dataset? If yes, from which variable(s)? 18. gen x=1 if missing( a_age) missing( a_sex) missing(ahe) missing( yrseduc) (2989 missing values generated) 19. tab x, missing 20. drop x x Freq. Percent Cum.. 2,989 100.00 100.00 Total 2,989 100.00 21. 22. /* e) Consider the variable a_sex. Create a dummy variable called male that takes the value 1 if a_sex refers to a male an d 0 otherwise. Label the values of the variable male by using the Stata command label.
Monday October 3 10:11:57 2011 Page 3 23. gen male=1 if a_sex==1 (1331 missing values generated) 24. replace male=0 if missing(male) (1331 real changes made) 25. label define male_label 1 "male" 0 "female" 26. label values male male_label 27. tab male, missing male Freq. Percent Cum. female 1,331 44.53 44.53 male 1,658 55.47 100.00 Total 2,989 100.00 28. 29. /* f) Replicate the regression results (5.23) and figure (5.3) in the Education Box on page 2 02. 30. regress ahe yrseduc Source SS df MS Number of obs = 2989 F( 1, 2987) = 563.83 Model 50838.7955 1 50838.7955 Prob F = 0.0000 Residual 269329.882 2987 90.1673527 R-squared = 0.1588 Adj R-squared = 0.1585 Total 320168.678 2988 107.151499 Root MSE = 9.4956 ahe Coef. Std. Err. t P t [95% Conf. Interval] yrseduc 1.756384.0739685 23.75 0.000 1.61135 1.901418 _cons -5.375077 1.032576-5.21 0.000-7.39971-3.350444 31. twoway (scatter ahe yrseduc) (lfit ahe yrseduc) 32. 33. /* g) Calculate the predicted values from the above regression by using the Stata command pre dict with the option xb. (e.g. predict ahe_hat, xb). Notice that this command will work only if you have already run the regression. Now generate the residuals of the regression (e.g. gen resid= ahe ahe_hat). 34. predict ahe_hat, xb 35. gen resid=ahe-ahe_hat 36. 37. 38. /* h) To verify that the claims on the last paragraph in the Education Box, find the standard deviation of the residuals for the following subgroups: -those with 10 years of education -those with a high school diploma (i.e. 12 years of education) -those with a college degree (i.e. 16 years of education)
Monday October 3 10:11:57 2011 Page 4 39. sum resid if yrseduc==10 resid 39 -.7623601 4.335933-8.983635 9.664384 40. sum resid if yrseduc==12 resid 869 -.5659396 7.302097-13.65025 35.58052 41. sum resid if yrseduc==16 resid 727.9821413 12.25246-18.88091 60.44601 42. 43. /* i) Run a regression of earnings on years of education (just like (5.23) in the Education B ox) only for men. Does the distribution of earnings for men spread out as education inc reases? 44. regress ahe yrseduc if male==1 Source SS df MS Number of obs = 1658 F( 1, 1656) = 298.61 Model 31845.7695 1 31845.7695 Prob F = 0.0000 Residual 176609.39 1656 106.648182 R-squared = 0.1528 Adj R-squared = 0.1523 Total 208455.159 1657 125.802752 Root MSE = 10.327 ahe Coef. Std. Err. t P t [95% Conf. Interval] yrseduc 1.896083.1097257 17.28 0.000 1.680867 2.111298 _cons -5.373537 1.498677-3.59 0.000-8.313038-2.434036 45. *if condition is important 46. predict ahe_hat_men if male==1, xb (1331 missing values generated) 47. *if condition is VERY important 48. gen resid_men=ahe-ahe_hat_men if male==1 (1331 missing values generated) 49. 50. sum resid_men if yrseduc==10 & male==1 resid_men 26 -.4358844 3.853732-4.452677 8.265856 51. sum resid_men if yrseduc==12 & male==1 resid_men 530 -.6358775 7.896357-15.08681 33.9026
Monday October 3 10:11:57 2011 Page 5 52. sum resid_men if yrseduc==16 & male==1 resid_men 371 1.528898 13.82425-20.84291 58.20929 53. 54. /* Exercise 2 There is a typing mistake in the book. Two exercises are numbered E5.2 on page 215. Do both! In order to have access to the data that these exercises refer, follow these instructio ns: 1. Go to the course homepage: Econ4135 Autumn 2011. 2. Click on the Lecture and seminar plan link. 3. When the pdf file opens, click on the Student resource link and allow the connection. 4. On the textbook website that opens up, click on Data for Empirical Exercises a nd Test on the left column. 5. Choose College Distance and Teacher Ratings Data from the list of data. Save th e Stata version of the datasets on a local folder so that you can open them with Stata. Use the PDF files that provide a description of the datasets in order to become acquainted with the datasets. Here is the command that loads a datasets (of Stata type) on Stata: use filename.dta, clear The dta ending of the filename is important. filename should include also the directory path that tells Stata where to find the saved filename. 55. *E5.2 (the first) 56. use "M:\pc\Dokumenter\ECON4135\Extra_sem\TeachingRatings.dta", clear 57. reg course_eval beauty Source SS df MS Number of obs = 463 F( 1, 461) = 17.08 Model 5.08300731 1 5.08300731 Prob F = 0.0000 Residual 137.155613 461.297517598 R-squared = 0.0357 Adj R-squared = 0.0336 Total 142.23862 462.307875801 Root MSE =.54545 course_eval Coef. Std. Err. t P t [95% Conf. Interval] beauty.1330014.0321775 4.13 0.000.0697687.1962342 _cons 3.998272.0253493 157.73 0.000 3.948458 4.048087 58. 59. 60. *E5.2 (the second) 61. use "M:\pc\Dokumenter\ECON4135\Extra_sem\CollegeDistance.dta", clear 62. *doing the calculations 63. regress ed dist Source SS df MS Number of obs = 3796 F( 1, 3794) = 28.48 Model 93.0256754 1 93.0256754 Prob F = 0.0000 Residual 12394.3568 3794 3.266831 R-squared = 0.0074 Adj R-squared = 0.0072 Total 12487.3825 3795 3.29048287 Root MSE = 1.8074 ed Coef. Std. Err. t P t [95% Conf. Interval] dist -.0733727.0137498-5.34 0.000 -.1003304 -.046415 _cons 13.95586.0377241 369.95 0.000 13.88189 14.02982
Monday October 3 10:11:57 2011 Page 6 64. tab female, missing female Freq. Percent Cum. 0 1,726 45.47 45.47 1 2,070 54.53 100.00 Total 3,796 100.00 65. bysort female: regress ed dist - female = 0 Source SS df MS Number of obs = 1726 F( 1, 1724) = 17.28 Model 56.8824751 1 56.8824751 Prob F = 0.0000 Residual 5674.39505 1724 3.29141244 R-squared = 0.0099 Adj R-squared = 0.0094 Total 5731.27752 1725 3.32247972 Root MSE = 1.8142 ed Coef. Std. Err. t P t [95% Conf. Interval] dist -.083837.0201668-4.16 0.000 -.1233911 -.044283 _cons 13.97899.0559295 249.94 0.000 13.86929 14.08869 - female = 1 Source SS df MS Number of obs = 2070 F( 1, 2068) = 11.64 Model 37.8250449 1 37.8250449 Prob F = 0.0007 Residual 6718.21795 2068 3.24865471 R-squared = 0.0056 Adj R-squared = 0.0051 Total 6756.043 2069 3.26536636 Root MSE = 1.8024 ed Coef. Std. Err. t P t [95% Conf. Interval] dist -.0641676.0188052-3.41 0.001 -.1010466 -.0272885 _cons 13.93587.0511233 272.59 0.000 13.83561 14.03613 66. *Coefficient of interest 67. display -0.084-(-0.064) -.02 68. *Variance of the coefficient 69. display 0.0201668*0.0201668+ 0.0188052*0.0188052.00076034 70. *St. error 71. display sqrt(0.00076034).02757426 72. *t statisitc 73. display -0.02/0.028 -.71428571
Monday October 3 10:11:57 2011 Page 7 74. 75. *Letting Stata do the calculations 76. *females 77. regress ed dist if female==1 Source SS df MS Number of obs = 2070 F( 1, 2068) = 11.64 Model 37.8250449 1 37.8250449 Prob F = 0.0007 Residual 6718.21795 2068 3.24865471 R-squared = 0.0056 Adj R-squared = 0.0051 Total 6756.043 2069 3.26536636 Root MSE = 1.8024 ed Coef. Std. Err. t P t [95% Conf. Interval] dist -.0641676.0188052-3.41 0.001 -.1010466 -.0272885 _cons 13.93587.0511233 272.59 0.000 13.83561 14.03613 78. matrix b=e(b) 79. matrix V=e(V) 80. 81. matrix list b b[1,2] dist _cons y1 -.06416757 13.935867 82. matrix list V symmetric V[2,2] dist _cons dist.00035364 _cons -.00060767.0026136 83. 84. scalar coef_female=b[1, 1] 85. scalar var_female=v[1,1] 86. 87. display coef_female -.06416757 88. display var_female.00035364 89. 90. *males 91. regress ed dist if female==0 Source SS df MS Number of obs = 1726 F( 1, 1724) = 17.28 Model 56.8824751 1 56.8824751 Prob F = 0.0000 Residual 5674.39505 1724 3.29141244 R-squared = 0.0099 Adj R-squared = 0.0094 Total 5731.27752 1725 3.32247972 Root MSE = 1.8142 ed Coef. Std. Err. t P t [95% Conf. Interval] dist -.083837.0201668-4.16 0.000 -.1233911 -.044283 _cons 13.97899.0559295 249.94 0.000 13.86929 14.08869
Monday October 3 10:11:57 2011 Page 8 92. matrix b=e(b) 93. matrix V=e(V) 94. scalar coef_male=b[1, 1] 95. scalar var_male=v[1,1] 96. display coef_male -.08383705 97. display var_male.0004067 98. 99. scalar t_stat=(coef_male-coef_female)/sqrt(var_male+var_female) 100. display t_stat -.71332914 101. 102. 103. 104. /* Exercise 3 Repeat the exercise from the last seminar. Use the nlsw88.dta data. (Remember that to upload this dataset you need to type: sysuse nslw88.dta) 105. sysuse nlsw88.dta (NLSW, 1988 extract) 106. *1. What is mean of schooling grade in the sample? 107. sum grade, detail current grade completed Percentiles Smallest 1% 7 0 5% 9 0 10% 11 4 Obs 2244 25% 12 4 Sum of Wgt. 2244 50% 12 Mean 13.09893 Largest Std. Dev. 2.521246 75% 15 18 90% 17 18 Variance 6.356682 95% 18 18 Skewness.0469717 99% 18 18 Kurtosis 3.615168 108. 109. *2. Are there any missing values? 110. tab grade, missing current grade completed Freq. Percent Cum. 0 2 0.09 0.09 4 3 0.13 0.22 5 1 0.04 0.27 6 14 0.62 0.89 7 19 0.85 1.74 8 33 1.47 3.21 9 55 2.45 5.65 10 84 3.74 9.39 11 123 5.48 14.87 12 943 41.99 56.86 13 176 7.84 64.69 14 187 8.33 73.02 15 92 4.10 77.11 16 252 11.22 88.33 17 106 4.72 93.05 18 154 6.86 99.91
Monday October 3 10:11:57 2011 Page 9. 2 0.09 100.00 Total 2,246 100.00 111. 112. *3. What is the age of those with missing values in education? 113. tab age if missing(grade) age in current year Freq. Percent Cum. 37 1 50.00 50.00 41 1 50.00 100.00 Total 2 100.00 114. 115. *4. Are they union workers? 116. tab union if missing(grade) union worker Freq. Percent Cum. nonunion 1 50.00 50.00 union 1 50.00 100.00 Total 2 100.00 117. 118. *5. Do they live in the south? 119. sum south if missing(grade) south 2.5.7071068 0 1 120. 121. *6. Do they live in smsa? 122. sum smsa if missing(grade) smsa 2.5.7071068 0 1 123. 124. *7. What is the relationship between wages and grades (education)? 125. *Create a scatter and a best linear fit with confidence intervals. 126. twoway (scatter wage grade) (lfit wage grade) 127. 128. *8. Regress wages on grades. 129. reg wage grade Source SS df MS Number of obs = 2244 F( 1, 2242) = 265.57 Model 7874.79847 1 7874.79847 Prob F = 0.0000 Residual 66479.532 2242 29.6518876 R-squared = 0.1059 Adj R-squared = 0.1055 Total 74354.3305 2243 33.1495009 Root MSE = 5.4454 wage Coef. Std. Err. t P t [95% Conf. Interval] grade.7431729.0456033 16.30 0.000.6537438.832602 _cons -1.965886.6083143-3.23 0.001-3.158804 -.7729677
Monday October 3 10:11:57 2011 Page 10 130. 131. *9. Test whether the slope is equal to 1. 132. test grade=1 ( 1) grade = 1 F( 1, 2242) = 31.72 Prob F = 0.0000 133. 134. *10. Regress wages on grades separately for union and non-union members. 135. bysort union: reg wage grade - union = nonunion Source SS df MS Number of obs = 1416 F( 1, 1414) = 356.62 Model 4800.93529 1 4800.93529 Prob F = 0.0000 Residual 19035.5751 1414 13.4622172 R-squared = 0.2014 Adj R-squared = 0.2008 Total 23836.5104 1415 16.8455904 Root MSE = 3.6691 wage Coef. Std. Err. t P t [95% Conf. Interval] grade.7481364.0396165 18.88 0.000.670423.8258499 _cons -2.54272.5254004-4.84 0.000-3.573368-1.512072 - union = union Source SS df MS Number of obs = 460 F( 1, 458) = 74.76 Model 1124.47451 1 1124.47451 Prob F = 0.0000 Residual 6889.18181 458 15.0418817 R-squared = 0.1403 Adj R-squared = 0.1384 Total 8013.65633 459 17.4589462 Root MSE = 3.8784 wage Coef. Std. Err. t P t [95% Conf. Interval] grade.5637353.0652006 8.65 0.000.4356059.6918647 _cons 1.024517.9034509 1.13 0.257 -.7509063 2.79994 - union =. Source SS df MS Number of obs = 368 F( 1, 366) = 19.92 Model 2131.63724 1 2131.63724 Prob F = 0.0000 Residual 39157.3898 366 106.987404 R-squared = 0.0516 Adj R-squared = 0.0490 Total 41289.0271 367 112.504161 Root MSE = 10.343 wage Coef. Std. Err. t P t [95% Conf. Interval] grade 1.035247.2319283 4.46 0.000.5791678 1.491326 _cons -4.415266 3.008321-1.47 0.143-10.33103 1.500497
Monday October 3 10:11:57 2011 Page 11 136. 137. *11. Predict wages from the regression of wages on grades. 138. reg wage grade Source SS df MS Number of obs = 2244 F( 1, 2242) = 265.57 Model 7874.79847 1 7874.79847 Prob F = 0.0000 Residual 66479.532 2242 29.6518876 R-squared = 0.1059 Adj R-squared = 0.1055 Total 74354.3305 2243 33.1495009 Root MSE = 5.4454 wage Coef. Std. Err. t P t [95% Conf. Interval] grade.7431729.0456033 16.30 0.000.6537438.832602 _cons -1.965886.6083143-3.23 0.001-3.158804 -.7729677 139. predict wage_hat, xb (2 missing values generated) 140. 141. *12. Add the predicted wages to the graph above. 142. twoway (scatter wage grade) (lfit wage grade) (scatter wage_hat grade) 143. 144. log close name: <unnamed log: M:\pc\Dokumenter\ECON4135\Extra_sem\Extraseminar.smcl log type: smcl closed on: 3 Oct 2011, 10:06:57