Dummy variables Treatment 22 1 1 Control 3 2 Y Y1 0 1 2 Y X X i identifies treatment 1 1 1 1 1 1 0 0 0 X i =1 if in treatment group X i =0 if in control H o : u n =u u Are wages different across union/nonunion jobs Or alternatively H o : d = u n u u = 0 H o : d 0 3 1
cps.dta. gen ln_weekly_earn=ln(weekly_earn). gen union=union_status==1. gen nonwhite=((race==2) (race==3)). * test whether means are the same across two subsamples. ttest weekly_earn, by(union) Two-sample t test with equal variances Group Obs Mean Std. Err. Std. Dev. [% Conf. Interval] ---------+-------------------------------------------------------------------- 0 0 0.3 2.0 2.32..3 1 1.2 2.001.03 0. 20. ---------+-------------------------------------------------------------------- combined.2 1.0 23.. 1.2 ---------+-------------------------------------------------------------------- diff -3.23 3.33-2.1-2.3 diff = mean(0) - mean(1) t = -.1 Ho: diff = 0 degrees of freedom = Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = 0.0000 Pr( T > t ) = 0.0000 Pr(T > t) = 1.0000 ˆ 3. t. 3. tˆ 1. reject null reg weekly_earn union Source SS df MS Number of obs = -------------+------------------------------ F( 1, ) =.3 Model 3.22 1 3.22 Prob > F = 0.0000 Residual 1.e+0 02.2 R-squared = 0.003 -------------+------------------------------ Adj R-squared = 0.003 Total 1.e+0 1. Root MSE = 23.01 weekly_earn Coef. Std. Err. t P> t [% Conf. Interval] union 3.23 3.33. 0.000 2.3 2.1 _cons 0.3 1.03 21.2 0.000. 3.1. Synthetic problem X impacts Y But there are two groups of people in the population: 1 and 2 Average of X and Y is higher for group 2 than 1 Should you add a dummy for group 2? 2
Plot: X vs. Y Plot: X vs. Y Group 2 Group 2 OLS line Y Group 1 Y Group 1 OLS line 0 1 20 2 30 3 0 X 0 1 20 2 30 3 0 X Plot: X vs. Y Plot: X vs. Y Pooled sample OLS line Y 0 1 20 2 30 3 0 0 1 20 2 30 3 0 X Y X 3
Plot: X vs. Y Sort the data by groups Pooled sample OLS line Group 2 OLS line Y Group 1 OLS line sort group by group: reg y x 0 1 20 2 30 3 0 X Run a regression for each of the separate groups 1 - -> group = 1 Source SS df MS Number of obs = 0 -------------+------------------------------ F( 1, ) =. Model 2.2 1 2.2 Prob > F = 0.0000 Residual 21..22033 R-squared = 0.2 -------------+------------------------------ Adj R-squared = 0.20 Total.0.02 Root MSE =.33 y Coef. Std. Err. t P> t [% Conf. Interval] x.31.00. 0.000.0.3 _cons.0.02 2.22 0.000.303.301 - -> group = 2 Source SS df MS Number of obs = 0 -------------+------------------------------ F( 1, ) =. Model 3.02 1 3.02 Prob > F = 0.0000 Residual 20.02.203 R-squared = 0.1 -------------+------------------------------ Adj R-squared = 0.2 Total.003.00 Root MSE =.. reg y x Source SS df MS Number of obs = 200 -------------+------------------------------ F( 1, 1) =.3 Model 33.3031 1 33.3031 Prob > F = 0.0000 Residual 33.10 1 1.2 R-squared = 0. -------------+------------------------------ Adj R-squared = 0. Total 1.1 1 3.0233 Root MSE = 1.32 y Coef. Std. Err. t P> t [% Conf. Interval] x.222.03.1 0.000.1.231 _cons.01.333. 0.000 3.33.1 When you ignore the fact that group 2 has higher outcomes And higher x s, this overstates the impact on x y Coef. Std. Err. t P> t [% Conf. Interval] x.2.002. 0.000.02.22 _cons..122 3. 0.000.233.022 1 The coefficients on X in both models are pretty similar 1
, E[ ] ˆ x 1 1 2 1 x 2i 0 1 1i i ˆ 0, ˆ 0 and 0 1 1 2 Generate dummy variable for One of the groups using logical operators gen group2=group==2 reg y x group2 0 Run a regression with x and the β 1 variable 1 1 1 Return to tobacco model Source SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 1) =.1 Model.33 2 33.0 Prob > F = 0.0000 Residual 2.30 1.212 R-squared = 0.0 -------------+------------------------------ Adj R-squared = 0.3 Total 1.1 1 3.0233 Root MSE =. y Coef. Std. Err. t P> t [% Conf. Interval] x.1.00 1.0 0.000.03.32 group2 2.003.033 3.03 0.000 2.23 3.0 _cons.3. 1. 0.000.2323.03 Regress ln(per capita consumption) on taxes and a time trend Concern: who are the lowest taxing states? Model subject to an omitted variables bias? 1 20
State rank per capita consumption - 200 21 State Rank Per capita packs/year KY 2 1. VA. TN. NC. SC 2.2 MD 3. US.2 22. * run regression with tax and trend. reg packs_pc real_tax trend. * time trend. gen trend=year-. label var trend "=1 in 1st year, 2 in second, etc"... * tobacco producing state. gen tob_state=(state=="nc" state=="va" state=="sc" state=="ky" state=="md" st > ate=="tn") Two new variables: A time trend, =1 in 1 st year, 2 in second, etc A dummy if the state produces tobacco Source SS df MS Number of obs = 20 -------------+------------------------------ F( 2, 1) = 1.3 Model 320. 2 12.2 Prob > F = 0.0000 Residual 1.2 1 0.002 R-squared = 0.1 -------------+------------------------------ Adj R-squared = 0.03 Total 1.32 1 00.31 Root MSE = 20. packs_pc Coef. Std. Err. t P> t [% Conf. Interval] real_tax -.3.030-1. 0.000 -.2 -.22 trend -1.32.0 -. 0.000-1.22-1.233 _cons. 2..3 0.000 1. 1.30 Each year, tobacco consumption falls 1. packs/person Every cent increase in the tax reduces consumption by. packs 23 2
. ttest packs_pc, by(tob_state) Two-sample t test with equal variances Group Obs Mean Std. Err. Std. Dev. [% Conf. Interval] ---------+-------------------------------------------------------------------- 0 00 3.31.32 2.31 1..1 1 0 0.32 2.0 2.3.022.33 ---------+-------------------------------------------------------------------- combined 20.021.2 2.23.3.30 ---------+-------------------------------------------------------------------- diff -2.3 2. -32.0-21.0 diff = mean(0) - mean(1) t = -.22 Ho: diff = 0 degrees of freedom = 1 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 Pr(T < t) = 0.0000 Pr( T > t ) = 0.0000 Pr(T > t) = 1.0000 Tobacco producing states have substantially higher consumption Than non-tobacco states. * correlation between tax and other variables. reg real_tax trend tob_state Source SS df MS Number of obs = 20 -------------+------------------------------ F( 2, 1) = 2.0 Model.32 2 22.1 Prob > F = 0.0000 Residual 230.2 1 22.3022 R-squared = 0.33 -------------+------------------------------ Adj R-squared = 0.33 Total 31.0 1 30.1 Root MSE = 1.0 real_tax Coef. Std. Err. t P> t [% Conf. Interval] trend 1.3.0 1. 0.000 1.231 1.3 tob_state -22.330 1.3332-1.2 0.000-2.2023-1.3 _cons.3.20.3 0.000 3.33. Tobacco producing states have substantially lower taxes that Other states 2 2 Know two facts Consumption is higher in tobacco producing states Taxes are lower in tobacco producing states What should happen to the tax coefficient when the tob_state dummy is added to the model?, E[ ] ˆ x 1 1 2 1 x 1i 0 1 2i i ˆ 0, ˆ 0 and 0 1 1 2 2 1 β 1 0 2
. * add tobacco producing state dummy. reg packs_pc real_tax trend tob_state Source SS df MS Number of obs = 20 -------------+------------------------------ F( 3, 1) = 2.22 Model 32.1 3.0 Prob > F = 0.0000 Residual 330. 1 2.223 R-squared = 0.1 -------------+------------------------------ Adj R-squared = 0. Total 1.32 1 00.31 Root MSE = 20. packs_pc Coef. Std. Err. t P> t [% Conf. Interval] real_tax -.22.023-1. 0.000 -.1 -.2 trend -1.. -. 0.000-1.01-1.3 tob_state. 2.223.2 0.000.2 1.3 _cons 1.03 2.3332. 0.000.3.1 - storage display value variable name type format label variable label - male float %.0g dummy variable, =1 of male business float %.0g dummy variable, =1 if business major engineer float %.0g dummy variable, =1 if engineer greek float %.0g dummy variable, =1 if in sor/fraternity college_gpa float %.0g college GPA,.0 scale hs_gpa float %.0g high school GPA,.0 scale act float %.0g act score, 1-3 pc float %.0g dummy variable, =1 if own a PC - Sorted by: 2 30. * run regression. reg college_gpa hs_gpa act male greek business engineer pc Source SS df MS Number of obs = 11 -------------+------------------------------ F(, 3) =.1 Model.12.3012 Prob > F = 0.0000 Residual 1. 3.3 R-squared = 0.222 -------------+------------------------------ Adj R-squared = 0.2 Total 1.00 10.1 Root MSE =.33031 college_gpa Coef. Std. Err. t P> t [% Conf. Interval] hs_gpa.1.0. 0.000.21.1 act.001.02 0.0 0.2 -.03.0302 male.0.0 0.33 0.0 -.0.33 greek.0322331.001 0. 0.3 -.033. business.01.0 0. 0. -.02.2021 engineer -.21.1-1. 0.02 -.20.03 pc.1.02 3.0 0.003.031.223 _cons 1.1.333 3.2 0.001.02 1.32 cps.dta. gen ln_weekly_earn=ln(weekly_earn). gen union=union_status==1. gen nonwhite=((race==2) (race==3)) 31 32
. * run basic regression. * ln(weekly earnings) on age, educ, union nonwhite. reg ln_weekly age years_educ union nonwhite Source SS df MS Number of obs = -------------+------------------------------ F(, 1) = 12.0 Model 1.03 3.200 Prob > F = 0.0000 Residual 32.3 1.23 R-squared = 0.21 -------------+------------------------------ Adj R-squared = 0.21 Total 23.33.2321 Root MSE =.31 ln_weekly_~n Coef. Std. Err. t P> t [% Conf. Interval] age.002.0002. 0.000.023.0 years_educ.0022.00133.32 0.000.03.02 union.10.0002 1. 0.000.223. nonwhite -.2.0000-1.0 0.000 -.12 -.230 _cons.02.012 23. 0.000.022.21 Now change the reference group. gen non_union=union_status==2. gen white=race==1. * no change the reference groups for the. * dummy variables, adding non_union and white. * to the model. * ln(weekly earnings) on age, educ, nonunion white. reg ln_weekly age years_educ non_union white 33 3 Notice that changing the reference groups on the DVs does not change R2 or the coef s on other parameters. * ln(weekly earnings) on age, educ, nonunion white. reg ln_weekly age years_educ non_union white Source SS df MS Number of obs = -------------+------------------------------ F(, 1) = 12.0 Model 1.03 3.200 Prob > F = 0.0000 Residual 32.3 1.23 R-squared = 0.21 -------------+------------------------------ Adj R-squared = 0.21 Total 23.33.2321 Root MSE =.31 ln_weekly_~n Coef. Std. Err. t P> t [% Conf. Interval] age.002.0002. 0.000.023.0 years_educ.0022.00133.32 0.000.03.02 non_union -.10.0002-1. 0.000 -. -.223 white.2.0000 1.0 0.000.230.12 _cons.03.01 230. 0.000.21.0 Notice that the only thing that has changed is that the sign on the DVs has flipped 3. * generate regional dummy variables. gen region1=region==1. gen region2=region==2. gen region3=region==3. gen region=region== Generate dummies for each region of the country 3
Do something silly include all four dummy variables In the model --. * do something dumb -- include all dummy variables. reg ln_weekly age years_educ union nonwhite region1-region Source SS df MS Number of obs = -------------+------------------------------ F(, 1) = 1.3 Model 1.22 21.0322 Prob > F = 0.0000 Residual 31. 1.13 R-squared = 0.20 -------------+------------------------------ Adj R-squared = 0.2 Total 23.33.2321 Root MSE =.331 ln_weekly_~n Coef. Std. Err. t P> t [% Conf. Interval] age.0003.0002. 0.000.02.0 years_educ.032.0012. 0.000.0301.0 union.003.00 1.1 0.000.322.2 nonwhite -.323.003-1.3 0.000 -.123 -.13 region1 -.0021.001-0.0 0.3 -.021.003 region2 -.032.0023 -.3 0.000 -.0 -.033 region3 -.01.000 -. 0.000 -.02 -.021 region (dropped) _cons.33.0203 22. 0.000.3.3 STATA will remind you cannot run a model with all the Dummies included 3. * run model with regional dummmy variables. reg ln_weekly age years_educ union nonwhite region2-region Source SS df MS Number of obs = -------------+------------------------------ F(, 1) = 1.3 Model 1.22 21.0322 Prob > F = 0.0000 Residual 31. 1.13 R-squared = 0.20 -------------+------------------------------ Adj R-squared = 0.2 Total 23.33.2321 Root MSE =.331 ln_weekly_~n Coef. Std. Err. t P> t [% Conf. Interval] age.0003.0002. 0.000.02.0 years_educ.032.0012. 0.000.0301.0 union.003.00 1.1 0.000.322.2 nonwhite -.323.003-1.3 0.000 -.123 -.13 region2 -.031.00 -.21 0.000 -.02 -.02 region3 -.0.003 -.1 0.000 -.0 -.03 region.0021.001 0.0 0.3 -.003.021 _cons.30.020 22.0 0.000.1.01 Difference between region 3 and region : -0.0-0.002 = -0.0 Difference between region 2 and region : -0.0 0.002 = -0.03 3 degrees of freedom in denominator % Critical values of F-Distribution Degrees of Freedom in numerator 1 2 3.. 3.1 3. 3.33 3.22. 3. 3. 3.3 3.20 3.0. 3. 3. 3.2 3. 3.00. 3.1 3.1 3.1 3.03 2.2 1.0 3. 3.3 3. 2. 2. 1. 3. 3.2 3.0 2.0 2. 1. 3.3 3.2 3.01 2. 2. 1. 3. 3.20 2. 2.1 2.0 1.1 3. 3.1 2.3 2. 2. 1.3 3.2 3. 2.0 2. 2.3 20.3 3. 3. 2. 2.1 2.0 21.32 3. 3.0 2. 2. 2. 22.30 3. 3.0 2.2 2. 2. 23.2 3.2 3.03 2.0 2. 2.3 2.2 3.0 3.01 2. 2.2 2.1 30.1 3.32 2.2 2. 2.3 2.2 0.0 3.23 2. 2.1 2. 2.3 0.00 3.1 2. 2.3 2.3 2.2 0 3. 3. 2.1 2. 2.32 2.20 0 3.2 3.0 2. 2. 2.2 2.1 infinity 3. 3.00 2.1 2.3 2.21 2. 3. *test whether the regional effects are all zero. test region2 region3 region ( 1) region2 = 0 ( 2) region3 = 0 ( 3) region = 0 F( 3, 1) = 3. Prob > F = 0.0000 0
The coef s on the other parameters stay the same. Notice The the SSE, SSM, R2 do not change at all Change the reference group from region 1 to region All the coefficients are now in relation to the omitted group # E.g., The coefficient on region 3 is now the difference between region 3 and 1. *change the reference group from region1 to region. reg ln_weekly age years_educ union nonwhite region1-region3 Source SS df MS Number of obs = -------------+------------------------------ F(, 1) = 1.3 Model 1.22 21.0322 Prob > F = 0.0000 Residual 31. 1.13 R-squared = 0.20 -------------+------------------------------ Adj R-squared = 0.2 Total 23.33.2321 Root MSE =.331 ln_weekly_~n Coef. Std. Err. t P> t [% Conf. Interval] age.0003.0002. 0.000.02.0 years_educ.032.0012. 0.000.0301.0 union.003.00 1.1 0.000.322.2 nonwhite -.323.003-1.3 0.000 -.123 -.13 region1 -.0021.001-0.0 0.3 -.021.003 region2 -.032.0023 -.3 0.000 -.0 -.033 region3 -.01.000 -. 0.000 -.02 -.021 _cons.33.0203 22. 0.000.3.3 Coef on Region 1 is negative of the coef on region from previous model. Coef on regions 2 and 3 exactly as we would 2 expect Definitions Obesity based on Body Mass Index BMI = weight (kg)/(height in cm) 2 = 03 x weight (pounds)/(height in inches) 2 BMI < 20 Underweight 20 BMI < 2 Ideal 2 BMI < 30 overweight 30 BMI obese Obesity Rates Over Time Obesity Overweight Group / 1/00 / 1/00 All 1. 30... Males.2 2...0 Females 1. 3.0 1.1 2.0 Black F. 2. 0. 0..0 3
Contains data from bmi1.dta obs: 1,2 vars: 2 Sep 200 0: size: 33,3 (.% of memory free) - storage display value. * generate race dummy variables;. gen black=race==2. gen other_race=race==3. gen hispanic=race==. label var black "=1 of black, non hispanic". label var other_race "=1 if other race, non hispanic". label var hispanic "=1 if hispanic"... * generate overweight dummy. gen overweight=bmi>=2. label var overweight "dummy, =1 if overweight" variable name type format label variable label - age byte %.0g age in years sex byte %.0g =1 if male, =2 if female income int %.0g annual family income educ byte %.0g years of education srhealth byte %.0g self report health,1=excel,2=vgood,3=good, =fair,=poor bmi float %.0g body mass index totalexp long %.0g total annual expenditures on medical care smoker byte %.0g dummy variable, =1 if current smoker race float %.0g =1 if white non-hisp,=2 if black nonhisp,=3 other race,=hispanic -. reg overweight age educ incomel male black hispanic other_race smoker. * get table of overweight. tab overweight dummy, =1 if overweight Freq. Percent Cum. ------------+----------------------------------- 0 3 2. 2. 1 2 0.0 0.00 ------------+----------------------------------- Total 1,2 0.00 Source SS df MS Number of obs = -------------+------------------------------ F(, 0) =. Model 1. 2.3 Prob > F = 0.0000 Residual 2. 0.120 R-squared = 0.001 -------------+------------------------------ Adj R-squared = 0.01 Total 2..200 Root MSE =.32 overweight Coef. Std. Err. t P> t [% Conf. Interval] age.002.0013.2 0.000.00.003 educ -.021.00333-3.00 0.003 -.0213 -.00 incomel -.03.031-0.0 0.0 -.031.023 male -.1.0222 -.3 0.000 -.13 -.02 black.32.03. 0.000..20 hispanic.12.03301 3.2 0.001.03.1 other_race -.0.030-1.01 0.31 -.1230.022 smoker -.01.03121-0. 0.1 -.03.000 _cons.21.3 1.3 0.03 -.03 1.23