The first step in the process is to select a topic that you will work on. There are 7 primary topics, and 5 secondary dimensions that you may choose from. Each team may have up to 4 people. All of the project questions can be addressed using data that I have collected and posted from the 2001 and 2014 American Community Survey (ACS). The ACS data is available as a Stata data set in G:\ECO\evenwe\eco311\ACS_data. There is also a codebook in the same directory that explains the meaning of all the variables contained in the data. The codebook is an xlm file and should be opened with an internet browser (e.g. internet explorer, firefox). To facilitate your analysis, I have created a subsample of the 2001/2014 ACS data. The subsample eliminated people in group quarters. I also eliminated any household that contained multiple families. In single-family households, I kept only the head of the household and, if the head is married, the spouse of the head. There are two data sets available: acs2010_2014_long and acs2010_2014_wide. The long data set has one observation per person. The wide data set has one observation per household. If the dependent variable you choose for your analysis is a household variable (e.g. mortgage payment, electric cost), you should use the wide data set. If it is an individual variable (e.g. number of children, earned income, weeks worked), use the long data set. In the wide data set, there is one observation per household. For each individual variable listed in the codebook, I created a variable with either a 1 or a 2 suffix representing the head and the spouse. For example, age1 is the age of the head; age2 is the age of the spouse. If the household does not have a married couple, all of the spouse variables are missing. A sample of the data is below. The variable hhid uniquely identifies each household. hhid marst age1 age2 sex1 sex2 1 never ma 39. male. 2 married, 49 59 male female 3 never ma 33. male. 4 married, 51 53 male female 5 divorced 40. female. The data without labels for the same observations is hhid marst age1 age2 sex1 sex2 1 6 39. 1. 2 1 49 59 1 2 3 6 33. 1. 4 1 51 53 1 2 5 4 40. 2.
In the long data set, there is one observation per person. For a married couple, each person will appear as a separate line of data. If a person is married, variable names with a suffix of _sp are added to represent the spouse s values. For example, age is the person s own age, and age_sp is the person s spouse s age. A sample of observations is below: hhid marst age_sp age sex_sp sex 1 never ma. 39. male 2 married, 49 59 male female 2 married, 59 49 female male 3 never ma. 33. male 4 married, 51 53 male female 4 married, 53 51 female male 5 divorced. 40. female Deadlines Wednesday 4/12: Select a primary and secondary topic and form a team. Your choices will be posted in a google sheet distributed to class. Wednesday 4/19: Along with other students who have chosen same primary topic, submit the following (25 points) A description of the data set you will use for your primary analysis. This should include a description of the primary data source (the American Community Survey), the year of the data, and the restrictions that I described above. A description of the control variables that you will include in your analysis. Be sure to explain any modifications you made to the variables in the data set (e.g. did you have to recode variables that were missing? Did you have to convert a categorical variable to a continuous variable? Did you make dummy variables?) Discuss at least two variables that you expect will be important determinants of your primary variable and describe why you believe the effect will be positive or negative. Any sample restrictions that you are making (e.g. omitted people in certain age ranges, dropped people with missing data, etc.). Make sure that you have cleaned your data so that you have the same number of observations on the dependent and control variables. Also, describe any variables that you create or modify. For example, be sure to study the codebook to understand how variables might be coded if there are missing values (drop such observations). You may also want to adjust the units that certain variables are measured in (e.g. housing values might be converted to 1000s of dollars instead of dollars).
Summary statistics (mean, standard deviation, min/max). Present the results of your summary statistics in a professional table where all the variable names and the units of measurement are easily deciphered from the table and footnotes indicate the exact sample that was used to generate the summary statistics. Guidance will be provided on this. The do-file and the log-file that show how you created your data and the summary statistics. Wednesday 4/26 (35 points): Along with other students who have chosen the same primary topic, submit results of a regression analysis. The regression analysis should Provide at least 2 tests of alternative specifications (e.g. log vs linear, linear vs quadratic, dummy variables vs continuous, etc.) Describe the specifications you compared and the preferred specification based on your analysis. Present the results of your regression analysis in a professional table. Guidance will be provided on this. Discuss whether your expected effects in (2a) are confirmed by the regression analysis and whether they are statistically significant at the.05 level. Discuss the economic significance of the effects in (2a). For example, describe the effect of a one standard deviation change in the control variables on the dependent variable. Wednesday 5/3 (40 points): Submit a final paper that closely follows the format and content required in the final paper. Details on the expectations for the final format and content will be distributed separately. Individual results for secondary topic and a discussion of your conclusions. For secondary topic, provide the following analysis 1. Provide a table of summary statistics for the dependent and control variables for the two groups created by your secondary variable (e.g. by sex, race, marital status, year). Included in the summary statistics should be a t-statistic that tests the null hypothesis that the means are equal for the two groups and asterisks indicating whether the difference in means is significantly different from zero at the.10 (*),.05(**) or.01(***). 2. Provide a regression analysis that allows you to test whether the between group difference in the dependent variable is explained by differences in the control variables. Identify a couple of key variables in your regression that might help explain why there is a gap in the dependent variable between the two groups.
3. Provide a test of the null hypothesis that the effect of at least one key variable differs across your two groups. Describe the results of your test and explain the implications for how the variable has differential effects on the dependent variable for the two groups you are examining.
Primary Topics use data for 2014 only (i.e drop if year==2010) for the primary and secondary topic unless instructed otherwise. 1. Among homeowners aged 25-55 with a non-zero mortgage payment, what determines the size of their mortgage payment? [MORTGAGE/MORTAMT1] 2. Among people aged 25-55 who own a house, what determines the value of their house? [VALUEH] 3. Among renters aged 25-55, what determines the amount they pay in rent? [RENT] 4. What determines the amount that households with the head spend on electricity [COSTELECT]? 5. What determines the amount that households spend on water [COSTWATER]? 6. What determines the number of children that a woman aged 25-35 has living in the household? [NCHILD] 7. Among people 62 and over who receive some Social Security income, what determines the level of a person s social security income? [INCSS] Secondary Topics. 1. (TIME) Has there been a significant change in the relevant dependent variable between 2001 and 2014? If so, can the change be explained by changes in the characteristics of households/people over time? If you are comparing something measure in dollars across time, be sure to use the CPI to convert to the values for 2010 to 2014 dollars. The CPI was 82.4 in 2001 and 108.6 in 2014. Use the variable named YEAR to create a dummy that equals one in 2014 and is zero in 2010. You will obviously need to keep data from 2010 and 2014 for this analysis. 2. (RACE) Are there differences between African Americans and others in the relevant dependent variable? If so, can the differences be explained by differences in the characteristics of households/people? Use the variable RACE in the ACS data to create a dummy variable that equals one for African Americans and is zero for everyone else. For this analysis, you should restrict the sample to whites and blacks only (race=1 or 2). 3. (SEX) Are there sex differences in the relevant dependent variable? If so, can the differences be explained by differences in the characteristics of men and women? Create a dummy that
equals one for women and zero otherwise based on the SEX variable in the ACS. If your primary data is for a household level variable (e.g. electricity, value of household), sex differences should be studied only for single people (married=0). 4. (MARRIED): Are there differences between married and single people in the relevant dependent variables? If so, can the differences be explained by differences in the characteristics of men and women? Create a dummy that equals one for people who are married (spouse present or absent) and is zero otherwise based on MARST variable in the ACS. 5. Region/State: Are there differences between residents of different states or regions? If so, can the differences in the dependent variables be explained by different characteristics of the people who live in these states or regions? You may use statefip to choose two large states to compare, or you can define regions based on a collection of states (e.g. Midwest versus South). The states includes in each census region are defined here.