Applications of Data Analysis (EC969) Simonetta Longhi and Alita Nandi (ISER) Contact: slonghi and anandi; @essex.ac.uk Week 2 Lecture 1: Sampling (I) Constructing Sampling distributions and estimating their characteristics Example: estimating mean number of children among women (II) Computing unbiased estimates with correct standard errors Example: estimating mean pay/wage in UK Input datasets: Week2Lecure1.dta Do file in Week2Lecure1_DoFile.pdf New commands ci, test, tabstat, aweight, pweight, ciplot, svyset, svydes, svy: mean, estat, 1 W e e k 2 L e c t u r e 1
(I) Constructing Sampling distributions and estimating their characteristics Example: estimating mean number of children among women Suppose there is a population of 6 women with children. The distribution of children is shown in Table 1. We are interested in estimating the average number of children of these women and so would like to draw a sample and estimate this number from that sample. In this exercise, we will compare the characteristics of the sampling distribution of the sample mean children of two different sampling plans or sample designs. In the first sampling plan we will draw a sample of 2 women from the population of 6 women. In the second one, we will draw a sample of 3 women. Table 1: Distribution of children across a population of 6 women Women Total no. of children 1 4 2 5 3 3 4 3 5 7 6 8 Total 30 Population Mean = 30/6 = 5 Sampling Plan 1 Table 2: All possible size 2 samples and corresponding sample averages Sample No. Women in the sample Mean number of children per woman in the sample (d) 1 1,2 4.5 2 1,3 3.5 3 1,4 3.5 4 1,5 5.5 5 1,6 6 6 2,3 4 7 2,4 4 8 2,5 6 9 2,6 6.5 10 3,4 3 11 3,5 5 12 3,6 5.5 13 4,5 5 14 4,6 5.5 15 5,6 7.5 2 W e e k 2 L e c t u r e 1
Table 3: Sampling distribution of d d Frequency (f) 3 1 3.5 2 4 2 4.5 1 5 2 5.5 3 6 2 6.5 1 7.5 1 Total 15 di fi i 1 Expected value of d, d 9 = 5 f 9 i 1 ( di i 1 Standard error of d = 9 9 i 1 i d ) f i 2 f i =1.21 Bias = 0 MSE=Bias + Sampling Variance = Bias + (Standard Error^2) = 0 + (1.21^2) = 1.47 RMSE=Square root of MSE = 1.21 Sampling Plan 2 Table 4: All possible size 3 samples and corresponding sample averages Fill in the blanks Sample No. Women in the sample Mean number of children per woman in the sample (d) 1 1,2,3 4.67 2 1,2,4 4.00 3 1,2,5 4.00 4 1,2,6 5.33 5 1,3,4 3.33 6 1,3,5 4.67 7 1,3,6 5.00 8 1,4,5 4.67 9 1,4,6 5.00 10 11 3 W e e k 2 L e c t u r e 1
12 13 14 15 16 17 18 19 20 Compute the sampling distribution of d Calculate the expected value and standard deviation of the sampling distribution of d Calculate the bias, standard error, MSE and RMSE of d How does this sampling plan 2 compare with sampling plan 1 in terms of bias and standard error of d? (II) Computing unbiased estimates with correct standard errors Example: estimating mean pay/wage in UK Input dataset: Week2Lecure1.dta Do file in Week2Lecure1_DoFile.pdf We have provided a dataset Week2Lecure1.dta from the 15 th wave (corresponds to year 2005) of the British Household Panel Survey (BHPS) see previous section on how to create it. Before doing any analysis and estimation using survey data ask yourself these questions: What is the population of interest, i.e., the population that you want the results of your analysis based on the survey sample to generalize to? What is the survey design? Specifically is it a clustered, stratified sample? And is there unequal selection probabilities? Is there nonresponse (the answer is almost always yes!)? Are weights provided which account for unequal selection probabilities and/or non-response? The BHPS The original BHPS is a clustered, stratified sample but with an almost equal probability sampling design. This was designed to be representative of Great Britain in 1990 south of the Caledonian Canal. So, the original sample included households in England, Wales and Scotland and they all had (almost) equal selection probabilities. In later waves over-samples or boosts from Wales, Scotland, and Northern Ireland were added to the original sample, i.e., proportion of households who are from Wales, Scotland, and Northern Ireland in the BHPS sample is much higher than in the UK population. In other words, sample units from the four countries had unequal selection probabilities. While the Scottish and Welsh boosts 4 W e e k 2 L e c t u r e 1
had a similar clustered, stratified design as the original sample, the Northern Ireland boost was a simple random sample. Variables are provided which help identify the primary sampling unit and strata from which the sample household was drawn. Weights are provided which account for the unequal selection probability and non-response. In this example we will only look at cross-sectional respondent weights, i.e., weights for respondents that account for unequal selection probability, non-response (at the household and individual level) and post-stratification. In other words, weighted estimates using cross-section respondent weight for wave 15 will provide unbiased estimates of the corresponding parameter for the UK population in 2005. The BHPS is conducted primarily by face-to-face interviews. Some respondents refuse to participate but opt for a telephone interview or a proxy interview (someone from their household answers on their behalf). Note respondent cross-sectional weight is zero for proxy and telephone respondents and missing for dead and out-of-scopes. Inspecting the data Open Week2Lecure1.dta First inspect the different variables, particularly find out which variables represent wages (wage), response (ivfio), weights (xrwtuk1), stratification (strata) and clustering (psu) variables. Let s see what these variables look like: What are the variable names, value labels, their mean, standard deviation, frequency distribution? Are there any missing values? What are the weights for proxy and telephone respondents? Examine weights and the sample identifier, memorig Look at distributions of cross-sectional respondent weight and see how they vary by sample. Notes: tabstat xrwtuk1, stat(mean min max sd) by(memorig) longstub nototal tabstat displays summary statistics for a series of numeric variables (specified right after this) in one table, possibly broken down by a second variable. stat() requires the list of statistics to be displayed for each of the variables specified by() requires variable by which the table is to be broken down. longstub specifies that the left stub of the table be made wider so that it can include names of the statistics or variables in addition to the categories of by(). nototal does not report overall statistics; use with by(). Why are the weights, on average, higher for the Original sample than for the country boosts? 5 W e e k 2 L e c t u r e 1
Examine the variable of interest, wage Why is this missing for some people? Hint: Use ivfio, employed Do weights matter: are weighted estimates likely to be different from unweighted estimates? Yes, if there is variation across observations in weights. This variation in weights is more prominent with the introduction of extension samples (Wales, etc) as there are large differences in selection probabilities. Stata allows for four types of weights: pweight, aweight, fweight and iweight. pweight & aweight are the ones that we will be using. See Stata Manual for more explanation. PWEIGHT are probability or sampling weights, i.e., it is the inverse of the probability that the observation is included in the sample. As the BHPS weights are probability weights the Stata weight command that we should ALWAYS use is pweight. However, Stata does not allow pweight for certain commands such as summarize, it only allows aweight (http://www.stata.com/support/faqs/stat/supweight.html) The estimated mean and standard deviation using pweight & aweight are the same, but not the standard error (& confidence interval). If Stata will not allow pweight and you have to use aweight be careful about its interpretation. aweight represents analytical weights which are inversely proportional to the variance of the observation. An example of analytical weights: If the observations are averages, then the number of observations used to compute the averages would be analytical weights. (See [U] 20.18 Weighted estimation in the Stata Manual) To compute mean and standard deviation use summarize, table, tabstat summ wage To compute the mean, standard error and confidence intervals use ci ci wage Compute unweighted mean, standard errors and confidence interval for wage. To compute weighted mean, standard errors, confidence interval and standard deviation for wage but without correcting for clustering and stratification, there are two options: First you could use summarize and ci with the option for weights. But for these commands Stata only allows you to use aweight option which means the weights will be treated as analytical weights. This will produce the weighted mean estimate as when using pweight which treats the weights as probability weights but will produce different estimates of 6 W e e k 2 L e c t u r e 1
standard errors. As the BHPS weights are not analytical weights but probability weights this is not the best choice (see note above and Stata Help). summ wage [aweight = xrwtuk1] ci wage [aweight = xrwtuk1] Compute weighted (but without correcting for clustering and stratification) mean, standard errors, confidence interval and standard deviation for wage using summarize and ci. Stata does not allow pweight with summarize and ci and if you do use Stata will give an error message and the program will stop running. The part in this box is optional To see what happens if you use pweight instead, type summ wage [pweight = xrwtuk1] To see what happens if you use pweight instead but to prevent Stata from stopping, type capture noisily summ wage [pweight = xrwtuk1] capture tells Stata to not show the error message and to continue running the program inspite of the error noisily tells Stata to show the output Together capture noisily asks Stata to show the error message but to continue running the program The second option is using Stata s svy suite of commands. Here you tell Stata what the survey design is and then Stata computes the correct estimates taking the survey design into account. The other advantage of using this option is that for this Stata does treat the weights as probability weights. As the BHPS weights are probability weights (as will be the weights in almost all such micro-panel surveys) this will produce the correct estimates of standard errors. To do this you need to first inform Stata about the survey design. For this part of the exercise, we will ignore the clustering and stratification aspect of the survey and just focus on the weights. svyset [pweight = xrwtuk1] And then to compute the weighted means, standard error and confidence intervals svy: mean wage If you want to produce estimates of the sample standard deviation estat sd svyset instructs Stata that the dataset is a complex survey data. The different features of this complex survey dataset are given by the commands pweight, strata and psu. All these options need not be specified. Once we have told Stata what the survey design is then whatever commands we type in the format svy: command, Stata will carry out the command after taking into account the structure of the dataset. 7 W e e k 2 L e c t u r e 1
estat displays scalar- and matrix-valued statistics after estimation; it complements predict, which calculates variables after estimation. Exactly what statistics estat can calculate depends on the previous estimation command. Compute weighted (but without correcting for clustering and stratification) mean, standard errors, confidence interval and standard deviation for wage, treating the BHPS weights as probability weights. What will happen if you use aweight instead of pweight in the above command? Next we now want to take into account the complete survey design, i.e., we want to compute estimates of mean, standard errors and confidence interval of wage that corrects for clustering and stratification in addition to unequal selection probability and non-response. To do this we will again need to use the svy suite of commands but this time specify the strata and psu variables in addition to the weight variable. To do this you need to first clear Stata s memory of any previous svy instructions svyset, clear Next inform Stata about the survey design variables svyset [pweight = xrwtuk1], psu(psu) strata(strata) Then compute the weighted means etc. as before svy: mean wage Compute weighted mean, standard errors, confidence interval and standard deviation for wage after correcting for clustering and stratification and treating the BHPS weights as probability weights. This returns mean income, but does not return standard error or confidence interval: Find out why? Hint: Use svydes which describes the structure of the survey data. svydes You will find that there is a stratum (-8) with just 1 unit (psu) within it. Which sample is that? tab memorig if strata==-8 NB The values of psu and strata for all cases in Northern Ireland is "-8" because the NI sample is a simple random sample, i.e., no clustering or stratification. Stata cannot compute correct standard errors if a part of the sample has a different sampling design. So, exclude the Northern Ireland sample from the analysis svy: mean wage if memorig ~= 7 8 W e e k 2 L e c t u r e 1
Computing estimates of mean wages in the different countries of UK Next, we would like to compare the average hourly wage of the four countries of UK and for that we would need to compute weighted mean wage for the different countries and test the difference. Look at distributions of cross-section respondent weights and see how these vary by the four countries There are two ways to compute the estimates for sub-samples: either use subpop or over commands. Using if statement to estimate weighted means of sub populations will result in incorrect standard error estimations (see Stata Survey Data Reference Manual Release 11, pp 53). To use subpop command option svy: mean var, subpop (varname) This asks Stata to compute estimates for the ONE single subpopulation identified by varname. The subpopulation is defined by the observations for which varname!=0. Typically, varname =1 defines the subpopulation, and varname =0 indicates observations not belonging to the subpopulation. For observations whose subpopulation status is uncertain, varname should be set to a missing value; such observations are dropped from the estimation sample. Alternatively, an if condition can be used with varname svy: mean var, subpop (if varname=x) To use over command option: svy: mean var, over (varname) This asks Stata to compute the estimates for ALL categorical values of the categorical variable varname. You can use more than one variable in varname, separated by space Now you have the tools to compute estimates of mean wage in each country. Remember to avoid complications because of Northern Ireland and missing region/country variables eliminate those cases from the sample. drop if memorig == 7 drop if country==. Compute the unweighted mean wage for each country Estimate mean wage for each country separately using the if statement option. What happens? Estimate mean wage for each country by using the svypop command option Estimate mean wage for the four countries by using the over command option Are the estimates obtained by the two methods (subpop and over) exactly the same? subpop and over can be used to compute estimates for multiple subpopulations. Here are a couple of tasks to illustrate that: Estimate mean wage for men and women in the four countries by using the over command option 9 W e e k 2 L e c t u r e 1
Estimate mean wage for men and women in the four countries by using the subpop command option You can also test the differences in these estimated means across the four countries. Note the test command should follow immediately after the estimation. svy: mean wage, over(country) test [wage]england = [wage]scotland = [wage]wales Test if men earn higher wages than women in England, Scotland and Wales. Design Effect (deff): Is the ratio of the variance of a statistic based on the actual sample design to the variance of this statistic had the sample design been a SRS (simple random sample) of the same size. In other words, it indicates by how much the variance is inflated due to the sampling design. deft is the square root of deff. Compute design effects, design factor and effective sample size: quietly svy: mean wage estat effects, deff deft [Optional] Plot the weighted mean and the confidence interval. ciplot shows means and confidence intervals. Means are shown by point symbols and intervals by capped bars. ci is used for the calculations. If it is not already installed then use findit to find it and then install it ciplot paygu, by(country) saving(graph1, replace) ciplot paygu [aw=xrwtuk1], by(country) saving(graph2, replace) 10 W e e k 2 L e c t u r e 1
[Optional] To estimate mean wage in all four countries of UK, i.e., including Northern Ireland We had asked you to drop Northern Ireland from the dataset because it has a different sample design and Stata cannot handle data with mixed sample design one part being clustered and stratified and the other a simple random sample. It has been suggested that Stata can be tricked into believing that the SRS part of the sample is also a clustered stratified sample simply by using the unique household identifier (whid) as the psu variable (wpsu). In other words, each such psu has just one household observation. Remember the current dataset does not include the Northern Ireland sub-sample as we have dropped it earlier.so, open Week2Lecure1.dta again and this time instead of dropping the Northern Ireland sub-sample, replace the value of wpsu for the Northern Ireland sub-sample with the whid. Then follow the earlier steps to estimate mean wage in all four countries. [Optional] How to create dataset Week2Lecure1.dta We have provided these datasets, but if you wanted to create these yourself, here is a guide to do that. See Week2_dataprep_DoFile.pdf which contains the corresponding do file for this. 1. Get information on strata and primary sampling unit from mhhsamp.dta (you can use any wave from onwards wave 12/wave L) 2. Get information about the household that was asked in the household questionnaire from mhhresp.dta: number of children of different ages in the household (use the same wave that you have used for step 1) 3. Get information about the individual that was asked in the individual questionnaire from mindresp.dta: individual level respondent weight, interview outcome, education, marital status, number of own children in the household, wages, work hours, employment status, health problem, region of residence (use the same wave that you have used for step 1) 4. Get some fixed info from xwavedat.dta: race, sex, sample origin and date of birth In addition to these variables always remember to include the appropriate unique identifiers in each of the datasets pid, hid & pno. As a general rule we have dropped the wave prefix for each of these files. This is because it makes it easier to use the same program for a different wave data. This is not necessary, just convenient. 5. Merge all these datasets sequentially. Finally, keep only those observations present in all datasets. Points to remember about merging: Datasets being merged should be sorted on the variable or variables that are being used to merge these Check _merge to see how many cases were available in both, how many in only one _merge is created by Stata at every merge and so if you don t drop _merge or rename it to something else after each merge, Stata will produce an error message saying _merge already exists and will not allow you to perform merge until you have dropped _merge or renamed it. 6. Create the following variables 11 W e e k 2 L e c t u r e 1
(i) Usual hourly wage rate (does NOT include overtime pay) using PAYGU (Usual gross pay per month: current job) and JBHRS (hours expected to work in a normal week). Drop cases that have negative values for PAYGU or JBHRS. Also drop cases for which JBHRS=0 which means no expected working hours (ii) Create a 0-1 dummy variable that takes on 1 if currently employed using JBHAS (did paid work last week) and JBOFF (no work last week but has job) (iii) Create a variable for number of young children (defined as being < 5years old) in the household (we are not differentiating between other s and own children) using NCH02 NCH34. There were some discrepancies in the data some households had more number of young children than NCHILD, the number of own children in the household. So, restrict the number of young children to the number of own children in the household (iv) Create 0-1 dummies for country of residence using REGION. Also create a categorical variable to represent the country of residence. (v) Create a 0-1 dummy variable that takes on the value of 1 if the person s ethnicity is white and 0 otherwise 7. Sample restriction for Week2Lecure1.dta (i) Restrict the sample to those who are not self-employed (ii) Drop those for whom if EMPLOYED is missing (iii) Drop when wage is missing even though the person is employed and interviewed face-toface 12 W e e k 2 L e c t u r e 1