Poststratification with PROC SURVEYMEANS

Size: px

Start display at page:

Download "Poststratification with PROC SURVEYMEANS"

May Cameron
6 years ago
Views:

1 Poststratification with PROC SURVEYMEANS Overview When a population can be partitioned into homogeneous groups and there is significant heterogeneity between those groups, stratified sampling can substantially improve the efficiency of estimates of population totals, means, and ratios. However, sometimes group membership is unknown prior to sampling, and sometimes a survey serves multiple purposes, with the result that no single sampling design is optimal for all the purposes of the survey. In such cases, poststratification can yield similar improvements in the efficiency of estimates (Särndal, Swensson, and Wretman 1992, p. 265). Poststratification is also used by epidemiologists, who frequently analyze health survey data. They often compute statistics by using a process called direct standardization, a form of poststratification. For example, certain diseases, such as cancer, are more common among older populations. Therefore, to compare the prevalence rates among geographic regions that are populated with different age groups, it is necessary to make adjustments according to such demographic categories and to compute relative prevalence rates of the diseases. To perform poststratification, you sample directly from the population and then establish group membership for the sampled elements. The groups are then called poststrata. You then combine the sample data with appropriate auxiliary data, such as census poststrata totals or proportions, and adjust the sampling weights such that marginal distribution of the sampling weights agrees with the known auxiliary information. The adjusted weight is often called the poststratification weight (see Lehtonen and Pahkinen 2004, p. 89). You then obtain parameter estimates in the same way that you would by using ordinary stratification. However, because you are stratifying after selecting the sample, you cannot assume any specific allocation scheme. That is, although the overall sample size can be fixed, you do not know how the sample is allocated to the different poststrata until you draw the sample. Therefore, you must adjust the estimator of the variance of the estimated parameters to take into account this uncertainty about the sample allocation. The SURVEYMEANS procedure in SAS/STAT software enables you to perform poststratification by specifying a POSTSTRATA statement. If you use replication methods with PROC SURVEYMEANS, you can save the replicate weights (and jackknife coefficients for the option VARMETHOD=JK) and use them with other SAS/STAT survey procedures. This, in effect, enables you to perform poststratification analysis by using any of the SAS/STAT survey procedures, even though in SAS/STAT 12.3 PROC SURVEYMEANS is the only procedure that supports the POSTSTRATA statement. The SAS source code for this example is available as an attachment in a text file. In Adobe Acrobat, right-click the icon and select Save Embedded File to Disk. You can also double-click the icon to open the file immediately.

2 2 Analysis When you perform poststratification, the SURVEYMEANS procedure computes statistics by using the poststratification weights, Qw hij, instead of the original sampling weights, w hij. The poststratification weights are computed as Qw hij D w hij Z p p where h D 1; 2; : : : ; H is the stratum index; i D 1; 2; : : : ; n h is the cluster index within stratum h; j D 1; 2; : : : ; m hi is the unit index within cluster i of stratum h; p D 1; 2; : : : ; P is the poststratum index; Z 1 ; Z 2 ; : : : ; Z P are the known population totals for each corresponding poststratum; and p is the sum of the original sampling weights (w hij ) for poststratum p. PROC SURVEYMEANS estimates the variance of computed statistics by using either the Taylor series method or a replication method. For more information about the various poststratified variance estimators, see the section Poststratification in the chapter The SURVEYMEANS Procedure in SAS/STAT User s Guide. Example: Unemployment Rate Poststratified by Urban-Rural Classification This example uses PROC SURVEYMEANS to obtain poststratified totals, means, and ratios. The data are sampled from county-level data sets that are publicly available from the USDA Economic Research Service website, at aspx. The sample consists of the county-level information about population size, the number of individuals in the labor force, and the number of unemployed persons in the 48 contiguous states of the United States of America in The sampling frame is stratified by state, and a simple random sample of two counties per state is selected. The analysis consists of a comparison between the non-poststratified estimates and the poststratified estimates of the total and average labor force size, number of unemployed, population size, and two ratios: the unemployment rate and the labor force participation rate. Table 1 describes the contents of the sample data set Unemployment, and Table 2 describes the interpretation of the six levels of the National Center for Health Statistics (NCHS) urban-rural classification for each county.

3 Example: Unemployment Rate Poststratified by Urban-Rural Classification 3 Table 1 Example Data Set Unemployment Variable Description FIPS ST_FIPS State County Code2006 Population Resident total population estimate as of July 1, 2011 LaborForce Number of individuals in the civilian labor force in 2011 Federal information processing standards (FIPS) code for counties FIPS code for states Abbreviation of state name County name National Center for Health Statistics (NCHS) 2006 urban-rural classification code Unemployed Number of unemployed individuals in 2011 SamplingWeight Sampling weight generated by yhe SURVEYSELECT procedure Table NCHS Urban-Rural Classification Scheme Code Urbanization Level Classification Rules 1 Large metro, central Counties in micropolitan statistical area (MSA) with population of 1 million or more that have the following characteristics: 1) contain the entire population of the largest principal city of the MSA, or 2) are completely contained within the largest principal city of the MSA, or 3) contain at least 250,000 residents of any principal city in the MSA 2 Large metro, fringe Counties in MSA with 1 million or more population that do not qualify as large central 3 Medium metro Counties in MSA with 250, ,999 population 4 Small metro Counties in MSA with 50, ,999 population 5 Micropolitan Counties in micropolitan statistical area 6 Noncore Counties not in micropolitan statistical area The following SAS statements create the SAS data set Unemployment: data unemployment; input FIPS 1-5 ST_FIPS 7-8 State $ County $ Code Population LaborForce Unemployed SamplingWeight 59-64; datalines; AL Barbour County AL Cherokee County AZ Pinal County AZ Yuma County AR Perry County more lines WI Taylor County WY Natrona County WY Sweetwater County ;

4 4 You begin the comparative analysis by using PROC SURVEYMEANS as in the following statements to estimate the means, totals, and ratios of interest. The MEAN and SUM keywords in the PROC SURVEYMEANS statement request estimates of the population means and totals, respectively. The VAR statement requests estimates of the variables LaborForce, Unemployed, and Population. So, for example, if you specify the keyword MEAN in the PROC SURVEYMEANS statement and the variable Unemployed in the VAR statement, you are requesting an estimate of how many unemployed persons, on average, reside in a county. The first RATIO statement requests an estimate of the population s unemployment rate, which is the ratio of the number of unemployed to the size of the labor force. The second RATIO statement requests an estimate of the labor force participation rate, which is the ratio of the size of the labor force to the size of the population of the county. The STRATA and WEIGHT statements identify the sampling design: the STRATA statement specifies that the strata are identified by the variable ST_FIPS, and the WEIGHT statement specifies that the sampling weights are contained in the variable SamplingWeight. proc surveymeans data=unemployment mean sum; strata st_fips; weight SamplingWeight; var LaborForce Unemployed Population; ratio 'Unemployment Rate' Unemployed / LaborForce; ratio 'Labor Force Participation Rate' LaborForce / Population; Output 1 displays the estimated means, totals, ratios, and their standard errors. For example, on average there are 110,064 individuals in a county and 53,472 individuals in the labor force, and 4,925 individuals are unemployed. On average, the unemployment rate is 9.2%, and the labor force participation rate is 48.58%. Output 1 Stratified Design The SURVEYMEANS Procedure Data Summary Number of Strata 48 Number of Observations 96 Sum of Weights 3108 Statistics Std Error Variable Mean of Mean Sum Std Dev LaborForce Unemployed Population Ratio Analysis: Unemployment Rate - Unemployed LaborForce

5 Example: Unemployment Rate Poststratified by Urban-Rural Classification 5 Output 1 continued Ratio Analysis: Labor Force Participation Rate - LaborForce Population In addition to the sample, the NCHS urban-rural classification code (Ingram and Franco 2012) for each county in the sample and the total number of counties in the population that have each of the six levels of the NCHS classification are known. If the totals, means, and ratios of the variables of interest are homogeneous for counties that have the same NCHS urban-rural classification, but there is significant heterogeneity between counties whose classifications differ, then poststratifying by the NCHS urban-rural classification can potentially yield more efficient estimates. The following SAS statements create the poststratum totals data set Poststrata. This data set is to be used in the PSTOTAL= option of the SURVEYMEANS procedure s POSTSTRATA statement. A poststratum total data set must contain all the poststratification variables that are listed in the POSTSTRATA statement, and it must have a variable named _PSTOTAL_ that contains the poststratum totals. In the Poststrata data set, the variable Code2006 contains the poststratum identification code, and the variable _PSTOTAL_ contains the total number of counties in that poststratum in data poststrata; input Code2006 _PSTOTAL_ ; datalines; ; Figure 1 compares the distributions of Code2006 in the population and the weighted sample. Based on the weighted sample, counties that have values of 3 and 4 are overrepresented in the sample, and counties that have values of 5 and 6 are underrepresented in the sample. Poststratifying on Code2006 reweights the data such that the poststratified weighted sample distribution of Code2006 equals the population distribution.

6 Figure 1 Population Distribution versus Weighted Sample Distribution of Code2006 To perform a poststratified analysis, you simply add a POSTSTRATA statement to the SURVEYMEANS procedure, as in the

The OUT= option saves the poststratification weights to the SAS data set Pswgt.

6 6 Figure 1 Population Distribution versus Weighted Sample Distribution of Code2006 To perform a poststratified analysis, you simply add a POSTSTRATA statement to the SURVEYMEANS procedure, as in the following statements. Specifically, you designate Code2006 as the poststratification variable, and you specify the SAS data set Poststrata in the PSTOTAL= option. The OUT= option saves the poststratification weights to the SAS data set Pswgt. proc surveymeans data=unemployment mean sum; strata st_fips; weight SamplingWeight; var LaborForce Unemployed Population; ratio 'Unemployment Rate' Unemployed / LaborForce; ratio 'Labor Force Participation Rate' LaborForce / Population; poststrata code2006 / pstotal=poststrata out=pswgt; Figure 2 shows the ratios of the poststratification weights to the original sampling weights for each category of Code2006. Poststratification reduces the weights for counties that have Code2006 values of 3 and 4 and increases the weights for counties that have Code2006 values of 5 and 6. Figure 2 Ratio of Poststratification Weights to Sampling Weights

7 Example: Unemployment Rate Poststratified by Urban-Rural Classification 7 Figure 3 shows that, as expected, the poststratified weighted sample has the same distribution as the population. Figure 3 Population Distribution versus Poststratified Weighted Sample Distribution of Code2006 Output 2 displays the poststratified estimates and their standard errors. All the poststratified estimates of the population means and totals are smaller than the non-poststratified estimates, but the two poststratified ratio estimates are larger. For example, the poststratified estimates indicate that on average there are 100,215 individuals in a county and 48,755 individuals in the labor force, and 4,518 individuals are unemployed. On average, the unemployment rate is 9.3%, and the labor force participation rate is 48.65%. Without exception, the variances of the estimates are smaller for the poststratified analysis, indicating that the poststratified estimates are more efficient for this sample. Output 2 Poststratified Analysis The SURVEYMEANS Procedure Data Summary Number of Strata 48 Number of Poststrata 6 Number of Observations 96 Sum of Weights 3108 Statistics Std Error Variable Mean of Mean Sum Std Dev LaborForce Unemployed Population

8 8 Output 2 continued Ratio Analysis: Unemployment Rate - Unemployed LaborForce Ratio Analysis: Labor Force Participation Rate - LaborForce Population Example: Age-Adjusted Mortality Rates Suppose you want to compare the mortality rates of Florida and California. If you have samples from the two populations, computing the crude mortality rate for each population is straightforward. However, because many health outcomes vary by age and the two populations have different age distributions, a direct comparison of the crude mortality rates might be inappropriate. To make a relative comparison, you can use age-adjusted mortality rates. A common method of computing age-adjusted rates is called direct standardization; it is mathematically equivalent to poststratification. The following SAS statements create the data sets Florida and California, which contain samples from a one-stage clustered sampling design that has a sampling rate of 0.5; the clusters consist of counties from the respective states, and the observations are age-specific groups. Each observation records the variable FIPS, which identifies the clusters (counties); the categorical variable Age, which identifies the age group; the variable Population, which records the total number of individuals in an age-specific group in 1968; the variable Deaths, which records the total number of recorded deaths in an age-specific group in 1968; and the variable SamplingWeights, which is the inverse of the probability of selecting a county in the sample. The data are sampled from the Compressed Mortality File (CMF), which is publicly available from the Centers for Disease Control and Prevention website, at data_availability. data Florida; input FIPS Age Population Deaths; SamplingWeight= ; datalines;

9 Example: Age-Adjusted Mortality Rates 9... more lines ; data California; input FIPS Age Population Deaths; SamplingWeight=2; datalines; more lines ; Table 3 describes the different levels of the categorical variable Age. Table 3 Age Categories Age Category Description 4 Less than 1 year years years years years years years years years years years years years

10 10 The following SAS statements use the SURVEYMEANS procedure to estimate the crude mortality rates for Florida and California. The RATE= option in the PROC SURVEYMEANS statement identifies the sampling rate. The SURVEYMEANS procedure uses the sampling rate to compute a finite population correction for the Taylor series variance estimates. The RATIO and SUM keywords in the PROC SURVEYMEANS statement request estimates of the population ratios and totals, respectively. The VAR statement requests estimates of the variables Deaths and Population. The CLUSTER statement specifies that the variable FIPS identify the primary sampling units. The WEIGHT statement specifies that the variable SamplingWeight contain the sampling weights. The RATIO statement identifies the ratio of interest to be the number of deaths divided by the population size. proc surveymeans data=florida ratio sum rate=.5; cluster fips; weight SamplingWeight; var deaths population; ratio 'Florida Crude Mortality Rate' deaths/population; proc surveymeans data=california ratio sum rate=.5; cluster fips; weight SamplingWeight; var deaths population; ratio 'California Crude Mortality Rate' deaths/population;

11 Example: Age-Adjusted Mortality Rates 11 Output 3 and Output 4 show the estimation results. Output 3 Crude Mortality Rate for Florida The SURVEYMEANS Procedure Data Summary Number of Clusters 34 Number of Observations 442 Sum of Weights 871 Ratio Analysis: Florida Crude Mortality Rate Deaths Population Output 4 Crude Mortality Rate for California The SURVEYMEANS Procedure Data Summary Number of Clusters 29 Number of Observations 377 Sum of Weights 754 Ratio Analysis: California Crude Mortality Rate Deaths Population The estimated crude mortality rates for Florida and California are 1.08% and 0.77%, respectively. The ratio of the crude mortality rates is However, before you conclude that the mortality rate is higher in Florida than in California, consider the following two exhibits. Figure 4 shows that the age-specific mortality rates are decidedly a function of age in both states.

12 Figure 4 Age-Specific Crude Rates versus Age in Florida and California Figure 5 shows that the populations in Florida and California exhibit different age distributions.

12 12 Figure 4 Age-Specific Crude Rates versus Age in Florida and California Figure 5 shows that the populations in Florida and California exhibit different age distributions. The percentage of residents in the age groups 13, 14, and 15 is higher in Florida than in California, whereas the percentage of residents in the age groups 5, 6, 7, 8, 9, 10, and 11 is lower in Florida than in California. Together these facts indicate that the crude mortality rates are not an appropriate measure for comparing differences between these two populations (Curtin and Klein 1995). Figure 5 Estimated Age Distributions in Florida and California NOTE: The SAS statements that generate Figure 4 and Figure 5 are not shown here but are included in the downloadable SAS program that is available with this web example. Because the crude rate is not appropriate, and because age-specific mortality rates provide too much detail and require a large number of comparisons, you can use a summary measure that controls for a population s age distribution. A commonly used measure is the age-adjusted mortality rate, which you can compute by performing direct standardization (Curtin and Klein 1995). As mentioned earlier, direct standardization is mathematically equivalent to poststratification. The difference

13 Example: Age-Adjusted Mortality Rates 13 between poststratification for the purpose of performing direct standardization and other forms of poststratification is this: when you perform direct standardization, the poststratum totals or proportions represent a standard or reference population rather than the population from which your sample was drawn. To compute comparable age-adjusted rates for Florida and California by using poststratification, you need a data set that contains the age distribution proportions from a standard or reference population. The following SAS statements create the data set USbyAge, which contains the age-specific proportions for the US population in 1968: data USbyAge; input Age _PSPCT_; datalines; ; You can then use PROC SUVEYMEANS to compute age-adjusted mortality rates for Florida and California. The procedure specification in the following SAS statements is the same as when you compute the crude rates, except that you add a POSTSTRATA statement, which specifies poststratification on the variable Age, and the PSPCT= option, which specifies that the population proportions be contained in the data set USbyAge. proc surveymeans data=florida ratio rate=.5; cluster fips; weight SamplingWeight; var deaths population; poststrata age / pspct=usbyage; ratio 'Florida Standardized Mortality Rate' deaths/population; proc surveymeans data=california ratio rate=.5; cluster fips; weight SamplingWeight; var deaths population; poststrata age / pspct=usbyage; ratio 'California Standardized Mortality Rate' deaths/population; Output 5 and Output 6 show the estimation results. The age-adjusted mortality rates for Florida and California are 0.70% and 0.48%, respectively. The ratio of the age-adjusted mortality rates is Therefore, on an age-adjusted basis, the mortality rate in Florida in 1968 is almost 1.5 times the mortality rate in California in

14 14 the same year. Output 5 Standardized Mortality Rate for Florida The SURVEYMEANS Procedure Data Summary Number of Clusters 34 Number of Poststrata 13 Number of Observations 442 Sum of Weights 871 Ratio Analysis: Florida Standardized Mortality Rate Deaths Population Output 6 Standardized Mortality Rate for California The SURVEYMEANS Procedure Data Summary Number of Clusters 29 Number of Poststrata 13 Number of Observations 377 Sum of Weights 754 Ratio Analysis: California Standardized Mortality Rate Deaths Population References Curtin, L. R. and Klein, R. J. (1995), Direct Standardization (Age-Adjusted Death Rates), Healthy People 2000: Statistical Notes, DHHS Publication No. (PHS) Ingram, D. D. and Franco, S. J. (2012), NCHS Urban-Rural Classification Scheme for Counties, Vital and Health Statistics, Series 2: Data Evaluation and Methods Research no. 154, DHHS publication no. (PHS)

15 References 15 Lehtonen, R. and Pahkinen, E. (2004), Practical Methods for Design and Analysis of Complex Surveys, 2nd Edition, Chichester, UK: John Wiley & Sons. Lohr, S. L. (2010), Sampling: Design and Analysis, 2nd Edition, Boston: Brooks/Cole. Särndal, C. E., Swensson, B., and Wretman, J. (1992), Model Assisted Survey Sampling, New York: Springer- Verlag.

New SAS Procedures for Analysis of Sample Survey Data

New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many