Proc SurveyCorr. Jessica Hampton, CCSU, New Britain, CT

Size: px

Start display at page:

Download "Proc SurveyCorr. Jessica Hampton, CCSU, New Britain, CT"

Jessica Bates
6 years ago
Views:

1 Proc SurveyCorr Jessica Hampton, CCSU, New Britain, CT ABSTRACT This paper provides background information on survey design, with data from the Medical Expenditures Panel Survey (MEPS) as an example. SAS survey procedures in SAS/STAT used to analyze such data sets include PROC SURVEYMEANS, PROC SURVEYFREQ, PROC SURVEYREG, and PROC SURVEYLOGISTIC. There is no PROC SURVEYCORR. One solution is to use a macro approach to extract r-squared values from multiple iterations of PROC SURVEYREG when there are many possible predictor variables to examine. INTRODUCTION The Medical Expenditures Panel Survey (MEPS) is a household survey administered annually by the U.S. Department of Health and Human Services since 1996, with data released for public use. Although it is a nationally representative sample, it is not a random sample of the population and must be weighted to produce reliable population estimates. The panel survey is designed featuring several rounds of interviews and overlapping panels over two full calendar years with multiple components (household, insurance/employer, and medical provider). The household component used in this analysis covers the following topics: demographics, household income, employment, diagnosed health conditions, additional health status issues, medical expenditures and utilization, satisfaction with and access to care, and insurance coverage. This project is based on the household component 2010 full year consolidated data file released in September 2012 (H138.SSP or H138.DAT). MEPS makes the data available both as a SAS transport data file and in ASCII format, and provides sample SAS code to either read the ASCII file into SAS using a DATA step or to convert the.ssp file into a SAS data set using PROC XCOPY. The latter method was used to convert the SAS transport data file into a SAS dataset. MEPS 2010 data consists of data from two overlapping panels: Rounds 3-5 of Panel 14 and Rounds 1-3 of Panel 15 (MEPS, 2012; MEPS-HC Panel Design and Collection Process). SURVEY DESIGN METHODS MEPS survey design is complex, using both stratification and multi-stage clustering techniques; that is, as previously noted, it is not a random sample of the population. Stratification is a survey design technique which is typically done by demographic variables such as age, race, sex, income, etc. The goal is to maximize homogeneity within strata and heterogeneity between strata. Sometimes stratification is used when it is desirable to oversample certain groups under-represented in the general population or with interesting characteristics relevant to what is being studied (for example, MEPS oversamples blacks, Hispanics, and low-income households). Clustering is typically done by geography in order to reduce survey costs, where it is not feasible or cost-effective to do a random sample of the entire population of the U.S., for example. Within-cluster correlation underestimates variance/error, as two families in the same neighborhood are more likely to be similar demographically (in regard to income, for instance). Therefore, we want clusters to be spatially close for cost effectiveness but as heterogeneous within as possible for reasonable variance. Sometimes a multi-stage clustering approach is used, as in MEPS; for example, a sample of counties is taken, then a sample of blocks is taken from that sample of counties, and finally individuals/households are surveyed from the sample of blocks (Carrington et al, 2000). Information about how the survey was designed is then stored in survey design variables, such as those shown in Table 2. These survey design variables are used to obtain population means and estimates and can also be used in regression analysis. The original consolidated dataset contains 32,846 rows unique at the person level, and 31,228 with positive person weights, as well as 1,911 variables. For the full list of original variables, see the MEPS HC-138 codebook (2012). MEPS includes a unique person-level identifier, DUPERSID, along with the following survey design variables: estimation weight variable PERWT10F, sampling strata variable VARSTR, and primary sampling unit (or cluster) VARPSU. These survey design variables are used in SAS survey procedures such as PROC SURVEYMEANS, PROC SURVEYFREQ, and PROC SURVEYREG (SAS Institute, 2008). Survey procedures are designed to take into account survey design methods such as those used by MEPS where some groups are oversampled for both sample size and policy reasons (MEPS, 2012, B. Background, 1.0 household component). Person weights can also be used in some non-survey procedures such as PROC REG and PROC CORR. 1

2 SAS SURVEY PROCEDURES Using SAS PROC SURVEYMEANS and PROC SURVEYFREQ for an overview of the data, we can obtain estimated mean expenditures for the population, as well as estimated population sizes. PROC SURVEYFREQ outputs sample frequency, weighted (estimated population) frequency, percent, and standard error of percent for each variable selected in the tables statement. In the tables statement below, these variables are PRIEU10 (covered by private employer/union group insurance), PRING10 (covered by private non-group insurance), and INSCOV10 (values 1-3 indicate any private insurance coverage in 2010, only public insurance coverage, and uninsured). Survey design variables are referenced in the strata, cluster, and weight statements as follows: PROC SURVEYFREQ DATA=PQI.MEPS_2010; STRATA VARSTR; CLUSTER VARPSU; TABLES PRIEU10 PRING10 INSCOV10; Figure 1. We can use PROC SURVEYMEANS to look at mean expenditures for the overall population and for selected segments of the population. Output variables include sample N, estimated mean, standard error of the mean, and 95% confidence intervals. For example, the following SAS statements output the mean total (TOTEXP10) and out of pocket (TOTSLF10) expenditures for those with private non-group insurance coverage (domain = PRING10) as well as for the total population. Figures 1 and 2 show sample output for PROC SURVEYFREQ and PROC SURVEYMEANS. 2

3 PROC SURVEYMEANS DATA=PQI.MEPS_2010; STRATA VARSTR; CLUSTER VARPSU; DOMAIN PRING10; VAR TOTEXP10 TOTSLF10; Figure 2. PROC SURVEYREG and PROC SURVEYLOGISTIC have some of the same options available for output/diagnostics as do their non-survey counterparts, PROC REG and PROC LOGISTIC. Default output includes fit statistics (R squared, AIC, and Schwartz's criterion), chi-squared tests of the global null hypothesis, degrees of freedom, and coefficient estimates for each parameter along with standard error of coefficient estimates and p-values. PROC SURVEYLOGISTIC also includes odds ratio point estimates and 95% Wald confidence intervals for each input parameter, as does PROC LOGISTIC. The survey procedures are more limited in some ways, though. For example, PROC LOGISTIC can use an option such as stepwise selection to restrict the output to only predictors with significance above a certain level; there is also an option to rank those predictors. These options do not work with PROC SURVEYLOGISTIC, which makes the output more unwieldy with a large number of predictors. Most notably in terms of differences, PROC LOGISTIC automatically outputs a chi-squared test of the residuals for each input variable; however, any analysis of residuals is irrelevant for the survey procedures since assumptions of normality and equal variance are not applicable due to survey design. Tabled residuals are not output at all for the survey procedures, although covariance matrices are available for both as a non-default option. Similarly, influential observations/outliers are also not analyzed due to the use of person weights. As long as we use person weights, we would get the same coefficients with a regular PROC REG as we would with PROC SURVEYREG, but standard error estimates would be different and predictor significance could also vary. Limitations arise from the complex nature of the survey design and having only four available survey procedures in SAS (PROC SURVEYFREQ, PROC SURVEYMEANS, PROC SURVEYREG, and PROC SURVEYLOGISTIC). Other procedures can be used, and in fact several other procedures in SAS allow for using sampling weights, but without accounting fully for survey design, some caveats apply to the results. If person weights are ignored and one tries to generalize sample findings to the entire population, total numbers, percentages, or means are inflated for the groups that are oversampled and underestimated for others. It is therefore highly undesirable to estimate population frequencies or means without using the procedures discussed above. In regression analysis, ignoring person weights leads to biased coefficient estimates. If sampling strata and cluster variables are ignored, means and coefficient estimates are unaffected, but standard error (or population variance) may be underestimated; that is, the reliability of an estimate may be overestimated. For example, when comparing one estimated population 3

4 mean to another, the difference may appear to be statistically significant when it is not (Machlin, S., Yu, W., & Zodet, M., 2005). PROC SURVEYCORR For more in-depth exploratory data analysis, we want to look at correlations of potential predictor variables with target variables such as total expenditures, out of pocket expenditures, and insurance coverage status variables. However, there is no PROC SURVEYCORR in SAS, and the output from PROC CORR can be unwieldy when there are many variables to analyze. We need a way to tabulate the output for large numbers of variables in a readable format; one solution is to use SPSS for unweighted Pearson correlations. Another solution is to use a macro approach to extract r-squared values from multiple iterations of PROC SURVEYREG. First, the original continuous numeric age, income, and expenditure-related variables are analyzed in SAS using PROC CORR with person weights to find correlations with total expenditures: PROC CORR DATA=PQI.MEPS_2010 PLOTS=MATRIX RANK; VAR AGE10X WAGEP10X TTLP10X FAMINC10 POVLEV10 TOTSLF10 ERTEXP10 ERTOT10 RXEXP10 OPTEXP10 OPTOTV10 OBVEXP10 OBTOTV10 IPTEXP10 IPNGTD10; WITH TOTEXP10; Inpatient expenditures and length of stay are driving total expenditures with the strongest Pearson correlations. ER expenditures and Age are less strongly correlated, falling behind office-based visits, rx, and outpatient expenditures in importance. Wage income is still significant at the 95% level, with a negative correlation, but other income variables are not significantly correlated, as shown in the output below: Figure 3. An alternate approach which uses PROC SURVEYREG is described as follows. First, SQL dictionary tables are used to select the number of predictor variables in the dataset and to store it in a macro variable. Data set names are stored in dictionary tables using all caps. The number of predictor variables (nvar) is the number of iterations SAS will use in the DO LOOP later on in the program. PROC SQL; SELECT NVAR INTO :NVAR FROM DICTIONARY.TABLES WHERE LIBNAME='PQI' AND MEMNAME='MEPS_2010'; PROC CONTENTS is then used to obtain a list of the predictor variable names, which are then stored as macro variables using the PROC SQL SELECT INTO statement: PROC CONTENTS DATA=PQI.MEPS_2010 OUT=CONTENTS NOPRINT; PROC SQL NOPRINT; SELECT NAME INTO:VAR1-:VAR76 FROM WORK.CONTENTS; 4

5 Next, an empty table is created to store the output from PROC SURVEYREG as it is inserted one row at a time. PROC SQL; CREATE TABLE SURVEYCORR (PARAMETER CHAR(15),R_SQUARE CHAR(8),R NUM(8),PROBT NUM(8)); PROC SURVEYREG uses the survey design variables in its strata, cluster, and weight statements as do all the other survey procedures. It also has the option to include an ODS OUTPUT statement to store parameter estimates, fit statistics, and other information created when the model runs. The R-square value is extracted from the FitStatistics output using PROC SQL, while p-values and the sign of the estimated regression coefficient are extracted from the ParameterEstimates output. The square root function is used to get the correlation coefficient, combined with the sign of the regression coefficient to tell us the direction of the correlation (negative or positive) with the target variable. The target variable itself is input as a parameter when the macro is called, so the process can be re-run for multiple target variables of interest by changing the parameter, re-compiling, and re-running the macro. %MACRO CORR(TARGET=); PROC SURVEYREG DATA=PQI.MEPS_2010; STRATA VARSTR; CLUSTER VARPSU; MODEL &TARGET=&&VAR&I /SOLUTION; ODS OUTPUT PARAMETERESTIMATES=PARAMETER_EST FITSTATISTICS=FIT; PROC SQL; INSERT INTO SURVEYCORR SELECT PARAMETER,CVALUE1 AS R_SQUARE,SIGN(ESTIMATE)* SQRT(INPUT(CVALUE1,8.)) AS R,PROBT AS PVALUE FROM FIT,PARAMETER_EST WHERE LABEL1 = "R-SQUARE" AND PARAMETER = "&&VAR&I"; %MEND CORR; The process above will run for each predictor variable, inserting a new row in the table each time, once it is called within the following loop: %MACRO LOOP; %DO I=1 %TO &NVAR; %CORR(TARGET=PUBAT10X); %END; %MEND LOOP; After calling the macro, PROC SQL is used to format the results, sort by correlation size, and exclude the survey design variables from the tabulated output. 5

6 PROC SQL; CREATE TABLE PQI.SURVEYCORR AS SELECT PARAMETER,R_SQUARE,R FORMAT BEST6.4,PROBT AS PVALUE FORMAT PVALUE6.4,CASE WHEN PROBT <=0.05 THEN "YES" ELSE "NO" END AS SIGNIFICANT_95 FROM SURVEYCORR WHERE PARAMETER NOT IN ('DUPERSID','VARSTR','VARPSU','PERWT10F') ORDER BY ABS(R) DESC; The tabulated output for the top ten correlated variables is shown below. Table 1. Target Variable TOTEXP10 - r Values with Proc SurveyReg parameter r- square r p-value significance (95% C.L.) TOTEXP IPTEXP TOTEXP_HIGH IPNGTD OBVEXP RXEXP OBTOTV OPTEXP TOTSLF ADAPPT As expected, all expenditure/utilization variables have a significant positive correlation with total expenditures; again, the largest drivers of total expenditures are inpatient expenditures and length of stay, followed by officebased visit and prescription expenditures. CONCLUSIONS SAS Survey Procedures are useful for analyzing data with complex survey design variables, but only four survey procedures exist. PROC CORR can be used with person weights, but when desired, additional survey design variables can be used with PROC SURVEYREG to obtain correlation values. The iterative approach described above using the PROC SURVEYCORR macro provides tabled output for a large number of predictor values when PROC CORR s output is unsatisfactory. It should be noted that all variables used are numeric or have been reformatted as numeric variables prior to running the program; if categorical variables are included in the data set then the CLASS statement will have to be used in PROC SURVEYREG to designate those variables. REFERENCES: Carrington, W. J., Eltinge, J. L., & McCue, K. (2000). An Economist s Primer on Survey Samples. Working Paper no Suitland, MD: Center for Economic Studies, U.S. Bureau of the Census, October Retrieved from ftp://tigerline.census.gov/ces/wp/2000/ces-wp pdf January 15, Machlin, S., & Yu, W. (2005). MEPS Sample Persons In-Scope for Part of the Year: Identification and Analytic 6

7 Considerations. April Agency for Healthcare Research and Quality, Rockville, MD. Retrieved from /survey_comp/hc_survey/hc_sample.shtml Machlin, S., Yu, W., & Zodet, M. (2005). Computing Standard Errors for MEPS Estimates. January Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from Medical Expenditure Panel Survey (MEPS). (2012). MEPS HC-138: 2010 Full Year Consolidated Data File. Rockville, MD: Agency for Healthcare Research and Quality (AHRQ), September Retrieved from September 27, Medical Expenditure Panel Survey (MEPS). (2012). MEPS HC-138: 2010 Full Year Consolidated Data Codebook. Rockville, MD: Agency for Healthcare Research and Quality (AHRQ), August 30, Retrieved from September 27, Medical Expenditure Panel Survey (MEPS). MEPS-HC Panel Design and Collection Process. Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from Medical Expenditure Panel Survey (MEPS). Data Use Agreement. Agency for Healthcare Research and Quality, Rockville, Md. Retrieved from SAS Institute Inc.(2008). SAS/STAT 9.2 User s Guide. Chapter 14: Introduction to Survey Sampling and Analysis Procedures. Pp Cary, NC: SAS Institute Inc. Retrieved from on January 15, ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Jessica Hampton Work Phone: (860) Jessica.Hampton@gmail.com Web: * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * 7

Medical Expenditure Panel Survey. Household Component Statistical Estimation Issues. Copyright 2007, Steven R. Machlin,

Medical Expenditure Panel Survey Household Component Statistical Estimation Issues Overview Annual person-level estimates Overlapping panels Estimation variables Weights Variance Pooling multiple years