Estiating onlinear Models With Multiply Iputed Data Catherine Phillips Montalto 1 and Yoonkyung Yuh 2 Repeated-iputation inference (RII) techniques for estiating nonlinear odels with ultiply iputed data are described. RII techniques are used to estiate a logit odel using the 1995 Survey of Consuer Finances. RII techniques use all inforation available in ultiply iputed data and incorporate estiates of iputation error. The advantage of RII techniques for analysis of ultiply iputed data is that RII techniques produce ore efficient estiates and provide a basis for ore valid inference. Researchers who do not use RII techniques when estiating nonlinear odels on ultiply iputed data ay incorrectly conclude that soe independent variables have statistically significant effects. Key Words: Logit, Probit, Repeated-iputation inference (RII), Survey of Consuer Finances, Tobit Multiple iputation is a technique coonly used to deal with issing inforation on individual ites in survey data. Multiple iputation eploys ultivariate statistical ethods to ipute issing data resulting in ultiple coplete data sets. a Since 1989, the Survey of Consuer Finances (SCF) data files have contained five coplete data sets, referred to as iplicates. b The benefit to researchers of these ultiply iputed data files is that the data files contain no issing values; the cost is that researchers ust learn how to analyze data appropriately in the presence of five coplete data sets. An appropriate ethod of analyzing ultiply iputed data is to cobine the results obtained independently on each of the separate iplicates using ultiple iputation cobining rules. Inferences based on the appropriately cobined results are called repeatediputation inferences (RII) (Rubin, 1987, 1996). Montalto and Sung (1996) provide a clear discussion of ultiple iputation in the SCF fro a user s point of view, and use the ultiple iputation cobining rules to obtain estiates of descriptive statistics and ordinary least squares regression coefficients. The use of these two specific exaples has caused soe users of the SCF to question whether RII techniques can be applied ore broadly, for exaple to the estiation of nonlinear odels. The purpose of this research note is to address the appropriateness of RII techniques in a broad range of applications including nonlinear estiation, to briefly explain the intuition behind RII techniques, and to ephasize the advantages of using RII techniques to obtain efficient estiates and ake valid inferences fro ultiply iputed data. An exaple using RII techniques in nonlinear regression analysis with Survey of Consuer Finances data is presented. A ore technical discussion of repeated-iputation inference techniques is presented in the Appendix. When Is It Appropriate to Use RII Techniques? RII techniques are appropriate whenever inferences ade fro the data analysis are based on point estiates and variances. For descriptive statistics, inferences are based on estiates of the ean and the variance of the ean. (The square root of the variance is the standard error of the ean.) For linear regressions, inferences are based on estiates of regression coefficients and the standard errors of these estiates. Siilarly, correlations, factor loadings, populations proportions, and nonlinear regressions (including logit, probit and tobit) yield inferences based on estiates and variances. For descriptive statistics, analysis of each coplete data set produces an estiate of a ean and the variance of the ean. When a researcher has an interest in one variable, the estiate of the ean and the variance of the ean are single nubers. When a researcher has an interest in two or ore variables, the estiate is represented by a vector, and the variance is represented by a variance-covariance atrix. For ordinary least squares and nonlinear regression, analysis of each iplicate produces a k-diensional 1 Catherine Phillips Montalto, Assistant Professor, Consuer and Textile Sciences Departent, The Ohio State University, 1787 eil Avenue, Colubus, OH 43210-1295. Phone: (614) 292-4571. Fax: (614) 292-7536. E-ail: ontalto.2@osu.edu 2 Yoonkyung Yuh received her Ph.D. fro The Ohio State University in Septeber, 1998. E-ail: yuh@afcpe.org 1998, Association for Financial Counseling and Planning Education. All rights of reproduction in any for reserved. 97
Financial Counseling and Planning, olue 9(1), 1998 vector of coefficients, and a k x k variance-covariance atrix. In the case of nonlinear regression odels, the estiate vector and variance-covariance atrix are based on asyptotic calculations, but RII techniques are still appropriate. The criteria for deterining appropriateness of RII techniques are independent of the functional for of the estiation ethod. RII techniques are appropriate whenever inferences ade fro the data analysis are based on point estiates and variances of the point estiates. Intuition Behind RII Techniques The ultiple iputation cobining rules are straightforward, and require only the calculation of eans and variances of the results obtained independently fro the separate iplicates (Rubin, 1987). Point estiates fro the separate iplicates are averaged to create a single paraeter estiate. The average variance within each iplicate and the variance between the iplicates are sued to create an estiate of the total variance. Advantages of Using RII Techniques Two advantages of RII techniques will be ephasized: (1) RII techniques produce ore efficient estiates, and (2) RII techniques provide a basis for ore valid inference. With respect to efficiency, since the RII estiates use data fro all iplicates they are ore efficient than estiates that use data fro a single iplicate. The ultiple iputation cobining rules average over the variability between the individual iplicates to produce the best estiate of what the results would have been if the issing data had been observed. An extreely iportant advantage of RII techniques is that they provide a basis for ore valid inference since the variability due to issing values (i.e. iputation error) is incorporated into the variance estiates. In general, this will increase the estiate of variance copared to estiates that ignore this variability, resulting in ore stringent tests for statistical significance. When iputation error is ignored, the variance estiate will be biased and will underestiate the true variance. Inferences based on the biased variance estiate ay incorrectly indicate that soe relationships are statistically significant. RII versus Alternative Analytical Approaches Analytical approaches other than RII have been used to analyze ultiply iputed data. Two approaches that have often been used include analysis of data fro only a single iplicate, and analysis of data averaged across the iplicates. Montalto and Sung (1996) cite nuerous studies that have analyzed single iplicates of the SCF. Kennickell (1997, lines 152-156) describes analysis of averaged data. There are liitations to both approaches. Analysis of data fro only one iplicate of a ultiply iputed data set iplicitly treats the iputed values as if they are known with certainty. Since the variability due to issing values is ignored, the estiates of variance will be too sall, and the statistical significance of relationships will be overestiated. Additionally, paraeter estiates fro only one iplicate will be less efficient than paraeter estiates that use data fro all iplicates. Another approach coonly used to analyze ultiply iputed data is to average the variables across the ultiple coplete data sets, and then analyze the averaged data. Point estiates derived in this ethod are equivalent to point estiates derived by RII techniques, and therefore are efficient. However, the variance estiates ignore the variability due to issing values (i.e. iputation error), and as a result the statistical significance of relationships will be overestiated. The risk of not using RII techniques to analyze ultiply iputed data is highest when the extent of variability between the iputed values is high. The extent of variability between the separate iplicates depends on the proportion of inforation that has been iputed as well as the variation within the stochastic iputation process. Epirical Exaple: Retireent Wealth Adequacy The analysis of retireent wealth adequacy by Yuh, Montalto and Hanna (1998) is used to illustrate the risk of ignoring iputation error. c The dependent variable is an indicator variable for retireent wealth adequacy. Due to the dichotoous nature of the dependent variable, logistic regression is used for the analysis. Independent variables include deographic characteristics of the householder, financial characteristics, saving/investent decision variables, and attitude/expectation variables. There are a total of 98 1998, Association for Financial Counseling and Planning Education. All rights of reproduction in any for reserved.
Estiating onlinear Models With Multiply Iputed Data 28 independent variables in the odel (Table 1). iplicates and in the RII results. - = variable was negative and statistically significant across the five iplicates and in the RII results. Table 1 Logistic analysis of retireent wealth adequacy ariables Deographic Characteristics Age (reference category: 55 and over) Sig. 35-44 45-54 Education (reference category: less than high school grad.) high school grad. soe college college and ore Marital Status (reference category: couple) unarried ale unarried feale Race/ Ethnicity (reference category: White non-hispanic) Black non-hispanic Hispanic Other (including Asian Aerican) Financial Characteristics Log of noral incoe + DB ownership + DC ownership + Housing Tenure (reference category: own without ortgage) rent - own with ortgage - Saving/ Investent Decision ariables Retireent Age (reference category: retire 61 or earlier) retire 62-65 + retire 66 or later + Stock Shares(of assets excluding housing asset). (reference:0%) 0%<stock <13.5% + 13.5%#stock<36.5% + stock$36.5% + Retireent as a saving goal Spending$incoe - Attitude/ Expectation ariables Subective Life Expectancy (reference: expect to live > 42) expect to live # 24 years 24 < expect to live #32 32 < expect to live #42 High risk taking Expect enough pension Expect incoe growth = variable was statistically significant in soe but not all of the iplicates and the RII results. = variable was not statistically significant in any of the iplicates and thus was not statistically significant in the RII results. + = variable was positive and statistically significant across the five Eleven of the independent variables are statistically significant and consistent in ters of sign across the five iplicates and in the RII results. d These variables include the variables easuring financial characteristics and saving/investent decisions (with the exception of the variable indicating if retireent is a saving goal). Ten of the independent variables are OT statistically significant in any of the iplicates and thus are not statistically significant in the RII results. Seven of the independent variables are statistically significant in soe but not all of the five iplicates and the RII results. These variables include the variables easuring the deographic characteristics of age and race/ethnicity of the respondent (with the exception of the indicator variable for other race); two of the three attitude/expectation variables easuring subective life expectancy, and if the household expects incoe growth in the future. These variables illustrate clearly how epirical results fro individual iplicates can differ fro one another, as well as fro the RII results. Selected variables are used to illustrate the risks of not using RII techniques to analyze ultiply iputed data. Results for Iplicate 1 and Iplicate 2 indicate that households with a Black non-hispanic householder are less likely to have adequate retireent wealth than otherwise siilar households with a White non-hispanic householder. However, the effect of having a Black non- Hispanic householder is not statistically significant in the results for Iplicates 3, 4, or 5, or in the RII results. Thus, a researcher basing inferences on analysis of only Iplicate 1 or 2 would incorrectly conclude that households with a Black non-hispanic householder are less likely to have adequate retireent wealth. Since the logit coefficients cannot be directly copared to evaluate the agnitude of differences in the effects of estiated coefficients across the five iplicates, odds ratios are calculated for selected variables. e There is soe evidence of ore variability in the agnitude of effects for variables that have inconsistent results (in ters of statistical significance) across the iplicates, copared to variables that are statistically significant across all iplicates. The variability in the agnitude of effects for variables 1998, Association for Financial Counseling and Planning Education. All rights of reproduction in any for reserved. 99
Financial Counseling and Planning, olue 9(1), 1998 that have inconsistent results (in ters of statistical significance) across the five iplicates is illustrated with the indicator variable for households with a Black, non- Hispanic householder. This odds ratio ranges fro.549 to.819 across the five iplicates, indicating that these households are only 55% to 82% as likely as otherwise siilar households with a White non-hispanic householder to be adequately prepared for retireent. The odds ratio based on the RII coefficient is.630. Thus, the odds ratio based on results of single iplicates ranges fro 13% below to 30% above the odds ratio based on the RII results -- a range of 43 percentage points. f The variability in the agnitude of effects for variables that are statistically significant across all iplicates is illustrated with the variable indicating whether the household spent at least as uch as incoe. This odds ratio ranges fro.102 to.125 across the five iplicates, copared to an odds ratio of.114 based on the RII coefficient. The odds ratios based on results of single iplicates ranges fro 10% below to 10% above the odds ratio based on the RII results a range of only 20 percentage points. In ost (but not all cases) the range of the odds ratios based on results of single iplicates relative to the odds ratio based on the RII coefficient is larger for variables with inconsistent results across the five iplicates copared to variables that are statistically significant across all iplicates, and therefore, in the RII results. To soe extent this finding is expected, but nonetheless, it provides another way of assessing the practical iplications of alternative approaches to analyzing ultiply iputed data. Endnotes a. A coplete data set is a data set free of issing data. b. Kennickell, Starr-McCluer & Suden (1997) describe the survey, the survey procedures and the statistical easures. c. Refer to Yuh, Montalto & Hanna (1998) for specific inforation on the ethod of analysis and variable easureent, as well as discussion of the results. d. A table suarizing the logistic results for each of the separate iplicates, as well as the RII results is available at: www.afcpe.org/nonr.ht Additionally, this table illustrates the calculation of the test statistic for the overall significance of the RII logit equation that is described in the following Appendix. e. The odds ratio for a dichotoous independent variable is calculated as e $.. These odds ratios are presented in the table cited in endnote d. f. Calculation: (.549-.630)/.630=-.13; (.819-.630)/.630=.3 An Exaple Using RII Techniques for onlinear Regression Analysis (otation closely follows Montalto and Sung (1996) which closely follows Rubin (1987)). The nonlinear regression analysis (for exaple, logit, probit, or tobit) is conducted on each of the five iplicates separately. The results obtained independently fro the five separate iplicates are cobined to obtain the RII estiates. The best estiate of the nonlinear regression coefficients is the average of the results fro the five iplicates (=5) where Q i is a 1xk vector. B Q U i1 i1 i1 Q i U i Q i &Q t Q i &Q & 1 (1) The within iputation variance is the average of the variance-covariance atrices fro the five iplicates where U i is a kxk atrix. (2) The between iputation variance is the saple variance in the estiates of Q i fro the five iplicates and is estiated by The transpose of the vector is indicated by t. The total variance-covariance atrix is given by T U % 1 % &1 B (4) P 2 Q 2 T (5) (3) A Wald chi-squared statistic is used to test whether each estiated coefficient is significantly different fro zero (Maddala, 1992, pp. 120-124). The Wald chi-squared statistic can be coputed by dividing the squared paraeter estiate by its variance estiate. The Wald chisquared statistic for testing an individual coefficient is distributed chisquare with one degree of freedo. The test statistic for the overall significance of the nonlinear regression can be coputed fro the P 2 statistics (-2 Log Likelihood for the contribution of the explanatory variables only) fro the analysis conducted on each of the five iplicates separately. The test statistic has an F distribution with k and (k + 1)v/2 degrees of freedo where k is equal to the nuber of independent variables excluding the intercept in the regression. Appendix 100 1998, Association for Financial Counseling and Planning Education. All rights of reproduction in any for reserved.
Estiating onlinear Models With Multiply Iputed Data where d l1 ˆD d (l d k & & 1 % 1 r 1 % r average of the five P2 statistic (6) r 1 % &1 Tr B U &1 k (7) where Tr (A) is the su of the diagonal eleents in the kxk atrix A. v & 1 1 % r &1 2 (8) SAS code for estiation of selected nonlinear odels will be available at www.afcpe.org/nonr.ht References Kennickell, A. B. (1997). Codebook for 1995 Survey of Consuer Finances. Washington, D.C.: Board of Governors of the Federal Reserve Syste. Kennickell, A. B., Starr-McCluer, M. & Sundén, A. E. (1997). Faily finances in the U.S.: Recent evidence fro the Survey of Consuer Finances. Federal Reserve Bulletin, 83(1), 1-24. Maddala, G. S. (1992) Introduction to Econoetrics, Second Edition. ew York, Y: Macillan Publishing Copany. Montalto, C. P. & Sung, J. (1996). Multiple iputation in the 1992 Survey of Consuer Finances. Financial Counseling and Planning, 7, 133-46. [also available as a WWW docuent http://www.hec.ohiostate.edu/hanna/iput.ht] Rubin, D. B. (1987). Multiple Iputation for onresponse in Surveys. ew York: John Wiley & Sons. Rubin, D. B. (1996). Multiple iputation after 18+ years. Journal of the Aerican Statistical Association, 91(434), 473-89. Yuh, Y., Montalto, C. P. & Hanna, S. (1998). Are Aericans prepared for retireent? Financial Counseling and Planning, 9(1), 1-12. 1998, Association for Financial Counseling and Planning Education. All rights of reproduction in any for reserved. 101
Financial Counseling and Planning, olue 9(1), 1998 Association for Financial Counseling and Planning Education Annual Conference Progra. Thee: Securing Your Financial Future Fort Lauderdale Marriott orth, Fort Lauderdale, FL oveber 18-21, 1998 Tentative Schedule of Sessions: Wednesday, oveber 18 1:00-7:30 Registration 4:30-6:00 Opening General Session Speaker: Dallas L. Salisbury, President & CEO, Eployee Benefit Research Institute Retireent Confidence, Investent Behavior, and Savings Education Thursday, oveber 19 7:00-8:15 Continental Breakfast and Registration 8:15-9:45 General Session Speaker: Gordon Sheran, Regional Coissioner of Social Security Adinistration at Atlanta Social Security: Today, Toorrow and Year 2032 9:30-5:00 Exhibits 10:00-11:30 Concurrent Sessions (4 choices) Session 1 ew Progras and Opportunities Session 2 Retireent Planning Session 3 Financial Planning Instruction Session 4 Bankruptcy and Counseling 11:30-1:45 Luncheon and Business eeting 2:00-3:30 Concurrent Sessions (4 choices) Session 5 EFT 99: Update Session 6 Credit and Mortgage Behavior Session 7 Housing and Investent Planning Session 8 Changes of Credit Counseling Industry 3:30-3:45 Refreshent break 3:45-5:15 Concurrent Sessions (3 choices) Session 9 Workplace Education and Media Coverage Session 10 Tax Planning and Financial Counseling Session 11 Building Partnerships to Increase Retireent Planning Friday, oveber 20 7:00-8:30 Continental Breakfast and Registration 7:30-3:00 Exhibits 8:30-10:00 General Session Speaker: Richard Hinz, CFA, Director, Office of Policy and Analysis, Pension and Welfare Benefits Adinistration, U. S. Departent of Labor The Private Pension Syste: Where Do We Go fro Here? 10:30-12:00 Concurrent Sessions (4 choices) Session 12 Alternative Counseling Approaches Session 13 Helping Low Incoe Failies Session 14 Saving for Retireent Session 15 Investent Planning 12:00-1:30 Awards Luncheon 1:45-3:15 Concurrent Sessions (4 choices) Session 16 Helping Eployees and Children Session 17 Serving Woen and Self-eployed Session 18 Serving Special Populations Session 19 Financial Manageent Strategies of Faily Owned Businesses 3:30-5:00 Refereed Posters Saturday, oveber 21 8:30-10:00 Concurrent Sessions (2 choices) Session 20 Workplace Financial Education: Issues and Answers Session 21 Collaboration showcase 10:30-12:00 General Session Speaker: Harold R. Evensky, CFP, Chair of CFP Board of Standard Board and Author of Wealth Manageent Retireent Planning Issues in the Real World For a detailed progra and registration fors, including web links to soe presenters, see WWW.AFCPE.ORG To request registration aterial, contact: Sharon Burns, 6099 Riverside Drive # 100 Dublin, OH 43012-2004 102 1998, Association for Financial Counseling and Planning Education. All rights of reproduction in any for reserved.
Estiating onlinear Models With Multiply Iputed Data 614-791-6560 e-ail: request@afcpe.org FAX: 614-798-6560 1998, Association for Financial Counseling and Planning Education. All rights of reproduction in any for reserved. 103