Comparison of design-based sample mean estimate with an estimate under re-sampling-based multiple imputations

Similar documents
Chapter 7. Sampling Distributions and the Central Limit Theorem

Chapter 7. Sampling Distributions and the Central Limit Theorem

Effects of missing data in credit risk scoring. A comparative analysis of methods to gain robustness in presence of sparce data

Nonresponse Adjustment of Survey Estimates Based on. Auxiliary Variables Subject to Error. Brady T. West. University of Michigan, Ann Arbor, MI, USA

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

Module 4: Point Estimation Statistics (OA3102)

Chapter 7 presents the beginning of inferential statistics. The two major activities of inferential statistics are

Determining Sample Size. Slide 1 ˆ ˆ. p q n E = z α / 2. (solve for n by algebra) n = E 2

Chapter 5. Sampling Distributions

4.2 Probability Distributions

Homework: (Due Wed) Chapter 10: #5, 22, 42

Estimating parameters 5.3 Confidence Intervals 5.4 Sample Variance

Much of what appears here comes from ideas presented in the book:

Statistical Intervals. Chapter 7 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

MLLunsford 1. Activity: Central Limit Theorem Theory and Computations

A Two-Step Estimator for Missing Values in Probit Model Covariates

VARIANCE ESTIMATION FROM CALIBRATED SAMPLES

CLS Cohort. Studies. Centre for Longitudinal. Studies CLS. Nonresponse Weight Adjustments Using Multiple Imputation for the UK Millennium Cohort Study

Chapter 8 Statistical Intervals for a Single Sample

Sampling Distributions

IEOR E4602: Quantitative Risk Management

MISSING CATEGORICAL DATA IMPUTATION AND INDIVIDUAL OBSERVATION LEVEL IMPUTATION

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Calibration Estimation under Non-response and Missing Values in Auxiliary Information

North West Los Angeles Average Price of Coffee in Licensed Establishments

Review of key points about estimators

χ 2 distributions and confidence intervals for population variance

Small Area Estimation of Poverty Indicators using Interval Censored Income Data

Logit Models for Binary Data

4 Random Variables and Distributions

Stratified Sampling in Monte Carlo Simulation: Motivation, Design, and Sampling Error

Missing Data. EM Algorithm and Multiple Imputation. Aaron Molstad, Dootika Vats, Li Zhong. University of Minnesota School of Statistics

Alternative VaR Models

Class 16. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

CLUSTER SAMPLING. 1 Estimation of a Population Mean and Total. 1.1 Notations. 1.2 Estimators. STAT 631 Survey Sampling Fall 2003

Chapter 10 Estimating Proportions with Confidence

AP Stats Review. Mrs. Daniel Alonzo & Tracy Mourning Sr. High

MATH 3200 Exam 3 Dr. Syring

Uniform Probability Distribution. Continuous Random Variables &

Bootstrap Inference for Multiple Imputation Under Uncongeniality

A probability distribution shows the possible outcomes of an experiment and the probability of each of these outcomes.

An Introduction to Bayesian Inference and MCMC Methods for Capture-Recapture

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

4 Reinforcement Learning Basic Algorithms

LECTURE 2: MULTIPERIOD MODELS AND TREES

Chapter Four: Introduction To Inference 1/50

DRAFT. 1 exercise in state (S, t), π(s, t) = 0 do not exercise in state (S, t) Review of the Risk Neutral Stock Dynamics

As you draw random samples of size n, as n increases, the sample means tend to be normally distributed.

Anomalies under Jackknife Variance Estimation Incorporating Rao-Shao Adjustment in the Medical Expenditure Panel Survey - Insurance Component 1

The Binomial Distribution

Section 2: Estimation, Confidence Intervals and Testing Hypothesis

Chapter 6 Part 3 October 21, Bootstrapping

Point Estimation. Principle of Unbiased Estimation. When choosing among several different estimators of θ, select one that is unbiased.

MVE051/MSG Lecture 7

Statistics 13 Elementary Statistics

Statistical Intervals (One sample) (Chs )

Section The Sampling Distribution of a Sample Mean

μ: ESTIMATES, CONFIDENCE INTERVALS, AND TESTS Business Statistics

BROWNIAN MOTION Antonella Basso, Martina Nardon

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Section 2: Estimation, Confidence Intervals and Testing Hypothesis

Section 0: Introduction and Review of Basic Concepts

Chapter 9 Chapter Friday, June 4 th

The Two-Sample Independent Sample t Test

Examples: Random Variables. Discrete and Continuous Random Variables. Probability Distributions

Inference of Several Log-normal Distributions

CPSC 540: Machine Learning

1 Introduction 1. 3 Confidence interval for proportion p 6

Lecture 6: Confidence Intervals

arxiv: v1 [q-fin.rm] 13 Dec 2016

ECO220Y Estimation: Confidence Interval Estimator for Sample Proportions Readings: Chapter 11 (skip 11.5)

Random Variables Handout. Xavier Vilà

Chapter 8: Sampling distributions of estimators Sections

Business Statistics 41000: Probability 4

Random Variables and Probability Functions

Chapter 14 : Statistical Inference 1. Note : Here the 4-th and 5-th editions of the text have different chapters, but the material is the same.

ROM Simulation with Exact Means, Covariances, and Multivariate Skewness

Some Characteristics of Data

Statistics 6 th Edition

Confidence Intervals for Paired Means with Tolerance Probability

Sampling Distribution

Class 11. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Lecture Stat 302 Introduction to Probability - Slides 15

AP Stats. Review. Mrs. Daniel Alonzo & Tracy Mourning Sr. High

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Statistics for Managers Using Microsoft Excel 7 th Edition

Chapter 6.1 Confidence Intervals. Stat 226 Introduction to Business Statistics I. Chapter 6, Section 6.1

AP Statistics: Chapter 8, lesson 2: Estimating a population proportion

Chapter 2 Uncertainty Analysis and Sampling Techniques

Learning Objectives for Ch. 7

Shifting our focus. We were studying statistics (data, displays, sampling...) The next few lectures focus on probability (randomness) Why?

Sampling and sampling distribution

Section 1.4: Learning from data

CHAPTER 5 SAMPLING DISTRIBUTIONS

Chapter 8. Introduction to Statistical Inference

SOLVENCY AND CAPITAL ALLOCATION

The rth moment of a real-valued random variable X with density f(x) is. x r f(x) dx

Improving the accuracy of estimates for complex sampling in auditing 1.

Transcription:

Comparison of design-based sample mean estimate with an estimate under re-sampling-based multiple imputations Recai Yucel 1 Introduction This section introduces the general notation used throughout this report. Let Y denote a binary random variable, and let the values of the Y in a random sample of n be denoted as y = (y 1, y 2,..., y n ). We assume that this random sample of n is obtained under a simple random sample without replacement (SRSWOR). Further we will work with the decomposition of y corresponding to the observed values and missing values: y com = (y obs, y mis ). Missingness indicator r i will be used to in the following way: 1 if y i is missing, r i = 0 if y i is observed, and r = (r 1, r 2,..., r n ). Methods dealing with missing data typically assume one of the following missingness mecahnisims: MCAR: P (r y obs, y mis ) = P (r) MAR: P (r y obs, y mis ) = P (r y obs ) MNAR: P (r y obs, y mis ) = P (r y obs, y mis ) Throughout this report we will assume MCAR as the underlying mechanism for missingness. The general idea of multiple imputation is to replace missing values with m sets of 1

plausible values. In a parametric multiple imputation, an imputation model (e.g. normal distribution) is used to draw these values, which is often called predictive distribution of missing values. To make a fair comparison of the estimation methods between design-based estimate by Stanek et al., we will not assume any parametric structure on Y, but rather randomly sample from y obs. The details are explained below. 2 Estimation routines 2.1 Stanek et al. estimate The estimate of the population mean is proposed to be the weighted sum of three terms: ˆµ 0 = 1 N [nȳ + (N n) ˆP 1 + Nπ ˆP 2 ], (1) where Ȳ = 1 n Y i n i=1 sample mean (for missing values Y i = 0, i.e. Ȳ = 1 ni=1 r n i Y i ) ˆP 1 : predictor of response for subject not selected (Ȳ ) ˆP 2 : predictor of response for Nπ subjects where the response is expected to be missing π : is the estimate of the probability of responding where The estimate of the variance of this estimate is given by T 2 = 1 n 1 n i=1 ˆV (ˆµ 0 ) = n 0 T 2 + N n nn 1 N r i Y 2 i, where n 1 = n obs, n 0 = n mis s 2 1 n, (2) s 2 = sample variance based on y obs, assuming y mis = 0 2

2.2 Multiple imputation estimate m sets of imputations are obtained by random draws from y obs using SRSWOR. After obtaining m imputations of y mis, we calculate the sample mean and estimate of its variance for each of the imputed dataset. These estimates are then combined using rules for scalar estimates by Rubin (1987). Note that these rules do not relate the procedure used in creating the imputations nor the missingness mechanism. It should be seen as a way to reflect the uncertainty due to imputation method into estimation. In standard notation, these rules are given below: ˆQ = complete-data point estimate Û = complete-data variance estimate Q = m ( 1) m t=1 ˆQ (t) m B = (m 1) 1 ( ˆQ (t) Q) 2 t=1 = Between imputation variance Ū = m ( 1) m U (t) t=1 = Within imputation variance T = Ū + (1 + m 1 )B = Total variance Interval estimate is Q ± t ν T, where ν = (m 1) [ 1 + ] 2 Ū. (1 + m 1 )B Degrees of freedom vary from m 1 to, depending on relative sizes of Ū and (1+m 1 )B. Relative increase in variance due to nonresponse is estimated by r = (1 + m 1 )B, Ū 3

and, fraction of missing information is estimated by r+2/(ν+3) r+1. It is often noted that this estimate can be noisy for small n In our application, complete-data point estimate is given by Q = ȳ = n i=1 y i /n and complete-data variance estimate is given by U = imputation number. Note that these are estimates under SRSWOR. V ˆ ar(ȳ) = N n s 2, where t denotes the N 1 n Question: Should one correct these estimates to reflect the fact that parts of data were imputed from y obs? 3 Simulation study 3.1 Simulation conditions This simulation study attempts to compare performances of the following estimators: design-based estimator by Stanek et al. Multiple imputation These methods are explained in detail below in (2) and (4). Notation used is also explained below. This simulation experiment assumes that the population consists of N = 100 binary values and simulations repeatedly draw sample of n = 20 via simple random sampling without replacement (SRSWOR). Let y i denote the i th value of the sampled unit, and let y denote the vector that consists of the y i, y = (y 1,..., y n ). Total number of repetition is 1000, and in each of the repetition we perform the following: 1. Sampling Select n = 20 from N = 100 using SRSWOR. 2. Imposing missing values Draw missingness indicator, r i Bernoulli(0.6), i = 1, 2,..., n. Note that this indicator will be used to set the values of y i to missing in 4

the following sense: y i = 1 if y i is missing, 0 if y i is observed. Let y obs and y mis denote the partitions of y corresponding to observed and missing parts of y. Then y obs = y[r == 0]. 3. Drawing (re-sampling) imputations from y obs. In each cycle of the simulation, form multiple imputations, i.e. multiply re-sample n = n n obs from y obs using SRSWOR. This step consists of the following three steps: (a) Sample n mis from n obs using SRSWOR, (b) Calculate estimates of mean (Ȳ SWOR formulas, ) and its variance ( ˆ V ar(ȳ)) using standard SR- (c) Repeat (a) and (b) 10 times, each time store the estimates, (d) Combine the 10 sets of mean estimate and its variance estimate. 3.2 Results and next steps The results show consistency between two estimates with respect to evaluation criterion MSE. Note that the column BD (the estimates based on sample before deletion) represents the gold standard that the two approach try to capture. There is a gap between the MSEs of the two method and the MSE of the sample mean before deletion. It would be desirable to further understand whether this gap is important, and whether the estimates could be improved to close the gap. It is also important to further understand the differences in the variance estimates between design-based and MI methods. Surprisingly, the MI method resulted in estimates that were closer to estimates under BD. Second step will be to look at the combined variance of the estimate under MI (column 2). This estimate is based on the following two quantities: Between imputation variance assessing the variability across the imputations B = (m 1) 1 m t=1 ( ˆQ (t) Q) 2 = (m 5

Table 1: Simulation results: Mean estimates followed by the MSE, given in parantheses (BD: before deletion; MI: multiple imputation, Ed: Ed s method; all are averages across the simulations) Method BD MI Ed Scenario 1: µ=0.19, σ = σ/ Ȳ n = 0.08816 0.9015(0.0788) 0.1895(0.0993) 0.1895(0.0991) Scenario 2: µ=0.35,σ = σ/ Ȳ n = 0.1072 0.3489(0.0312) 0.3502(0.0389) 0.3504(0.0389) Scenario 3: µ=0.57, σ = σ/ Ȳ n = 0.1113 0.5692(0.0708) 0.5726 (0.0747) 0.5719(0.0745) Scenario 4: µ=0.66, σ = σ/ Ȳ n = 0.1065 0.6605(0.0301) 0.6589 (0.0384) 0.6591(0.0380) Scenario 5: µ=0.72, σ = σ/ Ȳ n = 0.1009 0.7227(0.0285) 0.7238 (0.0352) 0.7233(0.0354) Scenario 6: µ= 0.8, σ = σ/ Ȳ n = 0.0899 0.7973 (0.0254) 0.7961 (0.0325) 0.7968 (0.0324) Scenario 7: µ=0.91, σ = σ/ Ȳ n = 0.0643 0.9099(0.0178) 0.9104 (0.0227) 0.9106(0.0226) 6

Table 2: Simulation results: Variance estimates (BD: before deletion; MI: multiple imputation, Ed: Ed s method; all are averages across the simulations) Method BD MI Ed Scenario 1: µ=0.19, σ = σ/ Ȳ n = 0.08816 0.00774 0.00722 0.01015 Scenario 2: µ=0.35,σ = σ/ Ȳ n = 0.1072 0.01144 0.01083 0.01685 Scenario 3: µ=0.57, σ = σ/ Ȳ n = 0.1113 0.01235 0.01169 0.02262 Scenario 4: µ=0.66, σ = σ/ Ȳ n = 0.1065 0.01133 0.01063 0.02369 Scenario 5: µ=0.72, σ = σ/ Ȳ n = 0.1009 0.01012 0.00954 0.02462 Scenario 6: µ= 0.8, σ = σ/ Ȳ n = 0.0899 0.00817 0.00767 0.02487 Scenario 7: µ=0.91, σ = σ/ Ȳ n = 0.0643 0.00415 0.00389 0.02424 7

1) 1 m t=1 (ȳ (t) ȳ) 2, where ȳ is the average of the sample means across the imputations. The second quantity is the within imputation variance: W = m ( 1) m t=1 U (t). The total variance is calculated to be Ū + (1 + m 1 )B (Rubin, 1986). As discussed by Schenker and Rubin (1986), the factor (1 + m 1 ) reflects the extra variability due to imputations based on a finite number of imputations (small m). It will be important to derive the estimate of this variance from a pure finite sampling point in which several processes needed to be taken into account: sampling, missingness mechanism and imputation. This step is also important in extending the re-sampling-based multiple imputation inference under other sampling schemes such as clustered or stratified designs. Final step pertains to extending the design-based and MI approaches to multivariate settings. Creating imputations by resampling from y obs will be somewhat cumbersome under the arbitrary missingness, and developing (or using previous methods) sound algorithmical rules (such as matching to propensity scores) would be potential contributions. References Rubin, D.B. (1986), Multiple imputation for Survey Nonresponse, New York, John Wiley. Rubin, D.B. and Schenker, N. (1986), Multiple imputation for interval estimate from simple random samples with igorable nonresponse, Journal of the American Statistical Association, Vol. 81, No. 394, 366 374. 8