Small Area Estimation of Poverty Indicators using Interval Censored Income Data

Similar documents
Strategies for Improving the Efficiency of Monte-Carlo Methods

SMALL AREA ESTIMATES OF INCOME: MEANS, MEDIANS

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Missing Data. EM Algorithm and Multiple Imputation. Aaron Molstad, Dootika Vats, Li Zhong. University of Minnesota School of Statistics

Introduction Dickey-Fuller Test Option Pricing Bootstrapping. Simulation Methods. Chapter 13 of Chris Brook s Book.

Week 7 Quantitative Analysis of Financial Markets Simulation Methods

Using Halton Sequences. in Random Parameters Logit Models

Monitoring Processes with Highly Censored Data

ST440/550: Applied Bayesian Analysis. (5) Multi-parameter models - Summarizing the posterior

Experience with the Weighted Bootstrap in Testing for Unobserved Heterogeneity in Exponential and Weibull Duration Models

arxiv: v1 [q-fin.rm] 13 Dec 2016

Estimating Mixed Logit Models with Large Choice Sets. Roger H. von Haefen, NC State & NBER Adam Domanski, NOAA July 2013

Omitted Variables Bias in Regime-Switching Models with Slope-Constrained Estimators: Evidence from Monte Carlo Simulations

Market Risk Analysis Volume IV. Value-at-Risk Models

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Comparison of design-based sample mean estimate with an estimate under re-sampling-based multiple imputations

Fast Convergence of Regress-later Series Estimators

A RIDGE REGRESSION ESTIMATION APPROACH WHEN MULTICOLLINEARITY IS PRESENT

mme: An R package for small area estimation with multinomial mixed models

Small area estimation for poverty indicators

LOSS SEVERITY DISTRIBUTION ESTIMATION OF OPERATIONAL RISK USING GAUSSIAN MIXTURE MODEL FOR LOSS DISTRIBUTION APPROACH

Modelling Returns: the CER and the CAPM

Chapter 7 - Lecture 1 General concepts and criteria

Week 1 Quantitative Analysis of Financial Markets Distributions B

A Two-Step Estimator for Missing Values in Probit Model Covariates

(5) Multi-parameter models - Summarizing the posterior

Experience with the Weighted Bootstrap in Testing for Unobserved Heterogeneity in Exponential and Weibull Duration Models

Consistent estimators for multilevel generalised linear models using an iterated bootstrap

Chapter 2 Uncertainty Analysis and Sampling Techniques

STAT 509: Statistics for Engineers Dr. Dewei Wang. Copyright 2014 John Wiley & Sons, Inc. All rights reserved.

Multivariate Cox PH model with log-skew-normal frailties

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Testing Out-of-Sample Portfolio Performance

Construction and behavior of Multinomial Markov random field models

Market Risk Analysis Volume I

1. You are given the following information about a stationary AR(2) model:

Using Monte Carlo Integration and Control Variates to Estimate π

Market Risk Analysis Volume II. Practical Financial Econometrics

A comment on Christoffersen, Jacobs and Ornthanalai (2012), Dynamic jump intensities and risk premiums: Evidence from S&P500 returns and options

Effects of missing data in credit risk scoring. A comparative analysis of methods to gain robustness in presence of sparce data

Statistical Models and Methods for Financial Markets

Optimal Window Selection for Forecasting in The Presence of Recent Structural Breaks

Window Width Selection for L 2 Adjusted Quantile Regression

Machine Learning for Quantitative Finance

Data Analysis and Statistical Methods Statistics 651

PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS

Linear Regression with One Regressor

Geostatistical Inference under Preferential Sampling

Chapter 3. Dynamic discrete games and auctions: an introduction

Limit Theorems for the Empirical Distribution Function of Scaled Increments of Itô Semimartingales at high frequencies

Chapter 7: Estimation Sections

On modelling of electricity spot price

Analysis of truncated data with application to the operational risk estimation

Monte Carlo Simulations in the Teaching Process

Computational Statistics Handbook with MATLAB

Box-Cox Transforms for Realized Volatility

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Computational Finance. Computational Finance p. 1

King s College London

Statistical analysis and bootstrapping

Which GARCH Model for Option Valuation? By Peter Christoffersen and Kris Jacobs

Evaluation of a New Variance Components Estimation Method Modi ed Henderson s Method 3 With the Application of Two Way Mixed Model

Fitting financial time series returns distributions: a mixture normality approach

An iterative approach to minimize the mean squared error in ridge regression

Applied Statistics I

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Monte Carlo Methods in Financial Engineering

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

Putting the Econ into Econometrics

Chapter 8: Sampling distributions of estimators Sections

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

SELECTION OF VARIABLES INFLUENCING IRAQI BANKS DEPOSITS BY USING NEW BAYESIAN LASSO QUANTILE REGRESSION

Equity correlations implied by index options: estimation and model uncertainty analysis

A Skewed Truncated Cauchy Uniform Distribution and Its Moments

Simulation Wrap-up, Statistics COS 323

Improving the accuracy of estimates for complex sampling in auditing 1.

Australian Journal of Basic and Applied Sciences. Conditional Maximum Likelihood Estimation For Survival Function Using Cox Model

Maximum Likelihood Estimation

Volatility Models and Their Applications

Non-informative Priors Multiparameter Models

Unit 5: Sampling Distributions of Statistics

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Unit 5: Sampling Distributions of Statistics

Estimating log models: to transform or not to transform?

Monte Carlo Methods in Finance

Financial Models with Levy Processes and Volatility Clustering

STATISTICS and PROBABILITY

Application of MCMC Algorithm in Interest Rate Modeling

UPDATED IAA EDUCATION SYLLABUS

Brooks, Introductory Econometrics for Finance, 3rd Edition

GENERATION OF STANDARD NORMAL RANDOM NUMBERS. Naveen Kumar Boiroju and M. Krishna Reddy

Calibration of Interest Rates

A Markov Chain Monte Carlo Approach to Estimate the Risks of Extremely Large Insurance Claims

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods

Chapter 7: Estimation Sections

Chapter 4: Asymptotic Properties of MLE (Part 3)

A Non-Random Walk Down Wall Street

Generating Random Numbers

Confidence Intervals Introduction

BIO5312 Biostatistics Lecture 5: Estimations

Transcription:

Small Area Estimation of Poverty Indicators using Interval Censored Income Data Paul Walter 1 Marcus Groß 1 Timo Schmid 1 Nikos Tzavidis 2 1 Chair of Statistics and Econometrics, Freie Universit?t Berlin 2 Department of Social Statistics & Demography, University of Southampton ITACOSM Bologna June 2017 Paul Walter 1 (26)

Motivation In order to fight poverty, it is essential to have knowledge about its spatial distribution. Small area estimation (SAE) methods enable the estimation of poverty indicators at a geographical level where direct estimation is either not possible, due to a lack of sample size or very imprecise (Rao & Molina, 2015). One commonly used SAE method is the empirical best predictor (EBP), which is based on a mixed model (Molina & Rao, 2010). Estimation becomes imprecise, when due to confidentially or other reasons the dependent variable in the underlying mixed model, such as income, is censored to particular intervals. Paul Walter 2 (26)

Motivation To get more precise estimates, two methodologies are introduced, one based on the expectation maximization algorithm (EM) (Dempster et al., 1977) (Stewart, 1983) and one based on the stochastic expectation maximization (SEM) algorithm (Caleux, 1985). How do the proposed methods assist in improving the precision of small area prediction when the dependent variable is censored to particular intervals? Paul Walter 3 (26)

Outline I Under Normality The EBP Approach The EM- and SEM-Algorithm MSE Estimation using Interval Censored Data Simulation Results II Paul Walter 4 (26)

The EBP Approach (1) Nested error linear regression model y ij = x T ij β + u i + e ij, j = 1,..., n i, i = 1,..., D, u i iid N(0, σ 2 u ), the random area-specific effects (1) e ij iid N(0, σ 2 e ), the unit-level error terms where y ij is unknown and only observed to fall into a certain interval (A k 1, A k ) on a continuous scale and k ij (1 k ij K) is the indicator into which of the intervals y ij falls. Paul Walter 5 (26)

The EBP Approach (2) Nested error linear regression model y ij = x T ij β + u i + e ij, j = 1,..., n i, i = 1,..., D, Use sample data to estimate β, σ u, σ e, u i, with the EM or SEM-algorithm. Micro-simulating a synthetic population: Generate a synthetic population under the model a large number of times each time estimating the target parameter. Linear and non-linear poverty indicators can be computed. Paul Walter 6 (26)

Methodology Reconstructing the distribution of the unknown y ij is necessary to estimate the parameters of model (1). From Bayes theorem it follows that: f (y ij x ij, k ij, u i ) f (k ij y ij, x ij, u i )f (y ij x ij, u i ) with f (k ij y ij, x ij, u i ) = { 1, if A k 1 y ij A k 0, else and from the assumption implied by model (1) f (y ij x ij, u i ) N(x T ij β + u i, σ 2 e). Paul Walter 7 (26)

Estimation and Computational Details (SEM) 1. Estimate ˆθ = ( ˆβ, û i, ˆσ 2 e, ˆσ 2 u) from model (1) using the midpoints of the intervals as a substitute for the unknown y ij. 2. Sample from the conditional distribution f (y ij x ij, u i ) by drawing randomly from N(xij T ˆβ + û i, ˆσ e) 2 within the given interval A k 1 y ij, A k. Obtain (ỹ ij, x ij ) for j = 1,... n i and i = 1,..., D. 3. Re-estimate ˆθ from model (1) by using the pseudo sample (ỹ ij, x ij ) obtained in step 2. 4. Iterate steps 2.-3. B + M times, with B burn-in iterations and M additional iterations. 5. Discard the burn-in iterations and estimate ˆθ by averaging the obtained M estimates. Paul Walter 8 (26)

Estimation and Computational Details (EM) 1. Estimate ˆθ = ( ˆβ, û i, ˆσ 2 e, ˆσ 2 u) from model (1) using the midpoints of the intervals as a substitute for the unknown y ij. 2. Estimate E[I (A k 1 y ij A k ) f (y ij x ij, u i )], the expected value of a two sided truncated normal distributed variable as pseudo ỹ ij : ỹ ij = E[I (A k 1 y ij A k ) f (y ij x ij, u i )] = (x T ij ˆβ + û i ) + ˆσ e φ(z k 1 ) φ(z k ) Φ(Z k ) Φ(Z k 1 ), obtain (ỹ ij, x ij ) for j = 1,... n i and i = 1,..., D. The conditional variance is given by the variance of a two sided truncated normal distributed variable as: {[ ] [ ] 2 } Var(ỹ ij x ij, k ij, u i ) = ˆσ 2 Zk 1 φ(z k 1 ) Z k φ(z k ) φ(zk 1 ) φ(z k 1 ) e Φ(Z k ) Φ(Z k 1 ) Φ(Z k ) Φ(Z k 1 ) } {{ } :=s ij with Z k = (A k (x T ij ˆβ + û i ))/ˆσ e. Paul Walter 9 (26)

Estimation and Computational Details (EM) 3. Re-estimate ˆθ from model (1) by using the pseudo sample (ỹ ij, x ij ) obtained in step 2. The variance ˆσ 2 e is given by: ˆσ 2 e = ni j=1 4. Iterate steps 2.-3. until convergence. 5. Obtain ˆθ from the last iteration step. D i=1 (ỹ ij (xij T ˆβ + û i )) 2 ni D j=1 i=1 (1 s ij) Paul Walter 10 (26)

MSE Estimation 1. Use the sample estimates ˆθ = ( ˆβ, ˆσ u, 2 ˆσ e 2 ) obtained by the EM- or SEM-algorithm to generate u (b) iid i N(0, ˆσ 2 u ) and e (b) iid ij N(0, ˆσ 2 e ) and to simulate a bootstrap superpopulation model = xij T ˆβ + u (b) i + e (b) ij. y (b) ij 2. Estimate the population indicator I i,b using y (b) ij. 3. Extract a bootstrap sample from y (b) ij, group it according to the K intervals (A k 1, A k ) and apply the EBP method using only the interval informations and treating y (b) ij as unknown. 4. Obtain Î EBP i,b. 5. Iterate steps 1-4, b = 1,..., B times. The MSE-estimate for each area i given by: MSE(Î EBP i ) = B 1 B (Î EBP i,b I i,b ) 2. (2) b=1 Paul Walter 11 (26)

Model-based Simulation: Normal Scenarios Finite population U of size N = 10000, partitioned into D = 50 regions U 1, U 2,..., U D of sizes N i = 200 Consider an unbalanced design with sample sizes n i between 8 n i 29 leading to a sample size of D i=1 n i = 921 The following super-population model is used to simulate M = 100 Monte Carlo populations: y ij = 4500 400x ij + u i + e ij, x ij N(µ i, 3) µ i = U[ 3, 3], u i iid N(0, 500 2 ), e ij iid N(0, 1000 2 ) j = 1,..., n i, i = 1,..., D. Paul Walter 12 (26)

Model-based Simulation: Normal Scenarios The following methods are applied for parameter estimation of model (1): LME - Estimate the model parameters with the true y ij to evaluate the performance of the estimation methods relying on the interval censored y ij. EM - Estimation based on the generated pseudo ỹ ij. SEM - Estimation based on the drawn pseudo ỹ ij, with 40 burn-ins and 200 iterations. Paul Walter 13 (26)

Model-based Simulation: Normal Scenarios Income distribution for two arbitrary chosen populations: Normal scenario 1: 7 intervals Interval Frequencies [1, 2000) 970 [2000, 3000) 1367 [3000, 4000) 2063 [4000, 5000) 2266 [5000, 6000) 1767 [6000, 7500) 1265 [7500, Inf ) 302 Normal scenario 2: 4 intervals Interval Frequencies [1, 3000) 2337 [3000, 5000) 4329 [5000, 7500) 3032 [7500, Inf ) 302 Paul Walter 14 (26)

Model-based Simulation: Quality Measures To evaluate the performance of the EBPs, the root mean squared error () of any parameter estimate Î EBP is estimated in each area i: ( [ Îi EBP ) 1 = M M m=1 (Î EBP(m) i ] I (m) ) 1/2 2 i, (3) where M corresponds to the number of Monte Carlo populations. Paul Walter 15 (26)

Under Normality Model-based Simulation: Normal Scenario 1 Mean Gini HCR 0.030 300 0.06 0.025 250 0.020 0.04 200 0.015 150 0.010 0.02 LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Mean 206.5263 208.8188 214.1630 211.4700 215.1445 214.6016 Gini 0.0132 0.0133 0.0155 0.0156 0.0141 0.0144 HCR 0.0349 0.0344 0.0373 0.0373 0.0359 0.0362 Paul Walter 16 (26)

Model-based Simulation: Normal Scenario 1 Mean (LME) 300 250 200 Type Empirical Estimated Mean (EM) Domain 300 250 Type Empirical Estimated 200 Mean (SEM) Domain 300 250 Type Empirical Estimated 200 Domain Paul Walter 17 (26)

Model-based Simulation: Normal Scenario 1 0.025 Gini (LME) 0.020 0.015 Type Empirical Estimated 0.010 Gini (EM) Domain 0.025 0.020 0.015 Type Empirical Estimated 0.010 Gini (SEM) Domain 0.025 0.020 0.015 Type Empirical Estimated 0.010 Domain Paul Walter 18 (26)

Model-based Simulation: Normal Scenario 1 Density plot of ŷ from a particular simulation run: 0.00020 Density 0.00015 0.00010 Method EM_Prediction LME_Prediction SEM_Prediction 0.00005 0.00000 0 4000 8000 12000 Y Paul Walter 19 (26)

Under Normality Model-based Simulation: Normal Scenario 2 Mean Gini HCR 350 0.06 300 0.075 0.04 250 0.050 200 0.02 0.025 150 LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Mean 205.0556 206.2724 247.1753 259.9144 256.3309 254.5574 Gini 0.0131 0.0130 0.0166 0.0244 0.0156 0.0169 HCR 0.0355 0.0343 0.0404 0.0488 0.0392 0.0401 Paul Walter 20 (26)

Model-based Simulation: Normal Scenario 2 Mean (LME) 300 250 200 Type Empirical Estimated 150 350 Mean (EM) Domain 300 250 Type Empirical Estimated 200 Mean (SEM) Domain 350 300 250 Type Empirical Estimated 200 Domain Paul Walter 21 (26)

Model-based Simulation: Normal Scenario 2 0.025 Gini (LME) 0.020 0.015 Type Empirical Estimated 0.010 Gini (EM) Domain 0.030 0.025 0.020 Type Empirical Estimated 0.015 0.010 0.030 Gini (SEM) Domain 0.025 0.020 0.015 Type Empirical Estimated 0.010 Domain Paul Walter 22 (26)

Model-based Simulation: Normal Scenario 2 Density plot of ŷ from a particular simulation run: 0.00025 0.00020 Density 0.00015 0.00010 Method EM_Prediction LME_Prediction SEM_Prediction 0.00005 0.00000 4000 0 4000 8000 Y Paul Walter 23 (26)

Previous research has shown, that whenever the dependent variable is censored to certain intervals, the EM- and SEM-algorithm outperform naive estimation procedures (regression on the midpoints of the intervals) or direct estimation, in terms of, in the EBPs. Simulation results show, that the accuracy loss in the EBPs, using the SEM- or EM-algorithm compared to the use of uncensored data, strongly depends on the number of intervals. The performance of the SEM- and EM- algorithm is in most scenarios quite similar. Paul Walter 24 (26)

Since the EM and SEM-algorithm strongly relies on the Gaussian assumption of the error terms, which can not accurately be tested whenever the dependent variable is grouped, two transformations are incorporated into the algorithms to handle departures from normality. The SEM-algorithm under Box-Cox transformation is outperforming the EM-algorithm under transformation in most scenarios and also performs well in the model based normal scenarios. The use of the SEM-algorithm under Box-Cox transformation is suggest as preferred estimation procedure. Paul Walter 25 (26)

Bibliography [1] Rao, J.N.K. & Molina, I. (2015), Small area estimation. John Wiley & Sons. [3] Molina, I. & Rao, J.N.K. (2010), Small area estimation of poverty indicators. Canadian Journal of Statistics, 38(3), 369-385. [4] Caleux, G. & Dieboldt, J. (1985), The sem algorithm: a probalistic teacher algorithm derived from the em algorithm for the mixture problem. Computational Statistics Quarterly, 2:73-82. [5] Stewart, M. B. (1983), On least square estimation when the dependent variable is grouped. Review of Economic Studies, 50(4):737-753. [6] Dempster, A., Laird, N., & Rubin, D. (1977), Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38. [7] Gonzalez-Manteiga, W., Lombarda, M. J., Molina, I., Morales, D., and Santamara, L. (2008). Analytic and bootstrap approximations of prediction errors under a multivariate fayherriot model. Computational Statistics & Data Analysis, 52 (12):5242-5252. Paul Walter 26 (26)