Small Area Estimation of Poverty Indicators using Interval Censored Income Data

Small Area Estimation of Poverty Indicators using Interval Censored Income Data Paul Walter 1 Marcus Groß 1 Timo Schmid 1 Nikos Tzavidis 2 1 Chair of Statistics and Econometrics, Freie Universit?t Berlin 2 Department of Social Statistics & Demography, University of Southampton ITACOSM Bologna June 2017 Paul Walter 1 (26)

Motivation In order to fight poverty, it is essential to have knowledge about its spatial distribution. Small area estimation (SAE) methods enable the estimation of poverty indicators at a geographical level where direct estimation is either not possible, due to a lack of sample size or very imprecise (Rao & Molina, 2015). One commonly used SAE method is the empirical best predictor (EBP), which is based on a mixed model (Molina & Rao, 2010). Estimation becomes imprecise, when due to confidentially or other reasons the dependent variable in the underlying mixed model, such as income, is censored to particular intervals. Paul Walter 2 (26)

Motivation To get more precise estimates, two methodologies are introduced, one based on the expectation maximization algorithm (EM) (Dempster et al., 1977) (Stewart, 1983) and one based on the stochastic expectation maximization (SEM) algorithm (Caleux, 1985). How do the proposed methods assist in improving the precision of small area prediction when the dependent variable is censored to particular intervals? Paul Walter 3 (26)

Outline I Under Normality The EBP Approach The EM- and SEM-Algorithm MSE Estimation using Interval Censored Data Simulation Results II Paul Walter 4 (26)

The EBP Approach (1) Nested error linear regression model y ij = x T ij β + u i + e ij, j = 1,..., n i, i = 1,..., D, u i iid N(0, σ 2 u ), the random area-specific effects (1) e ij iid N(0, σ 2 e ), the unit-level error terms where y ij is unknown and only observed to fall into a certain interval (A k 1, A k ) on a continuous scale and k ij (1 k ij K) is the indicator into which of the intervals y ij falls. Paul Walter 5 (26)

The EBP Approach (2) Nested error linear regression model y ij = x T ij β + u i + e ij, j = 1,..., n i, i = 1,..., D, Use sample data to estimate β, σ u, σ e, u i, with the EM or SEM-algorithm. Micro-simulating a synthetic population: Generate a synthetic population under the model a large number of times each time estimating the target parameter. Linear and non-linear poverty indicators can be computed. Paul Walter 6 (26)

Methodology Reconstructing the distribution of the unknown y ij is necessary to estimate the parameters of model (1). From Bayes theorem it follows that: f (y ij x ij, k ij, u i ) f (k ij y ij, x ij, u i )f (y ij x ij, u i ) with f (k ij y ij, x ij, u i ) = { 1, if A k 1 y ij A k 0, else and from the assumption implied by model (1) f (y ij x ij, u i ) N(x T ij β + u i, σ 2 e). Paul Walter 7 (26)

Estimation and Computational Details (SEM) 1. Estimate ˆθ = ( ˆβ, û i, ˆσ 2 e, ˆσ 2 u) from model (1) using the midpoints of the intervals as a substitute for the unknown y ij. 2. Sample from the conditional distribution f (y ij x ij, u i ) by drawing randomly from N(xij T ˆβ + û i, ˆσ e) 2 within the given interval A k 1 y ij, A k. Obtain (ỹ ij, x ij ) for j = 1,... n i and i = 1,..., D. 3. Re-estimate ˆθ from model (1) by using the pseudo sample (ỹ ij, x ij ) obtained in step 2. 4. Iterate steps 2.-3. B + M times, with B burn-in iterations and M additional iterations. 5. Discard the burn-in iterations and estimate ˆθ by averaging the obtained M estimates. Paul Walter 8 (26)

Estimation and Computational Details (EM) 1. Estimate ˆθ = ( ˆβ, û i, ˆσ 2 e, ˆσ 2 u) from model (1) using the midpoints of the intervals as a substitute for the unknown y ij. 2. Estimate E[I (A k 1 y ij A k ) f (y ij x ij, u i )], the expected value of a two sided truncated normal distributed variable as pseudo ỹ ij : ỹ ij = E[I (A k 1 y ij A k ) f (y ij x ij, u i )] = (x T ij ˆβ + û i ) + ˆσ e φ(z k 1 ) φ(z k ) Φ(Z k ) Φ(Z k 1 ), obtain (ỹ ij, x ij ) for j = 1,... n i and i = 1,..., D. The conditional variance is given by the variance of a two sided truncated normal distributed variable as: {[ ] [ ] 2 } Var(ỹ ij x ij, k ij, u i ) = ˆσ 2 Zk 1 φ(z k 1 ) Z k φ(z k ) φ(zk 1 ) φ(z k 1 ) e Φ(Z k ) Φ(Z k 1 ) Φ(Z k ) Φ(Z k 1 ) } {{ } :=s ij with Z k = (A k (x T ij ˆβ + û i ))/ˆσ e. Paul Walter 9 (26)

Estimation and Computational Details (EM) 3. Re-estimate ˆθ from model (1) by using the pseudo sample (ỹ ij, x ij ) obtained in step 2. The variance ˆσ 2 e is given by: ˆσ 2 e = ni j=1 4. Iterate steps 2.-3. until convergence. 5. Obtain ˆθ from the last iteration step. D i=1 (ỹ ij (xij T ˆβ + û i )) 2 ni D j=1 i=1 (1 s ij) Paul Walter 10 (26)

MSE Estimation 1. Use the sample estimates ˆθ = ( ˆβ, ˆσ u, 2 ˆσ e 2 ) obtained by the EM- or SEM-algorithm to generate u (b) iid i N(0, ˆσ 2 u ) and e (b) iid ij N(0, ˆσ 2 e ) and to simulate a bootstrap superpopulation model = xij T ˆβ + u (b) i + e (b) ij. y (b) ij 2. Estimate the population indicator I i,b using y (b) ij. 3. Extract a bootstrap sample from y (b) ij, group it according to the K intervals (A k 1, A k ) and apply the EBP method using only the interval informations and treating y (b) ij as unknown. 4. Obtain Î EBP i,b. 5. Iterate steps 1-4, b = 1,..., B times. The MSE-estimate for each area i given by: MSE(Î EBP i ) = B 1 B (Î EBP i,b I i,b ) 2. (2) b=1 Paul Walter 11 (26)

Model-based Simulation: Normal Scenarios Finite population U of size N = 10000, partitioned into D = 50 regions U 1, U 2,..., U D of sizes N i = 200 Consider an unbalanced design with sample sizes n i between 8 n i 29 leading to a sample size of D i=1 n i = 921 The following super-population model is used to simulate M = 100 Monte Carlo populations: y ij = 4500 400x ij + u i + e ij, x ij N(µ i, 3) µ i = U[ 3, 3], u i iid N(0, 500 2 ), e ij iid N(0, 1000 2 ) j = 1,..., n i, i = 1,..., D. Paul Walter 12 (26)

Model-based Simulation: Normal Scenarios The following methods are applied for parameter estimation of model (1): LME - Estimate the model parameters with the true y ij to evaluate the performance of the estimation methods relying on the interval censored y ij. EM - Estimation based on the generated pseudo ỹ ij. SEM - Estimation based on the drawn pseudo ỹ ij, with 40 burn-ins and 200 iterations. Paul Walter 13 (26)

Model-based Simulation: Normal Scenarios Income distribution for two arbitrary chosen populations: Normal scenario 1: 7 intervals Interval Frequencies [1, 2000) 970 [2000, 3000) 1367 [3000, 4000) 2063 [4000, 5000) 2266 [5000, 6000) 1767 [6000, 7500) 1265 [7500, Inf ) 302 Normal scenario 2: 4 intervals Interval Frequencies [1, 3000) 2337 [3000, 5000) 4329 [5000, 7500) 3032 [7500, Inf ) 302 Paul Walter 14 (26)

Model-based Simulation: Quality Measures To evaluate the performance of the EBPs, the root mean squared error () of any parameter estimate Î EBP is estimated in each area i: ( [ Îi EBP ) 1 = M M m=1 (Î EBP(m) i ] I (m) ) 1/2 2 i, (3) where M corresponds to the number of Monte Carlo populations. Paul Walter 15 (26)

Under Normality Model-based Simulation: Normal Scenario 1 Mean Gini HCR 0.030 300 0.06 0.025 250 0.020 0.04 200 0.015 150 0.010 0.02 LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Mean 206.5263 208.8188 214.1630 211.4700 215.1445 214.6016 Gini 0.0132 0.0133 0.0155 0.0156 0.0141 0.0144 HCR 0.0349 0.0344 0.0373 0.0373 0.0359 0.0362 Paul Walter 16 (26)

Model-based Simulation: Normal Scenario 1 Mean (LME) 300 250 200 Type Empirical Estimated Mean (EM) Domain 300 250 Type Empirical Estimated 200 Mean (SEM) Domain 300 250 Type Empirical Estimated 200 Domain Paul Walter 17 (26)

Model-based Simulation: Normal Scenario 1 0.025 Gini (LME) 0.020 0.015 Type Empirical Estimated 0.010 Gini (EM) Domain 0.025 0.020 0.015 Type Empirical Estimated 0.010 Gini (SEM) Domain 0.025 0.020 0.015 Type Empirical Estimated 0.010 Domain Paul Walter 18 (26)

Model-based Simulation: Normal Scenario 1 Density plot of ŷ from a particular simulation run: 0.00020 Density 0.00015 0.00010 Method EM_Prediction LME_Prediction SEM_Prediction 0.00005 0.00000 0 4000 8000 12000 Y Paul Walter 19 (26)

Under Normality Model-based Simulation: Normal Scenario 2 Mean Gini HCR 350 0.06 300 0.075 0.04 250 0.050 200 0.02 0.025 150 LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Method LME LMEBox EM EMBox SEM SEMBox Mean 205.0556 206.2724 247.1753 259.9144 256.3309 254.5574 Gini 0.0131 0.0130 0.0166 0.0244 0.0156 0.0169 HCR 0.0355 0.0343 0.0404 0.0488 0.0392 0.0401 Paul Walter 20 (26)

Model-based Simulation: Normal Scenario 2 Mean (LME) 300 250 200 Type Empirical Estimated 150 350 Mean (EM) Domain 300 250 Type Empirical Estimated 200 Mean (SEM) Domain 350 300 250 Type Empirical Estimated 200 Domain Paul Walter 21 (26)

Model-based Simulation: Normal Scenario 2 0.025 Gini (LME) 0.020 0.015 Type Empirical Estimated 0.010 Gini (EM) Domain 0.030 0.025 0.020 Type Empirical Estimated 0.015 0.010 0.030 Gini (SEM) Domain 0.025 0.020 0.015 Type Empirical Estimated 0.010 Domain Paul Walter 22 (26)

Model-based Simulation: Normal Scenario 2 Density plot of ŷ from a particular simulation run: 0.00025 0.00020 Density 0.00015 0.00010 Method EM_Prediction LME_Prediction SEM_Prediction 0.00005 0.00000 4000 0 4000 8000 Y Paul Walter 23 (26)

Previous research has shown, that whenever the dependent variable is censored to certain intervals, the EM- and SEM-algorithm outperform naive estimation procedures (regression on the midpoints of the intervals) or direct estimation, in terms of, in the EBPs. Simulation results show, that the accuracy loss in the EBPs, using the SEM- or EM-algorithm compared to the use of uncensored data, strongly depends on the number of intervals. The performance of the SEM- and EM- algorithm is in most scenarios quite similar. Paul Walter 24 (26)

Since the EM and SEM-algorithm strongly relies on the Gaussian assumption of the error terms, which can not accurately be tested whenever the dependent variable is grouped, two transformations are incorporated into the algorithms to handle departures from normality. The SEM-algorithm under Box-Cox transformation is outperforming the EM-algorithm under transformation in most scenarios and also performs well in the model based normal scenarios. The use of the SEM-algorithm under Box-Cox transformation is suggest as preferred estimation procedure. Paul Walter 25 (26)

Bibliography [1] Rao, J.N.K. & Molina, I. (2015), Small area estimation. John Wiley & Sons. [3] Molina, I. & Rao, J.N.K. (2010), Small area estimation of poverty indicators. Canadian Journal of Statistics, 38(3), 369-385. [4] Caleux, G. & Dieboldt, J. (1985), The sem algorithm: a probalistic teacher algorithm derived from the em algorithm for the mixture problem. Computational Statistics Quarterly, 2:73-82. [5] Stewart, M. B. (1983), On least square estimation when the dependent variable is grouped. Review of Economic Studies, 50(4):737-753. [6] Dempster, A., Laird, N., & Rubin, D. (1977), Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38. [7] Gonzalez-Manteiga, W., Lombarda, M. J., Molina, I., Morales, D., and Santamara, L. (2008). Analytic and bootstrap approximations of prediction errors under a multivariate fayherriot model. Computational Statistics & Data Analysis, 52 (12):5242-5252. Paul Walter 26 (26)