Report concerning the estimation of variables at various spatial scales for Luxembourg

Size: px

Start display at page:

Download "Report concerning the estimation of variables at various spatial scales for Luxembourg"

Laurence Flynn
5 years ago
Views:

1 Report concerning the estimation of variables at various spatial scales for Luxembourg Report produced in the framework of the Urban Audit project and presented to EUROSTAT on December 1st 2010 Luxembourgish Partners in the project: GEODE team at CEPS/INSTEAD and STATEC Corresponding authors: Dr. Hichem Omrani and Dr. Philippe Gerber Contacts: GEODE department, CEPS/INSTEAD, Luxembourg Contents: 1. Introduction Overview Description of the work Problems in the sub level estimation Examining the Data Fitting the linear probability model Coverage of variables Selected variables Spatial units Description of the developed estimation methodology Logit Model Neural Network Model Results from learning and test step Optimal number of units in each hidden layer Comparison of results from neural network and Logit models Capacity of prediction and efficiency Assessment by Receiving Operating Characteristics (ROC curve analysis) Monte Carlo simulation: evaluation, sub level data quality measures: confidence Intervals...14 Conclusion...16 Acknowledments...16 References...16

2 1. Introduction Urban Audit 4 (UA) was started on 1 September 2009 and it will be finished on 1st September 2011 by writing a final report presenting our results about estimating variables on multiple spatial units. GEODE (Geography and Development) department of CEPS/INSTEAD has to estimate various variables on several scales (i.e. CLSN: city, large city, sub-city and national level). For the assessment, multiple data sources are used: COL (administrative file of the whole residents in the City Of Luxembourg) and Census data in In the framework of Urban Audit, GEODE department of CEPS/INSTEAD, STATEC and EUROSTAT institutes acknowledge the demand for small area data to support planning, decision making and service delivery at local area level. In this report, the aim is to estimate a certain number of spatially scaled variables. This estimation (from census dataset and PSELL: panel survey at national level: Panel Socio- Economique Liewen zu Lëtzebuerg) is available just at national scale and not at sub level scales (e.g. city level and municipalities). Here our concern is using national level data set and sub-levels characteristics in order to estimate sub-levels variables. These estimations are useful for decision making. We present hereafter the methods of estimation developed and the results that they produced. The methods developed here could be applied in several fields and applications when estimation is needed at sub-levels or fitting non linear model from some explanatory variables. 2. Overview For effective planning of social economic and for distributing government funds, there is a growing demand to produce reliable estimates for smaller geographic areas and subpopulations, called small areas, for which adequate samples are not available. There are several methods already developed for sub-level estimation (e.g. small area estimation (SAE)). For sub level estimation, several methods could be applied such as a regression approach, small area estimation (SAE). For a review of SAE, see Ghosh and Rao (1994), among others. We can classify the different methods of SAE into the following classes: synthetic estimators, regression models, base unit estimators, composite small area estimators and bayesian estimators (see Omrani and Gerber, 2009). These methods are not sufficient for all kind of data and not appropriate for sublevel estimation (dissagregation process). In fact, from PSELL-EU-SILC dataset, it is only possible to provide estimation at national level. Indeed, it isn t appropriate to provide estimation at sublevels (e.g. at municipality or at Luxembourg City) and this is due to several issues like the size of samples, the missing data or responses etc. 3. Description of the work We present in this paper adequate techniques for sub-level estimation by applying Logit and neural network models. The methods developed here could be applied in several fields and applications when estimation is needed at sub-levels Problems in the sub level estimation We present here some problems in the sublevel estimation: - the census and the survey are almost contemporaneous. - the variables are often collected and coded differently across surveys and censuses.

3 In the framework of PSELL survey derived from EU-SILC, we interrogated in 2008, 746 households at Luxembourg City among 3779 households in all questioned in the level of the country. The problem is that they didn t be sampled according to a spatial stratification having to ensure a representativeness of the city. The sample is valid at the national level and not enough reliable at the sublevel (city level). However, the size of sample at the city level seems to be high enough for estimating variables at sub-levels. Number of people in the sample Lux. City Esch sur Alzette Canton code Figure 1: Number of individuals in the sample (from PSELL, 2008) by canton level 1 For instance, from figure 1, we note that the number of sampled individuals at the canton of Esch-sur-alzette (1296 persons) is larger than at the canton of Luxembourg City (746 persons) whereas individuals in the canton of Esch-sur-alzette (146000) are a little bit more than in Luxembourg City (139000) which makes a biased estimation. This is why we apply advanced techniques (i.e. Logit and neural network models) for variables estimation Examining the Data Initial examination of the data is a preliminary task in order to study variables distribution, their relationship, dependency, causality and linearity. It consists also to treat missing values. We present here bellow some data examination in figure 2. From figure 2, we underline that the number of population at Luxembourg City increase from 2003 to 2008 which is not match with the number of households in the same period according to PSELL Panel. This is due to the quality of PSELL data sources. In fact, COL data set is an administrative data which is exhaustive (containing all residents at Luxembourg City) but not precise (problem of over estimation by including residents who already left the city); whereas the PSELL data set is a national survey which may contain some imprecision at local level so it provides biased estimation of variables at such level. Variables of interest from Survey and COL data sets : 1 1: LUXEMBOURG VILLE (746), 2: CAPELLEN (199), 3: ESCH-SUR-ALZETTE (1296), 4: LUXEMBOURG-CAMPAGNE (355), 5: MERSCH (183), 6: CLERVAUX (155), 7: DIEKIRCH (202), 8: REDANGE (88), 9: VIANDEN (24), 10: WILTZ (109), 11: ECHTERNACH (110), 12: REMICH (142)

4 Let X = {x1, x2, x3, x4, x5} and Y = {y1} be the inputs and the output variables, where: - x1: age of individual; - x2: nationality of individual; - x3: gender of individual; - x4: marital status of individual; - x5: activity; - y1: being household chief (a dummy variable 2 : 1 (yes), 0 (no)): to be estimated by using Logit and Neural Network models and the COL data sets. Number of pop from VDL Number of households (PSELL) Year Figure 2: Number of population (in the left) vs. Number of households (in the right) from 2003 to Year 3.3. Fitting the linear probability model We proceed to fit a linear probability (known by discrete choice model or linear Probit) model to the data (represented according to figure 3 and table 1), including fixed effects for the 5 explicative variables as mentioned before. Let data03 be the PSELL database in 2003 with dummies variables (16 Inputs (I) and 1 Output (O)). The Linear model is as follow (R code): lm (O ~ I), where I=data03 [, 1:16] (inputs) and O=data03 [, 17] (output). Table 1: Linear probability model (values of intercept and slopes) Linear regression model by least squares: Call: lm (formula = O ~ I) (R code) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** I < 2e-16 *** I < 2e-16 *** 2 A dummy variable is a binary variable that has either 1 or zero. It is commonly used to examine group and time effects in regression.

5 I < 2e-16 *** I < 2e-16 *** I I < 2e-16 *** I < 2e-16 *** I < 2e-16 *** I < 2e-16 *** I < 2e-16 *** I < 2e-16 *** I < 2e-16 *** I < 2e-16 *** I < 2e-16 *** I < 2e-16 *** I e-13 *** Signif. codes: 0 *** ** 0.01 * Residual standard error: on degrees of freedom Multiple R-squared: 0.326, Adjusted R-squared: F-statistic: 1.331e+04 on 16 and DF, p-value: < 2.2e-16 AIC: Values of probabilities of being household chief (Linear model) Density _ Learning dataset _ Test dataset N = Bandwidth = Figure 3: Results from linear model: the values of probabilities of being household chief The estimated number of households (hh) is as follow: hh from the learning dataset and hh from the test dataset (the latter estimation is too low so inefficient for the estimation of the number of hh). A linear equation for estimating the number of households (function of age and marital status, activity etc.) is not adequate. This is clear by the value of Multiple R-squared and adjusted R-squared which are low (as shown in table 1). Therefore, we propose here to fit the household reference (i.e. being household chief) according to several variables (e.g. age, gender, marital status, nationality and activity) by using advanced techniques for sub-level estimation for instance logit and neural network models.

6 4. Coverage of variables 4.1. Selected variables There are three sources of information (COL, Census and PSELL) that are used for the estimation of variables at three spatial units (national, sub city district and city levels). For estimation at national level requests are done according to the PSELL panel. On the other hand, for estimation at sub levels (i.e. core city and sub-city district) specific techniques are used like small area estimation method which is the most applied method for this task. All variables that have been estimated pertain to the number of households that reside in determined spatial units of various sizes. The broad categories of variables of Urban Audit to be estimated are the following: - Total Number of Households (excluding institutional households) - One person households - Households with children aged 0 to under 18 - Total Number of Households with less than half of the national average disposable annual household income 4.2. Spatial units It exists 3 spatial levels of estimation related to : Core city, sub-city district level and national level (see tables 2-3). Table 2: Variables transmitted to Eurostat (to collect) Variables A B C CC Total of variables S (city) S (Sub-city) L (Large Urban Total (%) zone) N (National) Total Total (%) Note : A : collected, B : estimated, C : cannot be yet estimated, CC : collected centrally Table 3: List of variables (collected) Variables A B C CC Total of variables Total (%) S (city) S (Sub-city) L (Large Urban zone) Total N (National) Total (%) Note : A : collected, B : estimated, C : cannot be yet estimated, CC : collected centrally

7 5. Description of the developed estimation methodology For sub level estimation, several methods could be applied such as a regression approach. These methods are not sufficient for all kind of data like mentioned and demonstrated before in the sub-section 2.1. From PSELL-EU-SILC dataset, which is an national survey, it is only possible to provide estimation at national level. Indeed, it isn t appropriate to provide estimation at sublevels (e.g. at municipality or at Luxembourg City) and this is due to several issues like the size of samples, the missing data or responses etc. Indeed, we present hereafter adequate techniques for sub-level estimation. The methodology discussed hereafter to derive for example the number of households by sublevel (or small area) is provided by applying Logit and neural network models. Both of these methods use information derived from a set of independent variables obtained from administrative information sources (here in our study, Census and COL datasets) that are symptomatic of regional statistics estimation. Here, in order to estimate the Urban Audit variables, we use multiple information sources like aforementioned. The census data set has been used to calibtrate the models by estimating the different coefficients. Then, the models already calibtared have been used to estimate the sulevel estimation by using the COL data sets. Furthermore, for each estimation indicator, we will present a confidence interval which allows us to indicate the accuracy of each estimation. The learning dataset here used (according to PSELL-EU-SILC) in the logit and in the neural network is composed of 5 independent variables (age, sex, marital status, nationality and activity) and 1 dependent variable (i.e. being household chief). After data treatment, we have 16 dummies variables as inputs and 1 dummy variable as target variable or output. In 2003, the dataset has individuals living in Luxembourg. Among, there exist households. Indeed the rate of being households is equal to 39%. Let data03 be the PSELL database in 2003 with dummies variables (16 Inputs (I) and 1 Output (O)) Logit Model The logit model with R software is as follow: (see table 4). logit=glm (formula = O ~ I, family = binomial (link = "logit")), where I=data03 [, 1:16] (inputs) and O=data03 [, 17] (output). Coefficient=logit$coefficients r = Intercept + Coefficient[1] * I[,1] + Coefficient[2] * I[,2] + Coefficient[3] * I[,3] + Coefficient[4] * I[,4] + Coefficient[5] * I[,5] + Coefficient[6] * I[,6] + Coefficient[7] * I[,7] + Coefficient[8] * I[,8] + Coefficient[9] * I[,9] + Coefficient[10] * I[,10] + Coefficient[11] * I[,11] + Coefficient[12] * I[,12] + Coefficient[13] * I[,13] + Coefficient[14] * I[,14] + Coefficient[15] * I[,15] + Coefficient[16] * I[,16] r <- exp(r)/(1+exp(r)) Table 4: Variables in the Logit model and coefficients estimation B Coefficients Estimate S.E. Std. Error Wald df Sig. Exp(B) I

8 I I I I I I I I I I I I I I I Intercept From Logit model and with learning database (PSELL-EU-SILC dataset), the estimated number of households is equal to (as shown in figure 4), while the real value is Therefore, this estimated is reliable enough according to the learning step. The reliability of Logit model with CoL data is shown in figure 5 (test dataset including whole residents in Luxembourg City). The estimated number of households at the level of Luxembourg City is equal to households in The Logit model (where the information criteria for model selection AIC is equal to ) is quite better than the linear model (AIC=446730). Hereafter we will show the reliability of the Logit model comparing with the Neural Network model. Value of probabilities of being household chief (Logit model) Density _ From learning dataset Estimated number of households, from learning dataset N = Bandwidth = Figure 4: Value of probabilities of being household with Logit model during learning step ( individuals in 2003, from PSELL-EU-SILC dataset); estimated number of households and its variation (sensitivity analysis) with threshold: with a threshold equal to 0.5, the estimated number of households is equal to whereas the real value is Threshold

9 Density Value of probabilities of being household chief _ From test dataset Estimated number of households from test dataset with Logit model N = Bandwidth = Threshold, from 0.2 to 0.8 by Figure 5: Logit results: Households in 2007 at the level of Luxembourg City 5.2. Neural Network Model Neural network is a theoretical framework more general and widespread than Logit model which can be applied with several dependent variables and each one can follow a given distribution. The neural network (noted after by NN) is applied here in order to estimate the number of households at Luxembourg City. In the neural network, we have applied the softmax activation function (Bridle, 1990) which is a useful way of describing the relationship between one or more dependant variables (i.e. v1=age, v2=gender, v3= marital status and v4 =activity) and an outcome (being a household), expressed as a probability, that has only two possible values, such as "household chief" or "not household chief" Results from learning and test step In the figure 6 below, we present the estimated number of households and its variation with thresholds values. Here the number (corresponding to the threshold 0.5) seems to be the most adequate with 4 hidden layers (see fig. 6). Value of probabilities of being household chief Density _ From learning dataset Estimated number of households, from learning dataset N = Bandwidth = Threshold Figure 6: Value of probabilities of being household chief or not being; estimated number of households ( with optimal threshold), from the learning dataset PSELL-EU-SILC by using neural network model (the real number is equal to households)

10 Density Value of probabilities of being household chief _ From test dataset Estimated number of households from test dataset with NN model N = Bandwidth = Threshold, from 0.2 to 0.8 by Figure 7: Density of being households or not to be with test dataset from NN model; estimated number of households (48506, with 4 hidden layers), from test dataset with neural network model. Density distribution from NN model Density _ Learning dataset _ Test dataset N = Bandwidth = Figure 8: Comparison of value of probabilities by use NN model from learning and test data sets. From the neural network model with 4 units in the hidden layer (where results are given according to the figure 8), the estimated number of households is equal to as shown in figure Optimal number of units in each hidden layer We vary the number of units in the hidden layer and we observe the result of estimation of the number of households (see figure 9). By comparison with the real value, it is easy to determine the optimal number of units in the hidden layer as shown in figure 9. The optimal number of units in the hidden layer is equal to 4, which is in line with literature recommendation by proposing that it is equal to the square root of the number of variables.

11 Estimation Number of hidden layers Figure 9: Estimated number of households from test dataset with NN model, with various hidden layers (from 0 to 50) 5.3. Comparison of results from neural network and Logit models Capacity of prediction and efficiency In figure 10, we present the estimated results of the number of households with neural network and Logit model. This comparison is done by varying the threshold from 0.2 to 0.8 by a step of It seems that the neural network is more precise and less sensitive with the threshold values than the Logit model (see tables 4-7 after mentioned). Value of probabilities of being household chief Density Logit NN N = Bandwidth = Figure 10: Prediction results: comparision of probabilities values from Logit and NN models with the test dataset

12 Estimated number of households from test dataset Logit NN Threshold, from 0.2 to 0.8 by Figure 11: Number of household with various thresholds: comparison of Logit and neural network models for test step: COL dataset After leaning step, in order to estimate the parameter of the model, a test step is important to run in order to study the efficiency and the generalization capacity of the model. The dataset of test phase is composed of 5 variables (sex, age, marital status, nationality and activity) with individuals. Here, we applied the Logit and the neural network model in order to estimate the number of household starting from the administrative dataset with all the residents in Luxembourg City (see results as shown in figure 12). The efficiency of the applied models is shown in tables 3-6. We underline that the NN model is the most reliable in terms of capacity of prediction and the percentage of correct classification (i.e. 0 or 1, to be household or not to be) is equal to 82.34% (see tables 5-8). Table 5: Estimation efficiency from learning dataset with LPM: classification Table (a) Predicted Percentage 0 1 Correct Observed Overall Percentage (a): The cut value is.5 Table 6: Estimation efficiency from learning dataset with Logit: classification Table (a) Predicted Percentage 0 1 Correct Observed Overall Percentage 79.7 (a): The cut value is.5

13 Table 7: Estimation efficiency from learning dataset with NN: classification Table (a) Predicted Percentage 0 1 Correct Observed Overall Percentage (a): The cut value is.5 Table 8: Comparison of results from LPM, Logit and NN models LPM Logit NN Learning dataset Test dataset Percentage Correct Table 9 summarizes the estimation results from NN and Logit models from 2003 to Table 9: Comparison of results: etimated number of households from 3 models: LPM, Logit and Neural Network Year Population 3 in Luxembourg city HH number from PSELL Estimated value (LPM) Estimated value (Logit) Estimated value (NN) The number of population at the city is taken from the STATEC institute:

14 Number of population Number of household Number of population PSELL data set COL data set Number of household PSELL data set COL data set Year Figure 12: Resuls from the estimation methodology at the city level, estimation of number of housholds from 2003 to 2008 Year Assessment by Receiving Operating Characteristics (ROC curve analysis) The ROC is a tool for assessment and comparison of models. We use the ROC curve analysis (Receiving Operating Characteristics) in order to evaluate the efficiency of the three applied prediction models (LPM, Logit and Neural Network (NN)). According the figure 13, we underline the efficiency of the NN model against Logit and LPM models. Sensitivity NN LOGIT LPM Specificity Figure 13: ROC curve analysis related to three prediction models (NN, LOGIT and LPM) Monte Carlo simulation: evaluation, sub level data quality measures: confidence Intervals We recall that the confidence interval illustrates the degree of variability associated with a number. Wide confidence intervals indicate high variability. Thus, these numbers should be interpreted and compared with due caution. Here we apply Monte Carlo simulation in order to determine confidence intervals for the number of households (hh). The confidence

15 intervals of the number of households are produced here by assuming that the distribution of the hh number fellow normal distribution. The applied Monte Carlo simulation is presented in 5 steps as follows: - Predict the number of hh from the test data set with NN and Logit models: y =f(x1, x2,, xq) - Generate a set of random inputs (x1, x2,..., xq) by introducing an error term. - Evaluate the model and store the results as yi. - Repeat steps 2 and 3 (N iterations) for i=1 to N. - Analyze the results using confidence intervals and check that the first predicted value is within the confidence interval. First, we estimate the number of households with the test data sets (COL, ) by using the estimated parameters with the learning data sets. Then we introduce an error term in the model with the inputs variables. We repeat the process N times and at each iteration and then we determine the predicted value of the number of households as shown in figure 14. We compute the mean and the standard deviation of the predicted hh number. Finally we determine the confidence interval. For instance, with N iterations, let (m, σ) be respectively the mean and the standard variation of the number of household. Therefore its Confidence Interval (CI) at 95% can be computed as follows for lower and upper limit: CI = [m σ / (sqrt(n)), m σ/(sqrt(n))] Where (m, σ) are respectively the mean and the standard variation of number of households variable which supposed to follow normal distribution. The Confidence Interval (noted by CI) for the estimated number of households (from the test data set) is shown in figure 14. The CI for the estimated number of households in 2003 is equal to: [33351, 42004]. The predicted number of hh in 2003 is equal to which belongs the CI before mentioned. We conclude that this prediction is true. To conclude this section, we have seen that the neural network is better than the Logit model. The NN model is able to classify correctly about 86% of the individuals. We may still slightly improve these results by running a parametric study on the number of hidden units. This is could be done by minimizing the Akaike information criterion (AIC, Greene, 2000). But this is rather time-consuming and our results are already quite good. Estimated number of households: Monté Carlo Simulation N iterations Figure 14: Results from Monte Carlo simulation by NN model with various thresholds, from the CoL data.

16 Conclusion In this report, we support the use of neural network techniques, that integrate census and survey data, to produce sub-levels estimations. The use of neural network model for sublevel estimation is motivated by their universal approximation capabilities. As illustrated through the application previously described, neural learning is particularly useful and efficient when complex nonlinear phenomena. Neural Network technique seems particularly adapted to estimate sublevel variables better than Logit model. The results of the proposed methods are assessed by ROC curve analysis and they have been represented by confidence interval with Monte Carlo simulation. In the future, we plan to apply the developed technique based on NN model to estimate the set of variables at sublevel scales (i.e. not only city level but also at municipality or the sub-city district level) of the Urban Audit project. In a next step of our research, we may try to include the spatial units (e.g. municipality level) into the NN model and predict the revenues and the structure of household (e.g. number of children) and comparing the results with Poisson regression model which is widely used to predict such variables. Acknowledments We would like to thank the anonymous reviewers for their comments. This work was supported in part by the EUROSTAT institute and in part by GEODE department of CEPS/INSTEAD research institute. Annex: Time schedule Luxembourg Urban Audit Start 1st September 2009 Identification variables 1st December 2009 (Statec/ CEPS) Compilation variables 2008 ANNUAL UA 1st March 2010 (Statec/ CEPS) Compilation variables 1st September 2010 (Statec CEPS) EXHAUSTIVE UA ref year 2008 Compilation interim operational 1st December 2010 (CEPS) report Compilation variables 1stMarch 2011 (Statec/ CEPS) 2009 ANNUAL UA Participation quality control of 1st June 2011 (Statec/ CEPS) variables / variables Compilation of maps 1st June 2011 (CEPS) Compile final operational report 1st September 2011 (CEPS) References Omrani H., Gerber P., Small Area Estimation, International Conference on Small Area Estimation (SAE'09), Elche-Spain, June 29-July 01, 2009 (a) Omrani H., Gerber P., Bousch P., Model-Based Small Area Estimation with application to unemployment estimates, International Conference on Mathematics, Statistics and Scientific Computing (ICMSSC), Dubai, UAE, January 28-30, 2009 (b) Omrani H., Gerber P., Small area estimation: Methods and application, final report of Urban Audit project, EUROSTAT, CEPS/INSTEAD, Luxembourg, December, 2008

17 Roy, G. et Vanheuverzwyn, A., «Redressement par la macro CALMAR: applications et pistes d amélioration», in Traitements des fichiers d enquêtes, éditions PUG, p , Ghosh, M. and Rao, J. N. K., Small Area Estimation: An Appraisal, Statistical Science, Vol. 9, No. 1, pp , Bridle, J.S, Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: Springer-Verlag, pp , Greene W. H., Econometrics analysis, Pentice Hall International. 4th edition, ISBN: , Publication in 2010, about the Research work conducted in the framework of the Urban Audit project : Omrani H., Gerber P., Small Area Estimation, Sublevel estimation of variables at various spatial scales, submitted to a scientific journal (October 2010)

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

Multiple Regression and Logistic Regression II Dajiang Liu @PHS 525 Apr-19-2016 Materials from Last Time Multiple regression model: Include multiple predictors in the model = + + + + How to interpret the