Correcting for non-response bias using socio-economic register data

Correcting for non-response bias using socio-economic register data Liisa Larja & Riku Salonen liisa.larja@stat.fi / riku.salonen@stat.fi Introduction Increasing non-response is a problem for population surveys like LFS in many countries. Non-response may cause bias for survey estimates, if the response activity is not equally distributed in the population. This is the case especially when the non-response distribution is correlated with the distribution of labour market status in the population groups. Typically, demographic data (e.g. age, sex, region) is used in the estimation process to correct for this bias. This paper will discuss how also socio-economic register data can be used to correct for non-response related bias in the estimates. In the Finnish LFS, register unemployment has been part of the estimation procedure already since 1997, when it was noted that the LFS the post-stratification design based on sex, age groups and region did not capture overrepresentation of unemployed among the nonresponding population (Djerf, 1997). At present, we have noticed some shortcomings in our estimation procedure and are investigating possibilities of adding new socioeconomic data to our model. Research questions In this paper, we present results from Finnish LFS where we already employ data on register unemployment and investigate possibilities to add data on education and income. First, we will show how non-response varies in different groups. Then we will present how adding these data to the calibration model will affect the level and precision of the estimates on employment and unemployment. Current estimation procedure The Finnish LFS has been conducted by Statistics Finland since 1959. Over the years the estimation procedure has changed several times. Until 1997, only demographic data was used in in the estimation process (post-stratification). The current estimation procedure was introduced in 1997. Register data on unemployment were used as auxiliary information at the estimation stage. The use of such socio-economic auxiliary data significantly improved estimates on unemployment by reducing sampling errors and nonresponse bias (Djerf 1997). GREG estimation was used in this procedure. 1 Since the introduction of the current estimation procedure in 1997, the non-response has increased considerably from 8 % in 1996 to 33 % in 2017. The most problematic part of non-response is, that it may 1 In the Finnish LFS, the non-response adjustment is the two-step re-weighting procedure. In the first step, the post-stratification is used to improve the precision of estimation. 194 post-strata are constructed: a stratum of Mainland Finland has been divided into 192 post-strata by sex (2 groups), age (6 groups) and region (16 groups) and another stratum of the Autonomous Territory of the Åland Islands has been divided into two post-strata by sex. To allow the use of more variables, the post-stratified weights are calibrated according to sex (2 groups), age (12 groups), region (20 groups), reference week (4 or 5 groups) and status in Ministry of Labour s job-seeker register (8 groups). This will match the distributions of all variables included in the calibration process to the distributions in the calibration frame.

not be evenly distributed in the population and that the distribution of non-response in the population may be correlated with the distribution of labour market status. In the case of FI-LFS, we see that the nonresponse of the least educated (ISCED 0-2) has grown to as high as 40 percent whereas the non-response rate for the persons with higher education (ISCED 5-8) is only 25 percent (Figure 1). Figure 1: Non-response rate in the Finnish Labour Force Survey according to the level of education 45 40 35 30 25 20 15 ISCED 0-2 ISCED 3-4 ISCED 5-8 10 5 0 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 Furthermore, education is highly correlated with labour market status, as the respondents with higher education have an employment rate of 73 % and those with least education only 31 % (Figure 2). The same pattern is repeated also with the status in job-seeker register as well as with migration status. Those who are the most frequent respondents, i.e., not registered job seekers and Finland-Swedes, also tend to have a better labour market position. Hence, as the more frequent respondents are overrepresented in our data, and as they often are in a better labour market position, there is a probability that we overestimate the number of employed persons in the population. Figure 2: Non-response rate and employment rate in the Finnish Labour Force Survey in 2017

2017 0 10 20 30 40 50 60 70 80 ISCED 0-2 ISCED 3-4 ISCED 5-8 registered job seeker not registered job seeker Foreign background Finnish Finland-swedish Non-response rate Employment rate (among the respondents) We see effect of this bias when comparing LFS data on educational attainment of the population to the data from the Finnish register of completed degrees (with very good coverage, expect for foreign degrees). As presented in the Figure 3, as the tertiary educational attainment of 30-34 years old population seems very high, and even increasing in LFS, the data from the register shows exactly the opposite. Figure 3: Tertiary educational attainment, % of population aged 30 34 47 46 45 44 43 42 41 40 LFS Register of completed degrees 39 38 37 36 2008 2009 2010 2011 2012 2013 2014 2015 2016 Due to these development in the non-response, and based on our previous good experiences in using the data on status in job seeker register, we started studying the possibility to add more socio-economic data to the calibration model.

Methodology To study the effect of bias and uncertainty caused by the non-response on the estimates of labour market status, we compare the estimation results with different calibration models using the Finnish Labour Force Survey data from March 2018. As a baseline, we present estimates with calibration using only the demographic variables (sex, age, region). After that, we add other socio-economic variables to the model (status in the job-seeker register, educational attainment, native/foreign origin, urban/rural). We will look at the effect of new proposed calibration on the level as well as on the precision (standard error) of the estimates. Seven different weighting schemes were constructed as follows: GREG_base: sex, 5-year age group and region (20 areas based on NUTS3) GREG_current: GREG_base + status in the unemployment register GREG1: GREG_base + level of education GREG2: GREG_base + origin GREG3: GREG_base + urban/rural GREG7: GREG_current + level of education GREG8: GREG_current + level of education + origins + urban/rural Results We will first present the technical and quality details of the alternative calibrated weights. After that, the impact of the different weights to the LFS headline indicators are presented. Impacts on weights The calibrated weights are obtained by the ETOS, a program developed by Statistics Sweden for calibration and GREG estimation. It is well known that one goal of calibration weighting system is to determine the calibrated weights as close as possible to the pre-calibrated weights. We evaluate the correlations between the pre-calibrated weights and alternative calibrated weights (see Table 1 below). Table 1: Correlations between the pre-calibrated weights and alternative calibrated weights for the March of 2018 Calibrated weights Pre-calibrated weights GREG_base 0.94303 GREG_current 0.91491 GREG1 0.86442 GREG2 0.88131

GREG3 0.92906 GREG7 0.83603 GREG8 0.80812 Table 1 indicates that the closeness between the pre-calibrated weights and alternative calibrated weights depends on the calibration variables which are chosen (e.g. level of education). It is well known, also, that another goal of calibration weighting system is to prevent negative and extreme weights. We made comparison of distributions for alternative calibrated weights (see Table 2 below). The results show that alternative calibration schemes do not lead to undesirable weights. So we can assume that categories of calibration variables are quite well specified.

Table 2: Distributions of alternative calibrated weights for the March of 2018 Weights Minimum Maximum Average Respondents GREG_base 151 1240 503.81 8181 GREG_current 152 1424 503.81 8181 GREG1 130 1411 503.81 8181 GREG2 142 1228 503.81 8181 GREG3 146 1301 503.81 8181 GREG7 135 1417 503.81 8181 GREG8 126 1450 503.81 8181 Estimates on employment Comparing the results on the estimates of employment we see that with only demographic data (greg_base), the estimate is as high as 2 506 000 persons (Figure 4). Controlling for either the status in the job-seeker register (greg_current) or the level of education (greg1) in addition to the demographic distribution, the estimate falls some 27 000 30 000 persons, that is, 0,7-0,9 percentage points of the employment rate. This means that the demographic data alone is not sufficient for controlling the bias caused by non-response and that the socio-economic auxiliary information, such as the level of education or status in unemployment register can help to correct for this bias. The effect of information on foreign/native background (greg2) or degree of urbanity of the place of residence (greg3) has somewhat smaller effect, dropping the estimate for employed persons with 3 000-13 000 persons (or 0,1-0,4 percentage points of the employment rate). By adding information on both the status in the job-seeker register and the level of education (greg7) to the calibration model, we see a decrease of 50 000 persons in the estimate on employed population. This means a 1,4 percentage points decrease in the employment rate. As the effect is larger than with either of the variables alone, we conclude that the variables are not entirely overlapping and hence including them both in the calibration model would be beneficial. The effect of adding further data on foreign background and urbanity (greg8) has a very limited influence to the estimate, but may have important implications within estimates on subpopulation such as the number of foreign born employed. These analyses remain, however, outside the scope of this paper. In addition to correcting for the bias caused by non-response, using socio-economic auxiliary information in the calibration model may help to increase the precision of the estimates. Using only demographic auxiliary data, the standard error for the estimate of employment is 18 538, but adding appropriate socio-economic data decreases the standard error to 16 500 16 700 (greg_current, greg7 and greg8). However, adding more data to the model may also increase the standard error, if the resulting population classes become too small. Figure 4a: Estimates on employed population (15-74-years) using alternative calibrated weights for March 2018

GREG_base: sex, 5-year age group and region (20 areas based on NUTS3) GREG_current: GREG_base + status in the unemployment register Employed population 15-74 2360 000 2400 000 2440 000 2480 000 2520 000 2560 000 GREG1: base + level of education GREG2: base + origins GREG3: base + urban/rural GREG7: current + level of education GREG8: current + level of education + origins + urban/rural Figure 4b: Estimates on the employment rate (15-64-years) using alternative calibrated weights for March 2018 GREG_base: sex, 5-year age group and region (20 areas based on NUTS3) GREG_current: GREG_base + status in the unemployment register Employment rate, 15-64 66,0 67,0 68,0 69,0 70,0 71,0 72,0 73,0 GREG1: base + level of education GREG2: base + origins GREG3: base + urban/rural GREG7: current + level of education GREG8: current + level of education + origins + urban/rural Estimates on unemployment The results for the estimates on (ILO-) unemployment are presented in the Figures 5a and 5b. As compared to the calibration model with only demographic variables (greg_base), adding auxiliary data on the status in

the job-seeker register (greg_current) increases the number of (ILO-) unemployed by 12 000 persons, or, 0,5 percentage points in the unemployment rate. Furthermore, the precision of the estimate is improved as the standard error decreases from 10 865 to 9 690. The current calibration variable is divided into 8 categories (e.g. three indicators according to the length of being unemployed in the register and four indicators which tell if the person in question is included in the unemployed job-seekers register). We have also tested the use of registered unemployment indicator classified in four categories (Male 15-24, Female 15-24, Male 25-74 and Female 25-74) and the estimation results show that, there are substantial gains in efficiency in these categories. In the contrary to the results on employment estimates, adding further socio-economic auxiliary data has little effect on the estimates on unemployment. The model with all tested auxiliary information (greg 8) is almost exactly the same as our current model (greg_current). Figure 5a: Estimates on unemployed population (15-74-years) using alternative calibrated weights for March 2018 GREG_base: sex, 5-year age group and region (20 areas based on NUTS3) GREG_current: GREG_base + status in the job-seeker register Unemployed persons, 15-74 GREG1: base + level of education - 100 000 200 000 300 000 GREG2: base + foreign origin GREG3: base + urban/rural GREG7: current + level of education GREG8: current + level of education + origins + urban/rural Figure 5a: Estimates on unemployment rate (15-74-years) using alternative calibrated weights for March 2018

GREG_base: sex, 5-year age group and region (20 areas based on NUTS3) GREG_current: GREG_base + status in the job-seeker register GREG1: base + level of education Unemployment rate, 15-74 - 2,0 4,0 6,0 8,0 10,0 12,0 GREG2: base + foreign origin GREG3: base + urban/rural GREG7: current + level of education GREG8: current + level of education + origins + urban/rural Estimates on the population not in the labour force persons The results on the estimates of population not in the labour force reflect the pattern observed already for the estimates on employment. With a calibration model using only the demographic variables (greg_base, Figure 6), the number of persons not in the labour force is underestimated by some 40 000 persons, or by 2,8 %, when compared to the model using all tested auxiliary information (greg8, greg7). Similarly as with the results on employment, we see that both the information on the status in the job-seeker register as well as the data on the level of education seem to have an independent effect on the bias, as using both variables in the model (greg 7) has a larger effect on the estimate than either variable alone (greg_current, greg1). Again, the data on foreign/native origin as well as the degree of urbanity of the area of residence seem to have a smaller effect than the education and job-seeker data. However, the effect of these auxiliary data should be reviewed separately for the estimates of the relevant sub-populations, which is outside the scope of this paper. Figure 6: Estimates on the population not in the labour force (15-74-years) using alternative calibrated weights for March 2018

GREG_base: sex, 5-year age group and region (20 areas based on NUTS3) GREG_current: GREG_base + status in the job-seeker register GREG1: base + level of education GREG2: base + foreign origin GREG3: base + urban/rural GREG7: current + level of education GREG8: current + level of education + origins + urban/rural Not in the labour force, 15-74 1320 000 1360 000 1400 000 1440 000 Discussion This paper presents the first results of the work of Statistics Finland on improving the estimation in the Labour Force Survey. We demonstrate that using appropriate socio-economic auxiliary data in the estimation process may significantly improve estimation by correcting the bias caused by non-response and by improving the precision of the estimates. As compared to the models using only demographic auxiliary data, our results on estimates using auxiliary socio-economic data show bias as large as 1,5 percentage points in the employment rate. Auxiliary data on the status in the job-seeker register was introduced to our calibration model already in 1997, as it was shown to improve precision and decrease bias (Djerf, 1997). Since then, we have witnessed a significant increase in the non-response. In this paper we show, that the current model is not up-to-date anymore and would benefit from adding of new auxiliary data. Based on our results, adding at least data on the level of education would be important, as the current model seems to overestimate the number of employed by some 20 000 persons (or 0,5 percentage points in the employment rate). Accordingly, the number of persons outside the labour force is underestimated by the corresponding amount. The effects on the estimates on unemployment are small, as our current model which already uses the job-seeker register data, seems to capture the non-response bias sufficiently. In this paper, we have shown the first results on the work aiming to the improvement of the estimation model. Further analyses on the effect of other auxiliary data, such as income, student status, or the age of the youngest child remain to be done. Also, we have presented results only on the headline indicators of the Labour Force Survey. Analyses on the effects of sub-populations (men/women, youth, elderly, foreignborn, highly/least educated, etc.), as well as on other indicators (NEET-rate, early leavers from education, working time, etc.) remain to be conducted. References Djerf, K. (1997). Effects of Post-Stratification on the Estimates of the Finnish Labour Force Survey. Journal of Official Statstics, Vol. 13(1), pp. 29-39.