Data utility metrics and disclosure risk analysis for public use files

Similar documents
STATISTICS ON INCOME AND LIVING CONDITIONS (EU-SILC))

Inequality, poverty and the crisis in Greece

The distributional impact of the crisis in Greece

Labour Market Challenges: Turkey

INCOME DISTRIBUTION DATA REVIEW ESTONIA

Online Appendix. Long-term Changes in Married Couples Labor Supply and Taxes: Evidence from the US and Europe Since the 1980s

NON-STANDARD WORK AND INEQUALITY

Online Appendix. Long-term Changes in Married Couples Labor Supply and Taxes: Evidence from the US and Europe Since the 1980s

Synthetic Data Generation of SILC Data

Simulation of EU-SILC Population Data: Using the R Package simpopulation

P R E S S R E L E A S E Risk of poverty

Social Situation Monitor - Glossary

Standard Methods for Point Estimation of Indicators on Social Exclusion and Poverty using the R Package laeken

Introduction to Statistical Disclosure Control (SDC)

INCOME DISTRIBUTION AND INEQUALITY IN LUXEMBOURG AND THE NEIGHBOURING COUNTRIES,

The at-risk-of poverty rate declined to 18.3%

COUNCIL OF THE EUROPEAN UNION. Brussels, 5 November /01 LIMITE SOC 415 ECOFIN 310 EDUC 126 SAN 138

Intermediate Quality Report for the Swedish EU-SILC, The 2007 cross-sectional component

Table 1 sets out national accounts information from 1994 to 2001 and includes the consumer price index and the population for these years.

Improving Timeliness and Quality of SILC Data through Sampling Design, Weighting and Variance Estimation

Chapter 7 presents the beginning of inferential statistics. The two major activities of inferential statistics are

Online Appendix from Bönke, Corneo and Lüthen Lifetime Earnings Inequality in Germany

The OECD 2017 Employment Outlook. Comments by the TUAC

Poverty Measurement in the UNECE Region

Towards Developing Synthetic Datasets for the Economic Census

Intermediate Quality report Relating to the EU-SILC 2005 Operation. Austria

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Intermediate quality report EU-SILC The Netherlands

Poverty and Income Inequality in Scotland: 2013/14 A National Statistics publication for Scotland

Introduction to Statistical Disclosure Control (SDC) Authors: Vienna, May 16, 2018

Guide to the simplified student loan repayment model (June 2015)

Joint Research Centre

Heterogeneity in Returns to Wealth and the Measurement of Wealth Inequality 1

The Moldovan experience in the measurement of inequalities

Ireland's Income Distribution

Final Technical and Financial Implementation Report Relating to the EU-SILC 2005 Operation. Austria

Central Statistical Bureau of Latvia INTERMEDIATE QUALITY REPORT EU-SILC 2011 OPERATION IN LATVIA

Overview. We will discuss the nature of market risk and appropriate measures

Online Appendix of. This appendix complements the evidence shown in the text. 1. Simulations

Deviations from Optimal Corporate Cash Holdings and the Valuation from a Shareholder s Perspective

Discussion of Risks to Price Stability, The Zero Lower Bound, and Forward Guidance: A Real-Time Assessment

Redistributive Effects of Pension Reform in China

HEALTH CAPACITY TO WORK AT OLDER AGES IN FRANCE

Predictive Modeling Cross Selling of Home Loans to Credit Card Customers

Copies can be obtained from the:

Poverty in the United Way Service Area

PRESS RELEASE INCOME INEQUALITY

Poverty and social inclusion indicators

Income Distribution Database (

Explaining the gender gap in sickness absence: the EU-LFS ad hoc

Inequality and Poverty in EU- SILC countries, according to OECD methodology RESEARCH NOTE

EXCLUSION. Reduce the number of long-term unemployed by 320,000 by 2020, measured against the annual average in 2008.

Preliminary data for the Well-being Index showed an annual growth of 3.8% for 2017

PORTUGAL 1 MAIN CHARACTERISTICS OF THE PENSIONS SYSTEM

Copyright 2005 Pearson Education, Inc. Slide 6-1

Gini coefficient

Consistent weighting of the LFS - monthly, quarterly, annual and longitdinal data

Unemployment Duration in the United Kingdom. An Incomplete Data Analysis. Ralf A. Wilke University of Nottingham

European Union Statistics on Income and Living Conditions (EU-SILC)

Chapter 7 Sampling Distributions and Point Estimation of Parameters

60% of household expenditures on housing, food and transport

Assessing Model Stability Using Recursive Estimation and Recursive Residuals

EXCLUSION. Reduce the number of long-term unemployed by 320,000 by 2020, measured against the annual average in 2008.

Intermediate Quality Report Swedish 2011 EU-SILC

Fitting financial time series returns distributions: a mixture normality approach

Intermediate Quality Report Swedish 2010 EU-SILC

TRENDS IN LONG-RUN VERSUS CROSS-SECTION EARNINGS INEQUALITY IN THE 1970s AND 1980s

AIM-AP. Accurate Income Measurement for the Assessment of Public Policies. Citizens and Governance in a Knowledge-based Society

Small area estimation for poverty indicators

MALTA 1 MAIN CHARACTERISTICS OF THE PENSIONS SYSTEM

INCOME DISTRIBUTION DATA REVIEW - IRELAND

2015 Social Protection Performance Monitor (SPPM) dashboard results

EU Survey on Income and Living Conditions (EU-SILC)

Simulation Model of the Irish Local Economy: Short and Medium Term Projections of Household Income

Estimation risk for the VaR of portfolios...

(Revised version: 4th September 2013) INCOME DISTRIBUTION DATA REVIEW - TURKEY 1

FACT SHEET Malta. Contents. I. Economic indicators. Table 1 Population and forecast (1990, 2004, 2020) - population in million ( )

PREVENTING AGEING UNEQUALLY

FACT SHEET Slovakia. Contents. I. Economic indicators. Table 1 Population and forecast (1990, 2004, 2020) - population in million ( )

Income Dynamics & Mobility in Ireland: Evidence from Tax Records Microdata

POLAND 1 MAIN CHARACTERISTICS OF THE PENSIONS SYSTEM

Package rtip. R topics documented: April 12, Type Package

The redistributive and stabilising effects of an EMU unemployment benefit scheme under different hypothetical unemployment scenarios

Trends in Income Inequality in Ireland

Trends and episodes of income distribution change in Hungary

Labour market and Social Policy Review of Estonia

The redistributive and stabilising effects of an EMU unemployment benefit scheme under different hypothetical unemployment scenarios

COMMISSION STAFF WORKING DOCUMENT. accompanying document to the

Energy poverty in Italy*

Stretching the match: Unintended effects on plan contributions

National Social Target for Poverty Reduction. Social Inclusion Monitor 2012

Non linearity issues in PD modelling. Amrita Juhi Lucas Klinkers

CENTRAL STATISTICAL OFFICE OF POLAND INTERMEDIATE QUALITY REPORT ACTION ENTITLED: EU-SILC 2009

A Review of the Sampling and Calibration Methodology of the Survey on Income and Living Conditions (SILC)

METHODOLOGICAL ISSUES IN POVERTY RESEARCH

A Single-Tier Pension: What Does It Really Mean? Appendix A. Additional tables and figures

The role of an EMU unemployment insurance scheme on income protection in case of unemployment

Reamonn Lydon & Tara McIndoe-Calder Central Bank of Ireland CBI. NERI, 22 April 2015

National Social Target for Poverty Reduction. Social Inclusion Monitor 2011

Nowcasting the poverty rate by microsimulation

Transcription:

Data utility metrics and disclosure risk analysis for public use files Specific Grant Agreement Production of Public Use Files for European microdata Work Package 3 - Deliverable D3.1 October 2015 This document aims at giving some data utility and disclosure risk metrics to estimate loss of information and residual disclosure risk after generation of a public use file. Two methods will be tested: traditional approach (mix of global recoding, local suppression and PRAM perturbation) and generation of synthetic data. Two European datasets are considered: EU-SILC (synthetic data generation) and EU- LFS (traditional approach) panel surveys. Part I Data utility metrics 1 General comments To estimate data utility, we will compare proposed public use files (PUFs) with the associated scientific use files (SUFs) given the fact we will create some PUFs as if they were derived from the SUF (even if the generation model for synthetic data is computed using original microdata). To estimate data utility when considering synthetic data, it should be a nice idea to: Generate several (lots of) datasets Compute data utility estimates for each dataset Give as a robust measure the mean measure computed on all synthesized datasets We won t use this method because it s too time-consuming. Risk and utility analysis is performed for the final public use files of all member states. There is no mention of cross-sectional/longitudinal data in the proposed utility metrics because we will limit to cross-sectional data. 1

Parameters of anonymization process should be published in order to give an idea about the risk/utility tradeoff to the user. For instance matrices used for PRAM perturbation should be published. One goal is to provide bounded (between 0 and 1) measures where 1 stands for good utility. When possible we try to give bounded metrics. It s not always the case. Utility and risk analysis provided in this project aim not at being interpreted with absolute values. However compare metrics obtained for different public use files (with different parameters in the anonymization process) enables to compare anonymization methods. Complete results of risk/utility analysis for public use files produced in this project will be given in final deliverables (D2.4 and D3.2). 2 Does the PUF look like the SUF? In this section some simple metrics are proposed to check if the public use file has a similar structure than the SUF. In the rest of the report, the notation # is used for Number of. 2.1 Basic structure This subsection provides very basic measures in order to check if a given PUF looks like its associated SUF. These measures give some simple similarity measures between a public use file and its associated scientific use file. Definition 1. The rate of produced public use files is given by: Measure 1= # (PUFs produced for a year of data collection) # (SUFs produced for a year of data collection) This measure is used to check if all datasets are provided (yearly and quarterly data / individual and household data). Definition 2. The rate of variables in a PUF is given by: Measure 2= # ( Real Variables PUF) # ( Real Variables SUF) This measure can be estimated for each public use file. A variable is not meant as real if there are only missing values for this variable. We could use the same kind of measure to check for the rate of records. 2

Definition 3. The rate of records in a PUF is given by: Measure 3= # (Records PUF) # (Records SUF) This measure is useful if we make global suppression (e.g. for big households) or data simulation. If household and individual data are separated we need to compute Measure 3 for both datasets. 2.2 Variable-related structure We give in this subsection some variable-related measures. Definition 4. The score for missing values is given, for a variable X (the notation NA is meant for missing value ): 1 #(NAs P UF )=0, if #(NAs SUF ) = 0 #(NAs SUF) Measure 4=, if #(NAs P UF ) #(NAs SUF ) 0 #(NAs PUF) #(NAs PUF), if #(NAs P UF ) < #(NAs SUF ) #(NAs SUF) This measure can be used when using local suppression or generating synthetic data. An alternative to estimate impact of local suppression is use of this measure: Definition 5. The notation A is meant for non-missing value. For variable X let us assume that #(As SUF ) > 0. The score for non-missing values is given, for a variable X: #(As SUF ) [#(As SUF ) #As P UF )] Measure 4Bis = #(As SUF ) This measure can be used when using local suppression which implies that #(As SUF) #(As PUF). The measure assumes that considering variable X the best outcome for a PUF is the one that has as many non-missing values as the SUF i.e.#(as SUF )= #(As P UF ) and in this case Measure 4Bis = 1. The worst case would be when #(As P UF ) = 0 and thus Measure 4Bis = 0. Measure 4Bis is bounded between 0 and 1. This measure cannot be used for synthetic data. When using local suppression there is also a simple measure to estimate loss of information: number of locally suppressed values for one variable! Definition 6. Let n denote the sample size. suppression is given by: Measure 4Ter = For variable X the score for local n Number of local suppressed values n Definition 7. The similarity index for a variable X (the notation NA is meant for missing value ) is given by: Measure 5= # (Categories of X PUF) # (Categorie of X SUF) This measure can be used when using global recoding. 3

3 Does the PUF give similar results than the SUF? In this section some measures to check for similar results between the PUF and the SUF are presented. 3.1 Basic measures These measures are not data-related and can be used for both SILC and LFS datasets. I propose to use kind of relative deviation for some major indicators. This measure is unbounded and is more a non-utility measure contrary to the previous ones. Definition 8. Kind of relative deviation The relative deviation for one indicator is given by: Measure 6= Value (Indicator PUF) Value (Indicator SUF) Value (Indicator SUF) Basic indicators suggested (weights are taken into account): Size of population Number of individuals in the sample Number of households in the sample Distribution of individuals by Gender Age group Education Household size 3.2 Data-related measures We should also provide some data-related measures. I suggest to use the kind of relative deviation given in Definition 8 for the following indicators. Data-related indicators suggested are: EU-SILC data At-risk-at-poverty rate At-risk-at-poverty threshold Gini coefficient 4

Income quintile Share Ratio = P 80(Income) P 20(Income) Income decile Share Ratio = D9(Income) D1(Income) Relative median at-risk-of-poverty gap Capacity to meet unexpected financial expenses (HS060) LFS data (Un)employment rate by gender (for 15-74 years old) (Un) employment rate by age groups (15-24, 25-54, 55-74) (Un)employment rate by education (for 15-74 years old) Labour force (Number of employed + Number of unemployed) by gender (for 15-74 years old) Number of employed persons Number of self-employed Number of part-time employed persons Number of inactive persons Average actual hours of work per week for employed persons We may also use an error measure for continuous variables. We suggest this measure that reflects if (univariate) distribution of a continuous variable is close between public use file and scientific use file. Definition 9. For one record i PUF, let X p,puf denote the value taken by the percentile p of a continuous variable X. Let X p,suf denote the percentile associated in the SUF. The error measure for the individual i is given by: Measure 7 i = 100 p=1 X p,puf X p,suf X p,suf 100 It is possible to compute a mean indicator for the whole public use file. Let n denote the number of individuals in the public use file: Measure 7 = 1 n n Measure 7 i The higher the distribution of variable X is different in the PUF, higher is Measure 7. It is not necessary to compute this utility measure for LFS public use files given continuous variables are not perturbed. We suggest to use the Measure 7 for the following indicator for EU-SILC data: EU-SILC data: Equivalised disposable income i=1 5

3.3 Model-based measures We suggest finally to consider for both datasets a few model(s) and to compute an utility indicator for its estimated parameters. Suggestion is to take as an utility measure the confidence interval overlap proposed by Jörg Drechsler. Definition 10. Confidence Interval Overlap Let [L SUF, U SUF ] the 95%-confidence interval for a given parameter in the SUF. Let [L P UF, U P UF ] the corresponding interval in the PUF. We denote the intersection of the two intervals by : [L INT ER, U INT ER ] = [L SUF, U SUF ] [L P UF, U P UF ] The utility measure is given by: Measure 8= 1 2 ( UINT ER L INT ER U SUF L SUF + U ) INT ER L INT ER U P UF L P UF When the intervals are identical in both PUF and SUF, Measure 7 = 1. When the intervals do not overlap at all, Measure 8 = 0. The second term in the sum is included to avoid to give the maximum utility score if [L P UF, U P UF ] [L SUF, U SUF ]. This correction is particularly important if global suppression is applied (mechanically size of confidence intervals will increase in the PUF). For the models we will consider normalized weights. We will use the Confidence Interval Overlap with the following models: EU-SILC data: log(equivalised disposable income) age + gender + education + citizenship + hsize OR (logistic regression): 1 is-at-risk-at-poverty age + gender + education + citizenship + hsize LFS data (logistic regression) 1 : Men: 1 is employed age + education + citizenship + hsize Women: 1 is employed age + education + citizenship + hsize 1 We should consider one model for men and an other one for women because estimates should be very different. 6

Part II Disclosure risk analysis In this part we provide some measures to estimate residual disclosure risk for LFS data (traditional approach). For EU-SILC data it is planned to say that disclosure risk is null (or sufficiently low) in synthetic datasets. We will provide some comments and references about that in this report. 4 LFS data The idea behind LFS public use files is application of k-anonymity models. However number of key variables is too high to be able to deal with all of them and two main approaches are considered: k-anonymity for a restricted subset of identifying variables. all-m approach that considers all identifying variables but only tables up to dimension m. We decide to use complete k -anonymity (considering all identifying variables) to estimate residual disclosure risk in LFS public use files. k should be taken to a very low value. If we consider as risky only sample uniques (according to combination of all identifying variables), we should take k = 2. We will use the following measure: Definition 11. The residual disclosure risk measure in a public use file with n records is given by: Measure 9= # (Records that do not fulfill k -anonymity) n For one method (k-anonymity for a subset of identifying variables), other identifying variables which are not used in k-anonymity models will be perturbed (PRAM perturbation). We can slightly adjust the measure and compute the following estimate: Definition 12. The residual disclosure risk measure in a perturbed public use file with n records is given by: ( { ) are not PRAMmed # Records that do not fulfill k -anonymity Measure 9Bis= n We can also compute the quantity Measure 9Bis (computed after perturbation - Measure 9 (computed before perturbation) in order to estimate the quantity of perturbation introduced in the public use file and the resulting loss of disclosure risk. 7

5 EU-SILC synthetic data In the synthetic data approach, the dataset is fully synthetic. All variables are simulated based on estimated distributions in original data. In Templ and Alfons (2010) a general discussion about disclosure risk in case of fully synthetic population data is given, with an application to EU-SILC data as simulated in the AMELI project. In that paper, five disclosure scenarios are considered. The general conclusion is that even in case of a very knowledgeable intruder (he has information on the data generation process that produced the synthetic data), disclosure risk is very low. Moreover, even if the intruder is able to identify an individual, probability that derived information is close to the original value is extremely low. In synthetic datasets we have produced in this project, we have paid special attention to very big households with (close to) unique structure could be identified. For example, a large household that occurs multiple times in the PUF but always with the same structure (age and gender distribution) is likely to be a sample unique. However, we have noticed that other simulated variables (e.g. income variables) differs from the true income because simulation is not only based on household size: the only variable used as a stratum for simulation of other variables is the region. To reduce the disclosure risk of unique households, one might possibly consider to remove those households from the PUF. This will lead to a bias in estimates based on that PUF, but considering the intended use of this PUF this appears not to be a big problem. References Alfons A., Kraft S., Templ M. & Filzmoser P. (2011). Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20, 383 407. Bujnowska A. (2015). Implementation of the EU regulation on access to European microdata, European Data Access Forum, available at: http://dwbproject.org/events/edaf2.html Drechsler J. & Reiter J.P. (2009). Disclosure Risk and Data Utility for Partially Synthetic Data: An Empirical Study Using the German IAB Establishment Survey, Journal of Official Statistics, 25, 589-603, available at: https://stat.duke.edu/ jerry/papers/jos09.pdf 8