Data utility metrics and disclosure risk analysis for public use files Specific Grant Agreement Production of Public Use Files for European microdata Work Package 3 - Deliverable D3.1 October 2015 This document aims at giving some data utility and disclosure risk metrics to estimate loss of information and residual disclosure risk after generation of a public use file. Two methods will be tested: traditional approach (mix of global recoding, local suppression and PRAM perturbation) and generation of synthetic data. Two European datasets are considered: EU-SILC (synthetic data generation) and EU- LFS (traditional approach) panel surveys. Part I Data utility metrics 1 General comments To estimate data utility, we will compare proposed public use files (PUFs) with the associated scientific use files (SUFs) given the fact we will create some PUFs as if they were derived from the SUF (even if the generation model for synthetic data is computed using original microdata). To estimate data utility when considering synthetic data, it should be a nice idea to: Generate several (lots of) datasets Compute data utility estimates for each dataset Give as a robust measure the mean measure computed on all synthesized datasets We won t use this method because it s too time-consuming. Risk and utility analysis is performed for the final public use files of all member states. There is no mention of cross-sectional/longitudinal data in the proposed utility metrics because we will limit to cross-sectional data. 1
Parameters of anonymization process should be published in order to give an idea about the risk/utility tradeoff to the user. For instance matrices used for PRAM perturbation should be published. One goal is to provide bounded (between 0 and 1) measures where 1 stands for good utility. When possible we try to give bounded metrics. It s not always the case. Utility and risk analysis provided in this project aim not at being interpreted with absolute values. However compare metrics obtained for different public use files (with different parameters in the anonymization process) enables to compare anonymization methods. Complete results of risk/utility analysis for public use files produced in this project will be given in final deliverables (D2.4 and D3.2). 2 Does the PUF look like the SUF? In this section some simple metrics are proposed to check if the public use file has a similar structure than the SUF. In the rest of the report, the notation # is used for Number of. 2.1 Basic structure This subsection provides very basic measures in order to check if a given PUF looks like its associated SUF. These measures give some simple similarity measures between a public use file and its associated scientific use file. Definition 1. The rate of produced public use files is given by: Measure 1= # (PUFs produced for a year of data collection) # (SUFs produced for a year of data collection) This measure is used to check if all datasets are provided (yearly and quarterly data / individual and household data). Definition 2. The rate of variables in a PUF is given by: Measure 2= # ( Real Variables PUF) # ( Real Variables SUF) This measure can be estimated for each public use file. A variable is not meant as real if there are only missing values for this variable. We could use the same kind of measure to check for the rate of records. 2
Definition 3. The rate of records in a PUF is given by: Measure 3= # (Records PUF) # (Records SUF) This measure is useful if we make global suppression (e.g. for big households) or data simulation. If household and individual data are separated we need to compute Measure 3 for both datasets. 2.2 Variable-related structure We give in this subsection some variable-related measures. Definition 4. The score for missing values is given, for a variable X (the notation NA is meant for missing value ): 1 #(NAs P UF )=0, if #(NAs SUF ) = 0 #(NAs SUF) Measure 4=, if #(NAs P UF ) #(NAs SUF ) 0 #(NAs PUF) #(NAs PUF), if #(NAs P UF ) < #(NAs SUF ) #(NAs SUF) This measure can be used when using local suppression or generating synthetic data. An alternative to estimate impact of local suppression is use of this measure: Definition 5. The notation A is meant for non-missing value. For variable X let us assume that #(As SUF ) > 0. The score for non-missing values is given, for a variable X: #(As SUF ) [#(As SUF ) #As P UF )] Measure 4Bis = #(As SUF ) This measure can be used when using local suppression which implies that #(As SUF) #(As PUF). The measure assumes that considering variable X the best outcome for a PUF is the one that has as many non-missing values as the SUF i.e.#(as SUF )= #(As P UF ) and in this case Measure 4Bis = 1. The worst case would be when #(As P UF ) = 0 and thus Measure 4Bis = 0. Measure 4Bis is bounded between 0 and 1. This measure cannot be used for synthetic data. When using local suppression there is also a simple measure to estimate loss of information: number of locally suppressed values for one variable! Definition 6. Let n denote the sample size. suppression is given by: Measure 4Ter = For variable X the score for local n Number of local suppressed values n Definition 7. The similarity index for a variable X (the notation NA is meant for missing value ) is given by: Measure 5= # (Categories of X PUF) # (Categorie of X SUF) This measure can be used when using global recoding. 3
3 Does the PUF give similar results than the SUF? In this section some measures to check for similar results between the PUF and the SUF are presented. 3.1 Basic measures These measures are not data-related and can be used for both SILC and LFS datasets. I propose to use kind of relative deviation for some major indicators. This measure is unbounded and is more a non-utility measure contrary to the previous ones. Definition 8. Kind of relative deviation The relative deviation for one indicator is given by: Measure 6= Value (Indicator PUF) Value (Indicator SUF) Value (Indicator SUF) Basic indicators suggested (weights are taken into account): Size of population Number of individuals in the sample Number of households in the sample Distribution of individuals by Gender Age group Education Household size 3.2 Data-related measures We should also provide some data-related measures. I suggest to use the kind of relative deviation given in Definition 8 for the following indicators. Data-related indicators suggested are: EU-SILC data At-risk-at-poverty rate At-risk-at-poverty threshold Gini coefficient 4
Income quintile Share Ratio = P 80(Income) P 20(Income) Income decile Share Ratio = D9(Income) D1(Income) Relative median at-risk-of-poverty gap Capacity to meet unexpected financial expenses (HS060) LFS data (Un)employment rate by gender (for 15-74 years old) (Un) employment rate by age groups (15-24, 25-54, 55-74) (Un)employment rate by education (for 15-74 years old) Labour force (Number of employed + Number of unemployed) by gender (for 15-74 years old) Number of employed persons Number of self-employed Number of part-time employed persons Number of inactive persons Average actual hours of work per week for employed persons We may also use an error measure for continuous variables. We suggest this measure that reflects if (univariate) distribution of a continuous variable is close between public use file and scientific use file. Definition 9. For one record i PUF, let X p,puf denote the value taken by the percentile p of a continuous variable X. Let X p,suf denote the percentile associated in the SUF. The error measure for the individual i is given by: Measure 7 i = 100 p=1 X p,puf X p,suf X p,suf 100 It is possible to compute a mean indicator for the whole public use file. Let n denote the number of individuals in the public use file: Measure 7 = 1 n n Measure 7 i The higher the distribution of variable X is different in the PUF, higher is Measure 7. It is not necessary to compute this utility measure for LFS public use files given continuous variables are not perturbed. We suggest to use the Measure 7 for the following indicator for EU-SILC data: EU-SILC data: Equivalised disposable income i=1 5
3.3 Model-based measures We suggest finally to consider for both datasets a few model(s) and to compute an utility indicator for its estimated parameters. Suggestion is to take as an utility measure the confidence interval overlap proposed by Jörg Drechsler. Definition 10. Confidence Interval Overlap Let [L SUF, U SUF ] the 95%-confidence interval for a given parameter in the SUF. Let [L P UF, U P UF ] the corresponding interval in the PUF. We denote the intersection of the two intervals by : [L INT ER, U INT ER ] = [L SUF, U SUF ] [L P UF, U P UF ] The utility measure is given by: Measure 8= 1 2 ( UINT ER L INT ER U SUF L SUF + U ) INT ER L INT ER U P UF L P UF When the intervals are identical in both PUF and SUF, Measure 7 = 1. When the intervals do not overlap at all, Measure 8 = 0. The second term in the sum is included to avoid to give the maximum utility score if [L P UF, U P UF ] [L SUF, U SUF ]. This correction is particularly important if global suppression is applied (mechanically size of confidence intervals will increase in the PUF). For the models we will consider normalized weights. We will use the Confidence Interval Overlap with the following models: EU-SILC data: log(equivalised disposable income) age + gender + education + citizenship + hsize OR (logistic regression): 1 is-at-risk-at-poverty age + gender + education + citizenship + hsize LFS data (logistic regression) 1 : Men: 1 is employed age + education + citizenship + hsize Women: 1 is employed age + education + citizenship + hsize 1 We should consider one model for men and an other one for women because estimates should be very different. 6
Part II Disclosure risk analysis In this part we provide some measures to estimate residual disclosure risk for LFS data (traditional approach). For EU-SILC data it is planned to say that disclosure risk is null (or sufficiently low) in synthetic datasets. We will provide some comments and references about that in this report. 4 LFS data The idea behind LFS public use files is application of k-anonymity models. However number of key variables is too high to be able to deal with all of them and two main approaches are considered: k-anonymity for a restricted subset of identifying variables. all-m approach that considers all identifying variables but only tables up to dimension m. We decide to use complete k -anonymity (considering all identifying variables) to estimate residual disclosure risk in LFS public use files. k should be taken to a very low value. If we consider as risky only sample uniques (according to combination of all identifying variables), we should take k = 2. We will use the following measure: Definition 11. The residual disclosure risk measure in a public use file with n records is given by: Measure 9= # (Records that do not fulfill k -anonymity) n For one method (k-anonymity for a subset of identifying variables), other identifying variables which are not used in k-anonymity models will be perturbed (PRAM perturbation). We can slightly adjust the measure and compute the following estimate: Definition 12. The residual disclosure risk measure in a perturbed public use file with n records is given by: ( { ) are not PRAMmed # Records that do not fulfill k -anonymity Measure 9Bis= n We can also compute the quantity Measure 9Bis (computed after perturbation - Measure 9 (computed before perturbation) in order to estimate the quantity of perturbation introduced in the public use file and the resulting loss of disclosure risk. 7
5 EU-SILC synthetic data In the synthetic data approach, the dataset is fully synthetic. All variables are simulated based on estimated distributions in original data. In Templ and Alfons (2010) a general discussion about disclosure risk in case of fully synthetic population data is given, with an application to EU-SILC data as simulated in the AMELI project. In that paper, five disclosure scenarios are considered. The general conclusion is that even in case of a very knowledgeable intruder (he has information on the data generation process that produced the synthetic data), disclosure risk is very low. Moreover, even if the intruder is able to identify an individual, probability that derived information is close to the original value is extremely low. In synthetic datasets we have produced in this project, we have paid special attention to very big households with (close to) unique structure could be identified. For example, a large household that occurs multiple times in the PUF but always with the same structure (age and gender distribution) is likely to be a sample unique. However, we have noticed that other simulated variables (e.g. income variables) differs from the true income because simulation is not only based on household size: the only variable used as a stratum for simulation of other variables is the region. To reduce the disclosure risk of unique households, one might possibly consider to remove those households from the PUF. This will lead to a bias in estimates based on that PUF, but considering the intended use of this PUF this appears not to be a big problem. References Alfons A., Kraft S., Templ M. & Filzmoser P. (2011). Simulation of close-to-reality population data for household surveys with application to EU-SILC. Statistical Methods & Applications, 20, 383 407. Bujnowska A. (2015). Implementation of the EU regulation on access to European microdata, European Data Access Forum, available at: http://dwbproject.org/events/edaf2.html Drechsler J. & Reiter J.P. (2009). Disclosure Risk and Data Utility for Partially Synthetic Data: An Empirical Study Using the German IAB Establishment Survey, Journal of Official Statistics, 25, 589-603, available at: https://stat.duke.edu/ jerry/papers/jos09.pdf 8