Estimating an Earnings Function from Coarsened Data by an Interval Censored Regression Procedure

Similar documents
MgtOp 215 Chapter 13 Dr. Ahn

3/3/2014. CDS M Phil Econometrics. Vijayamohanan Pillai N. Truncated standard normal distribution for a = 0.5, 0, and 0.5. CDS Mphil Econometrics

Notes are not permitted in this examination. Do not turn over until you are told to do so by the Invigilator.

Tests for Two Correlations

Spatial Variations in Covariates on Marriage and Marital Fertility: Geographically Weighted Regression Analyses in Japan

ECONOMETRICS - FINAL EXAM, 3rd YEAR (GECO & GADE)

CHAPTER 9 FUNCTIONAL FORMS OF REGRESSION MODELS

The Integration of the Israel Labour Force Survey with the National Insurance File

Raising Food Prices and Welfare Change: A Simple Calibration. Xiaohua Yu

Module Contact: Dr P Moffatt, ECO Copyright of the University of East Anglia Version 2

Testing for Omitted Variables

Estimation of Wage Equations in Australia: Allowing for Censored Observations of Labour Supply *

A Bootstrap Confidence Limit for Process Capability Indices

Tests for Two Ordered Categorical Variables

/ Computational Genomics. Normalization

Using Conditional Heteroskedastic

II. Random Variables. Variable Types. Variables Map Outcomes to Numbers

An Application of Alternative Weighting Matrix Collapsing Approaches for Improving Sample Estimates

A Comparison of Statistical Methods in Interrupted Time Series Analysis to Estimate an Intervention Effect

OCR Statistics 1 Working with data. Section 2: Measures of location

Chapter 5 Student Lecture Notes 5-1

Spurious Seasonal Patterns and Excess Smoothness in the BLS Local Area Unemployment Statistics

occurrence of a larger storm than our culvert or bridge is barely capable of handling? (what is The main question is: What is the possibility of

Linear Combinations of Random Variables and Sampling (100 points)

A Utilitarian Approach of the Rawls s Difference Principle

Which of the following provides the most reasonable approximation to the least squares regression line? (a) y=50+10x (b) Y=50+x (d) Y=1+50x

Evaluating Performance

Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 12

Domestic Savings and International Capital Flows

Capability Analysis. Chapter 255. Introduction. Capability Analysis

Multifactor Term Structure Models

3: Central Limit Theorem, Systematic Errors

EDC Introduction

Chapter 10 Making Choices: The Method, MARR, and Multiple Attributes

THE VOLATILITY OF EQUITY MUTUAL FUND RETURNS

Monetary Tightening Cycles and the Predictability of Economic Activity. by Tobias Adrian and Arturo Estrella * October 2006.

Harmonised Labour Cost Index. Methodology

Real Exchange Rate Fluctuations, Wage Stickiness and Markup Adjustments

Chapter 3 Descriptive Statistics: Numerical Measures Part B

Random Variables. b 2.

International ejournals

Measures of Spread IQR and Deviation. For exam X, calculate the mean, median and mode. For exam Y, calculate the mean, median and mode.

Quiz on Deterministic part of course October 22, 2002

Chapter 3 Student Lecture Notes 3-1

Members not eligible for this option

Analysis of Variance and Design of Experiments-II

Technological inefficiency and the skewness of the error component in stochastic frontier analysis

Work, Offers, and Take-Up: Decomposing the Source of Recent Declines in Employer- Sponsored Insurance

2) In the medium-run/long-run, a decrease in the budget deficit will produce:

Data Mining Linear and Logistic Regression

UNIVERSITY OF NOTTINGHAM

A Simulation Study to Compare Weighting Methods for Nonresponses in the National Survey of Recent College Graduates

Stochastic ALM models - General Methodology

Maturity Effect on Risk Measure in a Ratings-Based Default-Mode Model

ISyE 512 Chapter 9. CUSUM and EWMA Control Charts. Instructor: Prof. Kaibo Liu. Department of Industrial and Systems Engineering UW-Madison

Asset Management. Country Allocation and Mutual Fund Returns

Understanding price volatility in electricity markets

A MODEL OF COMPETITION AMONG TELECOMMUNICATION SERVICE PROVIDERS BASED ON REPEATED GAME

Economic Design of Short-Run CSP-1 Plan Under Linear Inspection Cost

REFINITIV INDICES PRIVATE EQUITY BUYOUT INDEX METHODOLOGY

Labor Market Transitions in Peru

Likelihood Fits. Craig Blocker Brandeis August 23, 2004

4. Greek Letters, Value-at-Risk

Highlights of the Macroprudential Report for June 2018

Members not eligible for this option

Risk and Return: The Security Markets Line

Problem Set 6 Finance 1,

UNIVERSITY OF VICTORIA Midterm June 6, 2018 Solutions

Standardization. Stan Becker, PhD Bloomberg School of Public Health

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019

The Analysis of Net Position Development and the Comparison with GDP Development for Selected Countries of European Union

Clearing Notice SIX x-clear Ltd

Scribe: Chris Berlind Date: Feb 1, 2010

Introduction. Chapter 7 - An Introduction to Portfolio Management

STAT 3014/3914. Semester 2 Applied Statistics Solution to Tutorial 12

Analysis of Unemployment During Transition to a Market Economy: The Case of Laid-off Workers in the Beijing Area

FORD MOTOR CREDIT COMPANY SUGGESTED ANSWERS. Richard M. Levich. New York University Stern School of Business. Revised, February 1999

Price and Quantity Competition Revisited. Abstract

Urban Effects on Participation and Wages: Are there Gender. Differences? 1

OPERATIONS RESEARCH. Game Theory

Теоретические основы и методология имитационного и комплексного моделирования

Effects of Model Specification and Demographic Variables on Food. Consumption: Microdata Evidence from Jiangsu, China. The Area of Focus:

The Effects of Industrial Structure Change on Economic Growth in China Based on LMDI Decomposition Approach

Κείμενο Θέσεων Υπ. Αρ. 5 Rates of return to different levels of education: Recent evidence from Greece

Trivial lump sum R5.1

Nonresponse in the Norwegian Labour Force Survey (LFS): using administrative information to describe trends

Underemployed women: an analysis of voluntary and involuntary parttime wage employment in South Africa

Tree-based and GA tools for optimal sampling design

EXTENSIVE VS. INTENSIVE MARGIN: CHANGING PERSPECTIVE ON THE EMPLOYMENT RATE. and Eliana Viviano (Bank of Italy)

CHAPTER 3: BAYESIAN DECISION THEORY

Risk Reduction and Real Estate Portfolio Size

MODELING CREDIT CARD BORROWING BY STUDENTS

Problems to be discussed at the 5 th seminar Suggested solutions

Finite Math - Fall Section Future Value of an Annuity; Sinking Funds

Hewlett Packard 10BII Calculator

Trivial lump sum R5.0

A FRAMEWORK FOR PRIORITY CONTACT OF NON RESPONDENTS

Analysis of Moody s Bottom Rung Firms

International Comparisons of Performance in the Provision of Public Services:

Sampling Distributions of OLS Estimators of β 0 and β 1. Monte Carlo Simulations

Transcription:

Estmatng an Earnngs Functon from Coarsened Data by an Interval Censored Regresson Procedure Reza C. Danels School of Economcs Unversty of Cape Town rdanels@commerce.uct.ac.za Sandrne Rospabé Faculté des Scences Economque Unversté de Rennes I (France) sandrne.rospabe@unv-rennes1.fr Development Pol cy Re search Unt February 2005 Work ng Pa per 05/91 ISBN 1-920055-05-3

Abstract Ths paper estmates an earnngs functon where the dependent varable s a mx of pont and nterval data usng an nterval regresson model based on a pseudo-maxmum lkelhood estmaton procedure. The analyss uses the 1999 OHS, and takes nto account pont and nterval ncome observatons, as well as desgn features of the survey ncludng stratfcaton, clusterng and weghts. In developng and applyng the methodology, t s shown that researchers nterested n analysng the determnants of ncome n a meanngful way need not be hampered by the presence of both pont and nterval observatons, and can n fact account for these smultaneously usng a generalsed Tobt model. By ncorporatng survey desgn features nto the analyss of the varance, some changes were needed to the estmaton procedure and ths s where the pseudo-lkelhood becomes useful. However, ths then affects how the coeffcents of the model are nterpreted, and researchers are encouraged to focus attenton on the confluence of these factors. JEL Classfcaton: C42, C51 Key words: Generalsed Tobt Model, Pseudo Maxmum Lkelhood Estmaton, Complex Survey Data Acknowledgements The authors would lke to thank partcpants at the 6th Annual Conference of the Afrcan Econometrc Socety at the Unversty of Pretora for ther useful comments. We would also lke to thank the DPRU for ther fnancal assstance n the publcaton of ths paper Development Polcy Research Unt Tel: +27 21 650 5705 Fax: +27 21 650 5711 Informaton about our Workng Papers and other publshed ttles are avalable on our webste at: http://www.commerce.uct.ac.za/dpru/

Table of Contents 1. Introducton...1 2. Methodology...2 2.1 Estmaton of a Generalsed Tobt Model for Interval Regresson...2 2.2 Pseudo-Lkelhood Estmaton (PML) of Parameters...3 2.3 A Note on the Weghts...5 3. Data and Varables...6 3.1 Dependent Varable...6 3.2 Independent Varables...7 4. Results and Dscusson...7 4.1 Descrptve Statstcs...7 4.2 Regresson Results...11 4.3 The Influence of Survey-Desgn...14 5. Concluson...15 6. References...16

Estmatng an Earnngs Functon from Coarsened Data by and Interval Censored Regresson Procedure 1. Introducton In ths paper we dscuss an approach to estmatng earnngs functons from complex survey data usng both pont and nterval observatons smultaneously. Typcally, survey questons that ask respondents to provde nformaton on ncome, expendture, assets and labltes are subject to both hgh levels of tem mssng data as well as to potental measurement error f pont observatons are requred for these varables. As a consequence, Statstcs South Afrca provde respondents to ther household surveys wth two optons for the ncome queston, namely actual (pont) ncome and nterval ncome categores (e.g. R10,000-R15,000). The resultng dstrbuton of the ncome varable contans a mxture of actual value responses, nterval censored responses, and mssng data. Hetjan and Rubn (1990, 1991) call ths mxture of data types coarsened data and the phrase has become more wdely used wthn the survey statstcs lterature (see also Heernga et al, 2002). The consequence of havng both pont and nterval ncome observatons makes estmatng earnngs functons more complex. Our nnovaton n ths paper s to use a generalsed Tobt model for ths procedure. A further dmenson of complexty s added to ths task when survey samplng desgn features are consdered, ncludng stratfcaton, clusterng and weghts (see Ksh, 1965; Lehtonen & Pahknen, 1995), where conventonal maxmum lkelhood estmaton s no longer possble and pseudo-lkelhood estmaton must be used. Thus, two analytcal questons are addressed here. The frst s how to estmate an earnngs functon usng both pont and nterval observatons. The second s how to account for survey desgn n the estmaton method. As wll become evdent, both of these questons must be addressed n order to obtan accurate coeffcents and correct estmates of ther precson. The analyss s conducted on the 1999 October Household Survey (OHS) (Statstcs South Afrca, 1999). The analyss below proceeds as follows. Frstly, the methodology s dscussed. In ths secton, the model s presented as well as the estmaton procedure gven the features of complex surveys. Thereafter, the data and varables are descrbed. Secton 5 dsplays the emprcal outcomes, where descrptve statstcs for the regressed covarates are frstly provded before the results of the earnngs functon are dscussed. Lastly, the concluson summarses. 1

DPRU Workng Paper 05/91 Reza Danels & Sandrne Rospabé 2. Methodology 2.1 Estmaton of a Generalsed Tobt Model for Interval Regresson Snce ncome s a censored dstrbuton n ths case, the approprate foundaton from whch to develop the estmaton procedure s to use a censored regresson or Tobt model, where the latent varable y s modelled by y ' x. Here, y 0 f y 0; y y f y 0; 2 and N0, I regresson model, where:. Greene (2000: 911) provdes the standard log-lkelhood for the censored 2 1 2 ( y è'x ) è'x logl log(2 ) log log 1 2 2 y 0 2 y 0 To generalse the model and adapt the estmaton procedure n order to account for the mxture of pont, nterval and mssng observatons needed for an accurate treatment of the ncome varable, we follow the procedure gven below. As before, let as: y ' x be the model. We denote y y y y y y y y y y y y y y y y y y * * L f L; R f R; * * * * f L R; f 1 2 y, the observed dependent varable, Gven ths, the weghted log lkelhood for the nterval regresson procedure s therefore gven by the followng (adapted from StataCorp, 2003a: 262): 2 1 y èx ' 2 yl èx ' log L w log 2 wlog 2 C L yr èx ' y2 èx ' y1 èx ' wlog 1 wlog R I (1) Here, observatons C are pont data, L are left censored, R are rght censored and observatons I are ntervals. (.) s the standard cumulatve normal. Thus, regardless of the types of observatons, the estmaton method s able to account for them smultaneously. 2

Estmatng an Earnngs Functon from Coarsened Data by and Interval Censored Regresson Procedure However, the estmaton of ths model s complcated when survey desgn features are ncorporated nto the calculaton of. Essentally, ths mples that t no longer becomes possble to use a standard lkelhood functon, and a pseudo-lkelhood has to be developed nstead. 2.2 Pseudo-Lkelhood Estmaton (PML) of Parameters The estmaton technque for nterval regresson usng the generalsed Tobt uses a weghted maxmum lkelhood estmator. For complex survey data, however, ths weghted lkelhood s not the dstrbuton functon for the sample, snce () when there s clusterng, ndvdual observatons are no longer ndependent and the lkelhood does not reflect ths, and () when there are samplng weghts, the lkelhood does not fully account for the randomness of the weghted sample. As t s not a true lkelhood, t s termed a pseudo-lkelhood. One of the consequences of the pseudo-lkelhood s that standard lkelhood-rato (LR) tests are no longer vald, and Wald tests need to be used nstead (see Elason, 1993, 34-35 for a good dscusson of other convenent features of Wald tests over LR tests n ML estmaton). Bnder (1983) provded a rgorous treatment of how the varance of asymptotcally normal estmators should be estmated from complex surveys, and t was ths theoretcal framework that subsequently became synonymous wth PML estmaton. It should be noted that the estmaton of varance for a complex survey statstc s complcated not only by the nature of the survey s desgn, but also by the form of the statstc. In the event of a Tobt regresson coeffcent estmated by PML and ncorporatng survey desgn components, the varance formulae take on an added dmenson of complexty. Therefore, whle equaton (1) s an effcent model to use wth an nterval-censored dependent varable, t wll not yeld ether the correct coeffcents or precse standard errors f t were estmated from complex survey data wthout takng nto account the relevant survey desgn features. In order to obtan accurate coeffcents, approprate survey weghts must be used. In order to obtan precse standard errors, the effects of stratfcaton, mult-stage samplng and weghtng should be ncorporated nto the coeffcent and varance estmates. Snce n s large n the OHS99, fnte populaton correctons need not be ncluded. These features of complex surveys are standard n all Statstcs South Afrca s household surveys, and ther omsson consttutes an mportant, though frequently unrecognsed source of error. Below we show how the coeffcents and ther varance are estmated usng PML (adapted from StataCorp, 2003b: 39-40). Let (h,, ) ndex the elements n the populaton, where h =1,, H are the strata, =1,, Ah are the clusters (or prmary samplng unts PSUs) n stratum h, and =1,, Bh are the elements n PSU (h, ). Suppose that we observed (Yh, Xh) for the entre populaton, and that (Yh, Xh) arose from a sutable lkelhood model as n equaton (1). Let l ;Y h,, Xh be the assocated log-lkelhood under ths model. Then, for a fnte populaton, we defne the parameter by the vector estmatng equaton: 3

DPRU Workng Paper 05/91 Reza Danels & Sandrne Rospabé H Ah Bh G( ) S( ; Y, X ) 0 h1 1 1 h h Where S=l / s the score vector,.e. the frst dervatve wth respect to of l ;Y, X. Then, the PML estmator s the soluton to the weghted sample estmatng h, h equaton: H A B h h ^ G( ) w S( ; y, x ) 0 h1 1 1 h h h (2) For the estmated coeffcent n (2) above, t s then possble to use a frst-order matrx Taylor seres expanson to produce the varance estmate. 1 1 G() ˆ ˆ ˆ G() V ( ˆ ) VG ( ( )) H VG ˆ( ˆ( )) H 1 1 (3) Where H s the Hessan for the weghted sample log-lkelhood. The use of the Taylor seres expanson n equaton (3) s one (tractable) example of how we can calculate the varance for the regresson coeffcents, and follows Ksh s (1965) general dentfcaton of ths method for complex surveys and Bnder s (1983) specfc adaptaton to the PML framework. However, t s by no means the only one. Sul Lee et al (1989) provde a smple analyss of replcaton methods for estmatng varance, ncludng Balanced Repeated Replcaton (BRR) and Jackknfe Repeated Replcaton (JRR). These replcaton methods are generally more useful than the Bootstrap when the underlyng dstrbutonal assumptons of key varables are known. However, replcaton methods are generally more computatonally ntensve to derve than Taylor seres approxmatons, whch have become the standard approach n most current software programs (e.g. Stata, SAS). In the analyss below, the standard errors are estmated usng the Taylor seres. Once we estmate the varance, t s then also possble to evaluate the precson of the coeffcents estmated from the OHS99 gven ts complex survey features, relatve to a smple random sample (SRS) of the same sze. Ths s known as the desgn effect (deff) (Ksh, 1965), and provdes us wth addtonal nsght nto the effect of survey desgn on the precson of the estmates. It s computed as: Var deff Var complex srs (4) Where s the parameter of nterest. 4

Estmatng an Earnngs Functon from Coarsened Data by and Interval Censored Regresson Procedure 2.3 A Note on the Weghts There are two dfferent weghts that are applcable to ths analyss. The frst s the computaton of wh n equaton (2), whch s part of the weghted log lkelhood n the pseudo-ml procedure. It accounts for heteroskedastcty and the number of replcates n an teratve lkelhood procedure. The second weght s dstnct from the frst, and s developed n order to account for the desgn features of a complex survey. Ths weght s, n turn, comprsed of three components: () compensaton for unequal probablty of selecton (denoted w1), () adjustment for non-response (denoted w2), and () post-stratfcaton adjustments (denoted w3). The three weghts are calculated as follows: w 1 p ( ) 1 2 w 1 ; and r w3 c. N m Here, p() s the probablty that unt s sampled; r s the response rate; c s a constant chosen so that the weghts sum to the number of respondents; and N s the populaton total (e.g. obtaned from the Census) of a gven number of respondents m. The fnal weght (w) s then the product of the three ndvdual weghts, gven by: w ww w HAhBh 1 2 3 Therefore, t s w that must be used as the weght of choce n the survey desgn adjusted parameter estmates. In the analyss below, we use Statstcs South Afrca s (SSA) weght n the OHS99 snce t s computed n ths manner. It s mportant to be aware of the fact that adjustments to the OHS99 weght.e. the weght provded by SSA n the publcly released verson of the dataset n order to compensate for populaton growth and other demographc changes, consttutes an adjustment to w3 only (.e. the post-stratfcaton weght). If ths adjustment s made wthout factorng out w1 and w2 (thereby solatng the post-stratfcaton factor of the product), then the resultng weght would be ncorrect. 1 1 Snce the weght s a product functon, t would be useful for Statstcs South Afrca to nclude all three weghts plus the combned weght n the survey released to the publc. Ths would allow researchers to make ther own post-stratfcaton adjustments, or, ndeed, to create alternatve weghts based on some other procedure (e.g. mputaton). 5

DPRU Workng Paper 05/91 Reza Danels & Sandrne Rospabé 3. Data and Varables The data for ths exercse s taken from the 1999 October Household Survey (OHS99), conducted by Statstcs South Afrca. A two-stage samplng procedure was appled n the OHS, and the sample was stratfed, clustered and selected to meet the requrements of probablty samplng. The samplng procedure nvolved prmary stage stratfcaton by provnce and area type (urban/rural). Independent samples of Enumerated Areas (EAs) were systematcally selected wth probablty proportonal to sze n each stratum; these are the clusters. The measure of sze was the estmated number of households n each Enumerated Areas. A systematc sample of 10 households was then drawn from each EA, amountng to 30 000 households n 3 000 EAs. The sub-sample of ndvduals evaluated n ths study s lmted to: Workers whose ages range from 15 to 65 (.e. all economcally actve ndvduals); Those who are employed by someone else (we thus exclude self-employed people who only report ther gross turnover); and Those for whch nformaton s avalable for wages and all other relevant attrbutes. These restrctons reduce the orgnal sample sze to 17 945 ndvduals. 3.1 Dependent Varable Informaton on earnngs relates to total salary/pay, ncludng overtme, allowances and bonuses before tax. The worker s asked to gve ether the precse amount of ther salary or the ncome nterval n whch t fts, on a weekly, monthly or annual bass. Thus, the observatons for the dependent varables consst of a mxture of pont and nterval data. Despte the fact that we omt both tem and unt mssng data from the regresson (rather than mputng as per Heernga et al, 2002), the data s stll termed coarsened n the Hetjan and Rubn (1990, 1991) sense. Indeed, ther defnton of ths phrase s flexble enough to be appled even to data that have only been grouped to ensure confdentalty. All the observatons were then converted nto monthly data, though t s common to use hourly earnngs to abstract from the effect of varatons n hours worked. However, even f workers report the number of hours they usually work per week, the presence of nterval ncome data prevents us from workng wth the hourly wage rate. In order to account for ths, workng hours are ntroduced as an ndependent varable. Lastly, the model we use assumes normalty, and snce the dstrbuton of wages s skewed and non-normal, we more closely approxmate normalty f we model the log of wages. 6

Estmatng an Earnngs Functon from Coarsened Data by and Interval Censored Regresson Procedure 3.2 Independent Varables Independent varables nclude the followng: A set of educatonal dummes 2, a varable for age and one for tenure whch proxes on-the-job-learnng are ntroduced to test the human captal theory. Quadratc terms for age and tenure are ncluded to allow for ncreasng and then decreasng returns to age and experence over the lfe cycle. Racal dummes are ntroduced to assess whether, other thngs beng equal, race plays a role n the determnaton of earnngs. Ths s a smple, but not comprehensve, way of detectng racal dscrmnaton n the labour market. Followng the same method for race, a dummy for male s ncluded to test for gender dscrmnaton. Varables for marrage and headshp status are tradtonally set as determnants of earnngs as proxes for factors such as stablty, motvaton and dscplne. We also add a dummy varable for locaton to test the hypothess that workers n urban areas earn more than n rural areas. Dummes for the provnces are also ncluded, to take nto account the dfferences n the cost of lvng. A dummy for unon membershp s ntroduced to nvestgate the unon power over wage settng. We thus also test whether unonsed workers earn hgher wages than non-unonsed. A dummy for the nature of the actvty formal / nformal s also ncluded to test f workng n a regstered actvty s more lucratve than n a non-regstered actvty. Fnally, we ntroduce a set of 10 sectoral dummes and 10 occupatonal categores, snce earnngs are expected to vary substantally among ndustres and occupatons. 4. Results and Dscusson 4.1 Descrptve Statstcs In order to dscuss the effect of samplng desgn on the analyss of smple descrptve statstcs, Table 1 presents the mean, proporton and standard errors of the set of varables descrbed above. They are successvely calculated frst under smple random samplng (columns 2), then ntegratng weghts nto the computatons (column 3) and then ncludng stratfcaton, clusters and weghts (column 4). The last column shows the ndvdual desgn effect (deff) values for each varable; see equaton (4) above. 2 No educaton, prmary (grade1-grade7), secondary (grade8-grade12), further educaton (Natonal Techncal Certfcate), hgher educaton (dploma wth grade12, degree, postgraduate degree or dploma). 7

DPRU Workng Paper 05/91 Reza Danels & Sandrne Rospabé Table 1 : Descrptve statstcs of earnng wth and wthout survey desgn features: 1999 OHS Varable Wthout weghts,clusters and strata Mean Std. Error proporton Wth weghts only Mean or Std. Error proporton Wth weghts,clusters and strata Mean or Std. Error. proporton Desgn Effect Dependant varable Monthly ncome 1 2782.0290 269.3999 2963.6650 304.7622 2963.6650 317.7013 1.34 < 200 2 0.0498 0.0026 0.0438 0.0025 0.0438 0.0032 1.83 [201-500] 0.1137 0.0037 0.0979 0.0038 0.0979 0.0047 1.80 [501-1000] 0.1344 0.0040 0.1173 0.0041 0.1173 0.0054 2.02 [1001-1500] 0.1388 0.0041 0.1349 0.0045 0.1349 0.0055 1.90 [1501-2500] 0.1780 0.0045 0.1794 0.0051 0.1794 0.0060 1.77 [2501-3500] 0.1180 0.0038 0.1195 0.0043 0.1195 0.0048 1.56 [3501-4500] 0.0856 0.0033 0.0925 0.0040 0.0925 0.0048 1.96 [4501-6000] 0.0783 0.0032 0.0869 0.0040 0.0869 0.0046 1.95 [6001-8000] 0.0470 0.0025 0.0561 0.0033 0.0561 0.0037 1.83 [8001-11000] 0.0243 0.0018 0.0301 0.0026 0.0301 0.0029 2.05 [11001-16000] 0.0194 0.0016 0.0249 0.0024 0.0249 0.0029 2.46 [16001-30000] 0.0095 0.0011 0.0129 0.0017 0.0129 0.0018 1.90 > 30000 0.0030 0.0006 0.0039 0.0009 0.0039 0.0010 2.03 Independent varables 3 Schoolng No educaton 0.0990 0.0022 0.0783 0.0021 0.0783 0.0028 1.96 Prmary 0.2889 0.0034 0.2553 0.0036 0.2553 0.0051 2.47 Secondary 0.4746 0.0037 0.5047 0.0043 0.5047 0.0059 2.46 Further educaton 0.0206 0.0011 0.0243 0.0014 0.0243 0.0016 1.95 Hgher educaton 0.1170 0.0024 0.1374 0.0032 0.1374 0.0051 3.89 Age 36.9282 0.0786 36.0724 0.0882 36.0724 0.1004 1.69 Age square 1474.6920 6.1715 1408.1640 6.8404 1408.1640 7.7319 1.65 Tenure 6.9706 0.0587 6.5930 0.0633 6.5930 0.0790 1.95 Tenure square 110.3507 1.8955 100.7855 2.2087 100.7855 2.4671 1.71 Race Whte 0.1201 0.0024 0.1686 0.0037 0.1686 0.0082 8.63 Afrcan 0.6834 0.0035 0.6637 0.0042 0.6637 0.0092 6.87 Coloured 0.1710 0.0028 0.1354 0.0027 0.1354 0.0060 5.55 Indan 0.0245 0.0012 0.0311 0.0016 0.0311 0.0037 8.20 8

Estmatng an Earnngs Functon from Coarsened Data by and Interval Censored Regresson Procedure Other race 0.0010 0.0002 0.0012 0.0003 0.0012 0.0004 2.82 Male 0.5637 0.0037 0.5729 0.0043 0.5729 0.0046 1.56 Monthly hours 203.7437 0.4688 201.6373 0.5167 201.6373 0.6904 2.24 Urban 0.6433 0.0036 0.7007 0.0038 0.7007 0.0060 3.10 Unon 0.3725 0.0036 0.3693 0.0041 0.3693 0.0064 3.17 Martal status 0.5012 0.0037 0.4984 0.0043 0.4984 0.0060 2.60 Headshp status 0.5807 0.0037 0.5817 0.0043 0.5817 0.0050 1.85 Formal 0.8068 0.0029 0.8146 0.0033 0.8146 0.0046 2.50 Industres Manufacturng 0.1278 0.0025 0.1423 0.0031 0.1423 0.0042 2.55 Agrculture 0.1555 0.0027 0.1139 0.0025 0.1139 0.0054 5.25 Mnng 0.0691 0.0019 0.0604 0.0019 0.0604 0.0046 6.83 Utltes 0.0080 0.0007 0.0085 0.0008 0.0085 0.0010 1.94 Constructon 0.0461 0.0016 0.0485 0.0019 0.0485 0.0022 1.85 Trade 0.1405 0.0026 0.1531 0.0032 0.1531 0.0039 2.15 Transport 0.0416 0.0015 0.0486 0.0020 0.0486 0.0023 2.01 Fnance 0.0702 0.0019 0.0875 0.0027 0.0875 0.0033 2.40 Servces 0.2035 0.0030 0.2119 0.0035 0.2119 0.0053 3.00 Domestc servces 0.1330 0.0025 0.1201 0.0027 0.1201 0.0035 2.03 Occupatons Managers 0.1171 0.0024 0.1194 0.0028 0.1194 0.0034 2.02 Professonals 0.0381 0.0014 0.0466 0.0020 0.0466 0.0024 2.37 Techncans 0.0461 0.0016 0.0545 0.0021 0.0545 0.0028 2.73 Clerks 0.0974 0.0022 0.1074 0.0028 0.1074 0.0035 2.27 Salesperson 0.0991 0.0022 0.1145 0.0029 0.1145 0.0035 2.22 Artsans 0.1029 0.0023 0.1109 0.0028 0.1109 0.0034 2.06 Skll agrcultural workers 0.0414 0.0015 0.0367 0.0016 0.0367 0.0021 2.19 Operators 0.1289 0.0025 0.1223 0.0027 0.1223 0.0036 2.13 Elementary workers 0.2173 0.0031 0.1877 0.0032 0.1877 0.0050 2.94 Domestc workers 0.1118 0.0024 0.1001 0.0024 0.1001 0.0030 1.84 9

DPRU Workng Paper 05/91 Reza Danels & Sandrne Rospabé Provnces Western Cape 0.1752 0.0028 0.1609 0.0031 0.1609 0.0041 2.19 Eastern Cape 0.0826 0.0021 0.0876 0.0024 0.0876 0.0040 3.60 Northern Cape 0.0578 0.0017 0.0263 0.0009 0.0263 0.0017 2.09 Free State 0.0991 0.0022 0.0848 0.0021 0.0848 0.0031 2.22 Kwazulu-Natal 0.1243 0.0025 0.1635 0.0035 0.1635 0.0055 3.91 North West 0.0968 0.0022 0.0785 0.0020 0.0785 0.0027 1.81 Gauteng 0.1853 0.0029 0.2605 0.0041 0.2605 0.0059 3.28 Mpumalanga 0.0978 0.0022 0.0707 0.0019 0.0707 0.0030 2.47 Northern Provnce 0.0810 0.0020 0.0671 0.0020 0.0671 0.0029 2.44 Notes: 1 10 692 observatons for monthly ncome 2 7253 observatons for ncome ntervals 3 17945 observatons for each ndependent varable Ths table frst hghlghts the mportance of usng samplng weghts n order to obtan the correct pont estmates. Proportons from the weghted analyss dffer by 54 per cent (for the Northern Cape) up to 40 per cent (for the Whtes) from the pont estmates gnorng the survey desgn parameters. Put dfferently, takng nto account the weghts leads to an ncrease n the proporton of Whte workers among the total workforce, whch ndcates that the proporton of Whtes surveyed was lower than the true proporton of the populaton. Results for the Northern Cape show the opposte case, where the share of workers n the sample was too hgh compared to the true populaton proporton n South Afrca. Table 1 also shows that the survey desgn features of the sample generally reduce the precson of the samplng estmates. The reason s that workers lvng n the same clusters are usually more smlar to one another n behavour and characterstcs than workers lvng n dfferent clusters (Deaton, 1997). The deff s a useful concept to assess how the sample desgn affects precson. For example, we see that ther values are partcularly hgh for the race varable, mplyng that racal groups are hghly clustered n South Afrca. The deff s also mportant for agrculture and mnng, where we fnd that people employed n these sectors are largely grouped. On the other hand, age and gender are expected to cut across clusters unformly, whch explans why ther deff values are low. Surprsngly, deff values assocated wth ncome are also low. Here we would have expected that ncome would have been more clearly assocated wth the racal groups, and as such be hghly clustered. A possble explanaton could be that almost 20 per cent of the Whtes ntervewed dd not gve ether ther exact ncome or the ncome nterval n whch they earned. As there s a hgh probablty that these 20 per cent are not the least wealthy, t can explan why observatons for hgh-ncome ntervals are not largely grouped. 10

Estmatng an Earnngs Functon from Coarsened Data by and Interval Censored Regresson Procedure All these observatons show that desgn effects from complex survey data do ndeed nfluence the precson of the estmates and thus statstcal nference. Consequently, f we gnore them, we ncrease the probablty of makng erroneous conclusons. The next secton evaluates these ssues for the earnngs regresson. 4.2 Regresson Results Ths secton presents the results of the nterval regresson procedure n Table 2. The results are presented for the regresson coeffcents computed n equaton (2) and ther standard errors computed as the square root of equaton (3). These outcomes represent the survey-desgn adjusted results and are the accurate coeffcents and varance estmates descrbed n the methodology. For comparatve purposes, the unweghted non-desgn based coeffcents and standard errors are also presented, and these amount to estmaton under smple random samplng assumptons, labelled accordngly n Table 2. Lastly, we also present the mean desgn effects (deff) for smlarly grouped varables, computed n equaton (4). Table 2 : Earnngs nterval regresson wth and wthout survey desgn features: 1999 OHS Smple Random Samplng Wth survey-desgn Varables Coeffcent Std. Error Coeffcent Std. Error Deff Prmary a 0.1199*** 0.0216 0.1210*** 0.0245 1.32 Secondary 0.3627*** 0.0230 0.3734*** 0.0272 Further educaton 0.6363*** 0.0469 0.6306*** 0.0577 Hgher educaton 0.8153*** 0.0325 0.8239*** 0.0403 Age 0.0376*** 0.0038 0.0403*** 0.0047 1.38 Age square -0.0004*** 0.0000-0.0005*** 0.0001 Tenure 0.0233*** 0.0016 0.0230*** 0.0019 1.26 Tenure square -0.0004*** 0.0000-0.0003*** 0.0000 Afrcan b -0.6700*** 0.0210-0.6348*** 0.0333 2.10 Coloured -0.5429*** 0.0257-0.4632*** 0.0385 Indan -0.3283*** 0.0416-0.2966*** 0.0535 Other race -0.1562 0.1796-0.2287 0.1453 Male 0.1923*** 0.0150 0.2032*** 0.0178 1.32 Monthly hours 0.0009*** 0.0001 0.0009*** 0.0001 1.47 Urban 0.1766*** 0.0152 0.1751*** 0.0231 1.82 Martal status 0.0959*** 0.0131 0.1092*** 0.0178 1.74 Headshp status 0.1406*** 0.0141 0.1389*** 0.0175 1.43 Formal 0.2594*** 0.0191 0.2775*** 0.0255 1.50 Unon 0.2381*** 0.0143 0.2133*** 0.0201 1.74 Agrculture c -0.5799*** 0.0264-0.5879*** 0.0354 1.62 11

DPRU Workng Paper 05/91 Reza Danels & Sandrne Rospabé Mnng 0.0912** 0.0286 0.0086 0.0454 Utltes 0.3009*** 0.0656 0.2545 0.0703 Constructon -0.0556* 0.0328-0.0938** 0.0451 Trade -0.1804*** 0.0235-0.1948*** 0.0300 Transport 0.0578* 0.0322 0.0239 0.0393 Fnance 0.0796** 0.0283 0.0819** 0.0303 Servces 0.0812** 0.0233 0.0382 0.0277 Domestc servces -0.5063*** 0.0520-0.4948*** 0.0613 Managers d 0.5644*** 0.0362 0.5557*** 0.0526 1.51 Professonals 0.4252*** 0.0385 0.4583*** 0.0446 Techncans 0.2941*** 0.0300 0.3128*** 0.0408 Clerks 0.1536*** 0.0279 0.1320*** 0.0346 Salesperson -0.0591** 0.0275-0.0708** 0.0346 Skll agrcultural workers -0.1545*** 0.0397-0.1650** 0.0475 Operators -0.0632** 0.0242-0.0830** 0.0296 Elementary workers -0.1869*** 0.0236-0.1907*** 0.0296 Domestc workers -0.1951*** 0.0548-0.1801** 0.0643 Eastern Cape e Provnce -0.4884*** 0.0266-0.4570*** 0.0391 2.10 Northern Cape Provnce -0.3086*** 0.0278-0.3138*** 0.0462 Free State Provnce -0.5560*** 0.0265-0.4992*** 0.0412 Kwazulu-Natal Provnce -0.2025*** 0.0256-0.1792*** 0.0374 North West Provnce -0.2253*** 0.0271-0.1776*** 0.0383 Gauteng Provnce -0.0870*** 0.0232-0.0461 0.0327 Mpumalanga Provnce -0.2179*** 0.0267-0.1653*** 0.0439 Northern Provnce -0.2476*** 0.0285-0.2292*** 0.0389 palphaconstant 5.9542*** 0.0853 5.8331*** 0.1105 1.54 Number of observatons 17945 17945 Number of strata 18 Number of PSUs 2815 Populaton sze 7 042 100 Model Ch2 (c.1) or F (c.2) 16 056 336.78 Prob> Ch2 or F 0.00 0.00 Notes: Assocated standard errors are heteroscedastc-consstent. *** Statstcally sgnfcant at the 1% level, ** the 5% level, * the 10% level. Reference category: (a) No educaton, (b) Whte, (c) Manufacturng, (d) Artsans and (e) Western Cape Provnce. 12

Estmatng an Earnngs Functon from Coarsened Data by and Interval Censored Regresson Procedure It should be noted that the coeffcents n an nterval regresson are estmated by a pseudo-maxmum-lkelhood when survey desgn features are taken nto account. As such, they are not drectly nterpretable snce the coeffcents predct the effects of changes n the Ey exogenous varables on the latent varableas y j as. For y, the margnal effect s xj expected to be smaller (see Maddala, 1983, 160). Despte ths, comments can be made concernng the sgn and relatve sze of the coeffcents. As most of the varables have a smlar nfluence whether or not survey desgn features are accounted for, the general results of the two regressons are frstly consdered. The block of educatonal dummes shows expected results. Schoolng ncreases earnngs, and the more educated workers are the hgher the return of the year of schoolng completed. Age and tenure have postve and decreasng returns on wages. We can thus conclude that all of these varables have an nfluence consstent wth human captal theory. Racal dummes are all sgnfcant and dsplay the expected order. Other thngs beng equal, Afrcans earn less than Whte workers, followed by Coloureds and Indans, corroboratng smlar results found by Hofmeyr (2000) on a 1993 sample, and consstent wth South Afrca s racally dvded past. For further nvestgaton of the estmates of racal dscrmnaton, the resdual dfference methodology employed by Oaxaca (1973) should be utlsed (see for nstance Allanson et al (2000) and Rospabé (2002)). The male dummy has a postve and sgnfcant nfluence on wages. Whereas ths result can partly be explaned by the fact that males and females don t beneft equally from the same contract of employment, an unknown part of the coeffcent also reflects potental gender wage dscrmnaton. As expected, the number of hours worked on average durng a month postvely nfluences earnngs. The results for the locatonal varables were also expected to some extent. Frstly, lvng n an urban area ncreases earnngs. Secondly, the outcomes for provncal dummes show that earnngs are lower for workers who are located n any other provnce other than the Western Cape. However, the coeffcent for Gauteng s not sgnfcant when survey desgn s consdered. Beng marred and beng the head of a household confers some advantages to workers, whch ndcates that these two varables could be a motvatonal sgnal for employers. Alternatvely, t could also be due to confoundng marrage wth earnng potental and age. Turnng to the mpact of sectors on earnngs, estmates show that workers n the formal sector earn hgher wages than n the nformal sector. Ths result s not unexpected as the formal dummy also reflects the effects of frm sze and welfare contrbutons, whch are lkely to be larger n the formal sector. If we consder the results takng survey-desgn nto account, we can also see that there are a few ndustral sectors that provde sgnfcantly hgher wages 13

DPRU Workng Paper 05/91 Reza Danels & Sandrne Rospabé than manufacturng, exemplfed by the utlty and fnance sectors. However, other ndustres such as agrculture, trade and domestc work pay less than the manufacturng sector. Unon members earn sgnfcantly more than non-unon members. Ths result s common n the lterature on the unon wage premum and hghlghts the strong barganng power of South Afrcan unons over wages. Smlar results have already been found n prevous studes (Butcher and Rouse (2001), Moll (1993), Mwabu and Schultz (1998)), though for Afrcan workers only. As far as whte workers are concerned, the premum s often found to be nsgnfcant. Therefore, t should be expected that f the results were dsaggregate by race, the conclusons would be qute dfferent. The results for the block of occupatonal dummes also dsplay the expected wage herarchy, where artsans were used as the base category. Estmates show that managers, professonals, techncans and clerks earn sgnfcantly more than artsans, whereas workers perceved as less sklled receve lower wages. In the followng secton, we compare the results of the earnngs nterval regressons estmated under smple random samplng and when survey desgn was accounted for. 4.3 The Influence of Survey-Desgn At frst glance, there are no obvous dfferences between the results of the estmates wth or wthout ntegratng survey-desgn. As expected, the standard errors ncrease when clusters are ncluded nto the analyss, snce the smple random samplng regresson overstates precson by gnorng the dependence of observatons wthn the same PSU. Ignorng clusterng leads to a rse n the probablty of commttng a type I error. The desgn effects (deff) are large for race and provnce varables, exceedng 2 on average. However, whether or not survey-desgn s accounted for, the probablty of commttng a type I error remans zero n both cases, except for Gauteng where the coeffcent becomes nsgnfcant under cluster samplng. The nterpretaton of the results doesn t change too much except for the ndustres. Coeffcents for mnng, transport and servce dummes are sgnfcantly dfferent from zero, at least at the 10 per cent level n the case of smple random samplng estmates. However, they become nsgnfcant when survey-desgn s taken nto account. A Wald test shows that gven survey-desgn features, we cannot reject the jont sgnfcance of the ndustral dummes. To some extent, the szes of the coeffcents dffer when the data are weghted. Dfferences are small for human captal varables, urban locatons and gender, but are larger for some ndustres, occupatons and provnces. As the extent of the mpact of each varable on earnngs s dffcult to nterpret n the case of pseudo-lkelhood, so are the effects of the varatons n the sze of the coeffcents between smple random samplng and survey desgn. 14

Estmatng an Earnngs Functon from Coarsened Data by and Interval Censored Regresson Procedure In summary, t s evdent that there were not large dfferences between the weghted relatve to the unweghted coeffcents. Ths ndcates that the survey s samplng methodology was sound, capturng nformaton from the sampled populaton that was not too dfferent from the total populaton. However, the fact that the coeffcents were dfferent themselves, regardless of the magntude of ths dfference, ndcates that wthout ncorporatng the weghts the coeffcents would be ncorrect. As far as the varance s concerned, t was evdent that, wth the excepton of Other Race, every standard error n the regresson results ncreased. Consequently, t s fundamental that survey desgn features be accurately ncorporated nto the varance formulae. 5. Concluson Ths paper has estmated an earnngs functon from coarsened data usng the nterval regresson model based on a pseudo-maxmum lkelhood estmaton procedure. The analyss used the 1999 OHS and took nto account both pont and nterval ncome observatons, as well as the desgn features of the survey ncludng stratfcaton, mult-stage samplng and weghts. In developng and applyng the methodology, t was shown that researchers nterested n analysng the determnants of ncome n a meanngful way need not be hampered by the presence of both pont and nterval observatons, and can n fact account for these usng a generalsed Tobt model. By ncorporatng survey desgn features nto the analyss of the varance, some changes were needed to the estmaton procedure and ths s where the pseudo-lkelhood became useful. However, ths then affected how the coeffcents of the model were nterpreted. Therefore, careful attenton needs to be pad to the confluence of the model and ts estmaton procedures wth survey desgn features. The analyss of earnngs was then undertaken both at the descrptve and analytcal levels. In both nstances, a comparson was made of the precson of survey desgn-based coeffcents and varance estmates relatve to ther non desgn-based (smple random samplng) counterparts. It was shown that the ntroducton of weghts n the analyss sgnfcantly alters the sze of the means of varables, and to a smaller extent, the sze of the coeffcents n an earnngs regresson. It was also observed that survey desgn features generally ncrease standard errors, as would be expected. In some cases, coeffcents that were sgnfcantly dfferent from zero under random samplng became nsgnfcant when survey desgn was accounted for. These results pont to the fact that adequate attenton should be pad to features of complex survey data n order to yeld both correct estmates of coeffcents and ther standard errors. 15

DPRU Workng Paper 05/91 Reza Danels & Sandrne Rospabé 6. References Allanson, P., Atknks, J.P., and Hnks, T. (2000): A multlateral decomposton of racal wage dfferentals n the 1994 South Afrcan Labour Market, Journal of Development Studes, 37 (1), 93-120. Bnder, D.A. (1983): On the varances of asymptotcally normal estmators from complex surveys, Internatonal Statstcal Revew, 51, 279-292. Butcher, K. and Rouse C. (2001): Wage effects of unons and ndustral councls n South Afrca, Industral and Labor Relatons Revew, 54 (2), 349-74. Deaton, A. (1997): The analyss of household surveys: A mcroeconometrc approach to development polcy, Baltmore: John Hopkns Unversty Press. Elason, S.R. (1993): Maxmum lkelhood estmaton: logc and practce, London: Sage Publcatons. Greene, W.H. (2000): Econometrc Analyss, Fourth Edton, New Jersey: Prentce Hall. Heernga, S.G., Lttle, R.J.A. and Raghunathan, T.E. (2002): Multvarate Imputaton of Coarsened Survey Data on Household Wealth, n Groves, R.M., Dllman, D.A., Eltnge, D.L., and Lttle, R.J.A. (eds): Survey Nonresponse, New York: John Wley & Sons Inc. Hetjan, D.F. and Rubn, D.B. (1990): Inference from coarse data va multple mputaton wth applcaton to age heapng, Journal of the Amercan Statstcal Assocaton, 85 (410), 304-314. Hetjan, D.F. and Rubn, D.B. (1991): Ignorablty and coarse data, Annals of Statstcs,19, 2244-2253. Hofmeyr, J. (2000): The changng pattern of segmentaton n the South Afrcan Labour market, Studes n Economcs and Econometrcs, 24 (3), 109-128. Lehtonen, R. and Pahknen, E.J. (1995): Practcal Methods for Desgn and Analyss of Complex Surveys, New York: John Wley & Sons Inc. Maddala, G.S. (1983): Lmted-dependent and qualtatve varables n econometrc, Cambrdge Unversty Press. Moll, P. (1993): Black South Afrcan Unons: relatve wage effects n nternatonal perspectve, Industral and Labor Relatons Revew, 46 (2), 245-61. Mwabu, G. and Schultz, P. (1998): Labor unons and the dstrbuton of wages and employment n South Afrca, Industral Labor Relaton Revew, 51 (4), 680-703. Oaxaca, R.L. (1973): Male-female wage dfferentals n urban labor market, Internatonal Economc Revew, 14 (3), 693-709. Rospabé, S. (2002): How dd labour market racal dscrmnaton evolve after the end of Aparthed?, South Afrcan Journal of Economcs, 70 (1), 185-217. StataCorp. (2003a): Stata Statstcal Software. Release 8.0 Base Reference Manual, Volume 4: S-Z, College Staton, Texas: StataCorp LP. StataCorp. (2003b): Stata Statstcal Software. Release 8.0 Survey Data Reference Manual, College Staton, Texas: StataCorp LP. Statstcs South Afrca (1999): October Household Survey, Johannesburg: Statstcs South Afrca. Sul Lee, E., Forthofer, R.N. & Lormor, R.J. (1989) Analyzng Complex Survey Data, Sage London: Sage Publcatons. 16