Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

Similar documents
Non-linearities in Simple Regression

> attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount")

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013

Technical Documentation for Household Demographics Projection

6 Multiple Regression

Generalized Linear Models

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

Study 2: data analysis. Example analysis using R

ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University

NHY examples. Bernt Arne Ødegaard. 23 November Estimating dividend growth in Norsk Hydro 8

Multiple regression - a brief introduction

Regression and Simulation

Homework Assignment Section 3

CHAPTER 4 DATA ANALYSIS Data Hypothesis

R is a collaborative project with many contributors. Type contributors() for more information.

ECO671, Spring 2014, Sample Questions for First Exam

Econ 371 Problem Set #4 Answer Sheet. 6.2 This question asks you to use the results from column (1) in the table on page 213.

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS

Panel Data. November 15, The panel is balanced if all individuals have a complete set of observations, otherwise the panel is unbalanced.

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Midterm

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

Stat3011: Solution of Midterm Exam One

COMMUNITY ADVANTAGE PANEL SURVEY: DATA COLLECTION UPDATE AND ANALYSIS OF PANEL ATTRITION

FIGURE I.1 / Per Capita Gross Domestic Product and Unemployment Rates. Year

State Ownership at the Oslo Stock Exchange. Bernt Arne Ødegaard

MODEL SELECTION CRITERIA IN R:

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15

Jaime Frade Dr. Niu Interest rate modeling

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

sociology SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 SO5032 Quantitative Research Methods

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction

The Norwegian State Equity Ownership

CONVERGENCES IN MEN S AND WOMEN S LIFE PATTERNS: LIFETIME WORK, LIFETIME EARNINGS, AND HUMAN CAPITAL INVESTMENT $

Final Exam Suggested Solutions

Figure 2.1 The Longitudinal Employer-Household Dynamics Program

Table 4. Probit model of union membership. Probit coefficients are presented below. Data from March 2008 Current Population Survey.

CHAPTER 11 Regression with a Binary Dependent Variable. Kazu Matsuda IBEC PHBU 430 Econometrics

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Changes in Economic Mobility

Lecture Note: Analysis of Financial Time Series Spring 2017, Ruey S. Tsay

Statistics 101: Section L - Laboratory 6

To be two or not be two, that is a LOGISTIC question

COMMUNITY ADVANTAGE PANEL SURVEY: DATA COLLECTION UPDATE AND ANALYSIS OF PANEL ATTRITION

Homework Assignment Section 3

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

Policy Analysis Field Examination Questions Spring 2014

1 Estimating risk factors for IBM - using data 95-06

ALL RETIREMENT PLAN COVERAGE TABLES

Effect of Education on Wage Earning

COMMUNITY ADVANTAGE PANEL SURVEY: DATA COLLECTION UPDATE AND ANALYSIS OF PANEL ATTRITION

Online appendix for W. Kip Viscusi, Joel Huber, and Jason Bell, Assessing Whether There Is a Cancer Premium for the Value of a Statistical Life

Chapter 4 Level of Volatility in the Indian Stock Market

State Ownership at the Oslo Stock Exchange

Multiple linear regression

Logistic Regression Analysis

############################ ### toxo.r ### ############################

Categorical Outcomes. Statistical Modelling in Stata: Categorical Outcomes. R by C Table: Example. Nominal Outcomes. Mark Lunt.

Stat 328, Summer 2005

2SLS HATCO SPSS, STATA and SHAZAM. Example by Eddie Oczkowski. August 2001

> > is.factor(scabdata$trt) [1] TRUE > is.ordered(scabdata$trt) [1] FALSE > scabdata$trtord <- ordered(scabdata$trt, +

Monetary Economics Risk and Return, Part 2. Gerald P. Dwyer Fall 2015

Labor Force Participation and the Wage Gap Detailed Notes and Code Econometrics 113 Spring 2014

arxiv: v1 [q-fin.ec] 28 Apr 2014

Labor Market Returns to Two- and Four- Year Colleges. Paper by Kane and Rouse Replicated by Andreas Kraft

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Stat 401XV Exam 3 Spring 2017

What America Is Thinking Access Virginia Fall 2013

Relationship Between Household Nonresponse, Demographics, and Unemployment Rate in the Current Population Survey.

University of Zürich, Switzerland

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Appendix C: Econometric Analyses of IFC and World Bank SME Lending Projects: Drivers of Successful Development Outcomes

The Influence of Race in Residential Mortgage Closings

ECON Introductory Econometrics. Seminar 4. Stock and Watson Chapter 8

An Examination of the Impact of the Texas Methodist Foundation Clergy Development Program. on the United Methodist Church in Texas

Dummy variables 9/22/2015. Are wages different across union/nonunion jobs. Treatment Control Y X X i identifies treatment

Don t worry one bit about multicollinearity, because at the end of the day, you're going to be working with a favorite coefficient model.

Lifetime Wealth Taxation of the Very Rich

SALARY EQUITY ANALYSIS AT ARL INSTITUTIONS

Econometric Methods for Valuation Analysis

STA 4504/5503 Sample questions for exam True-False questions.

Commission District 4 Census Data Aggregation

Impact of Household Income on Poverty Levels

Supervisor: Prof. univ. dr. MOISA ALTAR MSc Student IONITA RODICA OANA

Northwest Census Data Aggregation

Online Appendix Results using Quarterly Earnings and Long-Term Growth Forecasts

Riverview Census Data Aggregation

Predicting Charitable Contributions

a. Explain why the coefficients change in the observed direction when switching from OLS to Tobit estimation.

2016 FACULTY SALARY EQUITY ANALYSIS

Zipe Code Census Data Aggregation

Zipe Code Census Data Aggregation

Multiple Regression. Review of Regression with One Predictor

Percentage of foreclosures in the area is the ratio between the monthly foreclosures and the number of outstanding home-related loans in the Zip code

Grouped Data Probability Model for Shrimp Consumption in the Southern United States

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

WesVar Analysis Example Replication C7

The Impact of a $15 Minimum Wage on Hunger in America

Production & Offshore Drilling July 2014

Income Convergence in the South: Myth or Reality?

DYNAMICS OF URBAN INFORMAL

Transcription:

Dummy Variables A dummy variable or binary variable is a variable that takes on a value of 0 or 1 as an indicator that the observation has some kind of characteristic. Common examples: Sex (female): FEMALE=1 if individual in the observation is female, equal to 0 otherwise Race (White): WHITE=1 if individual in the observation is white/caucasian, equal to 0 otherwise Urban vs Rural: URBAN=1 if individual in the observation lives in an urban area, equal to 0 otherwise College graduate: COLGRAD=1 if individual in the observation has a four-year college degree, equal to 0 otherwise It is common to use dummy variables as explanatory variables in regression models, if binary categorical variables are likely to influence the outcome variable. 1. Example: Factors Affecting Monthly Earnings Let us examine a data set that explores the relationship between total monthly earnings (MonthlyEarnings) and a number of variables on an interval scale (i.e. numeric quantities) that may influence monthly earnings including including each person s IQ (IQ), a measure of knowledge of their job (Knowledge), years of education (YearsEdu), and years experience (YearsExperience), years at current job (Tenure). The data set also includes dummy variables that may explain monthly earnings, including whether or not the person is black / African American (Black), whether or not the person lives in a Southern U.S. state (South), and whether or not the person lives in an urban area (Urban). The code below downloads a CSV file that includes data on the above variables from 1980 for 935 individuals and assigns it to a data set that we name wages. wages <- read.csv("http://murraylax.org/datasets/wage2.csv"); The following call to lm() estimates a multiple regression predicting monthly earnings based on the eight explanatory variables given above, which includes three dummy variables. The next call to summary() displays some summary statistics for the estimated regression. + Black + South + Urban, Call: lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + Tenure + Black + South + Urban, Residuals: Min 1Q Median 3Q Max -874.42-229.18-40.25 181.26 2163.02 1

Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -451.0098 121.3752-3.716 0.000215 *** IQ 2.5966 0.9963 2.606 0.009301 ** Knowledge 6.5545 1.8142 3.613 0.000319 *** YearsEdu 47.6530 7.1378 6.676 4.22e-11 *** YearsExperience 12.4833 3.1746 3.932 9.04e-05 *** Tenure 6.2910 2.4049 2.616 0.009043 ** Black -110.6660 39.2222-2.822 0.004882 ** South -50.8222 25.7903-1.971 0.049068 * Urban 155.4316 26.4621 5.874 5.94e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 356.7 on 926 degrees of freedom Multiple R-squared: 0.2285, Adjusted R-squared: 0.2219 F-statistic: 34.29 on 8 and 926 DF, p-value: < 2.2e-16 The p-values in the right-most column reveal that all of the coefficients are statistically significantly different from zero at the 5% significance level. We have statistical evidence that all of these variables influence monthly earnings. The coefficient on Black is equal to -110.67. This means that even after accounting for the effects of all the other explanatory variables in the model (includes educational attainment, experience, location, knowledge, and IQ), black / African American people earn on average $110.67 less per month than non-black people. The coefficient on South is -50.82. Accounting for the impact of all the variables in the model, people that live in Southern United States earn on average $50.82 less per month than others. The coefficient on Urban is 155.43. Accounting for the impact of all the variables in the model, people that live in urban areas earn $155.43 more per month, which probably reflects a higher cost of living. We can compute confidence intervals for these effects with the following call to confint() confint(lmwages, parm=c("black", "South", "Urban"), level = 0.95) 2.5 % 97.5 % Black -187.6407-33.6913263 South -101.4365-0.2079364 Urban 103.4989 207.3642822 2. Dummy Interactions with Numeric Explanatory Variables We found that black people have lower monthly earnings on average than non-black people. In our regression equation, this implies that the intercept is lower for black people than non-black people. We can also test whether a dummy variable affects the slope multiplying other variables. For example, are there differences in the returns to education for black versus non-black people? To answer this, we include an interaction effect between Black and YearsEdu: + Black + South + Urban + Black*YearsEdu, 2

Call: lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + Tenure + Black + South + Urban + Black * YearsEdu, Residuals: Min 1Q Median 3Q Max -871.77-223.35-39.15 183.60 2166.96 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -484.8569 122.7181-3.951 8.38e-05 *** IQ 2.5965 0.9951 2.609 0.009224 ** Knowledge 6.6834 1.8135 3.685 0.000242 *** YearsEdu 50.0652 7.2573 6.899 9.73e-12 *** YearsExperience 12.0943 3.1784 3.805 0.000151 *** Tenure 6.3322 2.4022 2.636 0.008528 ** Black 328.4032 249.9481 1.314 0.189211 South -48.6125 25.7902-1.885 0.059753. Urban 155.1421 26.4318 5.870 6.09e-09 *** YearsEdu:Black -35.0262 19.6929-1.779 0.075630. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 356.3 on 925 degrees of freedom Multiple R-squared: 0.2312, Adjusted R-squared: 0.2237 F-statistic: 30.9 on 9 and 925 DF, p-value: < 2.2e-16 We see here that when accounting for an interaction effect between race and education, the coefficient on the Black dummy variable becomes insignificant, but the coefficient on the interaction term is negative and significant at the 10% level. The coefficient on the interaction term equal to -35.03 means the slope on education is 35.03 less when Black = 1. The coefficient on the interaction term is interpreted as the additional marginal effect of the numeric variable for the group associated with the dummy variable equal to 1. For this example: The marginal effect on monthly earnings for non-black people for an additional year of education is equal to $50.07 (i.e. when Black = 0). The marginal effect on monthly earnings for black people for an additional year of education is equal to $50.07 - $35.03 = $15.02 (i.e. when Black = 1). Said another way, the marginal effect on monthly earnings for an additional year of education is $35.03 less for black people than non-black people. 3. Interacting Dummy Variables with Each Other Let us interact two of the dummy variables to understand this interpretation and motivation. In the call to lm() below, we use our baseline model and interact South and Urban: + Black + South + Urban + South*Urban, 3

Call: lm(formula = MonthlyEarnings ~ IQ + Knowledge + YearsEdu + YearsExperience + Tenure + Black + South + Urban + South * Urban, Residuals: Min 1Q Median 3Q Max -885.94-228.09-36.76 173.16 2153.62 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -516.8840 124.9159-4.138 3.83e-05 *** IQ 2.7472 0.9968 2.756 0.005964 ** Knowledge 6.6968 1.8118 3.696 0.000232 *** YearsEdu 48.1580 7.1275 6.757 2.50e-11 *** YearsExperience 12.9375 3.1753 4.074 5.01e-05 *** Tenure 6.1817 2.4007 2.575 0.010178 * Black -109.0280 39.1521-2.785 0.005467 ** South 30.3594 45.5537 0.666 0.505288 Urban 200.1871 33.5683 5.964 3.51e-09 *** South:Urban -116.3504 53.8671-2.160 0.031033 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 356 on 925 degrees of freedom Multiple R-squared: 0.2324, Adjusted R-squared: 0.2249 F-statistic: 31.12 on 9 and 925 DF, p-value: < 2.2e-16 To interpret the meaning of the coefficient on South, Urban, and South*Urban, we will ignore (hold constant) all the terms in the regression equation that do not include one of these variables. 3.1 Difference between Urban and Rural Workers in the North/East/West Workers in the North / East / and West U.S. have South = 0. Here South = 0, (South x Urban) = 0, so neither the coefficient on the interaction nor the coefficient on South come into play. The coefficient for b Urban implies that in the Non-Southern U.S., urban workers earn on average $200.19 more in monthly earnings than rural workers. 3.2 Difference between Urban and Rural Workers in the South When focusing on workers in the South, South = 1 and the interaction term comes into play. Impact for urban workers in the south = b South (1) + b Urban (1) + b Urban South (1) Impact for rural workers in the south = b South (1) + b Urban (0) + b Urban South (0) Difference = b Urban + b Urban South = 200.19 116.35 = $83.84 In the Southern U.S. states, urban workers on average earn $83.84 more in monthly earnings than rural workers. 4

3.3 Difference between Southern and North/East/West Monthly Earnings for Urban Workers Impact for Southern urban workers = b South (1) + b Urban (1) + b Urban South (1) Impact for Non-Southern urban workers = b South (0) + b Urban (1) + b Urban South (0) Difference = b South + b Urban South = 30.36 116.35 = -$85.99 For urban workers, workers in the South earn $85.99 less in monthly earnings than workers outside the South. 3.4 Difference between Southern and North/East/West Monthly Earnings for Rural Workers Rural workers have Urban = 0 and so the interaction term Urban x South = 0, so we can ignore both of those coefficients. The coefficient for b South implies that Southern rural workers earn on average $30.36$ more per month than Non-Southern rural workers. 4 Three-Way Interactions and Higher! What?! Things aren t complicated enough for you?! Do at your own peril! I have seen people include higher order interaction effects like South * Urban * Black * YearsEdu in their regressions. It has never been obvious to me that they understood what their results meant. 5