Multiple linear regression

Similar documents
Multiple regression - a brief introduction

Homework Assignment Section 3

Regression and Simulation

Non-linearities in Simple Regression

Business Statistics: A First Course

Homework Assignment Section 3

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Econometrics and Economic Data

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

11/28/2018. Overview. Multiple Linear Regression Analysis. Multiple regression. Multiple regression. Multiple regression. Multiple regression

Stat 401XV Exam 3 Spring 2017

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

Analysis of Variance in Matrix form

Regression. Lecture Notes VII

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Chapter 18: The Correlational Procedures

6 Multiple Regression

Jaime Frade Dr. Niu Interest rate modeling

Final Exam Suggested Solutions

Econometric Methods for Valuation Analysis

Statistic Midterm. Spring This is a closed-book, closed-notes exam. You may use any calculator.

Linear regression model

Estimating a demand function

Study 2: data analysis. Example analysis using R

Section 2: Estimation, Confidence Intervals and Testing Hypothesis

Section 2: Estimation, Confidence Intervals and Testing Hypothesis

Business Statistics. University of Chicago Booth School of Business Fall Jeffrey R. Russell

Problem Set 5 Answers. ( ) 2. Yes, like temperature. See the plot of utility in the notes. Marginal utility should be positive.

Chapter 12. Homework. For each situation below, state the independent variable and the dependent variable.

AP Stats: 3B ~ Least Squares Regression and Residuals. Objectives:

Lecture Note: Analysis of Financial Time Series Spring 2017, Ruey S. Tsay

This homework assignment uses the material on pages ( A moving average ).

The Least Squares Regression Line

Multiple Regression. Review of Regression with One Predictor

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

STT 315 Handout and Project on Correlation and Regression (Unit 11)

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41202, Spring Quarter 2003, Mr. Ruey S. Tsay

STA 371G Outline Spring 2014

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

R is a collaborative project with many contributors. Type contributors() for more information.

Economic Response Models in LookAhead

Introduction to Population Modeling

PRACTICE PROBLEMS FOR EXAM 2

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Econometrics is. The estimation of relationships suggested by economic theory

Stat3011: Solution of Midterm Exam One

Intro to GLM Day 2: GLM and Maximum Likelihood

Lecture 5 Theory of Finance 1

Business Statistics Final Exam

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

Monotonically Constrained Bayesian Additive Regression Trees

Advanced Operations Research Prof. G. Srinivasan Dept of Management Studies Indian Institute of Technology, Madras

Multidimensional Monotonicity Discovery with mbart

9. Logit and Probit Models For Dichotomous Data

Section 0: Introduction and Review of Basic Concepts

$0.00 $0.50 $1.00 $1.50 $2.00 $2.50 $3.00 $3.50 $4.00 Price

REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING

Economics 345 Applied Econometrics

proc genmod; model malform/total = alcohol / dist=bin link=identity obstats; title 'Table 2.7'; title2 'Identity Link';

Risk Analysis. å To change Benchmark tickers:

The Simple Regression Model

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION

Today's Agenda Hour 1 Correlation vs association, Pearson s R, non-linearity, Spearman rank correlation,

A useful modeling tricks.

ARIMA ANALYSIS WITH INTERVENTIONS / OUTLIERS

Predicting Charitable Contributions

Maths/stats support 12 Spearman s rank correlation

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15

Chapter 8 Statistical Intervals for a Single Sample

Risk Neutral Agent. Class 4

Logit Models for Binary Data

WEB APPENDIX 8A 7.1 ( 8.9)

Stochastic Manufacturing & Service Systems. Discrete-time Markov Chain

MODEL SELECTION CRITERIA IN R:

Test #1 (Solution Key)

Characterization of the Optimum

36106 Managerial Decision Modeling Sensitivity Analysis

The basic goal of regression analysis is to use data to analyze relationships.

Hedging and Regression. Hedging and Regression

Chapter 5. Sampling Distributions

Lecture 3: Factor models in modern portfolio choice

Business Statistics 41000: Probability 3

NCC5010: Data Analytics and Modeling Spring 2015 Exemption Exam

Confidence Intervals. σ unknown, small samples The t-statistic /22

Jacob: What data do we use? Do we compile paid loss triangles for a line of business?

Homework Assignments for BusAdm 713: Business Forecasting Methods. Assignment 1: Introduction to forecasting, Review of regression

Linear functions Increasing Linear Functions. Decreasing Linear Functions

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Midterm

Module 4: Point Estimation Statistics (OA3102)

The Simple Regression Model

STA 371G Outline Fall 2018

The Norwegian State Equity Ownership

b) According to the statistics above the graph, the slope is What are the units and meaning of this value?

Case 2: Motomart INTRODUCTION OBJECTIVES

Transcription:

Multiple linear regression Business Statistics 41000 Spring 2017 1

Topics 1. Including multiple predictors 2. Controlling for confounders 3. Transformations, interactions, dummy variables OpenIntro 8.1, Super Cruchers excerpt. 2

Simple linear regression recap We saw last week how to use the least squares criterion to define the best linear predictor. We also saw how to use R or Excel to compute the best linear predictor on a given data set. The best in-sample linear predictor is probably not the true linear predictor, but with enough data it should be similar. We can use the idea of a confidence interval to help us gauge how much we trust our fit. 3

Square feet versus sale price The least squares line-of-best-fit for the housing data is a = 10091 and b = 70.23. Price in dollars 80000 120000 160000 200000 1600 1800 2000 2200 2400 2600 Square Feet The residual standard error the noise level is ˆσ = $22, 480 in this case. Why is it so big? What can we do to make it smaller? 4

Bedrooms versus sale price Perhaps we could find a better predictor. If we use the number of bedrooms instead of price we get this fit. Price in dollars 80000 120000 160000 200000 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Square feet Now ˆσ = $22, 940, which is not an improvement. Couldn t we use both square feet and number of bedrooms to predict? 5

The best linear multivariate predictor We still want to find a prediction ŷ to minimize our squared error but now we have E{(ŷ Y ) 2 }, ŷ = b 0 + b 1 X 1 + b 2 X 2... b p X p For a whole list of predictor variables. Applied to a data set, this becomes the optimization problem: find coefficients b 0... b p that minimize: n b 0 + j i=1 2 b j x ij y i. Why are there two subscripts on x ij? 6

Including more predictors can improve prediction If we include both square feet and the number of bedrooms in our prediction of price, the residual standard error drops to ˆσ = $21, 100. Price in dollars 80000 120000 160000 200000 100000 120000 140000 160000 Predicted price Plotting in this case is trickier...but to get a sense of our prediction accuracy we can look at an predicted versus actual plot. 7

Including more predictors can improve prediction If we include SqFt, Bedrooms, Bathrooms and Brick in our prediction of price, the residual standard error drops to ˆσ = $17, 630. Price in dollars 80000 120000 160000 200000 100000 120000 140000 160000 180000 Predicted price What would this plot look like if we had all the relevant determinants of price? 8

Multiple linear regression in R > summary(housefit) Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + Brick) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -5279.436 14900.448-0.354 0.723710 SqFt 35.801 9.241 3.874 0.000173 *** Bedrooms 10873.060 2523.693 4.308 3.33e-05 *** Bathrooms 9818.430 3699.285 2.654 0.009002 ** BrickYes 21909.107 3371.179 6.499 1.81e-09 *** Residual standard error: 17630 on 123 degrees of freedom Multiple R-squared: 0.5828, Adjusted R-squared: 0.5692 R 2 is a standard measure of goodness-of-fit, but I like ˆσ better. 9

Plug-in predictions Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -5279.436 14900.448-0.354 0.723710 SqFt 35.801 9.241 3.874 0.000173 *** Bedrooms 10873.060 2523.693 4.308 3.33e-05 *** Bathrooms 9818.430 3699.285 2.654 0.009002 ** BrickYes 21909.107 3371.179 6.499 1.81e-09 *** Just as in the single-predictor case, we can calculate predictions by plugging in particular values for the predictor variables: ŷ = 5279 + 35.8(2000) + 10873(3) + 9818.43(2) + 21909(1) would be our prediction for a 2K sq ft, three bed, two bath, brick home. 10

Categorical predictors Can we use information in a linear regression even if it isn t numerical? In the housing data we have three neighborhoods, denoted 1, 2 and 3. Why would we potentially not want to include the Nbhd variable into the regression as-is? We can, via the creation of dummy variables. If we have k categories, we create k extra columns in each row exactly one column can be a one. What happens if we include an intercept in this model? 11

Dummy variable with no intercept Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) + Brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt 35.930 6.404 5.610 1.30e-07 *** Bedrooms 1902.169 1902.270 1.000 0.31933 Bathrooms 6826.925 2562.812 2.664 0.00878 ** as.factor(nbhd)1 17919.446 10474.046 1.711 0.08967. as.factor(nbhd)2 22785.141 11037.189 2.064 0.04112 * as.factor(nbhd)3 52003.166 11518.102 4.515 1.48e-05 *** BrickYes 18507.779 2396.302 7.723 3.65e-12 *** Residual standard error: 12150 on 121 degrees of freedom Multiple R-squared: 0.9921, Adjusted R-squared: 0.9917 12

Dummy variable with an intercept Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + Brick + as.factor(nbhd)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 17919.446 10474.046 1.711 0.08967. SqFt 35.930 6.404 5.610 1.30e-07 *** Bedrooms 1902.169 1902.270 1.000 0.31933 Bathrooms 6826.925 2562.812 2.664 0.00878 ** BrickYes 18507.779 2396.302 7.723 3.65e-12 *** as.factor(nbhd)2 4865.694 2721.805 1.788 0.07633. as.factor(nbhd)3 34083.719 3168.987 10.755 < 2e-16 *** Residual standard error: 12150 on 121 degrees of freedom Multiple R-squared: 0.805, Adjusted R-squared: 0.7954 13

In-sample prediction accuracy Price in dollars 80000 120000 160000 200000 100000 120000 140000 160000 180000 Predicted price Our predictions are getting progressively better. Or are they? 14

Over-fitting As you continue to add more and more predictors, you will notice R 2 gets closer and closer to 1. As a crazy though experiment, would this happen even if we kept including garbage variables? In addition to the variables above, let s include 100 junk variable (drawn from a normal distribution) and see what happens. garbage <- matrix(rnorm(100*128),128,100) 15

Predicting with garbage Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) + Brick - 1 + garbage) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt 28.31 17.76 1.594 0.125915 Bedrooms 6251.82 3768.03 1.659 0.111943 Bathrooms 3018.86 7802.43 0.387 0.702714 as.factor(nbhd)1 27896.39 30135.11 0.926 0.365113 as.factor(nbhd)2 36150.24 30968.28 1.167 0.256161 as.factor(nbhd)3 57673.84 30444.70 1.894 0.072030. BrickYes 21221.82 5516.62 3.847 0.000936 ***. garbage21-6389.07 2697.69-2.368 0.027538 *. garbage62-5543.11 2554.52-2.170 0.041632 * Residual standard error: 12020 on 21 degrees of freedom Multiple R-squared: 0.9987, Adjusted R-squared: 0.9918 16

Over-fitting Price in dollars 80000 120000 160000 200000 80000 120000 160000 200000 Predicted price One simple way to check for over-fitting is to use a hold-out set of data and try to predict them without peeking. 17

Interactions What if we think that the price-premium associated with brick might be different between different neighborhoods? Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd):brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt 36.172 6.419 5.635 1.20e-07 *** Bedrooms 2361.593 1911.665 1.235 0.219130 Bathrooms 5621.923 2651.164 2.121 0.036036 * as.factor(nbhd)1:brickno 19822.815 10547.949 1.879 0.062649. as.factor(nbhd)2:brickno 24518.728 11034.988 2.222 0.028180 * as.factor(nbhd)3:brickno 50774.655 11542.347 4.399 2.38e-05 *** as.factor(nbhd)1:brickyes 32824.396 10984.131 2.988 0.003408 ** as.factor(nbhd)2:brickyes 41554.942 11266.610 3.688 0.000342 *** as.factor(nbhd)3:brickyes 74923.876 11779.572 6.360 3.88e-09 *** Residual standard error: 12100 on 119 degrees of freedom Multiple R-squared: 0.9923, Adjusted R-squared: 0.9917 Adding an interaction term to our regression model explicitly accounts for this possibility. 18

Interactions Here is an equivalent way to run this regression. Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) * Brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt 36.172 6.419 5.635 1.20e-07 *** Bedrooms 2361.593 1911.665 1.235 0.2191 Bathrooms 5621.923 2651.164 2.121 0.0360 * as.factor(nbhd)1 19822.815 10547.949 1.879 0.0626. as.factor(nbhd)2 24518.728 11034.988 2.222 0.0282 * as.factor(nbhd)3 50774.655 11542.347 4.399 2.38e-05 *** BrickYes 13001.581 5012.476 2.594 0.0107 * as.factor(nbhd)2:brickyes 4034.634 6222.902 0.648 0.5180 as.factor(nbhd)3:brickyes 11147.641 6559.711 1.699 0.0919. Residual standard error: 12100 on 119 degrees of freedom Multiple R-squared: 0.9923, Adjusted R-squared: 0.9917 How can we see that these are equivalent? Which one do you prefer in terms of interpretation? 19

Time series Consider predicting the temperature based on day of the year. These are Chicago daily highs, in Fahrenheit, 2001-2003. Daily high temp 20 40 60 80 100 0 200 400 600 800 1000 Time We can often turn nonlinear problem into linear problems by transforming our predictor variables in various ways and using many of them to predict. 20

Transformations Since we suspect a seasonal trend, let us create the following two predictor variables: x 1 = sin(2πt/365) and x 2 = cos(2πt/365). We then use least-squares, via lm(), to find a linear prediction rule. Call: lm(formula = y ~ x1 + x2) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 60.1498 0.2905 207.1 <2e-16 *** x1-8.9526 0.4108-21.8 <2e-16 *** x2-24.0694 0.4108-58.6 <2e-16 *** Residual standard error: 9.611 on 1092 degrees of freedom Multiple R-squared: 0.7816, Adjusted R-squared: 0.7812 21

Nonlinear prediction via linear regression Daily high temp 20 40 60 80 100 0 200 400 600 800 1000 Time yhat <- 60.1498-8.95*sin(2*pi*t/365) - 24.07*cos(2*pi*t/365) 22

Nonlinear prediction via linear regression If we consider a string of 50 days, the daily high temps are sticky...the temp today looks like the temp in the preceding days. Daily high temp 20 40 60 80 0 50 100 150 Time This suggests we can use previous days weather to predict today s weather. 23

Auto-regression Plotting today s weather versus tomorrow s weather gives a nice clean correlation. Temp tomorrow 20 40 60 80 20 40 60 80 Temp today Running a linear regression will produce a prediction rule. What do you suppose the slope coefficient will be close to? 24

Auto-regression Here s how we set this up in R. > today <- y[1:149] > tomorrow <- y[2:150] > temp_auto_reg <- lm(tomorrow~today) > summary(temp_auto_reg) Call: lm(formula = tomorrow ~ today) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.50846 2.08862 3.116 0.0022 ** today 0.87387 0.03931 22.229 <2e-16 *** Residual standard error: 8.894 on 147 degrees of freedom Multiple R-squared: 0.7707, Adjusted R-squared: 0.7692 We still have nearly 9 degree swings from day to day. 25

Auto-regression On a two-day lag the predictability decreases. > today <- y[1:148] > tomorrow <- y[2:149] > dayaftertomorrow <- y[3:150] Call: lm(formula = dayaftertomorrow ~ today) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 12.27918 2.76165 4.446 1.72e-05 *** today 0.76349 0.05207 14.664 < 2e-16 *** Residual standard error: 11.75 on 146 degrees of freedom Multiple R-squared: 0.5956, Adjusted R-squared: 0.5928 The two-day variability is nearly 12 degrees. 26

Auto-regression What happens if we include both today and yesterday to predict tomorrow? > yesterday <- y[1:148] > today <- y[2:149] > tomorrow <- y[3:150] Call: lm(formula = tomorrow ~ today + yesterday) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.69962 2.16549 3.094 0.00237 ** today 0.86086 0.08283 10.393 < 2e-16 *** yesterday 0.01035 0.08256 0.125 0.90037 Residual standard error: 8.928 on 145 degrees of freedom Multiple R-squared: 0.7682, Adjusted R-squared: 0.765 Yesterday s weather is old news! 27

MBA beer survey How many beers can you drink before becoming drunk? height 60 65 70 75 0 5 10 15 20 number of beers 28

MBA beer survey Height seems to be a valuable predictor of beer tolerance. Call: lm(formula = nbeer ~ height) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -36.9200 8.9560-4.122 0.000148 *** height 0.6430 0.1296 4.960 9.23e-06 *** Residual standard error: 3.109 on 48 degrees of freedom Multiple R-squared: 0.3389, Adjusted R-squared: 0.3251 29

MBA beer survey But weight seems also to be relevant. weight 100 140 180 220 0 5 10 15 20 number of beers So weight and height both seem predictive, but is one more important than the other? 30

MBA beer survey It appears that weight is the relevant variable. Call: lm(formula = nbeer ~ height + weight) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -11.18709 10.76821-1.039 0.304167 height 0.07751 0.19598 0.396 0.694254 weight 0.08530 0.02381 3.582 0.000806 *** Residual standard error: 2.784 on 47 degrees of freedom Multiple R-squared: 0.4807, Adjusted R-squared: 0.4586 On what basis is this determination made? 31

Prediction versus intervention We are always safe interpreting our regression models as prediction engines steps the computer follows for turning data into forecasts. We are on much shakier ground when we try to interpret our regression coefficients as knobs to be adjusted. As we reminder ourselves last week, correlation does not imply causation. Straight teeth do not cause nice cars, remember? Essentially we have two alternate explanations: either causation in the other direction (umbrellas do not lead to rain), or common cause (rich folks have nice cars and nice teeth). The first one we have to use common sense. For the second problem lurking confounders we can possibly adjust or control for them. 32

Controlling = matching When we include a variable in a regression, we sometimes say that we are controlling for that variable. The intuition is that if we compare like-with-like, then our regression parameters make good mechanistic sense. So, presumably if I looked only at groups of individuals in the same socio-economic status, there would be no remaining relationship between the quality of one s smile and price of one s car. What we are aiming for is a rich enough set of predictors that the variation within each slice of the population (observations) is random there is no hidden structure to trick us. 33

Sales versus price Suppose you own a taco truck. The past three years of weekly sales and price data look like this: Sales -1.0-0.5 0.0 0.5 2 4 6 8 10 12 14 Price Apparently we should raise prices, right? Bigger price is better, clearly. Or is it? 34

Sales versus price The result is statistically significant. Call: lm(formula = sales ~ p1) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.037086 0.069633-14.89 <2e-16 *** p1 0.110411 0.009593 11.51 <2e-16 *** Residual standard error: 0.256 on 154 degrees of freedom Multiple R-squared: 0.4624, Adjusted R-squared: 0.4589 How should we interpret this result? 35

Price versus sales What if we account for our competitor s price? Competition Price 1 3 5 7 2 4 6 8 10 12 14 Our Price What do you suppose this tells us? What is this a proxy for? 36

Sales versus price The result is not statistically significant, but the least squares coefficient on our price variable changes sign! Call: lm(formula = sales ~ p1 + p2) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.18067 0.07305-16.163 < 2e-16 *** p1-0.04465 0.03571-1.250 0.213 p2 0.28374 0.06321 4.489 1.4e-05 *** Residual standard error: 0.2415 on 153 degrees of freedom Multiple R-squared: 0.525, Adjusted R-squared: 0.5188 37

Simpson s paradox, revisited Within each color, what is the sign of the slope? Sales -1.0-0.5 0.0 0.5 2 4 6 8 10 12 14 Our Price 38

The kitchen sink regression In an effort to clear out all unwanted confounding so we can interpret our regression coefficients cleanly we often reach for any and all available predictor variables. But this has its downsides. Specifically there are both statistical and also interpretational reasons not to do this. We have already seen the statistical argument, which is that we will tend to over-fit, and we become less certain about our estimates because our effective sample size decreases as we add more predictor variables. But there is another reason not to just throw everything into our regression models willy-nilly. 39

Intermediate outcomes Suppose we want to learn about how smoking relates to cancer rates by zip code. That is, Y = cancer rate is our response/outcome variable and X = smoking rate is our predictor variable. To avoid confounding, we control for many other attributes, such as average income, racial make-up, average age, crime rates, etc. Suppose we also included a measure of lung tar in our regression. What do you suppose would happen to the estimated impact of smoking? 40