Multiple linear regression

Size: px

Start display at page:

Download "Multiple linear regression"

Rosalyn Sharleen Kelley
5 years ago
Views:

1 Multiple linear regression Business Statistics Spring

2 Topics 1. Including multiple predictors 2. Controlling for confounders 3. Transformations, interactions, dummy variables OpenIntro 8.1, Super Cruchers excerpt. 2

3 Simple linear regression recap We saw last week how to use the least squares criterion to define the best linear predictor. We also saw how to use R or Excel to compute the best linear predictor on a given data set. The best in-sample linear predictor is probably not the true linear predictor, but with enough data it should be similar. We can use the idea of a confidence interval to help us gauge how much we trust our fit. 3

4 Square feet versus sale price The least squares line-of-best-fit for the housing data is a = and b = Price in dollars Square Feet The residual standard error the noise level is ˆσ = $22, 480 in this case. Why is it so big? What can we do to make it smaller? 4

5 Bedrooms versus sale price Perhaps we could find a better predictor. If we use the number of bedrooms instead of price we get this fit. Price in dollars Square feet Now ˆσ = $22, 940, which is not an improvement. Couldn t we use both square feet and number of bedrooms to predict? 5

6 The best linear multivariate predictor We still want to find a prediction ŷ to minimize our squared error but now we have E{(ŷ Y ) 2 }, ŷ = b 0 + b 1 X 1 + b 2 X 2... b p X p For a whole list of predictor variables. Applied to a data set, this becomes the optimization problem: find coefficients b 0... b p that minimize: n b 0 + j i=1 2 b j x ij y i. Why are there two subscripts on x ij? 6

7 Including more predictors can improve prediction If we include both square feet and the number of bedrooms in our prediction of price, the residual standard error drops to ˆσ = $21, 100. Price in dollars Predicted price Plotting in this case is trickier...but to get a sense of our prediction accuracy we can look at an predicted versus actual plot. 7

8 Including more predictors can improve prediction If we include SqFt, Bedrooms, Bathrooms and Brick in our prediction of price, the residual standard error drops to ˆσ = $17, 630. Price in dollars Predicted price What would this plot look like if we had all the relevant determinants of price? 8

9 Multiple linear regression in R > summary(housefit) Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + Brick) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) SqFt *** Bedrooms e-05 *** Bathrooms ** BrickYes e-09 *** Residual standard error: on 123 degrees of freedom Multiple R-squared: , Adjusted R-squared: R 2 is a standard measure of goodness-of-fit, but I like ˆσ better. 9

10 Plug-in predictions Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) SqFt *** Bedrooms e-05 *** Bathrooms ** BrickYes e-09 *** Just as in the single-predictor case, we can calculate predictions by plugging in particular values for the predictor variables: ŷ = (2000) (3) (2) (1) would be our prediction for a 2K sq ft, three bed, two bath, brick home. 10

11 Categorical predictors Can we use information in a linear regression even if it isn t numerical? In the housing data we have three neighborhoods, denoted 1, 2 and 3. Why would we potentially not want to include the Nbhd variable into the regression as-is? We can, via the creation of dummy variables. If we have k categories, we create k extra columns in each row exactly one column can be a one. What happens if we include an intercept in this model? 11

12 Dummy variable with no intercept Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) + Brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt e-07 *** Bedrooms Bathrooms ** as.factor(nbhd) as.factor(nbhd) * as.factor(nbhd) e-05 *** BrickYes e-12 *** Residual standard error: on 121 degrees of freedom Multiple R-squared: , Adjusted R-squared:

13 Dummy variable with an intercept Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + Brick + as.factor(nbhd)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) SqFt e-07 *** Bedrooms Bathrooms ** BrickYes e-12 *** as.factor(nbhd) as.factor(nbhd) < 2e-16 *** Residual standard error: on 121 degrees of freedom Multiple R-squared: 0.805, Adjusted R-squared:

14 In-sample prediction accuracy Price in dollars Predicted price Our predictions are getting progressively better. Or are they? 14

15 Over-fitting As you continue to add more and more predictors, you will notice R 2 gets closer and closer to 1. As a crazy though experiment, would this happen even if we kept including garbage variables? In addition to the variables above, let s include 100 junk variable (drawn from a normal distribution) and see what happens. garbage <- matrix(rnorm(100*128),128,100) 15

16 Predicting with garbage Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) + Brick garbage) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt Bedrooms Bathrooms as.factor(nbhd) as.factor(nbhd) as.factor(nbhd) BrickYes ***. garbage *. garbage * Residual standard error: on 21 degrees of freedom Multiple R-squared: , Adjusted R-squared:

17 Over-fitting Price in dollars Predicted price One simple way to check for over-fitting is to use a hold-out set of data and try to predict them without peeking. 17

18 Interactions What if we think that the price-premium associated with brick might be different between different neighborhoods? Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd):brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt e-07 *** Bedrooms Bathrooms * as.factor(nbhd)1:brickno as.factor(nbhd)2:brickno * as.factor(nbhd)3:brickno e-05 *** as.factor(nbhd)1:brickyes ** as.factor(nbhd)2:brickyes *** as.factor(nbhd)3:brickyes e-09 *** Residual standard error: on 119 degrees of freedom Multiple R-squared: , Adjusted R-squared: Adding an interaction term to our regression model explicitly accounts for this possibility. 18

19 Interactions Here is an equivalent way to run this regression. Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) * Brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt e-07 *** Bedrooms Bathrooms * as.factor(nbhd) as.factor(nbhd) * as.factor(nbhd) e-05 *** BrickYes * as.factor(nbhd)2:brickyes as.factor(nbhd)3:brickyes Residual standard error: on 119 degrees of freedom Multiple R-squared: , Adjusted R-squared: How can we see that these are equivalent? Which one do you prefer in terms of interpretation? 19

20 Time series Consider predicting the temperature based on day of the year. These are Chicago daily highs, in Fahrenheit, Daily high temp Time We can often turn nonlinear problem into linear problems by transforming our predictor variables in various ways and using many of them to predict. 20

21 Transformations Since we suspect a seasonal trend, let us create the following two predictor variables: x 1 = sin(2πt/365) and x 2 = cos(2πt/365). We then use least-squares, via lm(), to find a linear prediction rule. Call: lm(formula = y ~ x1 + x2) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** x <2e-16 *** x <2e-16 *** Residual standard error: on 1092 degrees of freedom Multiple R-squared: , Adjusted R-squared:

22 Nonlinear prediction via linear regression Daily high temp Time yhat < *sin(2*pi*t/365) *cos(2*pi*t/365) 22

23 Nonlinear prediction via linear regression If we consider a string of 50 days, the daily high temps are sticky...the temp today looks like the temp in the preceding days. Daily high temp Time This suggests we can use previous days weather to predict today s weather. 23

24 Auto-regression Plotting today s weather versus tomorrow s weather gives a nice clean correlation. Temp tomorrow Temp today Running a linear regression will produce a prediction rule. What do you suppose the slope coefficient will be close to? 24

25 Auto-regression Here s how we set this up in R. > today <- y[1:149] > tomorrow <- y[2:150] > temp_auto_reg <- lm(tomorrow~today) > summary(temp_auto_reg) Call: lm(formula = tomorrow ~ today) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** today <2e-16 *** Residual standard error: on 147 degrees of freedom Multiple R-squared: , Adjusted R-squared: We still have nearly 9 degree swings from day to day. 25

26 Auto-regression On a two-day lag the predictability decreases. > today <- y[1:148] > tomorrow <- y[2:149] > dayaftertomorrow <- y[3:150] Call: lm(formula = dayaftertomorrow ~ today) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-05 *** today < 2e-16 *** Residual standard error: on 146 degrees of freedom Multiple R-squared: , Adjusted R-squared: The two-day variability is nearly 12 degrees. 26

27 Auto-regression What happens if we include both today and yesterday to predict tomorrow? > yesterday <- y[1:148] > today <- y[2:149] > tomorrow <- y[3:150] Call: lm(formula = tomorrow ~ today + yesterday) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) ** today < 2e-16 *** yesterday Residual standard error: on 145 degrees of freedom Multiple R-squared: , Adjusted R-squared: Yesterday s weather is old news! 27

28 MBA beer survey How many beers can you drink before becoming drunk? height number of beers 28

29 MBA beer survey Height seems to be a valuable predictor of beer tolerance. Call: lm(formula = nbeer ~ height) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *** height e-06 *** Residual standard error: on 48 degrees of freedom Multiple R-squared: , Adjusted R-squared:

30 MBA beer survey But weight seems also to be relevant. weight number of beers So weight and height both seem predictive, but is one more important than the other? 30

31 MBA beer survey It appears that weight is the relevant variable. Call: lm(formula = nbeer ~ height + weight) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) height weight *** Residual standard error: on 47 degrees of freedom Multiple R-squared: , Adjusted R-squared: On what basis is this determination made? 31

32 Prediction versus intervention We are always safe interpreting our regression models as prediction engines steps the computer follows for turning data into forecasts. We are on much shakier ground when we try to interpret our regression coefficients as knobs to be adjusted. As we reminder ourselves last week, correlation does not imply causation. Straight teeth do not cause nice cars, remember? Essentially we have two alternate explanations: either causation in the other direction (umbrellas do not lead to rain), or common cause (rich folks have nice cars and nice teeth). The first one we have to use common sense. For the second problem lurking confounders we can possibly adjust or control for them. 32

33 Controlling = matching When we include a variable in a regression, we sometimes say that we are controlling for that variable. The intuition is that if we compare like-with-like, then our regression parameters make good mechanistic sense. So, presumably if I looked only at groups of individuals in the same socio-economic status, there would be no remaining relationship between the quality of one s smile and price of one s car. What we are aiming for is a rich enough set of predictors that the variation within each slice of the population (observations) is random there is no hidden structure to trick us. 33

34 Sales versus price Suppose you own a taco truck. The past three years of weekly sales and price data look like this: Sales Price Apparently we should raise prices, right? Bigger price is better, clearly. Or is it? 34

35 Sales versus price The result is statistically significant. Call: lm(formula = sales ~ p1) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** p <2e-16 *** Residual standard error: on 154 degrees of freedom Multiple R-squared: , Adjusted R-squared: How should we interpret this result? 35

36 Price versus sales What if we account for our competitor s price? Competition Price Our Price What do you suppose this tells us? What is this a proxy for? 36

37 Sales versus price The result is not statistically significant, but the least squares coefficient on our price variable changes sign! Call: lm(formula = sales ~ p1 + p2) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** p p e-05 *** Residual standard error: on 153 degrees of freedom Multiple R-squared: 0.525, Adjusted R-squared:

38 Simpson s paradox, revisited Within each color, what is the sign of the slope? Sales Our Price 38

39 The kitchen sink regression In an effort to clear out all unwanted confounding so we can interpret our regression coefficients cleanly we often reach for any and all available predictor variables. But this has its downsides. Specifically there are both statistical and also interpretational reasons not to do this. We have already seen the statistical argument, which is that we will tend to over-fit, and we become less certain about our estimates because our effective sample size decreases as we add more predictor variables. But there is another reason not to just throw everything into our regression models willy-nilly. 39

40 Intermediate outcomes Suppose we want to learn about how smoking relates to cancer rates by zip code. That is, Y = cancer rate is our response/outcome variable and X = smoking rate is our predictor variable. To avoid confounding, we control for many other attributes, such as average income, racial make-up, average age, crime rates, etc. Suppose we also included a measure of lung tar in our regression. What do you suppose would happen to the estimated impact of smoking? 40

Multiple regression - a brief introduction

Multiple regression - a brief introduction Multiple regression is an extension to regular (simple) regression. Instead of one X, we now have several. Suppose, for example, that you are trying to predict