Multiple linear regression Business Statistics 41000 Spring 2017 1
Topics 1. Including multiple predictors 2. Controlling for confounders 3. Transformations, interactions, dummy variables OpenIntro 8.1, Super Cruchers excerpt. 2
Simple linear regression recap We saw last week how to use the least squares criterion to define the best linear predictor. We also saw how to use R or Excel to compute the best linear predictor on a given data set. The best in-sample linear predictor is probably not the true linear predictor, but with enough data it should be similar. We can use the idea of a confidence interval to help us gauge how much we trust our fit. 3
Square feet versus sale price The least squares line-of-best-fit for the housing data is a = 10091 and b = 70.23. Price in dollars 80000 120000 160000 200000 1600 1800 2000 2200 2400 2600 Square Feet The residual standard error the noise level is ˆσ = $22, 480 in this case. Why is it so big? What can we do to make it smaller? 4
Bedrooms versus sale price Perhaps we could find a better predictor. If we use the number of bedrooms instead of price we get this fit. Price in dollars 80000 120000 160000 200000 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Square feet Now ˆσ = $22, 940, which is not an improvement. Couldn t we use both square feet and number of bedrooms to predict? 5
The best linear multivariate predictor We still want to find a prediction ŷ to minimize our squared error but now we have E{(ŷ Y ) 2 }, ŷ = b 0 + b 1 X 1 + b 2 X 2... b p X p For a whole list of predictor variables. Applied to a data set, this becomes the optimization problem: find coefficients b 0... b p that minimize: n b 0 + j i=1 2 b j x ij y i. Why are there two subscripts on x ij? 6
Including more predictors can improve prediction If we include both square feet and the number of bedrooms in our prediction of price, the residual standard error drops to ˆσ = $21, 100. Price in dollars 80000 120000 160000 200000 100000 120000 140000 160000 Predicted price Plotting in this case is trickier...but to get a sense of our prediction accuracy we can look at an predicted versus actual plot. 7
Including more predictors can improve prediction If we include SqFt, Bedrooms, Bathrooms and Brick in our prediction of price, the residual standard error drops to ˆσ = $17, 630. Price in dollars 80000 120000 160000 200000 100000 120000 140000 160000 180000 Predicted price What would this plot look like if we had all the relevant determinants of price? 8
Multiple linear regression in R > summary(housefit) Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + Brick) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -5279.436 14900.448-0.354 0.723710 SqFt 35.801 9.241 3.874 0.000173 *** Bedrooms 10873.060 2523.693 4.308 3.33e-05 *** Bathrooms 9818.430 3699.285 2.654 0.009002 ** BrickYes 21909.107 3371.179 6.499 1.81e-09 *** Residual standard error: 17630 on 123 degrees of freedom Multiple R-squared: 0.5828, Adjusted R-squared: 0.5692 R 2 is a standard measure of goodness-of-fit, but I like ˆσ better. 9
Plug-in predictions Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -5279.436 14900.448-0.354 0.723710 SqFt 35.801 9.241 3.874 0.000173 *** Bedrooms 10873.060 2523.693 4.308 3.33e-05 *** Bathrooms 9818.430 3699.285 2.654 0.009002 ** BrickYes 21909.107 3371.179 6.499 1.81e-09 *** Just as in the single-predictor case, we can calculate predictions by plugging in particular values for the predictor variables: ŷ = 5279 + 35.8(2000) + 10873(3) + 9818.43(2) + 21909(1) would be our prediction for a 2K sq ft, three bed, two bath, brick home. 10
Categorical predictors Can we use information in a linear regression even if it isn t numerical? In the housing data we have three neighborhoods, denoted 1, 2 and 3. Why would we potentially not want to include the Nbhd variable into the regression as-is? We can, via the creation of dummy variables. If we have k categories, we create k extra columns in each row exactly one column can be a one. What happens if we include an intercept in this model? 11
Dummy variable with no intercept Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) + Brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt 35.930 6.404 5.610 1.30e-07 *** Bedrooms 1902.169 1902.270 1.000 0.31933 Bathrooms 6826.925 2562.812 2.664 0.00878 ** as.factor(nbhd)1 17919.446 10474.046 1.711 0.08967. as.factor(nbhd)2 22785.141 11037.189 2.064 0.04112 * as.factor(nbhd)3 52003.166 11518.102 4.515 1.48e-05 *** BrickYes 18507.779 2396.302 7.723 3.65e-12 *** Residual standard error: 12150 on 121 degrees of freedom Multiple R-squared: 0.9921, Adjusted R-squared: 0.9917 12
Dummy variable with an intercept Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + Brick + as.factor(nbhd)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 17919.446 10474.046 1.711 0.08967. SqFt 35.930 6.404 5.610 1.30e-07 *** Bedrooms 1902.169 1902.270 1.000 0.31933 Bathrooms 6826.925 2562.812 2.664 0.00878 ** BrickYes 18507.779 2396.302 7.723 3.65e-12 *** as.factor(nbhd)2 4865.694 2721.805 1.788 0.07633. as.factor(nbhd)3 34083.719 3168.987 10.755 < 2e-16 *** Residual standard error: 12150 on 121 degrees of freedom Multiple R-squared: 0.805, Adjusted R-squared: 0.7954 13
In-sample prediction accuracy Price in dollars 80000 120000 160000 200000 100000 120000 140000 160000 180000 Predicted price Our predictions are getting progressively better. Or are they? 14
Over-fitting As you continue to add more and more predictors, you will notice R 2 gets closer and closer to 1. As a crazy though experiment, would this happen even if we kept including garbage variables? In addition to the variables above, let s include 100 junk variable (drawn from a normal distribution) and see what happens. garbage <- matrix(rnorm(100*128),128,100) 15
Predicting with garbage Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) + Brick - 1 + garbage) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt 28.31 17.76 1.594 0.125915 Bedrooms 6251.82 3768.03 1.659 0.111943 Bathrooms 3018.86 7802.43 0.387 0.702714 as.factor(nbhd)1 27896.39 30135.11 0.926 0.365113 as.factor(nbhd)2 36150.24 30968.28 1.167 0.256161 as.factor(nbhd)3 57673.84 30444.70 1.894 0.072030. BrickYes 21221.82 5516.62 3.847 0.000936 ***. garbage21-6389.07 2697.69-2.368 0.027538 *. garbage62-5543.11 2554.52-2.170 0.041632 * Residual standard error: 12020 on 21 degrees of freedom Multiple R-squared: 0.9987, Adjusted R-squared: 0.9918 16
Over-fitting Price in dollars 80000 120000 160000 200000 80000 120000 160000 200000 Predicted price One simple way to check for over-fitting is to use a hold-out set of data and try to predict them without peeking. 17
Interactions What if we think that the price-premium associated with brick might be different between different neighborhoods? Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd):brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt 36.172 6.419 5.635 1.20e-07 *** Bedrooms 2361.593 1911.665 1.235 0.219130 Bathrooms 5621.923 2651.164 2.121 0.036036 * as.factor(nbhd)1:brickno 19822.815 10547.949 1.879 0.062649. as.factor(nbhd)2:brickno 24518.728 11034.988 2.222 0.028180 * as.factor(nbhd)3:brickno 50774.655 11542.347 4.399 2.38e-05 *** as.factor(nbhd)1:brickyes 32824.396 10984.131 2.988 0.003408 ** as.factor(nbhd)2:brickyes 41554.942 11266.610 3.688 0.000342 *** as.factor(nbhd)3:brickyes 74923.876 11779.572 6.360 3.88e-09 *** Residual standard error: 12100 on 119 degrees of freedom Multiple R-squared: 0.9923, Adjusted R-squared: 0.9917 Adding an interaction term to our regression model explicitly accounts for this possibility. 18
Interactions Here is an equivalent way to run this regression. Call: lm(formula = Price ~ SqFt + Bedrooms + Bathrooms + as.factor(nbhd) * Brick - 1) Coefficients: Estimate Std. Error t value Pr(> t ) SqFt 36.172 6.419 5.635 1.20e-07 *** Bedrooms 2361.593 1911.665 1.235 0.2191 Bathrooms 5621.923 2651.164 2.121 0.0360 * as.factor(nbhd)1 19822.815 10547.949 1.879 0.0626. as.factor(nbhd)2 24518.728 11034.988 2.222 0.0282 * as.factor(nbhd)3 50774.655 11542.347 4.399 2.38e-05 *** BrickYes 13001.581 5012.476 2.594 0.0107 * as.factor(nbhd)2:brickyes 4034.634 6222.902 0.648 0.5180 as.factor(nbhd)3:brickyes 11147.641 6559.711 1.699 0.0919. Residual standard error: 12100 on 119 degrees of freedom Multiple R-squared: 0.9923, Adjusted R-squared: 0.9917 How can we see that these are equivalent? Which one do you prefer in terms of interpretation? 19
Time series Consider predicting the temperature based on day of the year. These are Chicago daily highs, in Fahrenheit, 2001-2003. Daily high temp 20 40 60 80 100 0 200 400 600 800 1000 Time We can often turn nonlinear problem into linear problems by transforming our predictor variables in various ways and using many of them to predict. 20
Transformations Since we suspect a seasonal trend, let us create the following two predictor variables: x 1 = sin(2πt/365) and x 2 = cos(2πt/365). We then use least-squares, via lm(), to find a linear prediction rule. Call: lm(formula = y ~ x1 + x2) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 60.1498 0.2905 207.1 <2e-16 *** x1-8.9526 0.4108-21.8 <2e-16 *** x2-24.0694 0.4108-58.6 <2e-16 *** Residual standard error: 9.611 on 1092 degrees of freedom Multiple R-squared: 0.7816, Adjusted R-squared: 0.7812 21
Nonlinear prediction via linear regression Daily high temp 20 40 60 80 100 0 200 400 600 800 1000 Time yhat <- 60.1498-8.95*sin(2*pi*t/365) - 24.07*cos(2*pi*t/365) 22
Nonlinear prediction via linear regression If we consider a string of 50 days, the daily high temps are sticky...the temp today looks like the temp in the preceding days. Daily high temp 20 40 60 80 0 50 100 150 Time This suggests we can use previous days weather to predict today s weather. 23
Auto-regression Plotting today s weather versus tomorrow s weather gives a nice clean correlation. Temp tomorrow 20 40 60 80 20 40 60 80 Temp today Running a linear regression will produce a prediction rule. What do you suppose the slope coefficient will be close to? 24
Auto-regression Here s how we set this up in R. > today <- y[1:149] > tomorrow <- y[2:150] > temp_auto_reg <- lm(tomorrow~today) > summary(temp_auto_reg) Call: lm(formula = tomorrow ~ today) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.50846 2.08862 3.116 0.0022 ** today 0.87387 0.03931 22.229 <2e-16 *** Residual standard error: 8.894 on 147 degrees of freedom Multiple R-squared: 0.7707, Adjusted R-squared: 0.7692 We still have nearly 9 degree swings from day to day. 25
Auto-regression On a two-day lag the predictability decreases. > today <- y[1:148] > tomorrow <- y[2:149] > dayaftertomorrow <- y[3:150] Call: lm(formula = dayaftertomorrow ~ today) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 12.27918 2.76165 4.446 1.72e-05 *** today 0.76349 0.05207 14.664 < 2e-16 *** Residual standard error: 11.75 on 146 degrees of freedom Multiple R-squared: 0.5956, Adjusted R-squared: 0.5928 The two-day variability is nearly 12 degrees. 26
Auto-regression What happens if we include both today and yesterday to predict tomorrow? > yesterday <- y[1:148] > today <- y[2:149] > tomorrow <- y[3:150] Call: lm(formula = tomorrow ~ today + yesterday) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.69962 2.16549 3.094 0.00237 ** today 0.86086 0.08283 10.393 < 2e-16 *** yesterday 0.01035 0.08256 0.125 0.90037 Residual standard error: 8.928 on 145 degrees of freedom Multiple R-squared: 0.7682, Adjusted R-squared: 0.765 Yesterday s weather is old news! 27
MBA beer survey How many beers can you drink before becoming drunk? height 60 65 70 75 0 5 10 15 20 number of beers 28
MBA beer survey Height seems to be a valuable predictor of beer tolerance. Call: lm(formula = nbeer ~ height) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -36.9200 8.9560-4.122 0.000148 *** height 0.6430 0.1296 4.960 9.23e-06 *** Residual standard error: 3.109 on 48 degrees of freedom Multiple R-squared: 0.3389, Adjusted R-squared: 0.3251 29
MBA beer survey But weight seems also to be relevant. weight 100 140 180 220 0 5 10 15 20 number of beers So weight and height both seem predictive, but is one more important than the other? 30
MBA beer survey It appears that weight is the relevant variable. Call: lm(formula = nbeer ~ height + weight) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -11.18709 10.76821-1.039 0.304167 height 0.07751 0.19598 0.396 0.694254 weight 0.08530 0.02381 3.582 0.000806 *** Residual standard error: 2.784 on 47 degrees of freedom Multiple R-squared: 0.4807, Adjusted R-squared: 0.4586 On what basis is this determination made? 31
Prediction versus intervention We are always safe interpreting our regression models as prediction engines steps the computer follows for turning data into forecasts. We are on much shakier ground when we try to interpret our regression coefficients as knobs to be adjusted. As we reminder ourselves last week, correlation does not imply causation. Straight teeth do not cause nice cars, remember? Essentially we have two alternate explanations: either causation in the other direction (umbrellas do not lead to rain), or common cause (rich folks have nice cars and nice teeth). The first one we have to use common sense. For the second problem lurking confounders we can possibly adjust or control for them. 32
Controlling = matching When we include a variable in a regression, we sometimes say that we are controlling for that variable. The intuition is that if we compare like-with-like, then our regression parameters make good mechanistic sense. So, presumably if I looked only at groups of individuals in the same socio-economic status, there would be no remaining relationship between the quality of one s smile and price of one s car. What we are aiming for is a rich enough set of predictors that the variation within each slice of the population (observations) is random there is no hidden structure to trick us. 33
Sales versus price Suppose you own a taco truck. The past three years of weekly sales and price data look like this: Sales -1.0-0.5 0.0 0.5 2 4 6 8 10 12 14 Price Apparently we should raise prices, right? Bigger price is better, clearly. Or is it? 34
Sales versus price The result is statistically significant. Call: lm(formula = sales ~ p1) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.037086 0.069633-14.89 <2e-16 *** p1 0.110411 0.009593 11.51 <2e-16 *** Residual standard error: 0.256 on 154 degrees of freedom Multiple R-squared: 0.4624, Adjusted R-squared: 0.4589 How should we interpret this result? 35
Price versus sales What if we account for our competitor s price? Competition Price 1 3 5 7 2 4 6 8 10 12 14 Our Price What do you suppose this tells us? What is this a proxy for? 36
Sales versus price The result is not statistically significant, but the least squares coefficient on our price variable changes sign! Call: lm(formula = sales ~ p1 + p2) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -1.18067 0.07305-16.163 < 2e-16 *** p1-0.04465 0.03571-1.250 0.213 p2 0.28374 0.06321 4.489 1.4e-05 *** Residual standard error: 0.2415 on 153 degrees of freedom Multiple R-squared: 0.525, Adjusted R-squared: 0.5188 37
Simpson s paradox, revisited Within each color, what is the sign of the slope? Sales -1.0-0.5 0.0 0.5 2 4 6 8 10 12 14 Our Price 38
The kitchen sink regression In an effort to clear out all unwanted confounding so we can interpret our regression coefficients cleanly we often reach for any and all available predictor variables. But this has its downsides. Specifically there are both statistical and also interpretational reasons not to do this. We have already seen the statistical argument, which is that we will tend to over-fit, and we become less certain about our estimates because our effective sample size decreases as we add more predictor variables. But there is another reason not to just throw everything into our regression models willy-nilly. 39
Intermediate outcomes Suppose we want to learn about how smoking relates to cancer rates by zip code. That is, Y = cancer rate is our response/outcome variable and X = smoking rate is our predictor variable. To avoid confounding, we control for many other attributes, such as average income, racial make-up, average age, crime rates, etc. Suppose we also included a measure of lung tar in our regression. What do you suppose would happen to the estimated impact of smoking? 40