Multiple Regression. Review of Regression with One Predictor

Similar documents
Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases.

Business Statistics: A First Course

Stat 328, Summer 2005

Analysis of Variance in Matrix form

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Homework Assignment Section 3

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

Linear Regression with One Regressor

Introduction to Population Modeling

The Simple Regression Model

σ e, which will be large when prediction errors are Linear regression model

GARCH Models. Instructor: G. William Schwert

The Simple Regression Model

Chapter 14. Descriptive Methods in Regression and Correlation. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 14, Slide 1

Model Construction & Forecast Based Portfolio Allocation:

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Risk-Adjusted Futures and Intermeeting Moves

Consistent estimators for multilevel generalised linear models using an iterated bootstrap

Data Analysis and Statistical Methods Statistics 651

SAMPLING DISTRIBUTIONS. Chapter 7

PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS

This homework assignment uses the material on pages ( A moving average ).

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Statistics 101: Section L - Laboratory 6

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

An approximate sampling distribution for the t-ratio. Caution: comparing population means when σ 1 σ 2.

CHAPTER 2 Describing Data: Numerical

Basic Procedure for Histograms

Final Exam Suggested Solutions

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Chapter 6: Supply and Demand with Income in the Form of Endowments

Example 1 of econometric analysis: the Market Model

The Standard Deviation as a Ruler and the Normal Model. Copyright 2009 Pearson Education, Inc.

Chapter 12. Homework. For each situation below, state the independent variable and the dependent variable.

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Homework Assignment Section 3

The SAS System 11:03 Monday, November 11,

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Statistics 431 Spring 2007 P. Shaman. Preliminaries

Data Distributions and Normality

Topic 8: Model Diagnostics

Lecture 9: Markov and Regime

Chapter 4 Level of Volatility in the Indian Stock Market

Statistical Evidence and Inference

Chapter 6 Simple Correlation and

The Normal Distribution

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2016, Mr. Ruey S. Tsay. Solutions to Midterm

Investment and Capital Constraints: Repatriations Under the American Jobs Creation Act

Monetary Economics Measuring Asset Returns. Gerald P. Dwyer Fall 2015

9. Logit and Probit Models For Dichotomous Data

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

The Fundamentals of Reserve Variability: From Methods to Models Central States Actuarial Forum August 26-27, 2010

WEB APPENDIX 8A 7.1 ( 8.9)

Chapter IV. Forecasting Daily and Weekly Stock Returns

We take up chapter 7 beginning the week of October 16.

Intro to GLM Day 2: GLM and Maximum Likelihood

Name Name. To enter the data manually, go to the StatCrunch website ( and log in (new users must register).

Lecture 8: Markov and Regime

Web Appendix. Are the effects of monetary policy shocks big or small? Olivier Coibion

Stat3011: Solution of Midterm Exam One

Lecture 3: Factor models in modern portfolio choice

Comparison of OLS and LAD regression techniques for estimating beta

Supplement materials for Early network events in the later success of Chinese entrepreneurs

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Problem max points points scored Total 120. Do all 6 problems.

General Business 706 Midterm #3 November 25, 1997

Jaime Frade Dr. Niu Interest rate modeling

Estimation of Volatility of Cross Sectional Data: a Kalman filter approach

Predicting Charitable Contributions

Five Things You Should Know About Quantile Regression

Problem Set 5 Answers. ( ) 2. Yes, like temperature. See the plot of utility in the notes. Marginal utility should be positive.

Influence of Personal Factors on Health Insurance Purchase Decision

A SEARCH FOR A STABLE LONG RUN MONEY DEMAND FUNCTION FOR THE US

Lattice Model of System Evolution. Outline

The Decreasing Trend in Cash Effective Tax Rates. Alexander Edwards Rotman School of Management University of Toronto

STAT Chapter 7: Confidence Intervals

Data Analysis and Statistical Methods Statistics 651

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

Introduction Dickey-Fuller Test Option Pricing Bootstrapping. Simulation Methods. Chapter 13 of Chris Brook s Book.

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Establishing a framework for statistical analysis via the Generalized Linear Model

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

Probability & Statistics Modular Learning Exercises

SMALL AREA ESTIMATES OF INCOME: MEANS, MEDIANS

I. Return Calculations (20 pts, 4 points each)

2 DESCRIPTIVE STATISTICS

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance

Econometrics and Economic Data

Decision 411: Class 6

Chapter 11: Inference for Distributions Inference for Means of a Population 11.2 Comparing Two Means

The Evidence for Differences in Risk for Fixed vs Mobile Telecoms For the Office of Communications (Ofcom)

Indian Institute of Management Calcutta. Working Paper Series. WPS No. 797 March Implied Volatility and Predictability of GARCH Models

UNIT 4 MATHEMATICAL METHODS

Rand Final Pop 2. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Web Extension: Continuous Distributions and Estimating Beta with a Calculator

Chapter 13 Return, Risk, and Security Market Line

STA 103: Final Exam. Print clearly on this exam. Only correct solutions that can be read will be given credit.

Transcription:

Fall Semester, 2001 Statistics 621 Lecture 4 Robert Stine 1 Preliminaries Multiple Regression Grading on this and other assignments Assignment will get placed in folder of first member of Learning Team. Will post solution set this week. Please see me if you have non-routine questions about grading. Feedback To me: Class feedback form, e-mail, or cohort academic reps. To you: Continuous assessment questions on class web page Practice questions on my web page. Review of Regression with One Predictor Regression model 0. Equation (Note that X and Y include needed transformation) ave(" Y " " X ") = β + β " X " i i 0 1 i 1. Independent observations (independent errors) 2. Constant variance observations (equal error variance) 3. Normally distributed around the regression line, written as an assumption about the unobserved errors: ε i 2 ~ N( 0, σ ) Three unknown parameters identify the model: slope, intercept and SD of the errors Checking the assumptions (order IS important) Linearity Scatterplots(data/residuals) Independence Plot residuals sequentially, highlighting trend Constant variance Plot residuals on predictor values Normality Quantile plot of residuals (outliers, skewness) Confidence intervals vs. prediction intervals To what degree does advertising affect sales? Interpreting when holds zero Fitted slope ± 2 SE(fitted slope) What can I conclude about next month s sales? (an individual value) Predicted value ± 2 RMSE (a.k.a. prediction interval)

Fall Semester, 2001 2 for in-sample predictions. Extrapolation beyond experience requires more care. (See cottage profits example below) Review Questions How does the precision of the estimated slope (standard error) change when an removing an outlier from a regression? It depends on where the outlier is located. If the point is a leveraged outlier (at the extremes of the predictor as with Center City in the Philadelphia housing data), then often the SE of the slope will increase. The explanation lies in the expression for the SE of a slope estimate: the more spread out the predictor, the smaller the SE of the slope. What does R 2 tell me about a regression model? R 2 gives the proportion of explained variation, i.e., the proportion of the differences among the response values that can be interpreted in terms of differences among the corresponding values of the predictor. Venn diagram view of the regression process. Squared correlation between the predictor and the response. How does R 2 compare to RMSE? Why have both? Both measure the goodness-of-fit of the model. RMSE is the estimated SD of the errors. If the RMSE is small relative to the variation in the data, then the R 2 is close to 1 and the data are concentrated near the fitted model. RMSE 2 (1 R 2 ) Var(Y) RMSE has the units of the response whereas R2 is an index measuring the proportion of variation explained by the model. Different views of how well the predictor fits/explains the response. R 2 is a relative measure of what the model has accomplished. RMSE is an absolute measure of predictive accuracy of the model.

Fall Semester, 2001 3 When can you compare the R 2 s of different models? Since R 2 is relative (it s the proportion of the variation), you must have the same response. What does it mean to say that a slope is not significant? An estimated slope is statistically significant when any/all of the following occur: Zero is not in the confidence interval for the slope. The t-ratio is larger than two in absolute size. The p-value is smaller than 0.05. When the slope is significant, we know (a) The predictor has a non-random effect on the response (i.e., unlikely this effect produced by sampling variation). (b) The direction of the effect of this predictor on the response (i.e., positive or negative effect). When zero lies in the CI, the true slope may be zero no effect. In any case, we cannot tell how if at all this predictor affects the response. For example, if the slope of sales on advertising is not significant, then you cannot expect a gain in sales when you increase advertising by $100,000, say. What should you do about outliers? See the next example!

Fall Semester, 2001 4 Review Example: Outliers and Confidence Intervals Housing construction Cottages.jmp, page 89 How much can a builder expect to profit from building a large home with 3,500 square feet? Builder has previously constructed one large home, making this observation highly leveraged (extreme in terms of the predictor) 60000 40000 Profit 20000 0-20000 500 1000 1500 2000 2500 3000 3500 Sq_Feet With prior large house, R 2 = 0.78, RMSE 3500, and large constructions appear quite profitable, about $10 per square foot. Parameter Term Intercept Sq_Feet Estimates Estimate -416.9 9.8 Std Error 1437.0 1.3 t Ratio -0.29 7.52 Prob> t 0.7755 Without prior large house, R 2 =0.08, RMSE 3600, and we cannot tell if profit grows with construction size (slope could be zero or negative). Parameter Term Intercept Sq_Feet Estimates Estimate 2245.4 6.1 Std Error 4237.2 5.6 t Ratio 0.53 1.10 Prob> t 0.6039 0.2868 Should we keep the large cottage, or do we exclude the large cottage?

Fall Semester, 2001 5 Outliers in Regression Terminology Leveraged observations = extremes of the predictor (e.g. CC Phila). Influential observations = fit changes when the point is removed. Use this term to refer to observations whose presence affects the slope or other feature (like R 2 or RMSE) of the fitted model. Three canonical situations (see casebook, page 46) Not leveraged, large residual (direct mail example) Leveraged, far from fitted line (Philadelphia crime rate example) Leveraged, close to fitted line (construction profits in cottage example) Outliers and assumptions Outliers can suggest many ways in which your data do not conform to the assumed form of the fitted model: Wrong equation (in leveraged case) Variance is quite different under some conditions Errors are simply not normally distributed (think back to 603). Impact of outlying values Fit can improve or weaken when the outlier is removed. Fit got better without outlier in Philadelphia crime example. Fit got worse in the housing construction example. Keep or set aside? Decision must be based on substantive grounds. Outliers often show where your model is deficient. (See discussion of Philadelphia housing example, p 68-70).

Fall Semester, 2001 6 New Application for Today Separating the factors that drive sales Which factor is the most important determinant of business growth? Advertising? Product loyalty? Price? Complicated because of the relationships among the predictors. Concepts and Terminology Multiple regression adds predictors to the equation 0. Equation adds more factors ave( Y X )= β + β X + β X + + β X i i 0 1 i1 2 i2 L k ik 1. Independent observations (independent errors) 2. Constant variance observations (equal error variance) 3. Normally distributed around the regression line, written as an assumption about the unobserved errors: ε i 2 ~ N( 0, σ ) Interpretation: Implications of model equation Interpretation of slopes as effect of each predictor holding other predictors fixed (as in a partial derivative) Same slope for each predictor X j regardless of values of other factors Factors add together (e.g., why not multiply as in production function?) Interpretation: Marginal slope versus partial slope Marginal: simple regression slope Partial: multiple regression slope, adjusted for levels of other factors. Complications arise from collinearity, the common (almost universal) presence of correlation among the predictors in the model. Draw the model graph with variables as nodes

Fall Semester, 2001 7 Inference: Determinants of SE for a slope Question about one predictor: Does this predictor improve a model containing all of the others? Answer: Use the same procedure as with one predictor, use the t-ratio or CI for the slope Relationships/correlation among predictors increase the SE of slopes. RMSE 1 SE( slope for X j ) = n SD( Adjusted X ) Inference: Goodness-of-fit and R 2 Question about the overall model: Does the model (taken collectively) explain significant variation? Answer: Requires a procedure that was not needed with one predictor: use the overall F-ratio (related to RMSE and R 2 ) R 2 = Proportion of variation in response captured by the fitted model. Explained Sums of Squares R 2 = Total Sums of Squares R 2 = Squared correlation of Y and predicted values from fitted model. Judge changes in R 2 by looking at what is left over... Easy... 0.50 > 0.51 Hard... 0.98 > 0.99 Diagnostic: Leverage plots Graphical diagnostic for multiple regression. Do I like the shown simple regression model? Reduces multiple regression to sequence of simple regressions. JMP-IN Methods Fit Model command Powerful generalization of the Fit Y by X command Allows you to save features of the model (predictions, residuals) Ability to save prediction formula is very useful Add a dummy row(s) so that you can fill in values for the predictors and get the model prediction under these conditions. j

Fall Semester, 2001 8 Example for Today Automobile design Car89.jmp, page 109 What is the predicted mileage for a 4000 lb. design, and which characteristics of the design are crucial? How much does my 200 pound brother owe me for gas for carrying him 3,000 miles to California? (Oops, it s urban mileage in example) Transform response to gallons per 1000 mile scale Simple regression using only the weight of the car gives R 2 = 77%, RMSE = 4.23 (p 111) Prediction @ 4000 lbs. = 63.9, Cost for 200 lbs. for 3000 miles 8.2 gals Skewness in residuals from regression with Weight. (p 112) Parameter Estimates Term Intercept Weight(lb) Estimate 9.4323 0.0136 Std Error 2.0545 0.0007 t Ratio 4.59 18.94 Prob> t Add variable for Horsepower (p 117) Addition of HP is significant improvement since its t-ratio=7.21 R 2 increases from 77% to 84% and RMSE drops to 3.50 Cost for carrying additional 200 lbs. for 3000 miles 5.3 gals Parameter Estimates Term Intercept Weight(lb) Horsepower Estimate 11.6843 0.0089 0.0884 Std Error 1.7270 0.0009 0.0123 t Ratio 6.77 10.11 7.21 Prob> t Related predictors: both typically increase together Implication is higher SE for Weight slope that in prior fit. Picture explains the increase in SE due to restricted range (p 120). Which predictor is more important? Statistical significance (which offers more incremental improvement) Substantive role of the coefficient in the model, $ impact on model. Leverage plots (p 125) show outlying, leveraged cars

Fall Semester, 2001 9 Next steps What other factors are important for the design? How small can we make the RMSE? How do we avoid false positives by searching many predictors? Key Take-Away Points Role of outliers Leverage and influence terminology Outliers can nominally improve or weaken a fitted model. Require substantive insight to choose course of action (keep, delete) Multiple regression Model simultaneously combines predictors Distinguish partial vs. marginal slope: Which is the right one to use? New role for t-ratio as measure of incremental value. Collinearity Predictors are correlated, making it hard to separate effects. Partial effect (multiple) not the same as marginal (simple) Graphical diagnostics Plot of residuals on fitted (or predicted) values Leverage plots Next Time More multiple regression and more collinearity Growing the model with the addition of more predictors Emphasizing impact of collinearity on fitted model