Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases.

Similar documents
Multiple Regression. Review of Regression with One Predictor

The SAS System 11:03 Monday, November 11,

Stat 328, Summer 2005

SAS Simple Linear Regression Example

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Chapter 14. Descriptive Methods in Regression and Correlation. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 14, Slide 1

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Analysis of Variance in Matrix form

Small Sample Performance of Instrumental Variables Probit Estimators: A Monte Carlo Investigation

EXST7015: Multiple Regression from Snedecor & Cochran (1967) RAW DATA LISTING

You created this PDF from an application that is not licensed to print to novapdf printer (

The Effect of Health Insurance on Death Rates

PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

> attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount")

AP Stats: 3B ~ Least Squares Regression and Residuals. Objectives:

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Linear Regression with One Regressor

Common Compensation Terms & Formulas

Cost of Capital (represents risk)

The Least Squares Regression Line

2 Exploring Univariate Data

Solutions for Session 5: Linear Models

Homework Assignment Section 3

PRICE DISTRIBUTION CASE STUDY

Linear regression model

11/28/2018. Overview. Multiple Linear Regression Analysis. Multiple regression. Multiple regression. Multiple regression. Multiple regression

Stat3011: Solution of Midterm Exam One

The Brattle Group 1 st Floor 198 High Holborn London WC1V 7BD

We take up chapter 7 beginning the week of October 16.

Name: Common Core Algebra L R Final Exam 2015 CLONE 3 Teacher:

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2016, Mr. Ruey S. Tsay. Solutions to Midterm

TESTING STATISTICAL HYPOTHESES

A SEARCH FOR A STABLE LONG RUN MONEY DEMAND FUNCTION FOR THE US

Statistics vs. statistics

b) According to the statistics above the graph, the slope is What are the units and meaning of this value?

Lecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit

Topic 8: Model Diagnostics

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

Chapter 7 presents the beginning of inferential statistics. The two major activities of inferential statistics are

CHAPTER 2 Describing Data: Numerical

Business Statistics: A First Course

Introduction to Population Modeling

Subject: Psychopathy

Non-linearities in Simple Regression

MA 1125 Lecture 05 - Measures of Spread. Wednesday, September 6, Objectives: Introduce variance, standard deviation, range.

Homework Solutions - Lecture 2 Part 2

Regression. Lecture Notes VII

$0.00 $0.50 $1.00 $1.50 $2.00 $2.50 $3.00 $3.50 $4.00 Price

GARCH Models. Instructor: G. William Schwert

σ e, which will be large when prediction errors are Linear regression model

Jacob: What data do we use? Do we compile paid loss triangles for a line of business?

5.6 Special Products of Polynomials

appstats5.notebook September 07, 2016 Chapter 5

R & R Study. Chapter 254. Introduction. Data Structure

sociology SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 SO5032 Quantitative Research Methods

starting on 5/1/1953 up until 2/1/2017.

Regression with a binary dependent variable: Logistic regression diagnostic

YEAR 12 Trial Exam Paper FURTHER MATHEMATICS. Written examination 1. Worked solutions

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

Name Date. Key Math Concepts

An Empirical Research on Chinese Stock Market Volatility Based. on Garch

Cumulative Abnormal Returns

Review Exercise Set 13. Find the slope and the equation of the line in the following graph. If the slope is undefined, then indicate it as such.

WEB APPENDIX 8A 7.1 ( 8.9)

Jacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation?

The Evidence for Differences in Risk for Fixed vs Mobile Telecoms For the Office of Communications (Ofcom)

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

TIME SERIES MODELS AND FORECASTING

Name Period. Linear Correlation

SJAM MPM 1D Unit 5 Day 13

Math Performance Task Teacher Instructions

Monetary Economics Measuring Asset Returns. Gerald P. Dwyer Fall 2015

Sales Sales

1 Describing Distributions with numbers

Lecture 18 Section Mon, Feb 16, 2009

APPLICATIONS OF STATISTICAL DATA MINING METHODS

R is a collaborative project with many contributors. Type contributors() for more information.

Lecture 18 Section Mon, Sep 29, 2008

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Measures of Dispersion (Range, standard deviation, standard error) Introduction

Session 5: Associations

Problem Set 1 answers

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

2SLS HATCO SPSS, STATA and SHAZAM. Example by Eddie Oczkowski. August 2001

Homework Assignment Section 3

3. The distinction between variable costs and fixed costs is:

Testing the Solow Growth Theory

Data screening, transformations: MRC05

Multiple regression - a brief introduction

Your Name (Please print) Did you agree to take the optional portion of the final exam Yes No. Directions

Considerations for Planning and Scheduling Part 3 Blending the Planned Maintenance Program and Reactive Maintenance Plan

Lecture 3: Factor models in modern portfolio choice

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Model Construction & Forecast Based Portfolio Allocation:

Risk Analysis. å To change Benchmark tickers:

Math of Finance Exponential & Power Functions

The analysis of the multivariate linear regression model of. soybean future influencing factors

Transcription:

Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases. Goal: Find unusual cases that might be mistakes, or that might strongly influence results. 3 types of unusual cases: 1. Cases with high leverage have one or more extreme explanatory variable values. (Unusual X values) 2. Outliers do not fit the trend of the rest of the data, identified by having large residuals. (Unusual Y values) 3. Influential cases have a strong impact on some aspect of the regression predicted values, R 2, test results, etc. Outliers and high leverage cases might be influential.

How to Identify Unusual Cases Easy to do visually in simple linear regression, but need numerical measures to find them in multiple regression. Identifying high leverage cases: Definition: For simple linear regression, the leverage or hat value for case i is 1 Notes (for simple linear regression only) 1. 2. (Show why on board.) 2. From #1, clearly the average is. We will identify: high leverage cases as those with 2, so 4/ extremely high leverage cases as those with hi > 6/n 3. Leverage depends on the x values only, not the y values.

EXAMPLE 1: A high leverage case (simple linear regression) (This and the next few examples are from Penn State online regression course) n = 21 High leverage is 4/21 = 0.19 Extreme leverage is 6/21 = 0.29 Leverage for red point is 0.36 Extreme leverage! But is it influential? Leverage for the x values, with them displayed on the x axis only:

Measure With high leverage case Without it R 2 -adj. 97.62 97.17 2.7 2.6 Estimated slope 4.93 5.12 s.e.( ) 0.172 0.200 So the case is not influential, even though it has high leverage.

Leverage for Multiple Regression Now it s the combination of x values for Case i that determine its leverage. No longer easy to write the formula (unless we use matrices) Idea remains the same; high values of hi indicate large distance from other points for the combination of x values for that case. With k explanatory variables (so k + 1 coefficients), the sum of the hi values is (k + 1), so the average is (k + 1)/n. high leverage cases are those with 2, so 2 1 / extremely high leverage cases are those with 3 1 / Leverage still depends only on the x values, not the y values.

More Notes about Leverage for Simple and Multiple Regression 0 hi 1, always 1 for the residuals So, large hi means that case has a small variance on the residual and a large variance on the predicted value.. Interpretation of the above: for the same set of x values, in repeated sampling of new y values, at an x combination with high leverage will change a lot, but the residuals will be small. Can picture this for linear regression the line will come close to the y value at that x, so the residual will be small. Estimate of 1 Estimate of

OUTLIERS (Unusual Y values) Identify using standardized and studentized residuals. For Case i: Standardized residual for Case i = stdresi 0.. 1 Studentized residual for Case i = sturesi 0.. 1 where MSE(i) = MSE for the model fit without Case i. NOTE: Some sources define this using as the predicted value, i.e fit for the model without Case i. Others call that the Studentized deleted residual.

Moderate outliers: Cases with absolute value of either of these > 2 Extreme outliers: Cases with absolute value of either of these > 3 EXAMPLE 2: Outlier rstandard = 3.68 rstudent = 6.69 So the red point is clearly identified as an extreme outlier. Is it influential?

Measure With outlier case Without it R 2 -adj. 90.13 97.17 4.7 2.6 Estimated slope 5.04 5.12 s.e.( ) 0.363 0.200 It barely changes the regression equation, but variability is reduced when it is removed, as would be expected!

New Measure, Combining Both Ideas Cook s distance combines leverage and outlier measures. 1 1 1 Large Cook s distance implies large stdres or large leverage or both. Flag (i.e. identify) cases with Cook s distance > 0.5 for moderate, or > 1 for extreme. EXAMPLE 1: Cook s distance for the high leverage point is 0.702. EXAMPLE 2: Cook s distance for the outlier is 0.36.

Another version of the formula (not in book), easier to see why it works: Define = predicted Yj using model without Case i. In other words: Remove case i Fit model Use it to predict all of the other cases, j = 1,, n Then 1 1 It s the distance (squared and normalized) between the predicted values for all cases, using the model with Case i included, and the model without Case i included.

EXAMPLE 3: This point has: Leverage = 0.31 Std. residual = -4.23 Cook s D = 4.05 All extreme! Let s see what happens when it s removed.

Measure With case Without it R 2 -adj. 52.84 97.17 10.4 2.6 Estimated slope 3.32 5.12 s.e.( ) 0.686 0.200 NEXT: Diagnostics in R, then Real estate example.