Topic 8: Model Diagnostics

Similar documents
Notice that X2 and Y2 are skewed. Taking the SQRT of Y2 reduces the skewness greatly.

1. Distinguish three missing data mechanisms:

Empirical Rule (P148)

Chapter 11 : Model checking and refinement An example: Blood-brain barrier study on rats

Homework 0 Key (not to be handed in) due? Jan. 10

The SAS System 11:03 Monday, November 11,

Topic 30: Random Effects Modeling

SAS Simple Linear Regression Example

2 Exploring Univariate Data

Review: Chebyshev s Rule. Measures of Dispersion II. Review: Empirical Rule. Review: Empirical Rule. Auto Batteries Example, p 59.

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

Lecture Week 4 Inspecting Data: Distributions

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

SOLUTIONS TO THE LAB 1 ASSIGNMENT

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Descriptive Analysis

Lecture 1: Review and Exploratory Data Analysis (EDA)

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Introduction to Statistical Data Analysis II

Describing Data: One Quantitative Variable

DATA SUMMARIZATION AND VISUALIZATION

Basic Procedure for Histograms

Some estimates of the height of the podium

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

1 Describing Distributions with numbers

STAT 113 Variability

EXST7015: Multiple Regression from Snedecor & Cochran (1967) RAW DATA LISTING

Some Characteristics of Data

Linear regression model

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Numerical Descriptions of Data

Descriptive Statistics Bios 662

Chapter 3. Populations and Statistics. 3.1 Statistical populations

Chapter 4 Variability

CHAPTER 6. ' From the table the z value corresponding to this value Z = 1.96 or Z = 1.96 (d) P(Z >?) =

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

2018 AAPM: Normal and non normal distributions: Why understanding distributions are important when designing experiments and analyzing data

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Terms & Characteristics

Statistics I Chapter 2: Analysis of univariate data

Stat 328, Summer 2005

AP Statistics Chapter 6 - Random Variables

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Midterm Exam. b. What are the continuously compounded returns for the two stocks?

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Data Distributions and Normality

Business Statistics 41000: Probability 3

Simple Descriptive Statistics

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

CHAPTER 2 Describing Data: Numerical

Frequency Distribution and Summary Statistics

4. DESCRIPTIVE STATISTICS

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Financial Econometrics (FinMetrics04) Time-series Statistics Concepts Exploratory Data Analysis Testing for Normality Empirical VaR

Establishing a framework for statistical analysis via the Generalized Linear Model

Data Analysis and Statistical Methods Statistics 651

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

Loss Simulation Model Testing and Enhancement

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Section3-2: Measures of Center

Valid Missing Total. N Percent N Percent N Percent , ,0% 0,0% 2 100,0% 1, ,0% 0,0% 2 100,0% 2, ,0% 0,0% 5 100,0%

Analysis Variable : Y Analysis Variable : Y E

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

Unit 2 Statistics of One Variable

Descriptive Statistics

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

Ti 83/84. Descriptive Statistics for a List of Numbers

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

σ e, which will be large when prediction errors are Linear regression model

1 Small Sample CI for a Population Mean µ

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Lecture 6: Non Normal Distributions

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

Quantile regression and surroundings using SAS

Data Analysis and Statistical Methods Statistics 651

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

Rand Final Pop 2. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

appstats5.notebook September 07, 2016 Chapter 5

Hypothesis Tests: One Sample Mean Cal State Northridge Ψ320 Andrew Ainsworth PhD

Chapter 11: Inference for Distributions Inference for Means of a Population 11.2 Comparing Two Means

Quantitative Analysis and Empirical Methods

Normal Probability Distributions

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

Financial Time Series and Their Characteristics

Probability & Statistics Modular Learning Exercises

Variance, Standard Deviation Counting Techniques

Data screening, transformations: MRC05

Measures of Variation. Section 2-5. Dotplots of Waiting Times. Waiting Times of Bank Customers at Different Banks in minutes. Bank of Providence

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Transcription:

Topic 8: Model Diagnostics

Outline Diagnostics to check model assumptions Diagnostics concerning X Diagnostics using the residuals

Diagnostics and remedial measures Diagnostics: look at the data to diagnose situations where the assumptions of our model are violated Violations inference cannot be trusted Remedies: changes in analytic strategy to fix / adjust for these problems

Look at the data Before trying to describe the relationship between a response variable (Y) and an explanatory variable (X), we should look at the distributions of these variables We should always look at X If Y depends on X, looking at Y alone may not be very informative looking marginally at Y is a common mistake

Diagnostics for X If X has many values, use Proc Univariate to get numerical summaries (e.g., mean, median, quartiles) If X has only a few values, use Proc Freq or the Freq option in Proc Univariate to get summaries (e.g., percentages, counts)

Diagnostics for X Examine the distribution of X Is it skewed? Are there outliers? Do the values of X depend on time (i.e., the order in which they were collected)? Suggests possible confounding factors

What s the concern? Model estimates based on means and sums of squares These numerical summaries are not robust to outliers Can inflate variance or influence trend Confounders may alter the relation between X and Y

Important Statistics Mean Standard deviation Skewness Kurtosis Range

Example: Toluca lot size data toluca; infile../data/ch01ta01.txt'; input lotsize hours; seq=_n_; proc univariate data=toluca plot; var lotsize; run;

Crude Plots Stem Leaf # Boxplot 12 0 1 11 00 2 10 00 2 9 0000 4 +-----+ 8 000 3 7 000 3 *--+--* 6 0 1 5 000 3 +-----+ 4 00 2 3 000 3 2 0 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1

Moments Moments N 25 Sum Weights 25 Mean 70 Sum Observations 1750 Std Deviation 28.7228132 Variance 825 Skewness -0.1032081 Kurtosis -1.0794107 Uncorrected SS 142300 Corrected SS 19800 Coeff Variation 41.0325903 Std Error Mean 5.74456265

Location and Spread Basic Statistical Measures Location Variability Mean 70.00000 Std Deviation 28.72281 Median 70.00000 Variance 825.00000 Mode 90.00000 Range 100.00000 Interquartile Range 40.00000

Quantiles (Definition 5) Quantile Estimate 100% Max 120 99% 120 95% 110 90% 110 75% Q3 90 50% Median 70 25% Q1 50 10% 30 5% 30 1% 20 0% Min 20

Extreme Observations Lowest Highest Value Obs Value Obs 20 14 100 9 30 21 100 16 30 17 110 15 30 2 110 20 40 23 120 7

SAS CODE FOR TREND IN ORDER? symbol1 v=circle i=sm70; proc gplot data=a1; plot lotsize*seq; run;

Normal distributions Regression model does not state that X come from a single Normal distribution X can follow any distribution Regression model does not state that Y come from a single Normal distribution Assume Y X i is Normal In some cases, however, X and/or Y may be Normal and it can be useful to know this

Common plots Histograms Bell-shaped? Symmetric? Can often fit Normal curve on top of histogram to assess fit Box plots Normal quantile plots

Normal quantile plots Consider n=5 observations iid N(0,1) From Table B.1, we find P(z -.84) =.20 P(-.84 < z -.25) =.20 P(-.25 < z.25) =.20 P(.25 < z.84) =.20 P(z >.84) =.20

Normal quantile plots So we expect One observation -.84 One observation in (-.84, -.25) One observation in (-.25,.25) One observation in (25,.84) One observation >.84

Normal quantile plots Use similar idea to define expected Normal scores for given sample size n Z i = Φ -1 ((i-.375)/(n+.25)), i=1 to n Plot the order statistics X (i) vs Z i KNNL plots X (i) vs s Z i Doesn t affect nature of the plot

Normal quantile plots The standardized X variable is z = (X - μ)/σ So, X = μ + σ z If the data are approximately Normal, the relationship will be approximately linear with slope close to σ and intercept close to μ.

SAS CODE proc univariate data=toluca plot; var lotsize; qqplot lotsize; run;

Diagnostics for residuals Model: Y i = β 0 + β 1 X i + e i Predicted values: Ŷ i = b 0 + b 1 X i Residuals: e i = Y i Ŷ i So, Y i = Ŷ i + e i The e i should be similar to the e i The model assumes e i iid N(0, σ 2 )

Plot Plot Plot PLOT PLOT PLOT Plot

Questions addressed by diagnostics for residuals Is the relationship linear? Is the variance constant? Are the errors Normal? Are the errors dependent? Are there outliers?

Is the Relationship Linear? Plot Y vs X Plot e vs X (residual plot) Residual plot better emphasizes deviations from linear pattern Recall Topic 2 plots of scatterplot and residual plot

SAS CODE: Fake #1 libname xxx../data ; Data xxx.a100; do x=1 to 30; y=x*x-10*x+30+25*normal(0); output; end; run; Generates data set where Y=X 2-10X+30 Errors are Normally distributed with s=25

SAS CODE proc reg data=xxx.a100; model y=x; output out=a2 r=resid; run;

OUTPUT Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 1032098 1032098 170.95 <.0001 Error 28 169048 6037.41596 Corrected Total 29 1201145 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1-145.37495 29.09684-5.00 <.0001 x 1 21.42943 1.63899 13.07 <.0001 A significant positive relationship!!

SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; run; symbol1 v=circle i=sm60; proc gplot data=a2; plot y*x; proc gplot data=a2; plot resid*x/ vref=0; run; Scatterplot with regression line Scatterplot with smoothed curve Residual plot

Does not appear to be linear

Nonlinear behavior easier to see here?!

Does the variance differ across X? Plot Y vs X Plot e vs X Plot of e vs X will emphasize problems with the variance assumption

SAS CODE: Fake #2 libname xxx../data'; Data xxx.a100a; do x=1 to 100; y=30+100*x+10*x*normal(0); output; end; run; Generates data set where Y=30 + 100X Errors are Normally distributed with s=10x

SAS CODE proc reg data=xxx.a100a; model y=x; output out=a2 r=resid; run;

OUTPUT Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 856723171 856723171 1682.55 <.0001 Error 98 49899722 509181 Corrected Total 99 906622893 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 13.80557 143.79092 0.10 0.9237 x 1 101.39875 2.47200 41.02 <.0001 A significant positive relationship!! Estimate close to the true value!!

SAS CODE: Visual Checks symbol1 v=circle i=sm60; proc gplot data=a2; plot y*x; Scatterplot with smoothed curve proc gplot data=a2; plot resid*x / vref=0; run; Residual plot

So what?! Why is non-constant variance an issue here? Trend appears linear Estimates close to the truth Answer:

Are the errors Normal? The real question is whether the distribution of the errors is far enough away from Normal to invalidate our confidence intervals and significance tests Look at the residuals distribution Use a Normal quantile plot Be wary of tests of Normality

SAS CODE data a1; infile..\data\ch01ta01.txt'; input lotsize hours; proc reg data=a1; model hours=lotsize; output out=a2 r=resid; proc univariate data=a2 plot normal; var resid; histogram resid / normal kernel; qqplot resid;

Univariate Output Fitted Normal Distribution for resid Parameters for Normal Distribution Parameter Symbol Estimate Mean Mu 0 Std Dev Sigma 47.79534 Goodness-of-Fit Tests for Normal Distribution Test ----Statistic----- ------p Value------ Kolmogorov-Smirnov D 0.09571960 Pr > D >0.150 Cramer-von Mises W-Sq 0.03326349 Pr > W-Sq >0.250 Anderson-Darling A-Sq 0.20714170 Pr > A-Sq >0.250 No obvious deviations from normality as P-values are greater than 0.05

Dependent Errors Usually we see this in a plot of residuals vs time order (KNNL) or seq (our SAS variable) We can have trends and/or cyclical effects in the residuals Observations not independent If you are interested read KNNL pgs 108-110

Are there outliers? Plot Y vs X Plot e vs X Plot of e vs X should emphasize an outlier

SAS CODE: Fake #3 Data xxx.a100b1; do x=1 to 100 by 5; y=30+50*x+200*normal(0); output; end; x=50; y=30+50*50+10000; d='out'; output; run; Generates data set where Y=30+50X Errors are Normally distributed with s=200

SAS CODE proc reg data=xxx.a100b1; model y=x; where d ne 'out'; run; proc reg data=xxx.a100b1; model y=x; output out=a2 r=resid; run;

Without Outlier Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 42426770 42426770 894.59 <.0001 Error 18 853668 47426 Corrected Total 19 43280438 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1-2.54677 95.29715-0.03 0.9790 x 1 50.51719 1.68899 29.91 <.0001 s=217.8

With Outlier Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 43888843 43888843 8.67 0.0083 Error 19 96206895 5063521 Corrected Total 20 140095738 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 432.20263 979.57661 0.44 0.6640 x 1 51.37694 17.45089 2.94 0.0083 s=2250.2

SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; proc gplot data=a2; plot resid*x/ vref=0; run;

Different kinds of outliers The outlier in the last example influenced the intercept but not the slope It inflated all of our standard errors Here is an example of an outlier that influences the slope

SAS CODE Data xxx.a100c1; do x=1 to 100 by 5; y=30+50*x+200*normal(0); output; end; x=100; y=30+50*100-10000; d='out'; output; run;

SAS CODE proc reg data=xxx.a100c1; model y=x; where d ne 'out'; run; proc reg data=xxx.a100c1; model y=x; output out=a2 r=resid; run;

Without Outlier Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 41233447 41233447 901.15 <.0001 Error 18 823612 45756 Corrected Total 19 42057060 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 73.28061 93.60451 0.78 0.4439 x 1 49.80168 1.65899 30.02 <.0001

With Outlier Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 11151297 11151297 2.53 0.1285 Error 19 83888277 4415172 Corrected Total 20 95039574 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 903.97793 899.32018 1.01 0.3274 x 1 24.13057 15.18374 1.59 0.1285

SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; proc gplot data=a2; plot resid*x/ vref=0; run;

Background Reading Program topic8.sas has code for the proc univariate diagnostics of X Program residualchecks.sas have the residual analysis The permanent sas data sets are a100.sas7bdat, a100a.sas7bdat, a100b1.sas7bdat, and a100c1.sas7bdat. Read Sections 3.8 and 3.9