Topic 8: Model Diagnostics

Outline Diagnostics to check model assumptions Diagnostics concerning X Diagnostics using the residuals

Diagnostics and remedial measures Diagnostics: look at the data to diagnose situations where the assumptions of our model are violated Violations inference cannot be trusted Remedies: changes in analytic strategy to fix / adjust for these problems

Look at the data Before trying to describe the relationship between a response variable (Y) and an explanatory variable (X), we should look at the distributions of these variables We should always look at X If Y depends on X, looking at Y alone may not be very informative looking marginally at Y is a common mistake

Diagnostics for X If X has many values, use Proc Univariate to get numerical summaries (e.g., mean, median, quartiles) If X has only a few values, use Proc Freq or the Freq option in Proc Univariate to get summaries (e.g., percentages, counts)

Diagnostics for X Examine the distribution of X Is it skewed? Are there outliers? Do the values of X depend on time (i.e., the order in which they were collected)? Suggests possible confounding factors

What s the concern? Model estimates based on means and sums of squares These numerical summaries are not robust to outliers Can inflate variance or influence trend Confounders may alter the relation between X and Y

Important Statistics Mean Standard deviation Skewness Kurtosis Range

Example: Toluca lot size data toluca; infile../data/ch01ta01.txt'; input lotsize hours; seq=_n_; proc univariate data=toluca plot; var lotsize; run;

Crude Plots Stem Leaf # Boxplot 12 0 1 11 00 2 10 00 2 9 0000 4 +-----+ 8 000 3 7 000 3 *--+--* 6 0 1 5 000 3 +-----+ 4 00 2 3 000 3 2 0 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+1

Moments Moments N 25 Sum Weights 25 Mean 70 Sum Observations 1750 Std Deviation 28.7228132 Variance 825 Skewness -0.1032081 Kurtosis -1.0794107 Uncorrected SS 142300 Corrected SS 19800 Coeff Variation 41.0325903 Std Error Mean 5.74456265

Location and Spread Basic Statistical Measures Location Variability Mean 70.00000 Std Deviation 28.72281 Median 70.00000 Variance 825.00000 Mode 90.00000 Range 100.00000 Interquartile Range 40.00000

Quantiles (Definition 5) Quantile Estimate 100% Max 120 99% 120 95% 110 90% 110 75% Q3 90 50% Median 70 25% Q1 50 10% 30 5% 30 1% 20 0% Min 20

Extreme Observations Lowest Highest Value Obs Value Obs 20 14 100 9 30 21 100 16 30 17 110 15 30 2 110 20 40 23 120 7

SAS CODE FOR TREND IN ORDER? symbol1 v=circle i=sm70; proc gplot data=a1; plot lotsize*seq; run;

Normal distributions Regression model does not state that X come from a single Normal distribution X can follow any distribution Regression model does not state that Y come from a single Normal distribution Assume Y X i is Normal In some cases, however, X and/or Y may be Normal and it can be useful to know this

Common plots Histograms Bell-shaped? Symmetric? Can often fit Normal curve on top of histogram to assess fit Box plots Normal quantile plots

Normal quantile plots Consider n=5 observations iid N(0,1) From Table B.1, we find P(z -.84) =.20 P(-.84 < z -.25) =.20 P(-.25 < z.25) =.20 P(.25 < z.84) =.20 P(z >.84) =.20

Normal quantile plots So we expect One observation -.84 One observation in (-.84, -.25) One observation in (-.25,.25) One observation in (25,.84) One observation >.84

Normal quantile plots Use similar idea to define expected Normal scores for given sample size n Z i = Φ -1 ((i-.375)/(n+.25)), i=1 to n Plot the order statistics X (i) vs Z i KNNL plots X (i) vs s Z i Doesn t affect nature of the plot

Normal quantile plots The standardized X variable is z = (X - μ)/σ So, X = μ + σ z If the data are approximately Normal, the relationship will be approximately linear with slope close to σ and intercept close to μ.

SAS CODE proc univariate data=toluca plot; var lotsize; qqplot lotsize; run;

Diagnostics for residuals Model: Y i = β 0 + β 1 X i + e i Predicted values: Ŷ i = b 0 + b 1 X i Residuals: e i = Y i Ŷ i So, Y i = Ŷ i + e i The e i should be similar to the e i The model assumes e i iid N(0, σ 2 )

Plot Plot Plot PLOT PLOT PLOT Plot

Questions addressed by diagnostics for residuals Is the relationship linear? Is the variance constant? Are the errors Normal? Are the errors dependent? Are there outliers?

Is the Relationship Linear? Plot Y vs X Plot e vs X (residual plot) Residual plot better emphasizes deviations from linear pattern Recall Topic 2 plots of scatterplot and residual plot

SAS CODE: Fake #1 libname xxx../data ; Data xxx.a100; do x=1 to 30; y=x*x-10*x+30+25*normal(0); output; end; run; Generates data set where Y=X 2-10X+30 Errors are Normally distributed with s=25

SAS CODE proc reg data=xxx.a100; model y=x; output out=a2 r=resid; run;

OUTPUT Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 1032098 1032098 170.95 <.0001 Error 28 169048 6037.41596 Corrected Total 29 1201145 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1-145.37495 29.09684-5.00 <.0001 x 1 21.42943 1.63899 13.07 <.0001 A significant positive relationship!!

SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; run; symbol1 v=circle i=sm60; proc gplot data=a2; plot y*x; proc gplot data=a2; plot resid*x/ vref=0; run; Scatterplot with regression line Scatterplot with smoothed curve Residual plot

Does not appear to be linear

Nonlinear behavior easier to see here?!

Does the variance differ across X? Plot Y vs X Plot e vs X Plot of e vs X will emphasize problems with the variance assumption

SAS CODE: Fake #2 libname xxx../data'; Data xxx.a100a; do x=1 to 100; y=30+100*x+10*x*normal(0); output; end; run; Generates data set where Y=30 + 100X Errors are Normally distributed with s=10x

SAS CODE proc reg data=xxx.a100a; model y=x; output out=a2 r=resid; run;

OUTPUT Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 856723171 856723171 1682.55 <.0001 Error 98 49899722 509181 Corrected Total 99 906622893 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 13.80557 143.79092 0.10 0.9237 x 1 101.39875 2.47200 41.02 <.0001 A significant positive relationship!! Estimate close to the true value!!

SAS CODE: Visual Checks symbol1 v=circle i=sm60; proc gplot data=a2; plot y*x; Scatterplot with smoothed curve proc gplot data=a2; plot resid*x / vref=0; run; Residual plot

So what?! Why is non-constant variance an issue here? Trend appears linear Estimates close to the truth Answer:

Are the errors Normal? The real question is whether the distribution of the errors is far enough away from Normal to invalidate our confidence intervals and significance tests Look at the residuals distribution Use a Normal quantile plot Be wary of tests of Normality

SAS CODE data a1; infile..\data\ch01ta01.txt'; input lotsize hours; proc reg data=a1; model hours=lotsize; output out=a2 r=resid; proc univariate data=a2 plot normal; var resid; histogram resid / normal kernel; qqplot resid;

Univariate Output Fitted Normal Distribution for resid Parameters for Normal Distribution Parameter Symbol Estimate Mean Mu 0 Std Dev Sigma 47.79534 Goodness-of-Fit Tests for Normal Distribution Test ----Statistic----- ------p Value------ Kolmogorov-Smirnov D 0.09571960 Pr > D >0.150 Cramer-von Mises W-Sq 0.03326349 Pr > W-Sq >0.250 Anderson-Darling A-Sq 0.20714170 Pr > A-Sq >0.250 No obvious deviations from normality as P-values are greater than 0.05

Dependent Errors Usually we see this in a plot of residuals vs time order (KNNL) or seq (our SAS variable) We can have trends and/or cyclical effects in the residuals Observations not independent If you are interested read KNNL pgs 108-110

Are there outliers? Plot Y vs X Plot e vs X Plot of e vs X should emphasize an outlier

SAS CODE: Fake #3 Data xxx.a100b1; do x=1 to 100 by 5; y=30+50*x+200*normal(0); output; end; x=50; y=30+50*50+10000; d='out'; output; run; Generates data set where Y=30+50X Errors are Normally distributed with s=200

SAS CODE proc reg data=xxx.a100b1; model y=x; where d ne 'out'; run; proc reg data=xxx.a100b1; model y=x; output out=a2 r=resid; run;

Without Outlier Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 42426770 42426770 894.59 <.0001 Error 18 853668 47426 Corrected Total 19 43280438 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1-2.54677 95.29715-0.03 0.9790 x 1 50.51719 1.68899 29.91 <.0001 s=217.8

With Outlier Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 43888843 43888843 8.67 0.0083 Error 19 96206895 5063521 Corrected Total 20 140095738 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 432.20263 979.57661 0.44 0.6640 x 1 51.37694 17.45089 2.94 0.0083 s=2250.2

SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; proc gplot data=a2; plot resid*x/ vref=0; run;

Different kinds of outliers The outlier in the last example influenced the intercept but not the slope It inflated all of our standard errors Here is an example of an outlier that influences the slope

SAS CODE Data xxx.a100c1; do x=1 to 100 by 5; y=30+50*x+200*normal(0); output; end; x=100; y=30+50*100-10000; d='out'; output; run;

SAS CODE proc reg data=xxx.a100c1; model y=x; where d ne 'out'; run; proc reg data=xxx.a100c1; model y=x; output out=a2 r=resid; run;

Without Outlier Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 41233447 41233447 901.15 <.0001 Error 18 823612 45756 Corrected Total 19 42057060 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 73.28061 93.60451 0.78 0.4439 x 1 49.80168 1.65899 30.02 <.0001

With Outlier Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 11151297 11151297 2.53 0.1285 Error 19 83888277 4415172 Corrected Total 20 95039574 Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1 903.97793 899.32018 1.01 0.3274 x 1 24.13057 15.18374 1.59 0.1285

SAS CODE: Visual Checks symbol1 v=circle i=rl; proc gplot data=a2; plot y*x; proc gplot data=a2; plot resid*x/ vref=0; run;

Background Reading Program topic8.sas has code for the proc univariate diagnostics of X Program residualchecks.sas have the residual analysis The permanent sas data sets are a100.sas7bdat, a100a.sas7bdat, a100b1.sas7bdat, and a100c1.sas7bdat. Read Sections 3.8 and 3.9