1. Distinguish three missing data mechanisms:

Size: px

Start display at page:

Download "1. Distinguish three missing data mechanisms:"

Rafe Reeves
5 years ago
Views:

1 1 DATA SCREENING I. Preliminary inspection of the raw data make sure that there are no obvious coding errors (e.g., all values for the observed variables are in the admissible range) and that all variables have been recoded appropriately if necessary (including missing values); II. Treatment of missing values 1. Distinguish three missing data mechanisms: Missing completely at random (MCAR): the distribution of missingness depends neither on the observed nor on the missing parts of the data; Missing at random (MAR): the distribution of missingness depends on the observed but not on the missing part of the data; Missing not at random (MNAR): the distribution of missingness also depends on the missing part of the data; 2. older methods (not recommended in general) a. case deletion (pairwise deletion or available case analysis, listwise deletion or complete case analysis) b. averaging the available items in multi-item scales c. single imputation (e.g., mean substitution, etc.) 3. newer methods (recommended) a. full information ML estimation: if the data are MAR, the correct observed data likelihood can be obtained by integrating out the missing data; correct standard errors can be obtained from the observed (rather than the expected) information matrix; b. Bayesian multiple imputation: each missing value is replaced by a number of imputations, and the results of these multiple imputations are combined in some fashion; 4. practical implementation: use PROC MI and PROC MIANALYZE in SAS (see the file mi.pdf) or the missing value procedures in LISREL, Mplus, etc.;

2 2 III. Outlier detection 1. graphical methods: histograms, stem-and-leaf plots, boxplots, normal probability or quantile-quantile plots, bivariate scatterplots, etc. 2. computational methods: Mahalanobis distance and variants; a. let X (N x q) be the matrix of observed variables (where N is the number of observations and q the number of variables) and S the covariance matrix of the observations, and let X contain the variable 2 means for each observation; then the diagonal values d ii of D 2, where D 2 is given by D 2 ( X X) S 1 ( X X)', contain the squared Mahalanobis distance of each observation from the centroid of the observations; if the variables are uncorrelated and the variances are the same, d reduces to the Euclidean distance; 2 ii b. Bollen (1989) suggests another measure, closely related to D 2, which can be obtained as follows; let Z (N x q) be the matrix of observed variables in deviation form and define a matrix A as: A Z( Z' Z) ' 1 Z the diagonal values aii of this matrix give the distance of the ith case from the centroid of the observations, where the distances have a range of 0 to 1; the sum of the distances across N equals q, so that q/n is the average size of aii and each aii can be compared to this average; observations with aii values much larger than the average may be considered outliers; c. more sophisticated techniques that take into account masking and swamping problems are also available (e.g., Hadi 1992, 1994); IV. Assessment of distributional assumptions (esp. normality) 1. graphical methods: histograms, stem-and-leaf plots, boxplots, probability or quantile-quantile plots, etc. 2. computational methods: statistical tests of normality, skewness and kurtosis, etc.;

3 3 a. tests of univariate normality: Shapiro-Wilk, Kolmogorov-Smirnov, etc. b. skewness: lack of symmetry of a distribution; positively skewed data have a long tail above the mean (are skewed to the right), negatively skewed data have a long tail below the mean (are skewed to the left); a normal distribution has a skewness of zero; various tests of univariate and multivariate normality based on skewness are available; c. kurtosis: distributions with positive kurtosis (leptokurtic distributions) have heavier tails and a higher peak than the normal; distributions with negative kurtosis (platykurtic distributions) have lighter tails and a lower peak than the normal; thus, distributions with excess kurtosis usually cross the normal twice on each side of the mean (see DeCarlo 1997); a normal distribution has a kurtosis of three (or zero when three has been subtracted to measure excess kurtosis); various tests of univariate and multivariate normality based on kurtosis are available; Appendix: Interpreting normal probability or quantile-quantile plots: Description of Point Pattern All but a few points fall on a line Left end of pattern is below line; right end of pattern is above line Left end of pattern is above line; right end of pattern is below line Curved pattern with slope increasing from left to right Curved pattern with slope increasing from right to left Staircase pattern (plateaus and gaps) Interpretation Outliers in data Symmetric, with long tails at both ends (positive kurtosis) Symmetric, with short tails at both ends (negative kurtosis) Data skewed to the right (positive skew) Data skewed to the left (negative skew) Data rounded or discrete Note: It is assumed that the raw data are shown on the y axis and the normal quantiles or percentiles on the x axis.

4 4 %INCLUDE 'd:\m554\programs\jitter.sas'; TITLE 'Attitude toward using coupons -- data screening'; DATA coupon; INFILE 'd:\m554\datascreening\cfa.dat'; INPUT id aa1t1 aa2t1 aa3t1 aa4t1 aa1t2 aa2t2 aa3t2 aa4t2; RUN; DATA coupont1; SET coupon(keep=id aa1t1 aa2t1 aa3t1 aa4t1); RUN; %JITTER(data=coupont1,out=coupont1,var=aa1t1 aa2t1 aa3t1 aa4t1,new=jaa1t1 jaa2t1 jaa3t1 jaa4t1); TITLE 'proc univariate for coupon data'; PROC UNIVARIATE PLOT NORMAL; VAR aa1t1 aa2t1 aa3t1 aa4t1; HISTOGRAM aa1t1 aa2t1 aa3t1 aa4t1 / NORMAL (MU=est SIGMA=est COLOR=red W=2.5 ) MIDPOINTS = 1 to 7 by 1; PROBPLOT aa1t1 aa2t1 aa3t1 aa4t1 / NORMAL (MU=est SIGMA=est COLOR=red W=2.5 ); QQPLOT aa1t1 aa2t1 aa3t1 aa4t1 / NORMAL (MU=est SIGMA=est COLOR=red W=2.5 ); RUN; PROC SGSCATTER DATA=coupont1; TITLE 'Scatterplot Matrix for original coupon data'; MATRIX aa1t1 aa2t1 aa3t1 aa4t1 / DIAGONAL=(HISTOGRAM NORMAL) ELLIPSE=(TYPE=PREDICTED); RUN; PROC SGSCATTER DATA=coupont1; TITLE 'Scatterplot Matrix for jittered coupon data'; MATRIX jaa1t1 jaa2t1 jaa3t1 jaa4t1 / DIAGONAL=(HISTOGRAM NORMAL) ELLIPSE=(TYPE=PREDICTED); RUN; /* proc sgplot data=coupont1; title 'jittered scatterplot'; scatter x=aa1t1 y=aa2t1 / jitter; ellipse x=aa1t1 y=aa2t1; run; */ QUIT;

5 5 The UNIVARIATE Procedure Variable: aa1t1 Moments N 250 Sum Weights 250 Mean Sum Observations 1128 Std Deviation Variance Skewness Kurtosis Uncorrected SS 5552 Corrected SS Coeff Variation Std Error Mean Basic Statistical Measures Location Variability Mean Std Deviation Median Variance Mode Range Interquartile Range Tests for Location: Mu0=0 Test -Statistic p Value Student's t t Pr > t <.0001 Sign M 125 Pr >= M <.0001 Signed Rank S Pr >= S <.0001 Tests for Normality Test --Statistic p Value Shapiro-Wilk W Pr < W < Kolmogorov-Smirnov D Pr > D < Cramer-von Mises W-Sq Pr > W-Sq < Anderson-Darling A-Sq Pr > A-Sq < Quantiles (Definition 5) Quantile Estimate 100% Max 7 99% 7 95% 7 90% 6 75% Q3 5 50% Median 5 25% Q1 4 10% 3 5% 2 1% 1 0% Min 1

6 Histogram # Boxplot Normal Probability Plot 7.25+******* ******** * ************************ 48 **********. +++.********************************* ******** ************************************* ********** ************* 26 ***** ******** 16 0 ****** *** *+**** * may represent up to 2 counts Fitted Distribution for aa1t1 Parameters for Normal Distribution Parameter Symbol Estimate Mean Mu Std Dev Sigma Goodness-of-Fit Tests for Normal Distribution Test ---Statistic p Value----- Kolmogorov-Smirnov D Pr > D <0.010 Cramer-von Mises W-Sq Pr > W-Sq <0.005 Anderson-Darling A-Sq Pr > A-Sq <0.005 Quantiles for Normal Distribution Quantile Percent Observed Estimated

7 Histogram: 7

8 Normal probability and quantile-quantile plots (raw data): 8

9 9 Scatterplot matrix for original data: Scatterplot matrix for data with jitter:

10 10 %include 'd:\m554\programs\outlier.sas'; %include 'd:\m554\programs\label.sas'; %include 'd:\m554\programs\cqplot.sas'; %let devtyp=screen; /* To clear the Results window every time you execute your program, use the statement: */ dm "odsresults; clear;"; /* To clear the Log and Output windows as well, use the following statement: */ dm "log; clear; output; clear; odsresults; clear;"; /* to clear the Results Viewer window */ ods html close; /* close previous */ ods html; /* open new */ /* To clear the graph window, use the following statements: */ proc greplay nofs; igout gseg; delete _all_; run;quit; TITLE 'Attitude toward using coupons -- data screening'; DATA coupon; INFILE 'd:\m554\datascreening\cfa.dat' PAD; INPUT id aa1t1 aa2t1 aa3t1 aa4t1 aa1t2 aa2t2 aa3t2 aa4t2; DATA coupont1; SET coupon(keep=id aa1t1 aa2t1 aa3t1 aa4t1); title 'Multivariate outlier detection - 5 passes'; %outlier(data=coupont1, var=aa1t1 aa2t1 aa3t1 aa4t1, id=id, pvalue=.0002, passes=5); run; quit;

11 11 Observations trimmed in calculating Mahalanobis distance _pass_ id _case_ dsq prob E E E E E E

12 12

Topic 8: Model Diagnostics

Topic 8: Model Diagnostics Outline Diagnostics to check model assumptions Diagnostics concerning X Diagnostics using the residuals Diagnostics and remedial measures Diagnostics: look at the data to diagnose