Natural Resources Data Analysis Lecture Notes Brian R. Mitchell. IV. Week 4: A. Goodness of fit testing

Natural Resources Data Analyss Lecture Notes Bran R. Mtchell IV. Week 4: A. Goodness of ft testng 1. We test model goodness of ft to ensure that the assumptons of the model are met closely enough for the model to provde vald nference. Every statstcal modelng technque has a set of assumptons that should be checked as well as possble. Goodness of ft s generally evaluated usng summary statstcs and nspecton of resdual plots. For certan complex models, goodness of ft can only be evaluated usng computer smulatons. 2. Note that volatng model assumptons s a much bgger problem f you are conductng null hypothess tests, but that not meetng the assumptons wll also affect model predctons. 3. Regresson a) Assumptons (1) Y-values and ther error terms are normally dstrbuted for each level of the predctor varables. Regresson s generally robust to volatons of ths assumpton. (2) Y-values and ther error terms have the same varance at each level of the predctor varables (.e. homogenety of varance). (3) Y-values and ther error terms are ndependent. (4) Predctor varables are fxed and known exactly (specfcally for Fxed Effects or Model 1 stuatons). Falng to meet ths assumpton, however, does not affect hypothess testng or predcton. (5) Predctor varables should not be hghly correlated wth each other. Severe collnearty can prevent a model from beng ft or create hghly senstve results. Correlated predctors also lead to nflated varance of parameter estmates. (6) There s a lnear relatonshp between predctors and outcome. b) Example (1) The example s a smulated data set. The outcome varable s Tme To Detecton (n seconds, durng a brd pont count), and the predctor varables are tme of day (decmal hours snce sunrse), folage densty, and number of brds actually present (the nce thng about smulated data s you don t have to actually count the brds!). (2) Test the ft of the regresson model: TTD = b 0 + b 1 *Tme + b 2 *Folage + b 3 *Number

(3) In SAS, use the code: PROC GLM data=dataset; MODEL ttd = tme folage number; RUN; c) Goodness of ft (1) In SAS, a varety of GOF stats can be saved usng the OUTPUT command: OUTPUT out=outdataset keyword=name;. So the command for resduals would be OUTPUT out=outdataset R=resd; (2) In SAS, plots can be made usng PROC PLOT: PROC PLOT data=outdataset; PLOT vertcal*horzontal; run; (3) Focus on whether a lnear model s approprate and whether there are outlers (.e. large resdual) or nfluence ponts (.e. far from the mean). (4) Useful statstcs to calculate: (a) The overall R 2 s a general measure of ft, t s the proporton of the varaton n the data set explaned by the model. (b) Correlatons among predctors. (PROC CORR n SAS: PROC CORR data=dataset; VAR x1 x2 x3; RUN; ) (c) Predcted values are useful for plots. (P n SAS OUTPUT lne) (d) Resduals are also useful for plots. (R n SAS) (e) Leverage measures how each x nfluences the ftted y- value; values further from the mean of all x s have greater leverage. Any leverage greater than 2K/n should be checked; leverage s typcally ncorporated nto Cook s D. (Leverage s H n SAS) (f) Large studentzed resduals ndcate outlers from the ftted model, compared to other observatons. (STUDENT n SAS) (g) Press resduals (usually studentzed) are the dfference between observed and predcted Y-values when the current observaton s excluded. (RSTUDENT n SAS). (h) Cook s D or Cook s Dstance measures the nfluence each observaton has on the ftted regresson lne and the estmates of regresson parameters. A large value ndcates that removal of the observaton would consderably nfluence the regresson parameters; dstances greater than 1 are usually partcularly nfluental. (COOKD n SAS) (5) Useful plots to examne (a) Scatterplots of each predctor aganst the other predctors can help detect multcollnearty. (b) Scatterplots of the outcome aganst each predctor: these plots can help you fnd unequal varances, nonlnearty, and outlers. But ths gnores the nfluence of other predctor varables.

(c) Partal regresson or partal resdual plots show the relatonshp between the outcome and a predctor, adjustng for the effects of the other predctors. Values on the Y axs are the resduals from the regresson of Y aganst all predctors except the predctor of nterest; values on the X axs are the resduals of the predctor of nterest aganst the other predctors. (These plots can be produced by PROC REG, wth the PARTIAL opton n the MODEL statement: MODEL y = x1 x2 x3 / partal ) (d) Resduals aganst predcted Y-values (these nclude Cook s D and studentzed resduals). (e) Resduals aganst predctors can detect outlers specfc to that predctor, nonlnearty between Y and that predctor, and temporal autocorrelaton f the predctor s tme (and ths type of plot can be adapted for detectng other sorts of autocorrelaton). (f) Resduals aganst predctors or nteractons not ncluded n the model; ths can help assess the mportance of factors not ncluded n the orgnal model. (g) Locatng outlers can be aded by plottng resduals aganst the observaton number, or by sortng the data set (note that f you want to sort a table n SAS, rght-clck the table and clck Edt Mode before clckng a column and sortng). 4. ANOVA a) Assumptons (1) Y-values and ther error terms are normally dstrbuted for each level of the predctor varables. ANOVA s generally robust to volatons of ths assumpton f sample szes and varances are smlar across levels. (2) Y-values and ther error terms have the same varance at each level of the predctor varables (.e. homogenety of varance). Unequal varances can be a bg problem, but can be addressed usng robust ANOVA technques. (3) Y-values and ther error terms are ndependent. b) Goodness of ft (1) ANOVA s essentally lnear regresson usng categorcal varables. However, the categorcal nature of the data means that some regresson dagnostcs are not useful. (2) Resduals and studentzed resduals are stll useful. Plot these aganst the predcted values (.e. group means). Resduals should show equal spread for each group, ndcatng varance homogenety. These plots wll also show outlers.

5. Dscrmnant analyss a) Assumptons (1) There are several requrements for the data set: (a) Groups mutually exclusve. (b) Number of samples per group should not be radcally dfferent. (c) No dscrmnatng varable can be a lnear combnaton of other dscrmnatng varables. (d) No hghly correlated dscrmnatng varables; maxmum correlaton suggestons vary, but be concerned f correlatons exceed 0.7 (although most publshed analyses I have seen use thresholds between 0.8 and 0.95). (e) At least 2 samples per group. (f) At least 2 more samples than the number of varables, and preferably there should be at least 3 tmes as many samples as varables. (g) The pror probablty of group membershp s known. Most packages assume equal probablty of membershp n each group, but ths can be adjusted (e.g. by usng the proporton of samples as the pror probablty). (2) Equal group dspersons (.e. equal varance-covarance matrces): Volatng ths assumpton s problematc f you are hopng to use nferental statstcs to determne f groups are sgnfcantly dfferent. If ths assumpton does not hold, the dscrmnant analyss can stll have useful for descrpton and predcton. (3) Multvarate normalty: Ths analyss assumes that the data for each group follows the multvarate normal dstrbuton. Dscrmnant analyss s robust to volatons of ths assumpton. (4) Independence: Dscrmnant analyss s senstve to lack of ndependence. (5) Lnearty: Dscrmnant analyss assumes that a lnear combnaton of varables best predcts group membershp. b) Example (1) The example s a data set that s often used to llustrate dscrmnant analyss. The goal s to classfy 3 speces of rses based on sepal and petal wdth and length. c) Goodness of ft (1) Use a scatterplot or correlaton matrx to explore correlatons among predctor varables. If there are any hgh correlatons, you can conduct an ANOVA for both varables aganst your groupng factor. Keep the varable wth the largest among-group dfferences. ( PROC CORR data=dscrm; var sepallen sepalwd

6. Logstc Regresson petallen petalwd; run; ) ( PROC GLM data=dscrm; class speces; model speces = var; run; ) (2) Calculate a unvarate ANOVA on each dscrmnatng varable wth the groupng varable as the man effect, and assess the dstrbuton of the resduals (whch should be normally dstrbuted). Ths doesn t really address multvarate normalty, but f unvarate normalty s not present, then multvarate normalty s also not present. ( PROC GLM data=dscrm; class speces; model speces = var; output out=dscrmout r=resd; run; PROC PLOT data=dscrmout; plot resd*speces; run; ) (3) Plot each varable on the Y axs aganst group membershp on the X axs; varance should be smlar across groups. If varances are unequal you can transform the varable. It may help to see f the transformed varable mproves dscrmnaton; f t does not, t should not be used. ( PROC PLOT data=dscrm; PLOT var*speces; run; ) (4) Calculate a test of equal group dspersons (there are a varety of these). If the dspersons dffer, dscrmnant analyss can be conducted usng the wthn-group matrces nstead of the pooled matrx. Ths requres quadratc dscrmnant analyss rather than lnear dscrmnant analyss. Alternatvely, f the dspersons do not dffer greatly, the dfferences are unlkely to have a large effect and they can be gnored. ( PROC DISCRIM data=dscrm pool=test; class speces; var vars; run; ) (note: pool=no s quadratc dscrmnant analyss, pool=yes s lnear) (5) Plot dscrmnant functons aganst each other (wth dfferent codng for each group). Ths wll help dentfy outlers, as well as nonlnearty. ( PROC DISCRIM data=dscrm out=dscrmout canoncal pool=yes; class speces; var vars; run; proc plot data=dscrmout; plot can2*can1=speces; run; ) (6) Classfcaton accuracy: How well does dscrmnant analyss classfy the data? Is the classfcaton better than expected by chance? Kappa (when probablty of group membershp = sample sze) or tau are useful statstcs that explan the mprovement n classfcaton accuracy over what was expected by chance. These statstcs are only unbased wth jack-knfed or splt-sample data. ( PROC DISCRIM data=dscrm canoncal crossvaldate crosslsterr pool=yes; class speces; var var; run; ) (Use Kappa spreadsheet to calculate Kappa) a) Assumptons (1) The data set must meet some basc requrements: (a) No hghly correlated predctor varables (b) No complete separaton (.e. perfect predcton). Nearly complete separaton can also be a problem.

(c) No zero cells (.e. no zero cells n the contngency table for categorcal predctors). (2) The probablty dstrbuton for the response varable (and the error terms) s bnomal (multnomal for multple logstc regresson). (3) The logstc lnk functon s approprate (.e. predctor varables have a lnear relatonshp to the logged odds of the outcome). b) Example (1) Coyote vocal responses to playback c) Goodness of ft (1) Plot each predctor aganst the outcome to look for complete separaton and zero cells (Use JMP, Analyze Ft Y by X to generate mosac plots and contngency tables). (2) Use a scatterplot or correlaton matrx to check for collnearty. Contngency table analyss can be used to check correlatons among categorcal predctors. ( PROC CATMOD data=logstc; model outcome = predctors / corrb; run; ) (3) Examne the logstc regresson output. Extremely large estmates and standard errors ndcate complete separaton or zero cells. ( PROC LOGISTIC data=logstc; class classvars; model vocresp = vars; run; ) (4) Area under the ROC (Recever Operatng Characterstc) curve; see Hosmer and Lemeshow (2000). (a) The ROC curve s a measure of classfcaton accuracy; t s a plot of senstvty versus (1-specfcty) over all possble classfcaton cut-ponts. (b) Senstvty s the proporton of cases where the outcome = 1 that were correctly classfed. (c) Specfcty s the proporton of cases where the outcome = 0 that were correctly classfed. (d) A cut-pont s the probablty at whch the decson s made to classfy nto one group nstead of the other. (e) An ROC area of 0.5 suggests no dscrmnaton; 0.7 to 0.8 s consdered acceptable, and greater than 0.8 s excellent. (f) Note that a poorly-fttng model may stll have good dscrmnaton! (g) In SAS, the area under the ROC curve s estmated by the statstc c n the table ttled Assocaton of predcted Probabltes and Observed Responses. (5) Overall goodness of ft can be assessed usng the G 2 statstc (the devance) or the Pearson χ 2 statstc f the predctors are all categorcal. These statstcs approxmate a χ 2 dstrbuton wth n

p (sample sze number of parameters) degrees of freedom. (In SAS, add / aggregate scale=none to the model statement) (6) If there are contnuous predctors n the model, the best overall ft statstc s the Hosmer-Lemeshow test. (n SAS, add / lackft to the model statement) (7) Examne the resduals, whch should be examned for large values or plotted aganst the predcted logstc probablty. Some useful resduals: (a) Pearson χ 2 or devance resduals. (b) χ 2 or G 2 resduals; these are the change n the χ 2 or G 2 statstcs when the current observaton s excluded (ths s the logstc verson of press resduals). (c) Dfbeta s an nfluence statstc that parallels Cook s D from regresson. (d) These resduals (and the predcted probabltes to plot them aganst) can all be calculated by addng the followng lne of SAS code after the MODEL statement: OUTPUT out=logout predcted=pred resch=resch resdev=resdev dfchsq=dfch dfdev=dfdev dfbetas=_all_; 7. What f your model does not ft? B. Multmodel nference a) Examne outlers and determne f the data s accurate b) Consder revsng your model set or error structure; ths ncludes consderng transformatons of predctor varables and the outcome varable (.e. the error dstrbuton). c) It s possble to use QAIC or QAIC c to select among poorly fttng models (B&A p. 309). But you must report the lack of ft and be aware that ths severely hampers nference (also p. 309). 1. The man goals of multmodel nference are to derve parameter estmates usng all models n a model set, and to ncorporate model selecton uncertanty nto precson estmates. 2. Model-averaged parameters a) The model-averaged parameter estmate s smply the weghted average of the estmates from each model. b) ˆ R θ = = 1 w ˆ θ c) For models that lack the parameter beng averaged, use ˆ θ = 0

d) WARNING: do not average a parameter that has dfferent meanngs n dfferent models (.e. dfferent functonal forms n a non-nested model set). Instead, calculate the estmated outcome for each model, and model average the outcomes. Note that ths value wll correspond to specfc values of the predctor varables (see below). 3. Uncondtonal parameter varance (.e. accountng for model selecton uncertanty) a) The uncondtonal varance for a model-averaged parameter s also a weghted average. b) R vâr ˆ θ = w vâr( ˆ θ g ) = 1 + ˆ θ ˆ θ c) The above formula s from Burnham and Anderson (2004), and dffers from the formula n Burnham and Anderson (2002). d) The uncondtonal varance = the sum of (Akake weghts tmes (the varance calculated for the current model plus the squared dfference between the parameter estmate for the current model mnus the model averaged parameter estmate)). e) If the parameter you are calculatng varance for s not n the current model, use a varance of zero and a parameter estmate of zero. Ths wll contrbute to the uncondtonal varance an amount equal to the model weght tmes the square of the model averaged parameter estmate. f) Burnham and Anderson spend some tme talkng about ˆ ~ θ versusθ, and the varance estmators that go along wth these parameters. The debate s essentally over what to do when a parameter s not n the model; Burnham and Anderson ntally assert thatθˆ and ts varance s calculated by only usng models where the parameter occurs, whle ~ θ s calculated accordng to the procedures I have outlned above and ts varance cannot be calculated. However, ther 2004 monograph on multmodel nference descrbes ~ θ whle callng t θˆ. I beleve that usng the procedure I have outlned here s the best match to ther apparent ntent. 4. Uncondtonal confdence nterval a) Once you have the uncondtonal parameter estmate and ts varance, just calculate confdence ntervals as you normally would. One typcal approach s based on the z dstrbuton: 2

b) ˆ θ ± z ˆ 1 α / 2 vâr θ 5. Relatve mportance of varables a) Burnham and Anderson suggest summng the Akake weghts for the models where each varable occurs; the larger the summed weght, the more mportant the varable. b) Ths approach requres that each varable occur n the same number of canddate models. c) I do not agree wth ths approach, snce t gnores the possblty that two parameters could be selected wth smlar frequency, but that one may be more mportant than the other (larger effect sze, smaller confdence nterval). I have not seen a convncng smulaton study supportng ths approach. d) I recommend lookng at the model averaged estmate dvded by the model averaged varance as an estmate of effect sze. Ths approach also does not requre equal representaton for the parameter across the model set. 6. Confdence Sets for the K-L Best Model a) Burnham and Anderson suggests that a n% confdence set can be produced by rankng the models by decreasng Akake weghts, and addng n models untl the cumulatve weght exceeds n% (so a 90% model set would nclude all models untl the cumulatve Akake weghts exceeded 0.90). b) They no longer seem to recommend ths approach, and I agree that t should not be used. It s smple enough to conduct model averagng that the entre set should be used. C. For the remander of the class, work on a spreadsheet to calculate model averaged parameters and uncondtonal varances.