Introduction to General and Generalized Linear Models

Similar documents
############################ ### toxo.r ### ############################

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction

boxcox() returns the values of α and their loglikelihoods,

> budworm$samplogit < log((budworm$y+0.5)/(budworm$m budworm$y+0.5))

Logistic Regression. Logistic Regression Theory

CREDIT RISK MODELING IN R. Logistic regression: introduction

Stat 401XV Exam 3 Spring 2017

Generalized Linear Models

The SAS System 11:03 Monday, November 11,

Chapter 8 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (May 1, 2010)

Step 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set.

Stat 328, Summer 2005

Bradley-Terry Models. Stat 557 Heike Hofmann

Logit Models for Binary Data

Generalized Multilevel Regression Example for a Binary Outcome

Case Study: Applying Generalized Linear Models

Regression and Simulation

Statistics 175 Applied Statistics Generalized Linear Models Jianqing Fan

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013

Logit Analysis. Using vttown.dta. Albert Satorra, UPF

Logistic Regression with R: Example One

STA 4504/5503 Sample questions for exam True-False questions.

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

Problem Set 9 Heteroskedasticty Answers

Loss Simulation Model Testing and Enhancement

Statistics for Business and Economics

Credit Risk Modelling

Predicting Charitable Contributions

Using R to Create Synthetic Discrete Response Regression Models

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS

Session 178 TS, Stats for Health Actuaries. Moderator: Ian G. Duncan, FSA, FCA, FCIA, FIA, MAAA. Presenter: Joan C. Barrett, FSA, MAAA

MODEL SELECTION CRITERIA IN R:

σ 2 : ESTIMATES, CONFIDENCE INTERVALS, AND TESTS Business Statistics

σ e, which will be large when prediction errors are Linear regression model

Lecture 5: Fundamentals of Statistical Analysis and Distributions Derived from Normal Distributions

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance

proc genmod; model malform/total = alcohol / dist=bin link=identity obstats; title 'Table 2.7'; title2 'Identity Link';

Copyright 2005 Pearson Education, Inc. Slide 6-1

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN EXAMINATION

Simple Descriptive Statistics

MVE051/MSG Lecture 7

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Previously, when making inferences about the population mean, μ, we were assuming the following simple conditions:

Hierarchical Generalized Linear Models. Measurement Incorporated Hierarchical Linear Models Workshop

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Midterm

SAS Simple Linear Regression Example

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Midterm

Conover Test of Variances (Simulation)

Random variables. Contents

Final Exam - section 1. Thursday, December hours, 30 minutes

Lecture 21: Logit Models for Multinomial Responses Continued

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

11. Logistic modeling of proportions

1 Introduction 1. 3 Confidence interval for proportion p 6

Categorical Outcomes. Statistical Modelling in Stata: Categorical Outcomes. R by C Table: Example. Nominal Outcomes. Mark Lunt.

Study 2: data analysis. Example analysis using R

Topic 8: Model Diagnostics

book 2014/5/6 15:21 page 261 #285

Final Exam Suggested Solutions

MCMC Package Example

WesVar uses repeated replication variance estimation methods exclusively and as a result does not offer the Taylor Series Linearization approach.

MixedModR2 Erika Mudrak Thursday, August 30, 2018

Multiple Regression. Review of Regression with One Predictor

Diploma Part 2. Quantitative Methods. Examiner s Suggested Answers

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Non-linearities in Simple Regression

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

Basic Procedure for Histograms

Tests for One Variance

Log-linear Modeling Under Generalized Inverse Sampling Scheme

Analysis of Variance in Matrix form

sociology SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 SO5032 Quantitative Research Methods

GLM III - The Matrix Reloaded

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 13, 2018

6 Multiple Regression

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Econometric Methods for Valuation Analysis

Model 0: We start with a linear regression model: log Y t = β 0 + β 1 (t 1980) + ε, with ε N(0,

Intro to GLM Day 2: GLM and Maximum Likelihood

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

M249 Diagnostic Quiz

Projects for Bayesian Computation with R

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

Chapter 7. Inferences about Population Variances

LESSON 7 INTERVAL ESTIMATION SAMIE L.S. LY

Random Effects ANOVA

Multiple regression - a brief introduction

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Final Exam

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University

Hydrology 4410 Class 29. In Class Notes & Exercises Mar 27, 2013

Girma Tefera*, Legesse Negash and Solomon Buke. Department of Statistics, College of Natural Science, Jimma University. Ethiopia.

Maximum Likelihood Estimation

Variance clustering. Two motivations, volatility clustering, and implied volatility

Statistics Class 15 3/21/2012

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 22 January :00 16:00

Hypothesis Tests: One Sample Mean Cal State Northridge Ψ320 Andrew Ainsworth PhD

Transcription:

Introduction to General and Generalized Linear Models Generalized Linear Models - IIIb Henrik Madsen March 18, 2012 Henrik Madsen () Chapman & Hall March 18, 2012 1 / 32

Examples Overdispersion and Offset! Germination of Orobanche (overdispersion) Accident rates (offset) Some comments Henrik Madsen () Chapman & Hall March 18, 2012 2 / 32

Germination of Orobanche Germination of Orobanche Binomial distribution Modelling overdispersion Diagnostics Henrik Madsen () Chapman & Hall March 18, 2012 3 / 32

Germination of Orobanche Germination of Orobanche Orobanche is a genus of parasitic plants without chlorophyll that grows on the roots of flowering plants. An experiment was made where a bach of seeds of the species Orobanche aegyptiaca was brushed onto a plate containing an extract prepared from the roots of either a bean or a cucumber plant. The number of seeds that germinated was then recorded. Two varieties of Orobanche aegyptiaca namely O.a. 75 and O.a. 73 were used in the experiment. Modelling binary data, David Collett Henrik Madsen () Chapman & Hall March 18, 2012 4 / 32

Data Germination of Orobanche > dat<-read.table('seeds.dat',header=t) > head(dat) variety root y n 1 1 1 10 39 2 1 1 23 62 3 1 1 23 81 4 1 1 26 51 5 1 1 17 39 6 1 2 5 6 > str(dat) 'data.frame': 21 obs. of 4 variables: $ variety: int 1 1 1 1 1 1 1 1 1 1... $ root : int 1 1 1 1 1 2 2 2 2 2... $ y : int 10 23 23 26 17 5 53 55 32 46... $ n : int 39 62 81 51 39 6 74 72 51 79... Henrik Madsen () Chapman & Hall March 18, 2012 5 / 32

Germination of Orobanche The model We shall assume that the number of seeds that germinated y i in each independent experiment followers a binomial distribution: y i Bin(n i, p i ), where logit(p i ) = µ + α(root i ) + β(variety i ) + γ(root i, variety i ) Henrik Madsen () Chapman & Hall March 18, 2012 6 / 32

Model fitting Germination of Orobanche > dat$variety<-as.factor(dat$variety) > dat$root<-as.factor(dat$root) > dat$resp<-cbind(dat$y,(dat$n-dat$y)) > fit1<-glm(resp~variety*root, + family=binomial(link=logit), + data=dat) > fit1 Call: glm(formula = resp ~ variety * root, family = binomial(link = logit), data = dat) Coefficients: (Intercept) variety2 root2 variety2:root2-0.5582 0.1459 1.3182-0.7781 Degrees of Freedom: 20 Total (i.e. Null); Null Deviance: 98.72 Residual Deviance: 33.28 AIC: 117.9 17 Residual Henrik Madsen () Chapman & Hall March 18, 2012 7 / 32

Germination of Orobanche Deviance table From the output we can make a table: Source f Deviance Mean deviance Model H M 3 65.44 21.81 Residual (Error) 17 33.28 1.96 Corrected total 20 98.72 4.94 The p-value for the test for model sufficiency > pval<-1-pchisq(33.28,17) > pval [1] 0.01038509 Henrik Madsen () Chapman & Hall March 18, 2012 8 / 32

Overdispersion? Germination of Orobanche The deviance is to big. Possible reasons are: Incorrect linear predictor Incorrect link function Outliers Influential observations Incorrect choose of distribution To check this we need to look at the residuals! If all the above looks ok the reason might be over-dispersion. Henrik Madsen () Chapman & Hall March 18, 2012 9 / 32

Overdispersion Germination of Orobanche In the case of over-dispersion the variance is larger than expected for the given distribution. When data are overdispersed, a dispersion parameter, σ 2, should be included in the model. We use Var[Y i ] = σ 2 V (µ i )/w i with σ 2 denoting the overdispersion. Including a dispersion parameter does not affect the estimation of the mean value parameters β. Including a dispersion parameter does affect the standard errors of β. The distribution of the test statistics will be influenced. Henrik Madsen () Chapman & Hall March 18, 2012 10 / 32

Germination of Orobanche The dispersion parameter Approximate moment estimate for the dispersion parameter It is common practice to use the residual deviance D(y; µ( β)) as basis for the estimation of σ 2 and use the result that D(y; µ( β)) is approximately distributed as σ 2 χ 2 (n k). It then follows that σ dev 2 D(y; µ( β)) = n k is asymptotically unbiased for σ 2. Alternatively, one would utilize the corresponding Pearson goodness of fit statistic X 2 = n i=1 w i (y i µ i ) 2 V ( µ i ) which likewise follows a σ 2 χ 2 (n k)-distribution, and use the estimator σ 2 Pears = X 2 n k. Henrik Madsen () Chapman & Hall March 18, 2012 11 / 32

Germination of Orobanche > resdev<-residuals(fit1,type='deviance') # Deviance residuals > plot(resdev, ylab="deviance residuals") Deviance residuals 2 1 0 1 2 5 10 15 20 Index Henrik Madsen () Chapman & Hall March 18, 2012 12 / 32

Germination of Orobanche > plot(predict(fit1),resdev,xlab=(expression(hat(eta))), + ylab="deviance residuals") Deviance residuals 2 1 0 1 2 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 η^ Henrik Madsen () Chapman & Hall March 18, 2012 13 / 32

Germination of Orobanche > par(mfrow=c(1,2)) > plot(jitter(as.numeric(dat$variety),amount=0.1), resdev, xlab='variety', + ylab="deviance residuals", cex=0.6, axes=false) > box() > axis(1,label=c('o.a. 75','O.a. 73'),at=c(1,2)) > axis(2) > plot(jitter(as.numeric(dat$root),amount=0.1), resdev, xlab='root', + ylab="deviance residuals", cex=0.6, axes=false) > box() > axis(1,label=c('bean','cucumber'),at=c(1,2)) > axis(2) Deviance residuals 2 1 0 1 2 Deviance residuals 2 1 0 1 2 O.a. 75 O.a. 73 Bean Cucumber Variety Root Henrik Madsen () Chapman & Hall March 18, 2012 14 / 32

Germination of Orobanche Possible reasons for overdispersion Nothing in the plots is shows an indication that the model is not reasonable. We conclude that the big residual deviance is because of overdispersion. In binomial models overdispersion can often be explained by variation between the response probabilities or correlation between the binary responses. In this case it might because of: The batches of seeds of particular spices germinated in a particular root extract are not homogeneous. The batches were not germinated under similar experimental conditions. When a seed in a particular batch germinates a chemical is released that promotes germination in the remaining seeds of the batch. Henrik Madsen () Chapman & Hall March 18, 2012 15 / 32

Germination of Orobanche Overdispersion - some facts The residual deviance cannot be used as a goodness of fit in the case of overdispersion. In the case of overdispersion an F-test should be used in stead of the χ 2 test. The test is not exact in contrast to the Gaussian case. When fitting a model to overdispersed data in R we use family = quasibinomial for binomial data and family = quasipoisson for Poisson data. The families differ from the binomial and poisson families only in that the dispersion parameter is not fixed at one, so they can model over-dispersion. Henrik Madsen () Chapman & Hall March 18, 2012 16 / 32

Germination of Orobanche Fit of model with overdispersion > fit2<-glm(resp~variety*root,family=quasibinomial,data=dat) > summary(fit2) Call: glm(formula = resp ~ variety * root, family = quasibinomial, data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.01617-1.24398 0.05995 0.84695 2.12123 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -0.5582 0.1720-3.246 0.00475 ** variety2 0.1459 0.3045 0.479 0.63789 root2 1.3182 0.2422 5.444 4.38e-05 *** variety2:root2-0.7781 0.4181-1.861 0.08014. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for quasibinomial family taken to be 1.861832) Null deviance: 98.719 on 20 degrees of freedom Residual deviance: 33.278 on 17 degrees of freedom AIC: NA Henrik Madsen () Chapman & Hall March 18, 2012 17 / 32

Germination of Orobanche Compare to summary of standard model (wrong here) > # JUST TO COMPARE THIS MODEL IS CONSIDERED WRONG HERE > summary(fit1) Call: glm(formula = resp ~ variety * root, family = binomial(link = logit), data = dat) Deviance Residuals: Min 1Q Median 3Q Max -2.01617-1.24398 0.05995 0.84695 2.12123 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.5582 0.1260-4.429 9.46e-06 *** variety2 0.1459 0.2232 0.654 0.5132 root2 1.3182 0.1775 7.428 1.10e-13 *** variety2:root2-0.7781 0.3064-2.539 0.0111 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 98.719 on 20 degrees of freedom Residual deviance: 33.278 on 17 degrees of freedom AIC: 117.87 Henrik Madsen () Chapman & Hall March 18, 2012 18 / 32

Model reduction Germination of Orobanche Note that the standard errors shown in the summary output are bigger than without the overdispersion - multiplied with σ = 1.8618 > fit2<-glm(resp~variety*root,family=quasibinomial,data=dat) > drop1(fit2, test="f") Single term deletions Model: resp ~ variety * root Df Deviance F value Pr(>F) <none> 33.278 variety:root 1 39.686 3.2736 0.08812. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Henrik Madsen () Chapman & Hall March 18, 2012 19 / 32

Model reduction Germination of Orobanche > fit3<-glm(resp~variety+root,family=quasibinomial,data=dat) > drop1(fit3, test="f") Single term deletions Model: resp ~ variety + root Df Deviance F value Pr(>F) <none> 39.686 variety 1 42.751 1.3902 0.2537 root 1 96.175 25.6214 8.124e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Henrik Madsen () Chapman & Hall March 18, 2012 20 / 32

Model reduction Germination of Orobanche > fit4<-glm(resp~root,family=quasibinomial,data=dat) > drop1(fit4, test="f") Single term deletions Model: resp ~ root Df Deviance F value Pr(>F) <none> 42.751 root 1 98.719 24.874 8.176e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Henrik Madsen () Chapman & Hall March 18, 2012 21 / 32

Model results Germination of Orobanche > par<-coef(fit4) > par (Intercept) root2-0.5121761 1.0574031 > std<-sqrt(diag(vcov(fit4))) > std (Intercept) root2 0.1531186 0.2118211 > par+std%o%c(lower=-1,upper=1)*qt(0.975,19) lower upper (Intercept) -0.8326570-0.1916952 root2 0.6140564 1.5007498 > confint.default(fit4) # same as above but with quantile qnorm(0.975) 2.5 % 97.5 % (Intercept) -0.8122830-0.2120691 root2 0.6422414 1.4725649 Henrik Madsen () Chapman & Hall March 18, 2012 22 / 32

Model results Germination of Orobanche Probability of germination is e 0.512 1+e 0.512 37% on bean roots. Probability of germination is The odds ratio becomes: e 0.512+1.0574 1+e 0.512+1.0574 63% on cucumber roots. odds(germination Cucumber) odds(germination Bean) 2.88 with confidence interval from 1.9 to 4.4. Henrik Madsen () Chapman & Hall March 18, 2012 23 / 32

Germination of Orobanche Consider The model Will still assume that the number of seeds that germinated y i in each independent experiment followers a binomial distribution: y i Bin(n i, p i ), where logit(p i ) = µ + α(root i ) + β(variety i ) + γ(root i, variety i ) + B i Where B i N (0, σ 2 ) Notice B i is unobserved In some sense this model does exactly what we need. Can we even handle such a model? Yes! Wait for next chapter... Henrik Madsen () Chapman & Hall March 18, 2012 24 / 32

Accident rates Accident rates Poisson distribution Rate data Use of offset Henrik Madsen () Chapman & Hall March 18, 2012 25 / 32

Accident rates Accident rates Events that may be assumed to follow a Poisson distribution are sometimes recorded on units of different size. For example number of crimes recorded in a number of cities depends on the size of the city. Data of this type are called rate data. If we denote the measure of size with t, we can model this type of data as: ( µ ) log = X β t and then log(µ) = log(t) + X β Generalized linear models, Ulf Olsson Henrik Madsen () Chapman & Hall March 18, 2012 26 / 32

Accident rates Accident rates The data are accidents rates for elderly drivers, subdivided by sex. For each sex, the number of person years (in thousands) are also given. Females Males No. of accidents 175 320 No. of person years 17.30 21.40 We can model these data using Poisson distribution and a log link and using number of person years as offset. Henrik Madsen () Chapman & Hall March 18, 2012 27 / 32

Fitting the model Accident rates > fit1<-glm(y~offset(log(years))+sex,family=poisson,data=dat) > anova(fit1,test='chisq') Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 1 17.852 sex 1 17.852 0 1.155e-14 2.388e-05 We can see from the output that sex is significant. Henrik Madsen () Chapman & Hall March 18, 2012 28 / 32

Accident rates Parameter estimates - relative accident rate > summary(fit1) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 2.31408 0.07559 30.612 < 2e-16 sex2 0.39085 0.09402 4.157 3.22e-05 (Dispersion parameter for poisson family taken to be 1) Null deviance: 1.7852e+01 on 1 degrees of freedom Residual deviance: 1.1546e-14 on 0 degrees of freedom Using the output we can calculate the ratio as > exp(0.3908) [1] 1.478163 The conclusion is that the risk of having an accident is 1.478 times bigger for males than for females. Henrik Madsen () Chapman & Hall March 18, 2012 29 / 32

Some comments Some comments Henrik Madsen () Chapman & Hall March 18, 2012 30 / 32

Some comments Residual deviance as goodness of fit - binomial/binary data When i n i is reasonable large the χ 2 -approximation of the residual deviance is usually good and the residual deviance can be used as a goodness of fit. The approximation is not particularly good if some of the binomial denominators n i are very small and the fitted probabilities under the current model are near zero or unity. In the special case when n i, for all i, is equal to 1, that is the data is binary, the deviance is not even approximately distributed as χ 2 and the deviance can not be used as a goodness of fit. Henrik Madsen () Chapman & Hall March 18, 2012 31 / 32

More comments... Some comments In a binomial setup where all n i are big the standardized deviance residuals should be closed to Gaussian. The normal probability plot can be used to check this. In a Poisson setup where the counts are big the standardized deviance residuals should be closed to Gaussian. The normal probability plot can be used to check this. In a binomial setup where x i (number of successes) are very small in some of the groups numerical problems sometimes occur in the estimation. This is often seen in very large standard errors of the parameter estimates. Henrik Madsen () Chapman & Hall March 18, 2012 32 / 32