Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

Similar documents
Step 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set.

Random Effects ANOVA

6 Multiple Regression

Final Exam Suggested Solutions

Regression and Simulation

Stat 401XV Exam 3 Spring 2017

Multiple regression - a brief introduction

Non-linearities in Simple Regression

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Midterm

Generalized Multilevel Regression Example for a Binary Outcome

Monetary Economics Risk and Return, Part 2. Gerald P. Dwyer Fall 2015

Homework Assignment Section 3

Study 2: data analysis. Example analysis using R

Stat3011: Solution of Midterm Exam One

Maximum Likelihood Estimation

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15

Parameter Estimation

State Ownership at the Oslo Stock Exchange. Bernt Arne Ødegaard

MODEL SELECTION CRITERIA IN R:

boxcox() returns the values of α and their loglikelihoods,

Lecture Note: Analysis of Financial Time Series Spring 2017, Ruey S. Tsay

The Norwegian State Equity Ownership

Chapter 10 Exercises 1. The final two sentences of Exercise 1 are challenging! Exercises 1 & 2 should be asterisked.

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Chapter 4 Level of Volatility in the Indian Stock Market

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

Copyright 2005 Pearson Education, Inc. Slide 6-1

Projects for Bayesian Computation with R

> > is.factor(scabdata$trt) [1] TRUE > is.ordered(scabdata$trt) [1] FALSE > scabdata$trtord <- ordered(scabdata$trt, +

Generalized Linear Models

Economics 424/Applied Mathematics 540. Final Exam Solutions

Problem Set 4 Answer Key

σ e, which will be large when prediction errors are Linear regression model

Jaime Frade Dr. Niu Interest rate modeling

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

> attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount")

Lecture Note: Analysis of Financial Time Series Spring 2008, Ruey S. Tsay. Seasonal Time Series: TS with periodic patterns and useful in

The SAS System 11:03 Monday, November 11,

R is a collaborative project with many contributors. Type contributors() for more information.

Logistic Regression. Logistic Regression Theory

############################ ### toxo.r ### ############################

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay. Midterm

ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University

Quantitative Techniques Term 2

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION

9. Logit and Probit Models For Dichotomous Data

NHY examples. Bernt Arne Ødegaard. 23 November Estimating dividend growth in Norsk Hydro 8

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction

Predicting Charitable Contributions

State Ownership at the Oslo Stock Exchange

Jacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation?

CHAPTER-4 ANALYSIS OF LIQUIDITY

MixedModR2 Erika Mudrak Thursday, August 30, 2018

1.017/1.010 Class 19 Analysis of Variance

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013

Introduction to Population Modeling

1 Estimating risk factors for IBM - using data 95-06

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2013, Mr. Ruey S. Tsay. Midterm

Business Statistics: A First Course

Introduction to General and Generalized Linear Models

Variance clustering. Two motivations, volatility clustering, and implied volatility

Web Appendix. Are the effects of monetary policy shocks big or small? Olivier Coibion

Topic 8: Model Diagnostics

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Multiple linear regression

STA258 Analysis of Variance

1 Stat 8053, Fall 2011: GLMMs

STA218 Analysis of Variance

The misleading nature of correlations

Logistic Regression with R: Example One

Session 178 TS, Stats for Health Actuaries. Moderator: Ian G. Duncan, FSA, FCA, FCIA, FIA, MAAA. Presenter: Joan C. Barrett, FSA, MAAA

Problem max points points scored Total 120. Do all 6 problems.

Conover Test of Variances (Simulation)

Confidence Intervals for Large Sample Proportions

Statistics for Business and Economics

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Final Exam

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Improving Returns-Based Style Analysis

Final Exam - section 1. Thursday, December hours, 30 minutes

1 Inferential Statistic

CHAPTER 4 DATA ANALYSIS Data Hypothesis

Tests for Two Variances

Determination of the Optimal Stratum Boundaries in the Monthly Retail Trade Survey in the Croatian Bureau of Statistics

MVE051/MSG Lecture 7

Statistic Midterm. Spring This is a closed-book, closed-notes exam. You may use any calculator.

STA2601. Tutorial letter 105/2/2018. Applied Statistics II. Semester 2. Department of Statistics STA2601/105/2/2018 TRIAL EXAMINATION PAPER

CHAPTER III METHODOLOGY

Risk Analysis. å To change Benchmark tickers:

MgtOp S 215 Chapter 8 Dr. Ahn

Tests for Two Means in a Cluster-Randomized Design

Financial Economics. Runs Test

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2013, Mr. Ruey S. Tsay. Final Exam

Tests for the Difference Between Two Linear Regression Intercepts

Transcription:

Mixed-effects models An introduction by Christoph Scherber Up to now, we have been dealing with linear models of the form where ß0 and ß1 are parameters of fixed value. Example: Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times. yield=rnorm(20,150) plot=gl(5,4) Let us start off with a wrong model, ignoring the grouping of our data points, and assuming that all 20 plants harvested were independent random samples: Call: lm(formula = yield ~ 1) Residuals: Min 1Q Median 3Q Max -2.42295-0.22550 0.06235 0.57603 1.63489 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 149.8674 0.1881 796.8 <2e-16 *** Residual standard error: 0.8412 on 19 degrees of freedom We conclude that = 149. 8674 and =0.8412. If we inspect the residuals of this model, separately for each plot, we see that there is high variability between plots (right). 5 4 3 2 1-2 -1 0 1 resid(model1)

We can improve our initial model by formulating a fixed-effects model with a different mean estimated for every plot: model2=lm(yield~plot-1) summary(model2) Call: lm(formula = yield ~ plot - 1) Residuals: Min 1Q Median 3Q Max -2.27885-0.33863 0.08602 0.46127 1.77899 Coefficients: Estimate Std. Error t value Pr(> t ) plot1 149.7233 0.4524 331.0 <2e-16 *** plot2 149.8896 0.4524 331.4 <2e-16 *** plot3 149.4947 0.4524 330.5 <2e-16 *** plot4 150.0474 0.4524 331.7 <2e-16 *** plot5 150.1819 0.4524 332.0 <2e-16 *** Residual standard error: 0.9047 on 15 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.098e+05 on 5 and 15 DF, p-value: < 2.2e-16 The residual s.e. for this model is 0.9047, which is similar to the one obtained previously. The residuals of this model are now centered around zero: 5 4 3 2 1-2 -1 0 1 resid(model2)

Both models we constructed so far were wrong, because they did not account for the fact that the plots we used were just a random sample from a large number of possible plots that could have been chosen. They did also not account for the pseudoreplication (several samples taken per plot). This is evident if we look at the ANOVA table for model2: > anova(model2) Analysis of Variance Table Response: yield plot 5 449206 89841 109763 < 2.2e-16 *** Residuals 15 12 1 We see that the estimates are based on a sample size of 20 data points, but there were only 5 plots in total. One solution would be to analyse the experiment as a split-plot ANOVA. However, because there are no treatments applied below the plot scale, there are not enough degrees of freedom to test for significant differences, and we only get the corresponding sums of squares and variances: model3=aov(yield~1+error(plot)) summary(model3) Error: plot Residuals 4 1.16607 0.29152 Error: Within Residuals 15 12.2775 0.8185 The only way to analyse these data in a sensible way is to use a mixed effects model. Suppose, for example, that we had 100 plots instead of 5; now the number of parameters in our classical linear models would increase linearly as more and more plots are added. The plots themselves, however, are uninteresting in the sense that we only want to predict mean plant yield and how much variance there is between plots. We are not interested in specific plot comparisons (for example plot 33 differed significantly from plot 67. To understand the transition from fixed to mixed effects models, we first need to come back to our initial model formulation, which was (in this case) model2: model.matrix(model2) plot1 plot2 plot3 plot4 plot5 1 1 0 0 0 0 2 1 0 0 0 0 3 1 0 0 0 0

4 1 0 0 0 0 5 0 1 0 0 0 6 0 1 0 0 0 7 0 1 0 0 0 8 0 1 0 0 0 (...) contrasts(plot) 2 3 4 5 1 0 0 0 0 2 1 0 0 0 3 0 1 0 0 4 0 0 1 0 5 0 0 0 1 You can see that four dummy variables have been introduced for the k-1 orthogonal contrasts of the factor plot. This is completely not what we want! As we just said: We are not interested in those specific comparisons, because had there been 1000 plots, there would be 999 possible comparisons, and we would be very likely to find (just by chance alone) some plots differing significantly in mean yield. Hence, in mixed effects models, some or all of the parameters ß in a model are not treated as fixed parameters, but as random variables. This has the great advantage that it saves us a lot of degrees of freedom, and it allows an estimation of between-plot and within-plot variability. Expressed as a mixed effects model, any linear model formula now becomes: Thus, there is now a mixture of both fixed effects ß, and random effects b. These random effects are now assumed to have mean 0 and variance sigma-squared. Our model 1, expressed as a mixed-effects model, could now become This means that a fixed intercept term ß 0 is estimated, but the deviations from this fixed effect are assumed to be random deviations between plots (b 0 ), plus random variation within plots (ε). Let s try this out in R: library(nlme) model4=lme(yield~1,random=~1 plot) summary(model4) Linear mixed-effects model fit by REML Data: NULL AIC BIC loglik 56.34253 59.17585-25.17127 Random effects: Formula: ~1 plot (Intercept) Residual

StdDev: 1.988913e-05 0.8411627 Fixed effects: yield ~ 1 Value Std.Error DF t-value p-value (Intercept) 149.8674 0.1880897 15 796.7867 0 Standardized Within-Group Residuals: Min Q1 Med Q3 Max -2.8804785-0.2680754 0.0741240 0.6848041 1.9436069 Number of Observations: 20 Number of Groups: 5