GOV 2001/ 1002/ E-200 Section 3 Inference and Likelihood

Similar documents
Chapter 7: Estimation Sections

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

CS 361: Probability & Statistics

Intro to Likelihood. Gov 2001 Section. February 2, Gov 2001 Section () Intro to Likelihood February 2, / 44

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Non-informative Priors Multiparameter Models

Distributions and Intro to Likelihood

Stochastic Models. Statistics. Walt Pohl. February 28, Department of Business Administration

6. Genetics examples: Hardy-Weinberg Equilibrium

Chapter 7: Estimation Sections

Review for Final Exam Spring 2014 Jeremy Orloff and Jonathan Bloom

The Weibull in R is actually parameterized a fair bit differently from the book. In R, the density for x > 0 is

Conjugate Models. Patrick Lam

Bivariate Birnbaum-Saunders Distribution

The Normal Distribution

CS340 Machine learning Bayesian model selection

Stochastic Components of Models

Exam STAM Practice Exam #1

Lecture 2. Probability Distributions Theophanis Tsandilas

6. Continous Distributions

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

1 Bayesian Bias Correction Model

Bayesian Hierarchical/ Multilevel and Latent-Variable (Random-Effects) Modeling

STA Module 3B Discrete Random Variables

Back to estimators...

Lecture 10: Point Estimation

Section 0: Introduction and Review of Basic Concepts

The Normal Distribution

Chapter 8 Estimation

Continuous random variables

SOCIETY OF ACTUARIES EXAM STAM SHORT-TERM ACTUARIAL MATHEMATICS EXAM STAM SAMPLE QUESTIONS

Statistical estimation

Likelihood Methods of Inference. Toss coin 6 times and get Heads twice.

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

Homework: Due Wed, Nov 3 rd Chapter 8, # 48a, 55c and 56 (count as 1), 67a

Chapter 7: Estimation Sections

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 23

PhD Qualifier Examination

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN EXAMINATION

Bayesian Normal Stuff

Practice Exam 1. Loss Amount Number of Losses

An Improved Skewness Measure

Some Characteristics of Data

Normal Distribution. Definition A continuous rv X is said to have a normal distribution with. the pdf of X is

Case Study: Heavy-Tailed Distribution and Reinsurance Rate-making

1. You are given the following information about a stationary AR(2) model:

STAT 425: Introduction to Bayesian Analysis

An Introduction to Bayesian Inference and MCMC Methods for Capture-Recapture

STA Rev. F Learning Objectives. What is a Random Variable? Module 5 Discrete Random Variables

2. The sum of all the probabilities in the sample space must add up to 1

Homework: Due Wed, Feb 20 th. Chapter 8, # 60a + 62a (count together as 1), 74, 82

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Bus 701: Advanced Statistics. Harald Schmidbauer

Business Statistics 41000: Probability 3

Commonly Used Distributions

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

A Practical Implementation of the Gibbs Sampler for Mixture of Distributions: Application to the Determination of Specifications in Food Industry

PASS Sample Size Software

Choice Probabilities. Logit Choice Probabilities Derivation. Choice Probabilities. Basic Econometrics in Transportation.

19. CONFIDENCE INTERVALS FOR THE MEAN; KNOWN VARIANCE

ST440/550: Applied Bayesian Analysis. (5) Multi-parameter models - Summarizing the posterior

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

Outline. Review Continuation of exercises from last time

INDIAN INSTITUTE OF SCIENCE STOCHASTIC HYDROLOGY. Lecture -5 Course Instructor : Prof. P. P. MUJUMDAR Department of Civil Engg., IISc.

Approximate Bayesian Computation using Indirect Inference

M249 Diagnostic Quiz

MAS187/AEF258. University of Newcastle upon Tyne

CSC 411: Lecture 08: Generative Models for Classification

Business Statistics 41000: Probability 4

The normal distribution is a theoretical model derived mathematically and not empirically.

(11) Case Studies: Adaptive clinical trials. ST440/540: Applied Bayesian Analysis

CS340 Machine learning Bayesian statistics 3

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

START HERE: Instructions. 1 Exponential Family [Zhou, Manzil]

High-Frequency Data Analysis and Market Microstructure [Tsay (2005), chapter 5]

MidTerm 1) Find the following (round off to one decimal place):

Introduction to Probability and Inference HSSP Summer 2017, Instructor: Alexandra Ding July 19, 2017

The Binomial Probability Distribution

Probability. An intro for calculus students P= Figure 1: A normal integral

Probability Weighted Moments. Andrew Smith

Chapter 4: Asymptotic Properties of MLE (Part 3)

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Statistical Intervals (One sample) (Chs )

SYSM 6304 Risk and Decision Analysis Lecture 2: Fitting Distributions to Data

One sample z-test and t-test

Chapter 5. Statistical inference for Parametric Models

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Lesson Plan for Simulation with Spreadsheets (8/31/11 & 9/7/11)

Central Limit Theorem (CLT) RLS

Random Variables Handout. Xavier Vilà

Introduction to Sequential Monte Carlo Methods

Time Invariant and Time Varying Inefficiency: Airlines Panel Data

Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

INDIAN INSTITUTE OF SCIENCE STOCHASTIC HYDROLOGY. Lecture -26 Course Instructor : Prof. P. P. MUJUMDAR Department of Civil Engg., IISc.

Gov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010

Lecture III. 1. common parametric models 2. model fitting 2a. moment matching 2b. maximum likelihood 3. hypothesis testing 3a. p-values 3b.

Statistics 431 Spring 2007 P. Shaman. Preliminaries

Transcription:

GOV 2001/ 1002/ E-200 Section 3 Inference and Likelihood Anton Strezhnev Harvard University February 10, 2016 1 / 44

LOGISTICS Reading Assignment- Unifying Political Methodology ch 4 and Eschewing Obfuscation Problem Set 3- Due by 6pm, 2/24 on Canvas. Assessment Question- Due by 6pm, 2/24 on on Canvas. You must work alone and only one attempt. 2 / 44

REPLICATION PAPER 1. Read Publication, Publication 2. Find a coauthor. See the Canvas discussion board to help with this. 3. Choose a paper based on the crieria in Publication, Publication. 4. Have a classmate sign-off on your paper choice. 3 / 44

OVERVIEW In this section you will... 4 / 44

OVERVIEW In this section you will... learn how to derive a likelihood function for some data given a data-generating process. 4 / 44

OVERVIEW In this section you will... learn how to derive a likelihood function for some data given a data-generating process. learn how to calculate a Bayesian posterior distribution and generate quantities of interest from it. 4 / 44

OVERVIEW In this section you will... learn how to derive a likelihood function for some data given a data-generating process. learn how to calculate a Bayesian posterior distribution and generate quantities of interest from it. learn about common pitfalls in hypothesis testing and think about how to interpret p-values more critically. 4 / 44

OVERVIEW In this section you will... learn how to derive a likelihood function for some data given a data-generating process. learn how to calculate a Bayesian posterior distribution and generate quantities of interest from it. learn about common pitfalls in hypothesis testing and think about how to interpret p-values more critically. learn that Frequentists and Bayesians aren t really that different after all! 4 / 44

OUTLINE Likelihood Inference Bayesian Inference Hypothesis Testing 5 / 44

LIKELIHOOD INFERENCE Last week we talked about probability Given parameters, what s the probability of the data. 6 / 44

LIKELIHOOD INFERENCE Last week we talked about probability Given parameters, what s the probability of the data. This week we re talking about inference Given the data, what can we say about the parameters. 6 / 44

LIKELIHOOD INFERENCE Last week we talked about probability Given parameters, what s the probability of the data. This week we re talking about inference Given the data, what can we say about the parameters. Likelihood approaches to inference ask What parameters make our data most likely? 6 / 44

EXAMPLE: AGE DISTRIBUTION OF ER VISITS DUE TO WALL PUNCHING We have a dataset from the U.S. Consumer Product Safety Commission s National Electronic Injury Surveillance System (NEISS) containing data on ER visits in 2014. 7 / 44

EXAMPLE: AGE DISTRIBUTION OF ER VISITS DUE TO WALL PUNCHING We have a dataset from the U.S. Consumer Product Safety Commission s National Electronic Injury Surveillance System (NEISS) containing data on ER visits in 2014. Let s take a look at one injury category wall punching. 7 / 44

EXAMPLE: AGE DISTRIBUTION OF ER VISITS DUE TO WALL PUNCHING We have a dataset from the U.S. Consumer Product Safety Commission s National Electronic Injury Surveillance System (NEISS) containing data on ER visits in 2014. Let s take a look at one injury category wall punching. We re interested in modelling the distribution of the ages of individuals who visit the ER having punched a wall. 7 / 44

EXAMPLE: AGE DISTRIBUTION OF ER VISITS DUE TO WALL PUNCHING We have a dataset from the U.S. Consumer Product Safety Commission s National Electronic Injury Surveillance System (NEISS) containing data on ER visits in 2014. Let s take a look at one injury category wall punching. We re interested in modelling the distribution of the ages of individuals who visit the ER having punched a wall. To do this, we write down a probability model for the data. 7 / 44

EMPIRICAL DISTRIBUTION OF WALL-PUNCHING AGES Ages of ER patients who punched a wall in 2014 Share 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0 10 20 30 40 50 60 70 80 Age 8 / 44

A MODEL FOR THE DATA LOG-NORMAL DISTRIBUTION We observe n observations of ages, Y = {Y 1,..., Y n }. 9 / 44

A MODEL FOR THE DATA LOG-NORMAL DISTRIBUTION We observe n observations of ages, Y = {Y 1,..., Y n }. A normal distribution doesn t seem like a reasonable model since age is strictly positive and the distribution is somewhat right-skewed. 9 / 44

A MODEL FOR THE DATA LOG-NORMAL DISTRIBUTION We observe n observations of ages, Y = {Y 1,..., Y n }. A normal distribution doesn t seem like a reasonable model since age is strictly positive and the distribution is somewhat right-skewed. But a log-normal might be reasonable! 9 / 44

A MODEL FOR THE DATA LOG-NORMAL DISTRIBUTION We observe n observations of ages, Y = {Y 1,..., Y n }. A normal distribution doesn t seem like a reasonable model since age is strictly positive and the distribution is somewhat right-skewed. But a log-normal might be reasonable! We assume that each Y i Log-Normal(µ, σ 2 ), and that each Y i is independently and identically distributed. 9 / 44

A MODEL FOR THE DATA LOG-NORMAL DISTRIBUTION We observe n observations of ages, Y = {Y 1,..., Y n }. A normal distribution doesn t seem like a reasonable model since age is strictly positive and the distribution is somewhat right-skewed. But a log-normal might be reasonable! We assume that each Y i Log-Normal(µ, σ 2 ), and that each Y i is independently and identically distributed. We could extend this model by adding covariates (e.g. µ i = X i β). 9 / 44

EXAMPLE: AGE DISTRIBUTION OF ER VISITS DUE TO WALL PUNCHING The density of the log-normal distribution is given by f (Y i µ, σ 2 1 ) = ( Y i σ 2π exp (ln(y i) µ) 2 ) 2σ 2 10 / 44

EXAMPLE: AGE DISTRIBUTION OF ER VISITS DUE TO WALL PUNCHING The density of the log-normal distribution is given by f (Y i µ, σ 2 1 ) = ( Y i σ 2π exp (ln(y i) µ) 2 ) 2σ 2 Basically the same as saying ln(y i ) is normally distributed! 10 / 44

WRITING A LIKELIHOOD After writing a probability model for the data, we can write the likelihood of the parameters given the data 11 / 44

WRITING A LIKELIHOOD After writing a probability model for the data, we can write the likelihood of the parameters given the data By definition of likelihood L(µ, σ 2 Y) f (Y µ, σ 2 ) 11 / 44

WRITING A LIKELIHOOD After writing a probability model for the data, we can write the likelihood of the parameters given the data By definition of likelihood L(µ, σ 2 Y) f (Y µ, σ 2 ) Unfortunately, f (Y µ, σ 2 ) is an n-dimensional density, and n is huge! 11 / 44

WRITING A LIKELIHOOD After writing a probability model for the data, we can write the likelihood of the parameters given the data By definition of likelihood L(µ, σ 2 Y) f (Y µ, σ 2 ) Unfortunately, f (Y µ, σ 2 ) is an n-dimensional density, and n is huge! How do we simplify this? 11 / 44

WRITING A LIKELIHOOD After writing a probability model for the data, we can write the likelihood of the parameters given the data By definition of likelihood L(µ, σ 2 Y) f (Y µ, σ 2 ) Unfortunately, f (Y µ, σ 2 ) is an n-dimensional density, and n is huge! How do we simplify this? The i.i.d. assumption lets us factor the density! N L(µ, σ 2 Y) f (Y i µ, σ 2 ) i=1 11 / 44

WRITING A LIKELIHOOD Now we can plug in our assumed density for Y. L(µ, σ 2 Y) N 1 ( Y i σ 2π exp (ln(y ) i) µ) 2 2σ 2 i=1 12 / 44

WRITING A LIKELIHOOD Now we can plug in our assumed density for Y. L(µ, σ 2 Y) N 1 ( Y i σ 2π exp (ln(y ) i) µ) 2 2σ 2 i=1 However, if we tried to calculate this in R, the value would be incredibly small! 12 / 44

WRITING A LIKELIHOOD Now we can plug in our assumed density for Y. L(µ, σ 2 Y) N 1 ( Y i σ 2π exp (ln(y ) i) µ) 2 2σ 2 i=1 However, if we tried to calculate this in R, the value would be incredibly small! It s the product of a bunch of probabilities which are between 0 and 1. 12 / 44

WRITING A LIKELIHOOD Now we can plug in our assumed density for Y. L(µ, σ 2 Y) N 1 ( Y i σ 2π exp (ln(y ) i) µ) 2 2σ 2 i=1 However, if we tried to calculate this in R, the value would be incredibly small! It s the product of a bunch of probabilities which are between 0 and 1. Computers have problems with numbers that small and round them to 0. 12 / 44

WRITING A LIKELIHOOD Now we can plug in our assumed density for Y. L(µ, σ 2 Y) N 1 ( Y i σ 2π exp (ln(y ) i) µ) 2 2σ 2 i=1 However, if we tried to calculate this in R, the value would be incredibly small! It s the product of a bunch of probabilities which are between 0 and 1. Computers have problems with numbers that small and round them to 0. It s also often analytically easier to work with sums over products. 12 / 44

WRITING A LIKELIHOOD Now we can plug in our assumed density for Y. L(µ, σ 2 Y) N 1 ( Y i σ 2π exp (ln(y ) i) µ) 2 2σ 2 i=1 However, if we tried to calculate this in R, the value would be incredibly small! It s the product of a bunch of probabilities which are between 0 and 1. Computers have problems with numbers that small and round them to 0. It s also often analytically easier to work with sums over products. This is why we typically work with the log-likelihood (often denoted l). 12 / 44

WRITING A LIKELIHOOD Now we can plug in our assumed density for Y. L(µ, σ 2 Y) N 1 ( Y i σ 2π exp (ln(y ) i) µ) 2 2σ 2 i=1 However, if we tried to calculate this in R, the value would be incredibly small! It s the product of a bunch of probabilities which are between 0 and 1. Computers have problems with numbers that small and round them to 0. It s also often analytically easier to work with sums over products. This is why we typically work with the log-likelihood (often denoted l). Because taking the log is a monotonic transformation, it retains the proportionality! L(µ, σ 2 Y) l(µ, σ 2 Y) 12 / 44

LOGARITHM REVIEW! Logs turn exponentiation into multiplication and multiplication into summation. 13 / 44

LOGARITHM REVIEW! Logs turn exponentiation into multiplication and multiplication into summation. log(a B) = log(a) + log(b) 13 / 44

LOGARITHM REVIEW! Logs turn exponentiation into multiplication and multiplication into summation. log(a B) = log(a) + log(b) log(a/b) = log(a) log(b) 13 / 44

LOGARITHM REVIEW! Logs turn exponentiation into multiplication and multiplication into summation. log(a B) = log(a) + log(b) log(a/b) = log(a) log(b) log(a b ) = b log(a) 13 / 44

LOGARITHM REVIEW! Logs turn exponentiation into multiplication and multiplication into summation. log(a B) = log(a) + log(b) log(a/b) = log(a) log(b) log(a b ) = b log(a) log(e) = ln(e) = 1 13 / 44

LOGARITHM REVIEW! Logs turn exponentiation into multiplication and multiplication into summation. log(a B) = log(a) + log(b) log(a/b) = log(a) log(b) log(a b ) = b log(a) log(e) = ln(e) = 1 log(1) = 0 13 / 44

LOGARITHM REVIEW! Logs turn exponentiation into multiplication and multiplication into summation. log(a B) = log(a) + log(b) log(a/b) = log(a) log(b) log(a b ) = b log(a) log(e) = ln(e) = 1 log(1) = 0 Notational note: log in math is almost always used as short-hand for the natural log (ln) as opposed to the base-10 log. 13 / 44

DERIVING THE LOG-LIKELIHOOD [ N ] l(µ, σ 2 Y) ln f (Y i µ, σ 2 ) i=1 14 / 44

DERIVING THE LOG-LIKELIHOOD i=1 [ N ] l(µ, σ 2 Y) ln f (Y i µ, σ 2 ) i=1 [ N 1 ln ( Y i σ 2π exp (ln(y i) µ) 2 ) ] 2σ 2 14 / 44

DERIVING THE LOG-LIKELIHOOD [ N ] l(µ, σ 2 Y) ln f (Y i µ, σ 2 ) i=1 [ N 1 ln ( Y i=1 i σ 2π exp (ln(y i) µ) 2 ) ] 2σ 2 N [ 1 ln ( Y i σ 2π exp (ln(y i) µ) 2 )] 2σ 2 i=1 14 / 44

DERIVING THE LOG-LIKELIHOOD i=1 [ N ] l(µ, σ 2 Y) ln f (Y i µ, σ 2 ) [ N 1 ln ( Y i=1 i σ 2π exp (ln(y i) µ) 2 ) ] 2σ 2 N [ 1 ln ( Y i=1 i σ 2π exp (ln(y i) µ) 2 )] 2σ 2 N ln(y i ) ln(σ) ln( [ 2π) + ln exp ( (ln(y i) µ) 2 )] 2σ 2 i=1 14 / 44

DERIVING THE LOG-LIKELIHOOD i=1 [ N ] l(µ, σ 2 Y) ln f (Y i µ, σ 2 ) [ N 1 ln ( Y i=1 i σ 2π exp (ln(y i) µ) 2 ) ] 2σ 2 N [ 1 ln ( Y i=1 i σ 2π exp (ln(y i) µ) 2 )] 2σ 2 N ln(y i ) ln(σ) ln( [ 2π) + ln exp ( (ln(y i) µ) 2 )] 2σ 2 i=1 N ln(y i ) ln(σ) ln( 2π) (ln(y i) µ) 2 i=1 2σ 2 14 / 44

DERIVING THE LOG-LIKELIHOOD i=1 [ N ] l(µ, σ 2 Y) ln f (Y i µ, σ 2 ) [ N 1 ln ( Y i=1 i σ 2π exp (ln(y i) µ) 2 ) ] 2σ 2 N [ 1 ln ( Y i=1 i σ 2π exp (ln(y i) µ) 2 )] 2σ 2 N ln(y i ) ln(σ) ln( [ 2π) + ln exp ( (ln(y i) µ) 2 )] 2σ 2 i=1 N ln(y i ) ln(σ) ln( 2π) (ln(y i) µ) 2 i=1 2σ 2 14 / 44

DERIVING THE LOG-LIKELIHOOD To simplify further, we can drop multiplicative (additive on the log scale) constants that are not functions of the the parameters since that retains proportionality. 15 / 44

DERIVING THE LOG-LIKELIHOOD To simplify further, we can drop multiplicative (additive on the log scale) constants that are not functions of the the parameters since that retains proportionality. N ln(y i ) ln(σ) ln( 2π) (ln(y i) µ) 2 i=1 2σ 2 15 / 44

DERIVING THE LOG-LIKELIHOOD To simplify further, we can drop multiplicative (additive on the log scale) constants that are not functions of the the parameters since that retains proportionality. N ln(y i ) ln(σ) ln( 2π) (ln(y i) µ) 2 i=1 N i=1 2σ 2 ln(σ) (ln(y i) µ) 2 2σ 2 15 / 44

DERIVING THE LOG-LIKELIHOOD To simplify further, we can drop multiplicative (additive on the log scale) constants that are not functions of the the parameters since that retains proportionality. N ln(y i ) ln(σ) ln( 2π) (ln(y i) µ) 2 i=1 N i=1 2σ 2 ln(σ) (ln(y i) µ) 2 2σ 2 15 / 44

WRITING THE LOG-LIKELIHOOD IN R We can often make use of the built-in PDF functions in R for distributions to write a function that takes as input µ, σ 2 and the data. Here, we want to use dlnorm (the density of the log-normal). 1 ### Log-Likelihood function 2 log.likelihood.func <- function(mu, sigma, Y){ 3 # Return the sum of the log of dnorm evaluated for all Y with fixed mu and sigma 4 return(sum(dlnorm(y, meanlog=mu, sdlog=sigma, log=t))) ## Set log=t to return the log-density 5 } 16 / 44

PLOTTING THE LOG-LIKELIHOOD Sigma 1 2 3 4 5 7000 6500 6000 5500 5000 6000 7000 8000 5000 7500 6500 4500 6500 7500 9000 2.5 3.0 3.5 Mu Figure : Contour plot of the log-likelihood for different values of µ and σ 17 / 44

PLOTTING THE LIKELIHOOD 5000 Log likelihood 6000 7000 Mu Sigma 8000 Figure : Plot of the log-likelihood for different values of µ and σ 18 / 44

PLOTTING THE LIKELIHOOD Conditional log likelihood varying mu, setting sigma=2 Log likelihood 6500 6400 6300 6200 6100 6000 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Mu Figure : Plot of the conditional log-likelihood of µ given σ = 2 19 / 44

COMPARING MODELS USING LIKELIHOOD In future problem sets, you ll be directly optimizing (either analytically or using R) to find the parameters that maximize of the likelihood. 20 / 44

COMPARING MODELS USING LIKELIHOOD In future problem sets, you ll be directly optimizing (either analytically or using R) to find the parameters that maximize of the likelihood. For today, we ll eyeball it and compare the fit to the data for parameters that yield low likelihoods vs. higher likelihoods. 20 / 44

COMPARING MODELS USING LIKELIHOOD In future problem sets, you ll be directly optimizing (either analytically or using R) to find the parameters that maximize of the likelihood. For today, we ll eyeball it and compare the fit to the data for parameters that yield low likelihoods vs. higher likelihoods. Example 1: µ = 4, σ =.2: Log-likelihood = 18048.79 20 / 44

COMPARING MODELS USING LIKELIHOOD In future problem sets, you ll be directly optimizing (either analytically or using R) to find the parameters that maximize of the likelihood. For today, we ll eyeball it and compare the fit to the data for parameters that yield low likelihoods vs. higher likelihoods. Example 1: µ = 4, σ =.2: Log-likelihood = 18048.79 Example 2: µ = 3.099, σ = 0.379: Log-likelihood = 4461.054 20 / 44

COMPARING MODELS USING LIKELIHOOD In future problem sets, you ll be directly optimizing (either analytically or using R) to find the parameters that maximize of the likelihood. For today, we ll eyeball it and compare the fit to the data for parameters that yield low likelihoods vs. higher likelihoods. Example 1: µ = 4, σ =.2: Log-likelihood = 18048.79 Example 2: µ = 3.099, σ = 0.379: Log-likelihood = 4461.054 (actually the MLE)! 20 / 44

COMPARING MODELS USING LIKELIHOOD In future problem sets, you ll be directly optimizing (either analytically or using R) to find the parameters that maximize of the likelihood. For today, we ll eyeball it and compare the fit to the data for parameters that yield low likelihoods vs. higher likelihoods. Example 1: µ = 4, σ =.2: Log-likelihood = 18048.79 Example 2: µ = 3.099, σ = 0.379: Log-likelihood = 4461.054 (actually the MLE)! Let s plot the implied distribution of Y i for each parameter set over the empirical histogram! 20 / 44

COMPARING MODELS USING LIKELIHOOD Ages of ER patients who punched a wall in 2014 Share 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0 10 20 30 40 50 60 70 80 Age Figure : Empirical distribution of ages vs. log-normal with µ = 4 and σ =.2 21 / 44

COMPARING MODELS USING LIKELIHOOD Ages of ER patients who punched a wall in 2014 Share 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0 10 20 30 40 50 60 70 80 Age Figure : Empirical distribution of ages vs. log-normal using MLEs of parameters 22 / 44

OUTLINE Likelihood Inference Bayesian Inference Hypothesis Testing 23 / 44

LIKELIHOODS VS. BAYESIAN POSTERIORS Bayesian Posterior Density: Likelihood: p(λ y) = p(λ)p(y λ) p(y) L(λ y) = k(y)p(y λ) p(y λ) p(λ y) = p(λ)p(y λ) p(y) There is a fixed, true value of λ. We use the likelihood to estimate λ with the MLE. 24 / 44

LIKELIHOODS VS. BAYESIAN POSTERIORS Bayesian Posterior Density: Likelihood: p(λ y) = p(λ)p(y λ) p(y) L(λ y) = k(y)p(y λ) p(y λ) p(λ y) = p(λ)p(y λ) p(y) p(λ)p(y λ) = λ p(λ)p(y λ)dλ There is a fixed, true value of λ. We use the likelihood to estimate λ with the MLE. 24 / 44

LIKELIHOODS VS. BAYESIAN POSTERIORS Bayesian Posterior Density: Likelihood: p(λ y) = p(λ)p(y λ) p(y) L(λ y) = k(y)p(y λ) p(y λ) p(λ y) = p(λ)p(y λ) p(y) p(λ)p(y λ) = λ p(λ)p(y λ)dλ p(λ)p(y λ) There is a fixed, true value of λ. We use the likelihood to estimate λ with the MLE. 24 / 44

LIKELIHOODS VS. BAYESIAN POSTERIORS Bayesian Posterior Density: Likelihood: p(λ y) = p(λ)p(y λ) p(y) L(λ y) = k(y)p(y λ) p(y λ) There is a fixed, true value of λ. We use the likelihood to estimate λ with the MLE. p(λ y) = p(λ)p(y λ) p(y) p(λ)p(y λ) = λ p(λ)p(y λ)dλ p(λ)p(y λ) λ is a random variable and therefore has fundamental uncertainty. We use the posterior density to make probability statements about λ. 24 / 44

UNDERSTANDING THE POSTERIOR DENSITY In Bayesian inference, we have a prior subjective belief about λ 25 / 44

UNDERSTANDING THE POSTERIOR DENSITY In Bayesian inference, we have a prior subjective belief about λ, which we update with the data 25 / 44

UNDERSTANDING THE POSTERIOR DENSITY In Bayesian inference, we have a prior subjective belief about λ, which we update with the data to form posterior beliefs about λ. 25 / 44

UNDERSTANDING THE POSTERIOR DENSITY In Bayesian inference, we have a prior subjective belief about λ, which we update with the data to form posterior beliefs about λ. p(λ y) p(λ)p(y λ) 25 / 44

UNDERSTANDING THE POSTERIOR DENSITY In Bayesian inference, we have a prior subjective belief about λ, which we update with the data to form posterior beliefs about λ. p(λ y) p(λ)p(y λ) p(λ y) is the posterior density 25 / 44

UNDERSTANDING THE POSTERIOR DENSITY In Bayesian inference, we have a prior subjective belief about λ, which we update with the data to form posterior beliefs about λ. p(λ y) p(λ)p(y λ) p(λ y) is the posterior density p(λ) is the prior density 25 / 44

UNDERSTANDING THE POSTERIOR DENSITY In Bayesian inference, we have a prior subjective belief about λ, which we update with the data to form posterior beliefs about λ. p(λ y) p(λ)p(y λ) p(λ y) is the posterior density p(λ) is the prior density p(y λ) is proportional to the likelihood 25 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. 26 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. Here are the basic steps: 26 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. Here are the basic steps: 1. Think about your subjective beliefs about the parameters you want to estimate. 26 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. Here are the basic steps: 1. Think about your subjective beliefs about the parameters you want to estimate. 2. Find a distribution that you think explains your prior beliefs of the parameter. 26 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. Here are the basic steps: 1. Think about your subjective beliefs about the parameters you want to estimate. 2. Find a distribution that you think explains your prior beliefs of the parameter. 3. Think about your data generating process. 26 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. Here are the basic steps: 1. Think about your subjective beliefs about the parameters you want to estimate. 2. Find a distribution that you think explains your prior beliefs of the parameter. 3. Think about your data generating process. 4. Find a distribution that you think explains the data. 26 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. Here are the basic steps: 1. Think about your subjective beliefs about the parameters you want to estimate. 2. Find a distribution that you think explains your prior beliefs of the parameter. 3. Think about your data generating process. 4. Find a distribution that you think explains the data. 5. Derive the posterior distribution. 26 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. Here are the basic steps: 1. Think about your subjective beliefs about the parameters you want to estimate. 2. Find a distribution that you think explains your prior beliefs of the parameter. 3. Think about your data generating process. 4. Find a distribution that you think explains the data. 5. Derive the posterior distribution. 6. Plot the posterior distribution. 26 / 44

BAYESIAN INFERENCE The whole point of Bayesian inference is to leverage information about the data generating process along with subjective beliefs about our parameters into our inference. Here are the basic steps: 1. Think about your subjective beliefs about the parameters you want to estimate. 2. Find a distribution that you think explains your prior beliefs of the parameter. 3. Think about your data generating process. 4. Find a distribution that you think explains the data. 5. Derive the posterior distribution. 6. Plot the posterior distribution. 7. Summarize the posterior distribution. (posterior mean, posterior standard deviation, posterior probabilities) 26 / 44

EXAMPLE: WAITING TIME FOR A TAXI ON MASS AVE If you randomly show up on Massachusetts Avenue, how long will it take you to hail a taxi? 27 / 44

EXAMPLE: WAITING TIME FOR A TAXI ON MASS AVE Let s assume that waiting times X i (in minutes) are distributed Exponentially with parameter λ. 28 / 44

EXAMPLE: WAITING TIME FOR A TAXI ON MASS AVE Let s assume that waiting times X i (in minutes) are distributed Exponentially with parameter λ. X i Expo(λ) The density is f (X i λ) = λe λx i We observe one observation of X i = 7 minutes and want to make inferences about λ. 28 / 44

EXAMPLE: WAITING TIME FOR A TAXI ON MASS AVE Let s assume that waiting times X i (in minutes) are distributed Exponentially with parameter λ. X i Expo(λ) The density is f (X i λ) = λe λx i We observe one observation of X i = 7 minutes and want to make inferences about λ. Quiz: Using what you know about the mean of the exponential, what would be a good guess for λ without any prior information? 28 / 44

EXAMPLE: WAITING TIME FOR A TAXI ON MASS AVE Let s assume that waiting times X i (in minutes) are distributed Exponentially with parameter λ. X i Expo(λ) The density is f (X i λ) = λe λx i We observe one observation of X i = 7 minutes and want to make inferences about λ. Quiz: Using what you know about the mean of the exponential, what would be a good guess for λ without any prior information? 1 7! (since the mean of the Expo is 1 λ ) 28 / 44

DERIVING A POSTERIOR DISTRIBUTION p(λ X i ) = p(x i λ)p(λ) p(x i p(x i λ)p(λ) λe λx i p(λ) Even when deriving Bayesian posteriors, it s often easier to work without proportionality constants (e.g. p(x i )). 29 / 44

DERIVING A POSTERIOR DISTRIBUTION p(λ X i ) = p(x i λ)p(λ) p(x i p(x i λ)p(λ) λe λx i p(λ) Even when deriving Bayesian posteriors, it s often easier to work without proportionality constants (e.g. p(x i )). You can figure out these normalizing constants at the end by integration since you know that a valid probability density 29 / 44

DERIVING A POSTERIOR DISTRIBUTION How do we choose a distribution for p(λ)? 30 / 44

DERIVING A POSTERIOR DISTRIBUTION How do we choose a distribution for p(λ)? The difficulty of this question is why Bayesian methods only recently gained wider adoption. 30 / 44

DERIVING A POSTERIOR DISTRIBUTION How do we choose a distribution for p(λ)? The difficulty of this question is why Bayesian methods only recently gained wider adoption. Most prior choices give posteriors that are analytically intractable (can t express them in a neat mathematical form). 30 / 44

DERIVING A POSTERIOR DISTRIBUTION How do we choose a distribution for p(λ)? The difficulty of this question is why Bayesian methods only recently gained wider adoption. Most prior choices give posteriors that are analytically intractable (can t express them in a neat mathematical form). More advanced computational methods (like MCMC) make this less of an issue. However, for some distributions of the data, there are distributions called conjugate priors. 30 / 44

DERIVING A POSTERIOR DISTRIBUTION How do we choose a distribution for p(λ)? The difficulty of this question is why Bayesian methods only recently gained wider adoption. Most prior choices give posteriors that are analytically intractable (can t express them in a neat mathematical form). More advanced computational methods (like MCMC) make this less of an issue. However, for some distributions of the data, there are distributions called conjugate priors. These priors retain the shape of their distribution after being multiplied by the data/likelihood. 30 / 44

DERIVING A POSTERIOR DISTRIBUTION How do we choose a distribution for p(λ)? The difficulty of this question is why Bayesian methods only recently gained wider adoption. Most prior choices give posteriors that are analytically intractable (can t express them in a neat mathematical form). More advanced computational methods (like MCMC) make this less of an issue. However, for some distributions of the data, there are distributions called conjugate priors. These priors retain the shape of their distribution after being multiplied by the data/likelihood. Example: Beta distribution is conjugate to Binomial data. 30 / 44

DERIVING A POSTERIOR DISTRIBUTION The conjugate prior for λ in Exponential data is the Gamma distribution. So we assume a prior of the form λ Gamma(α, β). 31 / 44

DERIVING A POSTERIOR DISTRIBUTION The conjugate prior for λ in Exponential data is the Gamma distribution. So we assume a prior of the form λ Gamma(α, β). α and β are hyperparameters we have to assume values for them that capture our prior beliefs. 31 / 44

DERIVING A POSTERIOR DISTRIBUTION The conjugate prior for λ in Exponential data is the Gamma distribution. So we assume a prior of the form λ Gamma(α, β). α and β are hyperparameters we have to assume values for them that capture our prior beliefs. In the case of the Expo-Gamma relationship, α and β have substantive meaning 31 / 44

DERIVING A POSTERIOR DISTRIBUTION The conjugate prior for λ in Exponential data is the Gamma distribution. So we assume a prior of the form λ Gamma(α, β). α and β are hyperparameters we have to assume values for them that capture our prior beliefs. In the case of the Expo-Gamma relationship, α and β have substantive meaning you can think of it as denoting α previously observed taxi times that sum to a total of β. 31 / 44

DERIVING A POSTERIOR DISTRIBUTION The conjugate prior for λ in Exponential data is the Gamma distribution. So we assume a prior of the form λ Gamma(α, β). α and β are hyperparameters we have to assume values for them that capture our prior beliefs. In the case of the Expo-Gamma relationship, α and β have substantive meaning you can think of it as denoting α previously observed taxi times that sum to a total of β. 31 / 44

DERIVING A POSTERIOR DISTRIBUTION p(λ X i ) λe λx i p(λ) 32 / 44

DERIVING A POSTERIOR DISTRIBUTION p(λ X i ) λe λx i p(λ) λe λx i λ α 1 e βλ 32 / 44

DERIVING A POSTERIOR DISTRIBUTION p(λ X i ) λe λx i p(λ) λe λx i λ α 1 e βλ λ α e (λ(x i+β) 32 / 44

DERIVING A POSTERIOR DISTRIBUTION p(λ X i ) λe λx i p(λ) λe λx i λ α 1 e βλ λ α e (λ(x i+β) By inspection, the posterior for λ is also the form of a Gamma. Here, it s Gamma(α + 1, β + X i ) 32 / 44

DERIVING A POSTERIOR DISTRIBUTION p(λ X i ) λe λx i p(λ) λe λx i λ α 1 e βλ λ α e (λ(x i+β) By inspection, the posterior for λ is also the form of a Gamma. Here, it s Gamma(α + 1, β + X i ) We could also integrate the above form to get the normalizing constant and get an explicit density if we didn t recognize it as a known distribution. 32 / 44

PLOTTING THE POSTERIOR Density 0 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 Lambda Figure : Prior and Posterior densities for λ (Red = Prior, Blue = Posterior). Vertical line denotes MLE). α = 3, β = 10 33 / 44

OUTLINE Likelihood Inference Bayesian Inference Hypothesis Testing 34 / 44

IS ESP REAL? Bem (2011) conducted 9 experiments purporting to show evidence of precognition. 35 / 44

IS ESP REAL? Bem (2011) conducted 9 experiments purporting to show evidence of precognition. One experiment had 100 respondents asked to repeatedly guess which curtain had a picture hidden behind it. 35 / 44

IS ESP REAL? Bem (2011) conducted 9 experiments purporting to show evidence of precognition. One experiment had 100 respondents asked to repeatedly guess which curtain had a picture hidden behind it. Under null hypothesis, guess rate by chance would be 50%. 35 / 44

IS ESP REAL? Bem (2011) conducted 9 experiments purporting to show evidence of precognition. One experiment had 100 respondents asked to repeatedly guess which curtain had a picture hidden behind it. Under null hypothesis, guess rate by chance would be 50%. But Bem found that explicit images were significantly more likely to be predicted (53.1%) 35 / 44

IS ESP REAL? Bem (2011) conducted 9 experiments purporting to show evidence of precognition. One experiment had 100 respondents asked to repeatedly guess which curtain had a picture hidden behind it. Under null hypothesis, guess rate by chance would be 50%. But Bem found that explicit images were significantly more likely to be predicted (53.1%) With a p-value of.01! Should we conclude that precognition exists? 35 / 44

IS ESP REAL? Bem (2011) conducted 9 experiments purporting to show evidence of precognition. One experiment had 100 respondents asked to repeatedly guess which curtain had a picture hidden behind it. Under null hypothesis, guess rate by chance would be 50%. But Bem found that explicit images were significantly more likely to be predicted (53.1%) With a p-value of.01! Should we conclude that precognition exists? What makes Bem s p-value different from one that you calculate in your study? 35 / 44

IS ESP REAL? Bem (2011) conducted 9 experiments purporting to show evidence of precognition. One experiment had 100 respondents asked to repeatedly guess which curtain had a picture hidden behind it. Under null hypothesis, guess rate by chance would be 50%. But Bem found that explicit images were significantly more likely to be predicted (53.1%) With a p-value of.01! Should we conclude that precognition exists? What makes Bem s p-value different from one that you calculate in your study? Answer: Your priors about effect size will affect how you interpret p-values. 35 / 44

HYPOTHESIS TESTING Figure : A misleading caricature - everyone uses priors 36 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). All inferences depend on prior beliefs about the plausibility of a hypothesis. 1 Where Bayesians and Frequentists differ is in how that information is used. 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). All inferences depend on prior beliefs about the plausibility of a hypothesis. 1 Where Bayesians and Frequentists differ is in how that information is used. Bayesians use a formally defined prior 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). All inferences depend on prior beliefs about the plausibility of a hypothesis. 1 Where Bayesians and Frequentists differ is in how that information is used. Bayesians use a formally defined prior Advantage: Explicitly incorporates prior beliefs into final inferences in a rigorous way. 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). All inferences depend on prior beliefs about the plausibility of a hypothesis. 1 Where Bayesians and Frequentists differ is in how that information is used. Bayesians use a formally defined prior Advantage: Explicitly incorporates prior beliefs into final inferences in a rigorous way. Disadvantages: Prior needs to be elicited explicitly (in the form of a distribution). Wrong priors give misleading results. Computational issues with non-conjugate priors. 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). All inferences depend on prior beliefs about the plausibility of a hypothesis. 1 Where Bayesians and Frequentists differ is in how that information is used. Bayesians use a formally defined prior Advantage: Explicitly incorporates prior beliefs into final inferences in a rigorous way. Disadvantages: Prior needs to be elicited explicitly (in the form of a distribution). Wrong priors give misleading results. Computational issues with non-conjugate priors. Frequentists use prior information in the design and interpretation of studies. 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). All inferences depend on prior beliefs about the plausibility of a hypothesis. 1 Where Bayesians and Frequentists differ is in how that information is used. Bayesians use a formally defined prior Advantage: Explicitly incorporates prior beliefs into final inferences in a rigorous way. Disadvantages: Prior needs to be elicited explicitly (in the form of a distribution). Wrong priors give misleading results. Computational issues with non-conjugate priors. Frequentists use prior information in the design and interpretation of studies. Advantage: Not necessary to formulate prior beliefs in terms of a specific probability distribution. 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). All inferences depend on prior beliefs about the plausibility of a hypothesis. 1 Where Bayesians and Frequentists differ is in how that information is used. Bayesians use a formally defined prior Advantage: Explicitly incorporates prior beliefs into final inferences in a rigorous way. Disadvantages: Prior needs to be elicited explicitly (in the form of a distribution). Wrong priors give misleading results. Computational issues with non-conjugate priors. Frequentists use prior information in the design and interpretation of studies. Advantage: Not necessary to formulate prior beliefs in terms of a specific probability distribution. Disadvantages: No clear rules for how prior information should be weighed relative to the data at hand. 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Frequentist inference doesn t mean that prior information is irrelevant! (despite popular interpretations). All inferences depend on prior beliefs about the plausibility of a hypothesis. 1 Where Bayesians and Frequentists differ is in how that information is used. Bayesians use a formally defined prior Advantage: Explicitly incorporates prior beliefs into final inferences in a rigorous way. Disadvantages: Prior needs to be elicited explicitly (in the form of a distribution). Wrong priors give misleading results. Computational issues with non-conjugate priors. Frequentists use prior information in the design and interpretation of studies. Advantage: Not necessary to formulate prior beliefs in terms of a specific probability distribution. Disadvantages: No clear rules for how prior information should be weighed relative to the data at hand. 1 See Andy Gelman s comments at http://andrewgelman.com/2012/11/10/16808/ 37 / 44

EVERYONE S A LITTLE BIT BAYESIAN Don t forget what you learned in Intro to Probability! 38 / 44

EVERYONE S A LITTLE BIT BAYESIAN Don t forget what you learned in Intro to Probability! Classic example: A disease has a very low base rate (.1% of the population). 38 / 44

EVERYONE S A LITTLE BIT BAYESIAN Don t forget what you learned in Intro to Probability! Classic example: A disease has a very low base rate (.1% of the population). A test for the disease has a 5% false positive rate and a 5% false negative rate. 38 / 44

EVERYONE S A LITTLE BIT BAYESIAN Don t forget what you learned in Intro to Probability! Classic example: A disease has a very low base rate (.1% of the population). A test for the disease has a 5% false positive rate and a 5% false negative rate. Given that you test positive, what s the probability you have the disease? Bayes rule: P(D +) = P(+ D)P(D) P(+ D)P(D)+P(+ Not D)P(Not D).95.001 P(D +) =.95.001+.05.999 =.01866 = 1.86% The same principles apply to hypothesis testing! 38 / 44

EVERYONE S A LITTLE BIT BAYESIAN Don t forget what you learned in Intro to Probability! Classic example: A disease has a very low base rate (.1% of the population). A test for the disease has a 5% false positive rate and a 5% false negative rate. Given that you test positive, what s the probability you have the disease? Bayes rule: P(D +) = P(+ D)P(D) P(+ D)P(D)+P(+ Not D)P(Not D).95.001 P(D +) =.95.001+.05.999 =.01866 = 1.86% The same principles apply to hypothesis testing! Always important to ask: given my decision to reject, how likely is it that my decision is misleading? 38 / 44

THINKING ABOUT P-VALUES We typically calibrate p-values in terms of Type I error that is, False Positive Rate. 39 / 44

THINKING ABOUT P-VALUES We typically calibrate p-values in terms of Type I error that is, False Positive Rate. But false-positive rate can be misleading conditional on a positive result. 39 / 44

THINKING ABOUT P-VALUES We typically calibrate p-values in terms of Type I error that is, False Positive Rate. But false-positive rate can be misleading conditional on a positive result. Determining how informative our result is depends on additional design-related factors. 1) The effect size 2) The sample size 39 / 44

TYPE M AND S ERRORS Gelman and Carlin (2014) suggest also considering Type S (Sign) and Type M (Magnitude) error rates that are conditional on rejecting. 40 / 44

TYPE M AND S ERRORS Gelman and Carlin (2014) suggest also considering Type S (Sign) and Type M (Magnitude) error rates that are conditional on rejecting. Type S error: Given that you reject the null, what s the probability that your point estimate is the wrong sign? 40 / 44

TYPE M AND S ERRORS Gelman and Carlin (2014) suggest also considering Type S (Sign) and Type M (Magnitude) error rates that are conditional on rejecting. Type S error: Given that you reject the null, what s the probability that your point estimate is the wrong sign? Type M error: Given that you reject the null, what s the probability that your estimate is too extreme? 40 / 44

TYPE M AND S ERRORS Gelman and Carlin (2014) suggest also considering Type S (Sign) and Type M (Magnitude) error rates that are conditional on rejecting. Type S error: Given that you reject the null, what s the probability that your point estimate is the wrong sign? Type M error: Given that you reject the null, what s the probability that your estimate is too extreme? Both depend not only on your sampling distribution s variance, but also on the effect size. 40 / 44

CALCULATING TYPE M AND S ERROR RATES Example of Low Power Effect =.2, Population Variance = 16 N = 50 Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Type 'S': Reject and conclude wrong direction Type 'M': Reject and conclude effect > 5x larger than truth 2 1 0 1 2 3 Effect Estimate Pr(Reject) =.0644. Pr(Wrong Sign Reject) =.16. Pr(Estimate 5x Truth Reject) =.84 41 / 44

CALCULATING TYPE M AND S ERROR RATES Example of Moderate Power Effect =.2, Population Variance = 16 N = 500 Density 0.0 0.5 1.0 1.5 2.0 Type 'S': Reject and conclude wrong direction 1.0 0.5 0.0 0.5 1.0 Effect Estimate Pr(Reject) =.200. Pr(Wrong Sign Reject) =.005. Low probability of Type S and our positive estimates are a lot more reasonable! 42 / 44

TAKEAWAYS FOR HYPOTHESIS TESTING General rule: 43 / 44

TAKEAWAYS FOR HYPOTHESIS TESTING General rule: Smaller effects require larger samples (more data) to reliably detect. A rule for tiny sample sizes and tiny effects: 43 / 44

TAKEAWAYS FOR HYPOTHESIS TESTING General rule: Smaller effects require larger samples (more data) to reliably detect. A rule for tiny sample sizes and tiny effects: You re probably getting nothing, and if you get something, it s probably wrong. A rule for reading published p-values: 43 / 44

TAKEAWAYS FOR HYPOTHESIS TESTING General rule: Smaller effects require larger samples (more data) to reliably detect. A rule for tiny sample sizes and tiny effects: You re probably getting nothing, and if you get something, it s probably wrong. A rule for reading published p-values: Just because it s peer-reviewed and published, doesn t mean its true. 43 / 44

QUESTIONS Questions? 44 / 44