Conjugate priors: Beta and normal Class 15, Jeremy Orloff and Jonathan Bloom

Similar documents
Review for Final Exam Spring 2014 Jeremy Orloff and Jonathan Bloom

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

Chapter 7: Estimation Sections

Chapter 5. Statistical inference for Parametric Models

Chapter 7: Estimation Sections

The Bernoulli distribution

Probability Theory and Simulation Methods. April 9th, Lecture 20: Special distributions

Chapter 7: Estimation Sections

MA : Introductory Probability

Studio 8: NHST: t-tests and Rejection Regions Spring 2014

Some Discrete Distribution Families

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

Chapter 3 Discrete Random Variables and Probability Distributions

Chapter 3 Discrete Random Variables and Probability Distributions

CSE 312 Winter Learning From Data: Maximum Likelihood Estimators (MLE)

CS340 Machine learning Bayesian statistics 3

Binomial Random Variables. Binomial Random Variables

Two hours UNIVERSITY OF MANCHESTER. 23 May :00 16:00. Answer ALL SIX questions The total number of marks in the paper is 90.

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Common one-parameter models

Confidence Intervals Introduction

STAT 111 Recitation 3

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Conjugate Models. Patrick Lam

Chapter 7 1. Random Variables

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Statistics for Business and Economics

CS 361: Probability & Statistics

CS340 Machine learning Bayesian model selection

Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

Discrete Random Variables; Expectation Spring 2014

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

AP Statistics Ch 8 The Binomial and Geometric Distributions

Learning From Data: MLE. Maximum Likelihood Estimators

Elementary Statistics Lecture 5

Probability. An intro for calculus students P= Figure 1: A normal integral

Statistics 6 th Edition

Stat511 Additional Materials

Probability Theory. Mohamed I. Riffi. Islamic University of Gaza

Probability Models.S2 Discrete Random Variables

Business Statistics 41000: Probability 3

Bernoulli and Binomial Distributions

PROBABILITY DISTRIBUTIONS

Probability Distributions for Discrete RV

The Binomial distribution

STAT 111 Recitation 2

Much of what appears here comes from ideas presented in the book:

Engineering Statistics ECIV 2305

Chapter 3 - Lecture 5 The Binomial Probability Distribution

Lecture 2. Probability Distributions Theophanis Tsandilas

18.05 Problem Set 3, Spring 2014 Solutions

Chapter 6: Random Variables and Probability Distributions

Probability Distributions: Discrete

START HERE: Instructions. 1 Exponential Family [Zhou, Manzil]

The normal distribution is a theoretical model derived mathematically and not empirically.

Central Limit Theorem (cont d) 7/28/2006

The Binomial Distribution

The Binomial Probability Distribution

10 5 The Binomial Theorem

4 Random Variables and Distributions

Back to estimators...

Probability Distributions: Discrete

STAT 111 Recitation 4

MLLunsford 1. Activity: Central Limit Theorem Theory and Computations

Point Estimation. Copyright Cengage Learning. All rights reserved.

MATH 264 Problem Homework I

Generating Random Numbers

Introduction to Probability and Inference HSSP Summer 2017, Instructor: Alexandra Ding July 19, 2017

2-4 Completing the Square

2011 Pearson Education, Inc

STA 114: Statistics. Notes 10. Conjugate Priors

Bayesian Linear Model: Gory Details

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

30 Wyner Statistics Fall 2013

Lecture 9: Plinko Probabilities, Part III Random Variables, Expected Values and Variances

TABLE OF CONTENTS - VOLUME 2

Commonly Used Distributions

Mean of a Discrete Random variable. Suppose that X is a discrete random variable whose distribution is : :

χ 2 distributions and confidence intervals for population variance

. (i) What is the probability that X is at most 8.75? =.875

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

14.30 Introduction to Statistical Methods in Economics Spring 2009

Discrete Random Variables and Probability Distributions. Stat 4570/5570 Based on Devore s book (Ed 8)

(5) Multi-parameter models - Summarizing the posterior

(2/3) 3 ((1 7/8) 2 + 1/2) = (2/3) 3 ((8/8 7/8) 2 + 1/2) (Work from inner parentheses outward) = (2/3) 3 ((1/8) 2 + 1/2) = (8/27) (1/64 + 1/2)

Random Variables Handout. Xavier Vilà

MVE051/MSG Lecture 7

Discrete Random Variables and Probability Distributions

Statistics & Flood Frequency Chapter 3. Dr. Philip B. Bedient

Have you ever wondered whether it would be worth it to buy a lottery ticket every week, or pondered on questions such as If I were offered a choice

Chapter 4: Asymptotic Properties of MLE (Part 3)

Business Statistics 41000: Probability 4

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Bayesian Normal Stuff

SOCIETY OF ACTUARIES EXAM STAM SHORT-TERM ACTUARIAL MATHEMATICS EXAM STAM SAMPLE QUESTIONS

Chapter 9: Sampling Distributions

Expectations. Definition Let X be a discrete rv with set of possible values D and pmf p(x). The expected value or mean value of X, denoted by E(X ) or

2 of PU_2015_375 Which of the following measures is more flexible when compared to other measures?

Transcription:

1 Learning Goals Conjugate s: Beta and normal Class 15, 18.05 Jeremy Orloff and Jonathan Bloom 1. Understand the benefits of conjugate s.. Be able to update a beta given a Bernoulli, binomial, or geometric likelihood. 3. Understand and be able to use the formula for updating a normal given a normal likelihood with known variance. Introduction and definition In this reading, we will elaborate on the notion of a conjugate for a likelihood function. With a conjugate the posterior is of the same type, e.g. for binomial likelihood the beta becomes a beta posterior. Conjugate s are useful because they reduce Bayesian updating to modifying the parameters of the distribution (so-called hyperparameters) rather than computing integrals. Our focus in 18.05 will be on two important examples of conjugate s: beta and normal. For a far more comprehensive list, see the tables herein: http://en.wikipedia.org/wiki/conjugate distribution We now give a definition of conjugate. It is best understood through the examples in the subsequent sections. Definition. Suppose we have data with likelihood function f(x θ) depending on a hypothesized parameter. Also suppose the distribution for θ is one of a family of parametrized distributions. If the posterior distribution for θ is in this family then we say the the is a conjugate for the likelihood. 3 Beta distribution In this section, we will show that the beta distribution is a conjugate for binomial, Bernoulli, and geometric likelihoods. 3.1 Binomial likelihood We saw last time that the beta distribution is a conjugate for the binomial distribution. This means that if the likelihood function is binomial and the distribution is beta then the posterior is also beta. 1

18.05 class 15, Conjugate s: Beta and normal, Spring 014 More specifically, suppose that the likelihood follows a binomial(n, θ) distribution where N is known and θ is the (unknown) parameter of interest. We also have that the data x from one trial is an integer between 0 and N. Then for a beta we have the following table: θ x beta(a, b) binomial(n, θ) beta(a + x, b + N x) θ x c 1 θ a 1 (1 θ) b 1 c θ x (1 θ) N x c 3 θ a+x 1 (1 θ) b+n x 1 The table is simplified by writing the normalizing coefficient as c 1, c and c 3 respectively. If needed, we can recover the values of the c 1 and c by recalling (or looking up) the normalizations of the beta and binomial distributions. (a + b 1)! N N! (a + b + N 1)! c 1 = c = = c 3 = (a 1)! (b 1)! x x! (N x)! (a + x 1)! (b + N x 1)! 3. Bernoulli likelihood The beta distribution is a conjugate for the Bernoulli distribution. This is actually a special case of the binomial distribution, since Bernoulli(θ) is the same as binomial(1, θ). We do it separately because it is slightly simpler and of special importance. In the table below, we show the updates corresponding to success (x = 1) and failure (x = 0) on separate rows. θ x beta(a, b) Bernoulli(θ) beta(a + 1, b) or beta(a, b + 1) θ x = 1 c 1 θ a 1 (1 θ) b 1 θ c 3 θ a (1 θ) b 1 θ x = 0 c 1 θ a 1 (1 θ) b 1 1 θ c 3 θ a 1 (1 θ) b The constants c 1 and c 3 have the same formulas as in the previous (binomial likelihood case) with N = 1. 3.3 Geometric likelihood Recall that the geometric(θ) distribution describes the probability of x successes before the first failure, where the probability of success on any single independent trial is θ. The corresponding pmf is given by p(x) = θ x (1 θ). Now suppose that we have a data point x, and our hypothesis θ is that x is drawn from a geometric(θ) distribution. From the table we see that the beta distribution is a conjugate for a geometric likelihood as well: θ x beta(a, b) geometric(θ) beta(a + x, b + 1) θ x c 1 θ a 1 (1 θ) b 1 θ x (1 θ) c 3 θ a+x 1 (1 θ) b At first it may seem strange that the beta distribution is a conjugate for both the binomial and geometric distributions. The key reason is that the binomial and geometric likelihoods are proportional as functions of θ. Let s illustrate this in a concrete example. Example 1. While traveling through the Mushroom Kingdom, Mario and Luigi find some rather unusual coins. They agree on a of f(θ) beta(5,5) for the probability of heads,

18.05 class 15, Conjugate s: Beta and normal, Spring 014 3 though they disagree on what experiment to run to investigate θ further. a) Mario decides to flip a coin 5 times. He gets four heads in five flips. b) Luigi decides to flip a coin until the first tails. He gets four heads before the first tail. Show that Mario and Luigi will arrive at the same posterior on θ, and calculate this posterior. answer: We will show that both Mario and Luigi find the posterior pdf for θ is a beta(9, 6) distribution. Mario s table θ x = 4 beta(5, 5) binomial(5,θ)??? θ x = 4 c 1 θ 4 (1 θ) 4 5 4 θ4 (1 θ) c 3 θ 8 (1 θ) 5 Luigi s table θ x = 4 beta(5, 5) geometric(θ)??? θ x = 4 c 1 θ 4 (1 θ) 4 θ 4 (1 θ) c 3 θ 8 (1 θ) 5 Since both Mario and Luigi s posterior has the form of a beta(9, 6) distribution that s what they both must be. The normalizing factor is the same in both cases because it s determined by requiring the total probability to be 1. 4 Normal begets normal We now turn to another important example: the normal distribution is its own conjugate. In particular, if the likelihood function is normal with known variance, then a normal gives a normal posterior. Now both the hypotheses and the data are continuous. Suppose we have a measurement x N(θ, σ ) where the variance σ is known. That is, the mean θ is our unknown parameter of interest and we are given that the likelihood comes from a normal distribution with variance σ. If we choose a normal pdf f(θ) N(µ, σ ) then the posterior pdf is also normal: f(θ x) N(µ post, σ post ) where µ post µ x 1 1 1 = +, = + (1) σ σ σ σ σ σ post post The following form of these formulas is easier to read and shows that µ post is a weighted average between µ and the data x. 1 1 aµ + bx 1 a = b =, µ σ post =, post =. () σ σ a + b a + b With these formulas in mind, we can express the update via the table: θ x f(θ) N(µ, σ ) f(x θ) N(θ, σ ) f(θ x) N(µ post, σ ( ) θ x c 1 exp c exp (x θ) (θ µpost) c 3 exp (θ µ ) σ σ σ post post)

18.05 class 15, Conjugate s: Beta and normal, Spring 014 4 We leave the proof of the general formulas to the problem set. It is an involved algebraic manipulation which is essentially the same as the following numerical example. Example. Suppose we have θ N(4, 8), and likelihood function likelihood x N(θ, 5). Suppose also that we have one measurement x 1 = 3. Show the posterior distribution is normal. answer: We will show this by grinding through the algebra which involves completing the square. (θ 4) /16 (x 1 θ) /10 (3 θ) /10 : f(θ) = c 1 e ; likelihood: f(x 1 θ) = c e = c e We multiply the and likelihood to get the posterior: We complete the square in the exponent Therefore the posterior is (θ 4) /16 (3 θ) /10 f(θ x 1 ) = c 3 e e ( ) (θ 4) (3 θ) = c 3 exp 16 10 (θ 4) (3 θ) 5(θ 4) + 8(3 θ) = 16 10 80 13θ 88θ + 15 = 80 θ 88 15 13 θ + 13 = 80/13 (θ 44/13) + 15/13 (44/13) =. 80/13 (θ 44/13) +15/13 (44/13) (θ 44/13) 80/13 80/13 f(θ x 1 ) = c 3 e = c 4 e. This has the form of the pdf for N(44/13, 40/13). QED For practice we check this against the formulas (). Therefore 1 1 µ = 4, σ = 8, σ = 5 a =, b =. 8 5 aµ + bx 44 µ post = = = 3.38 a + b 13 1 40 σpost = = = 3.08. a + b 13 Example 3. Suppose that we know the data x N(θ, 1) and we have N(0, 1). We get one data value x = 6.5. Describe the changes to the pdf for θ in updating from the to the posterior.

18.05 class 15, Conjugate s: Beta and normal, Spring 014 5 answer: Here is a graph of the pdf with the data point marked by a red line. Prior in blue, posterior in magenta, data in red The posterior mean will be a weighted average of the mean and the data. So the peak of the posterior pdf will be be between the peak of the and the read line. A little algebra with the formula shows 1 σ σ = = σ < σ 1/σ + 1/σ σ + σ post That is the posterior has smaller variance than the, i.e. data makes us more certain about where in its range θ lies. 4.1 More than one data point Example 4. Suppose we have data x 1, x, x 3. Use the formulas (1) to update sequentially. answer: Let s label the mean and variance as µ 0 and σ 0. The updated means and variances will be µ i and σ i. In sequence we have 1 1 1 µ 1 µ 0 x 1 = + ; = + σ1 σ0 σ σ1 σ0 σ 1 1 1 1 µ µ 1 x µ 0 x 1 + x = + = + ; = + = + σ σ1 σ σ0 σ σ σ1 σ σ0 σ 1 1 1 1 3 µ 3 µ x 3 µ 0 x 1 + x + x 3 = + = + ; = + = + σ σ σ σ σ σ σ σ σ σ 3 0 3 0 The example generalizes to n data values x 1,..., x n :

18.05 class 15, Conjugate s: Beta and normal, Spring 014 6 Normal-normal update formulas for n data points µ post µ nx 1 1 n x 1 +... + x n = +, = +, x =. (3) σ σ σ σ σ σ n post post Again we give the easier to read form, showing µ post is a weighted average of µ and the sample average x: 1 n aµ + bx 1 a = b =, µ post =, σpost =. (4) σ σ a + b a + b Interpretation: µ post is a weighted average of µ and x. If the number of data points is large then the weight b is large and x will have a strong influence on the posterior. If σ is small then the weight a is large and µ will have a strong influence on the posterior. To summarize: 1. Lots of data has a big influence on the posterior.. High certainty (low variance) in the has a big influence on the posterior. The actual posterior is a balance of these two influences.

MIT OpenCourseWare https://ocw.mit.edu 18.05 Introduction to Probability and Statistics Spring 014 For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.