Conjugate Models Patrick Lam
Outline Conjugate Models What is Conjugacy? The Beta-Binomial Model The Normal Model Normal Model with Unknown Mean, Known Variance Normal Model with Known Mean, Unknown Variance
Outline Conjugate Models What is Conjugacy? The Beta-Binomial Model The Normal Model Normal Model with Unknown Mean, Known Variance Normal Model with Known Mean, Unknown Variance
Outline Conjugate Models What is Conjugacy? The Beta-Binomial Model The Normal Model Normal Model with Unknown Mean, Known Variance Normal Model with Known Mean, Unknown Variance
Conjugacy Suppose we have a Bayesian model with a likelihood p(y θ) and a prior p(θ). If we multiply our likelihood and prior, we get our posterior p(θ y) up to a constant of proportionality. If our posterior is a distribution that is of the same family as our prior, then we have conjugacy. We say that the prior is conjugate to the likelihood. Conjugate models are great because we know the exact distribution of the posterior so we can easily simulate or derive quantities of interest analytically. In practice, we rarely have conjugacy.
Brief List of Conjugate Models Likelihood Prior Posterior Binomial Beta Beta Negative Binomial Beta Beta Poisson Gamma Gamma Geometric Beta Beta Exponential Gamma Gamma Normal (mean unknown) Normal Normal Normal (variance unknown) Inverse Gamma Inverse Gamma Normal (mean and variance unknown) Normal/Gamma Normal/Gamma Multinomial Dirichlet Dirichlet
Outline Conjugate Models What is Conjugacy? The Beta-Binomial Model The Normal Model Normal Model with Unknown Mean, Known Variance Normal Model with Known Mean, Unknown Variance
A Binomial Example Suppose we have vector of data on voter turnout for a random sample of n voters in the 004 US Presidential election. We can model the voter turnout with a binomial model. Y Binomial(n, π) Quantity of interest: π (voter turnout) Assumptions: Each voter s decision to vote follows the Bernoulli distribution. Each voter has the same probability of voting. (unrealistic) Each voter s decision to vote is independent. (unrealistic)
The Conjugate Beta Prior We can use the beta distribution as a prior for π, since the beta distribution is conjugate to the binomial distribution. p(π y) p(y π)p(π) = Binomial(n, π) Beta(α, β) ( ) n = π y (n y) Γ(α + β) ( π) y Γ(α)Γ(β) π(α ) ( π) (β ) π y ( π) (n y) π (α ) ( π) (β ) p(π y) π y+α ( π) n y+β The posterior distribution is simply a Beta(y + α, n y + β) distribution. Effectively, our prior is just adding α successes and β failures to the dataset.
The Uninformative (Flat) Uniform Prior Suppose we have no strong prior beliefs about the parameters. We can choose a prior that gives equal weight to all possible values of the parameters, essentially an uninformative or flat prior. for all values of π. p(π) = constant For the binomial model, one example of a flat prior is the Beta(,) prior: p(π) = Γ() Γ()Γ() π( ) ( π) ( ) = which is the Uniform distribution over the [0, ] interval.
Since we know that a Binomial likelihood and a Beta(,) prior produces a Beta(y +, n y + ) posterior, we can simulate the posterior in R. Suppose our turnout data had 500 voters, of which 85 voted. > table(turnout) turnout 0 5 85 Setting our prior parameters at α = and β =, > a <- > b <- we get the posterior > posterior.unif.prior <- rbeta(0000, shape = 85 + a, shape = 500 - + 85 + b)
Outline Conjugate Models What is Conjugacy? The Beta-Binomial Model The Normal Model Normal Model with Unknown Mean, Known Variance Normal Model with Known Mean, Unknown Variance
Outline Conjugate Models What is Conjugacy? The Beta-Binomial Model The Normal Model Normal Model with Unknown Mean, Known Variance Normal Model with Known Mean, Unknown Variance
Normal Model with Unknown Mean, Known Variance Suppose we wish to estimate a model where the likelihood of the data is normal with an unknown mean µ and a known variance σ. Our parameter of interest is µ. We can use a conjugate Normal prior on µ, with mean µ 0 and variance τ 0. p(µ y, σ ) p(y µ, σ )p(µ) Normal(µ, τ ) = Normal(µ, σ ) Normal(µ 0, τ 0 )
Let θ represent our parameter of interest, in this case µ. p(θ y) ny i= exp (y «i θ) πσ σ p exp πτ 0! «(θ µ0) τ0 nx (y i θ) (θ µ0) exp σ τ i= 0 "!# = exp nx (y i θ) (θ µ0) + σ τ i= 0 "!# = exp nx τ σ τ0 0 (y i θ) +σ (θ µ 0) i= "!# = exp nx τ σ τ0 0 (yi θy i + θ )+σ (θ θµ 0 + µ 0) i=
We can multiply the θy i term in the summation by n n in order to get the equations in terms of the sufficient statistic ȳ. "!# p(θ y) exp nx τ σ τ0 0 (yi θ n n y i + θ )+σ (θ θµ 0 + µ 0) i= "!# = exp nx τ σ τ0 0 yi τ0 θnȳ + τ0 nθ +θ σ θµ 0σ + µ 0σ i= We can then factor the terms into several parts. Since µ 0 σ and τ0 n i= y i do not contain θ, we can represent them as some constant k, which we will drop into the normalizing constant.» p(θ y) exp θ σ + τ0 n θ µ 0σ + τ0 nȳ + k» = exp σ τ 0 θ σ + τ0 n σ τ0» = exp θ τ0 «θ + n σ «θ µ0σ + τ 0 nȳ µ0 τ 0 ««+ k σ τ0 + nȳ ««+ k σ
τ Let s multiply by 0 τ 0 p(θ y) exp 4 τ0 = exp 4 + n σ «+ n σ «in order to simplify the θ term. τ 0 = exp 4 τ 0 0 0 + n «@θ @ σ 0 + n «σ + n τ0 σ + n τ0 σ 0 @θ θ @ 0 + n «@θ σ 0 @ 0 A θ @ µ 0 + nȳ τ0 σ + n τ0 σ µ 0 + nȳ τ0 σ + n τ0 σ µ 0 τ 0 τ 0 + nȳ σ + n σ 3 A + ka5 3 AA 5 3 A + ka5 Finally, we have something that looks like the density function of a Normal distribution!
p(θ y) exp 4 τ 0 0 + n «@θ σ 0 @ µ 0 + nȳ τ0 σ + n τ0 σ 3 AA 5 Posterior Mean: µ = µ 0 τ 0 τ 0 + nȳ σ «+ n σ «Posterior Variance: τ = ( τ 0 + n σ ) Posterior Precision: τ = τ 0 + n σ Posterior Precision is just the sum of the prior precision and the data precision.
We can also look more closely at how the prior mean µ 0 and the posterior mean µ relate to each other. µ0 + nȳ τ0 µ = σ + n σ = τ 0 µ 0 σ +τ 0 nȳ τ 0 σ σ +nτ 0 τ 0 σ = µ0σ + τ0 nȳ σ + nτ0 µ 0σ = + τ 0 nȳ σ + nτ0 σ + nτ0 As n increases, data mean dominates prior mean. As τ0 decreases (less prior variance, greater prior precision), our prior mean becomes more important.
A Simple Example Suppose we have some (fake) data on the heights (in inches) of a random sample of 00 individuals in the U.S. population. > known.sigma.sq <- 6 > unknown.mean <- 68 > n <- 00 > heights <- rnorm(n, mean = unknown.mean, sd = sqrt(known.sigma.sq)) We believe that the heights are normally distributed with some unknown mean µ and a known variance σ = 6. Suppose before we see the data, we have a prior belief about the distribution of µ. Let our prior mean µ 0 = 7 and our prior variance τ 0 = 36. > mu0 <- 7 > tau.sq0 <- 36
Our posterior is a Normal distribution with Mean Variance ( τ 0 + n σ ) µ 0 τ 0 τ 0 > post.mean <- (mu0/tau.sq0 + (n * mean(heights)/known.sigma.sq))/(/tau.sq0 + + n/known.sigma.sq) > post.mean [] 68.03969 > post.var <- /(/tau.sq0 + n/known.sigma.sq) > post.var [] 0.5990 + nȳ σ «+ n σ «and
Outline Conjugate Models What is Conjugacy? The Beta-Binomial Model The Normal Model Normal Model with Unknown Mean, Known Variance Normal Model with Known Mean, Unknown Variance
Normal Model with Known Mean, Unknown Variance Now suppose we wish to estimate a model where the likelihood of the data is normal with a known mean µ and an unknown variance σ. Now our parameter of interest is σ. We can use a conjugate inverse gamma prior on σ, with shape parameter α 0 and scale parameter β 0. p(σ y, µ) p(y µ, σ )p(σ ) Invgamma(α, β ) = Normal(µ, σ ) Invgamma(α 0, β 0 )
Let θ represent our parameter of interest, in this case σ. ny p(θ y, µ) exp (y ««i µ) βα 0 0 i= πθ θ Γ(α 0) θ (α 0+) exp β0 θ ny θ exp (y ««i µ) θ (α0+) exp β0 θ θ i= P n = θ n exp i= (y i µ) ««θ (α0+) exp β0 θ θ» P n = θ (α 0+ n +) β0 exp θ + i= (y i µ) «θ 0 Pn = θ (α 0+ n +) exp 4 @ β0 + 3 i= (y i µ) A5 θ 0 Pn 3 = θ (α 0+ n +) exp 4 @ β0 + i= (y i µ) A5 θ This looks like the density of an inverse gamma distribution!
0 Pn 3 p(θ y, µ) θ (α 0+ n +) exp 4 @ β0 + i= (y i µ) A5 θ α = α 0 + n β = P n i= β 0 + i µ) Our posterior is an Invgamma(α 0 + n P, β n i= 0 + i µ) ) distribution.
A Simple Example Again suppose we have some (fake) data on the heights (in inches) of a random sample of 00 individuals in the U.S. population. > known.mean <- 68 > unknown.sigma.sq <- 6 > n <- 00 > heights <- rnorm(n, mean = known.mean, sd = sqrt(unknown.sigma.sq)) We believe that the heights are normally distributed with a known mean µ = 68 and some unknown variance σ. Suppose before we see the data, we have a prior belief about the distribution of σ. Let our prior shape α 0 = 5 and our prior scale β 0 = 0. > alpha0 <- 5 > beta0 <- 0
Our posterior is a inverse gamma distribution with shape α 0 + n and scale β 0 + P n i= (y i µ) > alpha <- alpha0 + n/ > beta <- beta0 + sum((heights - known.mean)^)/ > library(mcmcpack) > posterior <- rinvgamma(0000, alpha, beta) > post.mean <- mean(posterior) > post.mean [].8839 > post.var <- var(posterior) > post.var [] 3.36047 Hmm... what if we increased our sample size?
> n <- 000 > heights <- rnorm(n, mean = known.mean, sd = sqrt(unknown.sigma.sq)) > alpha <- alpha0 + n/ > beta <- beta0 + sum((heights - known.mean)^)/ > posterior <- rinvgamma(0000, alpha, beta) > post.mean <- mean(posterior) > post.mean [] 5.98 > post.var <- var(posterior) > post.var [] 0.505895