Part II: Computation for Bayesian Analyses

Part II: Computation for Bayesian Analyses 62 BIO 233, HSPH Spring 2015

Conjugacy In both birth weight eamples the posterior distribution is from the same family as the prior: Prior Likelihood Posterior Beta(a, b) Y i Bernoulli(θ) Beta(y + + a, n y + + b) Normal(m, V ) Y i Normal(µ, σ 2 ) Normal(m, V ) This is referred to as conjugacy arises because of specific choice of prior/likelihood combination Having the posterior be of a known distribution is very convenient we have the means to compute moments and quantiles for both the Beta and the Normal distribution 63 BIO 233, HSPH Spring 2015

Conjugacy was originally pursued and advocated because of this if you can fi it so that the posterior distribution is from a known family then life becomes much easier! particularly important in the past More generally, when the posterior is of a known form we say that the posterior is analytically tractable summary measures are available analytically For many problems, however, the posterior is not a known distribution 64 BIO 233, HSPH Spring 2015

Mean birth weight: µ and σ 2 unknown In Homework #1 you assumed that σ 2 was known More realistically, let s take σ 2 as unknown multiparameter setting, θ =(µ, σ 2 ) Likelihood remains the same Y i Normal(µ, σ 2 ), i =1,..., n L(y µ, σ 2 ) = ( ) { n 1 ep 1 2πσ 2 2σ 2 } n (y i µ) 2 i=1 How do we specify a bivariate prior distribution for (µ, σ 2 )? 65 BIO 233, HSPH Spring 2015

One option is the following noninformative prior π(µ, σ 2 ) 1 σ 2 uniform for (µ, logσ) on(, ) (, ) aprioriindependence Using Bayes Theorem, the joint posterior distribution is proportional to π(µ, σ 2 y) L(y µ, σ 2 )π(µ, σ 2 ) 66 BIO 233, HSPH Spring 2015

Unfortunately, this doesn t correspond to any commonly known joint distribution even if we performed the integration to get the normalizing constant, we d be stuck How do we proceed? How do summarize a distribution if it isn t available analytically? visualize the joint posterior distribution for (µ, σ 2 ) compute summary statistics We use simulation-based or Monte Carlo methods 67 BIO 233, HSPH Spring 2015

Monte Carlo methods Consider the Beta(85, 915) distribution posterior for the low birth weight eample To summarize this distribution, note that the mean and variance have closed form epressions but the median does not Suppose we could generate random deviates from a Beta distribution e.g., using rbeta() in R We could empirically estimate the median: > sampposterior <- rbeta(1000, 85, 915) > median(sampposterior) [1] 0.08486423 68 BIO 233, HSPH Spring 2015

More generally, one can estimate any summary measure by eploiting the duality between a distribution and samples generated from that distribution ## Mean > round(mean(sampposterior), digits=3) [1] 0.085 ## Standard deviation > round(sd(sampposterior), digits=3) [1] 0.009 ## 95% credible interval > round(quantile(sampposterior, c(0.025, 0.975)), digits=3) 2.5% 97.5% 0.069 0.104 ## P(theta > 0.10) > round(mean(sampposterior > 0.10), digits=3) [1] 0.061 69 BIO 233, HSPH Spring 2015

One can also visual the distribution > hist(sampposterior, breaks=seq(from=0.05, to=0.12, by=0.005), main="", lab=epression(theta * " = P(LBW)"), col="blue", freq=false) Density 0 10 20 30 40 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 θ = P(LBW) 70 BIO 233, HSPH Spring 2015

Monte Carlo refers to the use of samples or simulation as a means to learn about a distribution the term was coined by physicists working on the Manhattan Project in the 1940 s These ideas can be applied to any distribution the trick is being able to generate samples from π(θ y) Consider again the posterior for (µ, σ 2 ) π(µ, σ 2 y) 1 { ep 1 [ σn+2 (n 1)s 2 2σ 2 + n(y µ) 2] } At the outset, it isn t clear how to generate samples from this distribution 71 BIO 233, HSPH Spring 2015

We can, however, decompose the joint posterior distribution as π(µ, σ 2 y) = π(µ y,σ 2 )π(σ 2 y) where conditional posterior of µ y,σ 2 is a Normal(y, σ 2 /n) and the marginal posterior of σ 2 y is given by π(σ 2 y) ( σ 2) n+1 2 ep { 1σ [ ]} (n 1)s 2 2 2 which is the kernel for an inverse-gamma distribution: ( n 1 Inv-gamma, 2 (n 1)s 2 ) 2 72 BIO 233, HSPH Spring 2015

The decomposition suggests generating samples using the algorithm: (1) generate a random ( n 1 σ 2(r) Inv-gamma, 2 (2) generate a random µ (r) σ 2(r) Normal ( y, (n 1)s 2 ) 2 σ 2(r) n ) each cycle generates an independent random deviate from the joint posterior, π(µ, σ 2 y) (µ (1),σ 2(1) ) (µ (2),σ 2(2) ). 73 BIO 233, HSPH Spring 2015

One can also easily augment this algorithm to generate samples from the posterior predictive distribution f(ỹ y) = f(ỹ θ)π(θ y) θ This representation suggests imbedding a step where we generate random ỹ (r) Normal (µ (r), σ 2(r)) at the end of the r th cycle Marginally, the {ỹ (1),...,ỹ (R) } are a random sample from the target posterior predictive distribution 74 BIO 233, HSPH Spring 2015

Run the algorithm... > ## > load("northcarolina_data.dat") > n <- 100 > sampy <- sample(infants$weight, n) > ## > library(mcmcpack) >?rinvgamma > ## > R <- 1000 > sampposterior <- matri(na, nrow=r, ncol=3) > for(r in 1:R) + { + ## + sigmasq <- rinvgamma(1, (n-1)/2, ((n-1)*var(sampy))/2) + mu <- rnorm(1, mean(sampy), sqrt(sigmasq/n)) + ytilde <- rnorm(1, mu, sqrt(sigmasq)) + ## + sampposterior[r,] <- c(mu, sigmasq, ytilde) + } 75 BIO 233, HSPH Spring 2015

Visualize the results... > ## > library(ks) > fhat <- kde(=sampposterior[,1:2], H=Hpi(=sampPosterior[,1:2])) > ## > plot(fhat, lab=epression(mu), ylab=epression(sigma^2), lim=c(3,3.5), drawpoints=true, ptcol="red", pch="", lwd=3, aes=false) > ais(1, seq(from=3, to=3.5, by=0.1)) > ais(2, seq(from=0.5, to=1.2, by=0.1)) > ## > fhaty <- kde(=sampposterior[,3], h=hpi(sampposterior[,3])) > plot(fhaty, lim=c(0, 6), lab=epression("prediction, " * tilde(y)), ylab="", aes=false, col="red", lwd=3) > ais(1, seq(from=0, to=6, by=1)) > ais(2, seq(from=0, to=0.5, by=0.1)) 76 BIO 233, HSPH Spring 2015

Joint posterior distribution, π(µ, σ 2 y) µ σ 2 25 50 75 3.0 3.1 3.2 3.3 3.4 3.5 0.3 0.4 0.5 0.6 0.7 0.8 0.9 77 BIO 233, HSPH Spring 2015

Posterior predictive distribution, f(ỹ y) relative weight assigned to possible birth weight values, averaging over the uncertainty in our knowledge about µ and σ 2 density, f(y ~ y) 0.0 0.1 0.2 0.3 0.4 0.5 0 1 2 3 4 5 6 prediction, Y ~ 78 BIO 233, HSPH Spring 2015

Compute numerical summary measures... > ## > apply(sampposterior, 2, mean) [1] 3.2400460 0.5169768 3.1995524 > t(apply(sampposterior, 2, quantile, probs=c(0.5, 0.025, 0.975))) 50% 2.5% 97.5% [1,] 3.2402101 3.1011615 3.3802954 [2,] 0.5086291 0.3872956 0.6837703 [3,] 3.1964943 1.8347178 4.5187996 > cor(sampposterior[,1], sampposterior[,2]) [1] 0.001173572 79 BIO 233, HSPH Spring 2015

Sequential decomposition This strategy can be applied to a generic θ = (θ 1,...,θ p ) decompose the p-dimensional posterior distribution as π(θ y) = π(θ p y,θ 1,...,θ p 1 ) π(θ p 1 y,θ 1,...,θ p 2 )... π(θ 1 y) cycle through and sample from the p distributions sequentially each cycle generates an independent, random deviate from the multivariate posterior, π(θ y) Intuitive in that one breaks up the problem into a series of manageable pieces can then empirically summarize the joint distribution 80 BIO 233, HSPH Spring 2015

The Gibbs sampling algorithm For many problems, sequential decomposition doesn t yield a set of p distributions where each is of a known form Consider the set of p full conditionals π(θ 1 y,θ 1 ) π(θ 2 y,θ 2 ). π(θ p y,θ p ) it is often the case that these are each of a convenient form The Gibbs sampling algorithm generates samples by sequentially sampling from each these p full conditionals 81 BIO 233, HSPH Spring 2015

For eample, for p=2 the algorithm proceeds as follows: generate θ (1) 1 from π(θ 1 y,θ (0) 2 ) generate θ (1) 2 from π(θ 2 y,θ (1) 1 ) θ (1) = (θ (1) 1,θ(1) 2 ) θ (0) 2 is the starting value generate θ (2) 1 from π(θ 1 y,θ (1) 2 ) generate θ (2) 2 from π(θ 2 y,θ (2) 1 ) θ (2) = (θ (2) 1,θ(2) 2 ). 82 BIO 233, HSPH Spring 2015

Helpful to visualize the way the algorithm generates samples: 83 BIO 233, HSPH Spring 2015

Mean birth weight: Normal likelihood Suppose Y i i.i.d. Normal(µ, σ 2 )fori =1,..., n task is to learn about the unknown θ =(µ, σ 2 ) Further, suppose we adopt a flat prior for µ and an independent Inv-Gamma(a, b) priorforσ 2 the prior density is π(µ, σ 2 ) = ba Γ(a) ( ) σ 2 a 1 ep { b } σ 2 setting a=b=0 corresponds to the non informative prior we ve been using 84 BIO 233, HSPH Spring 2015

Applying Bayes Theorem, the joint posterior distribution is π(µ, σ 2 y) L(µ, σ 2 y)π(µ, σ 2 ) = ( ) { n 1 ep 1 2πσ 2 2σ 2 ba Γ(a) } n (y i µ) 2 i=1 ( σ 2 ) a 1 ep { b σ 2 } ( ) ( n/2+a+1 1 σ 2 ep { 1σ 1 2 2 )} n (y i µ) 2 + b i=1 It isn t immediately clear how to directly generate samples from this joint distribution 85 BIO 233, HSPH Spring 2015

Rather than directly sampling from π(µ, σ 2 y), wecanusethegibbs sampling algorithm and iterate between (i) (ii) ( µ σ 2 σ 2 ), y Normal y, n ( ) n σ 2 µ, y Inv-gamma 2 + a +1, D 2 + b where y is the sample mean and D = n i=1 (y i µ) 2 Result is a sequence of samples from the joint posterior: (µ (1),σ 2(1) ) (µ (2),σ 2(2) ). 86 BIO 233, HSPH Spring 2015

Markov Chain Monte Carlo The Gibbs algorithm is helpful in that a seemingly difficult problem is broken down into a series of manageable pieces But there are two problems! (1) The set of full conditionals does not jointly fully specify the target posterior distribution in contrast to the components obtained via a sequential decomposition: π(θ 1,θ 2 y) = π(θ 1 y) π(θ 2 y,θ 1 ) (2) The resulting samples are dependent dependency between θ (1) and θ (2) is generated by the value of θ (1) 2 say that the samples ehibit autocorrelation 87 BIO 233, HSPH Spring 2015

The result is that we don t have an independent sample from the joint posterior distribution empirical summary measures won t pertain to π(θ y) However, by construction, the sequence of samples constitute a Markov chain Further, it s possible to show that the stationary distribution for the Markov chain is the sought-after posterior distribution we say that once the Markov chain converges, oneisgeneratingsamples from the posterior distribution In practice, we need to run the chain long enough to ensure that it has converged to its stationary distribution make adjustments to remove autocorrelation in the samples 88 BIO 233, HSPH Spring 2015

The Gibbs algorithm is one member of a broader class of algorithms that generate samples from (arbitrary) posterior distributions The general technique is referred to as Markov Chain Monte Carlo MCMC each algorithm requires consideration of convergence and autocorrelation Other algorithms include, among many others: importance sampling adaptive rejection sampling the Metropolis algorithm the Metropolis-Hastings algorithm We are going to focus on the Metropolis-Hastings algorithm versatile and, as we ll see, generally a good algorithm in the contetof GLMs 89 BIO 233, HSPH Spring 2015

The Metropolis-Hastings Algorithm Mean birth weight: t-distribution likelihood So far we ve taken the continuous response, Y,tobeNormallydistributed An alternative is to use the t-distribution heavier tails provide an alternative that may be robust to unusual/outlier observations Specifically, suppose Y t ν (µ, σ 2 ) non-central, scaled t-distribution with ν degrees of freedom for y (, ), thedensityis f(y µ, σ 2,ν) = Γ ( ) ν+1 2 Γ ( ) ν 2 ( ) 1/2 1 [1+ πνσ 2 (y µ)2 νσ 2 ] ν+1 2 90 BIO 233, HSPH Spring 2015

Given an i.i.d sample of size n, thelikelihoodis L(µ, σ 2 y) = n i=1 Γ ( ) ν+1 2 Γ ( ) ν 2 ( ) 1/2 1 [1+ (y i µ) 2 ] ν+1 2 πνσ 2 νσ 2 here we take ν to be fied and known Again adopt a flat prior for µ and an independent Inv-Gamma(a, b) priorfor σ 2 : π(µ, σ 2 ) = b2 ( σ 2 ) a 1 ep { b } Γ(a) σ 2 Using Bayes Theorem, the posterior distribution is π(µ, σ 2 y) L(µ, σ 2 y)π(µ, σ 2 ) 91 BIO 233, HSPH Spring 2015

As with the posterior based on the Normal likelihood, this joint posterior doesn t belong to any known family We could again use the Gibbs sampling algorithm and iterate between the two full conditionals: π(µ σ 2, y) π(σ 2 µ, y) Unfortunately, neither of these are of a convenient form either! Q: How can we generate samples from a distribution that does not belong to any known family? 92 BIO 233, HSPH Spring 2015

Metropolis-Hastings Consider the general task of sampling from π(θ y) goal is to generate a sequence: θ (1), θ (2),... use these samples to evaluate summaries of the posterior Suppose the distribution corresponding to π(θ y) is unknown the functional form of the density doesn t correspond to a known distribution we only know the kernel of the density π(θ y) L(θ y)π(θ) integral in the denominator doesn t have a closed form epression software isn t readily available 93 BIO 233, HSPH Spring 2015

The Metropolis-Hastings algorithm proceeds as follows: let θ (r) be the current state in the sequence (i) generate a proposal for the net value of θ, denotedbyθ denote the density of the proposal distribution by q(θ θ (r) ) (ii) either reject the proposal θ (r+1) = θ (r) accept the proposal θ (r+1) = θ the decision to reject/accept the proposal is based on the flip ofacoin with probability ( = min 1, a r referred to as the acceptance ratio accept automatically if a r 1 π(θ y) π(θ (r) y) q(θ (r) θ ) ) q(θ θ (r) ) 94 BIO 233, HSPH Spring 2015

The algorithm boils down to being able to perform three tasks: (1) choose a proposal distribution (2) sample from the proposal distribution (3) compute the acceptance ratio Ideally the proposal distribution is as close to the target posterior distribution as possible intuitively, why does this make sense? mathematically, why does this make sense? But we also have to choose a proposal distribution that we can actually sample from! Q: Interpretation of the acceptance ratio? 95 BIO 233, HSPH Spring 2015

Mean birth weight: t-distribution likelihood For the birth weight eample, we couldn t sample directly from the two full conditionals: π(µ σ 2, y) n [1+ (y i µ) 2 i=1 νσ 2 ] ν+1 2 π(σ 2 µ, y) ( ) n/2+a+1 { 1 σ 2 ep b } n [1+ (y i µ) 2 ] ν+1 2 σ 2 νσ 2 i=1 Instead, we sample from each using the Metropolis-Hasting algorithm say: implement a Gibbs sampling algorithm with two Metropolis-Hastings steps or updates Need to choose proposal distributions for both updates 96 BIO 233, HSPH Spring 2015

Recall, for the Normal likelihood the two full conditionals were µ σ 2, y Normal ( y, σ 2 n ) σ 2 µ, y Inv-gamma (a, b ) where a = n/2+a +1and b = D/2+b Their densities are, respectively, q 1 (µ σ 2, y) = q 2 (σ 2 µ, y) = { } 1 2πσ2 /n ep n(µ y)2 2σ 2 ( ) } σ 2 a 1 ep { b b Γ(a ) σ 2 Q: Why might these be good proposal distributions? 97 BIO 233, HSPH Spring 2015

Suppose the current state in the Markov chain is (µ (r), σ 2(r) ) To sample µ (r+1) : generate a proposal, µ,fromanormal(y, σ 2(r) /n) calculate a r = min ( 1, π(µ σ 2(r), y) π(µ (r) σ 2(r), y) q 1 (µ (r) σ 2(r), y) q 1 (µ σ 2(r), y) ) generate a random U Uniform(0,1) if U<a r,acceptthemove set µ (r+1) equal to µ if U>a r,rejectthemove set µ (r+1) equal to µ (r) 98 BIO 233, HSPH Spring 2015

Whatever the decision, we have a value for µ (r+1) To sample σ 2(r+1) : generate a proposal, σ 2,fromanInv-gamma(a, b ) calculate a r = min ( 1, π(σ 2 µ (r+1), y) π(σ 2(r) µ (r+1), y) q 2 (σ 2(r) µ (r+1), y) q 2 (σ 2 µ (r+1), y) ) generate a random U Uniform(0,1) if U<a r,acceptthemove set σ 2(r+1) equal to σ 2 if U>a r,rejectthemove set σ 2(r+1) equal to σ 2(r) 99 BIO 233, HSPH Spring 2015

The Metropolis sampling algorithm The Metropolis algorithm is a special case of the Metropolis-Hastings algorithm where the proposal distribution is symmetric As such, q(θ (r) θ ) = q(θ θ (r) ) and the acceptance ratio reduces to ( π(θ ) y) a r = min 1, π(θ (r) y) However, if the target distribution is not symmetric then we might epect symmetric proposal distributions to not perform as well fewer proposals will be accepted Trade-off in terms of computational ease and efficiency of the sampling algorithm 100 BIO 233, HSPH Spring 2015

Practical issues for MCMC Following this algorithm yields a sequence of samples that form a Markov chain one whose stationary distribution is the desired posterior distribution Practical issues include: (1) specification of starting values, (µ (0), σ 2(0) ) (2) monitoring convergence of the chain (3) deciding how many samples to generate (4) accounting for correlation in the samples 101 BIO 233, HSPH Spring 2015

In practice, one often runs M chains simultaneously each with differing starting values pool samples across the chains when summarizing the posterior distribution Monitoring convergence is often done via visual inspection of the chains referred to as trace plots goal is to have good coverage of the parameter space eamine miing of the chains if M>1 102 BIO 233, HSPH Spring 2015

Posterior for µ based on a Normal likelihood Gibbs sampling algorithm µ 3.0 3.1 3.2 3.3 3.4 3.5 0 100 200 300 400 500 600 700 800 900 1000 Scan 103 BIO 233, HSPH Spring 2015

Posterior for σ 2 based on a Normal likelihood Gibbs sampling algorithm σ 2 0.2 0.4 0.6 0.8 1.0 0 100 200 300 400 500 600 700 800 900 1000 Scan 104 BIO 233, HSPH Spring 2015

Posterior for ψ = νσ 2 /(ν 2) based on a t-distribution likelihood Gibbs sampling algorithm with M-H updates ψ 0.7 0.9 1.1 1.3 1.5 1.7 0 100 200 300 400 500 600 700 800 900 1000 Scan 105 BIO 233, HSPH Spring 2015

If M>1, onecancalculatethepotential scale reduction (PSR) factor suppose each chain is run for R iterations for a given parameter, θ, them th chain is denoted θ m (1), θ m (2),...,θ m (R) let θ m and s 2 m denote the sample mean and variance of the m th chain calculate PSR for θ as B/R + W (R 1)/R PSR = W where W is the mean of the within-chain variance of θ M W = 1 M m=1 s 2 m 106 BIO 233, HSPH Spring 2015

and B/R is the between-chain variance of the chain means B R = 1 M 1 M (θ m θ) 2 m=1 typically set R such that PSR is less than 1.05 should be ensured for each parameter The number of samples should ideally be based on the Monte Carlo error see Homework #1 may not be clear cut if the algorithm is computationally epensive Autocorrelation arises because of the dependent nature of the sampling as one cycles through the full conditionals will also be a problem if the proposal distribution was poorly chosen typically handled by thinning several graphical and numerical diagnostics are also available 107 BIO 233, HSPH Spring 2015