STAT 425: Introduction to Bayesian Analysis

STAT 45: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 018 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 1 / 37

Lectures 9-11: Multi-parameter models The Normal model Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 / 37

Parameterizations of the Normal Distribution Mean and deviation: f(x µ, σ ) = 1 πσ e (x µ) σ, x R, σ > 0. Mean and precision: f(x µ, τ) = τ τ(x µ) π e, x R, τ = 1 σ > 0. The latter has advantages in numerical computations when σ 0 and simplify formulas. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 3 / 37

Summary pdf/pmf Domain Mean Variance Bern P (x) = p x (1 p) 1 x {0, 1} p p(1 p) Bin P (x) = N p x (1 p) N x x {0,..., N} Np Np(1 p) λ Poi P (x) = e λx N λ λ x! NB P (x) = r + x 1 p r (1 p) x N r 1 p r 1 p p p x { M P (x 1,..., x k ) = N! k x k! k px k {0,..., N} K Np k U f(x) = 1 b a [a, b] a+b Be f(x) = Γ(a+b) Γ(a)Γ(b) xa 1 (1 x) b 1 [0, 1] a a+b Ga f(x) = ba Γ(a) xa 1 e bx R + a b Np k (1 p k ) Np k p k (b a) 1 ab (a+b) (a+b+1) N f(x) = 1 e (x µ) σ R µ σ πσ a b MN f(x) = (π) p Σ 1 e 1 (X µ)t Σ 1 (X µ) R p µ Σ Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 4 / 37

Summary Model parameters MOM MLE UMVUE Bern p X X X X n 1 Bin p n S X X X nn nn Poi λ X X X NB r p X X n 1 with known r n S ˆr ˆr+ X r r+ X U a X 3 n 1 n S X (1) with a = 0 Ga b a b X + 3 n 1 n S n+1 X (n) n X (n) X n 1 with known a n S X ā n 1 n S X N µ X X X σ n 1 n 1 S n n S with known σ Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 5 / 37

Related Distributions Normal distribution X N(µ, σ ): Truncated normal distribution: f(x µ, σ, a, b) = Φ Standardized t-distribution: X µ s/ n t n 1(0, 1), X = 1 n n i=1 f(x µ, σ ) ) ( b µ σ Φ ( a µ σ X i, s = 1 n 1 Standard normal distribution X N(0, 1): Log-normal distribution: e µ+σx LN(µ, σ ); Cauchy distribution: X 1 /X Cauchy(0, 1); ); n (X i X). i=1 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 6 / 37

Bell-shaped Distributions Laplace distribution (double exponential distribution): f(x µ, b) = 1 x µ be b, x R, b > 0. Cauchy distribution: f(x µ, γ) = [ πγ 1 + 1 ( x µ γ ) ], x R, b > 0. t-distribution: f(x ν, µ, σ) = Γ ( ) [ ν+1 ( νπσγ ν ) 1 + 1 ( ) ] x µ ν+1 ν σ Logistic distribution: f(x µ, s) = s e x µ s (1 + e x µ s ), x R, s > 0., x R, ν > 0, σ > Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 7 / 37

Laplace, Cauchy, Standardized t and logistic Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 8 / 37

The Gamma distribution - a refresher The Gamma distribution is often used to model parameters that can only take positive values. In turn, this has been motivated by the fact that the Gamma distribution acts as a conjugate prior in many models θ Gamma(α, β) Gamma(5, 1) p(θ) = βα Γ(α) θα 1 e βθ α, β > 0 Gamma(1, β) Exp(β) (exponential density) dgamma(sort(x), shape = 5, rate = 1) 0.00 0.05 0.10 0.15 0.0 0 5 10 15 Gamma( ν, 1 ) χ ν (chi-square density) x Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 9 / 37

The Gamma distribution The Gamma distribution is often used to model parameters that can only take positive values. In turn, this has been motivated by the fact that the Gamma distribution acts as a conjugate prior in many models θ Gamma(α, β) Gamma(5, ) p(θ) = βα Γ(α) θα 1 e βθ α, β > 0 E(θ) = α β dgamma(sort(x), shape = 5, rate = ) 0.0 0.1 0. 0.3 0.4 Mode(θ) = α 1 β, α > 1 0 4 6 8 x V (θ) = α β Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 10 / 37

Possible models Data likelihood: f(x 1,..., x n µ, σ ) = Models: µ is unknown, σ is known; µ is known, σ is unknown; Both µ and σ are unknown: µ is dependent on σ ; µ and σ are independent. = n f(x i µ, σ ) i=1 n i=1 1 e (x µ) σ πσ = ( πσ ) n e ni=1 (x i µ) σ. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 11 / 37

Useful facts for derivations Normal component: if π(θ) e 1 (aθ bθ), then ( b θ N a, 1 ) a and 1 π a 1 e b a e 1 (aθ bθ) dθ = 1. Gamma component: if π(θ) θ a 1 e bθ, then θ Ga (a, b) and b a Γ(a) θa 1 e bθ dθ = 1. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 1 / 37

Student component: if π(θ) (δ + (θ l) S θ t δ (l, S) ) δ+1, then and 1 Γ ( ) δ+1 πs Γ ( ) δ δ δ (δ + ) δ+1 (θ l) S dθ = 1. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 13 / 37

The Normal Model x = (x 1,..., x n ) N(µ, σ ) i.i.d., with both µ and σ unknown. The likelihood is: L(µ, σ ) n 1 ( σ π exp 1 σ (x i µ) ) i=1 ( 1 ) n/ ( exp σ 1 σ (x i µ) ) For inference, focus is on p(µ, σ x) = p(µ σ, x)p(σ x). From a Bayesian perspective, it is easier to work with the precision, τ = 1. σ The likelihood becomes: n 1 ( L(µ, τ) τ 1/ exp 1 π τ(x i µ) ) i=1 τ n/ exp ( 1 τ i i (x i µ) ) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 14 / 37

Likelihood factorization: ( L(µ, τ) τ n/ exp 1 τ i ( τ n/ exp 1 τ i (x i µ) ) [(x i x) (µ x)] ) ( τ n/ exp 1 [ τ (x i x) + n(µ x) ]) ( τ n/ exp τ n/ exp 1 ) τs (n 1) ( 1 ) τss exp i ( exp 1 τn(µ x)) ( 1 τn(µ x)) with s = i (x i x) /(n 1) and SS = i (x i x) sample variance and sum of squares [SS and x sufficient statistics] Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 15 / 37

Non-informative Prior Non-informative prior: π(µ, σ ) 1 σ. This arises by considering µ and σ a priori independent and taking the product of the standard non-inf priors. This is not a conjugate setting (the posterior does not factor into a product of two independent distributions). Prior is improper but posterior is proper. This is also the Jeffreys prior. Joint posterior distribution of µ and σ is { p(µ, σ x) (σ ) (n/+1) exp 1 } σ [(n 1)s + n( x µ) ] where s = 1 n 1 n (x i x) i=1 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 16 / 37

The conditional posterior distribution, p(µ σ, x), is equivalent to deriving the posterior for µ when σ is known ) µ σ, x N ( x, σ n The marginal posterior p(σ x), is obtained integrating p(µ, σ x) over µ [Hint: integral of a Gaussian function c π = exp( 1 (µ + b) )dµ] c { p(σ x) (σ ) (n/+1) exp 1 } µ σ [(n 1)s + n( x µ) ] dµ } (σ ) [(n 1)/+1] (n 1)s exp { σ which is an inverse-gamma density, i.e. ( n 1 σ x Inv-Gamma, n 1 ) s Inv-χ (n 1, s ) or, equivalently, τ x Ga. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 17 / 37

Sampling from the joint posterior distribution One can simulate a value of (µ, σ ) from the joint posterior density by 1 simulating σ from an inverse-gamma ( ) n 1 n 1, s distribution [take the inverse of random samples from a Gamma ( ) n 1 n 1, s ] ( ) then simulating µ from N x, σ n distribution. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 18 / 37

Marginal posterior distribution p(µ x) of µ As µ is typically the parameter of interest (σ nuisance parameter) it is useful to calculate its marginal posterior distribution [Hint: integral of a Gamma function Γ(a)a p(µ x) = 0 0 p(µ, σ x)dσ (σ ) (n/+1) exp = A n/ z (n )/ exp( z)dz, 0 b a = 0 z a 1 exp( zb )dz] { 1 } σ [(n 1)s + n( x µ) ] dσ with A = (n 1)s + n( x µ), z = A σ [ A n/ = 1 + 1 ( ) ] µ x [(n 1)+1]/ n 1 s/ n that is, µ x t(n 1, x, s /n), or µ x s/ n x t n 1 with t n 1 the standard t-distribution with n 1 degrees of freedom Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 19 / 37

Conjugate Prior Model A conjugate prior must be of the form π(µ, σ ) = π(µ σ )π(σ ), e.g., µ σ N(µ 0, σ /τ 0 ), which corresponds to the joint prior density ( σ p(µ, σ ) 1/ ) exp τ 0 ( σ ν0 IG, SS ) [ ] 0 or τ Ga, { 1 } σ (µ µ 0 ) /τ 0 = (σ ) ( ν0 +1 +1 we call this a Normal-Inverse-Gamma prior, (µ, σ ) NIG(µ 0, τ 0, ν 0 /, SS 0 /) ) { (σ ) (ν0/+1) exp SS 0 σ { exp τ ( 0 SS 0 σ + (µ µ 0 ) τ 0 } )} Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 0 / 37

Joint Posterior p(µ, σ y) with ( ) p(µ, σ x) (σ ) ν0 +1 +1 (σ ) n/ exp { exp { 1 σ ( SS σ 0 + τ 0 (µ µ 0 ) ) } } n (y i µ) 1 i=1 (σ νn+1 ) ( +1) exp { τ ( n SS n σ + (µ µ n ) τ n µ σ, x N(µ n, σ /τ n ), µ n = µ 0 τ 0 σ + x n σ τ 0 σ + n σ )} = τ 0µ 0 + n x, τ n = τ 0 + n τ n ( σ νn x IG, SS ) n, ν n = ν 0 + n, SS n = SS 0 + SS + τ 0n ( x µ 0 ) τ n Thus, µ, σ y Normal-Inverse Gamma(µ n, τ n ; ν n /, SS n/). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 1 / 37

Also µ x t νn (µ n, σ n/τ n ), σ n = SS n/ν n [Note: Again N(m, σ /τ)ig(ν/, SS/)dσ = t ν (m, SS/(ντ)] Comments: µ n expected value for µ after seeing the data µ n = n τ n x + τ 0 τ n µ 0, weighted average τ n precision for estimating µ after n observations. ν n degrees of freedom [τ Ga(α/, β/) βτ χ α, with α degrees of freedom] SS n posterior variation as prior variation+observed variation+variation between prior mean and sample mean. Limiting case τ 0 0, ν 0 1 (and SS 0) then µ x t n 1 ( x, s /n) (same as improper prior!) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 / 37

Example on SPF (from Merlise Clyde) A Sunlight Protection Factor (SPF) of 5 means an individual that can tolerate X minutes of sunlight without any sunscreen can tolerate 5X minutes with sunscreen. Data on 13 individual (tolerance, in min, with and without sunscreen). Analysis should take into account pairing which induces dependence between observations (take differences and use ratios or log(ratios) = difference in logs). Ratios make more sense given the goals: how much longer can a person be exposed to the sun relative to their baseline. Model: Y = log(t RT ) log(cont ROL) N(µ, τ). Then E(log(T RT/CONT ROL)) = µ = log(sp F ). Interested in exp(µ) = SP F. Summary statistics: ȳ = 1.998, s = 0.55, n = 13 [make boxplots and Q-Q normal plots to check on normality] Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 3 / 37

Model formulation: Y = log(trt) log(control) N(µ, σ ), n = 13, ȳ = 1.998, SS = 0.55. Question: π(µ y 1,..., y n ) =? Bayesian model: Data likelihood: f(y 1,..., y n µ, σ ) = n i=1 N(y i; µ, σ ); Non-informative Prior: (µ, σ ) 1/σ ; Posterior: (µ, σ y 1,..., y n ) N(ȳ, σ /n)ig( n 1 n 1, s ) Posterior: µ y 1,..., y n t n 1 (ȳ, s n ); Prediction: y f y 1,..., y n t n 1 (ȳ, s (n 1)/n). Coding in R: rgamma(), rnorm() and rt(). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 4 / 37

With non-informative prior. Posterior: (µ, σ y 1,..., y n ) N(ȳ, σ /n)ig( n 1 n 1, s ) Posterior: µ y 1,..., y n t n 1 (ȳ, s n ) Define: vn = (n 1) = 1, SSn = s (n 1) = 0.55, mn = 1.998 Sampling from posterior: Draw τ Y tau = rgamma(10000, vn/, rate=ssn/) Draw µ τ, Y mu = rnorm(10000, mn, 1/sqrt(phi*n)) or draw µ Y directly mu = rt(10000,vn)*sqrt(ssn/(n*vn))+ mn Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 5 / 37

Model formulation: Y = log(trt) log(control) N(µ, σ ), n = 13, ȳ = 1.998, SS = 0.55. Question: π(µ y 1,..., y n ) =? Bayesian model: Data likelihood: f(y 1,..., y n µ, σ ) = n i=1 N(y i; µ, σ ); Conjugate Prior: µ σ N(µ 0, σ τ 0 ), σ IG( ν0, SS0 ); Posterior: (µ, σ y 1,..., y n ) NIG(µ n, τ n ; ν n /, SS n) Posterior: µ y 1,..., y n t νn (µ n, SSn τ nν n ); Prediction: y f y 1,..., y n t νn (µ n, SSn ν n τ n+1 τ n ). Coding in R: rgamma(), rnorm() and rt(). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 6 / 37

Expert opinions on µ: Best guess on median SPF is 16 P (µ > 64) = 0.01 information in prior is worth 5 observations Possible subjective prior: µ 0 = log(16), τ 0 = 5, ν 0 = τ 0 1 P (µ < log(64)) =.99 implies SS 0 = 185.7 Posterior hyperpar: τ n = 38, µ n =.508, ν u = 37, SS n = 197.134 Sampling from posterior: Draw τ Y tau = rgamma(10000, vn/, rate=ssn/) Draw µ τ, Y mu = rnorm(10000, mn, 1/sqrt(phi*tn)) or draw µ Y directly mu = rt(10000,vn)*sqrt(ssn/(tn*vn))+ mn Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 7 / 37

Transform to exp(µ). Find 95% C.I. of 4.54 to 3.758 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 8 / 37

Predictive Distribution of future z Posterior predictive distribution (given x = (x 1,..., x n )): p(z x) = p(z µ, σ, x)p(µ, σ x)dµdσ [Use assumption that z is independent of x given µ and σ, then integrate µ using the normal integral, then integrate σ using the Gamma integral] ) Reference prior: z x t n 1 ( x, s (n + 1)/n ( Conjugate prior: z x t νn µ n, σn(τ n + 1)/τ n ), σn = SSn/ν n [Can use the normal trick to integrate µ: If z N(µ, σ ) and µ N(µ 0, σ /τ 0 ) then y = z µ σ N(0, 1), that is z = d σy + µ and therefore z σ N(µ 0, σ (1 + 1 τ 0 )) since a linear comb of (independent) normals is normal with added mean and variance.] Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 9 / 37

Prior predictive distribution: What we expect the distribution to be before we observe the data, p(z) = p(z µ, σ )π(µ, σ )dµdσ z t ν0 (µ 0, SS 0 ν 0 (1 + 1 τ 0 )) [as above] [ N(µ, σ )N(µ 0, σ /τ 0 )IG(ν/, SS/)dµdσ = t ν (µ 0, SS ν (1 + 1 τ 0 ))] Note: This is what we used in the example to specify our subjective prior. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 30 / 37

Back to example Prior predictive distribution: z t 4 ( log(16), 185.7 4 (1 + 1 5 ) ) Posterior predictive distribution: z t 37 (.5, 5.3(1 + 1 38 ) ) Y=rt(10000,4)*sqrt((1+1/5)*187.5/4)+log(16) quantile(exp(y)) 0% 5% 50% 75% 100% 4.57e-06.3 16.78 114.98 370966. Sampling from posterior predictive leads to 50% C.I. (0.0003,1.4) - with sunscreen, 50% chance that next individual can be exposed from 0 to 1 times longer than without sunscreen. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 31 / 37

Semi-conjugate prior A semi-conjugate setting is obtained with independent priors π(µ, σ ) = π(µ)π(σ ) ( µ N(µ 0, σ0), σ ν0 IG, SS ) 0 then µ σ, x N(µ n, τ n), µ n = σ x not in closed form µ 0 σ0 + x n σ 1 + n, τn = σ0 σ 1 1 + n σ0 σ We will solve this with MCMC methods! Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 3 / 37

Summary of Conjugate Priors for the Normal Model Conjugate priors for normal data with unknown precision are τ Gamma( a, b ) µ τ N(µ 0, 1 τ 0 τ ) Here a, b, µ 0, and τ 0 are known hyper-parameters chosen to characterize the prior information. The problem with using this prior in practical data analysis is the difficulty of specifying a distribution for µ that is conditional on τ (which is also unknown). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 33 / 37

Summary of Independence prior Here we assume that information about µ can be elicited independently of information on τ or σ, so p(µ, τ) = p(µ) p(τ) This makes elicitation relatively easy. Although the primary goal is to get a prior that reasonably captures the expert s information, independence priors work generally well. Usually, one considers Gamma priors for τ, since they are conjugate. But there s really no need, as long as the prior is defined on the positive real line. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 34 / 37

Proper (semi-conjugate) Reference Priors More recently priors such as µ N(0, b) τ Gamma(c, c) have been used as proper reference priors. In this case, b and c are chosen so that the prior precision for µ, 1/b, and both hyperparameters c in the Gamma distribution are near zero. Such priors are seen as approximation of the p(µ, τ) 1/τ improper default prior. Common choices are b = 10 6 and c = 0.001. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 35 / 37

Back to the example We need to identify a prior distribution that gives information/no-information about the unknown parameters µ and τ = 1/σ. µ N(0, 10 6 ) as proper non-informative prior. Expert opinion that µ should be centered at 16. Then, µ N(16, 10 6 ) as diffuse prior. Expert 95% certain that the mean SPF should be µ should be between 10 and 75, that is, P r(10 < µ < 75) = 0.95. Then µ N(10, 0.0163) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 36 / 37

Back to the example We have no good information on σ, the variance of an observation So we can specify a reference (vague) prior on τ, which is independent of µ: τ Gamma(0.001, 0.001) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 37 / 37