STAT 425: Introduction to Bayesian Analysis

Size: px

Start display at page:

Download "STAT 425: Introduction to Bayesian Analysis"

Trevor Nash
5 years ago
Views:

1 STAT 45: Introduction to Bayesian Analysis Marina Vannucci Rice University, USA Fall 018 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

2 Lectures 9-11: Multi-parameter models The Normal model Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 / 37

3 Parameterizations of the Normal Distribution Mean and deviation: f(x µ, σ ) = 1 πσ e (x µ) σ, x R, σ > 0. Mean and precision: f(x µ, τ) = τ τ(x µ) π e, x R, τ = 1 σ > 0. The latter has advantages in numerical computations when σ 0 and simplify formulas. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

4 Summary pdf/pmf Domain Mean Variance Bern P (x) = p x (1 p) 1 x {0, 1} p p(1 p) Bin P (x) = N p x (1 p) N x x {0,..., N} Np Np(1 p) λ Poi P (x) = e λx N λ λ x! NB P (x) = r + x 1 p r (1 p) x N r 1 p r 1 p p p x { M P (x 1,..., x k ) = N! k x k! k px k {0,..., N} K Np k U f(x) = 1 b a [a, b] a+b Be f(x) = Γ(a+b) Γ(a)Γ(b) xa 1 (1 x) b 1 [0, 1] a a+b Ga f(x) = ba Γ(a) xa 1 e bx R + a b Np k (1 p k ) Np k p k (b a) 1 ab (a+b) (a+b+1) N f(x) = 1 e (x µ) σ R µ σ πσ a b MN f(x) = (π) p Σ 1 e 1 (X µ)t Σ 1 (X µ) R p µ Σ Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

5 Summary Model parameters MOM MLE UMVUE Bern p X X X X n 1 Bin p n S X X X nn nn Poi λ X X X NB r p X X n 1 with known r n S ˆr ˆr+ X r r+ X U a X 3 n 1 n S X (1) with a = 0 Ga b a b X + 3 n 1 n S n+1 X (n) n X (n) X n 1 with known a n S X ā n 1 n S X N µ X X X σ n 1 n 1 S n n S with known σ Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

6 Related Distributions Normal distribution X N(µ, σ ): Truncated normal distribution: f(x µ, σ, a, b) = Φ Standardized t-distribution: X µ s/ n t n 1(0, 1), X = 1 n n i=1 f(x µ, σ ) ) ( b µ σ Φ ( a µ σ X i, s = 1 n 1 Standard normal distribution X N(0, 1): Log-normal distribution: e µ+σx LN(µ, σ ); Cauchy distribution: X 1 /X Cauchy(0, 1); ); n (X i X). i=1 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

7 Bell-shaped Distributions Laplace distribution (double exponential distribution): f(x µ, b) = 1 x µ be b, x R, b > 0. Cauchy distribution: f(x µ, γ) = [ πγ ( x µ γ ) ], x R, b > 0. t-distribution: f(x ν, µ, σ) = Γ ( ) [ ν+1 ( νπσγ ν ) ( ) ] x µ ν+1 ν σ Logistic distribution: f(x µ, s) = s e x µ s (1 + e x µ s ), x R, s > 0., x R, ν > 0, σ > Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

8 Laplace, Cauchy, Standardized t and logistic Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

9 The Gamma distribution - a refresher The Gamma distribution is often used to model parameters that can only take positive values. In turn, this has been motivated by the fact that the Gamma distribution acts as a conjugate prior in many models θ Gamma(α, β) Gamma(5, 1) p(θ) = βα Γ(α) θα 1 e βθ α, β > 0 Gamma(1, β) Exp(β) (exponential density) dgamma(sort(x), shape = 5, rate = 1) Gamma( ν, 1 ) χ ν (chi-square density) x Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

10 The Gamma distribution The Gamma distribution is often used to model parameters that can only take positive values. In turn, this has been motivated by the fact that the Gamma distribution acts as a conjugate prior in many models θ Gamma(α, β) Gamma(5, ) p(θ) = βα Γ(α) θα 1 e βθ α, β > 0 E(θ) = α β dgamma(sort(x), shape = 5, rate = ) Mode(θ) = α 1 β, α > x V (θ) = α β Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

11 Possible models Data likelihood: f(x 1,..., x n µ, σ ) = Models: µ is unknown, σ is known; µ is known, σ is unknown; Both µ and σ are unknown: µ is dependent on σ ; µ and σ are independent. = n f(x i µ, σ ) i=1 n i=1 1 e (x µ) σ πσ = ( πσ ) n e ni=1 (x i µ) σ. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

12 Useful facts for derivations Normal component: if π(θ) e 1 (aθ bθ), then ( b θ N a, 1 ) a and 1 π a 1 e b a e 1 (aθ bθ) dθ = 1. Gamma component: if π(θ) θ a 1 e bθ, then θ Ga (a, b) and b a Γ(a) θa 1 e bθ dθ = 1. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

13 Student component: if π(θ) (δ + (θ l) S θ t δ (l, S) ) δ+1, then and 1 Γ ( ) δ+1 πs Γ ( ) δ δ δ (δ + ) δ+1 (θ l) S dθ = 1. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

14 The Normal Model x = (x 1,..., x n ) N(µ, σ ) i.i.d., with both µ and σ unknown. The likelihood is: L(µ, σ ) n 1 ( σ π exp 1 σ (x i µ) ) i=1 ( 1 ) n/ ( exp σ 1 σ (x i µ) ) For inference, focus is on p(µ, σ x) = p(µ σ, x)p(σ x). From a Bayesian perspective, it is easier to work with the precision, τ = 1. σ The likelihood becomes: n 1 ( L(µ, τ) τ 1/ exp 1 π τ(x i µ) ) i=1 τ n/ exp ( 1 τ i i (x i µ) ) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

15 Likelihood factorization: ( L(µ, τ) τ n/ exp 1 τ i ( τ n/ exp 1 τ i (x i µ) ) [(x i x) (µ x)] ) ( τ n/ exp 1 [ τ (x i x) + n(µ x) ]) ( τ n/ exp τ n/ exp 1 ) τs (n 1) ( 1 ) τss exp i ( exp 1 τn(µ x)) ( 1 τn(µ x)) with s = i (x i x) /(n 1) and SS = i (x i x) sample variance and sum of squares [SS and x sufficient statistics] Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

16 Non-informative Prior Non-informative prior: π(µ, σ ) 1 σ. This arises by considering µ and σ a priori independent and taking the product of the standard non-inf priors. This is not a conjugate setting (the posterior does not factor into a product of two independent distributions). Prior is improper but posterior is proper. This is also the Jeffreys prior. Joint posterior distribution of µ and σ is { p(µ, σ x) (σ ) (n/+1) exp 1 } σ [(n 1)s + n( x µ) ] where s = 1 n 1 n (x i x) i=1 Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

17 The conditional posterior distribution, p(µ σ, x), is equivalent to deriving the posterior for µ when σ is known ) µ σ, x N ( x, σ n The marginal posterior p(σ x), is obtained integrating p(µ, σ x) over µ [Hint: integral of a Gaussian function c π = exp( 1 (µ + b) )dµ] c { p(σ x) (σ ) (n/+1) exp 1 } µ σ [(n 1)s + n( x µ) ] dµ } (σ ) [(n 1)/+1] (n 1)s exp { σ which is an inverse-gamma density, i.e. ( n 1 σ x Inv-Gamma, n 1 ) s Inv-χ (n 1, s ) or, equivalently, τ x Ga. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

18 Sampling from the joint posterior distribution One can simulate a value of (µ, σ ) from the joint posterior density by 1 simulating σ from an inverse-gamma ( ) n 1 n 1, s distribution [take the inverse of random samples from a Gamma ( ) n 1 n 1, s ] ( ) then simulating µ from N x, σ n distribution. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

19 Marginal posterior distribution p(µ x) of µ As µ is typically the parameter of interest (σ nuisance parameter) it is useful to calculate its marginal posterior distribution [Hint: integral of a Gamma function Γ(a)a p(µ x) = 0 0 p(µ, σ x)dσ (σ ) (n/+1) exp = A n/ z (n )/ exp( z)dz, 0 b a = 0 z a 1 exp( zb )dz] { 1 } σ [(n 1)s + n( x µ) ] dσ with A = (n 1)s + n( x µ), z = A σ [ A n/ = ( ) ] µ x [(n 1)+1]/ n 1 s/ n that is, µ x t(n 1, x, s /n), or µ x s/ n x t n 1 with t n 1 the standard t-distribution with n 1 degrees of freedom Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

20 Conjugate Prior Model A conjugate prior must be of the form π(µ, σ ) = π(µ σ )π(σ ), e.g., µ σ N(µ 0, σ /τ 0 ), which corresponds to the joint prior density ( σ p(µ, σ ) 1/ ) exp τ 0 ( σ ν0 IG, SS ) [ ] 0 or τ Ga, { 1 } σ (µ µ 0 ) /τ 0 = (σ ) ( ν we call this a Normal-Inverse-Gamma prior, (µ, σ ) NIG(µ 0, τ 0, ν 0 /, SS 0 /) ) { (σ ) (ν0/+1) exp SS 0 σ { exp τ ( 0 SS 0 σ + (µ µ 0 ) τ 0 } )} Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

21 Joint Posterior p(µ, σ y) with ( ) p(µ, σ x) (σ ) ν (σ ) n/ exp { exp { 1 σ ( SS σ 0 + τ 0 (µ µ 0 ) ) } } n (y i µ) 1 i=1 (σ νn+1 ) ( +1) exp { τ ( n SS n σ + (µ µ n ) τ n µ σ, x N(µ n, σ /τ n ), µ n = µ 0 τ 0 σ + x n σ τ 0 σ + n σ )} = τ 0µ 0 + n x, τ n = τ 0 + n τ n ( σ νn x IG, SS ) n, ν n = ν 0 + n, SS n = SS 0 + SS + τ 0n ( x µ 0 ) τ n Thus, µ, σ y Normal-Inverse Gamma(µ n, τ n ; ν n /, SS n/). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

22 Also µ x t νn (µ n, σ n/τ n ), σ n = SS n/ν n [Note: Again N(m, σ /τ)ig(ν/, SS/)dσ = t ν (m, SS/(ντ)] Comments: µ n expected value for µ after seeing the data µ n = n τ n x + τ 0 τ n µ 0, weighted average τ n precision for estimating µ after n observations. ν n degrees of freedom [τ Ga(α/, β/) βτ χ α, with α degrees of freedom] SS n posterior variation as prior variation+observed variation+variation between prior mean and sample mean. Limiting case τ 0 0, ν 0 1 (and SS 0) then µ x t n 1 ( x, s /n) (same as improper prior!) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall 018 / 37

23 Example on SPF (from Merlise Clyde) A Sunlight Protection Factor (SPF) of 5 means an individual that can tolerate X minutes of sunlight without any sunscreen can tolerate 5X minutes with sunscreen. Data on 13 individual (tolerance, in min, with and without sunscreen). Analysis should take into account pairing which induces dependence between observations (take differences and use ratios or log(ratios) = difference in logs). Ratios make more sense given the goals: how much longer can a person be exposed to the sun relative to their baseline. Model: Y = log(t RT ) log(cont ROL) N(µ, τ). Then E(log(T RT/CONT ROL)) = µ = log(sp F ). Interested in exp(µ) = SP F. Summary statistics: ȳ = 1.998, s = 0.55, n = 13 [make boxplots and Q-Q normal plots to check on normality] Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

24 Model formulation: Y = log(trt) log(control) N(µ, σ ), n = 13, ȳ = 1.998, SS = Question: π(µ y 1,..., y n ) =? Bayesian model: Data likelihood: f(y 1,..., y n µ, σ ) = n i=1 N(y i; µ, σ ); Non-informative Prior: (µ, σ ) 1/σ ; Posterior: (µ, σ y 1,..., y n ) N(ȳ, σ /n)ig( n 1 n 1, s ) Posterior: µ y 1,..., y n t n 1 (ȳ, s n ); Prediction: y f y 1,..., y n t n 1 (ȳ, s (n 1)/n). Coding in R: rgamma(), rnorm() and rt(). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

25 With non-informative prior. Posterior: (µ, σ y 1,..., y n ) N(ȳ, σ /n)ig( n 1 n 1, s ) Posterior: µ y 1,..., y n t n 1 (ȳ, s n ) Define: vn = (n 1) = 1, SSn = s (n 1) = 0.55, mn = Sampling from posterior: Draw τ Y tau = rgamma(10000, vn/, rate=ssn/) Draw µ τ, Y mu = rnorm(10000, mn, 1/sqrt(phi*n)) or draw µ Y directly mu = rt(10000,vn)*sqrt(ssn/(n*vn))+ mn Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

26 Model formulation: Y = log(trt) log(control) N(µ, σ ), n = 13, ȳ = 1.998, SS = Question: π(µ y 1,..., y n ) =? Bayesian model: Data likelihood: f(y 1,..., y n µ, σ ) = n i=1 N(y i; µ, σ ); Conjugate Prior: µ σ N(µ 0, σ τ 0 ), σ IG( ν0, SS0 ); Posterior: (µ, σ y 1,..., y n ) NIG(µ n, τ n ; ν n /, SS n) Posterior: µ y 1,..., y n t νn (µ n, SSn τ nν n ); Prediction: y f y 1,..., y n t νn (µ n, SSn ν n τ n+1 τ n ). Coding in R: rgamma(), rnorm() and rt(). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

27 Expert opinions on µ: Best guess on median SPF is 16 P (µ > 64) = 0.01 information in prior is worth 5 observations Possible subjective prior: µ 0 = log(16), τ 0 = 5, ν 0 = τ 0 1 P (µ < log(64)) =.99 implies SS 0 = Posterior hyperpar: τ n = 38, µ n =.508, ν u = 37, SS n = Sampling from posterior: Draw τ Y tau = rgamma(10000, vn/, rate=ssn/) Draw µ τ, Y mu = rnorm(10000, mn, 1/sqrt(phi*tn)) or draw µ Y directly mu = rt(10000,vn)*sqrt(ssn/(tn*vn))+ mn Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

28 Transform to exp(µ). Find 95% C.I. of 4.54 to Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

29 Predictive Distribution of future z Posterior predictive distribution (given x = (x 1,..., x n )): p(z x) = p(z µ, σ, x)p(µ, σ x)dµdσ [Use assumption that z is independent of x given µ and σ, then integrate µ using the normal integral, then integrate σ using the Gamma integral] ) Reference prior: z x t n 1 ( x, s (n + 1)/n ( Conjugate prior: z x t νn µ n, σn(τ n + 1)/τ n ), σn = SSn/ν n [Can use the normal trick to integrate µ: If z N(µ, σ ) and µ N(µ 0, σ /τ 0 ) then y = z µ σ N(0, 1), that is z = d σy + µ and therefore z σ N(µ 0, σ (1 + 1 τ 0 )) since a linear comb of (independent) normals is normal with added mean and variance.] Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

30 Prior predictive distribution: What we expect the distribution to be before we observe the data, p(z) = p(z µ, σ )π(µ, σ )dµdσ z t ν0 (µ 0, SS 0 ν 0 (1 + 1 τ 0 )) [as above] [ N(µ, σ )N(µ 0, σ /τ 0 )IG(ν/, SS/)dµdσ = t ν (µ 0, SS ν (1 + 1 τ 0 ))] Note: This is what we used in the example to specify our subjective prior. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

31 Back to example Prior predictive distribution: z t 4 ( log(16), ( ) ) Posterior predictive distribution: z t 37 (.5, 5.3( ) ) Y=rt(10000,4)*sqrt((1+1/5)*187.5/4)+log(16) quantile(exp(y)) 0% 5% 50% 75% 100% 4.57e Sampling from posterior predictive leads to 50% C.I. (0.0003,1.4) - with sunscreen, 50% chance that next individual can be exposed from 0 to 1 times longer than without sunscreen. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

32 Semi-conjugate prior A semi-conjugate setting is obtained with independent priors π(µ, σ ) = π(µ)π(σ ) ( µ N(µ 0, σ0), σ ν0 IG, SS ) 0 then µ σ, x N(µ n, τ n), µ n = σ x not in closed form µ 0 σ0 + x n σ 1 + n, τn = σ0 σ n σ0 σ We will solve this with MCMC methods! Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

33 Summary of Conjugate Priors for the Normal Model Conjugate priors for normal data with unknown precision are τ Gamma( a, b ) µ τ N(µ 0, 1 τ 0 τ ) Here a, b, µ 0, and τ 0 are known hyper-parameters chosen to characterize the prior information. The problem with using this prior in practical data analysis is the difficulty of specifying a distribution for µ that is conditional on τ (which is also unknown). Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

34 Summary of Independence prior Here we assume that information about µ can be elicited independently of information on τ or σ, so p(µ, τ) = p(µ) p(τ) This makes elicitation relatively easy. Although the primary goal is to get a prior that reasonably captures the expert s information, independence priors work generally well. Usually, one considers Gamma priors for τ, since they are conjugate. But there s really no need, as long as the prior is defined on the positive real line. Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

35 Proper (semi-conjugate) Reference Priors More recently priors such as µ N(0, b) τ Gamma(c, c) have been used as proper reference priors. In this case, b and c are chosen so that the prior precision for µ, 1/b, and both hyperparameters c in the Gamma distribution are near zero. Such priors are seen as approximation of the p(µ, τ) 1/τ improper default prior. Common choices are b = 10 6 and c = Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

36 Back to the example We need to identify a prior distribution that gives information/no-information about the unknown parameters µ and τ = 1/σ. µ N(0, 10 6 ) as proper non-informative prior. Expert opinion that µ should be centered at 16. Then, µ N(16, 10 6 ) as diffuse prior. Expert 95% certain that the mean SPF should be µ should be between 10 and 75, that is, P r(10 < µ < 75) = Then µ N(10, ) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

37 Back to the example We have no good information on σ, the variance of an observation So we can specify a reference (vague) prior on τ, which is independent of µ: τ Gamma(0.001, 0.001) Marina Vannucci (Rice University, USA) Bayesian Analysis (Part 1) Fall / 37

Non-informative Priors Multiparameter Models

Non-informative Priors Multiparameter Models Statistics 220 Spring 2005 Copyright c 2005 by Mark E. Irwin Prior Types Informative vs Non-informative There has been a desire for a prior distributions that