CS340 Machine learning Bayesian statistics 3

CS340 Machine learning Bayesian statistics 3 1

Outline Conjugate analysis of µ and σ 2 Bayesian model selection Summarizing the posterior 2

Unknown mean and precision The likelihood function is p(d µ,λ) = = 1 (2π) n/2λn/2 exp 1 (2π) n/2λn/2 exp The natural conjugate prior is normal gamma ( ( λ 2 λ 2 ) n (x i µ) 2 i=1 p(µ,λ) = NG(µ,λ µ 0,κ 0,α 0,β 0 ) def [ n(µ x) 2 + ]) n (x i x) 2 i=1 = N(µ µ 0,(κ 0 λ) 1 )Ga(λ α 0,rate=β 0 ) ( 1 = λ α 1 0 2exp λ [ κ0 (µ µ 0 ) 2 ] ) +2β 0 Z NG 2 3

Used for positive reals Gamma distribution Ga(x shape=a,rate=b) = Γ(a) xa 1 e xb, x,a,b>0 1 Ga(x shape=α,scale=β) = β α Γ(α) xα 1 e x/β b a Bishop Matlab 4

Posterior is also NG Just update the hyper-parameters p(µ,λ D) = NG(µ,λ µ n,κ n,α n,β n ) µ n = κ 0µ 0 +nx κ 0 +n κ n = κ 0 +n α n = α 0 +n/2 n β n = β 0 + 1 2 (x i x) 2 + κ 0n(x µ 0 ) 2 2(κ 0 +n) i=1 Derivation of this result not on exam 5

Posterior marginals Variance p(λ D) = Ga(λ α n,β n ) Mean p(µ D) = T 2αn (µ µ n, β n α n κ n ) Student t distribution Derivation of this result not on exam 6

Student t distribution Approaches Gaussian as ν t ν (x µ,σ 2 ) [ 1+ 1 ] ( ν+1 ν (x µ 2 ) σ )2 7

Robustness of t distribution Student t less affected by outliers Bishop 2.16 Gaussian 8

Posterior predictive distribution Also a t distribution (fatter tails than Gaussian due to uncertainty in λ) p(x D) = t 2αn (x µ n, β n(κ n +1) α n κ n ) Derivation of this result not on exam 9

Uninformative prior It can be shown (see handout) that an uninformative prior has the form p(µ,λ) 1 λ This can be emulated using the following hyper-parameters κ 0 = 0 a 0 = 1 2 b 0 = 0 This prior is improper (does not integrate to 1), but the posterior is proper if n 1 Derivation of this result not on exam 10

Outline Conjugate analysis of µ and σ 2 Bayesian model selection Summarizing the posterior 11

Bayesian model selection Suppose we have K possible models, each with parameters θ i. The posterior over models is defined using the marginal likelihood ( evidence ) p(d M=i), which is the normalizing constant from the posterior over parameters p(m =i D) = p(d M =i) = p(m =i)p(d M =i) p(d) p(d θ,m =i)p(θ M =i)dθ p(θ D,M =i) = p(d θ,m =i)p(θ M=i) p(d M =i) 12

Bayes factors To compare two models, use posterior odds O ij = p(m i D) p(m j D) = p(d M i)p(m i ) p(d M j )p(m j ) Bayes factor Prior odds The Bayes factor BF(i,j) is a Bayesian version of a likelihood ratio test, that can be used to compare models of different complexity 13

Marginal likelihood for Beta-Bernoulli Since we know p(θ D) = Be(α 1,α 0 ) p(θ D) = p(θ)p(d θ) p(d) [ ] 1 1 = p(d) B(α 1,α 0 ) θα 1 1 (1 θ) α 0 1 [θ N 1 (1 θ) N ] 0 = θα 1 1 (1 θ) α 0 1 B(α 1,α 0 ) Hence the marginal likelihood is a ratio of normalizing constants p(d)= p(d θ)p(θ)dθ= B(α 1,α 0) B(α 1,α 0 ) 14

Example: is the Eurocoin biased? Suppose we toss a coin N=250 times and observe N 1 =141 heads and N 0 =109 tails. Consider two hypotheses: H 0 that θ=0.5 and H 1 that θ 0.5. Actually, we can let H 1 be p(θ) = U(0,1), since p(θ=0.5 H 1 ) = 0 (pdf). For H 0, marginal likelihood is p(d H 0 )=0.5 N For H 1, marginal likelihood is P(D H 1 )= 1 Hence the Bayes factor is 0 P(D θ,h 1 )P(θ H 1 )dθ= B(α 1+N 1,α 0 +N 0 ) B(α 1,α 0 ) BF(1,0)= P(D H 1) P(D H 0 ) = B(α 1+N 1,α 0 +N 0 ) B(α 1,α 0 ) 1 0.5 N 15

Bayes factor vs prior strength Let α 1 =α 0 range from 0 to 1000. The largest BF in favor of H1 (biased coin) is only 2.0, which is very weak evidence of bias. BF(1,0) α 16

Bayesian Occam s razor The use of the marginal likelihood p(d M) automatically penalizes overly complex models, since they spread their probability mass very widely (predict that everything is possible), so the probability of the actual data is small. Too simple, cannot predict D Just right Too complex, can predict everything Bishop 3.13 17

Bayesian Occam s razor for biased coin Blue line = p(d H 0 ) = 0.5 N Red curve = p(d H 1 ) = p(d θ) Beta(θ 1,1) d θ 0.18 0.16 If we have already observed 4 heads, it is much more likely to observe a 5 th head than Marginal a likelihood tail, since of biased θ gets coin updated sequentially. 1.000 0.14 0.12 0.1 0.08 0.06 If we observe 2 or 3 heads out of 5, the simpler model is more likely 0.04 0.02 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 num heads 18

Bayesian Information Criterion (BIC) If we make a Gaussian approximation to p(θ D) (Laplace approximation), and approximate H N d, the log marginal likelihood becomes logp(d) logp(d θ ML ) 1 2 dlogn Here d is the dimension/ number of free parameters. AIC (Akaike Info criterion) is defined as logp(d) logp(d θ ML ) d Can use penalized log-likelihood for model selection instead of cross-validation. 19

Outline Conjugate analysis of µ and σ 2 Bayesian model selection Summarizing the posterior 20

Summarizing the posterior If p(θ D) is too complex to plot, we can compute various summary statistics, such as posterior mean, mode and median ˆθ mean = E[θ D] ˆθ MAP = argmaxp(θ D) θ ˆθ median = t:p(θ>t D)=0.5 21

Bayesian credible intervals We can represent our uncertainty using a posterior credible interval We set p(l θ u D) 1 α l=f 1 (α/2),u=f 1 (1 α/2) 22

Example We see 47 heads out of 100 trials. Using a Beta(1,1) prior, what is the 95% credible interval for probability of heads? S = 47; N = 100; a = S+1; b = (N-S)+1; alpha = 0.05; l = betainv(alpha/2, a, b); u = betainv(1-alpha/2, a, b); CI = [l,u] 0.3749 0.5673 23

Posterior sampling If θ is high-dimensional, it is hard to visualize p(θ D). A common strategy is to draw typical values θ s p(θ D) and analyze the resulting samples. Eg we can generate fake data p(x s θ s ) to see if it looks like the real data (a simple kind of posterior predictive check of model adequacy). See handout for some examples. 24