START HERE: Instructions. 1 Exponential Family [Zhou, Manzil]

Size: px

Start display at page:

Download "START HERE: Instructions. 1 Exponential Family [Zhou, Manzil]"

George Patterson
5 years ago
Views:

1 START HERE: Instructions Thanks a lot to John A.W.B. Constanzo and Shi Zong for providing and allowing to use the latex source files for quick preparation of the HW solution. The homework was due at 9:00am on Feb 3, 05. Anything that is received after that time will not be considered. s to every theory questions will be also submitted electronically on Autolab PDF: Latex or handwritten and scanned. Make sure you prepare the answers to each question separately. Collaboration on solving the homework is allowed after you have thought about the problems on your own. However, when you do collaborate, you should list your collaborators! You might also have gotten some inspiration from resources books or online etc... This might be OK only after you have tried to solve the problem, and couldn t. In such a case, you should cite your resources. If you do collaborate with someone or use a book or website, you are expected to write up your solution independently. That is, close the book and all of your notes before starting to write up your solution. Latex source of this homework: hw5_latex.tar Exponential Family [Zhou, Manzil] In this problem we will review the exponential family, its significance in Bayesian statistics and work out a detailed example for the commonly encountered Multivariate Normal distribution and its conjugate prior Normal Inverse Wishart Distribution.. Review Exponential family is a set of probability distributions whose probability density function for x R d can be expressed in the form: px θ = exp φx, θ T gθ where φx is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that fully summarizes the data x within the density function. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data and hence to derive any desired estimate of the parameters. We will explore this important property in detail below. θ is called the natural parameter. The set of values of θ for which the function px θ is finite is called the natural parameter space. It can be shown that the natural parameter space is always convex. First show that log-partition function gθ is a convex function, then you can show this from first principles. gθ is called the log-partition function because it is the logarithm of a normalization factor, without which px θ would not be a probability distribution partition function is often used as a synonym of normalization factor for historical reasons arising from Statistical Physics.

2 . Conjugate Priors [0+0+0] Exponential families are very important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there always exists a conjugate prior, which is also in the exponential family. Consider the distribution: pθ; m 0, φ 0 = exp φ 0, θ m 0, gθ hm 0, φ 0 where m 0 > 0 and φ 0 R d. These are called hyperparameters parameters controlling parameters. Question Show that this distribution, i.e. is a member of the Exponential Family. There is not much to show. Note pθ; m 0, φ 0 = exp θ The sufficient statistic is gθ φ0 The natural parameter is m 0 θ φ0, hm gθ m 0, φ The log-partition function is hm 0, φ 0. There exist infinitely many splitting into ĝ for which h = T ĝ

3 Suppose we obtain the data X = x,..., x n, where x i p θ, i.e. each single observation follows some distribution from the exponential family. Question First of all write out the likelihood px θ. Then use as the prior and derive the posterior pθ X exactly, i.e. with proper normalization constant. The likelihood turns out to be simply n n px θ = px i θ = exp φx i, θ gθ i= i= n = exp φx i, θ i= gθ i= n = exp φx i, θ n T gθ i= 4 Now observe that h is defined so that for all x, y in the hyperparameter space, x θ exp, hy, x dθ =. 5 y gθ Keeping this in mind, we proceed to compute the posterior as pθ X px θpθ; m 0, φ 0 n exp φx i, θ n T gθ + i= φ0 m 0 φ0 + n = exp i= φx i θ,. m 0 + n gθ θ, gθ 6 By 5, normalizing yields [ φ0 + n pθ X = exp i= φx i θ, h m m 0 + n gθ 0 + n, φ 0 + ] φx i. 7 i= 3

4 If you got Question correct hopefully you did, observe that the posterior has the same form as the prior, thus is a conjugate prior. The difference between the prior, i.e. and your answer to Question lies only in the parameters. Question 3 Let m n and φ n be parameters of the posterior pθ X, then show that: m n = m 0 + n φ n = φ 0 + φx i i= 8 We call this update equations. This is obvious from 7. Specifically, comparing equation 7 with prior distribution, we can see that posses the same form and having only the following difference in parameters: m n = m 0 + n φ n = φ 0 + φx i i= 9 This shows that the update equations can be written simply in terms of the number of data points and the sufficient statistic of the data. Also, it provides meaning to the hyperparameters. In particular, m 0 corresponds to the effective number of fake observations that the prior distribution contributes, and φ 0 corresponds to the total amount that these fake observations contribute to the sufficient statistic over all observations and fake observations. 4

5 .3 Multivariate Normal Distribution [0+0+0] The Multivariate Normal N µ, Σ is a distribution that is encountered very often. The distribution is given by: px µ, Σ = πd Σ exp x µt Σ x µ 0 where µ R d and Σ 0 is a symmetric positive definite d d matrix. We claim that it belongs to the Exponential Family. Question 4 Identify the natural parameters θ in terms of µ and Σ. Also derive the sufficient statistics φx and log partition function gθ in terms of µ and Σ. Hint: Design a two dimensional gθ, where first dimension is µt Σ µ. We will use the notation A, B = veca, vecb = trab when A and B are matrices. Note that x T Σ x = trx T Σ x Now px µ, Σ = exp = exp = trxx T Σ = xx T, Σ. x T Σ x x T Σ µ + µ T Σ µ + log[π d Σ ] [ µ xx T, Σ x, Σ µ + T T Σ µ log[π d Σ ] [ x Σ = exp xx T, µ T Σ ] µt Σ µ. log[πd Σ ] The natural parameters, sufficient statistics and log partition function are obtained by inspection as: So we get to know that the natural parameter is, Σ µ Σ ] The sufficient statistics is, The log-partition function is, x xx T µt Σ µ d logπ + log Σ 5

6 The conjugate prior for Multivariate Normal Distribution can be parametrized as the Normal Inverse Wishart Distribution N IWµ 0, κ 0, Σ 0, ν 0. The distribution is given by: pµ, Σ; µ 0, κ 0, Σ 0, ν 0 = N µ µ 0, Σ/κ 0 W Σ Σ 0, ν 0 = κ d 0 Σ 0 ν 0 Σ ν0+d+ ν 0 +d π d Γ d ν0 e κ 0 µ µ 0 T Σ µ µ 0 trσ0σ 3 where κ 0, ν 0 > 0, µ 0 R d and Σ 0 0 is a symmetric positive definite d d matrix., and Γ d is the multivariate gamma function. Question 5 Notice that Normal Inverse Wishart Distribution will fit into the form of. Find the mapping between µ 0, κ 0, Σ 0, ν 0 and m 0, φ 0 and the function hm 0, φ 0 in terms of µ 0, κ 0, Σ 0, ν 0. Hint: m 0 and gθ is two dimensional. A bit of algebra shows that: κ0 pµ, Σ = exp µt Σ µ + κ 0 µ T 0 Σ µ κ 0 µt 0 Σ µ 0 trσ 0Σ + = exp d log κ 0 + ν 0 log Σ 0 ν 0 + d + κ 0 µ 0 κ 0 µ 0 µ T 0 + Σ 0 ν 0 + d + Σ, µ Σ log[π d ] v 0 + log Σ ν 0 + d log d log π log Γ d κ 0, µt Σ µ d ν 0 + d + logπ + + log Σ ] ν0 log d d log π + d log κ 0 + ν 0 log Σ ν0 0 log Γ d Comparing terms with, we obtain κ m 0 = 0 ν 0 + d + κ φ 0 = 0 µ 0 Σ 0 + κ 0 µ 0 µ T 0 dd + hm 0, φ 0 = logπ ν 0d logπ d logκ 0 ν 0 log Σ ν0 0 + log Γ d. 4 6

7 Equipped with these, results we move on to tackle the problem of finding posterior for µ, Σ. One can follow brute force approach to find it be using 0 and 3, but things can get really messy. We will adopt a more elegant and easy approach exploiting the fact that these distribution belong to the exponential family. Question 6 Using the update equations described in Question 3 and your answers to Question 4 and 5, directly write down the posterior for pµ, Σ X. Just providing appropriate update equations would suffice. Applying update equations 8 in Question 3, first considering m n κ m n = n κ = 0 + n 5 ν n + d + ν 0 + d + Similarly for φ n we have κ φ n = n µ n Σ n + κ n µ n µ T = n κ 0 µ 0 Σ 0 + κ 0 µ 0 µ T 0 n + i= x i n i= x ix T i 6 Thus, we obtain the following update equations in terms of µ 0, κ 0, Σ 0, ν 0 : κ n = κ 0 + n µ n = κ 0µ 0 + n x κ 0 + n ν n = ν 0 + n Σ n = Σ 0 + x i x T i i= + κ 0 µ 0 µ T 0 κ n µ n µ T n 7 where x = n n i= x i. This is one of the remarkable cases where working out the general case saves you effort than working out the special case! The algebra can become very complicated, e.g. see murphyk/papers/bayesgauss.pdf where they have explicitly done complicated math! We hope that after solving this homework, you can take advantage of this neat short-cut : 7

8 If we want the same expression as Wikipedia for Σ n, we need to the following rearrangements: Σ n = Σ 0 + x i x T i + κ 0 µ 0 µ T 0 κ n µ n µ T n i= = Σ 0 + C + x i x T + i= xx T i i= x x T + κ 0 µ 0 µ T 0 κ n µ n µ T n i= = Σ 0 + C + n x x T + κ 0 µ 0 µ T 0 κ n µ n µ T n where C = n i= x i xx i x T. Now further expanding µ n, we obtain: 8 Σ n = Σ 0 + C + n x x T + κ 0 µ 0 µ T 0 κ n µ n µ T n κ 0 µ 0 + n xκ 0 µ 0 + n x T = Σ 0 + C + κ 0 µ 0 µ T 0 + +n x x T κ 0 + n nκ0 = Σ 0 + C + µ 0 µ T 0 nκ 0 µ 0 x T nκ 0 xµ T 0 + nκ 0 x x T κ 0 + n = Σ 0 + C + nκ0 κ 0 + n x µ 0 x µ 0 T 9 8

9 .4 Posterior Predictive - Bonus [0+0] Another quantify, which is often of interest in Bayesian Statistics, is the posterior predictive. The posterior predictive distribution is the distribution of unobserved observations prediction conditional on the observed data. Specifically, it is computed by marginalising over the parameters, using the posterior distribution: p x X = p x θ pθ X dθ 0 The posterior predictive distribution for a distribution in exponential family has a rather nice form. Question 7 Show that the posterior predictive for the distribution with prior is given by: p x X = exp {h m n +, φ n + φ x h m n, φ n } Since p x X = exp [ φ x, θ T gθ + φ n, θ m n, gθ hm n, φ n ] dθ = exp [ φ x + φ n, θ m n +, gθ hm n, φ n ] dθ = exp [ hm n, φ n ] exp [ φ x + φ n, θ m n +, gθ ] dθ. exp [ φ x + φ n, θ m n +, gθ hm n +, φ n + φ x] = p θ; m n +, φ n + φ x 3 is a probability distribution, we have = exp [ φ x + φ n, θ m n +, gθ h m n +, φ n + φ x] dθ = exp [ h m n +, φ n + φ x] exp [ φ x + φ n, θ m n +, gθ ] dθ, 4 leading to exp [ φ x + φ n, θ m n +, gθ ] dθ = exp [h m n +, φ n + φ x]. 5 Combining this result with yields the desired result: p x X = exp {h m n +, φ n + φ x h m n, φ n } 6 9

10 The result of previous problem can be specialized for the Multivariate Normal case. Question 8 Find the predictive posterior for the case of Multivariate Normal Distribution with Normal Inverse Wishart Distribution, having parameters as described in.3 by using Question 7. Hint: The matrix determinant lemma might come handy determinant_lemma. The adding of new point x leads to following update: κ = κ n + ν = ν n + µ = κ nµ n + x κ n + Σ = Σ n + x x T + κ n µ n µ T n κ µ µ T = Σ n + Now by substituting 4 into, exphm n +, φ n + φ x exphm n, φ n κ n κ n + x µ n x µ n T dd+ = π π νn+d κ n + d Σ νn+ Γ νn+ d π dd+ π νnd κ n d Σ n νn Γ νn d = Γ ν n+ Γ ν n d+ d π d + κ n Σ n νn Σ νn+ Next rewrite Σ using matrix determinant lemma, Σ = Σ n + κ n κ n + x µ n x µ n T [ = + κ ] n κ n + x µ n T Σ n x µ n Σ n Putting all together, we get Γ ν n+ p x X = Γ d/. 30 νn+/ ν n d+ π d/ κ n+ κ Σn n / + κn κ x µ n+ n T Σ n x µ n Further using student t distribution formula, one can show that p x X = x µ n, t νn d+ κ n+σ n κ nν n d+ 0

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties Posterior Inference Example. Consider a binomial model where we have a posterior distribution for the probability term, θ. Suppose we want to make inferences about the log-odds γ = log ( θ 1 θ), where