Bayesian Linear Model: Gory Details Pubh7440 Notes By Sudipto Banerjee Let y y i ] n i be an n vector of independent observations on a dependent variable (or response) from n experimental units. Associated with the y i, is a p vector of regressors, say x i, and lead to the linear regression model y X + ɛ, () where X x T i ]n i is the n p matrix of regressors with i-th row being xt i and is assumed fixed, is the slope vector of regression coefficients and ɛ ɛ i ] n i is the vector of random variables representing pure error or measurement error in the dependent variable. For independent observations, we assume ɛ MV N(0, σ I n ), viz. that each component ɛ i iid N(0, σ ). Furthermore, we will assume that the columns of the matrix X are linearly independent so that the rank of X is p. The N IG conjugate prior family A popular Bayesian model builds upon the linear regression of y using conjugate priors by specifying p(, σ ) p( σ )p(σ ) N(µ, σ V ) IG(a, b) NIG(µ, V, a, b) b a ( ) a+p/+ (π) p/ V / Γ(a) σ exp σ { b + }] ( µ )T V ( µ ) ( ) a+p/+ σ exp σ { b + }] ( µ )T V ( µ ), () where Γ( ) represents the Gamma function and the IG(a, b) prior density for σ is given by p(σ ) ( ) ba a+ ( Γ(a) σ exp b ) σ, σ > 0, where a, b > 0. We call this the Normal-Inverse-Gamma (NIG) prior and denote it as NIG(µ, V, a, b). The NIG probability distribution is a joint probability distribution of a vector and a scalar σ. If (, σ ) NIG(µ, V, a, b), then an interesting analytic form results from integrating out σ
from the joint density: b a ( (π) p/ V / Γ(a) σ ) a+ exp { σ b + ]} ( µ)t V ( µ) dσ NIG(µ, V, a, b)dσ b a (π) p/ V / exp { σ ( Γ(a) b + )} ( µ)t V ( µ) dσ b a Γ ( a + p ) (π) p/ V / b + ] (a+ p ) Γ(a) ( µ)t V ( µ) Γ ( a + p ) π p/ (a) b a V + ( µ)t b a V ] ] ( a+p ) ( µ). / Γ(a) a This is a multivariate t density: Γ ( ν+p) MV St ν (µ, Σ) Γ ( ) ν π p/ νσ / + ( µ)t Σ ( µ) ν with ν a and Σ ( b a) V. ] ν+p, (3) The likelihood The likelihood for the model is defined, up to proportionality, as the joint probability of observing the data given the parameters. Since X is fixed, the likelihood is given by p(y, σ ) N(X, σ I) ( ) n/ { πσ exp } σ (y X)T (y X). (4) 3 The posterior distribution from the N IG prior Inference will proceed from the posterior distribution p(, σ y) p(, σ )p(y, σ ), p(y) where p(y) p(, σ )p(y, σ )ddσ is the marginal distribution of the data. The key to deriving the joint posterior distribution is the following easily verified multivariate completion of squares or ellipsoidal rectification identity: u T Au α T u (u A α) T A(u A α) α T A α, (5)
where A is a symmetric positive definite (hence invertible) matrix. An application of this identity immediately reveals, σ b + { ( µ ) T V ( µ ) + (y X) T (y X) }] σ b + ] ( µ ) T V ( µ ), using which we can write the posterior as where p(, σ y) ( ) a+(n+p)/+ σ exp { σ b + ]} ( µ ) T V ( µ ), (6) µ (V + X T X) (V µ + X T y), V (V + X T X), a a + n/, b b + µt V µ + y T y µ T V µ ]. This posterior distribution is easily identified as a NIG(µ, V, a, b ) proving it to be a conjugate family for the linear regression model. Note that the marginal posterior distribution of σ is immediately seen to be an IG(a, b ) whose density is given by: ( ) p(σ y) b a a + ) Γ(a ) σ exp ( b σ. (7) The marginal posterior distribution of is obtained by integrating out σ from the NIG joint posterior as follows: p( y) ( p(, σ y)dσ σ This is a multivariate t density: MV St ν (µ, Σ ) NIG(µ, V, a, b )dσ ) a + exp { σ b + ]} ( µ ) T V ( µ ) dσ + ( µ ) T V ( µ ] (a ) +p/) b. Γ ( ν with ν a and Σ ( b a ) V. ( ) Γ ν +p ) π p/ ν Σ / + ( µ ) T Σ ( µ ) 3 ν ] ν +p, (8)
4 A useful expression for the N IG scale parameter Here we will prove: b b + ( y Xµ ) T ( I + XV X T ) (y Xµ ) (9) On account of the expression for b derived in the preceding section, it suffices to prove that y T y + µ T V µ µ V µ ( y Xµ ) T ( I + XV X T ) (y Xµ ) Substituting µ V (V µ + X T y) in the left hand side above we obtain: y T y + µ T V µ µ V µ y T y + µ T V µ (V µ + X T y)v (V µ + X T y) y T (I XV XT )y y T XV V µ + µ T (V V V V )µ. Further development of the proof will employ two tricky identities. The first is the well-known Sherman-Woodbury-Morrison identity in matrix algebra: (0) (A + BDC) A A B ( D + CA B ) CA, () where A and D are square matrices that are invertible and B and C are rectangular (square if A and D have the same dimensions) matrices such that the multiplications are well-defined. This identity is easily verified by multiplying the right hand side with A + BDC and simplifying to reduce it to the identity matrix. Applying () twice, once with A V and D (X T X) to get the second equality and then with A (X T X) and D V to get the third equality, we have V V V V V V (V V + (X T X) ] + XX T ) V X T X X T X(X T X + V ) X T X X T (I n XV X T )X. () The next identity notes that since V (V + X T X) I p, we have V V I p V X T X, so that XV V X XV X T X (I n XV X T )X. (3) 4
Substituting () and (3) in (0) we obtain y T (I n XV X T )y y T (I n XV X T )µ + µ T (I n XV X T )µ (y Xµ ) T (I n XV X T )(y Xµ ) (y Xµ ) T (I n + XV X T ) (y Xµ ), (4) where the last step is again a consequence of (): (I n + XV X T ) I n X(V + X T X) X T I n XV X T. 5 Marginal distributions the hard way To obtain the marginal distribution of y, we first compute the distribution p(y σ ) by integrating out and subsequently integrate out σ to obtain p(y). To be precise, we use the expression for b derived in the preceding section, proceeding as below: p(y σ ) p(y, σ )p( σ )d exp (πσ ) n+p V / N(X, σ I n ) N(µ, σ V )d { σ (y X) T (y X) + ( µ ) T V )} ] ( µ d (πσ ) n+p V / exp { (y Xµ σ ) T (I + XV X T ) (y Xµ ) + ( µ ) T V ( µ ) } ] d { exp } (πσ ) n+p V / σ (y Xµ ) T (I + XV X T ) (y Xµ ) exp { ( µ σ ) T V ( µ ) } ] d ( V ) / { exp } (πσ ) n V σ (y Xµ ) T (I + XV X T ) (y Xµ ) { exp } (πσ ) n I + XV X T / σ (y Xµ ) T (I + XV X T ) (y Xµ ) N(Xµ, σ (I + XV X T )). (5) Here we have applied the matrix identity A + BDC A D D + CA B (6) 5
to obtain I n + XV X T V V + X T X ( ) V V. Now, the marginal distribution of p(y) is obtained by integrating a N IG density as follows: p(y) p(y σ )p(σ )dσ N(Xµ, σ (I + XV X T ))IG(a, b)dσ NIG(Xµ, (I + XV X T ), a, b)dσ MV St a (Xµ, b ) a (I + XV XT ). (7) Rewriting our result slightly differently reveals another useful property of the N IG density: p(y) p(y, σ )p(, σ )ddσ N(X, σ I n ) NIG(µ, V, a, b)ddσ MV St a (Xµ, b ) a (I + XV XT ). (8) Of course, the computation of p(y) could also be carried out in terms of the NIG distribution parameters more directly as p(y) p(y, σ )p(, σ )ddσ N(X, σ I n ) NIG(µ, V, a, b)ddσ b a ( ) a +p/+ (π) p/ V / Γ(a) σ exp { σ b + ]} ( µ ) T V ( µ ) b a Γ(a)(π) (n+p)/ V Γ(a )(π) p/ V (b ) a ba Γ ( a + n ) V (π) n/ Γ(a) V b + { µ T V µ + y T y µ V µ }] (a+n/). (9) 6 Marginal distribution: the easy way An alternative and much easier way to derive p(y σ ), avoiding any integration at all, is to note that we can write the above model as: y X + ɛ, where ɛ N(0, σ I); µ + ɛ, where ɛ N(0, σ V ), where ɛ and ɛ are independent of each other. It then follows that y Xµ + Xɛ + ɛ N(Xµ, σ (I + XV X T )). 6
This gives p(y σ ). Next we integrate out σ to obtain p(y) as in the preceding section to obtain In fact, the entire distribution theory for the Bayesian regression with NIG priors could proceed by completely avoiding any integration. To be precise, we obtain this marginal distribution first and derive the posterior distribution: p(, σ y) p(, σ ) p(y, σ ) p(y) NIG(µ, V, a, b) N(X, σ I) MV St a (Xµ, b a (I + XV X T )), which indeed reduces (after some algebraic manipulation) to the NIG(µ, V, a, b ) density. 7 Bayesian Predictions Next consider Bayesian prediction in the context of the linear regression model. Suppose we now want to apply our regression analysis to a new set of data, where we have observed a new m p matrix of regressors X, and we wish to predict the corresponding outcome ỹ. Observe that if and σ were known, then the probability law for the predicted outcomes would be described as ỹ N( X, σ I m ) and would be independent of y. However, these parameters are not known; instead they are summarized through their posterior samples. Therefore, all predictions for the data must follow from the posterior predictive distribution: p(ỹ y) p(ỹ, σ )p(, σ y)ddσ N( X, σ I m ) NIG(µ, V, a, b )ddσ ( ) MV St a Xµ, b a (I + XV XT ), (0) where the last step follows from (8). There are two sources of uncertainty in the posterior predictive distribution: () the fundamental source of variability in the model due to σ, unaccounted for by X, and () the posterior uncertainty in and σ as a result of their estimation from a finite sample y. As the sample size n the variance due to posterior uncertainty disappears, but the predictive uncertainty remains. 7
8 Posterior and posterior predictive sampling Sampling from the NIG posterior distribution is straightforward: for each l,..., L, we sample { } L σ (l) IG(a + n/, b ) and (l) MV N(µ, σ (l) V ). The resulting (l), σ (l) provide l samples from the joint distribution p(, σ y) while { (l) } L l and {σ(l) } L l provide samples from the marginal posterior distributions p( y) and p(σ y) respectively. Predictions are carried out by sampling from the posterior predictive density (0). Sampling from this is easy for each posterior sample ( (l), σ (l) ), we draw ỹ (l) N( X (l), σ (l) I m ). The resulting {ỹ (l) } L l are samples from the desired posterior predictive distribution in (0); the mean and variance of this sample provide estimates of the predictive mean and variance respectively. 9 The posterior distribution from improper priors Taking V 0 (i.e. the null matrix) and a p/ and b 0 leads to the improper prior p(, σ ) /σ. The posterior distribution is NIG (µ, V, a, b ) with µ ˆ (X T X) X T y, V (X T X), a n p, b (n p)s where s n p (y X ˆ) T (y X ˆ) n p yt (I P X )y, where P X X(X T X) X T. Here ˆ is the classical least squares estimates (also the maximum likelihood estimate) of, s is the classical unbiased estimate of σ and P X is the projection matrix onto the column space of X. Plugging in the above values implied by the improper priors into the more general NIG(µ, V, a, b ) ( ) density, we find the marginal posterior distribution of σ is an IG n p, (n p)s (equivalently the posterior distribution of (n p)s /σ is a χ n p distribution) and the marginal posterior distribution of is a MV St n p (ˆ, s X T X) with density: MV St n p (µ, s X T Γ ( ) n X) Γ ( n p) + ( ˆ) T X T X( ˆ) ] n π p/ (n p)s (X T X) / (n p)s. Predictions with non-informative priors again follow by sampling from the posterior predictive distribution as earlier, but some additional insight is gained by considering analytical expressions 8
for the expectation and variance of the posterior predictive distribution. Again, plugging in the parameter values implied by the improper priors into (0), we obtain the posterior predictive density ( as a MV St n p X ˆ, s (I + X(X ) T X) XT ). Note that E(ỹ σ, y) EE(ỹ, σ, y) σ, y] E X σ, y] X ˆ X(X T X) X T y, where the inner expectation averages over p(ỹ, σ ) and the outer expectation averages with respect to p( σ, y). Note that given σ, the future observations have a mean which does not depend on σ. In analogous fashion, var(ỹ σ, y) Evar(ỹ, σ, y) σ, y] + vare(ỹ, σ, y) σ, y] Eσ I m ] + var X σ, y] (I m + X(X T X) XT )σ. Thus, conditional on σ, the posterior predictive variance has two components: σ I m, representing sampling variation, and X(X T X) XT σ, due to uncertainty about. 9