Bayesian Linear Model: Gory Details

Similar documents
Non-informative Priors Multiparameter Models

START HERE: Instructions. 1 Exponential Family [Zhou, Manzil]

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Bayesian Normal Stuff

STAT 425: Introduction to Bayesian Analysis

Conjugate Models. Patrick Lam

Black-Litterman Model

Objective Bayesian Analysis for Heteroscedastic Regression

Conjugate Bayesian Models for Massive Spatial Data

Chapter 8: Sampling distributions of estimators Sections

Extended Model: Posterior Distributions

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29

Module 2: Monte Carlo Methods

Practice Exercises for Midterm Exam ST Statistical Theory - II The ACTUAL exam will consists of less number of problems.

Chapter 4: Asymptotic Properties of MLE (Part 3)

Chapter 7: Estimation Sections

Simulation Wrap-up, Statistics COS 323

STA258 Analysis of Variance

arxiv: v1 [math.st] 18 Sep 2018

Final Exam Suggested Solutions

Bayesian Hierarchical/ Multilevel and Latent-Variable (Random-Effects) Modeling

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

χ 2 distributions and confidence intervals for population variance

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN EXAMINATION

Chapter 3 Common Families of Distributions. Definition 3.4.1: A family of pmfs or pdfs is called exponential family if it can be expressed as

Conjugate priors: Beta and normal Class 15, Jeremy Orloff and Jonathan Bloom

Chapter 7: Estimation Sections

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

RESEARCH ARTICLE. The Penalized Biclustering Model And Related Algorithms Supplemental Online Material

Lecture 3: Factor models in modern portfolio choice

Case Study: Heavy-Tailed Distribution and Reinsurance Rate-making

IEOR E4602: Quantitative Risk Management

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Chapter 7: Estimation Sections

Stochastic Volatility (SV) Models

PhD Qualifier Examination

ELEMENTS OF MATRIX MATHEMATICS

The mean-variance portfolio choice framework and its generalizations

Estimation after Model Selection

Statistics 431 Spring 2007 P. Shaman. Preliminaries

IEOR E4703: Monte-Carlo Simulation

Linear Regression with One Regressor

Exam STAM Practice Exam #1

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

Confidence Intervals Introduction

The Delta Method. j =.

Causal Analysis of Economic Growth and Military Expenditure

CSC 411: Lecture 08: Generative Models for Classification

Central limit theorems

12. Conditional heteroscedastic models (ARCH) MA6622, Ernesto Mordecki, CityU, HK, 2006.

Chapter 8: CAPM. 1. Single Index Model. 2. Adding a Riskless Asset. 3. The Capital Market Line 4. CAPM. 5. The One-Fund Theorem

12 The Bootstrap and why it works

(5) Multi-parameter models - Summarizing the posterior

Quantitative Risk Management

GPD-POT and GEV block maxima

MTH6154 Financial Mathematics I Stochastic Interest Rates

last problem outlines how the Black Scholes PDE (and its derivation) may be modified to account for the payment of stock dividends.

Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty

Much of what appears here comes from ideas presented in the book:

Statistics for Business and Economics

1 Bayesian Bias Correction Model

THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

Arbitrages and pricing of stock options

Decision theoretic estimation of the ratio of variances in a bivariate normal distribution 1

Logit Models for Binary Data

Random Variables Handout. Xavier Vilà

STAT 830 Convergence in Distribution

Random Variables and Probability Distributions

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

Dynamic Portfolio Execution Detailed Proofs

MATH 3200 Exam 3 Dr. Syring

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay. Solutions to Midterm

ECSE B Assignment 5 Solutions Fall (a) Using whichever of the Markov or the Chebyshev inequalities is applicable, estimate

Information, Interest Rates and Geometry

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

Characterization of the Optimum

A Stochastic Reserving Today (Beyond Bootstrap)

Unit 5: Sampling Distributions of Statistics

A Bayesian Control Chart for the Coecient of Variation in the Case of Pooled Samples

Unit 5: Sampling Distributions of Statistics

Statistical and Computational Inverse Problems with Applications Part 5B: Electrical impedance tomography

Strategies for Improving the Efficiency of Monte-Carlo Methods

Statistical Inference and Methods

1. You are given the following information about a stationary AR(2) model:

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

STAT 509: Statistics for Engineers Dr. Dewei Wang. Copyright 2014 John Wiley & Sons, Inc. All rights reserved.

2 Control variates. λe λti λe e λt i where R(t) = t Y 1 Y N(t) is the time from the last event to t. L t = e λr(t) e e λt(t) Exercises

Practice Exam 1. Loss Amount Number of Losses

Course information FN3142 Quantitative finance

STA 114: Statistics. Notes 10. Conjugate Priors

Introduction Dickey-Fuller Test Option Pricing Bootstrapping. Simulation Methods. Chapter 13 of Chris Brook s Book.

Econometric Methods for Valuation Analysis

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 59

The Analytics of Information and Uncertainty Answers to Exercises and Excursions

Qualifying Exam Solutions: Theoretical Statistics

Weight Smoothing with Laplace Prior and Its Application in GLM Model

Modelling Returns: the CER and the CAPM

Review of the Topics for Midterm I

Missing Data. EM Algorithm and Multiple Imputation. Aaron Molstad, Dootika Vats, Li Zhong. University of Minnesota School of Statistics

Transcription:

Bayesian Linear Model: Gory Details Pubh7440 Notes By Sudipto Banerjee Let y y i ] n i be an n vector of independent observations on a dependent variable (or response) from n experimental units. Associated with the y i, is a p vector of regressors, say x i, and lead to the linear regression model y X + ɛ, () where X x T i ]n i is the n p matrix of regressors with i-th row being xt i and is assumed fixed, is the slope vector of regression coefficients and ɛ ɛ i ] n i is the vector of random variables representing pure error or measurement error in the dependent variable. For independent observations, we assume ɛ MV N(0, σ I n ), viz. that each component ɛ i iid N(0, σ ). Furthermore, we will assume that the columns of the matrix X are linearly independent so that the rank of X is p. The N IG conjugate prior family A popular Bayesian model builds upon the linear regression of y using conjugate priors by specifying p(, σ ) p( σ )p(σ ) N(µ, σ V ) IG(a, b) NIG(µ, V, a, b) b a ( ) a+p/+ (π) p/ V / Γ(a) σ exp σ { b + }] ( µ )T V ( µ ) ( ) a+p/+ σ exp σ { b + }] ( µ )T V ( µ ), () where Γ( ) represents the Gamma function and the IG(a, b) prior density for σ is given by p(σ ) ( ) ba a+ ( Γ(a) σ exp b ) σ, σ > 0, where a, b > 0. We call this the Normal-Inverse-Gamma (NIG) prior and denote it as NIG(µ, V, a, b). The NIG probability distribution is a joint probability distribution of a vector and a scalar σ. If (, σ ) NIG(µ, V, a, b), then an interesting analytic form results from integrating out σ

from the joint density: b a ( (π) p/ V / Γ(a) σ ) a+ exp { σ b + ]} ( µ)t V ( µ) dσ NIG(µ, V, a, b)dσ b a (π) p/ V / exp { σ ( Γ(a) b + )} ( µ)t V ( µ) dσ b a Γ ( a + p ) (π) p/ V / b + ] (a+ p ) Γ(a) ( µ)t V ( µ) Γ ( a + p ) π p/ (a) b a V + ( µ)t b a V ] ] ( a+p ) ( µ). / Γ(a) a This is a multivariate t density: Γ ( ν+p) MV St ν (µ, Σ) Γ ( ) ν π p/ νσ / + ( µ)t Σ ( µ) ν with ν a and Σ ( b a) V. ] ν+p, (3) The likelihood The likelihood for the model is defined, up to proportionality, as the joint probability of observing the data given the parameters. Since X is fixed, the likelihood is given by p(y, σ ) N(X, σ I) ( ) n/ { πσ exp } σ (y X)T (y X). (4) 3 The posterior distribution from the N IG prior Inference will proceed from the posterior distribution p(, σ y) p(, σ )p(y, σ ), p(y) where p(y) p(, σ )p(y, σ )ddσ is the marginal distribution of the data. The key to deriving the joint posterior distribution is the following easily verified multivariate completion of squares or ellipsoidal rectification identity: u T Au α T u (u A α) T A(u A α) α T A α, (5)

where A is a symmetric positive definite (hence invertible) matrix. An application of this identity immediately reveals, σ b + { ( µ ) T V ( µ ) + (y X) T (y X) }] σ b + ] ( µ ) T V ( µ ), using which we can write the posterior as where p(, σ y) ( ) a+(n+p)/+ σ exp { σ b + ]} ( µ ) T V ( µ ), (6) µ (V + X T X) (V µ + X T y), V (V + X T X), a a + n/, b b + µt V µ + y T y µ T V µ ]. This posterior distribution is easily identified as a NIG(µ, V, a, b ) proving it to be a conjugate family for the linear regression model. Note that the marginal posterior distribution of σ is immediately seen to be an IG(a, b ) whose density is given by: ( ) p(σ y) b a a + ) Γ(a ) σ exp ( b σ. (7) The marginal posterior distribution of is obtained by integrating out σ from the NIG joint posterior as follows: p( y) ( p(, σ y)dσ σ This is a multivariate t density: MV St ν (µ, Σ ) NIG(µ, V, a, b )dσ ) a + exp { σ b + ]} ( µ ) T V ( µ ) dσ + ( µ ) T V ( µ ] (a ) +p/) b. Γ ( ν with ν a and Σ ( b a ) V. ( ) Γ ν +p ) π p/ ν Σ / + ( µ ) T Σ ( µ ) 3 ν ] ν +p, (8)

4 A useful expression for the N IG scale parameter Here we will prove: b b + ( y Xµ ) T ( I + XV X T ) (y Xµ ) (9) On account of the expression for b derived in the preceding section, it suffices to prove that y T y + µ T V µ µ V µ ( y Xµ ) T ( I + XV X T ) (y Xµ ) Substituting µ V (V µ + X T y) in the left hand side above we obtain: y T y + µ T V µ µ V µ y T y + µ T V µ (V µ + X T y)v (V µ + X T y) y T (I XV XT )y y T XV V µ + µ T (V V V V )µ. Further development of the proof will employ two tricky identities. The first is the well-known Sherman-Woodbury-Morrison identity in matrix algebra: (0) (A + BDC) A A B ( D + CA B ) CA, () where A and D are square matrices that are invertible and B and C are rectangular (square if A and D have the same dimensions) matrices such that the multiplications are well-defined. This identity is easily verified by multiplying the right hand side with A + BDC and simplifying to reduce it to the identity matrix. Applying () twice, once with A V and D (X T X) to get the second equality and then with A (X T X) and D V to get the third equality, we have V V V V V V (V V + (X T X) ] + XX T ) V X T X X T X(X T X + V ) X T X X T (I n XV X T )X. () The next identity notes that since V (V + X T X) I p, we have V V I p V X T X, so that XV V X XV X T X (I n XV X T )X. (3) 4

Substituting () and (3) in (0) we obtain y T (I n XV X T )y y T (I n XV X T )µ + µ T (I n XV X T )µ (y Xµ ) T (I n XV X T )(y Xµ ) (y Xµ ) T (I n + XV X T ) (y Xµ ), (4) where the last step is again a consequence of (): (I n + XV X T ) I n X(V + X T X) X T I n XV X T. 5 Marginal distributions the hard way To obtain the marginal distribution of y, we first compute the distribution p(y σ ) by integrating out and subsequently integrate out σ to obtain p(y). To be precise, we use the expression for b derived in the preceding section, proceeding as below: p(y σ ) p(y, σ )p( σ )d exp (πσ ) n+p V / N(X, σ I n ) N(µ, σ V )d { σ (y X) T (y X) + ( µ ) T V )} ] ( µ d (πσ ) n+p V / exp { (y Xµ σ ) T (I + XV X T ) (y Xµ ) + ( µ ) T V ( µ ) } ] d { exp } (πσ ) n+p V / σ (y Xµ ) T (I + XV X T ) (y Xµ ) exp { ( µ σ ) T V ( µ ) } ] d ( V ) / { exp } (πσ ) n V σ (y Xµ ) T (I + XV X T ) (y Xµ ) { exp } (πσ ) n I + XV X T / σ (y Xµ ) T (I + XV X T ) (y Xµ ) N(Xµ, σ (I + XV X T )). (5) Here we have applied the matrix identity A + BDC A D D + CA B (6) 5

to obtain I n + XV X T V V + X T X ( ) V V. Now, the marginal distribution of p(y) is obtained by integrating a N IG density as follows: p(y) p(y σ )p(σ )dσ N(Xµ, σ (I + XV X T ))IG(a, b)dσ NIG(Xµ, (I + XV X T ), a, b)dσ MV St a (Xµ, b ) a (I + XV XT ). (7) Rewriting our result slightly differently reveals another useful property of the N IG density: p(y) p(y, σ )p(, σ )ddσ N(X, σ I n ) NIG(µ, V, a, b)ddσ MV St a (Xµ, b ) a (I + XV XT ). (8) Of course, the computation of p(y) could also be carried out in terms of the NIG distribution parameters more directly as p(y) p(y, σ )p(, σ )ddσ N(X, σ I n ) NIG(µ, V, a, b)ddσ b a ( ) a +p/+ (π) p/ V / Γ(a) σ exp { σ b + ]} ( µ ) T V ( µ ) b a Γ(a)(π) (n+p)/ V Γ(a )(π) p/ V (b ) a ba Γ ( a + n ) V (π) n/ Γ(a) V b + { µ T V µ + y T y µ V µ }] (a+n/). (9) 6 Marginal distribution: the easy way An alternative and much easier way to derive p(y σ ), avoiding any integration at all, is to note that we can write the above model as: y X + ɛ, where ɛ N(0, σ I); µ + ɛ, where ɛ N(0, σ V ), where ɛ and ɛ are independent of each other. It then follows that y Xµ + Xɛ + ɛ N(Xµ, σ (I + XV X T )). 6

This gives p(y σ ). Next we integrate out σ to obtain p(y) as in the preceding section to obtain In fact, the entire distribution theory for the Bayesian regression with NIG priors could proceed by completely avoiding any integration. To be precise, we obtain this marginal distribution first and derive the posterior distribution: p(, σ y) p(, σ ) p(y, σ ) p(y) NIG(µ, V, a, b) N(X, σ I) MV St a (Xµ, b a (I + XV X T )), which indeed reduces (after some algebraic manipulation) to the NIG(µ, V, a, b ) density. 7 Bayesian Predictions Next consider Bayesian prediction in the context of the linear regression model. Suppose we now want to apply our regression analysis to a new set of data, where we have observed a new m p matrix of regressors X, and we wish to predict the corresponding outcome ỹ. Observe that if and σ were known, then the probability law for the predicted outcomes would be described as ỹ N( X, σ I m ) and would be independent of y. However, these parameters are not known; instead they are summarized through their posterior samples. Therefore, all predictions for the data must follow from the posterior predictive distribution: p(ỹ y) p(ỹ, σ )p(, σ y)ddσ N( X, σ I m ) NIG(µ, V, a, b )ddσ ( ) MV St a Xµ, b a (I + XV XT ), (0) where the last step follows from (8). There are two sources of uncertainty in the posterior predictive distribution: () the fundamental source of variability in the model due to σ, unaccounted for by X, and () the posterior uncertainty in and σ as a result of their estimation from a finite sample y. As the sample size n the variance due to posterior uncertainty disappears, but the predictive uncertainty remains. 7

8 Posterior and posterior predictive sampling Sampling from the NIG posterior distribution is straightforward: for each l,..., L, we sample { } L σ (l) IG(a + n/, b ) and (l) MV N(µ, σ (l) V ). The resulting (l), σ (l) provide l samples from the joint distribution p(, σ y) while { (l) } L l and {σ(l) } L l provide samples from the marginal posterior distributions p( y) and p(σ y) respectively. Predictions are carried out by sampling from the posterior predictive density (0). Sampling from this is easy for each posterior sample ( (l), σ (l) ), we draw ỹ (l) N( X (l), σ (l) I m ). The resulting {ỹ (l) } L l are samples from the desired posterior predictive distribution in (0); the mean and variance of this sample provide estimates of the predictive mean and variance respectively. 9 The posterior distribution from improper priors Taking V 0 (i.e. the null matrix) and a p/ and b 0 leads to the improper prior p(, σ ) /σ. The posterior distribution is NIG (µ, V, a, b ) with µ ˆ (X T X) X T y, V (X T X), a n p, b (n p)s where s n p (y X ˆ) T (y X ˆ) n p yt (I P X )y, where P X X(X T X) X T. Here ˆ is the classical least squares estimates (also the maximum likelihood estimate) of, s is the classical unbiased estimate of σ and P X is the projection matrix onto the column space of X. Plugging in the above values implied by the improper priors into the more general NIG(µ, V, a, b ) ( ) density, we find the marginal posterior distribution of σ is an IG n p, (n p)s (equivalently the posterior distribution of (n p)s /σ is a χ n p distribution) and the marginal posterior distribution of is a MV St n p (ˆ, s X T X) with density: MV St n p (µ, s X T Γ ( ) n X) Γ ( n p) + ( ˆ) T X T X( ˆ) ] n π p/ (n p)s (X T X) / (n p)s. Predictions with non-informative priors again follow by sampling from the posterior predictive distribution as earlier, but some additional insight is gained by considering analytical expressions 8

for the expectation and variance of the posterior predictive distribution. Again, plugging in the parameter values implied by the improper priors into (0), we obtain the posterior predictive density ( as a MV St n p X ˆ, s (I + X(X ) T X) XT ). Note that E(ỹ σ, y) EE(ỹ, σ, y) σ, y] E X σ, y] X ˆ X(X T X) X T y, where the inner expectation averages over p(ỹ, σ ) and the outer expectation averages with respect to p( σ, y). Note that given σ, the future observations have a mean which does not depend on σ. In analogous fashion, var(ỹ σ, y) Evar(ỹ, σ, y) σ, y] + vare(ỹ, σ, y) σ, y] Eσ I m ] + var X σ, y] (I m + X(X T X) XT )σ. Thus, conditional on σ, the posterior predictive variance has two components: σ I m, representing sampling variation, and X(X T X) XT σ, due to uncertainty about. 9