START HERE: Instructions. 1 Exponential Family [Zhou, Manzil]

Similar documents
Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Bayesian Linear Model: Gory Details

Non-informative Priors Multiparameter Models

CS340 Machine learning Bayesian statistics 3

Chapter 7: Estimation Sections

Chapter 7: Estimation Sections

ST440/550: Applied Bayesian Analysis. (5) Multi-parameter models - Summarizing the posterior

Likelihood Methods of Inference. Toss coin 6 times and get Heads twice.

Chapter 8: Sampling distributions of estimators Sections

Chapter 4: Asymptotic Properties of MLE (Part 3)

Conjugate priors: Beta and normal Class 15, Jeremy Orloff and Jonathan Bloom

12 The Bootstrap and why it works

Lecture 17. The model is parametrized by the time period, δt, and three fixed constant parameters, v, σ and the riskless rate r.

Bayesian Hierarchical/ Multilevel and Latent-Variable (Random-Effects) Modeling

STAT 425: Introduction to Bayesian Analysis

Probability. An intro for calculus students P= Figure 1: A normal integral

(5) Multi-parameter models - Summarizing the posterior

Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Problem Set 4 Answers

Bayesian Normal Stuff

The rth moment of a real-valued random variable X with density f(x) is. x r f(x) dx

Module 2: Monte Carlo Methods

MTH6154 Financial Mathematics I Stochastic Interest Rates

MA 1125 Lecture 12 - Mean and Standard Deviation for the Binomial Distribution. Objectives: Mean and standard deviation for the binomial distribution.

Conjugate Models. Patrick Lam

Applications of Exponential Functions Group Activity 7 Business Project Week #10

Chapter 7: Estimation Sections

CSC 411: Lecture 08: Generative Models for Classification

CS340 Machine learning Bayesian model selection

Back to estimators...

IEOR 3106: Introduction to OR: Stochastic Models. Fall 2013, Professor Whitt. Class Lecture Notes: Tuesday, September 10.

Martingales. by D. Cox December 2, 2009

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Extracting Information from the Markets: A Bayesian Approach

The Normal Distribution

ROM SIMULATION Exact Moment Simulation using Random Orthogonal Matrices

Practice Exercises for Midterm Exam ST Statistical Theory - II The ACTUAL exam will consists of less number of problems.

arxiv: v1 [math.st] 18 Sep 2018

Machine Learning for Quantitative Finance

Characterization of the Optimum

STOR Lecture 15. Jointly distributed Random Variables - III

Lecture 10: Point Estimation

EE641 Digital Image Processing II: Purdue University VISE - October 29,

5.3 Interval Estimation

Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

STAT 830 Convergence in Distribution

2.1 Mean-variance Analysis: Single-period Model

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

Generating Random Numbers

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

Bivariate Birnbaum-Saunders Distribution

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

CSE 312 Winter Learning From Data: Maximum Likelihood Estimators (MLE)

Course information FN3142 Quantitative finance

Chapter 3 Common Families of Distributions. Definition 3.4.1: A family of pmfs or pdfs is called exponential family if it can be expressed as

Chapter 6. The Normal Probability Distributions

SEQUENTIAL DECISION PROBLEM WITH PARTIAL MAINTENANCE ON A PARTIALLY OBSERVABLE MARKOV PROCESS. Toru Nakai. Received February 22, 2010

GOV 2001/ 1002/ E-200 Section 3 Inference and Likelihood

Outline. Review Continuation of exercises from last time

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

EVA Tutorial #1 BLOCK MAXIMA APPROACH IN HYDROLOGIC/CLIMATE APPLICATIONS. Rick Katz

The Analytics of Information and Uncertainty Answers to Exercises and Excursions

UNIFORM BOUNDS FOR BLACK SCHOLES IMPLIED VOLATILITY

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

SYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives

Estimating the Parameters of Closed Skew-Normal Distribution Under LINEX Loss Function

Logarithmic derivatives of densities for jump processes

1 Bayesian Bias Correction Model

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Lecture IV Portfolio management: Efficient portfolios. Introduction to Finance Mathematics Fall Financial mathematics

Random Variables Handout. Xavier Vilà

6. Continous Distributions

1 Residual life for gamma and Weibull distributions

Stat 6863-Handout 1 Economics of Insurance and Risk June 2008, Maurice A. Geraghty

12. Conditional heteroscedastic models (ARCH) MA6622, Ernesto Mordecki, CityU, HK, 2006.

What was in the last lecture?

IEOR E4703: Monte-Carlo Simulation

Pre-Algebra, Unit 7: Percents Notes

Practical Hedging: From Theory to Practice. OSU Financial Mathematics Seminar May 5, 2008

Central limit theorems

Math-Stat-491-Fall2014-Notes-V

Objective Bayesian Analysis for Heteroscedastic Regression

Stochastic Volatility (SV) Models

Homework Assignments

Chapter 5. Statistical inference for Parametric Models

Probabilistic Meshless Methods for Bayesian Inverse Problems. Jon Cockayne July 8, 2016

Greek Maxima 1 by Michael B. Miller

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

The Normal Distribution

Non replication of options

Option Pricing. Chapter Discrete Time

Decision theoretic estimation of the ratio of variances in a bivariate normal distribution 1

An Improved Skewness Measure

Metropolis-Hastings algorithm

Financial Time Series and Their Characterictics

4: Single Cash Flows and Equivalence

Modelling Environmental Extremes

Transcription:

START HERE: Instructions Thanks a lot to John A.W.B. Constanzo and Shi Zong for providing and allowing to use the latex source files for quick preparation of the HW solution. The homework was due at 9:00am on Feb 3, 05. Anything that is received after that time will not be considered. s to every theory questions will be also submitted electronically on Autolab PDF: Latex or handwritten and scanned. Make sure you prepare the answers to each question separately. Collaboration on solving the homework is allowed after you have thought about the problems on your own. However, when you do collaborate, you should list your collaborators! You might also have gotten some inspiration from resources books or online etc... This might be OK only after you have tried to solve the problem, and couldn t. In such a case, you should cite your resources. If you do collaborate with someone or use a book or website, you are expected to write up your solution independently. That is, close the book and all of your notes before starting to write up your solution. Latex source of this homework: http://alex.smola.org/teaching/0-70-5/homework/ hw5_latex.tar Exponential Family [Zhou, Manzil] In this problem we will review the exponential family, its significance in Bayesian statistics and work out a detailed example for the commonly encountered Multivariate Normal distribution and its conjugate prior Normal Inverse Wishart Distribution.. Review Exponential family is a set of probability distributions whose probability density function for x R d can be expressed in the form: px θ = exp φx, θ T gθ where φx is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that fully summarizes the data x within the density function. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data and hence to derive any desired estimate of the parameters. We will explore this important property in detail below. θ is called the natural parameter. The set of values of θ for which the function px θ is finite is called the natural parameter space. It can be shown that the natural parameter space is always convex. First show that log-partition function gθ is a convex function, then you can show this from first principles. gθ is called the log-partition function because it is the logarithm of a normalization factor, without which px θ would not be a probability distribution partition function is often used as a synonym of normalization factor for historical reasons arising from Statistical Physics.

. Conjugate Priors [0+0+0] Exponential families are very important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there always exists a conjugate prior, which is also in the exponential family. Consider the distribution: pθ; m 0, φ 0 = exp φ 0, θ m 0, gθ hm 0, φ 0 where m 0 > 0 and φ 0 R d. These are called hyperparameters parameters controlling parameters. Question Show that this distribution, i.e. is a member of the Exponential Family. There is not much to show. Note pθ; m 0, φ 0 = exp θ The sufficient statistic is gθ φ0 The natural parameter is m 0 θ φ0, hm gθ m 0, φ 0. 3 0 The log-partition function is hm 0, φ 0. There exist infinitely many splitting into ĝ for which h = T ĝ

Suppose we obtain the data X = x,..., x n, where x i p θ, i.e. each single observation follows some distribution from the exponential family. Question First of all write out the likelihood px θ. Then use as the prior and derive the posterior pθ X exactly, i.e. with proper normalization constant. The likelihood turns out to be simply n n px θ = px i θ = exp φx i, θ gθ i= i= n = exp φx i, θ i= gθ i= n = exp φx i, θ n T gθ i= 4 Now observe that h is defined so that for all x, y in the hyperparameter space, x θ exp, hy, x dθ =. 5 y gθ Keeping this in mind, we proceed to compute the posterior as pθ X px θpθ; m 0, φ 0 n exp φx i, θ n T gθ + i= φ0 m 0 φ0 + n = exp i= φx i θ,. m 0 + n gθ θ, gθ 6 By 5, normalizing yields [ φ0 + n pθ X = exp i= φx i θ, h m m 0 + n gθ 0 + n, φ 0 + ] φx i. 7 i= 3

If you got Question correct hopefully you did, observe that the posterior has the same form as the prior, thus is a conjugate prior. The difference between the prior, i.e. and your answer to Question lies only in the parameters. Question 3 Let m n and φ n be parameters of the posterior pθ X, then show that: m n = m 0 + n φ n = φ 0 + φx i i= 8 We call this update equations. This is obvious from 7. Specifically, comparing equation 7 with prior distribution, we can see that posses the same form and having only the following difference in parameters: m n = m 0 + n φ n = φ 0 + φx i i= 9 This shows that the update equations can be written simply in terms of the number of data points and the sufficient statistic of the data. Also, it provides meaning to the hyperparameters. In particular, m 0 corresponds to the effective number of fake observations that the prior distribution contributes, and φ 0 corresponds to the total amount that these fake observations contribute to the sufficient statistic over all observations and fake observations. 4

.3 Multivariate Normal Distribution [0+0+0] The Multivariate Normal N µ, Σ is a distribution that is encountered very often. The distribution is given by: px µ, Σ = πd Σ exp x µt Σ x µ 0 where µ R d and Σ 0 is a symmetric positive definite d d matrix. We claim that it belongs to the Exponential Family. Question 4 Identify the natural parameters θ in terms of µ and Σ. Also derive the sufficient statistics φx and log partition function gθ in terms of µ and Σ. Hint: Design a two dimensional gθ, where first dimension is µt Σ µ. We will use the notation A, B = veca, vecb = trab when A and B are matrices. Note that x T Σ x = trx T Σ x Now px µ, Σ = exp = exp = trxx T Σ = xx T, Σ. x T Σ x x T Σ µ + µ T Σ µ + log[π d Σ ] [ µ xx T, Σ x, Σ µ + T T Σ µ log[π d Σ ] [ x Σ = exp xx T, µ T Σ ] µt Σ µ. log[πd Σ ] The natural parameters, sufficient statistics and log partition function are obtained by inspection as: So we get to know that the natural parameter is, Σ µ Σ ] The sufficient statistics is, The log-partition function is, x xx T µt Σ µ d logπ + log Σ 5

The conjugate prior for Multivariate Normal Distribution can be parametrized as the Normal Inverse Wishart Distribution N IWµ 0, κ 0, Σ 0, ν 0. The distribution is given by: pµ, Σ; µ 0, κ 0, Σ 0, ν 0 = N µ µ 0, Σ/κ 0 W Σ Σ 0, ν 0 = κ d 0 Σ 0 ν 0 Σ ν0+d+ ν 0 +d π d Γ d ν0 e κ 0 µ µ 0 T Σ µ µ 0 trσ0σ 3 where κ 0, ν 0 > 0, µ 0 R d and Σ 0 0 is a symmetric positive definite d d matrix., and Γ d is the multivariate gamma function. Question 5 Notice that Normal Inverse Wishart Distribution will fit into the form of. Find the mapping between µ 0, κ 0, Σ 0, ν 0 and m 0, φ 0 and the function hm 0, φ 0 in terms of µ 0, κ 0, Σ 0, ν 0. Hint: m 0 and gθ is two dimensional. A bit of algebra shows that: κ0 pµ, Σ = exp µt Σ µ + κ 0 µ T 0 Σ µ κ 0 µt 0 Σ µ 0 trσ 0Σ + = exp d log κ 0 + ν 0 log Σ 0 ν 0 + d + κ 0 µ 0 κ 0 µ 0 µ T 0 + Σ 0 ν 0 + d + Σ, µ Σ log[π d ] v 0 + log Σ ν 0 + d log d log π log Γ d κ 0, µt Σ µ d ν 0 + d + logπ + + log Σ ] ν0 log d d log π + d log κ 0 + ν 0 log Σ ν0 0 log Γ d Comparing terms with, we obtain κ m 0 = 0 ν 0 + d + κ φ 0 = 0 µ 0 Σ 0 + κ 0 µ 0 µ T 0 dd + hm 0, φ 0 = logπ ν 0d logπ d logκ 0 ν 0 log Σ ν0 0 + log Γ d. 4 6

Equipped with these, results we move on to tackle the problem of finding posterior for µ, Σ. One can follow brute force approach to find it be using 0 and 3, but things can get really messy. We will adopt a more elegant and easy approach exploiting the fact that these distribution belong to the exponential family. Question 6 Using the update equations described in Question 3 and your answers to Question 4 and 5, directly write down the posterior for pµ, Σ X. Just providing appropriate update equations would suffice. Applying update equations 8 in Question 3, first considering m n κ m n = n κ = 0 + n 5 ν n + d + ν 0 + d + Similarly for φ n we have κ φ n = n µ n Σ n + κ n µ n µ T = n κ 0 µ 0 Σ 0 + κ 0 µ 0 µ T 0 n + i= x i n i= x ix T i 6 Thus, we obtain the following update equations in terms of µ 0, κ 0, Σ 0, ν 0 : κ n = κ 0 + n µ n = κ 0µ 0 + n x κ 0 + n ν n = ν 0 + n Σ n = Σ 0 + x i x T i i= + κ 0 µ 0 µ T 0 κ n µ n µ T n 7 where x = n n i= x i. This is one of the remarkable cases where working out the general case saves you effort than working out the special case! The algebra can become very complicated, e.g. see http://www.cs.ubc.ca/ murphyk/papers/bayesgauss.pdf where they have explicitly done complicated math! We hope that after solving this homework, you can take advantage of this neat short-cut : 7

If we want the same expression as Wikipedia for Σ n, we need to the following rearrangements: Σ n = Σ 0 + x i x T i + κ 0 µ 0 µ T 0 κ n µ n µ T n i= = Σ 0 + C + x i x T + i= xx T i i= x x T + κ 0 µ 0 µ T 0 κ n µ n µ T n i= = Σ 0 + C + n x x T + κ 0 µ 0 µ T 0 κ n µ n µ T n where C = n i= x i xx i x T. Now further expanding µ n, we obtain: 8 Σ n = Σ 0 + C + n x x T + κ 0 µ 0 µ T 0 κ n µ n µ T n κ 0 µ 0 + n xκ 0 µ 0 + n x T = Σ 0 + C + κ 0 µ 0 µ T 0 + +n x x T κ 0 + n nκ0 = Σ 0 + C + µ 0 µ T 0 nκ 0 µ 0 x T nκ 0 xµ T 0 + nκ 0 x x T κ 0 + n = Σ 0 + C + nκ0 κ 0 + n x µ 0 x µ 0 T 9 8

.4 Posterior Predictive - Bonus [0+0] Another quantify, which is often of interest in Bayesian Statistics, is the posterior predictive. The posterior predictive distribution is the distribution of unobserved observations prediction conditional on the observed data. Specifically, it is computed by marginalising over the parameters, using the posterior distribution: p x X = p x θ pθ X dθ 0 The posterior predictive distribution for a distribution in exponential family has a rather nice form. Question 7 Show that the posterior predictive for the distribution with prior is given by: p x X = exp {h m n +, φ n + φ x h m n, φ n } Since p x X = exp [ φ x, θ T gθ + φ n, θ m n, gθ hm n, φ n ] dθ = exp [ φ x + φ n, θ m n +, gθ hm n, φ n ] dθ = exp [ hm n, φ n ] exp [ φ x + φ n, θ m n +, gθ ] dθ. exp [ φ x + φ n, θ m n +, gθ hm n +, φ n + φ x] = p θ; m n +, φ n + φ x 3 is a probability distribution, we have = exp [ φ x + φ n, θ m n +, gθ h m n +, φ n + φ x] dθ = exp [ h m n +, φ n + φ x] exp [ φ x + φ n, θ m n +, gθ ] dθ, 4 leading to exp [ φ x + φ n, θ m n +, gθ ] dθ = exp [h m n +, φ n + φ x]. 5 Combining this result with yields the desired result: p x X = exp {h m n +, φ n + φ x h m n, φ n } 6 9

The result of previous problem can be specialized for the Multivariate Normal case. Question 8 Find the predictive posterior for the case of Multivariate Normal Distribution with Normal Inverse Wishart Distribution, having parameters as described in.3 by using Question 7. Hint: The matrix determinant lemma might come handy http://en.wikipedia.org/wiki/matrix_ determinant_lemma. The adding of new point x leads to following update: κ = κ n + ν = ν n + µ = κ nµ n + x κ n + Σ = Σ n + x x T + κ n µ n µ T n κ µ µ T = Σ n + Now by substituting 4 into, exphm n +, φ n + φ x exphm n, φ n κ n κ n + x µ n x µ n T dd+ = π π νn+d κ n + d Σ νn+ Γ νn+ d π dd+ π νnd κ n d Σ n νn Γ νn d = Γ ν n+ Γ ν n d+ d π d + κ n Σ n νn Σ νn+ Next rewrite Σ using matrix determinant lemma, Σ = Σ n + κ n κ n + x µ n x µ n T [ = + κ ] n κ n + x µ n T Σ n x µ n Σ n 7 8 9 Putting all together, we get Γ ν n+ p x X = Γ d/. 30 νn+/ ν n d+ π d/ κ n+ κ Σn n / + κn κ x µ n+ n T Σ n x µ n Further using student t distribution formula, one can show that p x X = x µ n, t νn d+ κ n+σ n κ nν n d+ 0