CS340 Machine learning Bayesian statistics 3

Similar documents
CS340 Machine learning Bayesian model selection

Chapter 7: Estimation Sections

Chapter 7: Estimation Sections

ST440/550: Applied Bayesian Analysis. (5) Multi-parameter models - Summarizing the posterior

(5) Multi-parameter models - Summarizing the posterior

Conjugate priors: Beta and normal Class 15, Jeremy Orloff and Jonathan Bloom

Conjugate Models. Patrick Lam

Objective Bayesian Analysis for Heteroscedastic Regression

Chapter 7: Estimation Sections

Chapter 8: Sampling distributions of estimators Sections

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

START HERE: Instructions. 1 Exponential Family [Zhou, Manzil]

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

STAT 425: Introduction to Bayesian Analysis

Non-informative Priors Multiparameter Models

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Outline. Review Continuation of exercises from last time

This is a open-book exam. Assigned: Friday November 27th 2009 at 16:00. Due: Monday November 30th 2009 before 10:00.

CS 361: Probability & Statistics

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Review for Final Exam Spring 2014 Jeremy Orloff and Jonathan Bloom

Bayesian Hierarchical/ Multilevel and Latent-Variable (Random-Effects) Modeling

Common one-parameter models

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

Commonly Used Distributions

The Bernoulli distribution

Business Statistics 41000: Probability 3

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Bayesian course - problem set 3 (lecture 4)

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

The Normal Distribution

8.1 Estimation of the Mean and Proportion

Chapter 5. Statistical inference for Parametric Models

Bayesian Normal Stuff

SYSM 6304 Risk and Decision Analysis Lecture 2: Fitting Distributions to Data

Back to estimators...

Modeling skewness and kurtosis in Stochastic Volatility Models

High Dimensional Bayesian Optimisation and Bandits via Additive Models

CSC 411: Lecture 08: Generative Models for Classification

MVE051/MSG Lecture 7

ECE 340 Probabilistic Methods in Engineering M/W 3-4:15. Lecture 10: Continuous RV Families. Prof. Vince Calhoun

Random Variables Handout. Xavier Vilà

MATH 3200 Exam 3 Dr. Syring

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

Lecture 10: Point Estimation

Continuous Distributions

Metropolis-Hastings algorithm

Statistics 6 th Edition

Probability Theory and Simulation Methods. April 9th, Lecture 20: Special distributions

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN EXAMINATION

SOCIETY OF ACTUARIES EXAM STAM SHORT-TERM ACTUARIAL MATHEMATICS EXAM STAM SAMPLE QUESTIONS

Chapter 4: Asymptotic Properties of MLE (Part 3)

2 of PU_2015_375 Which of the following measures is more flexible when compared to other measures?

Estimation after Model Selection

Exam STAM Practice Exam #1

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

GOV 2001/ 1002/ E-200 Section 3 Inference and Likelihood

Bayesian Linear Model: Gory Details

Lecture 2. Probability Distributions Theophanis Tsandilas

1 Bayesian Bias Correction Model

Stochastic Volatility (SV) Models

ECE 295: Lecture 03 Estimation and Confidence Interval

The normal distribution is a theoretical model derived mathematically and not empirically.

Binomial Random Variables. Binomial Random Variables

Confidence Intervals Introduction

Probability. An intro for calculus students P= Figure 1: A normal integral

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Introduction to Probability and Inference HSSP Summer 2017, Instructor: Alexandra Ding July 19, 2017

Extended Model: Posterior Distributions

Learning From Data: MLE. Maximum Likelihood Estimators

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Practice Exercises for Midterm Exam ST Statistical Theory - II The ACTUAL exam will consists of less number of problems.

A random variable (r. v.) is a variable whose value is a numerical outcome of a random phenomenon.

What was in the last lecture?

The binomial distribution p314

Chapter 8. Introduction to Statistical Inference

Business Statistics 41000: Probability 4

1/2 2. Mean & variance. Mean & standard deviation

Modeling Co-movements and Tail Dependency in the International Stock Market via Copulae

Statistics and Probability

IEOR E4602: Quantitative Risk Management

Chapter 4: Estimation

Statistical Tables Compiled by Alan J. Terry

Maximum Likelihood Estimation

PROBABILITY AND STATISTICS

Bivariate Birnbaum-Saunders Distribution

Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

Likelihood Methods of Inference. Toss coin 6 times and get Heads twice.

2. The sum of all the probabilities in the sample space must add up to 1

Asymmetric Type II Compound Laplace Distributions and its Properties

MATH 446/546 Homework 1:

Normal Distribution. Definition A continuous rv X is said to have a normal distribution with. the pdf of X is

value BE.104 Spring Biostatistics: Distribution and the Mean J. L. Sherley

Semiparametric Modeling, Penalized Splines, and Mixed Models

Estimation Appendix to Dynamics of Fiscal Financing in the United States

Stochastic Volatility and Jumps: Exponentially Affine Yes or No? An Empirical Analysis of S&P500 Dynamics

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Chapter 7 - Lecture 1 General concepts and criteria

Transcription:

CS340 Machine learning Bayesian statistics 3 1

Outline Conjugate analysis of µ and σ 2 Bayesian model selection Summarizing the posterior 2

Unknown mean and precision The likelihood function is p(d µ,λ) = = 1 (2π) n/2λn/2 exp 1 (2π) n/2λn/2 exp The natural conjugate prior is normal gamma ( ( λ 2 λ 2 ) n (x i µ) 2 i=1 p(µ,λ) = NG(µ,λ µ 0,κ 0,α 0,β 0 ) def [ n(µ x) 2 + ]) n (x i x) 2 i=1 = N(µ µ 0,(κ 0 λ) 1 )Ga(λ α 0,rate=β 0 ) ( 1 = λ α 1 0 2exp λ [ κ0 (µ µ 0 ) 2 ] ) +2β 0 Z NG 2 3

Used for positive reals Gamma distribution Ga(x shape=a,rate=b) = Γ(a) xa 1 e xb, x,a,b>0 1 Ga(x shape=α,scale=β) = β α Γ(α) xα 1 e x/β b a Bishop Matlab 4

Posterior is also NG Just update the hyper-parameters p(µ,λ D) = NG(µ,λ µ n,κ n,α n,β n ) µ n = κ 0µ 0 +nx κ 0 +n κ n = κ 0 +n α n = α 0 +n/2 n β n = β 0 + 1 2 (x i x) 2 + κ 0n(x µ 0 ) 2 2(κ 0 +n) i=1 Derivation of this result not on exam 5

Posterior marginals Variance p(λ D) = Ga(λ α n,β n ) Mean p(µ D) = T 2αn (µ µ n, β n α n κ n ) Student t distribution Derivation of this result not on exam 6

Student t distribution Approaches Gaussian as ν t ν (x µ,σ 2 ) [ 1+ 1 ] ( ν+1 ν (x µ 2 ) σ )2 7

Robustness of t distribution Student t less affected by outliers Bishop 2.16 Gaussian 8

Posterior predictive distribution Also a t distribution (fatter tails than Gaussian due to uncertainty in λ) p(x D) = t 2αn (x µ n, β n(κ n +1) α n κ n ) Derivation of this result not on exam 9

Uninformative prior It can be shown (see handout) that an uninformative prior has the form p(µ,λ) 1 λ This can be emulated using the following hyper-parameters κ 0 = 0 a 0 = 1 2 b 0 = 0 This prior is improper (does not integrate to 1), but the posterior is proper if n 1 Derivation of this result not on exam 10

Outline Conjugate analysis of µ and σ 2 Bayesian model selection Summarizing the posterior 11

Bayesian model selection Suppose we have K possible models, each with parameters θ i. The posterior over models is defined using the marginal likelihood ( evidence ) p(d M=i), which is the normalizing constant from the posterior over parameters p(m =i D) = p(d M =i) = p(m =i)p(d M =i) p(d) p(d θ,m =i)p(θ M =i)dθ p(θ D,M =i) = p(d θ,m =i)p(θ M=i) p(d M =i) 12

Bayes factors To compare two models, use posterior odds O ij = p(m i D) p(m j D) = p(d M i)p(m i ) p(d M j )p(m j ) Bayes factor Prior odds The Bayes factor BF(i,j) is a Bayesian version of a likelihood ratio test, that can be used to compare models of different complexity 13

Marginal likelihood for Beta-Bernoulli Since we know p(θ D) = Be(α 1,α 0 ) p(θ D) = p(θ)p(d θ) p(d) [ ] 1 1 = p(d) B(α 1,α 0 ) θα 1 1 (1 θ) α 0 1 [θ N 1 (1 θ) N ] 0 = θα 1 1 (1 θ) α 0 1 B(α 1,α 0 ) Hence the marginal likelihood is a ratio of normalizing constants p(d)= p(d θ)p(θ)dθ= B(α 1,α 0) B(α 1,α 0 ) 14

Example: is the Eurocoin biased? Suppose we toss a coin N=250 times and observe N 1 =141 heads and N 0 =109 tails. Consider two hypotheses: H 0 that θ=0.5 and H 1 that θ 0.5. Actually, we can let H 1 be p(θ) = U(0,1), since p(θ=0.5 H 1 ) = 0 (pdf). For H 0, marginal likelihood is p(d H 0 )=0.5 N For H 1, marginal likelihood is P(D H 1 )= 1 Hence the Bayes factor is 0 P(D θ,h 1 )P(θ H 1 )dθ= B(α 1+N 1,α 0 +N 0 ) B(α 1,α 0 ) BF(1,0)= P(D H 1) P(D H 0 ) = B(α 1+N 1,α 0 +N 0 ) B(α 1,α 0 ) 1 0.5 N 15

Bayes factor vs prior strength Let α 1 =α 0 range from 0 to 1000. The largest BF in favor of H1 (biased coin) is only 2.0, which is very weak evidence of bias. BF(1,0) α 16

Bayesian Occam s razor The use of the marginal likelihood p(d M) automatically penalizes overly complex models, since they spread their probability mass very widely (predict that everything is possible), so the probability of the actual data is small. Too simple, cannot predict D Just right Too complex, can predict everything Bishop 3.13 17

Bayesian Occam s razor for biased coin Blue line = p(d H 0 ) = 0.5 N Red curve = p(d H 1 ) = p(d θ) Beta(θ 1,1) d θ 0.18 0.16 If we have already observed 4 heads, it is much more likely to observe a 5 th head than Marginal a likelihood tail, since of biased θ gets coin updated sequentially. 1.000 0.14 0.12 0.1 0.08 0.06 If we observe 2 or 3 heads out of 5, the simpler model is more likely 0.04 0.02 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 5 num heads 18

Bayesian Information Criterion (BIC) If we make a Gaussian approximation to p(θ D) (Laplace approximation), and approximate H N d, the log marginal likelihood becomes logp(d) logp(d θ ML ) 1 2 dlogn Here d is the dimension/ number of free parameters. AIC (Akaike Info criterion) is defined as logp(d) logp(d θ ML ) d Can use penalized log-likelihood for model selection instead of cross-validation. 19

Outline Conjugate analysis of µ and σ 2 Bayesian model selection Summarizing the posterior 20

Summarizing the posterior If p(θ D) is too complex to plot, we can compute various summary statistics, such as posterior mean, mode and median ˆθ mean = E[θ D] ˆθ MAP = argmaxp(θ D) θ ˆθ median = t:p(θ>t D)=0.5 21

Bayesian credible intervals We can represent our uncertainty using a posterior credible interval We set p(l θ u D) 1 α l=f 1 (α/2),u=f 1 (1 α/2) 22

Example We see 47 heads out of 100 trials. Using a Beta(1,1) prior, what is the 95% credible interval for probability of heads? S = 47; N = 100; a = S+1; b = (N-S)+1; alpha = 0.05; l = betainv(alpha/2, a, b); u = betainv(1-alpha/2, a, b); CI = [l,u] 0.3749 0.5673 23

Posterior sampling If θ is high-dimensional, it is hard to visualize p(θ D). A common strategy is to draw typical values θ s p(θ D) and analyze the resulting samples. Eg we can generate fake data p(x s θ s ) to see if it looks like the real data (a simple kind of posterior predictive check of model adequacy). See handout for some examples. 24