CSC 411: Lecture 08: Generative Models for Classification

Similar documents
The Bernoulli distribution

Bivariate Birnbaum-Saunders Distribution

Chapter 7: Estimation Sections

Chapter 4: Asymptotic Properties of MLE (Part 3)

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

EE641 Digital Image Processing II: Purdue University VISE - October 29,

Lecture 23. STAT 225 Introduction to Probability Models April 4, Whitney Huang Purdue University. Normal approximation to Binomial

Subject : Computer Science. Paper: Machine Learning. Module: Decision Theory and Bayesian Decision Theory. Module No: CS/ML/10.

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Practice Exercises for Midterm Exam ST Statistical Theory - II The ACTUAL exam will consists of less number of problems.

A potentially useful approach to model nonlinearities in time series is to assume different behavior (structural break) in different subsamples

ECE 340 Probabilistic Methods in Engineering M/W 3-4:15. Lecture 10: Continuous RV Families. Prof. Vince Calhoun

Learning From Data: MLE. Maximum Likelihood Estimators

CS340 Machine learning Bayesian model selection

Normal Distribution. Notes. Normal Distribution. Standard Normal. Sums of Normal Random Variables. Normal. approximation of Binomial.

(5) Multi-parameter models - Summarizing the posterior

Non-informative Priors Multiparameter Models

Laplace approximation

STAT 830 Convergence in Distribution

Chapter 7: Estimation Sections

IEOR 165 Lecture 1 Probability Review

Statistical estimation

MLLunsford 1. Activity: Central Limit Theorem Theory and Computations

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Hierarchical Generalized Linear Models. Measurement Incorporated Hierarchical Linear Models Workshop

CSE 312 Winter Learning From Data: Maximum Likelihood Estimators (MLE)

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

Chapter 7: Estimation Sections

Log-linear Modeling Under Generalized Inverse Sampling Scheme

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2009, Mr. Ruey S. Tsay. Solutions to Final Exam

Random Variables Handout. Xavier Vilà

Bayesian Linear Model: Gory Details

Lecture 22: Dynamic Filtering

CS340 Machine learning Bayesian statistics 3

Estimation after Model Selection

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay. Solutions to Final Exam

Probability. An intro for calculus students P= Figure 1: A normal integral

ELEMENTS OF MONTE CARLO SIMULATION

Logit Models for Binary Data

Statistical and Computational Inverse Problems with Applications Part 5B: Electrical impedance tomography

Common one-parameter models

12. Conditional heteroscedastic models (ARCH) MA6622, Ernesto Mordecki, CityU, HK, 2006.

Session 5. A brief introduction to Predictive Modeling

PROBABILITY AND STATISTICS

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

ST440/550: Applied Bayesian Analysis. (5) Multi-parameter models - Summarizing the posterior

START HERE: Instructions. 1 Exponential Family [Zhou, Manzil]

Phylogenetic comparative biology

Section 0: Introduction and Review of Basic Concepts

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Actuarial Mathematics and Statistics Statistics 5 Part 2: Statistical Inference Tutorial Problems

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

Binomial Random Variables. Binomial Random Variables

Molecular Phylogenetics

E509A: Principle of Biostatistics. GY Zou

GOV 2001/ 1002/ E-200 Section 3 Inference and Likelihood

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 4: May 2, Abstract

STOR Lecture 15. Jointly distributed Random Variables - III

Objective Bayesian Analysis for Heteroscedastic Regression

M.Sc. ACTUARIAL SCIENCE. Term-End Examination

Stochastic Volatility (SV) Models

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Information Processing and Limited Liability

Lecture 8. The Binomial Distribution. Binomial Distribution. Binomial Distribution. Probability Distributions: Normal and Binomial

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2017, Mr. Ruey S. Tsay. Solutions to Final Exam

6. Genetics examples: Hardy-Weinberg Equilibrium

Lecture 7: Bayesian approach to MAB - Gittins index

High Dimensional Bayesian Optimisation and Bandits via Additive Models

Lecture 10: Point Estimation

Hardy Weinberg Model- 6 Genotypes

Deriving the Black-Scholes Equation and Basic Mathematical Finance

Lecture 3. Sergei Fedotov Introduction to Financial Mathematics. Sergei Fedotov (University of Manchester) / 6

continuous rv Note for a legitimate pdf, we have f (x) 0 and f (x)dx = 1. For a continuous rv, P(X = c) = c f (x)dx = 0, hence

CS 361: Probability & Statistics

Kernel Conditional Quantile Estimation via Reduction Revisited

Lecture 9: Classification and Regression Trees

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing Examples

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Solutions to Final Exam

Computer Vision Group Prof. Daniel Cremers. 7. Sequential Data

Posterior Inference. , where should we start? Consider the following computational procedure: 1. draw samples. 2. convert. 3. compute properties

Credit Risk Modelling

STAT 425: Introduction to Bayesian Analysis

Back to estimators...

Simulation Wrap-up, Statistics COS 323

The Normal Distribution

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2011, Mr. Ruey S. Tsay. Solutions to Final Exam.

Data Analysis and Statistical Methods Statistics 651

MTH6154 Financial Mathematics I Stochastic Interest Rates

Math 140 Introductory Statistics. Next test on Oct 19th

18.05 Problem Set 3, Spring 2014 Solutions

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Shifting our focus. We were studying statistics (data, displays, sampling...) The next few lectures focus on probability (randomness) Why?

Course information FN3142 Quantitative finance

EE266 Homework 5 Solutions

What was in the last lecture?

Arbitrages and pricing of stock options

Exercise. Show the corrected sample variance is an unbiased estimator of population variance. S 2 = n i=1 (X i X ) 2 n 1. Exercise Estimation

Lecture Stat 302 Introduction to Probability - Slides 15

Transcription:

CSC 411: Lecture 08: Generative Models for Classification Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 1 / 23

Today Classification Bayes classifier Estimate probability densities from data Making decisions: Risk Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 2 / 23

Classification Given inputs x and classes y we can do classification in several ways. How? Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 3 / 23

Discriminative Classifiers Discriminative classifiers try to either: learn mappings directly from the space of inputs X to class labels {0, 1, 2,..., K} Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 4 / 23

Discriminative Classifiers Discriminative classifiers try to either: or try to learn p(y x) directly Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 5 / 23

Generative Classifiers How about this approach: build a model of how data for a class looks like Generative classifiers try to model p(x y) Classification via Bayes rule (thus also called Bayes classifiers) Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 6 / 23

Generative vs Discriminative Two approaches to classification: Discriminative classifiers estimate parameters of decision boundary/class separator directly from labeled examples learn p(y x) directly (logistic regression models) learn mappings from inputs to classes (least-squares, neural nets) Generative approach: model the distribution of inputs characteristic of the class (Bayes classifier) Build a model of p(x y) Apply Bayes Rule Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 7 / 23

Bayes Classifier Aim to diagnose whether patient has diabetes: classify into one of two classes (yes C=1; no C=0) Run battery of tests on the patients, get x for each patient Given patient s results: x = [x 1, x 2,, x d ] T we want to compute class probabilities using Bayes Rule: More formally posterior = p(c x) = p(x C)p(C) p(x) Class likelihood prior Evidence How can we compute p(x) for the two class case? p(x) = p(x C = 0)p(C = 0) + p(x C = 1)p(C = 1) To compute p(c x) we need: p(x C) and p(c) Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 8 / 23

Figure : Our example (showing counts of patients for input value): What distribution to choose? Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 9 / 23 Classification: Diabetes Example Let s start with the simplest case where the input is only 1-dimensional, for example: white blood cell count (this is our x) We need to choose a probability distribution p(x C) that makes sense

Gaussian Discriminant Analysis (Gaussian Bayes Classifier) Our first generative classifier assumes that p(x y) is distributed according to a multivariate normal (Gaussian) distribution This classifier is called Gaussian Discriminant Analysis Let s first continue our simple case when inputs are just 1-dim and have a Gaussian distribution: p(x C) = 1 exp ( (x µ C ) 2 ) 2πσ with µ R and σ 2 R + 2σ 2 C Notice that we have different parameters for different classes How can I fit a Gaussian distribution to my data? Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 10 / 23

MLE for Gaussians Let s assume that the class-conditional densities are Gaussian p(x C) = 1 exp ( (x µ C ) 2 ) 2πσ with µ R and σ 2 R + How can I fit a Gaussian distribution to my data? We are given a set of training examples {x (n), t (n) } n=1,,n with t (n) {0, 1} and we want to estimate the model parameters {(µ 0, σ 0 ), (µ 1, σ 1 )} First divide the training examples into two classes according to t (n), and for each class take all the examples and fit a Gaussian to model p(x C) Let s try maximum likelihood estimation (MLE) 2σ 2 C Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 11 / 23

MLE for Gaussians (note: we are dropping subscript C for simplicity of notation) We assume that the data points that we have are independent and identically distributed p(x (1),, x (N) C) = N p(x (n) C) = n=1 N 1 exp ( (x ) (n) µ) 2 2πσ 2σ 2 Now we want to maximize the likelihood, or minimize its negative (if you think in terms of a loss) ( N l log loss = ln p(x (1),, x (N) 1 C) = ln exp ( (x ) ) (n) µ) 2 2πσ 2σ 2 = N ln( N 2πσ) + n=1 n=1 n=1 n=1 (x (n) µ) 2 2σ 2 = N 2 ln ( 2πσ 2) + N (x (n) µ) 2 n=1 2σ 2 How do we minimize the function? Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 12 / 23

Computing the Mean (let s try to find a) Closed-form solution: Write dl log loss dµ and dl log loss dσ 2 equal it to 0 to find the parameters µ and σ 2 l log loss µ ( N ln ( 2πσ 2) + ) N (x (n) µ) 2 2 n=1 2σ = 2 = µ = N n=1 2(x (n) µ) N = 2σ 2 And equating to zero we have n=1 (x (n) µ) σ 2 and ( N ) (x d (n) µ) 2 n=1 2σ 2 dµ = Nµ N n=1 x (n) σ 2 Thus dl log loss dµ = 0 = Nµ N n=1 x (n) σ 2 µ = 1 N N n=1 x (n) Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 13 / 23

Computing the Variance And for σ 2 : dl log loss dσ 2 = d ( N 2 ln ( 2πσ 2) + ) N (x (n) µ) 2 n=1 2σ 2 dσ 2 = N N 1 2 2πσ 2 2π + n=1 (x (n) µ) 2 2 = N N 2σ 2 n=1 (x (n) µ) 2 2σ 4 ( ) 1 σ 4 And equating to zero we have dl log loss dσ 2 = 0 = N N 2σ 2 n=1 (x (n) µ) 2 2σ 4 N = Nσ2 n=1 (x (n) µ) 2 2σ 4 Thus: σ 2 = 1 N N (x (n) µ) 2 n=1 Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 14 / 23

MLE of a Gaussian In summary, we can compute the parameters of a Gaussian distribution in closed form for each class by taking the training points that belong to that class MLE estimates of parameters for a Gaussian distribution: µ = 1 N σ 2 = 1 N N n=1 x (n) N (x (n) µ) 2 n=1 Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 15 / 23

Posterior Probability We now have p(x C) In order to compute the posterior probability: p(c x) = p(x C)p(C) p(x) p(x C)p(C) = p(x C = 0)p(C = 0) + p(x C = 1)p(C = 1) given a new observation, we still need to compute the prior Prior: In the absence of any observation, what do I know about the problem? Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 16 / 23

Diabetes Example Doctor has a prior p(c = 0) = 0.8, how? A new patient comes in, the doctor measures x = 48 Does the patient have diabetes? Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 17 / 23

Diabetes Example Compute p(x = 48 C = 0) and p(x = 48 C = 1) via our estimated Gaussian distributions Compute posterior p(c = 0 x = 48) via Bayes rule using the prior (how can we get p(c = 1 x = 48)?) How can we decide on diabetes/non-diabetes? Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 18 / 23

Bayes Classifier Use Bayes classifier to classify new patients (unseen test examples) Simple Bayes classifier: estimate posterior probability of each class What should the decision criterion be? The optimal decision is the one that minimizes the expected number of mistakes Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 19 / 23

Risk of a Classifier Risk (expected loss) of a C-class classifier y(x): R(y) = E x,t [L(y(x), t)] C = L(y(x), t)p(x, t = c)dx = x c=1 x [ C c=1 ] L(y(x), t)p(t = c x) p(x)dx Clearly, its enough to minimize the conditional risk for any x: C R(y x) = L(y(x), t)p(t = c x) c=1 Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 20 / 23

Conditional Risk of a Classifier We have assumed a zero-one loss: Conditional risk: R(y x) = L(y(x), t) = { 0 if y(x) = t 1 if y(x) t C L(y(x), t)p(t = c x) c=1 = 0 p(t = y(x) x) + 1 p(t = c x) = c y c y p(t = c x) = 1 p(t = y(x) x) To minimize conditional risk given x, the classifier must decide y(x) = arg max p(t = c x) Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative c Models 21 / 23

Log-odds Ratio Optimal rule y = arg max c p(t = c x) is equivalent to y = c p(t = c x) p(t = j x) 1 log p(t = c x) p(t = j x) 0 j c j c For the binary case y = 1 log p(t = 1 x) p(t = 0 x) 0 Where have we used this rule before? Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 22 / 23

Gaussian Discriminant Analysis Consider the 2-class case Interesting: When σ 0 = σ 1, then the posterior takes the following form: p(t = 1 x) = 1 1 + e w x where w is some appropriate function of φ, µ 0, µ 1, σ 0, where we denoted the prior with p(t) = φ t (1 φ) (1 t) (Bernoulli distribution). Prove this! In this case the GDA and Logistic Regression are equivalent When would you choose one over the other? GDA makes strong modeling assumptions (data has Gaussian distribution) If data really had Gaussian distribution, then GDA will find a better fit Logistic Regression is more robust and less sensitive to incorrect modeling assumptions [Credit: A. Ng] Zemel, Urtasun, Fidler (UofT) CSC 411: 08-Generative Models 23 / 23