CSE 312 Winter Learning From Data: Maximum Likelihood Estimators (MLE)

Similar documents
Learning From Data: MLE. Maximum Likelihood Estimators

Point Estimation. Copyright Cengage Learning. All rights reserved.

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

The Bernoulli distribution

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Lecture 10: Point Estimation

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Binomial Random Variables. Binomial Random Variables

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

Back to estimators...

Chapter 8. Introduction to Statistical Inference

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Conjugate priors: Beta and normal Class 15, Jeremy Orloff and Jonathan Bloom

Chapter 4: Asymptotic Properties of MLE (Part 3)

Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

Chapter 7: Estimation Sections

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

MA : Introductory Probability

4 Random Variables and Distributions

Lecture 9: Plinko Probabilities, Part III Random Variables, Expected Values and Variances

Chapter 6: Point Estimation

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 41

6. Genetics examples: Hardy-Weinberg Equilibrium

MVE051/MSG Lecture 7

Chapter 5. Statistical inference for Parametric Models

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 23

EE641 Digital Image Processing II: Purdue University VISE - October 29,

Much of what appears here comes from ideas presented in the book:

Discrete Random Variables and Probability Distributions. Stat 4570/5570 Based on Devore s book (Ed 8)

4.3 Normal distribution

E509A: Principle of Biostatistics. GY Zou

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

MATH 3200 Exam 3 Dr. Syring

CS 361: Probability & Statistics

CS 294-2, Grouping and Recognition (Prof. Jitendra Malik) Aug 30, 1999 Lecture #3 (Maximum likelihood framework) DRAFT Notes by Joshua Levy ffl Maximu

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Statistical estimation

Chapter 3 Discrete Random Variables and Probability Distributions

UQ, STAT2201, 2017, Lectures 3 and 4 Unit 3 Probability Distributions.

The Assumptions of Bernoulli Trials. 1. Each trial results in one of two possible outcomes, denoted success (S) or failure (F ).

PROBABILITY AND STATISTICS

Statistics for Business and Economics

Chapter 7: Estimation Sections

Estimating parameters 5.3 Confidence Intervals 5.4 Sample Variance

Statistics and Probability

Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Business Statistics 41000: Probability 3

Chapter 7: Estimation Sections

5.2 Random Variables, Probability Histograms and Probability Distributions

Elementary Statistics Lecture 5

Probability Theory and Simulation Methods. April 9th, Lecture 20: Special distributions

Chapter 3 Discrete Random Variables and Probability Distributions

Expectations. Definition Let X be a discrete rv with set of possible values D and pmf p(x). The expected value or mean value of X, denoted by E(X ) or

Simulation Wrap-up, Statistics COS 323

Review for Final Exam Spring 2014 Jeremy Orloff and Jonathan Bloom

Random Variables Handout. Xavier Vilà

Maximum Likelihood Estimation

Normal Distribution. Notes. Normal Distribution. Standard Normal. Sums of Normal Random Variables. Normal. approximation of Binomial.

Lecture 22. Survey Sampling: an Overview

Probability Distributions for Discrete RV

Week 1 Quantitative Analysis of Financial Markets Basic Statistics A

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Lecture 2. Probability Distributions Theophanis Tsandilas

Some Discrete Distribution Families

Chapter 8: Sampling distributions of estimators Sections

Discrete Random Variables and Probability Distributions

CSC 411: Lecture 08: Generative Models for Classification

Chapter 7: Point Estimation and Sampling Distributions

II. Random Variables

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

MATH 264 Problem Homework I

CS340 Machine learning Bayesian model selection

Case Study: Heavy-Tailed Distribution and Reinsurance Rate-making

ECO220Y Continuous Probability Distributions: Normal Readings: Chapter 9, section 9.10

The normal distribution is a theoretical model derived mathematically and not empirically.

Lecture III. 1. common parametric models 2. model fitting 2a. moment matching 2b. maximum likelihood 3. hypothesis testing 3a. p-values 3b.

Statistics 6 th Edition

A random variable is a quantitative variable that represents a certain

STAT 830 Convergence in Distribution

4 Martingales in Discrete-Time

Lecture Data Science

The Binomial Distribution

Bernoulli and Binomial Distributions

Lecture 8. The Binomial Distribution. Binomial Distribution. Binomial Distribution. Probability Distributions: Normal and Binomial

2 of PU_2015_375 Which of the following measures is more flexible when compared to other measures?

The Normal Distribution

Chapter 5: Statistical Inference (in General)

4: Probability. What is probability? Random variables (RVs)

Central Limit Theorem, Joint Distributions Spring 2018

STAT/MATH 395 PROBABILITY II

A useful modeling tricks.

Computer Statistics with R

Probability Theory. Mohamed I. Riffi. Islamic University of Gaza

STAT 111 Recitation 2

Probability. An intro for calculus students P= Figure 1: A normal integral

Transcription:

CSE 312 Winter 2017 Learning From Data: Maximum Likelihood Estimators (MLE) 1

Parameter Estimation Given: independent samples x1, x2,..., xn from a parametric distribution f(x θ) Goal: estimate θ. Not formally conditional probability, but the notation is convenient E.g.: Given sample HHTTTTTHTHTTTHH of (possibly biased) coin flips, estimate θ = probability of Heads f(x θ) is the Bernoulli probability mass function with parameter θ 2

Likelihood (For Discrete Distributions) P(x θ): Probability of event x given model θ Viewed as a function of x (fixed θ), it s a probability E.g., Σx P(x θ) = 1 Viewed as a function of θ (fixed x), it s called likelihood E.g., Σ θ P(x θ) can be anything; relative values are the focus. E.g., if θ = prob of heads in a sequence of coin flips then P(HHTHH.6) > P(HHTHH.5), I.e., event HHTHH is more likely when θ =.6 than θ =.5 And what θ make HHTHH most likely? 3

Likelihood Function P( HHTHH θ ): Probability of HHTHH, given P(H) = θ: θ θ 4 (1-θ) 0.2 0.0013 0.5 0.0313 0.8 0.0819 0.95 0.0407 P( HHTHH Theta) 0.00 0.02 0.04 0.06 0.08 max 0.0 0.2 0.4 0.6 0.8 1.0 Theta

Maximum Likelihood Parameter Estimation (For Discrete Distributions) One (of many) approaches to param. est. Likelihood of (indp) observations x 1, x 2,..., x n L(x 1,x 2,...,x n )= As a function of θ, what θ maximizes the likelihood of the data actually observed? Typical approach: n i=1 @ @ L(~x ) =0 or @ @ f(x i ) (*) log L(~x ) =0 (*) In general, (discrete) likelihood is the joint pmf; product form follows from independence 5

Example 1 n independent coin flips, x1, x2,..., xn; n0 tails, n1 heads, n0 + n dl/dθ = 0 1 = n; θ = probability of heads (Also verify it s max, not min, & not better on boundary) Observed fraction of successes in sample is MLE of success probability in population 6

Parameter Estimation Given: indp samples x1, x2,..., xn from a parametric distribution f(x θ), estimate: θ. E.g.: Given n normal samples, estimate mean & variance f(x) = 1 2 2 e (x µ)2 /(2 2 ) µ ± σ = (µ, 2 ) μ -3-2 -1 0 1 7

Ex2: I got data; a little birdie tells me it s normal, and promises σ 2 = 1 x X X XX X XXX X Observed Data 8

-1 0 1 2 3 Which is more likely: (a) this? μ unknown, σ 2 = 1 µ ± 1σ μ X X XX X XXX X Observed Data 9

Which is more likely: (b) or this? μ unknown, σ 2 = 1 µ ± σ1 X X XX X XXX X Observed Data μ -3-2 -1 0 1 10

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? μ unknown, σ 2 = 1 µ ± σ1 X X XX X XXX X Observed Data μ 11

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? μ unknown, σ 2 = 1 Looks good by eye, but how do I optimize my estimate of μ? µ ± σ1 X X XX X XXX X Observed Data μ 12

-3-2 -1 0 1 Likelihood (For Continuous Distributions) Probability of any specific observation xi is zero, so likelihood = probability fails. Instead, as usual, we swap density for pmf: likelihood of x1, x2,..., xn is defined to be their joint density, and given independence of the xi, that s the product of their marginal densities. Why it s sensible: a) for maximizing likelihood, we really only care about relative likelihoods, and density captures that b) it has the desired property that likelihood increases with better fit to the model µ ± σ1 and X X XX X c) if density at x is f(x), for any small δ>0, the probability of a sample within ±δ/2 of x is δf(x), so density really is capturing probability, and δ is constant wrt θ, so it just drops out of d/dθ log L( ) = 0. Otherwise, MLE approach is just like discrete case: get likelihood, @ @ μ log L(~x ) =0 13

Ex. 2: x i And verify it s max, not min & not better on boundary dl/dθ = 0 L(x 1,x 2,...,x n ) = ln L(x 1,x 2,...,x n ) = d d ln L(x 1,x 2,...,x n ) = N(µ, 2 ), 2 = 1, µ unknown = b = ny i=1 nx i=1 1 p 2 e (x i ) 2 /2 1 2 ln(2 ) (x i ) 2 2 nx (x i ) i=1! nx x i i=1 n =0! nx x i /n = x i=1 Sample mean is MLE of population mean product of densities 14

Ex3: I got data; a little birdie tells me it s normal (but does not tell me μ, σ 2 ) x X X XX X XXX X Observed Data 15

-1 0 1 2 3 Which is more likely: (a) this? μ, σ 2 both unknown µ ± 1σ μ ± 1 μ X X XX X XXX X Observed Data 16

0 1 2 3 Which is more likely: (b) or this? μ, σ 2 both unknown µ μ ± 3 σ 3 μ X X XX X XXX X Observed Data 17

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? μ, σ 2 both unknown µ ± σ1 μ ± 1 X X XX X XXX X Observed Data μ 18

Which is more likely: (d) or this? μ, σ 2 both unknown µ ± σ μ ± 0.5 X X XX X XXX X Observed Data μ -3-2 -1 0 1 2 3 19

Which is more likely: (d) or this? μ, σ 2 both unknown Looks good by eye, but how do I optimize my estimates of μ & σ 2? µ ± σ μ ± 0.5 X X XX X XXX X Observed Data μ -3-2 -1 0 1 2 3 20

Ex 3: x i N(µ, 2 ), µ, 2 both unknown ln L(x 1,x 2,...,x n 1, 2 ) = @ @ 1 ln L(x 1,x 2,...,x n 1, 2 ) = Likelihood surface b 1 = nx i=1 nx i=1 1 2 ln(2 2) (x i 1 ) 2 = 0! nx x i /n = x i=1 (x i 1 ) 2 2 2 3 2 1 0-0.4-0.2 θ 0 0.2 2 0.2 0.4 θ 1 0.4 0.6 0.8 Sample mean is MLE of population mean, again In general, a problem like this results in 2 equations in 2 unknowns. Easy in this case, since θ2 drops out of the / θ1 = 0 equation 21

Ex. 3, (cont.) ln L(x 1,x 2,...,x n 1, 2 ) = @ @ 2 ln L(x 1,x 2,...,x n 1, 2 ) = b 2 = nx i=1 nx i=1 Pn 1 2 ln(2 2) 1 2 i=1 (x i (x i 1 ) 2 2 2 2 + (x i 1 ) 2 2 2 2 2 2 b 1 ) 2 /n = s 2 = 0 Sample variance is MLE of population variance 22

Ex. 3, (cont.) Bias? if Y is sample mean Y = (Σ1 i n Xi)/n then E[Y] = (Σ1 i n E[Xi])/n = n μ/n = μ so the MLE is an unbiased estimator of population mean Similarly, (Σ1 i n (Xi-μ) 2 )/n is an unbiased estimator of σ 2. Unfortunately, if μ is unknown, estimated from the same data, as above, is a consistent, but biased estimate of population variance. (An example of overfitting.) Unbiased estimate is: I.e., limn = correct Moral: MLE is a great idea, but not a magic bullet 23

Summary MLE is one way to estimate parameters from data You choose the form of the model (normal, binomial,...) Math chooses the value(s) of parameter(s) Defining the Likelihood Function (based on the pmf or pdf of the model) is often the critical step; the math/algorithms to optimize it are generic Often simply (d/dθ)(log Likelihood(data θ)) = 0 Has the intuitively appealing property that the parameters maximize the likelihood of the observed data; basically just assumes your sample is representative Of course, unusual samples will give bad estimates (estimate normal human heights from a sample of NBA stars?) but that is an unlikely event Often, but not always, MLE has other desirable properties like being unbiased, or at least consistent 24