Learning From Data: MLE. Maximum Likelihood Estimators

Similar documents
CSE 312 Winter Learning From Data: Maximum Likelihood Estimators (MLE)

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

Point Estimation. Copyright Cengage Learning. All rights reserved.

The Bernoulli distribution

Binomial Random Variables. Binomial Random Variables

Back to estimators...

Chapter 5. Statistical inference for Parametric Models

MATH 3200 Exam 3 Dr. Syring

Point Estimators. STATISTICS Lecture no. 10. Department of Econometrics FEM UO Brno office 69a, tel

MA : Introductory Probability

Lecture 10: Point Estimation

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

Chapter 8. Introduction to Statistical Inference

Chapter 6: Point Estimation

MVE051/MSG Lecture 7

Chapter 4: Asymptotic Properties of MLE (Part 3)

PROBABILITY AND STATISTICS

4 Random Variables and Distributions

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Statistical estimation

EE641 Digital Image Processing II: Purdue University VISE - October 29,

Much of what appears here comes from ideas presented in the book:

Lecture 8. The Binomial Distribution. Binomial Distribution. Binomial Distribution. Probability Distributions: Normal and Binomial

4-1. Chapter 4. Commonly Used Distributions by The McGraw-Hill Companies, Inc. All rights reserved.

Multi-armed bandit problems

Version A. Problem 1. Let X be the continuous random variable defined by the following pdf: 1 x/2 when 0 x 2, f(x) = 0 otherwise.

Conjugate priors: Beta and normal Class 15, Jeremy Orloff and Jonathan Bloom

Lecture 22. Survey Sampling: an Overview

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 41

Chapter 7: Point Estimation and Sampling Distributions

The normal distribution is a theoretical model derived mathematically and not empirically.

CS340 Machine learning Bayesian model selection

Statistics 6 th Edition

Chapter 7: Estimation Sections

Chapter 7: Estimation Sections

Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

4.3 Normal distribution

Probability Theory and Simulation Methods. April 9th, Lecture 20: Special distributions

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 23

Computer Statistics with R

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

MA 1125 Lecture 12 - Mean and Standard Deviation for the Binomial Distribution. Objectives: Mean and standard deviation for the binomial distribution.

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Chapter 5: Statistical Inference (in General)

Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

Lecture 9: Plinko Probabilities, Part III Random Variables, Expected Values and Variances

Some Discrete Distribution Families

E509A: Principle of Biostatistics. GY Zou

Chapter 3 Discrete Random Variables and Probability Distributions

Probability. An intro for calculus students P= Figure 1: A normal integral

Elementary Statistics Lecture 5

Statistics for Managers Using Microsoft Excel 7 th Edition

Random Variables Handout. Xavier Vilà

The Binomial Probability Distribution

Statistics and Probability

The Assumptions of Bernoulli Trials. 1. Each trial results in one of two possible outcomes, denoted success (S) or failure (F ).

Probability Distributions for Discrete RV

Discrete Random Variables and Probability Distributions. Stat 4570/5570 Based on Devore s book (Ed 8)

Exercise. Show the corrected sample variance is an unbiased estimator of population variance. S 2 = n i=1 (X i X ) 2 n 1. Exercise Estimation

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

CSC 411: Lecture 08: Generative Models for Classification

Actuarial Mathematics and Statistics Statistics 5 Part 2: Statistical Inference Tutorial Problems

A random variable (r. v.) is a variable whose value is a numerical outcome of a random phenomenon.

Learning Objec0ves. Statistics for Business and Economics. Discrete Probability Distribu0ons

Normal Distribution. Notes. Normal Distribution. Standard Normal. Sums of Normal Random Variables. Normal. approximation of Binomial.

Theoretical Foundations

4.2 Bernoulli Trials and Binomial Distributions

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Chapter 5. Sampling Distributions

Econometric Methods for Valuation Analysis

II - Probability. Counting Techniques. three rules of counting. 1multiplication rules. 2permutations. 3combinations

Data Analysis and Statistical Methods Statistics 651

Simulation Wrap-up, Statistics COS 323

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Estimating parameters 5.3 Confidence Intervals 5.4 Sample Variance

Statistics for Business and Economics

Stat511 Additional Materials

Chapter 3 Discrete Random Variables and Probability Distributions

5.2 Random Variables, Probability Histograms and Probability Distributions

The method of Maximum Likelihood.

Lecture 23. STAT 225 Introduction to Probability Models April 4, Whitney Huang Purdue University. Normal approximation to Binomial

MidTerm 1) Find the following (round off to one decimal place):

Sampling and sampling distribution

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Central Limit Theorem (cont d) 7/28/2006

Likelihood Methods of Inference. Toss coin 6 times and get Heads twice.

CS 361: Probability & Statistics

SOCIETY OF ACTUARIES EXAM STAM SHORT-TERM ACTUARIAL MATHEMATICS EXAM STAM SAMPLE QUESTIONS

ECO220Y Continuous Probability Distributions: Normal Readings: Chapter 9, section 9.10

STAT 111 Recitation 2

MATH 264 Problem Homework I

3 ˆθ B = X 1 + X 2 + X 3. 7 a) Find the Bias, Variance and MSE of each estimator. Which estimator is the best according

CS 294-2, Grouping and Recognition (Prof. Jitendra Malik) Aug 30, 1999 Lecture #3 (Maximum likelihood framework) DRAFT Notes by Joshua Levy ffl Maximu

Hardy Weinberg Model- 6 Genotypes

Intro to Decision Theory

6. Genetics examples: Hardy-Weinberg Equilibrium

INF FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning, Lecture 3, 1.9

Transcription:

Learning From Data: MLE Maximum Likelihood Estimators 1

Parameter Estimation Assuming sample x1, x2,..., xn is from a parametric distribution f(x θ), estimate θ. E.g.: Given sample HHTTTTTHTHTTTHH of (possibly biased) coin flips, estimate θ = probability of Heads f(x θ) is the Bernoulli probability mass function with parameter θ 2

Likelihood P(x θ): Probability of event x given model θ Viewed as a function of x (fixed θ), it s a probability E.g., Σx P(x θ) = 1 Viewed as a function of θ (fixed x), it s a likelihood E.g., Σθ P(x θ) can be anything; relative values of interest. E.g., if θ = prob of heads in a sequence of coin flips then P(HHTHH.6) > P(HHTHH.5), I.e., event HHTHH is more likely when θ =.6 than θ =.5 And what θ make HHTHH most likely? 3

Likelihood Function Probability of HHTHH, given P(H) = θ: θ θ 4 (1-θ) 0.2 0.0013 0.5 0.0313 0.8 0.0819 0.95 0.0407 P( HHTHH Theta) 0.00 0.02 0.04 0.06 0.08 0.0 0.2 0.4 0.6 0.8 1.0 Theta

Maximum Likelihood Parameter Estimation One (of many) approaches to param. est. Likelihood of (indp) observations x 1, x 2,..., x n n L(x 1,x 2,...,x n θ) = f(x i θ) i=1 As a function of θ, what θ maximizes the likelihood of the data actually observed Typical approach: or L( x θ) = 0 θ θ log L(x θ) =0 5

Example 1 n coin flips, x 1, x 2,..., x n ; n 0 tails, n 1 heads, n 0 + n 1 = n; θ = probability of heads dl/dθ = 0 (Also verify it s max, not min, & not better on boundary) Observed fraction of successes in sample is MLE of success probability in population 6

Bias A desirable property: An estimator Y of a parameter θ is an unbiased estimator if E[Y] = θ For coin ex. above, MLE is unbiased: Y = fraction of heads = (Σ1 i nxi)/n, (Xi = indicator for heads in i th trial) so E[Y] = (Σ1 i n E[Xi])/n = n θ/n = θ 7

Aside: are all unbiased estimators equally good? No! E.g., Ignore all but 1st flip; if it was H, let Y = 1; else Y = 0 Exercise: show this is unbiased Exercise: if observed data has at least one H and at least one T, what is the likelihood of the data given the model with θ = Y? 8

-3-2 -1 0 1 Parameter Estimation Assuming sample x1, x2,..., xn is from a parametric distribution f(x θ), estimate θ. E.g.: Given n normal samples, estimate mean & variance f(x) = 1 2πσ 2 e (x µ)2 /(2σ 2 ) µ ±! θ = (µ, σ 2 ) μ 9

Ex2: I got data; a little birdie tells me it s normal, and promises σ 2 = 1 x X X XX X XXX X Observed Data 10

-1 0 1 2 3 Which is more likely: (a) this? µ ± 1! μ X X XX X XXX X Observed Data 11

-3-2 -1 0 1 Which is more likely: (b) or this? µ ±! 1 X X XX X XXX X Observed Data μ 12

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? µ ±! 1 X X XX X XXX X Observed Data μ 13

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? Looks good by eye, but how do I optimize my estimate of μ? µ ±! 1 X X XX X XXX X Observed Data μ 14

Ex. 2: x i N(µ, σ 2 ), σ 2 = 1, µ unknown And verify it s max, not min & not better on boundary dl/dθ = 0 Sample mean is MLE of population mean 15

Ex3: I got data; a little birdie tells me it s normal (but does not tell me σ 2 ) x X X XX X XXX X Observed Data 16

-1 0 1 2 3 Which is more likely: (a) this? µ ± 1! μ X X XX X XXX X Observed Data 17

0 1 2 3 Which is more likely: (b) or this? 3 µ ±! μ X X XX X XXX X Observed Data 18

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? µ ±! 1 X X XX X XXX X Observed Data μ 19

-3-2 -1 0 1 2 3 Which is more likely: (d) or this? µ ±! X X XX X XXX X Observed Data μ 20

-3-2 -1 0 1 2 3 Which is more likely: (d) or this? Looks good by eye, but how do I optimize my estimates of μ & σ? µ ±! X X XX X XXX X Observed Data μ 21

Ex 3: x i N(µ, σ 2 ), µ, σ 2 both unknown 3 2 1 0-0.4-0.2 0 0.2 0.4 0.2 0.8 0.6 0.4 θ 1 θ 2 Sample mean is MLE of population mean, again 22

Ex. 3, (cont.) ln L(x 1, x 2,..., x n θ 1, θ 2 ) = 1 2 ln 2πθ 2 (x i θ 1 ) 2 2θ 2 1 i n θ 2 ln L(x 1, x 2,..., x n θ 1, θ 2 ) = 1 2π + (x i θ 1 ) 2 2 2πθ 2 2θ 2 = 0 1 i n 2 ˆθ 2 = 1 i n (x i ˆθ 1 ) 2 /n = s 2 Sample variance is MLE of population variance 23

Ex. 3, (cont.) Bias? if Y is sample mean Y = (Σ1 i n Xi)/n then E[Y] = (Σ1 i n E[Xi])/n = n μ/n = μ so the MLE is an unbiased estimator of population mean Similarly, (Σ1 i n (Xi-μ) 2 )/n is an unbiased estimator of σ 2. Unfortunately, if μ is unknown, estimated from the same data, as above, is a consistent, but biased estimate of population variance. (An example of overfitting.) Unbiased estimate is: I.e., limn = correct Moral: MLE is a great idea, but not a magic bullet 24

More on Bias of θ2 ˆ Biased? Yes. Why? As an extreme, think about n = 1. Then θ2 = 0; probably an underestimate! ˆθ2 Also, think about n = 2. Then θ1 ˆθ1 is exactly between the two sample points, the position that exactly minimizes the expression for θ2. Any other choices for θ1, θ2 make the likelihood of the observed data slightly lower. But it s actually pretty unlikely that two sample points would be chosen exactly equidistant from, and on opposite sides of the mean, so the MLE θ2 ˆθ2 systematically underestimates θ2. (But not by much, & bias shrinks with sample size.) 25

Summary MLE is one way to estimate parameters from data You choose the form of the model (normal, binomial,...) Math chooses the value(s) of parameter(s) Has the intuitively appealing property that the parameters maximize the likelihood of the observed data; basically just assumes your sample is representative Of course, unusual samples will give bad estimates (estimate normal human heights from a sample of NBA stars?) but that is an unlikely event Often, but not always, MLE has other desirable properties like being unbiased 26