CSE 312 Winter Learning From Data: Maximum Likelihood Estimators (MLE)

CSE 312 Winter 2017 Learning From Data: Maximum Likelihood Estimators (MLE) 1

Parameter Estimation Given: independent samples x1, x2,..., xn from a parametric distribution f(x θ) Goal: estimate θ. Not formally conditional probability, but the notation is convenient E.g.: Given sample HHTTTTTHTHTTTHH of (possibly biased) coin flips, estimate θ = probability of Heads f(x θ) is the Bernoulli probability mass function with parameter θ 2

Likelihood (For Discrete Distributions) P(x θ): Probability of event x given model θ Viewed as a function of x (fixed θ), it s a probability E.g., Σx P(x θ) = 1 Viewed as a function of θ (fixed x), it s called likelihood E.g., Σ θ P(x θ) can be anything; relative values are the focus. E.g., if θ = prob of heads in a sequence of coin flips then P(HHTHH.6) > P(HHTHH.5), I.e., event HHTHH is more likely when θ =.6 than θ =.5 And what θ make HHTHH most likely? 3

Likelihood Function P( HHTHH θ ): Probability of HHTHH, given P(H) = θ: θ θ 4 (1-θ) 0.2 0.0013 0.5 0.0313 0.8 0.0819 0.95 0.0407 P( HHTHH Theta) 0.00 0.02 0.04 0.06 0.08 max 0.0 0.2 0.4 0.6 0.8 1.0 Theta

Maximum Likelihood Parameter Estimation (For Discrete Distributions) One (of many) approaches to param. est. Likelihood of (indp) observations x 1, x 2,..., x n L(x 1,x 2,...,x n )= As a function of θ, what θ maximizes the likelihood of the data actually observed? Typical approach: n i=1 @ @ L(~x ) =0 or @ @ f(x i ) (*) log L(~x ) =0 (*) In general, (discrete) likelihood is the joint pmf; product form follows from independence 5

Example 1 n independent coin flips, x1, x2,..., xn; n0 tails, n1 heads, n0 + n dl/dθ = 0 1 = n; θ = probability of heads (Also verify it s max, not min, & not better on boundary) Observed fraction of successes in sample is MLE of success probability in population 6

Parameter Estimation Given: indp samples x1, x2,..., xn from a parametric distribution f(x θ), estimate: θ. E.g.: Given n normal samples, estimate mean & variance f(x) = 1 2 2 e (x µ)2 /(2 2 ) µ ± σ = (µ, 2 ) μ -3-2 -1 0 1 7

Ex2: I got data; a little birdie tells me it s normal, and promises σ 2 = 1 x X X XX X XXX X Observed Data 8

-1 0 1 2 3 Which is more likely: (a) this? μ unknown, σ 2 = 1 µ ± 1σ μ X X XX X XXX X Observed Data 9

Which is more likely: (b) or this? μ unknown, σ 2 = 1 µ ± σ1 X X XX X XXX X Observed Data μ -3-2 -1 0 1 10

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? μ unknown, σ 2 = 1 µ ± σ1 X X XX X XXX X Observed Data μ 11

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? μ unknown, σ 2 = 1 Looks good by eye, but how do I optimize my estimate of μ? µ ± σ1 X X XX X XXX X Observed Data μ 12

-3-2 -1 0 1 Likelihood (For Continuous Distributions) Probability of any specific observation xi is zero, so likelihood = probability fails. Instead, as usual, we swap density for pmf: likelihood of x1, x2,..., xn is defined to be their joint density, and given independence of the xi, that s the product of their marginal densities. Why it s sensible: a) for maximizing likelihood, we really only care about relative likelihoods, and density captures that b) it has the desired property that likelihood increases with better fit to the model µ ± σ1 and X X XX X c) if density at x is f(x), for any small δ>0, the probability of a sample within ±δ/2 of x is δf(x), so density really is capturing probability, and δ is constant wrt θ, so it just drops out of d/dθ log L( ) = 0. Otherwise, MLE approach is just like discrete case: get likelihood, @ @ μ log L(~x ) =0 13

Ex. 2: x i And verify it s max, not min & not better on boundary dl/dθ = 0 L(x 1,x 2,...,x n ) = ln L(x 1,x 2,...,x n ) = d d ln L(x 1,x 2,...,x n ) = N(µ, 2 ), 2 = 1, µ unknown = b = ny i=1 nx i=1 1 p 2 e (x i ) 2 /2 1 2 ln(2 ) (x i ) 2 2 nx (x i ) i=1! nx x i i=1 n =0! nx x i /n = x i=1 Sample mean is MLE of population mean product of densities 14

Ex3: I got data; a little birdie tells me it s normal (but does not tell me μ, σ 2 ) x X X XX X XXX X Observed Data 15

-1 0 1 2 3 Which is more likely: (a) this? μ, σ 2 both unknown µ ± 1σ μ ± 1 μ X X XX X XXX X Observed Data 16

0 1 2 3 Which is more likely: (b) or this? μ, σ 2 both unknown µ μ ± 3 σ 3 μ X X XX X XXX X Observed Data 17

-3-2 -1 0 1 2 3 Which is more likely: (c) or this? μ, σ 2 both unknown µ ± σ1 μ ± 1 X X XX X XXX X Observed Data μ 18

Which is more likely: (d) or this? μ, σ 2 both unknown µ ± σ μ ± 0.5 X X XX X XXX X Observed Data μ -3-2 -1 0 1 2 3 19

Which is more likely: (d) or this? μ, σ 2 both unknown Looks good by eye, but how do I optimize my estimates of μ & σ 2? µ ± σ μ ± 0.5 X X XX X XXX X Observed Data μ -3-2 -1 0 1 2 3 20

Ex 3: x i N(µ, 2 ), µ, 2 both unknown ln L(x 1,x 2,...,x n 1, 2 ) = @ @ 1 ln L(x 1,x 2,...,x n 1, 2 ) = Likelihood surface b 1 = nx i=1 nx i=1 1 2 ln(2 2) (x i 1 ) 2 = 0! nx x i /n = x i=1 (x i 1 ) 2 2 2 3 2 1 0-0.4-0.2 θ 0 0.2 2 0.2 0.4 θ 1 0.4 0.6 0.8 Sample mean is MLE of population mean, again In general, a problem like this results in 2 equations in 2 unknowns. Easy in this case, since θ2 drops out of the / θ1 = 0 equation 21

Ex. 3, (cont.) ln L(x 1,x 2,...,x n 1, 2 ) = @ @ 2 ln L(x 1,x 2,...,x n 1, 2 ) = b 2 = nx i=1 nx i=1 Pn 1 2 ln(2 2) 1 2 i=1 (x i (x i 1 ) 2 2 2 2 + (x i 1 ) 2 2 2 2 2 2 b 1 ) 2 /n = s 2 = 0 Sample variance is MLE of population variance 22

Ex. 3, (cont.) Bias? if Y is sample mean Y = (Σ1 i n Xi)/n then E[Y] = (Σ1 i n E[Xi])/n = n μ/n = μ so the MLE is an unbiased estimator of population mean Similarly, (Σ1 i n (Xi-μ) 2 )/n is an unbiased estimator of σ 2. Unfortunately, if μ is unknown, estimated from the same data, as above, is a consistent, but biased estimate of population variance. (An example of overfitting.) Unbiased estimate is: I.e., limn = correct Moral: MLE is a great idea, but not a magic bullet 23

Summary MLE is one way to estimate parameters from data You choose the form of the model (normal, binomial,...) Math chooses the value(s) of parameter(s) Defining the Likelihood Function (based on the pmf or pdf of the model) is often the critical step; the math/algorithms to optimize it are generic Often simply (d/dθ)(log Likelihood(data θ)) = 0 Has the intuitively appealing property that the parameters maximize the likelihood of the observed data; basically just assumes your sample is representative Of course, unusual samples will give bad estimates (estimate normal human heights from a sample of NBA stars?) but that is an unlikely event Often, but not always, MLE has other desirable properties like being unbiased, or at least consistent 24