Maximum Likelihood Estimation

Similar documents
The Delta Method. j =.

Maximum Likelihood Estimation

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Intro to GLM Day 2: GLM and Maximum Likelihood

Mark-recapture models for closed populations

Probability. An intro for calculus students P= Figure 1: A normal integral

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

CS 361: Probability & Statistics

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

[D7] PROBABILITY DISTRIBUTION OF OUTSTANDING LIABILITY FROM INDIVIDUAL PAYMENTS DATA Contributed by T S Wright

Review of key points about estimators

Chapter 8: Fitting models of discrete character evolution

Point Estimation. Copyright Cengage Learning. All rights reserved.

Mathematics of Finance Final Preparation December 19. To be thoroughly prepared for the final exam, you should

Chapter 7: Estimation Sections

Logit Models for Binary Data

UPDATED IAA EDUCATION SYLLABUS

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

Statistics & Statistical Tests: Assumptions & Conclusions

Gamma Distribution Fitting

Review of key points about estimators

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Shifting our focus. We were studying statistics (data, displays, sampling...) The next few lectures focus on probability (randomness) Why?

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 23

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance

A Derivation of the Normal Distribution. Robert S. Wilson PhD.

Australian Journal of Basic and Applied Sciences. Conditional Maximum Likelihood Estimation For Survival Function Using Cox Model

The normal distribution is a theoretical model derived mathematically and not empirically.

Lecture 6: Non Normal Distributions

Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty

Getting started with WinBUGS

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

The method of Maximum Likelihood.

Department of Agricultural Economics. PhD Qualifier Examination. August 2010

Estimating the Parameters of Closed Skew-Normal Distribution Under LINEX Loss Function

Chapter 5. Statistical inference for Parametric Models

CSE 312 Winter Learning From Data: Maximum Likelihood Estimators (MLE)

Introduction to Population Modeling

The following content is provided under a Creative Commons license. Your support

Definition 9.1 A point estimate is any function T (X 1,..., X n ) of a random sample. We often write an estimator of the parameter θ as ˆθ.

INSTITUTE AND FACULTY OF ACTUARIES. Curriculum 2019 SPECIMEN EXAMINATION

STAT 509: Statistics for Engineers Dr. Dewei Wang. Copyright 2014 John Wiley & Sons, Inc. All rights reserved.

The Weibull in R is actually parameterized a fair bit differently from the book. In R, the density for x > 0 is

Notes on a Basic Business Problem MATH 104 and MATH 184 Mark Mac Lean (with assistance from Patrick Chan) 2011W

Lecture 2. Probability Distributions Theophanis Tsandilas

Content Added to the Updated IAA Education Syllabus

Gov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010

Chapter 8: Sampling distributions of estimators Sections

Chapter 5. Sampling Distributions

Lecture Data Science

Chapter 7. Sampling Distributions and the Central Limit Theorem

Web Science & Technologies University of Koblenz Landau, Germany. Lecture Data Science. Statistics and Probabilities JProf. Dr.

THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management

Log-linear Modeling Under Generalized Inverse Sampling Scheme

Inferences on Correlation Coefficients of Bivariate Log-normal Distributions

Lecture 3: Factor models in modern portfolio choice

Applications of Exponential Functions Group Activity 7 Business Project Week #10

Chapter 8 Statistical Intervals for a Single Sample

Queens College, CUNY, Department of Computer Science Computational Finance CSCI 365 / 765 Fall 2017 Instructor: Dr. Sateesh Mane.

The Bernoulli distribution

Actuarial Mathematics and Statistics Statistics 5 Part 2: Statistical Inference Tutorial Problems

Bivariate Birnbaum-Saunders Distribution

Contents Part I Descriptive Statistics 1 Introduction and Framework Population, Sample, and Observations Variables Quali

Chapter 8. Sampling and Estimation. 8.1 Random samples

UNIT 4 MATHEMATICAL METHODS

And The Winner Is? How to Pick a Better Model

1.1 Interest rates Time value of money

STRESS-STRENGTH RELIABILITY ESTIMATION

2017 IAA EDUCATION SYLLABUS

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

Superiority by a Margin Tests for the Ratio of Two Proportions

Solving dynamic portfolio choice problems by recursing on optimized portfolio weights or on the value function?

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Port(A,B) is a combination of two stocks, A and B, with standard deviations A and B. A,B = correlation (A,B) = 0.

Introduction Dickey-Fuller Test Option Pricing Bootstrapping. Simulation Methods. Chapter 13 of Chris Brook s Book.

Robust Critical Values for the Jarque-bera Test for Normality

Chapter 8. Introduction to Statistical Inference

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2011, Mr. Ruey S. Tsay. Solutions to Final Exam.

ST440/550: Applied Bayesian Analysis. (5) Multi-parameter models - Summarizing the posterior

A RIDGE REGRESSION ESTIMATION APPROACH WHEN MULTICOLLINEARITY IS PRESENT

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 10, 2017

Measures of Central tendency

Statistical estimation

Market Risk Analysis Volume I

Models and optimal designs for conjoint choice experiments including a no-choice option

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 13, 2018

Risk and Return and Portfolio Theory

Finance: A Quantitative Introduction Chapter 7 - part 2 Option Pricing Foundations

Chapter 7. Sampling Distributions and the Central Limit Theorem

Back to estimators...

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

1/2 2. Mean & variance. Mean & standard deviation

2. ANALYTICAL TOOLS. E(X) = P i X i = X (2.1) i=1

1. Covariance between two variables X and Y is denoted by Cov(X, Y) and defined by. Cov(X, Y ) = E(X E(X))(Y E(Y ))

9. Logit and Probit Models For Dichotomous Data

Confidence Intervals Introduction

Math489/889 Stochastic Processes and Advanced Mathematical Finance Homework 4

The Two Sample T-test with One Variance Unknown

STAT 201 Chapter 6. Distribution

Transcription:

Maximum Likelihood Estimation The likelihood and log-likelihood functions are the basis for deriving estimators for parameters, given data. While the shapes of these two functions are different, they have their maximum point at the same value. In fact, the value of p that corresponds to this maximum point is defined as the Maximum Likelihood Estimate (MLE) and that value is denoted as p. This is the value that is mostly likely" relative to the other values. This is a simple, compelling concept and it has a host of good statistical properties. Thus, in general, we seek ), such that this value maximizes the log-likelihood function. In the binomial model, the log /(_) is a function of only one variable, so it is easy to plot and visualize. The maximum likelihood estimate of p (the unknown parameter in the model) is that value that maximizes the log-likelihood, given the data. We denote this as p. In the binomial model, there is an analytical form (termed closed form") of the MLE, thus maximization of the log-likelihood is not required. In this simple case, p= y/n= 7/11 = 0.6363636363... of course, if the observed data were different, p would differ. The log-likelihood links the data, unknown model parameters and assumptions and allows rigorous, statistical inferences. Real world problems have more that one variable or parameter (e.g., p, in the example). Computers can find the maximum of the multi-dimensional log-likelihood function, the biologist need not to be terribly concerned with these details. The actual numerical value of the log-likelihood at its maximum point is of substantial importance. In the binomial coin flipping example with n = 11 and y = 7, max(log _) = -1.411 (see graph). The log-likelihood function is of fundamental importance in the theory of inference and in all of statistics. It is the basis for the methods explored in FW-663. Students should make every effort to get comfortable with this function in the simple cases. Then, extending the concepts to more complex cases will come easy. Likelihood Theory -- What Good Is It? 1. The basis for deriving estimators or estimates of model parameters (e.g., survival probabilities). These are termed maximum likelihood estimates," MLEs.

2. Estimates of the precision (or repeatability). This is usually the conditional (on the model) sampling variance -covariance matrix (to be discussed). 3. Profile likelihood intervals (asymmetric confidence intervals). 4. A basis for testing hypotheses: Tests between nested models (so-called likelihood ratio tests) Goodness of fit tests for a given model 5. Model selection criterion, based on Kullback-Leibler information. Numbers 1-3 (above) require a model to be given." Number 4, statistical hypothesis testing, has become less useful in many respects in the past two decades and we do not stress this approach as much as others might. Likelihood theory is also important in Bayesian statistics. Properties of Maximum Likelihood Estimators For large" samples ( asymptotically"), MLEs are optimal. 1. MLEs are asymptotically normally distributed. 2. MLEs are asymptotically minimum variance." 3. MLEs are asymptotically unbiased (MLEs are often biased, but the bias Ä 0 as Ä _). One to one transformations are also MLEs. For example, mean life span - L is defined as n - - -1/log ( S). Thus, an estimator of L = -1/log ( S ) and then Lis also an MLE. / / Maximum likelihood estimation represents the backbone of statistical estimation. It is based on deep theory, originally developed by R. A. Fisher (his first paper on this theory was published in 1912 when he was 22 years old!). While beginning classes often focus on least squares estimation ( regression"); likelihood theory is the omnibus approach across the sciences, engineering and medicine. The Likelihood Principle states that all the relevant information in the sample is contained in the likelihood function. The likelihood function is also the basis for Bayesian statistics. See Royall (1997) and Azzalini (1996) for more information on likelihood theory.

Maximum Likelihood Estimates Generally, the calculus is used to find the maximum point of the log-likelihood function and obtain MLEs is closed form. This is tedious for biologists and often not useful in real problems (where a closed form estimator may often not even exist). The log-likelihood functions we will see have a single mode or maximum point and no local optima. These conditions make the use of numerical methods appealing and efficient. Consider, first, the binomial model with a single unknown parameter, p. Using calculus one could take the first partial derivative of the log-likelihood function with respect to the p, set it to zero and solve for p. This solution will give p, the MLE. This value of pis the one the maximizes the log-likelihood function. It is the value of the parameter that is most likely, given the data. The likelihood function provides information on the relative likelihood of various parameter values, given the data and the model (here, a binomial). Think of 10 of >2 your friends, 9 of which have one raffle ticket, while the 10 has 4 tickets. The person with 4 tickets has a higher likelihood of winning, relative to the other 9. If you were to try to select the mostly likely winner of the raffle, which person would you pick? Most would select the person with 4 tickets (would you?). Would you feel strongly that this person would win? Why? or Why not? Now, what if 8 people had a single ticket, one had 4 tickets, but the last had 80 tickets. Surely, the person with 80 tickets is most likely to win (but not with certainty). In this simple example you have a feeling about the strength of evidence" about the likely winner. In the first case, one person has an edge, but not much more. In the second case, the person with 80 tickets is relatively very likely to win. The shape of the log-likelihood function is important in a conceptual way to the raffle ticket example. If the log-likelihood function is relatively flat, one can make the interpretation that several (perhaps many) values of p are nearly equally likely. They are relatively alike; this is quantified as the sampling variance or standard error. If the log-likelihood function is fairly flat, this implies considerable uncertainty and this is reflected in large sampling variances and standard errors, and wide confidence intervals. On the other hand, if the log-likelihood function is fairly peaked near its maximum point, this indicates some values of p are relatively very likely compared to others (like the person with 80 raffle tickets). There is some considerable degree of certainty implied and this is reflected in small sampling variances and standard errors, and narrow confidence intervals. So, the log-likelihood function at its maximum point is important as well as the shape of the function near this maximum point.

The shape of the likelihood function near the maximum point can be measured by the analytical second partial derivatives and these can be closely approximated numerically by a computer. Such numerical derivatives are important in complicated problems where the log-likelihood exists in 20-60 dimensions (i.e., has 20-60 unknown parameters). Likelihood 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.75 0.8 0.85 0.9 0.95 1 theta

-2-4 log Likelihood -6-8 -10-12 0.75 0.8 0.85 0.9 0.95 1 theta The standard, analytical method of finding the MLEs is to take the first partial derivatives of the log-likelihood with respect to each parameter in the model. For example: `jn( _( p)) 11 5 `p p 1 p œ ( n œ 16) Set to zero: `jn( _( p)) `p 11 5 p 1 p and solve to get p œ 11/16, the MLE. For most models we have more than one parameter. In general, let there be K parameters, ), ), á, ). Based on a specific model we can construct the log-likelihood, 1 2 K

log /(_) ( 1, ) 2, á, ) K ± data)) log /(_) and K log-likelihood equations, `log( _) ` ) 1 `log( _) ` ã ) 2 `log( _) ` ) K The solution of these equations gives the MLEs, ), ),, ). 1 2 á K The MLEs are almost always unique; in particular this is true of multinomial-based models. In principle log( _) ( 1, ) 2, á, ) K ± data)) log( _) defines a surface" in K-dimensional space, ideas of curvature still apply (as mathematical constructs). Plotting is hard for more than 2 parameters. Sampling variances and covariances of the MLEs are computed from the log-likelihood, log( _) ( 1, ) 2, á, ) K ± data)) log( _) based on curvature at the maximum. Actual formulae involve second mixed-partial derivatives of the log-likelihood, hence quantities like ` # log( _ ) `# ` ` and log( _ ) ) ) `) `) 1 1 1 2 evaluated at the MLEs. Let D be the estimated variance-covariance matrix for the K MLEs; D is a K by K matrix. The inverse of D is the matrix of elements as below.

`# log( _) `) `) i i as the ith diagonal element, and `j # n( _) `) `) i j as the i, jth off-diagonal element. (these mixed second partial derivatives are evaluated at the MLEs). The use of log-likelihood functions (rather than likelihood functions) is deeply rooted in the nature of likelihood theory. Note also that LRT theory leads to tests which basically always involve taking 2 (log-likelihood at MLEs). Therefore we give this quantity a symbol and a name: deviance, or deviance œ 2 log ( ( )) + 2log ( ( e _) _ ))), / = = 2 Š log ( ( )) log ( ( e _) / _ = ))), evaluated at the MLEs for some model. Here, the first term is the log-likelihood, evaluated at its maximum point, for the model in question and the second term is the loglikelihood, evaluated at its maximum point, for the saturated model. The meaning of a saturated model will become clear in the following material; basically, in the multinomial models, it is a model with as many parameters as cells. This final term in the deviance can often be dropped, as it is often a constant across models. The deviance for the saturated model 0. Deviance, like information, is additive. The deviance is approximately ; # with df = number of cells K and is thus useful is examining goodness-of-fit of a model. There are some ways where use of the deviance in this way will not provide correct results. MARK outputs the deviance as a measure of model fit and this is often very useful.

25 20 Deviance 15 10 5 0 0.75 0.8 0.85 0.9 0.95 1 theta