Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 10, 2017

Similar documents
Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 13, 2018

sociology SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 SO5032 Quantitative Research Methods

Logistic Regression Analysis

Multinomial Logit Models - Overview Richard Williams, University of Notre Dame, Last revised February 13, 2017

[BINARY DEPENDENT VARIABLE ESTIMATION WITH STATA]

Allison notes there are two conditions for using fixed effects methods.

Module 4 Bivariate Regressions

Intro to GLM Day 2: GLM and Maximum Likelihood

Final Exam - section 1. Thursday, December hours, 30 minutes

Getting Started in Logit and Ordered Logit Regression (ver. 3.1 beta)

Sociology Exam 3 Answer Key - DRAFT May 8, 2007

Econometric Methods for Valuation Analysis

Getting Started in Logit and Ordered Logit Regression (ver. 3.1 beta)

Categorical Outcomes. Statistical Modelling in Stata: Categorical Outcomes. R by C Table: Example. Nominal Outcomes. Mark Lunt.

tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6}

West Coast Stata Users Group Meeting, October 25, 2007

Maximum Likelihood Estimation

Catherine De Vries, Spyros Kosmidis & Andreas Murr

Module 9: Single-level and Multilevel Models for Ordinal Responses. Stata Practical 1

Day 3C Simulation: Maximum Simulated Likelihood

Problem Set 9 Heteroskedasticty Answers

Sociology 704: Topics in Multivariate Statistics Instructor: Natasha Sarkisian. Binary Logit

Estimating Heterogeneous Choice Models with Stata

Longitudinal Logistic Regression: Breastfeeding of Nepalese Children

Duration Models: Parametric Models

The method of Maximum Likelihood.

Estimating Ordered Categorical Variables Using Panel Data: A Generalised Ordered Probit Model with an Autofit Procedure

Your Name (Please print) Did you agree to take the optional portion of the final exam Yes No. Directions

Introduction to POL 217

Review questions for Multinomial Logit/Probit, Tobit, Heckit, Quantile Regressions

MA 1125 Lecture 05 - Measures of Spread. Wednesday, September 6, Objectives: Introduce variance, standard deviation, range.

Model fit assessment via marginal model plots

Chapter 6 Part 3 October 21, Bootstrapping

Quantitative Techniques Term 2

List of figures. I General information 1

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Regression with a binary dependent variable: Logistic regression diagnostic

19. CONFIDENCE INTERVALS FOR THE MEAN; KNOWN VARIANCE

Section 0: Introduction and Review of Basic Concepts

EC327: Limited Dependent Variables and Sample Selection Binomial probit: probit

Handout seminar 6, ECON4150

Econometrics is. The estimation of relationships suggested by economic theory

Multiple regression - a brief introduction

Modeling wages of females in the UK

Penalty Functions. The Premise Quadratic Loss Problems and Solutions

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Vlerick Leuven Gent Working Paper Series 2003/30 MODELLING LIMITED DEPENDENT VARIABLES: METHODS AND GUIDELINES FOR RESEARCHERS IN STRATEGIC MANAGEMENT

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Limited Dependent Variables

Maximum Likelihood Estimation

Appendix. Table A.1 (Part A) The Author(s) 2015 G. Chakrabarti and C. Sen, Green Investing, SpringerBriefs in Finance, DOI /

Duration Models: Modeling Strategies

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Nonlinear Econometric Analysis (ECO 722) Answers to Homework 4

The following content is provided under a Creative Commons license. Your support

Cross- Country Effects of Inflation on National Savings

Agricultural and Applied Economics 637 Applied Econometrics II

The Fallacy of Large Numbers

The Fallacy of Large Numbers and A Defense of Diversified Active Managers

Linear functions Increasing Linear Functions. Decreasing Linear Functions

Section 2: Estimation, Confidence Intervals and Testing Hypothesis

Omitted Variables Bias in Regime-Switching Models with Slope-Constrained Estimators: Evidence from Monte Carlo Simulations

Sean Howard Econometrics Final Project Paper. An Analysis of the Determinants and Factors of Physical Education Attendance in the Fourth Quarter

The Assumption(s) of Normality

Back to estimators...

Introduction to fractional outcome regression models using the fracreg and betareg commands

Morten Frydenberg Wednesday, 12 May 2004

Computerized Adaptive Testing: the easy part

if a < b 0 if a = b 4 b if a > b Alice has commissioned two economists to advise her on whether to accept the challenge.

3. Multinomial response models

Lattice Model of System Evolution. Outline

A COMPARATIVE ANALYSIS OF REAL AND PREDICTED INFLATION CONVERGENCE IN CEE COUNTRIES DURING THE ECONOMIC CRISIS

Homework Assignment Section 3

PASS Sample Size Software

Lecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit

Lesson Exponential Models & Logarithms

Statistics & Statistical Tests: Assumptions & Conclusions

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

STA 4504/5503 Sample questions for exam True-False questions.

Maximum Likelihood Estimates for Alpha and Beta With Zero SAIDI Days

Monthly Treasurers Tasks

Economics 742 Brief Answers, Homework #2

Why do the youth in Jamaica neither study nor work? Evidence from JSLC 2001

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2017

Financial Risk Measurement/Management

NPTEL Project. Econometric Modelling. Module 16: Qualitative Response Regression Modelling. Lecture 20: Qualitative Response Regression Modelling

ARIMA ANALYSIS WITH INTERVENTIONS / OUTLIERS

1) The Effect of Recent Tax Changes on Taxable Income

Optimization 101. Dan dibartolomeo Webinar (from Boston) October 22, 2013

a. Explain why the coefficients change in the observed direction when switching from OLS to Tobit estimation.

(1) Get a job now and don t go to graduate school (2) Get a graduate degree and then get a higher paying job. > V J or, stated another way, if V G

*1A. Basic Descriptive Statistics sum housereg drive elecbill affidavit witness adddoc income male age literacy educ occup cityyears if control==1

Chapter 12 Module 6. AMIS 310 Foundations of Accounting

PAULI MURTO, ANDREY ZHUKOV

CPSC 540: Machine Learning

GPD-POT and GEV block maxima

Introduction to Population Modeling

Logit Models for Binary Data

Monthly Treasurers Tasks

Model 0: We start with a linear regression model: log Y t = β 0 + β 1 (t 1980) + ε, with ε N(0,

Transcription:

Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 0, 207 [This handout draws very heavily from Regression Models for Categorical and Limited Dependent Variables, 997, by J. Scott Long. See Long s book, especially sections 2.6, 3.5 and 3.6 for additional details.] Most of the models we will look at are (or can be) estimated via maximum likelihood. Brief Definition. The maximum likelihood estimate is that value of the parameter that makes the observed data most likely. Example. This is adapted from J. Scott Long s Regression Models for Categorical and Limited Dependent Variables. Define pi as the probability of observing whatever value of y was actually observed for a given observation, i.e. p i Pr( yi = xi ) = Pr( yi = xi ) if yi = is observed if yi = 0 is observed So, for example, if the predicted probability of the event occurring for case i was.7, and the event did occur, then pi =.7. If, on the other hand, the event did not occur, then pi =.30. If the observations are independent, the likelihood equation is L( β y, X) = p i The likelihood tends to be an incredibly small number, and it is generally easier to work with the log likelihood. Ergo, taking logs, we obtain the log likelihood equation: ln L( β y, X) = ln p i Before proceeding, let s see how this works in practice! Here is how you compute pi and the log of pi using Stata:. use https://www3.nd.edu/~rwilliam/statafiles/logist.dta, clear. logit grade gpa tuce psi Iteration 0: log likelihood = -20.5973 Iteration : log likelihood = -3.259768 Iteration 2: log likelihood = -2.894606 Iteration 3: log likelihood = -2.889639 Iteration 4: log likelihood = -2.889633 Iteration 5: log likelihood = -2.889633 Maximum Likelihood Estimation Page

Logistic regression umber of obs = 32 LR chi2(3) = 5.40 Prob > chi2 = 0.005 Log likelihood = -2.889633 Pseudo R2 = 0.3740 ---------------- grade Coef. Std. Err. z P> z [95% Conf. Interval] -------------+-- gpa 2.8263.26294 2.24 0.025.3507938 5.30432 tuce.095577.45542 0.67 0.50 -.822835.3725988 psi 2.378688.064564 2.23 0.025.2928 4.46595 _cons -3.0235 4.93325-2.64 0.008-22.68657-3.3563 ----------------. * Compute probability that y =. predict prob (option pr assumed; Pr(grade)).* If y =, pi = probability y =. gen pi = prob if grade == (2 missing values generated). * If y = 0, replace pi with probability y = 0. replace pi = ( - prob) if grade == 0 (2 real changes made). * compute log of pi. gen lnpi = ln(pi). list grade gpa tuce psi prob pi lnpi, sep(8) +-------------------------------------------------------------+ grade gpa tuce psi prob pi lnpi -------------------------------------------------------------. 0 2.06 22.063758.9386242 -.063340 2. 2.39 9.0308.0308-2.97947 3. 0 2.63 20 0.0244704.9755296 -.0247748 --- Output Deleted --- 30. 4 2 0.569893.569893 -.5623066 3. 4 23.9453403.9453403 -.056203 32. 3.92 29 0.69354.69354 -.3659876 +-------------------------------------------------------------+ So, this tells us that the predicted probability of the first case being 0 was.9386. The probability of the second case being a was.. The probability of the 3 rd case being a 0 was.9755; and so on. The likelihood is therefore L( β y, p i X) = =.9386*.0*.9755*...*.6935 =.000002524 which is a really small number; indeed so small that your computer or calculator may have trouble calculating it correctly (and this is only 32 cases; imagine the difficulty if you have hundreds of thousands). Much easier to calculate is the log likelihood, which is ln L( β y, X) = ln p i =.0633+ 2.98 +... +.366 = 2.88963 Stata s total command makes this calculation easy for us: Maximum Likelihood Estimation Page 2

. total lnpi Total estimation umber of obs = 32 Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ lnpi -2.88963 3.27734-9.26869-6.50578 ote this is the exact same value that logit reported as the log likelihood for the model. We call this number LLM, i.e. the log likelihood for the model. Incidentally, if we repeated this process for the constant-only model, i.e. if we instead ran the logistic regression. logit grade we would eventually get. total lnpi Total estimation umber of obs = 32 Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ lnpi -20.5973.7654-24.976-6.997 This is the same value that the original logit reported as the log likelihood for iteration 0. We call this LL0. As other handouts will show, LL0 and LLM can be used to compute various statistics of interest. Expanded Definition. The maximum likelihood estimates are those values of the parameters that make the observed data most likely. That is, the maximum likelihood estimates will be those values which produce the largest value for the likelihood equation (i.e. get it as close to as possible; which is equivalent to getting the log likelihood equation as close to 0 as possible). otes Maximum likelihood estimation is generally more complicated than this. You usually have more cases and are often estimating several parameters simultaneously. onetheless, the general idea is the same: the ML estimates are those values that make the observed data most likely. See Long for the technical details of how ML estimation is done. To say that something is the most likely value is not the same as saying it is likely; there are, after all, an infinite number of other parameter values that would produce almost the same observed data. As we ve just seen, the likelihood that is reported for models is a very small number (which is part of the reason you ll see the log of the likelihood, or the deviance [-2LL, i.e. -2 * the log of the likelihood] reported instead). Maximum Likelihood Estimation Page 3

In the case of OLS regression, the maximum likelihood estimates and the OLS estimates are one and the same. Properties of ML estimators The ML estimator is consistent. As the sample size grows large, the probability that the ML estimator differs from the true parameter by an arbitrarily small amount tends toward 0. The ML estimator is asymptotically efficient, which means that the variance of the ML estimator is the smallest possible among consistent estimators. The ML estimator is asymptotically normally distributed, which justifies various statistical tests. ML and Sample Size. For ML estimation, the desirable properties of consistency, normality and efficiency are asymptotic, i.e. these properties have been proven to hold as the sample size approaches infinity. The small sample behavior of ML estimators is largely unknown. Long says there are no hard and fast rules for sample size. He says it is risky to use ML with samples smaller than 00, while samples over 500 seem adequate. More observations are needed if there are a lot of parameters he suggests that at least 0 observations per parameter seems reasonable for the models he discusses. If the data are ill-conditioned (e.g. IVs are highly collinear) or if there is little variation in the DV (e.g. nearly all outcomes are ) a larger sample is required. Some models seem to require more cases, e.g. ordinal regression models. Both Long and Allison agree that the standard advice is that with small samples you should accept larger p-values as evidence against the null hypothesis. Given that the degree to which ML estimates are normally distributed in small samples is unknown, it is more reasonable to require smaller p-values in small samples. umerical Methods for ML Estimation. For OLS regression, you can solve for the parameters using algebra. Algebraic solutions are rarely possible with nonlinear models. Consequently, numeric methods are used to find the estimates that maximize the log likelihood function. umerical methods start with a guess of the values of the parameters and iterate to improve on that guess. The iterative process stops when estimates do not change much from one step to the next. Long (997) describes various algorithms that can be used when computing ML estimates. Occasionally, there are problems with numerical methods: It may be difficult or impossible to reach convergence, e.g. you ll get a message like Convergence not obtained after 250 iterations. Maximum Likelihood Estimation Page 4

. list Convergence does occur, but you get the wrong solution (this is rare, but still, you might want to be suspicious if the numbers just don t look right to you). In some cases, ML estimates do not exist for a particular pattern of data. For example, with a binary outcome and a single binary IV, ML estimates are not possible if there is no variation in the IV for one of the outcomes (e.g. everybody coded on the IV is also coded on the DV). Long and Freese (2006) also discuss this problem. Here is an example: +--------------+ x y freq --------------. 25 2. 0 0 3. 0 30 4. 0 0 2 +--------------+. logit y x [fw=freq] note: x!= 0 predicts success perfectly x dropped and obs not used Iteration 0: log likelihood = -25.27323 Logit estimates umber of obs = 42 LR chi2(0) = 0.00 Prob > chi2 =. Log likelihood = -25.27323 Pseudo R2 = 0.0000 ---------------- y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+-- _cons.962907.34565 2.68 0.007.2468356.585746 ----------------. dis invlogit(_b[_cons]).742857 In the above example, whenever x =, y =. ML estimates are impossible. So, what Stata does, is it drops x from the model along with all observations where x =. As a result, you are left analyzing the 42 cases where x = 0 and y = 0 or. For those 42 cases, p(y = ) = 30/42 =.742857. If you seem to be having problems with your estimation, Long suggests the following: Model specification. Make sure the software is estimating the model you want to estimate, i.e. make sure you haven t made a mistake in specifying what you want to run. (And, if it is running what you wanted it to run, make sure that what you wanted it to do actually makes sense!) Incorrect variables. Make sure the variables are correct, e.g. variables have been computed correctly. Check the descriptive statistics. Long says his experience is that most problems with numerical methods are due to data that have not been cleaned. Maximum Likelihood Estimation Page 5

umber of observations. Convergence generally occurs more rapidly when there are more observations. ot that there is much you can do about sample size, but this may explain why you are having problems. Scaling of variables. Scaling can be important. The larger the ratio between the largest standard deviation and the smallest standard deviation, the more problems you will have with numerical methods. For example, if you have income measured in dollars, it may have a very large standard deviation relative to other variables. Recoding income to thousands of dollars may solve the problem. Long says that, in his experience, problems are much more likely when the ratio between the largest and smallest standard deviations exceeds 0. (You may want to rescale for presentation purposes anyway, e.g. the effect of dollar of income may be extremely small and have to be reported to several decimal places; coding income in thousands of dollars may make your tables look better.) Distribution of outcomes. If a large proportion of cases are censored in the tobit model or if one of the categories of a categorical variable has very few cases, convergence may be difficult. Long says you can t do much about this, although for the latter problem I think you could sometimes combine categories. When the model is appropriate for the data, Long says that ML estimation tends to work well and convergence is often achieved within 5 iterations. Rescaling can solve some problems. If still having problems, you can try using a different program that uses a different algorithm; a problem that may be very difficult for one algorithm may work quite well for another. The troubleshooting FAQ for my gologit2 program also has several suggestions for what to do when you have convergence problems, and much of this advice will apply to programs besides gologit2. See https://www3.nd.edu/~rwilliam/gologit2/tsfaq.html Maximum Likelihood Estimation Page 6