Fitting parametric distributions using R: the fitdistrplus package

Similar documents
A UNIFIED APPROACH FOR PROBABILITY DISTRIBUTION FITTING WITH FITDISTRPLUS

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

ก ก ก ก ก ก ก. ก (Food Safety Risk Assessment Workshop) 1 : Fundamental ( ก ( NAC 2010)) 2 3 : Excel and Statistics Simulation Software\

SYLLABUS OF BASIC EDUCATION SPRING 2018 Construction and Evaluation of Actuarial Models Exam 4

Computational Statistics Handbook with MATLAB

Homework Problems Stat 479

Frequency Distribution Models 1- Probability Density Function (PDF)

Financial Models with Levy Processes and Volatility Clustering

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Cambridge University Press Risk Modelling in General Insurance: From Principles to Practice Roger J. Gray and Susan M.

Analysis of the Oil Spills from Tanker Ships. Ringo Ching and T. L. Yip

Certified Quantitative Financial Modeling Professional VS-1243

QQ PLOT Yunsi Wang, Tyler Steele, Eva Zhang Spring 2016

Introduction to Algorithmic Trading Strategies Lecture 8

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Fitting parametric univariate distributions to non-censored or censored data using the R package fitdistrplus

Loss Simulation Model Testing and Enhancement

GUIDANCE ON APPLYING THE MONTE CARLO APPROACH TO UNCERTAINTY ANALYSES IN FORESTRY AND GREENHOUSE GAS ACCOUNTING

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

Probability and Statistics

Package fitdistrplus

Lecture 1: Empirical Properties of Returns

Practice Exam 1. Loss Amount Number of Losses

Monte Carlo Simulation (Random Number Generation)

Homework Problems Stat 479

Market Risk Analysis Volume I

Methods for Characterizing Variability and Uncertainty: Comparison of Bootstrap Simulation and. Likelihood-Based Approaches

Contents Part I Descriptive Statistics 1 Introduction and Framework Population, Sample, and Observations Variables Quali

Changes to Exams FM/2, M and C/4 for the May 2007 Administration

Journal of Statistical Software

Master s in Financial Engineering Foundations of Buy-Side Finance: Quantitative Risk and Portfolio Management. > Teaching > Courses

Paper Series of Risk Management in Financial Institutions

Background. opportunities. the transformation. probability. at the lower. data come

Analysis of truncated data with application to the operational risk estimation

yuimagui: A graphical user interface for the yuima package. User Guide yuimagui v1.0

Describing Uncertain Variables

Technology Support Center Issue

Financial Econometrics Notes. Kevin Sheppard University of Oxford

Chapter 3 Statistical Quality Control, 7th Edition by Douglas C. Montgomery. Copyright (c) 2013 John Wiley & Sons, Inc.

Testing the significance of the RV coefficient

[D7] PROBABILITY DISTRIBUTION OF OUTSTANDING LIABILITY FROM INDIVIDUAL PAYMENTS DATA Contributed by T S Wright

2018 AAPM: Normal and non normal distributions: Why understanding distributions are important when designing experiments and analyzing data

Gamma Distribution Fitting

Joseph O. Marker Marker Actuarial Services, LLC and University of Michigan CLRS 2011 Meeting. J. Marker, LSMWP, CLRS 1

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...

Homework Problems Stat 479

Data Distributions and Normality

Portfolio modelling of operational losses John Gavin 1, QRMS, Risk Control, UBS, London. April 2004.

The histogram should resemble the uniform density, the mean should be close to 0.5, and the standard deviation should be close to 1/ 12 =

Chapter 7: Estimation Sections

Application of the Bootstrap Estimating a Population Mean

WC-5 Just How Credible Is That Employer? Exploring GLMs and Multilevel Modeling for NCCI s Excess Loss Factor Methodology

KARACHI UNIVERSITY BUSINESS SCHOOL UNIVERSITY OF KARACHI BS (BBA) VI

A Convenient Way of Generating Normal Random Variables Using Generalized Exponential Distribution

UPDATED IAA EDUCATION SYLLABUS

LAST SECTION!!! 1 / 36

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

An Improved Skewness Measure

A NEW POINT ESTIMATOR FOR THE MEDIAN OF GAMMA DISTRIBUTION

4-2 Probability Distributions and Probability Density Functions. Figure 4-2 Probability determined from the area under f(x).

BloxMath Library Reference

STAT 479 Test 3 Spring 2016 May 3, 2016

STRESS-STRENGTH RELIABILITY ESTIMATION

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

ESTIMATION OF MODIFIED MEASURE OF SKEWNESS. Elsayed Ali Habib *

Fat Tailed Distributions For Cost And Schedule Risks. presented by:

Asymmetric Price Transmission: A Copula Approach

Commonly Used Distributions

Statistical Models and Methods for Financial Markets

Contents. An Overview of Statistical Applications CHAPTER 1. Contents (ix) Preface... (vii)

MONTE CARLO SIMULATION AND PARETO TECHNIQUES FOR CALCULATION OF MULTI- PROJECT OUTTURN-VARIANCE

Introduction to Statistical Data Analysis II

Statistics and Finance

EVA Tutorial #1 BLOCK MAXIMA APPROACH IN HYDROLOGIC/CLIMATE APPLICATIONS. Rick Katz

Monte Carlo Simulation (General Simulation Models)

How To: Perform a Process Capability Analysis Using STATGRAPHICS Centurion

Linda Allen, Jacob Boudoukh and Anthony Saunders, Understanding Market, Credit and Operational Risk: The Value at Risk Approach

ExcelSim 2003 Documentation

Introduction Models for claim numbers and claim sizes

Fitting the generalized Pareto distribution to commercial fire loss severity: evidence from Taiwan

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Financial Econometrics (FinMetrics04) Time-series Statistics Concepts Exploratory Data Analysis Testing for Normality Empirical VaR

Stochastic Claims Reserving _ Methods in Insurance

Obtaining Predictive Distributions for Reserves Which Incorporate Expert Opinions R. Verrall A. Estimation of Policy Liabilities

Can we use kernel smoothing to estimate Value at Risk and Tail Value at Risk?

THE USE OF THE LOGNORMAL DISTRIBUTION IN ANALYZING INCOMES

Statistical Analysis of Data from the Stock Markets. UiO-STK4510 Autumn 2015

Statistics & Flood Frequency Chapter 3. Dr. Philip B. Bedient

Distribution analysis of the losses due to credit risk

Estimation Procedure for Parametric Survival Distribution Without Covariates

Stochastic model of flow duration curves for selected rivers in Bangladesh

A Saddlepoint Approximation to Left-Tailed Hypothesis Tests of Variance for Non-normal Populations

Mongolia s TOP-20 Index Risk Analysis, Pt. 3

Appendix A. Selecting and Using Probability Distributions. In this appendix

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

A Comparison Between Skew-logistic and Skew-normal Distributions

Exploring Data and Graphics

MAINTAINABILITY DATA DECISION METHODOLOGY (MDDM)

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

Transcription:

Fitting parametric distributions using R: the fitdistrplus package M. L. Delignette-Muller - CNRS UMR 5558 R. Pouillot J.-B. Denis - INRA MIAJ user! 2009,10/07/2009

Background Specifying the probability distribution that best fits a sample data among a predefined family of distributions a frequent need especially in Quantitative Risk Assessment general-purpose maximum-likelihood fitting routine for the parameter estimation step : fitdistr(mass) (Venables and Ripley, 2002) possibility to implement other steps using R (Ricci, 2005) but no specific package dedicated to the whole process difficulty to work with censored data

Objective Build a package that provides functions to help the whole process of specification of a distribution from data choose among a family of distributions the best candidates to fit a sample estimate the distribution parameters and their uncertainty assess and compare the goodness-of-fit of several distributions that specifically handles different kinds of data discrete continuous with possible censored values (right-, left- and interval-censored with several upper and lower bounds)

Technical choices Skewness-kurtosis graph for the choice of distributions (Cullen and Frey, 1999) Two fitting methods matching moments for a limited number of distributions and non-censored data maximum likelihood (mle) using optim(stats) for any distribution, predefined or defined by the user for non-censored or censored data Uncertainty on parameter estimations standard errors from the Hessian matrix (only for mle) parametric or non-parametric bootstrap Assessment of goodness-of-fit chi-squared, Kolmogorov-Smirnov, Anderson-Darling statistics density, cdf, P-P and Q-Q plots

Technical choices Skewness-kurtosis graph for the choice of distributions (Cullen and Frey, 1999) Two fitting methods matching moments for a limited number of distributions and non-censored data maximum likelihood (mle) using optim(stats) for any distribution, predefined or defined by the user for non-censored or censored data Uncertainty on parameter estimations standard errors from the Hessian matrix (only for mle) parametric or non-parametric bootstrap Assessment of goodness-of-fit chi-squared, Kolmogorov-Smirnov, Anderson-Darling statistics density, cdf, P-P and Q-Q plots

Technical choices Skewness-kurtosis graph for the choice of distributions (Cullen and Frey, 1999) Two fitting methods matching moments for a limited number of distributions and non-censored data maximum likelihood (mle) using optim(stats) for any distribution, predefined or defined by the user for non-censored or censored data Uncertainty on parameter estimations standard errors from the Hessian matrix (only for mle) parametric or non-parametric bootstrap Assessment of goodness-of-fit chi-squared, Kolmogorov-Smirnov, Anderson-Darling statistics density, cdf, P-P and Q-Q plots

Technical choices Skewness-kurtosis graph for the choice of distributions (Cullen and Frey, 1999) Two fitting methods matching moments for a limited number of distributions and non-censored data maximum likelihood (mle) using optim(stats) for any distribution, predefined or defined by the user for non-censored or censored data Uncertainty on parameter estimations standard errors from the Hessian matrix (only for mle) parametric or non-parametric bootstrap Assessment of goodness-of-fit chi-squared, Kolmogorov-Smirnov, Anderson-Darling statistics density, cdf, P-P and Q-Q plots

Main functions of fitdistrplus descdist: provides a skewness-kurtosis graph to help to choose the best candidate(s) to fit a given dataset fitdist and plot.fitdist: for a given distribution, estimate parameters and provide goodness-of-fit graphs and statistics bootdist: for a fitted distribution, simulates the uncertainty in the estimated parameters by bootstrap resampling fitdistcens, plot.fitdistcens and bootdistcens: same functions dedicated to continuous data with censored values

Skewness-kurtosis plot for continuous data Ex. on consumption data: food serving sizes (g) > descdist(serving.size) Cullen and Frey graph kurtosis 10 9 8 7 6 5 4 3 2 1 Observation Theoretical distributions normal uniform exponential logistic beta lognormal gamma (Weibull is close to gamma and lognormal) 0 1 2 3 4 square of skewness

Skewness-kurtosis plot for continuous data with bootstrap option > descdist(serving.size,boot=1001) 0 1 2 3 4 Cullen and Frey graph square of skewness kurtosis 10 9 8 7 6 5 4 3 2 1 Observation bootstrapped values Theoretical distributions normal uniform exponential logistic beta lognormal gamma (Weibull is close to gamma and lognormal)

Skewness-kurtosis plot for discrete data Ex. on microbial data: counts of colonies on small food samples > descdist(colonies.count,discrete=true) Cullen and Frey graph kurtosis 21 19 17 15 13 11 9 8 7 6 5 4 3 2 1 Observation Theoretical distributions normal negative binomial Poisson 0 5 10 15 square of skewness

Fit of a given distribution by maximum likelihood or matching moments Ex. on consumption data: food serving sizes (g) Maximum likelihood estimation > fg.mle<-fitdist(serving.size,"gamma",method="mle") > summary(fg.mle) estimate Std. Error shape 4.0083 0.34134 rate 0.0544 0.00494 Loglikelihood: -1254 Matching moments estimation > fg.mom<-fitdist(serving.size,"gamma",method="mom") > summary(fg.mom) estimate shape 4.2285 rate 0.0574

Fit of a given distribution by maximum likelihood or matching moments Ex. on consumption data: food serving sizes (g) Maximum likelihood estimation > fg.mle<-fitdist(serving.size,"gamma",method="mle") > summary(fg.mle) estimate Std. Error shape 4.0083 0.34134 rate 0.0544 0.00494 Loglikelihood: -1254 Matching moments estimation > fg.mom<-fitdist(serving.size,"gamma",method="mom") > summary(fg.mom) estimate shape 4.2285 rate 0.0574

Comparison of goodness-of-fit statistics Ex. on consumption data: food serving sizes (g) Comparison of the fits of three distributions using the Anderson-Darling statistics Gamma > fitdist(serving.size,"gamma")$ad [1] 3.566019 lognormal > fitdist(serving.size,"lnorm")$ad [1] 4.543654 Weibull > fitdist(serving.size,"weibull")$ad [1] 3.573646

Goodness-of-fit graphs for continuous data Ex. on consumption data: food serving sizes (g) > plot(fg.mle) Empirical and theoretical distr. QQ plot Density 0.000 0.004 0.008 0.012 sample quantiles 50 100 150 200 0 50 100 150 200 data 50 100 150 200 theoretical quantiles Empirical and theoretical CDFs PP plot CDF 0.0 0.2 0.4 0.6 0.8 1.0 sample probabilities 0.0 0.2 0.4 0.6 0.8 1.0 0 50 100 150 200 data 0.0 0.2 0.4 0.6 0.8 1.0 theoretical probabilities

Goodness-of-fit graphs for discrete data Ex. on microbial data: counts of colonies on small food samples > fnbinom<-fitdist(colonies.count,"nbinom") > plot(fnbinom) Empirical (black) and theoretical (red) distr. Density 0.0 0.2 0.4 0 2 4 6 8 10 12 data Empirical (black) and theoretical (red) CDFs CDF 0.0 0.4 0.8 0 2 4 6 8 10 12 data

Fit of a given distribution by maximum likelihood to censored data Ex. on microbial censored data: concentrations in food with left censored values (not detected) and interval censored values (detected but not counted) > log10.conc left right 1 1.73 1.73 2 1.51 1.51 3 0.77 0.77 4 1.96 1.96 5 1.96 1.96 6-1.40 0.00 7-1.40-0.70 8 NA -1.40 9-0.11-0.11... > fnorm<-fitdistcens(log10.conc, "norm") > summary(fnorm) estimate Std. Error mean 0.118 0.332 sd 1.426 0.261 Loglikelihood: -32.1

Goodness-of-fit graphs for censored data Ex. on microbial censored data: concentrations in food > plot(fnorm) Cumulative distribution plot CDF 0.0 0.2 0.4 0.6 0.8 1.0 2 1 0 1 2 3 4 censored data

Bootstrap resampling Ex. on microbial censored data > bnorm<-bootdistcens(fnorm) > summary(bnorm) Nonparametric bootstrap medians and 95% CI Median 2.5% 97.5% mean 0.233-0.455 0.875 sd 1.294 0.908 1.776 > plot(bnorm) 0.5 0.0 0.5 1.0 1.0 1.5 2.0 Scatterplot of the boostrapped values of the two parameters mean sd

Use of the bootstrap in risk assessment The bootstrap sample may be used to take into account uncertainty in risk assessment, in two-dimensional Monte Carlo simulations, as proposed in the package mc2d. Uncertain hyperparameter 1 Uncertain hyperparameter 2 Variability Uncertain and Variable parameter Uncertainty

Conclusion fitdistrplus could help risk assessment. It is a part of a collaborative project with 2 other packages under development, mc2d and ReBaStaBa: The R-Forge project "Risk Assessment with R" http://riskassessment.r-forge.r-project.org/ fitdistrplus could also be used more largely to help the fit of univariate distributions to data

Conclusion fitdistrplus could help risk assessment. It is a part of a collaborative project with 2 other packages under development, mc2d and ReBaStaBa: The R-Forge project "Risk Assessment with R" http://riskassessment.r-forge.r-project.org/ fitdistrplus could also be used more largely to help the fit of univariate distributions to data

Still many things to do fitdistrplus is still under development. Many improvements are planned other goodness-of-fit statistics other graphs for goodness-of-fit for censored data (Turnbull,...) optimized choice of the algorithm used in optim for the likelihood maximization graphs of likelihood contours (detection of identifiability problems)... do not hesitate to provide us other improvement ideas!