H i s t o g r a m o f P ir o. P i r o. H i s t o g r a m o f P i r o. P i r o

Similar documents
Continuous Distributions

Lecture 6: Non Normal Distributions

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Basic Procedure for Histograms

[D7] PROBABILITY DISTRIBUTION OF OUTSTANDING LIABILITY FROM INDIVIDUAL PAYMENTS DATA Contributed by T S Wright

SYSM 6304 Risk and Decision Analysis Lecture 2: Fitting Distributions to Data

Financial Econometrics (FinMetrics04) Time-series Statistics Concepts Exploratory Data Analysis Testing for Normality Empirical VaR

ECON 214 Elements of Statistics for Economists 2016/2017

STAT 113 Variability

Some Characteristics of Data

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

Assessing Normality. Contents. 1 Assessing Normality. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

Simple Descriptive Statistics

The distribution of the Return on Capital Employed (ROCE)

Probability distributions relevant to radiowave propagation modelling

Can we use kernel smoothing to estimate Value at Risk and Tail Value at Risk?

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Basic Principles of Probability and Statistics. Lecture notes for PET 472 Spring 2010 Prepared by: Thomas W. Engler, Ph.D., P.E

CS 237: Probability in Computing

Data Distributions and Normality

Basic Principles of Probability and Statistics. Lecture notes for PET 472 Spring 2012 Prepared by: Thomas W. Engler, Ph.D., P.E

MVE051/MSG Lecture 7

Uncertainty Analysis with UNICORN

The mathematical definitions are given on screen.

yuimagui: A graphical user interface for the yuima package. User Guide yuimagui v1.0

22.2 Shape, Center, and Spread

Introduction to Algorithmic Trading Strategies Lecture 8

The normal distribution is a theoretical model derived mathematically and not empirically.

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

2 DESCRIPTIVE STATISTICS

2.1 Properties of PDFs

Asset Allocation Model with Tail Risk Parity

A New Hybrid Estimation Method for the Generalized Pareto Distribution

Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

Since his score is positive, he s above average. Since his score is not close to zero, his score is unusual.

Commonly Used Distributions

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

Statistics 431 Spring 2007 P. Shaman. Preliminaries

x is a random variable which is a numerical description of the outcome of an experiment.

Lecture 2 Describing Data

Random Variables and Probability Distributions

4.3 Normal distribution

Software Tutorial ormal Statistics

Describing Uncertain Variables

2 Exploring Univariate Data

ECON 214 Elements of Statistics for Economists

Chapter 15: Dynamic Programming

Master s in Financial Engineering Foundations of Buy-Side Finance: Quantitative Risk and Portfolio Management. > Teaching > Courses

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Contents Part I Descriptive Statistics 1 Introduction and Framework Population, Sample, and Observations Variables Quali

STAT 157 HW1 Solutions

Frequency Distribution Models 1- Probability Density Function (PDF)

ME3620. Theory of Engineering Experimentation. Spring Chapter III. Random Variables and Probability Distributions.

Mongolia s TOP-20 Index Risk Analysis, Pt. 3

Section 6-1 : Numerical Summaries

Characterization of the Optimum

DATA SUMMARIZATION AND VISUALIZATION

Economics 307: Intermediate Macroeconomic Theory A Brief Mathematical Primer

AP Statistics Chapter 6 - Random Variables

Cambridge University Press Risk Modelling in General Insurance: From Principles to Practice Roger J. Gray and Susan M.

Probability and Statistics

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

MODELLING OF INCOME AND WAGE DISTRIBUTION USING THE METHOD OF L-MOMENTS OF PARAMETER ESTIMATION

Statistics and Probability

The following content is provided under a Creative Commons license. Your support

Quality Digest Daily, March 2, 2015 Manuscript 279. Probability Limits. A long standing controversy. Donald J. Wheeler

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

THE USE OF THE LOGNORMAL DISTRIBUTION IN ANALYZING INCOMES

4.3 The money-making machine.

QQ PLOT Yunsi Wang, Tyler Steele, Eva Zhang Spring 2016

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Modeling Obesity and S&P500 Using Normal Inverse Gaussian

EconS Income E ects

3. The Discount Factor

9. Logit and Probit Models For Dichotomous Data

1 Describing Distributions with numbers

Most of the transformations we will deal with will be in the families of powers and roots: p X -> (X -1)/-1.

Appendix A. Selecting and Using Probability Distributions. In this appendix

The mean-variance portfolio choice framework and its generalizations

Class 13. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Terms & Characteristics

Fitting financial time series returns distributions: a mixture normality approach

The Normal Distribution

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

MEASURING PORTFOLIO RISKS USING CONDITIONAL COPULA-AR-GARCH MODEL

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing Examples

**BEGINNING OF EXAMINATION** A random sample of five observations from a population is:

When we look at a random variable, such as Y, one of the first things we want to know, is what is it s distribution?

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Lecture 1: The Econometrics of Financial Returns

The probability of having a very tall person in our sample. We look to see how this random variable is distributed.

We use probability distributions to represent the distribution of a discrete random variable.

Sampling Distributions

The proof of Twin Primes Conjecture. Author: Ramón Ruiz Barcelona, Spain August 2014

3 ˆθ B = X 1 + X 2 + X 3. 7 a) Find the Bias, Variance and MSE of each estimator. Which estimator is the best according

Statistical Analysis of Data from the Stock Markets. UiO-STK4510 Autumn 2015

Chapter 8 Statistical Intervals for a Single Sample

MA300.2 Game Theory 2005, LSE

KARACHI UNIVERSITY BUSINESS SCHOOL UNIVERSITY OF KARACHI BS (BBA) VI

Approximate Revenue Maximization with Multiple Items

Transcription:

fit Lecture 3 Common problem in applications: find a density which fits well an eperimental sample. Given a sample 1,..., n, we look for a density f which may generate that sample. There eist infinitely many such densities, for a given sample (think to the case n 1). But, for some of them the sample is natural, typical, for others it is etreme, unusual, even if possible. We look for a density such that the sample is typical for it. Let us treat two eamples: the result of a test for future students in medicine, (large and regular sample), the intensity of the last 19 volcanic eruptions at Campi Flegrei (few data, one outlier). Load on R the file dati_campi_flegrei.tt (first column, ecept last component), save them in the vector Piro: A -read.table(file dati_campi_flegrei.tt,header TRUE) Piro - A[1:19,1] Load also test_medicina.tt, saved in the vector Medi. B -read.table(file test_medicina.tt,header TRUE) Medi - B[,2] These are Piro data: 5.4, 9.3, 23.4, 10, 27.6, 29.5, 52.9, 44.3, 18.3, 38.7, 7.4, 347.6, 5.3, 19.1, 44.3, 29.5, 71.2, 5.4, 18.1 Histograms and empiric cumulatives An histogram is a kind of empiric density. But it is not uniquely determined from data: it depends on the classes. Let us see two histograms of Piro, hist(piro) and hist(piro,15): H i s t o g r a m o f P ir o Frequency 0 5 10 15 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o H i s t o g r a m o f P i r o Frequency 0 2 4 6 8 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o They have absolute frequences. If we want area one under the graph, let us use hist(x,15,freq FALSE):

H is to g r a m o f P ir o 0.000 0.005 0.010 0.015 0.020 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i ro We get a first idea of data and probability of different values.. Due to the outlier 347.6, most of the histogram is squeezed to the left. We may epand it by Piro.cut - c(piro[1:11],piro[13:19]) hist(piro.cut,7,freq FALSE) H is to g r a m o f P ir o.c u t 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0 2 0 4 0 6 0 8 0 P i ro.c u t From the epansion we not that there is no ascending part on the left, as we have in Weibull or Gamma distributions with shape 1. Thus, if we use Weibull, we choose shape 1. Much more regular is the histrogram of Medi: H is t o g r a m o f m e d 0.00 0.01 0.02 0.03 0.04 0 2 0 4 0 6 0 m e d Let us plot the empirical cumulative plot.ecdf(piro). It is absolute, no choice of classes. For Piro and Medi: e c d f ( ) Fn() 0 1 0 0 2 0 0 3 0 0

e c d f ( ) Fn() 0 2 0 4 0 6 0 Parametric and non parametric methods Using a parametric method means choosing a class of distributions (Weibull, normal, ecc.) characterized by few parameters (usually 2) and look for the best parameters; then one compares the results of different classes. Non parametric methods search a density in very large classes, having a very large number of degrees of freedom. Even such classes may be parametrized, but with too many parameters (sometimes infinitely many). Thus they are very fleible and fit data very closely. The previous histograms help us in the choice of the parametric class. For instance, we shall eclude Gaussians for Piro, as well as Beta, but eamine Weibull and possibly Gamma. Moreover, the decreasing shape of the histogram suggests shape 1. Vice versa, for Medi, Gaussians look suitable, although there is a mild asymmetry. Recall the way Gamma and Weibull are asymmetric; it is more natural to try Weibull. For Piro data there is an outlier, so presumably an heavy tail, or sub-eponential. Gamma are not sub-eponential. Weibull yes, if shape 1. Another class offered by R are log-normals. Summarizing, Gaussian and Weibull for Medi, Weibull and log-normal for Piro. One more distribution: log-normal If X is Gaussian or normal, the random variable Y e X is called log-normal. To be at the eponent (X), has the effect that Y takes very large values, sometimes. For instance, if X takes typical values in 2-4, but sometimes 5, the typical values of Y will be 7-55, but sometimes 150. It is eactly what happens to Piro. Parameters of log-normals are mean and standard deviation of the corresponding Gaussian. To mimic the numbers just given above, take a Gaussian with 3 and 1. We have: -1:100 y - dlnorm(,3,1) plot(,y)

y 0 2 0 4 0 6 0 8 0 1 0 0 0.000 0.005 0.010 0.015 0.020 0.025 0.030 The only qualitative drawback of this distribution, for Piro, is the ascending initial step. But it is very fast, so we may choose to forget it. The heavy tail can be seen from the definition, the graph, or the density: f 1 2 2 log 2 ep 2 2 for 0. Eponential and logarithm compensate and the decay is polynomial. A non parametric method Let us run: require(kernsmooth) density - bkde(piro, kernel normal, bandwidth 20) plot(density, type l ) density$y 0.000 0.004 0.008 0.012 0 1 0 0 2 0 0 3 0 0 4 0 0 d e nsity$ density$y 0.000 0.005 0.010 0.015-5 0 0 5 0 1 0 0 1 5 0 d e ns ity$ The package KernSmooth (kernel smoothing) is uploaded, since it is not default. The aim of this package is to find non parametric densities. using smoothing methods based on suitable kernels. There are several kernels. We try another one below. The feature of this method is to fit very closely our data. Run: hist(piro,15,freq FALSE) lines(density, type l )

Histogram of Piro 0.000 0.005 0.010 0.015 0.020 0 50 100 150 200 250 300 350 The drawback, fir us, of this method, is its main feature: too close to these particular data. The precise value of the outlier 347.6 has a physical meaning, of net time we may get 527 or 293? In this eample we think that 347.6 has no absolute meaning. Thus the density given by kernel smoothing is not physical. Parameter estimate Assume we have chosen a class and we want to find optimal parameters. Two classical approaches are the method of Maimum Likelihood and the method of moments. We may also find the parameters optimizing other quantites, like the L 1 -distance described below. Let us describe here only ML. Given a density f, given an eperimental value, the number f is not the probability of (it is zero). It is called, however, likelihood of. Given a sample 1,..., n, the product Piro L 1,..., n f 1 f n is called likelihood of 1,..., n. When the density depends on parameters, say a,s, we write f a,s and L 1,..., n a,s. The ML method is: given a sample 1,..., n, find a,s which maimizes L 1,..., n a,s. If it were a probability, we could say: which is the choice of parameters that maimizes the probability or our sample? Since most probability densities are related to eponentials and products, taking logarithm is convenient: logl 1,..., n a,s. Maimizing it, it is equivalent. If this function is differentiable in a,s, and the possible maimum is inside the domain of definition, we must have a,s logl 1,..., n a,s 0. These are the ML equations. Sometimes, they can be solved eplicitly. Sometime else, numerical optimization is needed. Software R gives us a routine to compute ML estimates of parameters, for several classes of densities: fitdistr. In our cases: require(mass) fitdistr(piro, weibull ) fitdistr(piro, weibull, list(shape 0.5, scale 20)) fitdistr(piro, weibull, list(shape 2, scale 100)) fitdistr(piro, log-normal ) fitdistr(medi, normal )

mean(medi) sd(medi) The case fitdistr(medi, weibull ) gives error because of negative values. We cancel them in the file Medi.plus, and run fitdistr(medi.plus, weibull ) We also changed initial guesses of parameters in fitdistr(piro, weibull ) to check that the maimum did not change. We also checked that Gaussian fit is made just by taking empirical mean and deviation (the method of moments, in its simplest case). The results are: fitdistr(piro, weibull ): 0.85, 38.11 fitdistr(piro, log-normal ): 3.09, 1.02 fitdistr(medi, normal ): 34.97, 11.06 fitdistr(medi, weibull ): 3.58, 38.84 Comparison between density and histogram The first idea is to compare density and histogram. Let us see Piro with Weibull and log-normal: a -0.85 s -38.11 -(-0:5000)/10 hist(piro,15,freq FALSE) y -dweibull(,a,s) lines(,y) H is to g r a m o f P ir o 0.000 0.005 0.010 0.015 0.020 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o H i s t o g r a m o f P ir o 0.000 0.005 0.010 0.015 0.020 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o Both look reasonable, but comparison is very difficult. Not so different is Weibull with parameters a -0.8 s -100

H is t o g r a m o f P ir o 0.000 0.005 0.010 0.015 0.020 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o The fit of the outlier looks improved, worsening a little bit elsewhere. We do not say that this kind of comparison is useless, simply that is is not trivial and final. Let us see Medi, gaussiana and Weibull: H is to g r a m o f m e d 0.00 0.01 0.02 0.03 0.04 0 2 0 4 0 6 0 m e d H is to g r a m o f M e d i.p lu s 0.00 0.01 0.02 0.03 0.04 0 1 0 2 0 3 0 4 0 5 0 6 0 M e d i.p lu s Both are very good. There is no evidence of improvement by Weibull to cope with asymmetry (Weibull, with those parameters, is almost symmetric). We have seen an eample, Medi, where the comparison density-histogram is convincing, another where it is poor. The presence of an outlier will always deteriorate a comparison density-histogram. Indeed, to be physical, a density must be distributed over a wide range, not only around the outlier. Comparison between cumulatives Another comparison is that of cumulatives, empirical and theoretical. For Piro, Weibull and log-normale, we have a -0.85 s -38.11 -(-0:5000)/10 plot.ecdf(piro) y -pweibull(,a,s) lines(,y)

0 1 0 0 2 0 0 3 0 0 e c d f( ) Fn() e c d f( ) Fn() 0 1 0 0 2 0 0 3 0 0 Here, for the first time, we have a hint of the superiority of log-normal. I we try again the Weibull a -0.8 s -100 we get e c d f ( ) Fn() 0 1 0 0 2 0 0 3 0 0 which is much worse. Thus: the comparison of cumulatives is very informative. For Medi, Gaussian and Weibull: ecdf() Fn() 0 20 40 60

ecdf() Fn() 0 20 40 60 Both look perfect. However, we notice a very small discrepancy in the tails. The right tail is better fitted by Weibull, the left tail by Gaussian, and not so much. Recall that Weibull of shape a -3.58 decays as while Gaussian as ep 3.58 ep 2. The decay on the rught is very strong (even more than Weibull with a -3.58). The decay on the left is slower than Gaussian. Comparison between samples Another comparison, essentially heuritic, is based on the generation of a sample from the given distribution. Try with a -0.85 s -38.11 rweibull(19,a,s) Piro If we repeat this a few times, we usually get numbers similar to those of Piro, ecept that we do not get numbers of the order of 300, most often. The same for log-normal. This is the only hint, until now, that we have under-estimated the outlier. Traditional methods of fit have this tendency. One can see that the parameters m -3.09 s -1.3 rlnorm(19,m,s) give us samples still similar to Piro but most of the times with outliers of the right order. Comparison between cumulatives is good: e c d f( ) Fn() 0 1 0 0 2 0 0 3 0 0

and we see why this is better for the outlier. Which case should we prefer? Q-Q plot Do describe this method, we need to give the definition of quantile. It is the inverse of the cdf. In all our eamples, the cdf F is continuous, strictly increasing (ecept maybe on half-lines). Therefore, given 0,1, there eists one and only one number q such that F q. The number q is called the quantile of order. For instance, if 5%, it is also called fifth percentile (if 25%, 25 percentile, and so on). Moreover, 25 percentile, 50 percentile, 75 percentile are also called first, second and third quartiles. The empirical cdf F is defined as follows: given a sample 1,..., n, we order it; if 1,..., n is the result, we set Some people prefer F i i n. F i i 0.5 n which is more symmetric. If a sample comes from a cdf F, we have F i nearly equal to F i. Compute the inverse of F, the quantile, and get that q F i is roughly equal to i q F i. But then the points i,q F i will be closed to the line y. We plot these points and get a feeling of the goodness of fit. For Piro, Weibull and log-normal: Dati - Piro a -0.85 s -38.11 quant - function() {qweibull(,a,s)} - 1:500 L - length(dati) F.hat - (1:L)/L - 0.5/L Dati.ord -sort(dati) plot(,, type l ) q - quant(f.hat)

0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 0 100 200 300 400 500 0 100 200 300 400 500 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 Let us add the modified log-normale ( 1.3) 0 100 200 300 400 500 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 which clearly shows what happens: the fit of the outlier is improved, the fit of some other points is worse. ML log-normal is better than ML Weibull; our modified log-normal is good as well and improves the outlier. For Medi, Gaussian, Weibull: L - length(medi) F.hat - (1:L)/L - 0.5/L Medi.ord -sort(medi) m -34.97 s -11.06 q - qnorm(f.hat,m,s) -(-0:700)/10 plot(,, type l ) lines(medi.ord,q, type b )

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 The result is surprising! We epected a very strong fit, and on the contrary we see so clarly the drawbacks of the tails. The problem is only there, the body of the distribution is perfect. The pictures seen until now were dominated by the body. This Q-Q plot confirms what seen previously: the decay on the right is very fast (a little more than Weibull with a -3.58, which however, is very good); slower than gaussian on the left. Numerical summaries, distances After several graphical comparisons, let us see some numerical ones. let us unticipate that they will not be so better than the graphical ones, but will add a few informations. One of the problems with them is that there are too many. If we use these indees to copare two given distributions, it may work, mosto of them will give the same order. If, on the contrary, we hope to use them to identify the optimal density in a class, or similarly to prove that the ML density is the best, we get in trouble. Usually, the optimal parameters depend on the inde. To summarize, a certain degree of subjectivity remains, cannot be eliminated, by the numerical indees. A distance between cumulatives Among many possible ones, particularly natural is the L 1 distance between empirical and theoretical cumulatives I : F F d. It measures the distance between the probability of events of the form X t, averaged in t. For simple dimensional and epository reasons, it may be convenient to use the following small variant, that we may call error of fit:

E 100 I ma min where ma and min are referred to the sample 1,..., n. The results for Piro are ML Weibull: I 6.13 ML log-normale: I 5.39, the best between the two modified log-normale: I 5.96, better than ML Weibull. Eercise Write R code which computes, for every positive number k, the inde I k : F F k d and the error of degree k: E k 100 I k ma min 1/k. Which discrepances between the densities are captured, as k? (Pay attention to the typical dimensions of the numbers involved). Are these values typical? We may use the error E to compare different densities, as above. We may use it to compute optimal parameters. But we may use it also as a statistical test, to understand, for instance, whether ML log-normal is acceptable or not in itself (not whether it is better than another density). We do it the following way. Consider ML log-normal. Generate from it a sample of cardinality 19 and compute its error E with respect to our log-normal. Repeat 1000 times, get 1000 values of E: e 1,...,e 1000. k A percentage of them will be greater than the value 1000 e obtained conparing the k eperimental sample with the log-normal. We interpret as the probability that, at 1000 random, from that log-normal we may get a sample like the eperimental one, so k etreme. Call empirical p-value the number, or k ecc. depending on the number 1000 10000 of trials. If the p-value is small, e. 0.05, it means that it was not easy to get at random such a sample. This indicates that such log-normal is not natural enough. If, on the contrary, the p-value, is not so small, even some 0.15, we cannot eclude that the sample comes out from that distribution. At the end, we have a criterium to reject or not reject a distribution. Not reject does not mean a confirmation: several other distributions have the same property of non rejection. The code gives us E, p-value and histogram of e 1,...,e 1000. For Piro, ML log-normal: E 5.39, p-value 0.214

H is to g r a m o f 1 0 0 * I.r a n d /R a n g e Frequency 0 200 400 600 0 5 1 0 1 5 2 0 2 5 3 0 3 5 1 0 0 * I.ra n d /R a ng e (the p-value varies a little bit from trial to trial). We cannot reject this distribution; although this is an indication that the fit is not so good. Much worse is the result for ML Weibull: E 6.127, p-value 0.149 H i s t o g r a m o f 1 0 0 * I. r a n d / R a n g e Frequency 0 100 200 300 400 500 0 5 1 0 1 5 2 0 1 0 0 * I.r a n d /R a n g e All methods confirm the superiority of log-normal fit. Eercise Find the p-value for the error of degree k introduced in the eercise above. Eercise Analyze the data of this lecture by means of class Gamma. Recall to use dgamma(,shape a,scale s), ecc.