fit Lecture 3 Common problem in applications: find a density which fits well an eperimental sample. Given a sample 1,..., n, we look for a density f which may generate that sample. There eist infinitely many such densities, for a given sample (think to the case n 1). But, for some of them the sample is natural, typical, for others it is etreme, unusual, even if possible. We look for a density such that the sample is typical for it. Let us treat two eamples: the result of a test for future students in medicine, (large and regular sample), the intensity of the last 19 volcanic eruptions at Campi Flegrei (few data, one outlier). Load on R the file dati_campi_flegrei.tt (first column, ecept last component), save them in the vector Piro: A -read.table(file dati_campi_flegrei.tt,header TRUE) Piro - A[1:19,1] Load also test_medicina.tt, saved in the vector Medi. B -read.table(file test_medicina.tt,header TRUE) Medi - B[,2] These are Piro data: 5.4, 9.3, 23.4, 10, 27.6, 29.5, 52.9, 44.3, 18.3, 38.7, 7.4, 347.6, 5.3, 19.1, 44.3, 29.5, 71.2, 5.4, 18.1 Histograms and empiric cumulatives An histogram is a kind of empiric density. But it is not uniquely determined from data: it depends on the classes. Let us see two histograms of Piro, hist(piro) and hist(piro,15): H i s t o g r a m o f P ir o Frequency 0 5 10 15 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o H i s t o g r a m o f P i r o Frequency 0 2 4 6 8 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o They have absolute frequences. If we want area one under the graph, let us use hist(x,15,freq FALSE):
H is to g r a m o f P ir o 0.000 0.005 0.010 0.015 0.020 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i ro We get a first idea of data and probability of different values.. Due to the outlier 347.6, most of the histogram is squeezed to the left. We may epand it by Piro.cut - c(piro[1:11],piro[13:19]) hist(piro.cut,7,freq FALSE) H is to g r a m o f P ir o.c u t 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0 2 0 4 0 6 0 8 0 P i ro.c u t From the epansion we not that there is no ascending part on the left, as we have in Weibull or Gamma distributions with shape 1. Thus, if we use Weibull, we choose shape 1. Much more regular is the histrogram of Medi: H is t o g r a m o f m e d 0.00 0.01 0.02 0.03 0.04 0 2 0 4 0 6 0 m e d Let us plot the empirical cumulative plot.ecdf(piro). It is absolute, no choice of classes. For Piro and Medi: e c d f ( ) Fn() 0 1 0 0 2 0 0 3 0 0
e c d f ( ) Fn() 0 2 0 4 0 6 0 Parametric and non parametric methods Using a parametric method means choosing a class of distributions (Weibull, normal, ecc.) characterized by few parameters (usually 2) and look for the best parameters; then one compares the results of different classes. Non parametric methods search a density in very large classes, having a very large number of degrees of freedom. Even such classes may be parametrized, but with too many parameters (sometimes infinitely many). Thus they are very fleible and fit data very closely. The previous histograms help us in the choice of the parametric class. For instance, we shall eclude Gaussians for Piro, as well as Beta, but eamine Weibull and possibly Gamma. Moreover, the decreasing shape of the histogram suggests shape 1. Vice versa, for Medi, Gaussians look suitable, although there is a mild asymmetry. Recall the way Gamma and Weibull are asymmetric; it is more natural to try Weibull. For Piro data there is an outlier, so presumably an heavy tail, or sub-eponential. Gamma are not sub-eponential. Weibull yes, if shape 1. Another class offered by R are log-normals. Summarizing, Gaussian and Weibull for Medi, Weibull and log-normal for Piro. One more distribution: log-normal If X is Gaussian or normal, the random variable Y e X is called log-normal. To be at the eponent (X), has the effect that Y takes very large values, sometimes. For instance, if X takes typical values in 2-4, but sometimes 5, the typical values of Y will be 7-55, but sometimes 150. It is eactly what happens to Piro. Parameters of log-normals are mean and standard deviation of the corresponding Gaussian. To mimic the numbers just given above, take a Gaussian with 3 and 1. We have: -1:100 y - dlnorm(,3,1) plot(,y)
y 0 2 0 4 0 6 0 8 0 1 0 0 0.000 0.005 0.010 0.015 0.020 0.025 0.030 The only qualitative drawback of this distribution, for Piro, is the ascending initial step. But it is very fast, so we may choose to forget it. The heavy tail can be seen from the definition, the graph, or the density: f 1 2 2 log 2 ep 2 2 for 0. Eponential and logarithm compensate and the decay is polynomial. A non parametric method Let us run: require(kernsmooth) density - bkde(piro, kernel normal, bandwidth 20) plot(density, type l ) density$y 0.000 0.004 0.008 0.012 0 1 0 0 2 0 0 3 0 0 4 0 0 d e nsity$ density$y 0.000 0.005 0.010 0.015-5 0 0 5 0 1 0 0 1 5 0 d e ns ity$ The package KernSmooth (kernel smoothing) is uploaded, since it is not default. The aim of this package is to find non parametric densities. using smoothing methods based on suitable kernels. There are several kernels. We try another one below. The feature of this method is to fit very closely our data. Run: hist(piro,15,freq FALSE) lines(density, type l )
Histogram of Piro 0.000 0.005 0.010 0.015 0.020 0 50 100 150 200 250 300 350 The drawback, fir us, of this method, is its main feature: too close to these particular data. The precise value of the outlier 347.6 has a physical meaning, of net time we may get 527 or 293? In this eample we think that 347.6 has no absolute meaning. Thus the density given by kernel smoothing is not physical. Parameter estimate Assume we have chosen a class and we want to find optimal parameters. Two classical approaches are the method of Maimum Likelihood and the method of moments. We may also find the parameters optimizing other quantites, like the L 1 -distance described below. Let us describe here only ML. Given a density f, given an eperimental value, the number f is not the probability of (it is zero). It is called, however, likelihood of. Given a sample 1,..., n, the product Piro L 1,..., n f 1 f n is called likelihood of 1,..., n. When the density depends on parameters, say a,s, we write f a,s and L 1,..., n a,s. The ML method is: given a sample 1,..., n, find a,s which maimizes L 1,..., n a,s. If it were a probability, we could say: which is the choice of parameters that maimizes the probability or our sample? Since most probability densities are related to eponentials and products, taking logarithm is convenient: logl 1,..., n a,s. Maimizing it, it is equivalent. If this function is differentiable in a,s, and the possible maimum is inside the domain of definition, we must have a,s logl 1,..., n a,s 0. These are the ML equations. Sometimes, they can be solved eplicitly. Sometime else, numerical optimization is needed. Software R gives us a routine to compute ML estimates of parameters, for several classes of densities: fitdistr. In our cases: require(mass) fitdistr(piro, weibull ) fitdistr(piro, weibull, list(shape 0.5, scale 20)) fitdistr(piro, weibull, list(shape 2, scale 100)) fitdistr(piro, log-normal ) fitdistr(medi, normal )
mean(medi) sd(medi) The case fitdistr(medi, weibull ) gives error because of negative values. We cancel them in the file Medi.plus, and run fitdistr(medi.plus, weibull ) We also changed initial guesses of parameters in fitdistr(piro, weibull ) to check that the maimum did not change. We also checked that Gaussian fit is made just by taking empirical mean and deviation (the method of moments, in its simplest case). The results are: fitdistr(piro, weibull ): 0.85, 38.11 fitdistr(piro, log-normal ): 3.09, 1.02 fitdistr(medi, normal ): 34.97, 11.06 fitdistr(medi, weibull ): 3.58, 38.84 Comparison between density and histogram The first idea is to compare density and histogram. Let us see Piro with Weibull and log-normal: a -0.85 s -38.11 -(-0:5000)/10 hist(piro,15,freq FALSE) y -dweibull(,a,s) lines(,y) H is to g r a m o f P ir o 0.000 0.005 0.010 0.015 0.020 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o H i s t o g r a m o f P ir o 0.000 0.005 0.010 0.015 0.020 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o Both look reasonable, but comparison is very difficult. Not so different is Weibull with parameters a -0.8 s -100
H is t o g r a m o f P ir o 0.000 0.005 0.010 0.015 0.020 0 5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 P i r o The fit of the outlier looks improved, worsening a little bit elsewhere. We do not say that this kind of comparison is useless, simply that is is not trivial and final. Let us see Medi, gaussiana and Weibull: H is to g r a m o f m e d 0.00 0.01 0.02 0.03 0.04 0 2 0 4 0 6 0 m e d H is to g r a m o f M e d i.p lu s 0.00 0.01 0.02 0.03 0.04 0 1 0 2 0 3 0 4 0 5 0 6 0 M e d i.p lu s Both are very good. There is no evidence of improvement by Weibull to cope with asymmetry (Weibull, with those parameters, is almost symmetric). We have seen an eample, Medi, where the comparison density-histogram is convincing, another where it is poor. The presence of an outlier will always deteriorate a comparison density-histogram. Indeed, to be physical, a density must be distributed over a wide range, not only around the outlier. Comparison between cumulatives Another comparison is that of cumulatives, empirical and theoretical. For Piro, Weibull and log-normale, we have a -0.85 s -38.11 -(-0:5000)/10 plot.ecdf(piro) y -pweibull(,a,s) lines(,y)
0 1 0 0 2 0 0 3 0 0 e c d f( ) Fn() e c d f( ) Fn() 0 1 0 0 2 0 0 3 0 0 Here, for the first time, we have a hint of the superiority of log-normal. I we try again the Weibull a -0.8 s -100 we get e c d f ( ) Fn() 0 1 0 0 2 0 0 3 0 0 which is much worse. Thus: the comparison of cumulatives is very informative. For Medi, Gaussian and Weibull: ecdf() Fn() 0 20 40 60
ecdf() Fn() 0 20 40 60 Both look perfect. However, we notice a very small discrepancy in the tails. The right tail is better fitted by Weibull, the left tail by Gaussian, and not so much. Recall that Weibull of shape a -3.58 decays as while Gaussian as ep 3.58 ep 2. The decay on the rught is very strong (even more than Weibull with a -3.58). The decay on the left is slower than Gaussian. Comparison between samples Another comparison, essentially heuritic, is based on the generation of a sample from the given distribution. Try with a -0.85 s -38.11 rweibull(19,a,s) Piro If we repeat this a few times, we usually get numbers similar to those of Piro, ecept that we do not get numbers of the order of 300, most often. The same for log-normal. This is the only hint, until now, that we have under-estimated the outlier. Traditional methods of fit have this tendency. One can see that the parameters m -3.09 s -1.3 rlnorm(19,m,s) give us samples still similar to Piro but most of the times with outliers of the right order. Comparison between cumulatives is good: e c d f( ) Fn() 0 1 0 0 2 0 0 3 0 0
and we see why this is better for the outlier. Which case should we prefer? Q-Q plot Do describe this method, we need to give the definition of quantile. It is the inverse of the cdf. In all our eamples, the cdf F is continuous, strictly increasing (ecept maybe on half-lines). Therefore, given 0,1, there eists one and only one number q such that F q. The number q is called the quantile of order. For instance, if 5%, it is also called fifth percentile (if 25%, 25 percentile, and so on). Moreover, 25 percentile, 50 percentile, 75 percentile are also called first, second and third quartiles. The empirical cdf F is defined as follows: given a sample 1,..., n, we order it; if 1,..., n is the result, we set Some people prefer F i i n. F i i 0.5 n which is more symmetric. If a sample comes from a cdf F, we have F i nearly equal to F i. Compute the inverse of F, the quantile, and get that q F i is roughly equal to i q F i. But then the points i,q F i will be closed to the line y. We plot these points and get a feeling of the goodness of fit. For Piro, Weibull and log-normal: Dati - Piro a -0.85 s -38.11 quant - function() {qweibull(,a,s)} - 1:500 L - length(dati) F.hat - (1:L)/L - 0.5/L Dati.ord -sort(dati) plot(,, type l ) q - quant(f.hat)
0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 0 100 200 300 400 500 0 100 200 300 400 500 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 Let us add the modified log-normale ( 1.3) 0 100 200 300 400 500 0 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 which clearly shows what happens: the fit of the outlier is improved, the fit of some other points is worse. ML log-normal is better than ML Weibull; our modified log-normal is good as well and improves the outlier. For Medi, Gaussian, Weibull: L - length(medi) F.hat - (1:L)/L - 0.5/L Medi.ord -sort(medi) m -34.97 s -11.06 q - qnorm(f.hat,m,s) -(-0:700)/10 plot(,, type l ) lines(medi.ord,q, type b )
0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 The result is surprising! We epected a very strong fit, and on the contrary we see so clarly the drawbacks of the tails. The problem is only there, the body of the distribution is perfect. The pictures seen until now were dominated by the body. This Q-Q plot confirms what seen previously: the decay on the right is very fast (a little more than Weibull with a -3.58, which however, is very good); slower than gaussian on the left. Numerical summaries, distances After several graphical comparisons, let us see some numerical ones. let us unticipate that they will not be so better than the graphical ones, but will add a few informations. One of the problems with them is that there are too many. If we use these indees to copare two given distributions, it may work, mosto of them will give the same order. If, on the contrary, we hope to use them to identify the optimal density in a class, or similarly to prove that the ML density is the best, we get in trouble. Usually, the optimal parameters depend on the inde. To summarize, a certain degree of subjectivity remains, cannot be eliminated, by the numerical indees. A distance between cumulatives Among many possible ones, particularly natural is the L 1 distance between empirical and theoretical cumulatives I : F F d. It measures the distance between the probability of events of the form X t, averaged in t. For simple dimensional and epository reasons, it may be convenient to use the following small variant, that we may call error of fit:
E 100 I ma min where ma and min are referred to the sample 1,..., n. The results for Piro are ML Weibull: I 6.13 ML log-normale: I 5.39, the best between the two modified log-normale: I 5.96, better than ML Weibull. Eercise Write R code which computes, for every positive number k, the inde I k : F F k d and the error of degree k: E k 100 I k ma min 1/k. Which discrepances between the densities are captured, as k? (Pay attention to the typical dimensions of the numbers involved). Are these values typical? We may use the error E to compare different densities, as above. We may use it to compute optimal parameters. But we may use it also as a statistical test, to understand, for instance, whether ML log-normal is acceptable or not in itself (not whether it is better than another density). We do it the following way. Consider ML log-normal. Generate from it a sample of cardinality 19 and compute its error E with respect to our log-normal. Repeat 1000 times, get 1000 values of E: e 1,...,e 1000. k A percentage of them will be greater than the value 1000 e obtained conparing the k eperimental sample with the log-normal. We interpret as the probability that, at 1000 random, from that log-normal we may get a sample like the eperimental one, so k etreme. Call empirical p-value the number, or k ecc. depending on the number 1000 10000 of trials. If the p-value is small, e. 0.05, it means that it was not easy to get at random such a sample. This indicates that such log-normal is not natural enough. If, on the contrary, the p-value, is not so small, even some 0.15, we cannot eclude that the sample comes out from that distribution. At the end, we have a criterium to reject or not reject a distribution. Not reject does not mean a confirmation: several other distributions have the same property of non rejection. The code gives us E, p-value and histogram of e 1,...,e 1000. For Piro, ML log-normal: E 5.39, p-value 0.214
H is to g r a m o f 1 0 0 * I.r a n d /R a n g e Frequency 0 200 400 600 0 5 1 0 1 5 2 0 2 5 3 0 3 5 1 0 0 * I.ra n d /R a ng e (the p-value varies a little bit from trial to trial). We cannot reject this distribution; although this is an indication that the fit is not so good. Much worse is the result for ML Weibull: E 6.127, p-value 0.149 H i s t o g r a m o f 1 0 0 * I. r a n d / R a n g e Frequency 0 100 200 300 400 500 0 5 1 0 1 5 2 0 1 0 0 * I.r a n d /R a n g e All methods confirm the superiority of log-normal fit. Eercise Find the p-value for the error of degree k introduced in the eercise above. Eercise Analyze the data of this lecture by means of class Gamma. Recall to use dgamma(,shape a,scale s), ecc.