USE OF PROC IML TO CALCULATE L-MOMENTS FOR THE UNIVARIATE DISTRIBUTIONAL SHAPE PARAMETERS SKEWNESS AND KURTOSIS

Size: px

Start display at page:

Download "USE OF PROC IML TO CALCULATE L-MOMENTS FOR THE UNIVARIATE DISTRIBUTIONAL SHAPE PARAMETERS SKEWNESS AND KURTOSIS"

Suzanna Floyd
6 years ago
Views:

1 USE OF PROC IML TO CALCULATE L-MOMENTS FOR THE UNIVARIATE DISTRIBUTIONAL SHAPE PARAMETERS SKEWNESS AND KURTOSIS Michael A. Walega Covance, Princeton, New Jersey Introduction Exploratory data analysis statistics, such as those generated by the SAS procedure PROC UNIVARIATE (1990), are useful tools to characterize the underlying distribution of data prior to more rigorous statistical analyses. Assessment of the distributional shape of data is usually accomplished by careful examination of the values of the third and fourth central moments, skewness and kurtosis. However, when the sample size is small or the underlying distribution is non-normal, the information obtained from the sample skewness and kurtosis can be misleading. One alternative to the central moment shape statistics is the use of a linear combination of order statistics (L-moments) to examine the distributional shape characteristics of data. L-moments have several theoretical advantages over the central moment shape statistics: Characterization of a wider range of distributions, robustness to outliers and more accurate estimates in small sample sizes. This paper focuses on the development of a macro program that uses SAS/IML (1989) to generate the central moment and L-moment distributional shape parameters. In addition, the results of simulations, conducted with various sample sizes and distributions, will be presented. Background Largely through the influence of John Tukey s work (1977), statisticians have increasingly emphasized the exploratory analysis of data prior to more formal statistical analyses (t-tests, ANOVA, etc.). Tukey has suggested that to fully understand the nature of a variable and its measurement, characteristics other than the central tendency (mean) and variability (standard deviation) need to be examined. Many classical statistical tests rely on the assumption that the underlying distribution of the data (or residuals) is Gaussian. Bickel (1988) and Van Der Laan and Verdooren (1987) discuss the concept of robustness and how it pertains to the assumption of normality. As discussed by Glass et al. (1972), incorrect conclusions may be reached when the normality assumption is not valid, especially when one-tail tests are employed or the sample size or significance level are very small. Hopkins and Weeks (1990) also discuss the effects of highly nonnormal data on hypothesis testing of variances. Thus, the exam-ination of the skewness (departure from symmetry) and kurtosis (deviation from a normal curve) is an important component of exploratory data analyses. Various methods to estimate skewness and kurtosis have been proposed (MacGillivray and Balanda, 1988). For many years, the conventional coefficients of skewness and kurtosis, ϒ and κ (Hosking, 1990), have been used to describe the shape characteristics of distributions. However, as pointed out by Hosking (1990) and Royston (1992), these coefficients are not without limitations. Both are sensitive to minute changes in the tails of a distribution, susceptible to moderate outliers and biased in small to moderately-sized samples from skew distributions. Also, the information conveyed by the third and fourth central moments with regards to the shape of a distribution can be difficult to assess. Thus, it would be appropriate to determine if other, more robust measures of skewness and kurtosis can be used to assess the shape of a distribution. L-moments One more robust measure are linear combinations of order statistics, or L-moments. In theory, L- moments are less prone to the effects of sampling variability as compared to conventional moments. Hosking (1990) provides an excellent overview of the theory behind the derivation and application of L-moments as summary statistics for univariate probability distributions. Royston (1992) compares the prop-erties of the conventional shape parameters to their L-moment counterparts for two lognormal dis-tributions. Rather than discuss the detailed theory behind L-moments, the reader is referred to the two aforementioned papers. Instead,

2 a brief overview of the development of the equations necessary to apply L-moments is described below. As with the paper by Royston, the notation of Hosking (1990) will be employed. For the random variables X 1,, X n of sample size n drawn from the distribution of a random variable X with the mean µ and variance of σ 2, let X 1:1 X 1:n be the order statistics such that the L-moments of X are defined by r-1 λ r r-1 (-1) k (r - 1 ) EX r-k:r, k=0 k r = 1,2, where r is the r th L-moment of a distribution and EX i:r is the expected value of the i th smallest observation in a sample of size r. The first four central moments of a random variable X can be written as µ = E(X), σ 2 = E(X - µ) 2, ϒ = E(X - µ) 3 /σ 3 and κ = E(X - µ) 4 / σ 4. In a similar fashion, the first four L-moments of a random variable X can be written as λ 1 = E(X), λ 2 = 1/2E(X 2:2 - X 1:2 ), λ 3 = 1/3E (X 3:3-2X 2:3 + X 1:3 ) and λ 4 = 1/4E(X 4:4-3X 3:4 + 3X 2:4 - X 1:4 ) It can be seen that λ 1 is equivalent to the usual measure of central tendency, µ. λ 2 is similar to σ 2 in that both measure the difference between two randomly selected values of X; however, by its nature σ 2 assigns more weight to extreme sample values than does λ 2. λ 3 is a scale-dependent measure of skewness for a sample of size 3, and λ 4 is proportional to a weighted difference between outer extremes and the central portion samples of size 4 (Royston, 1992). Scale-free versions of the L-moments for skewness, τ 3, and kurtosis, τ 4, can be written as τ 3 = λ 3 / λ 2 and τ 4 = λ 4 / λ 2. An alternative measure of skewness, τ, 3, is defined as (1 + τ 3 ) / (1 - τ 3 ). This measure is the ratio of the expected length of the upper tail to that of the lower tail in samples of size 3, and as such may be easier to interpret than τ 3. λ 2, τ 3, τ, 3 and λ 4 are subject to the constraints λ 2 > 0, -1 < τ 3 < 1, 0 < τ, 3 < and 1/4(5τ 2 3-1) τ 4 < 1. If a random sample of size n is drawn from a distribution of the random variable X and x 1:n x n:n are the ordered sample values then estimates of the L-moments λ 1, λ 2, λ 3 and λ 4, namely I 1, I 2, I 3, I 4, can be calculated as follows. First, define w 2, w 3, and w 4 as 1 n w 2 = (i- 1)x i:n, n(n-1) i=2 1 n w 3 = (i - 1)(i -2)x i:n and n(n-1)(n-2) i=3 1 n w 4 = (i - 1)(i - 2)(i - 3)x i:n n(n-1)(n-2)(n-3) i=4 Then the L-moments and the corresponding shape statistics can be estimated as I 1 = x i / n, I 2 = 2w 2 - I 1, I 3 = 6w 3-6w 2 + I 1, I 4 = 20w 4-30w w 2 - I 1,

3 t 3 = I 3 / I 2 and t 4 = I 4 / I 2. t 3 and t 4 are the sample L-skewness and L-kurtosis, respectively, The sample estimate of the alternative measure of skewness, t, 3, is defined as (1 + t 3 ) / (1 - t 3 ). The Program The macro program L_MOMENTS was originally written using SAS v6.08 under the VMS operating environment. It has been modified to run under V6.09 and V6.12 on HP-UNIX. With slight modification (detailed below), the program should run on any operating system. The user is required to provide the name of the SAS data set (macro variable INDS) to be used in the analyses and the name(s) of the variables (macro variable VARS), separated by spaces, to be analyzed. There is no limit as to the number of variables that can be analyzed. Options available to the user include: Specify the location of the SAS data library (macro variable LIB). Default is current user location. BY group processing (macro variable BYVAR). No limit of number of BY variables, delimited by a space. Default is no BY group processing. Generate stem-leaf, box and normal probability plots (macro variable PLOTS). Default is no plots. Generate a hardcopy of the usual PROC UNIVARIATE output, with the central moment and L-moment shape statistics appended (macro variable PRINT). Default is to have the output provided. Create an output data (Temporary or Permanent) that contains the central moment and L-moment shape statistics (macro variable OUT). Default is no output dataset is created. A brief description of the flow of the program follows. A driver macro is used to initialize variables, search for the analysis data set, call a macro that outputs to the LMOMENTS.LOG file the options selected by the user, and call a macro that performs the calculations. If no analysis data set is found, the program reports this error to the.log file and terminates. Otherwise, the calculation macro begins. If BY variable processing is requested, the data are sorted before submission to PROC UNIVARIATE for analysis. PROC PRINTTO is used to capture the usual output and send it to the file UNI.DAT. An output data set from PROC UNIVARIATE is used to store the number of non-missing observations for each analysis. If the user chooses to generate a hardcopy of the results, a DATA step is used to process the UNI.DAT file. The functions PUT and SUBSTR are used in conjunction with the $HEX16. format to search for pagebreaks and set a flag that will be used to fire a PUT _PAGE_ in a DATA_NULL_ step at the end of the program. Next, a flag is set if BY variable box plots are created. For each page of output that does not contain BY variable box plots, a counter is incremented. The counter is used to facilitate direct read access of the shape statistics data set created by PROC IML for use in the DATA_NULL_ that generates the hardcopy output. Next, that part of the output line that displays the values for skewness and kurtosis is removed. Finally, a flag is set that indicates the last line of the tabular portion of the PROC UNIVARIATE output. For each analysis variable, the raw data are sorted, then merged and transposed. The output data set from PROC UNIVARIATE that contains the number of non-missing observations is also transposed. PROC IML is then used to calculate the central moments and L-moments for skewness and kurtosis. Using the same method as PROC UNIVARIATE, the sample skewness and kurtosis are calculated. Then, conditional upon there being at least four nonmissing observations, for each combination of BY variable and analysis variable the values for w 1, w 2, w 3 and w 4 are calculated and appended to an interim matrix. If this condition is not met, then the calculation of the L-moment parameters is not possible and a flag is set. Finally, the L-moment parameters are calculated, concatenated with the BY variables (if present), the central moment parameters and the conditional flag described above and placed into a SAS data set. The names of the analyses variables are placed in a separate SAS data set. Once the calculations have been completed, userdefined options direct the results to hardcopy output and/or a temporary or permanent SAS data set. If hardcopy output is requested, a DATA _NULL_ writes the modified PROC UNIVARIATE output and, using the direct access counter previously described, places the shape statistics immediately below the last line of PROC UNIVARIATE tabular

4 output. If the sample size flag generated by PROC IML has been fired, a ** is printed for the L-moment output, with an appropriate footnote. If the user has requested a temporary (OUT = T) or permanent (OUT =P) data set be created, then the two resultant data sets from PROC IML are merged and the data set is created as appropriate. Simulations Simulations were conducted to explore the applicability of the L-moment shape statistics to varying sample sizes and distributional shapes. For each of the following distributions, 5000 data sets were generated for samples sizes 5, 10, 20, 40, 60, 125 and 250: Logistic y = a + k*log(x/(1-x)), where a = 0 and k = 1; Gumbel y = a - b(log(-log(x))), where a = 1 and b =1; Normal(0, 1) Exponential y = a - b*log(1-x), where a = 1 and b = 1; Lognormal y = exp(a*x), where a = 0.5; Lognormal y = exp(a*x), where a = 1. In the equations, x is a random normal (0,1) variate. The table below lists the theoretical values for the shape statistics for the above distributions (values for the central moments are taken from Hastings and Peacock, 1975; except for t 4 for the Lognormal distributions (Royston, 1992), values for the L- moments are taken from Hosking, 1990): Distribution κ τ 3 τ 4 Logistic Gumbel Normal Exponential Lognormal(0.5) Lognormal(1.0) The method of Royston (1992) was used to quantify the results of the simulations. For the logistic and normal distributions, the mean absolute values for τ 3 and κ were determined. Otherwise, each of the 5000 values of the shape parameters was standardized by dividing its simulation mean by the nominal, theoretical value and then averaged. The results are presented in Appendix 1, comparing the usual shape statistics to the L-moment statistics for the simulated samples sizes. The results are presented as the nominal values for τ 3 and κ for the logistic and normal distributions or percent of nominal value for the other distributions. Upon review of the results it appears that, for the simulations conducted and independent of sample size or distributional shape, the L-moment shape statistics in general are less biased than the central moment shape statistics. As such, the L-moment shape statistics may be more useful indicators of the type of departure of a sample from normality (Royston, 1992). Example The usefulness of L-moment shape statistics become apparent when applied to the analysis of pharmacokinetic parameters. It has been suggested that many pharmacokinetic parameters follow a lognormal distribution. To examine this, data from Metzler and Huang (1983) will be used to calculate the central moment and L-moment shape statistics for the untransformed and log-transformed area under the plasma concentration-time curve data. Figures 1 and 2 present an example of the output produced using the macro call %LMOMENTS(INDS=TEST,PLOTS=Y,VARS= AUC LOGAUC); The shape statistics for the untransformed data suggest that the underlying distribution is positively skewed, with some evidence of kurtosis. Logtransformation of the data results in a closer approximation to normality. Note the disparity between the central moment, κ, and L-moment, τ 4, measures of kurtosis. This can probably be attributed to the poor small sample performance of κ compared to τ 4, and to the biasedness of κ in nonnormal distributions. Discussion The L-moment shape indices t 3, t 3, and t 4 have several advantages over the usual shape statistics ϒ and κ. Accurate characterization of several nonnormal distributions, reasonably unbiased in small sample sizes, ease of interpretability and robustness to outliers make the L-moment shape statistics useful measures of the shape of distribution. As shown in the example, the L-moment shape statistics could be useful indicators when transformation of data is required. A macro program was developed to include the calculation of the L-moment shape statistics with the central moment shape statistics in a hardcopy of PROC UNIVARIATE output, an output data set, or both.

5 SAS and SAS/IML are registered trademarks of the SAS Institute, Inc., Cary, NC. HP-UNIX is a registered trademark of the Hewlett- Packard Corporation, Boise, Idaho. References Bickle, P. Robust Estimation in S. Kotz and N. Johnson (eds.) the Encyclopedia of Statistical Sciences, John Wiley and Sons (1988), New York, NY, Volume 8, pp Van Der Laan, P. and L. R. Verdooren. Classical analysis of variance methods and nonparametric counterparts. Biom. J. 6: , The author can be reached at Covance 210 Carnegie Center Princeton, NJ Phone:(609) Michael.Walega@covance.com Glass, G.V., Peckham, P.D. and J.R. Sanders. Consequences of failure to meet assumptions underlying the fixed effects analysis of variance and covariance. Rev. Educ. Res. 42: Hastings, N.A.J. and J.B. Peacock, Statistical Distributions. John Wiley and Sons (1975), New York, NY. Hopins, K.D. and D.L. Weeks. Tests for normality and measures of skewness and kurtosis: Their place in research reporting. Educ. Psychol. Meas. 50: , 90. Hosking, J.R.M. L-moments: Analysis and estimation of distributions using linear combinations of order statistics. J. Royal Stat. Soc.. B 52: , MacGillivray, H.L. and K.P. Balanda. The relationships between skewness and kurtosis. Austral. J. Stat. 30: , Metzler, C.M. and D.C. Huang. Statistical methods for bioavailability and bioequivalence. Clin. Res. Pract. Drug Reg. Affairs, 1: , Royston, P. Which measures of skewness and kurtosis are best? Stat. Med. 11: , SAS Institute, Inc. SAS Language: Reference, Version 6, First Edition, Cary, NC: SAS Institute, Inc., SAS Institute, Inc. SAS Procedures Guide, Version 6, Third Edition, Cary, NC: SAS Institute, Inc., SAS Institute, Inc. SAS/IML Software: Usage and Reference, Version 6, First Edition, Cary, NC: SAS Institute, Inc., Tukey, J.W. Exploratory Data Analysis. Addison- Wesley (1977), Reading, MA.

6 APPENDIX 1 - Simulation Results Sample Size Logistic Gumbel % 95% 38% 79% 4% 101% % 102% 55% 87% 12% 105% % 100% 72% 100% 20% 107% % 101% 80% 100% 28% 108% % 102% 85% 101% 34% 108% % 102% 100% 103% 40% 108% % 102% 104% 103% 42% 108% Sample Size Normal Exponential % 99% 27% 58% 4% 99% % 102% 43% 63% 12% 104% % 101% 57% 65% 21% 108% % 100% 64% 67% 32% 112% % 100% 69% 71% 38% 114% % 100% 74% 73% 46% 115% % 100% 82% 74% 51% 115% Sample Size Lognormal (0.5) Lognormal (1.0) 5 33% 78% 3% 82% 17% 75% 1% 67% 10 44% 82% 7% 87% 21% 81% 4% 77% 20 61% 89% 18% 91% 30% 87% 8% 83% 40 68% 86% 28% 96% 37% 92% 10% 90% 60 75% 98% 33% 98% 42% 95% 13% 94% % 99% 40% 99% 51% 97% 18% 96% % 100% 44% 100% 61% 98% 20% 98%

7 FIGURE 1 Univariate Procedure Variable=AUC Moments Quantiles(Def=5) Extremes N 20 Sum Wgts % Max % Lowest Obs Highest Obs Mean 7.08 Sum % Q % ( 1) 9.28( 16) Std Dev Variance % Med % ( 2) 10.73( 17) 25% Q % ( 3) 10.8( 18) USS CSS % Min % ( 4) 14.02( 19) CV Std Mean % ( 5) 16.26( 20) T:Mean= Pr> T Range Num ^= 0 20 Num > 0 20 Q3-Q M(Sign) 10 Pr>= M Mode 2.33 Sgn Rank 105 Pr>= S W:Normal Pr<W Skewness Kurtosis Usual Method: L-Moments: T T T Stem Leaf # Boxplot Normal Probability Plot * * *+* ** *--+--* +++**** *+**** * *++*+*+* * NOTE: T3 = (1+T3)/(1-T3). The L-Moment statistics are subject to the following constraints: -1 < T3 < 1, 0 < T3 < Infinity, ¼ * (5 * (T3**2) - 1) <= T4 <= 1. ** indicates that L-Moment statistics could not be computed. REF: P. Royston, Stat. Med. 11: (1992).

8 FIGURE 2 Univariate Procedure Variable=LOGAUC Moments Quantiles(Def=5) Extremes N 20 Sum Wgts % Max % Lowest Obs Highest Obs Mean Sum % Q % ( 1) ( 16) Std Dev Variance % Med % ( 2) ( 17) 25% Q % ( 3) ( 18) USS CSS % Min % ( 4) ( 19) CV Std Mean % ( 5) ( 20) T:Mean= Pr> T Range Num ^= 0 20 Num > 0 20 Q3-Q M(Sign) 10 Pr>= M Mode Sgn Rank 105 Pr>= S W:Normal Pr<W Skewness Kurtosis Usual Method: L-Moments: T T T Stem Leaf # Boxplot Normal Probability Plot *++++* ***+**+*+* *--+--* ****+* *++*+*+** * NOTE: T3 = (1+T3)/(1-T3). The L-Moment statistics are subject to the following constraints: -1 < T3 < 1, 0 < T3 < Infinity, ¼ * (5 * (T3**2) - 1) <= T4 <= 1. ** indicates that L-Moment statistics could not be computed. REF: P. Royston, Stat. Med. 11: (1992).

Review: Chebyshev s Rule. Measures of Dispersion II. Review: Empirical Rule. Review: Empirical Rule. Auto Batteries Example, p 59.

Review: Chebyshev s Rule. Measures of Dispersion II. Review: Empirical Rule. Review: Empirical Rule. Auto Batteries Example, p 59. Review: Chebyshev s Rule Measures of Dispersion II Tom Ilvento STAT 200 Is based on a mathematical theorem for any data At least ¾ of the measurements will fall within ± 2 standard deviations from the