Technology Support Center Issue

Size: px

Start display at page:

Download "Technology Support Center Issue"

Wilfrid Powell
5 years ago
Views:

1 United States Office of Office of Solid EPA/600/R-02/084 Environmental Protection Research and Waste and October 2002 Agency Development Emergency Response Technology Support Center Issue Estimation of the Exposure Point Concentration Term Using a Gamma Distribution Anita Singh 1, Ashok K. Singh 2, and Ross J. Iaci 3 The Technology Support Projects, Technology Support Center (TSC) for Monitoring and Site Characterization was established in 1987 as a result of an agreement between the Office of Research and Development (ORD), the Office of Solid Waste and Emergency Response (OSWER) and all ten Regional Offices. The objectives of the Technology Support Project and the TSC were to make available and provide ORD's state-of-the-science contaminant characterization technologies and expertise to Regional staff, facilitate the evaluation and application of site characterization technologies at Superfund and RCRA sites, and to improve communications between Regions and ORD Laboratories. The TSC identified a need to provide federal, state, and private environmental scientists working on hazardous waste sites with a technical issue paper that identifies data assessment applications that can be implemented to better define and identify the distribution of hazardous waste site contaminants. The examples given in this Issue paper and the recommendations provided were the result of numerous data assessment approaches performed by the TSC at hazardous waste sites. This paper was prepared by Anita Singh, Ashok K. Singh, and Ross J. Iaci. Support for this project was provided by the EPA National Exposure Research Laboratory's Environmental Sciences Division with the assistance of the Superfund Technical Support Projects Technology Support Center for Monitoring and Site Characterization, OSWER s Technology Innovation Office, the U.S. DOE Idaho National Engineering and Environmental Laboratory, and the Associated Western Universities Faculty Fellowship Program. For further information, contact Christopher Sibert, Technology Support Center Director, at (702) , Anita Singh at (702) , or A. K. Singh at (702) In Superfund and RCRA projects of the U.S. EPA, cleanup, exposure, and risk assessment decisions are often made based upon the mean concentrations of the contaminants of potential concern. A 95% upper confidence limit (UCL) of the population mean is used to estimate the exposure point concentration (EPC) term (EPA, 1992), to determine the attainment of cleanup standards (EPA, 1989), to estimate background level contaminant concentrations, or to compare the soil concentrations with the site specific soil screening levels (EPA, 1996). It is, therefore, important to compute an accurate and stable 95% UCL of the population mean from the available data. The formula for computing a UCL depends upon the data distribution. Typically, environmental data are positively skewed, and 1 Lockheed Martin Environmental Services, 1050 East Flamingo, Ste. E120, Las Vegas, NV Department of Mathematics, University of Nevada, Las Vegas, NV Department of Statistics, University of Georgia, Athens, GA Technology Support Center for Monitoring and Site Characterization, National Exposure Research Laboratory Environmental Sciences Division Las Vegas, NV Technology Innovation Office Office of Solid Waste and Emergency Response, U.S. EPA, Washington, D.C. Walter W. Kovalick, Jr., Ph.D., Director Printed on Recycled Paper 289CMB02.RPT Rev. 3/6/03

2 a lognormal distribution (EPA, 1992) is often used to model such data distributions. The H-statistic (Land, 1971) based upper confidence limit of the mean (denoted henceforth as H-UCL) is used in these applications. However, recent research in this area (Hardin and Gilbert, 1992; Singh, et al., 1997, 1999; and Schultz and Griffin, 1999) suggest that this may not be an appropriate choice. It is observed that for large values of standard deviation (e.g., exceeding ) of the logtransformed data, the use of H-statistic leads to unreasonably large, unstable, and impractical UCL values. This is especially true for sample sets of smaller sizes (e.g., n < 20-25). The H-UCL is also very sensitive to a few low or high values. For example, the addition of a sample below detection limit can cause the H-UCL to increase by a large amount. Realizing that the use of H-statistic can result in an unreasonably large UCL, it has been recommended (EPA, 1992) to use the maximum observed value as an estimate of the UCL (EPC term) in cases where the H-UCL exceeds the maximum observed value. Also, when the sample size is 5 or less, the maximum observed concentration is often used as an estimate of the EPC term. However, it is observed that for highly skewed data sets, use of the maximum observed concentration may not provide the specified 95% coverage to the population mean (as shown in Section 5). This is especially true for samples of small size (e.g., 5-10). For larger sample sets/data sets (e.g., n$20), the use of the maximum observed value results in an overestimate of the 95% UCL of population mean. For such highly skewed data sets, use of a gamma distribution based UCL of the mean provides a viable option. A positively skewed data set can quite often be modeled by lognormal as well as gamma distributions. Due to the relative computational ease, however, the lognormal distribution is used to model positively skewed data sets. However, use of a lognormal model for an environmental data set unjustifiably elevates the minimum variance unbiased estimate of the mean and its UCL to levels that may not be applicable in practice. In this paper, we propose the use of a gamma distribution to model positively skewed data sets. The objective of the present work is to above, the test is based on a one-sided UCL of the mean. A one-sided UCL is a statistic such that the study procedures which can be used to compute a stable and accurate UCL of the mean based upon a gamma distribution. Several parametric and non-parametric (e.g., standard bootstrap, bootstrap-t, Hall s bootstrap, Chebyshev inequality) methods of computing a UCL of the unknown population mean, :, have also been considered. Monte Carlo simulation experiments have been performed to compare the performances of these methods. The comparison of the various methods has been evaluated in terms of the coverage (confidence coefficient) probabilities achieved by the various UCLs. Based upon this study, in Section 6, recommendations have been made about the computation of a UCL of the mean for skewed data distributions originating from various environmental applications. 1. Introduction Suppose the Regional Project Manager (RPM) of a Superfund site believes that the mean concentration of the contaminant of potential concern (COPC) exceeds a specified cleanup standard, C s, but the potentially responsible party (PRP) claims that the mean concentration is below C s. In statistical terminology this can be stated in terms of testing of hypotheses. The hypotheses of interest are the null hypothesis that the mean concentration exceeds the cleanup standard, H 0 : : $ C s, versus the alternative hypothesis, H a : : < C s. This formulation of the problem is protective of the environment because it assumes that the area in question is contaminated, and the burden of testing is to show otherwise. In order to perform a test of these hypotheses, a random sample is collected from the site and concentrations of the COPC in these samples are determined. A suitable statistical test is then used to make a decision. A convenient way to perform a test of hypotheses about an unknown population parameter is first to compute a confidence interval for the parameter, and then reject H 0 if the hypothesized value, in this case the cleanup standard, C s, lies outside of the confidence interval. For the one-sided hypotheses mentioned true population mean, :, is less than the UCL with a prescribed probability or level of confidence, say 2

3 (1! "). For example, if the UCL is a 95% onesided upper confidence limit, then : < UCL with 95% confidence (or with 0.95 probability), and the set of all real numbers less than UCL forms a 95% upper one-sided confidence interval. The corresponding statistical test will reject H 0 (i.e., declare the site clean) if UCL < C s, and the significance level of this test, or false positive error rate, is ". This follows because if the site is contaminated (i.e., : $ C s ), then the probability of declaring it clean is the probability that UCL < C s, which is at most ". Testing of these hypotheses and computation of a UCL of the mean depends upon the population distribution of the COPC concentrations. Several procedures are available to compute a UCL of the mean of a normal or a lognormal distribution in the literature of environmental statistics (i.e., Singh, Singh, and Engelhardt, 1997, 1999; Schultz and Griffin, 1999). In this paper, we focus our effort on the inference procedures for an unknown population mean based upon a gamma distribution. The objective here is to study procedures that can be used to compute an accurate and stable UCL of the mean. Several parametric (Johnson, 1978; Chen, 1995; and, Grice and Bain, 1980) and nonparametric (e.g., standard bootstrap, bootstrap-t (Efron, 1982, Hall, 1988), Hall s bootstrap (Hall, 1992), Chebyshev inequality) methods of computing a UCL of population mean, :, of a skewed distribution have also been considered. The comparison of the various methods has been performed in terms of the coverage (confidence coefficient) probabilities provided by the various 95% UCLs. Monte Carlo simulation experiments have been performed to compare the performances of these methods. Based upon this study, recommendations have been made about the computation of a UCL of the mean for skewed data distributions originating from various environmental applications. Section 2 has a brief description of the gamma distribution and a discussion of goodness-of-fit tests for the gamma distribution. Section 2 also describes estimation of gamma parameters and the computation of the UCL of mean based upon a gamma distribution. Section 3 describes the various other methods which can be used to compute a UCL of population mean. Section 4 has some examples illustrating the procedures used. Section 5 discusses the Monte Carlo experiments used to illustrate these methods and results. Section 6 consists of our recommendations for dealing with heavily skewed data sets. 2. The Gamma Distribution A continuous random variable, X (e.g., COPC concentration), is said to follow a two-parameter gamma distribution, G(k,2) with parameters k>0 and 2>0, if its probability density function is given by the following equation: (1) and zero otherwise. The parameter k is the shape parameter, and 2 is the scale parameter (the location parameter is set to zero). Plots of the gamma distribution, G(k,2) for varying choices of the shape parameter, k, and the scale parameter, 2, are shown in Figures 1-4. These figures have been generated using the statistical software package, MINITAB. The mean, variance, and skewness of a gamma distribution, G(k,2) are given as follows: Mean = : = k2. (2) Variance = F 2 = k2 2. (3) Skewness = 2/qk. (4) From equation (4), it is noted that skewness increases as the shape parameter k decreases. Figures 1 and 2 have the graphs of highly skewed distributions. As k increases, skewness decreases, and consequently a gamma distribution starts approaching a normal distribution for larger values of k (e.g., k $10), as can be seen in Figures 3 and 4. Thus for larger values of k, the UCL based upon a gamma distribution and a UCL based upon a normal distribution are in close agreement. From Figures 1-4, it can also be seen that the scale parameter, 2, simply affects the scale of the distribution and has no effect on the shape of the gamma distribution. In practice, a highly skewed data set can be fitted by both lognormal and gamma distributions. However, the difference between the UCLs obtained using the two 3

4 distributions can be enormous. This is especially true when the shape parameter is small (e.g., k < 1). This is illustrated in examples 2-5 given in Section 4. Figure 1. Graphs of the gamma distributions G(0.1, 1), G(0.2, 1), and G(0.5, 1). Figure 2. Graphs of the gamma distributions G(0.1, 50), G(0.2, 50), and G(0.5, 50). 4

5 Figure 3. Graphs of the gamma distributions G(2, 1), G(4, 1), and G(10, 1). Figure 4. Graphs of the gamma distributions G(2, 50), G(4, 50), and G(10, 50). 2.1 Goodness-of-Fit Tests for Gamma Distribution Since the goodness-of-fit tests for gamma distributions are not readily available, a brief description of those tests is given here. Several tests based upon empirical distribution functions (EDF) exist in the statistical literature, and can be used to test for a gamma distribution. These tests include Kolmogorov-Smirnov, D-test statistic, Anderson-Darling, A 2 -test statistic, and Cramervon Mises test statistics, W 2 and U 2 (e.g., see D Agostino and Stephens (1986), page 101). The exact critical values of these statistics are not available; this is especially true when the shape 5

6 parameter, k, of the gamma distribution is less than 1. Some asymptotic upper-tail critical values of the test statistics, W 2, A 2, and U 2 are given in D Agostino and Stephens (1986) for values of the shape parameter, k$1 (pages ). Schneider (1978) also studied the goodness-of-fit tests for gamma distribution. He derived the critical values of Kolmogorov-Smirnov, D-test statistic for selected values of the shape parameter, k, and the sample size for the gamma distribution with unknown parameters. All of these tests are righttailed. This means that if a computed test-statistic exceeds its respective "100% critical value, the null hypothesis of gamma distribution will be rejected at " level of significance. Most of the commercially available software packages such as SAS and S-PLUS do not provide the goodness-of-fit tests for a gamma distribution. The software ExpertFit (developed by Law & Associates, Inc., 2001) performs a goodness-of-fit test for gamma distribution using the Anderson- Darling test statistic, A 2 and Kolmogorov-Smirnov test statistic. Due to the unavailability of the exact critical values of the general test statistics, the software ExpertFit (Law and Kelton (2000)) uses approximate critical values of the test statistic under the assumption that all parameters (e.g., shape and scale) of the distribution are known, that is the distribution is completely specified as given in Stephens (1970). Those critical values are the generic critical values for all completely specified distributions. ExpertFit uses these generic critical values to test for a gamma distribution. These critical values are also given on page 105 of D Agostino and Stephens (1986). The authors of this article also developed a program, GamGood (2002), to test for a gamma distribution. This program computes the various goodness-of-fit test statistics using the formulae as given on page 101 of D Agostino and Stephens (1986). In this paper, we also use the smoothed percentage points of the Kolmogorov-Smirnov (K- S), D-test statistic as computed by Schneider and Clickner (1976), Schneider (1978) to test for a gamma distribution. An illustration of the goodness-of-fit test for a gamma distribution has been discussed in Example 1. Example 1 The following data set of size 20 is given by Grice and Bain (1980): 152, 152, 115, 109, 137, 88, 94, 77, 160, 165, 125, 40, 128, 123, 136, 101, 62, 153, 83, and 69. None of the parameters of the underlying distribution are known. The various goodness-of-fit test statistics are given by A 2 = , W 2 = , U 2 = , and D = The estimated shape parameter, k, for this data set is (see Example 1, to be continued). For a shape parameter of 7.513, the asymptotic 5% critical values (Table 4-21, page 155, D Agostino and Stephens, 1986) of these statistics are: A 2 = 0.755, W 2 = 0.127, and U 2 = 0.117, and the critical value of the K-S statistic, is D = (Table 7 of Schneider, 1978). Since all of the test-statistics are less than their respective critical values, it is concluded that there is insufficient evidence to conclude at the 0.05 level of significance that the data do not follow a gamma distribution. 2.2 Estimation of Parameters of the Gamma Distribution Next, we consider the estimation of the parameters of a gamma distribution. The population mean and variance of a gamma distribution, G(k,2), are functions of both parameters, k and 2. In order to estimate the mean, one has to obtain estimates of k and 2. Computation of the maximum likelihood estimate (MLE) of k is quite complex and requires the computation of Digamma and Trigamma functions (Choi and Wette, 1969). Several authors (Choi and Wette, 1969, Bowman and Shenton, 1988, Johnson, Kotz, and Balakrishnan, 1994) have studied the estimation of shape and scale parameters of a gamma distribution. The maximum likelihood estimation procedure to estimate shape and scale parameters of a gamma distribution is described below. Let x 1,x 2,...,x n be a random sample (of COPC concentrations) of size n from a gamma distribution, G(k,2), with unknown shape and scale parameters k and 2, respectively. The log likelihood function is given as follows: (5) 6

7 To find the MLEs of k and 2, which are $k and $ θ, respectively, we differentiate the log likelihood function as given in (5) with respect to k and 2, and set the derivatives to zero. This results in the following two equations: (6) (7) Raphson (Faires and Burden, 1993) method leading to the following iterative equation: The iterative process stops when $k starts to converge. In practice, convergence is typically achieved in fewer than 10 iterations. In equation (9): (9) Solving equation (7) for $ θ and substituting the result in equation (6), we get the following equation: There does not exist a closed form solution of equation (8). This equation needs to be solved numerically for $k, which requires the computation of digamma and trigamma functions. This is quite easy to do using a personal computer. An estimate of k can be computed iteratively by using the Newton- (8) where Q (k) is the digamma function and QN (k) is the trigamma function. In order to obtain the MLEs of k and 2, one needs to compute the digamma and trigamma functions. Good approximate values for these two functions (Choi and Wette, 1969) can be obtained using the following approximations. For k$8, these functions are approximated by: (10) and (11) For k < 8, one can use the following recurrence relation to compute these functions: iteration from the following formula: (12) (13) The iterative process requires an initial estimate of k. A good starting value for k in this iterative process is given by k 0 = 1/(2M). Thom (1968) suggests the following approximation as an estimate of k: (14) Bowman and Shenton (1988) suggested using $k as given by equation (14) to be a starting value of k for an iterative procedure, calculating at the l th 16 (15) Both equations (9) and (15) have been used to compute the MLE of k. It is observed that the estimate, $k based upon Newton-Raphson method as given by equation (9) is in close agreement with that obtained using equation (15) with Thom s approximation as an initial estimate. Choi and Wette (1969) further concluded that the MLE of k, $k, is biased high. A bias corrected (Johnson, et al., 1994) estimate of k is given by the following equation: (16) In (16), $k is the MLE of k obtained using either (9) or (15). Substitution of equation (16) in 7

8 equation (7) yields an estimate of the scale parameter, 2 given as follows: (17) Next we provide an example illustrating the computations of the MLEs of k and 2. Consider the data set of Example 1. The sample mean, x, is The MLEs of the two parameters, k and 2, are obtained iteratively using the Newton-Raphson method (equation 9), and Bowman and Shenton s proposal as given by equation (15). The two sets of estimates are in agreement and are given by $k = 8.799, and $ θ = The corresponding bias-corrected estimates of k and 2, as given by equations (16) * * and (17) are $k = and $ θ = Note that the bias-corrected MLE of the shape parameter, k = , which is quite high; consequently, the skewness of this data set is mild and its MLE = 0.73 (from equation (4)). Goodness-of-fit tests performed on this data set suggest that the data cannot reject the hypothesis that the data are normal or that they are lognormal. 2.3 Computation of UCL of the Mean of a Gamma, G(k,2) Distribution In the statistical literature, even though procedures exist to compute a UCL of the mean of a gamma distribution (Grice and Bain, 1980, Wong, 1993), those procedures have not become popular due to their computational complexity. Those approximate and adjusted procedures depend upon the Chi-square distribution and an estimate of the shape parameter, k. As seen above, computation of a MLE of k is quite involved, and this works as a deterrent to the use of a gamma distribution-based UCL of the mean. However, the computation of a gamma UCL currently should not be a problem due to easy availability of personal computers. Given a random sample, x 1,x 2,...,x n of size n from a gamma, G(k,2) distribution, it can be shown that 2 nx / θ follows a Chi-square distribution,, with 2nk degrees of freedom (df). It is noted that (2 nx ) / θ = 2(X 1 + X For " = 0.05 (confidence coefficient of 0.95), " = 0.1, and " = 0.01, these adjusted probability levels + X n ) / 2. Using a simple transformation of variables, it is seen that each of the random variables, 2X i /2;i:=1,2,...,n follows a chi-square,, distribution. Also those chi-square random variables are independently distributed. Since the sum of the independently distributed chi-square random variables also follows a chi-square distribution, it is concluded that (2 nx ) / θ follows a chi-square, distribution with 2nk degrees-of-freedom. When the shape parameter, k, is known, a uniformly most powerful test of size " of the null hypothesis, H 0 : :$C s, against the alternative hypothesis, H 1 : :<C s, is to reject H 0 if x / C ( )/ nk s < χ 2 α 2. 2nk The corresponding (1-") 100% uniformly most accurate UCL for the mean, :, is then given by the probability statement: (18) where (") denotes the " cumulative percentage point of the Chi-square distribution. That is, if Y 2 follows, then PY ( χυ ( α)) = α. In practice, k is not known and needs to be estimated from data. A reasonable procedure is to replace k by its bias corrected estimate, $k *, as given by equation (16). This results in the following approximate (1-") 100% UCL of the mean: (19) It should be pointed out that the UCL given in (19) is an approximate UCL and there is no guarantee that the confidence level of (1-") will be achieved by this UCL. However, it does provide a way of computing a UCL of mean of a gamma distribution. Simulation studies conducted in Section 4 suggest that an approximate gamma UCL thus obtained provides the specified coverage (95%) as the shape parameter, k approaches 0.5. Thus when k$0.5, one can use the approximate UCL given by (19). It should be observed that this approximation is good even for smaller (e.g., n=5) sample sizes. Grice and Bain (1980) computed an adjusted probability level, $, which can be used in (19) to achieve the specified confidence level of (1-"). are given below for some values of the sample size n (Table 1). One can use linear interpolation to 8

9 obtain an adjusted $ for values of n not covered in the table. The adjusted (1-") 100% UCL of gamma mean, : = k2 is given by: (20) where $ is given in Table 1 for "=0.05, 0.1, and Note that as the sample size, n, becomes large, the adjusted probability level, $, approaches ". Except for the computation of the MLE of k, equations (19) and (20) provide simple Chisquare-distribution-based UCLs of the mean of a gamma distribution. It should also be noted that the UCLs as given by (19) and (20) only depend upon the estimate of the shape parameter, k, and are independent of the scale parameter, 2, and its estimate. Consequently, as expected, it is observed that coverage probabilities for the mean associated with these UCLs do not depend upon the values of the scale parameter, 2. This is further discussed in Section 4. Table 1. Adjusted Critical Level, $ for Various Values of " and n " = 0.05 " = 0.1 " = 0.01 n probability level, $ probability level, $ probability level, $ It is observed (Figures 5-7) that except for highly skewed (k<0.15) data and samples of small size (e.g., <10), the adjusted gamma UCL given by (20) provides the specified 95% coverage of the population mean. It is also noted that for highly skewed (k<0.15) data sets of small sizes, except for the H-UCL, the coverage probability provided by the adjusted gamma UCL is the highest and is close to the specified level, However, for these highly skewed data sets, the H- statistic results in unacceptably large values of the UCL. This is further illustrated in examples 2-4. For values of k $0.2, the specified coverage of 0.95 is always approximately achieved by the adjusted gamma UCL given by equation (20), as shown in Figures Figure 5. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 0.10, 2 = 50). 9

10 Figure 6. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 0.15, 2 = 50). Figure 7. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 0.20, 2 = 50). Figure 8. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 0.25, 2 = 50). 10

11 Figure 9. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 0.50, 2 = 50). Figure 10. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 1.0, 2 = 50). Figure 11. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 2.0, 2 = 50). 11

12 Figure 12. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 4.0, 2 = 50). Figure 13. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 6.0, 2 = 50). Figure 14. Graphs of coverage probabilities by 95% UCLs of mean of G(k = 10.0, 2 = 50). 12

13 Example 1 (Continued) The data set of size 20 and the associated MLEs of parameters k and 2 are given in Example 1. For n=20 and =0.05, the adjusted probability level, * $ = (Table 1), and the adjusted df, 2nk $ = υ * = The approximate 95% UCL of the mean obtained using equation (19) is given by UCL = , and the adjusted 95% UCL of mean obtained using equation (20) is given by UCL = As noted above, this data set passes both normality as well as lognormality tests. The associated Student s t-statistic based and the H-statistic based UCLs are and , respectively. For this mildly skewed data set, one can use any of these four UCLs. 3. Other UCL Computation Methods Several authors (Johnson, 1978, Kleijnen, Kloppenburg, and Meeuwsen, 1986, Chen, 1995, Sutton, 1993) have developed inference procedures for estimating the means of asymmetrical distributions. Also, several bootstrap procedures (Efron, 1982, Hall, 1988 and 1992, Manly, 1997) have been recommended for the computation of confidence intervals for means of skewed distributions. These are summarized below and are also included in the simulation experiments described in Section 4. Some examples have been included to illustrate these procedures. 3.1 UCL Based Upon Student s t-statistic A (1!") 100% one-sided upper confidence limit for the mean based upon Student s t-statistic is given by the following equation: (21) where t ", n!1 is the upper " th percentile of the Student's t distribution with n!1 degrees of freedom, and the sample variance is given by: specified (1-") 100% coverage for the population mean, :. 3.2 UCL Based Upon Modified Student s t- Statistic for Asymmetric Distributions Johnson (1978) and Sutton (1993) proposed the use of a modified t-statistic for testing the mean of a positively skewed distribution. An adjusted (1-") 100% UCL (Singh, Singh, and Engelhardt, 1999) of the mean, :, based upon modified t-statistic is given as follows: (22) Where, $µ 3 an unbiased moment estimate (Kleijnen, Kloppenburg, and Meeuwsen, 1986) of the third central moment,, is given as follows: µ 3 (23) The simulation study conducted in Section 4 suggests that the UCL based upon the modified-t statistic also fails to provide the specified coverage (95% here) for skewed data sets from gamma distributions. 3.3 UCL of the Mean Based Upon the Adjusted Central Limit Theorem for Skewed Distributions Given a random sample, x 1, x 2,..., x n of size n from a population with finite variance, F 2, and mean, :, the Central Limit Theorem (CLT) states that the asymptotic distribution (as n approaches infinity) of the sample mean, x n, is normally distributed with mean : and variance F 2 /n. An often cited rule of thumb for a minimum sample size satisfying the CLT is n $ 30. However, this is not adequate if the population is highly skewed (Singh, Singh, and Engelhardt, 1999). A refinement of the CLT approach which makes an adjustment for skewness is discussed by Chen (1995). Specifically, the "adjusted CLT" UCL is given by: (24) This UCL should be used when either the data follow a normal distribution, or when the data distribution is only mildly skewed and sample size n is large. For highly skewed data sets, the UCL based upon this method fails to provide the where $k 3, the coefficient of skewness, is given by $ 3 k 3 = $ µ 3 / s x. The simulation study conducted in Section 5 suggests that even for larger samples, the adjustment made in the CLT-UCL method is not effective enough to provide the specified 13

14 (95%) coverage for skewed data sets. As skewness decreases, the coverage provided by the adjusted CLT-UCL approaches 95% for larger sample sizes, as can be seen in Figures UCL of the Mean of a Lognormal Distribution Based Upon Land s Method In practice, a skewed data set can be modeled by both lognormal and gamma distribution. However, due to computational ease, the lognormal distribution is typically used to model such skewed data sets. A (1!")100% UCL for the mean, :, of a lognormal distribution based upon Land s H-statistic (1971) is given as follows: (25) 2 where y and s y are the sample mean and variance of the log-transformed data. Tables of values denoted by H 1! " can be found in Gilbert (1987). From the simulation experiments discussed in Section 4, it is observed that H-statistic based UCL grossly overestimates the 95% UCL and consequently, coverage provided by a H-UCL is always larger than the specified coverage of 95%. In Section 4, examples to illustrate this unreasonable behavior of the H-statistic based UCL are included. The practical merit of a H- UCL is doubtful as it results in unacceptably high UCL values. This is especially true for samples of small size (e.g., <25) with values of s y exceeding This is illustrated in examples UCL of the Mean Based Upon the Chebyshev Inequality Chebyshev inequality can be used to obtain a reasonably conservative but stable estimate of the UCL of the mean. The two-sided Chebyshev Theorem states that given a random variable X with finite mean and standard deviation, : and F, we have: (26) Here, j is a positive real number. This result can be applied with the sample mean, x, to obtain a conservative UCL for the population mean. Specifically, a (1-") 100% UCL of the mean, :, is given by: (27) Of course, this would require the user to know the value of F. The obvious modification would be to replace F with the sample standard deviation, s x, but this is estimated from data, and therefore, the result is no longer guaranteed to be conservative. In general, if : is an unknown mean, $µ is an estimate, and $ σ ( $µ ) is an estimate of the standard error of $µ, then the quantity UCL = $µ $ σ ( $) µ will provide a 95% UCL for :, which should tend to be conservative, but this is not assured. In this article we use equation (27) to compute a 95% UCL of mean based upon Chebyshev inequality. From the Monte-Carlo results discussed in Section 4, it is observed that for highly skewed data sets (with k<0.5), the coverage provided by the Chebyshev UCL is smaller than the specified coverage of This is especially true when the sample size is smaller than 20. As expected, for larger samples sizes, the coverage provided by the Chebyshev UCL is at least 95%. This means that for larger samples, the Chebyshev UCL will result in a higher (but stable) UCL of the gamma, G(k, 2) mean. Bootstrap Procedures General methods for deriving estimates, such as the method of maximum likelihood, often result in estimates that are biased. Bootstrap procedures as discussed by Efron (1982) are nonparametric statistical techniques which can be used to reduce bias of point estimates and construct approximate confidence intervals for parameters such as the population mean. These procedures require no assumptions regarding the statistical distribution (e.g. normal, lognormal, gamma) for the underlying population, and can be applied to a variety of situations no matter how complicated. However, it should be pointed out that a use of a parametric statistical method (depending upon distributional assumptions) when appropriate is more efficient than its nonparametric counterpart. In practice, parametric assumptions are often difficult to justify, especially in environmental applications. In these cases, nonparametric methods provide valuable tools for obtaining reliable estimates of the parameters of interest. Use of these methods has been considered in environmental applications (Singh, Singh, and Engelhardt, 1997, 1999; Schulz and Griffin, 14

15 1999). Some of those methods are described as follows. Let x 1, x 2,..., x n be a random sample of size n from a population with an unknown parameter 2 (e.g., 2 = :) and let $ θ be an estimate of 2 which is a function of all n observations. For example, the parameter 2 could be the mean, and a reasonable choice for the estimate $ θ might be the sample mean x. In the bootstrap procedures, repeated samples of size n are drawn with replacement from the given set of observations. The process is repeated a large number of times (e.g., 1000), and each time an estimate, θ $, of 2 (the mean, here) is computed. The estimates thus obtained are used to compute an estimate of the standard error of $ θ. There exists in the literature of statistics an extensive array of different bootstrap methods for constructing confidence intervals. In this article three of those methods are considered: 1) the standard bootstrap method, and 2) bootstrap - t method (Efron, 1982, Hall, 1988), and 3) Hall s bootstrap method (Hall, 1992, Manly, 1997). 3.6 UCL of the Mean Based Upon the Standard Bootstrap Method Step 1. Let (x i1, x i2,..., x in ) represent the i th sample of size n with replacement from the original data set (x 1, x 2,..., x n ). Compute the sample mean x of the i th i sample. Step 2. Repeat Step 1 independently N times (e.g., ), each time calculating a new estimate. Denote these estimates by x1, x2, x3,..., x N. The bootstrap estimate of the population mean is the arithmetic mean, x B, of the N estimates x i. The bootstrap estimate of the standard error is given by: (28) The general bootstrap estimate, denoted by θ B, is the arithmetic mean of the N estimates. The difference, θ ˆ B θ, provides an estimate of the bias of the estimate, $ θ. The standard bootstrap confidence interval is derived from the following pivotal quantity, t: (29) A (1!") 100% standard bootstrap UCL for 2, which assumes that equation (29) is approximately normal, is given as follows: (30) It is observed that the standard bootstrap method does not adequately adjust for skewness, and the UCL given by equation (30) fails to provide the specified (1-") 100% coverage of the population mean of skewed data distributions. 3.7 UCL of the Mean Based Upon the Bootstrap t Method Another variation of the bootstrap method, called the "bootstrap - t" by Efron (1982) is a nonparametric procedure which uses the bootstrap methodology to estimate quantiles of the t- statistic, given by (29), directly from data (Hall, 1988). In practice, for non-normal populations, the required t-quantiles may not be easily obtained, or may be impossible to derive exactly. In this method, as before in Steps 1 and 2 described above, x is the sample mean computed from the original data, and x i and s x,i are the sample mean and sample standard deviation computed from the ith resampling of the original data. The N quantities t i = q(n) ( xi x)/ sx, i are computed and sorted, yielding ordered quantities t (1) # t (2) # t (N). The estimate of the lower " th quantile of the pivotal quantity (29) is t ",B = t ("N). For example, if N = 1000 bootstrap samples are generated, then the 50th ordered value, t (50), would be the bootstrap estimate of the lower 0.05th quantile of the t-statistic as given by (29). Then a (1-") 100% UCL of mean based upon bootstrap t- method is given as follows: (31) 3.8 UCL of the Mean Based Upon Hall s Bootstrap Method Hall (1992) proposed a bootstrap method which adjusts for bias as well as skewness. In this method that is the sample mean, sample standard deviation, and sample skewness, respectively (as given in Section 3.3 above) are computed from the ith resampling (i=1,2,..., N) of 15

16 the original data. Let x be the sample mean, s x be the sample standard deviation, and $k 3 be the sample skewness computed from the original data. The quantities W i and Q i given as follows are computed for each of the N bootstrap samples, where: The quantities Q i (W i ) given above are arranged in ascending order. For a specified (1-") confidence coefficient, compute the ("N) th ordered value, q " of quantities Q i (W i ). Finally, compute W(q " ) using the inverse function, which is given as follows: (32) Finally, the (1-") 100% UCL of the population mean based upon Hall s bootstrap method (Manly, 1997) is given as follows: (33) It is observed (Section 4) that the coverage probabilities provided by bootstrap - t and Hall s bootstrap methods are in close agreement. For larger samples these two methods approximately provide the specified 95% coverage to the population mean, k2. For smaller sample sizes, the coverage provided by these methods is only slightly lower than the specified level of It is also noted that, for highly skewed (Figures 5-8) data sets (with k#0.25) of small size (e.g., n<10), coverage probability provided by these two methods is higher than the Chebyshev UCL. 4. Examples Several examples illustrating the computation of the various 95% UCLs of the population mean are included in this section. Software, ProUCL (EPA 2002) has been used to compute some of the UCLs values. Gamma UCLs are computed using the program Chi_test (2002). Examples are generated from the gamma distribution and the lognormal distribution, and UCLs are computed using all of the methods discussed in this paper. It is observed that for small data sets, it is not easy to distinguish between a gamma model and a lognormal distribution. It is further noted that use of a gamma distribution results in practical and reliable UCLs of the population mean. Simulation results discussed in Section 5 suggest that the adjusted gamma UCL approximately provides the specified 95% coverage to the population mean for data sets with shape parameter, k, exceeding Simulated Examples from Gamma Distribution Example 2 A data set of size 15 is generated from a gamma, G(0.2, 100), distribution with the true population mean = 20, and skewness = The data are: , , , , , , , , , , , , , , The data set consists of very small values as well as some large values. These types of data sets often occur in environmental applications. The sample mean is Using the Shapiro- Wilk s test, it is concluded that the data also follow a lognormal model. The standard deviation (sd) of log-transformed data is quite large, 5.618; therefore, the H-statistic based UCL of mean becomes unpractically large. The bias-corrected MLEs of k and 2 are and , respectively. The adjusted (using bias-corrected estimate of k) df, υ$ * = For " =0.05, and n=15, the critical probability level, $, to be used is (from Table 1). The UCLs obtained using the various methods are summarized in the following table. 16

17 UCL Computation Method 95% UCL of Mean Approximate gamma UCL (equation (19)) Adjusted gamma UCL (equation (20)) UCL based upon t-statistic (equation (21)) UCL based upon modified t-statistic (equation (22)) UCL based upon adjusted CLT (equation (24)) UCL based upon H-statistic (equation (25)) 5.4E+13 UCL based upon Chebyshev (equation (27)) UCL based upon standard bootstrap (equation (30)) UCL based upon bootstrap - t (equation (31)) Hall s bootstrap UCL (equation (33)) Note that the H-UCL becomes unacceptably large. Since the H-UCL exceeds the maximum observed value of , using the recommendation made in the EPA (1992) RAGS document, one would use that maximum value as an estimate of the EPC term. Simulation results summarized in the next section (Figures 6-7) suggest that for n=15 and an estimate of k = 0.165, the adjusted UCL based upon a gamma model provides the specified 95% coverage to the population mean. Therefore, for this data set, the use of the adjusted gamma UCL of (equation 20) is an appropriate choice for an estimate of the EPC term. The maximum observed value represents an overestimate of the EPC term. Example 3 A data set of size 15 is generated from a gamma distribution with: k=0.5; and 2 =100 with mean, : = k2 = 50, and skewness = The data are: , , 0.33, 1.42, 13.17, , , 158.0, 70.65, 25.05, , 63.65, 62.50, 11.58, Using Shapiro-Wilk s test, it is concluded that these data cannot reject the hypothesis that the data also follow a lognormal distribution with sample mean = The bias-corrected estimates of k and 2 are , and , respectively. The adjusted df, υ * = 2nk$ *, for the Chi-square distribution = As before, for " =0.05, and n=15, the critical probability level, $ = The 95% UCLs of mean obtained using the various methods described above are given below. UCL Computation Method 95% UCL of Mean Approximate gamma UCL (equation (19)) Adjusted gamma UCL (equation (20)) UCL based upon t-statistic (equation (21)) UCL based upon modified t-statistic (equation (22)) UCL based upon adjusted CLT (equation (24)) UCL based upon H-statistic (equation (25)) UCL based upon Chebyshev (equation (27)) UCL based upon standard bootstrap (equation (30)) UCL based upon bootstrap - t (equation (31)) Hall s bootstrap UCL (equation (33)) Again note that the H-UCL is , which is much higher than the UCLs obtained using any of the other methods. Simulation results suggest that, for n=15 and an MLE of k to be , both approximate as well as the adjusted UCLs based upon a gamma model provide the specified 95% coverage to the population mean (Figure 9). Also, note that the Chebyshev UCL is very close to the adjusted gamma UCL. Any of these three methods may be used to compute the UCL of the population mean. 17

18 Example 4 A random sample of size n=10 is generated from a gamma (1,100) distribution with mean 100 and skewness=2. The data are: , , , , , , , , , Also, at 0.05 level of significance, these data cannot reject the hypothesis that the data follow a lognormal distribution. They also pass the Shapiro-Wilk s test for normality. The sample mean is The bias corrected MLEs of k and 2 are and , respectively, and the associated df = For n=10, and " =0.05, the critical probability level, $, to be used (to achieve a confidence coefficient of 0.95) is = The UCLs obtained using the various methods are given as follows. UCL Computation Method 95% UCL of Mean Approximate gamma UCL (equation (19)) Adjusted gamma UCL (equation (20)) UCL based upon t-statistic (equation (21)) UCL based upon modified t-statistic (equation (22)) UCL based upon adjusted CLT (equation (24)) UCL based upon H-statistic (equation (25)) UCL based upon Chebyshev (equation (27)) UCL based upon standard bootstrap (equation (30)) UCL based upon bootstrap - t (equation (31)) Hall s bootstrap UCL (equation (33)) Once again, note that the H-UCL is , which is much higher than the UCLs obtained using any of the other methods. Simulation results summarized in the next section suggest that for, for n=10 and an estimate of k to be (Figures 9-10), both the approximate and adjusted UCLs based upon the gamma model at least provide the specified 95% coverage to the population mean. 95% Chebyshev UCL also provides the specified coverage to population mean. For this combination of skewness and sample size, any of these three methods may be used to compute a 95% UCL of population mean. Example 5 A mildly skewed data set of size 10 was generated from a gamma distribution G(4,100) with mean 400 and skewness =1. The data are , , , , , , , , , The sample mean = Based upon the Shapiro-Wilk s test, at 0.05 level of significance, the data do not reject the hypotheses of normality as well as of lognormality. The sd of the log-transformed data is The bias corrected MLEs of k and 2 are and , respectively. The associated df = For n=10, and " =0.05, the critical probability level, $ (to achieve a confidence coefficient of 0.095), to be used is = The UCLs obtained using the various methods are given below. For this data set, the difference between the H- UCL and other UCLs is small. Simulation results suggest that as the sample size increases, these differences in the UCLs will decrease. From these results (Figures 11-12), it is noted that for a sample of size 10 and an estimate of k=2.55, both the approximate Gamma UCL and adjusted gamma UCL at least provide the specified 95% coverage to the population mean. Any of the two methods can be used to compute a 95% UCL of the mean. The Chebyshev inequality results in an overestimate of the UCL. 18

19 UCL Computation Method 95% UCL of Mean Approximate gamma UCL (equation (19)) Adjusted gamma UCL (equation (20)) UCL based upon t-statistic (equation (21)) UCL based upon modified t-statistic (equation (22)) UCL based upon adjusted CLT (equation (24)) UCL based upon H-statistic (equation (25)) UCL based upon Chebyshev (equation (27)) UCL based upon standard bootstrap (equation (30)) UCL based upon bootstrap - t (equation (31)) Hall s bootstrap UCL (equation (33)) Simulated Examples from Lognormal Distributions Next we consider a couple of small data sets generated from lognormal distributions. It is observed that those data sets also follow gamma models. Example 6 A sample of size n = 15 is generated from the lognormal distribution with parameters : = 5, F = 2; the true mean of this distribution is , the coefficient of variation (CV) is 7.32, and skewness is The generated data are: 47.42, , , , 14.73, 7.67, 73.36, , , , 14.8, 37.32, 24.74, , A goodness-of-fit test showed the data distribution to be non-normal (P < 0.01) and also that the data passes the test of lognormality (P > 0.15). The software packages ExpertFit (2001) and GamGood (2002) were used to test the goodness-of-fit of the gamma distribution. The observed value of the Anderson-Darling test statistic is 1.094, and the approximate critical value for test size 0.05 is 2.492, and hence an approximate gamma distribution can also be used to model the probability distribution of this data set. The Chi-square goodness-of-fit test with four equal intervals led to the same conclusion. The bias-adjusted estimates of shape, k, and scale, 2, of the gamma distribution are and , respectively. The 95% UCLs computed from the various methods are given below. Notice that the H-UCL is more than 5 times higher than the maximum concentration in the sample, and more than 10 times higher than all the other UCLs. All UCLs are larger than the true population mean (1096.6) for this data set. From Figures 8 and 9, it is observed that for an estimate of k=0.321 and n=15, the adjusted gamma UCL = provides the specified 95% coverage to population mean. Student s t Adjusted CLT Modified t CLT Standard Bootstrap Bootstrap t Hall s Bootstrap Chebyshev (Mean, Std) % H-UCL Adjusted Gamma UCL Continuing with this example, suppose that another sample is collected and it turns out to be below the detection limit (DL) of the instrument. Suppose further that DL = 10, and following EPA guidance documents, this value is replaced by DL/2 = 5. One would expect that this additional non-detect observation would result in a reduction of the UCL. The UCLs calculated from this sample of n = 16 observations are given below: Student s t 1884 Adjusted CLT Modified t CLT Standard Bootstrap Bootstrap t Chebyshev H-UCL Gamma UCL

A NEW POINT ESTIMATOR FOR THE MEDIAN OF GAMMA DISTRIBUTION

Banneheka, B.M.S.G., Ekanayake, G.E.M.U.P.D. Viyodaya Journal of Science, 009. Vol 4. pp. 95-03 A NEW POINT ESTIMATOR FOR THE MEDIAN OF GAMMA DISTRIBUTION B.M.S.G. Banneheka Department of Statistics and