Introduction to Alternative Statistical Methods Or Stuff They Didn t Teach You in STAT 101
Classical Statistics For the most part, classical statistics assumes normality, i.e., if all experimental units of interest were measured and those measurements were plotted, then the distribution would look bell-shaped.
Normal Distribution
Not to be Confused with
Normal Distribution The standard normal is the iconic normal distribution but there are as many normal distributions as there are combinations of finite means and finite, positive variances.
Normal Distributions
Central Limit Theorem All this is well and good, you say, but there is no reason, in general, to assume normality. How do we proceed? The famous French mathematician Laplace (1749 1827) discovered in the early 19 th century that the sample mean, X, is normally distributed for sufficiently large sample size, provided the sample is random and the underlying population variance is finite.
Central Limit Theorem How large is sufficiently large? For light-tailed distributions, sample size n 30 or so is usually sufficient. What do I mean by light-tailed? The rate at which the tail or tails of the distribution approach zero must be at least as fast as 1/e x in math-speak. (The tails of the normal distribution approach zero at the rate of (1/e x^2 ), which is faster still.)
Limitations of the CLT So where can things come off of the rails? It seems like the Central Limit Theorem takes care of just about everything.
What Manner of Distribution is this?
The Cauchy Distribution This is the Cauchy distribution. It sort of looks normal, and some people might mistake it for a normal distribution (present company excepted, of course!) but observe the tails.
Cauchy vs. Normal
Cauchy vs. Normal The Cauchy distribution has heavy tails. What is meant by heavy tails here? The tails of the Cauchy distribution approach zero at the rate of 1/x 2 How does that compare with the normal distribution? Example: For x = 10, 1/x 2 = 1/100 = 0.01. By way of contrast, (1/e x^2 ) = 3.72 x 10^-44 (i.e., a decimal with 43 leading zeroes). Quite a difference!
A Pathological Distribution In fact, the tails of the Cauchy are so heavy, that the mean and the variance do not exist. One observation is a better estimator of the center of the Cauchy distribution than the average of a random sample! The Cauchy Distribution represents an extreme situation and is good for testing where our methods break down. Will you encounter the Cauchy distribution in practice? Probably not, but it is a distinct possibility that you will encounter the next problematic example.
Mixed Normal A mixed normal distribution occurs when a population of interest actually contains two subpopulations that are normally distributed but with different means and variances within each subpopulation. For whatever reason, these subpopulations cannot be easily isolated, and the resulting distribution is not normal, even though it consists of two normal subpopulations.
Mixed Normal A specific example (from Rand Wilcox s Applying Contemporary Statistical Techniques): Assume a population consists of both dieters and non-dieters in a ratio of 1:9, i.e., 10% have dieted and 90% have not. Let X represent the amount of weight loss observed for an individual during the previous year. Further assume that X is distributed N(0, 100) for dieters and X is distributed N(0, 1) for non-dieters.
Mixed Normal The resulting distribution can be represented as (0.9)N(0, 1) + (0.1) N(0, 100), which has mean = (0.9)(0) + (0.1)(0) = 0 and variance = (0.9)(1) + (0.1)(100) = 0.9 + 10 = 10.9. Thus, even though non-dieters represent 90% of the population and their variance is 1, we observe much greater variability in the resulting mixed model. This presents a problem for inferences about the population mean, for example. How do we proceed in such problem cases?
Trimming and Winsorizing Data A trimmed mean is obtained by deleting a certain percentage of the smallest and largest values and then calculating the mean based on the remaining values. For example, a 10% trimmed mean is obtained by deleting 10% of the highest values and 10% of the lowest values, leaving you with 80% of the original values upon which a mean is then calculated.
A concrete example: Trimmed Mean {3.54, 6.61, 2.88, 2.20, 8.04, 5.31, 6.51, 6.37, 3.86, 7.82, 0.967, 1.12, 7.00, 4.87, 8.39, 4.15, 3.11, 7.48, 16.62, 2.77} The mean of this sample is 5.48. 10% trimming removes 0.967, 1.12, 8.39, and 16.62. The 10% trimmed mean is based on the 20-2-2 = 16 remaining values, which in this example, is 5.16.
Trimmed Mean Isn t throwing out data a bad idea? Not necessarily. In the 1960s and 1970s Lehmann and Bickel (two famous statisticians) showed that the 10% trimmed mean is nearly as good as X for approximately normal data and a much safer bet than X for heavy-tailed data. (A. DasGupta, Asymptotic Theory of Statistics and Probability, p. 271)
Trimmed Mean Incidentally, you have encountered trimmed means before, even if you have not recognized them as such. The median of a distribution is a 50% trimmed mean, i.e., you remove 50% of the lowest data values and 50% of the highest data values and are left with one number as an estimate of the location parameter or center of the distribution.
Trimmed Mean Moreover, removing transparently erroneous data or data collected on an unsuitable experimental unit is not trimming. Trimming occurs when you remove data values that are legitimate (or cannot be identified as illegitimate) but small or large with respect to the sample as a whole.
Winsorizing Data Winsorizing is distinct from trimming in that a certain proportion of lowest and highest values are not discarded but are instead replaced by the lowest and highest values in the data set apart from those values, so that the sample size remains the same.
Winsorizing Data Returning to my previous example: {3.54, 6.61, 2.88, 2.20, 8.04, 5.31, 6.51, 6.37, 3.86, 7.82, 0.967, 1.12, 7.00, 4.87, 8.39, 4.15, 3.11, 7.48, 16.62, 2.77} If this data is winsorized at 10% then the smallest and larger values after the 2 lowest and 2 highest (2 = 10% of 20) values are identified are 2.20 and 8.04, respectively. The winsorized data set becomes:
Winsorizing Data {3.54, 6.61, 2.88, 2.20, 8.04, 5.31, 6.51, 6.37, 3.86, 7.82, 2.20, 2.20, 7.00, 4.87, 8.04, 4.15, 3.11, 7.48, 8.04, 2.77} The 10% winsorized mean is the average of these 20 values with the repeats, which is 5.15 as compared to the sample mean of 5.48. The winsorized standard deviation is just the standard deviation of the winsorized data. In this example, it is 2.22 (as compared to 3.5 for the original sample).
(1-α)% CI for Trimmed Mean You cannot use the standard method of constructing confidence intervals about means for trimmed means! It would be unsound for a number of reasons, including that the remaining values in the trimmed data set are no longer independent or identically distributed.
(1-α)% CI for Trimmed Mean How can the leftover data points be dependent when they were independent just a few minutes ago before trimming? I did not trim at random. I ordered the data first, then trimmed the lowest and highest 10%. For observations to be independent, one cannot tell you anything about another. When you order the data and observe that the penultimate largest value is 8, say, then you know the largest value cannot be 7 or, indeed, any value less than 8. Hence, the trimmed data set is not independent (or identically distributed).
(1-α)% CI for Trimmed Mean The correct standard error for a 10% trimmed mean is s w 0.8 n Where s w is the standard deviation of the 10% winsorized sample.
(1-α)% CI for Trimmed Mean The correct standard error for the 10% trimmed mean of my example data is 0.62 and a 95% confidence interval for the 10% trimmed mean is: 5.16±2.13(0.62) = (3.84, 6.47) where 2.13 is the appropriate critical value from the T distribution with 16 (= #of values left after deleting two smallest and two largest) - 1 = 15 degrees of freedom (i.e., t 0.975 (15) = 2.13)
CI s for Binomial Proportions Everyone is taught the Wald-Wolfowitz confidence interval for binomial proportions, i.e., Which is based on a normal approximation. They are not taught how poorly it performs in general, however, even for large n.
CI s for Binomial Proportions What do I mean by performs poorly? The coverage probability of the Wald-Wolfowitz CI is often less than you intend when you choose your α. (For example, you might intend to have a 95% CI but end up with a 89% CI.) What is the coverage probability? The coverage probability is the percentage associated with the CI you construct. A 95% confidence interval has (or should have) a coverage probability of 0.95
CI s for Binomial Proportions But what does this mean? For the frequentist statistician, it means that were she to replicate her experiment 20 times, she would expect 19 out of the 20 confidence intervals she constructs to contain the true value of the parameter she is estimating. For the frequentist, the parameter is a fixed constant of nature and once the CI is constructed it either contains the true value or it does not.
CI s for Binomial Proportions By way of contrast, for the Bayesian statistician, the parameter is not fixed; there is a probability distribution associated with it and even after the CI is constructed he can speak of there being a probability that the parameter is contained in that interval. The distinction between frequentist and Bayesian statistics is not particularly important here and would, in any event, require a talk of its own.
CI s for Binomial Proportions What should you use instead of the Wald- Wolfowitz confidence interval? There are a number of alternatives, but I recommend the Wilson score interval.
Wilson Score Interval The most general form of the Wilson score interval is as follows:
Wilson Score Interval A 95% Wilson Interval is (approximately): 2 pˆ(1 pˆ) 1 pˆ 2 2 n n n 4 1 n
Wilson Score Interval As an example, suppose we observe 7 successes in 100 trials. An (approximate) 95% Wilson interval would be (0.034, 0.14) I have been writing approximate in parentheses because I used 2 in place of the correct critical value z 0.975 = 1.96 to make the formula on the previous slide look cleaner. The difference is slight. If you want to be extra fastidious, you can use the critical value from the T distribution, t 0.975 (n-1), which in this ex. would be 1.984.
In Summary Bell-shaped distributions need not have nice properties. Do not simply assume the underlying distribution of your data is normal and apply standard statistical methods. To do so is sort of like the statistical equivalent of running with scissors. Instead, investigate the data until you are satisfied of its approximate normality.
In Summary Even when the data appears to be approximately normal, you might want to consider trimming and/or winsorizing it. The efficiency of the 10% trimmed mean is such that it is competitive with the sample mean even under normality and is a better bet for heavy-tailed data. (This situation is not unlike the Wilcoxon-Mann-Whitney test vs. the twosample T test.)
In Summary The Wald-Wolfowitz confidence interval for binomial proportions, like fast food, probably should be avoided.