Simple Descriptive Statistics - PDF Free Download

Simple Descriptive Statistics These are ways to summarize a data set quickly and accurately The most common way of describing a variable distribution is in terms of two of its properties: Central tendency describes the central value of the distribution, around which the observations cluster Dispersion describes how the observations are distributed about the central value Today, we ll focus on measures of dispersion

Why Do We Need Measures of Dispersion at all? Measures of central tendency tell us nothing about the variability / dispersion / deviation / range of values about the central value. Consider the following two unimodal symmetric distributions: Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 91.

Measures of Dispersion In addition to measures of central tendency, we can also summarize data by characterizing its variability Measures of dispersion are concerned with the distribution of values around the mean in data: 1. Range 2. Quartile range etc. 3. Mean deviation 4. Variance, standard deviation and z-scores 5. Coefficient of variation

Measures of Dispersion - Range 1. Range this is the most simply formulated of all measures of dispersion Given a set of measurements x 1, x 2, x 3,,x n-1, x n, the range is defined as the difference between the largest and smallest values: Range = x max x min This is another descriptive measure that is vulnerable to the influence of outliers in a data set, which result in a range that is not really descriptive of most of the data

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. We can divide distributions into a number of parts each containing an equal number of observations: Quartiles each contains 25% of all values Quintiles each contains 20% of all values Deciles each contains 10% of all values Percentiles each contains 1% of all values A standard application of this approach for describing dispersion involves calculating the interquartile range (a.k.a. quartile deviation)

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. cont. Rogerson (p. 6) defines the interquartile range as being the difference between the values of the 25 th and 75 th percentiles (i.e. the minimum value of the 2 nd quartile and the maximum value of the 3 rd quartile) This is well applied to skewed distributions, since it measures deviation around the median The interquartile range provides 2 of the 5 values displayed in a box plot, which is a convenient graphical summary of a data set

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. cont. A box plot graphically displays the following five values: Median Minimum value Maximum value 25 th percentile value max. median 75 th percentile value min. 75 th %-ile 25 th %-ile Rogerson, p. 8. Under some circumstances, the whiskers are not used for the min. and max. because of outliers

Measures of Dispersion Mean Deviation 3. Mean Deviation Once we have calculated the mean value for a data set, we can assess the difference between any observation and that mean, and this is termed the statistical distance: Statistical distance = x i - x If we take the absolute values of these, and sum for all observations, we have calculated the mean i=n deviation: Mean deviation = Σ x i x i=1 n

Measures of Dispersion Mean Deviation 3. Mean Deviation cont. Why is it necessary to take absolute values of statistical distances (x i x) before summing them to get the mean deviation? Because the statistical distances would be both positive and negative, and when summed using the mean deviation formula (without absolute values), they would sum to zero Mean deviation = i=n Σ x i x i=1 n

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. As an alternative to taking the absolute values of the statistical distances, we can square each deviation before taking their sum, which yields the sum of squares: Sum of Squares = i=n Σ (x i x)2 i=1 The sum of squares expresses the total square variation about the mean, and using this value we can calculate variances and standard deviations for both populations and samples

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Variance is formulated as the sum of squares divided by the population size or the sample size minus one: σ 2 = i=n Σ (x i µ) 2 i=1 N Population variance S 2 = i=n Σi=1 (x i x) 2 n - 1 Sample variance Note the differences in the two formulae, both in the notation but also the denominators!

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. We subtract one from the sample size when calculating sample variance because we wish to produce an unbiased statistic In this context, unbiased refers to the idea that when we calculate this statistic based upon a sample, it is some subset of the population, and if we did this repeatedly (using many samples) then the average of all the sample variances would eventually reach the true, population variance which generally has a larger deviation than any given sample Thus, we divide by n-1 to increase the sample variance, and provide a better estimate of the population variance

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Standard deviation is calculated by taking the square root of variance: σ = i=n Σ (x i µ) 2 i=1 N Population standard deviation S = i=n Σi=1 (x i x) 2 n - 1 Sample standard deviation Why do we prefer standard deviation over variance as a measure of dispersion? Magnitude of values and units match means.

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Just as the mean can be applied to spatial distributions through the bivariate mean center and weighted mean center formulae (computed by considering the (x,y) coordinates of a set of spatial objects), standard deviation can be applied to examining the dispersion of a spatial distribution. This is called standard distance (SD): i=n Σi=1 (x i x) 2 i=n Σi=1 (y i y) 2 SD = n - 1 + n - 1

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. In the event that the observations are weighted, we have a similar formula to use for calculating weighted standard distance: i=n i=n w i (x i x) 2 w i (y i y) 2 Σi=1 Σi=1 SD w = i=n Σ w i i=1 + i=n Σ w i i=1 Note that the x and y used in this formula need to be calculated using the weighted mean center formulae

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Sometimes, we want to compare data from different distributions, which in turn have different means and variances In these circumstances, it s convenient to have a standardized measure of dispersion that can be calculated for an individual observation. The z-score (a.k.a. standard normal variate, standard normal deviate, or just the standard score) is calculated by subtracting the sample mean from the observation, and then dividing that difference by the sample standard deviation: Z-score = x - x S

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Z-scores can be interpreted as an indication of how many standard deviations above (or below, if the z-score is negative) a given observation is from the sample mean, i.e. when we have a value that has a z-score of 2 or greater, we are more than likely looking at an outlier We can form some expectations about the size of the standard deviation for a distribution (even if it is not bell-shaped), and the dispersion of values within that distribution Chebyshev s Theorem

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Chebyshev s Theorem states: Given a set of data, no matter how they are distributed, it may be shown that a proportion of at least (1 1/z 2 ) of the observations (where the value of z is greater than 1) will lie within z standard deviations of their mean x, and the proportion falling beyond the limits of x ± z will be less than 1/z 2 Plugging in values for z: z 1-1/z 2 1 0 2 3/4 (75% of obs.) 3 8/9 (89% of obs.)

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. For bell-shaped (normal) distributions we can go even further than Chebyshev s Theorem, and state an empirical rule about the expectation of the dispersion of values in such a distribution: The Empirical Rule: Given a distribution that is approximately bell-shaped, the interval 1. µ ± σ will contain ~ 68.27% of the values 2. µ ± 2σ will contain ~ 95.45% of the values 3. µ ± 3σ will contain ~ 99.73% of the values

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Displaying this Empirical Rule graphically, we can get a sense of what defines a normal distribution: Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 100.

Measures of Dispersion Coefficient of Variation 5. Coefficient of Variation We cannot directly compare the standard deviations of frequency distributions with different means, because a distribution with a higher mean is likely to have a larger deviation In addition to z-scores (which describe the deviation of an observation), we need an overall measure of dispersion that is normalized with respect to the mean from the same distribution: S σ Coefficient of variation = or (*100%) x µ

Further Moments of the Distribution While measures of dispersion are useful for helping us describe the width of the distribution, they tell us nothing about the shape of the distribution. You ll recall these figures from earlier in the lecture: Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 91.

Further Moments of the Distribution There are further statistics that describe the shape of the distribution, using formulae that are similar to those of the mean and variance: 1. The 1 st moment of the distribution Mean (describes central value) 2. The 2 nd moment of the distribution Variance (describes dispersion) 3. The 3 rd moment of the distribution Skewness (describes asymmetry) 4. The 4 th moment of the distribution Kurtosis (describes peakedness)

Further Moments of the Distribution Each of the moments of the distribution make use of a formula that is based on a summation of the statistical distances (x i x) for a set of data, where the exponents in the formulae are determined by the moment: i=n (x i x) 2 Variance = Dispersion Σi=1 n - 1 2 nd Moment

Further Moments of the Distribution - Skewness Skewness This statistic measures the degree of asymmetry exhibited by the data (i.e. whether there are more observations on one side of the mean than the other): Skewness = i=n Σi=1 (x i x) 3 ns 3 Because the exponent in this moment is odd, skewness can be positive or negative; positive skewness has more observations below the mean than above it (negative vice-versa)

Further Moments of the Distribution - Skewness Skewness cont. Skewness can also be assessed by comparing the mean and the median Positive skewness Median < Mean Negative skewness Mean < Median This can also be assessed by calculating Pearson s coefficient of skewness: x is the mean Md is the median S is the std. deviation 3(x Md) Sk = S Sk follows the above convention, and values less than 3 are moderately skewed

Further Moments of the Distribution - Kurtosis Kurtosis This statistic measures how flat or peaked the distribution is, and is formulated as: i=n Σi=1 (x i x) 4 Kurtosis = ns 4-3 The 3 is included in this formula because it results in the kurtosis of a normal distribution to have the value 0 (this condition is also termed having a mesokurtic distribution)

Further Moments of the Distribution - Kurtosis Kurtosis cont. When the kurtosis < 0, the frequencies throughout the curve are closer to equal (i.e. the curve is more flat and wide) and this condition is termed platykurtic When the kurtosis > 0, there are high frequencies in only a small part of the curve (i.e. the curve is more peaked) and this condition is termed leptokurtic NOTE: Both skewness and kurtosis are sensitive to the size of n; when n is small and there are outliers, they are less useful