David Tenenbaum GEOG 090 UNC-CH Spring 2005

Size: px

Start display at page:

Download "David Tenenbaum GEOG 090 UNC-CH Spring 2005"

Iris Bishop
6 years ago
Views:

1 Simple Descriptive Statistics Review and Examples You will likely make use of all three measures of central tendency (mode, median, and mean), as well as some key measures of dispersion (standard deviation, z- scores, and the coefficient of variation), along with the statistics that describe the shape of a distribution (skewness and kurtosis) at some point if you work with numeric data sets in an academic or research context In this lecture, we will review the procedures for calculating these statistics, and work through an example for each of the statistics (using a small data set, smaller than those that are typically found in research applications)

2 Measures of Central Tendency - Review 1. Mode This is the most frequently occurring value in the distribution 2. Median This is the value of a variable such that half of the observations are above and half are below this value i.e. this value divides the distribution into two groups of equal size 3. Mean a.k.a. average, the most commonly used measure of central tendency

3 Measures of Central Tendency - Review 1. Mode This is the most frequently occurring value in the distribution Procedure for finding the mode of a data set: 1) Sort the data, putting the values in ascending order 2) Count the instances of each value (if this is continuous data with a high degree of precision and many decimal places, this may be quite tedious) 3) Find the value that has the most occurrences this is the mode (if more than one value occurs an equal number of times and these exceed all other counts, we have multiple modes) Use the mode for multi-modal or nominal data sets

4 Measures of Central Tendency - Review 2. Median - ½ of the values are above & ½ below this value Procedure for finding the median of a data set: 1) Sort the data, putting the values in ascending order 2) Find the value with an equal number of values above and below it (if there are an even number of values, you will need to average two values together): Odd number of observations [(n-1)/2]+1 values from the lowest, e.g. n=19 [(19-1)/2]+1 = 10 th value Even number of observations average the (n/2) and [(n/2)+1] values, e.g. n=20 average the 10 th and 11 th Use the median with assymetric distributions, when you suspect outliers are present, or with ordinal data

5 Measures of Central Tendency - Review 3. Mean a.k.a. average, the most commonly used measure of central tendency Procedure for finding the mean of a data set: 1) Sum all the values in the data set 2) Divide the sum by the number of values in the data set x = i=n Σ x i i=1 n Use the mean when you have interval or ratio data sets with a large sample size, few (or no?) outliers, and a reasonably symmetric unimodal distribution

6 Measures of Central Tendency - Review An example data set: Daily low temperatures recorded in Chapel Hill from January 18, 2005 through January 31, 2005 in degrees Fahrenheit: Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees For these 14 values, we will calculate all three measures of central tendency - the mode, median, and mean

7 Measures of Central Tendency - Review 1. Mode Find the most frequently occurring value 1) Sort the data, putting the values in ascending order: 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 2) Count the instances of each value: 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 3x 1x 1x 1x 1x 2x 2x 1x 1x 1x 3) Find the value that has the most occurrences: In this case, the mode is 11 degrees Fahrenheit, but is this a good measure of the central tendency of this data? Had there only been two days with a recorded temperature of 11 degrees, what would be the mode?

8 Measures of Central Tendency - Review 2. Median - ½ of the values are above & ½ below this value 1) Sort the data, putting the values in ascending order: 11, 11, 11, 14, 18, 19, 22, 25, 25, 27, 27, 29, 30, 33 2) Find the value with an equal number of values above and below it (if there are an even number of values, you will need to average two values together): Even number of observations average the (n/2) and [(n/2)+1] values Here, n=14 average the (14/2) and [(14/2)+1] values, i.e. the 7 th and 8 th values (22+25)/2 = 23.5 degrees F Here, the median is 23.5 degrees F is this a good measure of central tendency for this data?

9 Measures of Central Tendency - Review 3. Mean a.k.a. average, the most commonly used measure of central tendency i=n 1) Sum all the values in the data set Σ x i i= = 302 2) Divide the sum by the number of values in the data set Here, n=14, so calculate the mean using 302/14 = The mean is degrees F is this a good measure of central tendency for this data set?

10 Measures of Dispersion Review 1. Standard Deviation This is the most frequently used measure of dispersion because it has the same units as the values and their mean 2. Z-scores These express the difference from the mean in terms of standard deviations of an individual value, and thus can be compared to z-scores drawn from other data sets or distributions 3. Coefficient of Variation This is an overall measure of dispersion that is normalized with respect to the mean from the same distribution, and thus is comparable to coefficients of variation from other data sets because it is a normalized measure of dispersion

11 Measures of Dispersion Review 1. Standard Deviation Standard deviation is calculated by taking the square root of variance: σ = i=n Σ (x i µ) 2 i=1 N Population standard deviation S = i=n Σi=1 (x i x) 2 n - 1 Sample standard deviation Why do we prefer standard deviation over variance as a measure of dispersion? Magnitude of values and units match means.

12 Measures of Dispersion - Review 1. Standard Deviation This is the most frequently used measure of dispersion because it has the same units as the values and their mean (unlike variance) Procedure for finding the standard deviation of a data set: 1) Calculate the mean 2) Calculate the statistical distances (x i x) for each value 3) Square each of the statistical distances (x i x) 2 4) Sum the squared statistical distances, the sum of squares 5) Divide the sum of squares by N for a population or by (n-1) for a sample this gives you the variance 6) Take the square root of the variance to get the standard deviation

13 Measures of Dispersion - Review 2. Z-scores These express the difference from the mean in terms of standard deviations of an individual value, and thus can be compared to z-scores drawn from other data sets or distributions Procedure for finding the z-score of an observation: 1) Calculate the mean 2) Calculate the statistical distances (x i x) for each value where we wish find the z-score 3) Calculate the standard deviation 4) Calculate the z-score using the formula Z-score = x - x S

14 Measures of Dispersion - Review 3. Coefficient of Variation This is an overall measure of dispersion that is normalized with respect to the mean from the same distribution, and thus is comparable to coefficients of variation from other data sets because it is a normalized measure of dispersion Procedure for finding the coef. of variation for a data set: 1) Calculate the mean 2) Calculate the standard deviation 3) Calculate the coefficient of variation using the formula S σ Coefficient of variation = or (*100%) x µ

15 Measures of Dispersion - Review We will use the same example data set: Daily low CH temps. Jan , 2005 in degrees F: Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees For these 14 values, we will calculate the three measures of dispersion listed above - the standard deviation, some z-scores and the coefficient of variation for this data set

16 Measures of Dispersion - Review 1. Standard Deviation This is the most frequently used measure of dispersion because it has the same units as the values and their mean (unlike variance) 1) Calculate the mean We have previously found the mean = degrees F 2) Calculate the statistical distances (x i x) for each value Jan. 18 ( ) = Jan. 25 ( ) = 3.43 Jan. 19 ( ) = Jan. 26 ( ) = Jan. 20 ( ) = 3.43 Jan. 27 ( ) = 0.43 Jan. 21 ( ) = 7.43 Jan. 28 ( ) = Jan. 22 ( ) = 5.43 Jan. 29 ( ) = Jan. 23 ( ) = Jan. 30 ( ) = 8.42 Jan. 24 ( ) = Jan. 31 ( ) = 5.42 I have rounded the values for display here to 2 decimal places, ideally you want to do as little rounding as possible

17 Measures of Dispersion - Review 1. Standard Deviation cont. 3) Square each of the statistical distances (x i x) 2 Jan. 18 (-10.57) 2 = Jan. 25 (3.43) 2 = Jan. 19 (-10.57) 2 = Jan. 26 (11.43) 2 = Jan. 20 (3.43) 2 = Jan. 27 (0.43) 2 = 0.18 Jan. 21 (7.43) 2 = Jan. 28 (-3.57) 2 = Jan. 22 (5.43) 2 = Jan. 29 (-2.57) 2 = 6.61 Jan. 23 (7.57) 2 = Jan. 30 (8.43) 2 = Jan. 24 (-10.57) 2 = Jan. 31 (5.43) 2 = ) Sum the squared statistical distances, the sum of squares Sum of Squares = i=n Σ (x i x)2 = i=1

18 Measures of Dispersion - Review 1. Standard Deviation cont. 5) Divide the sum of squares by N for a population or by (n-1) for a sample this gives you the variance Here, our sample n =14, so /(14-1) = ) Take the square root of the variance to calculate the standard deviation Taking the square root of our variance (57.8) gives us the standard deviation for our data set 57.8 = 7.6

19 Measures of Dispersion - Review 2. Z-scores We will calculate z-scores for the lowest and highest temperatures in our sample (11 and 33 degrees) 1) Calculate the mean We have previously found the mean = degrees F 2) Calculate the statistical distances (x i x) for each value where we wish find the z-score We have already calculated these statistical distances: Jan. 18 ( ) = Jan. 26 ( ) = ) Calculate the standard deviation We have already calculated the standard deviation for our data set and found it to be = 7.6 degrees

20 Measures of Dispersion - Review 2. Z-scores cont. 4) Calculate the z-score using the formula Z-score = x - x S i.e. divide the statistical distances by the standard deviation Jan / 7.6 = Jan / 7.6 = 1.5 If we had another set of minimum temperatures from a previous January (from 2004, for example), we could calculate the z-scores for values from that data set, and make a reasonable comparison to these values

21 Measures of Dispersion - Review 3. Coefficient of Variation This is a normalized measure of dispersion for the variation throughout a data set 1) Calculate the mean We have previously found the mean = degrees F 2) Calculate the standard deviation We have previously found the std. dev. = 7.6 degrees F 3) Calculate the coefficient of variation using the formula S σ Coefficient of variation = or (*100%) x µ Using the example values: 7.6/21.57 = or 35.24% This value could be compared with that from 2004 etc.

22 Skewness and Kurtosis - Review 1. Skewness This statistic measures the degree of asymmetry exhibited by the data (i.e. whether there are more observations on one side of the mean than the other) 2. Kurtosis This statistic measures the degree to which the distribution is flat or peaked

23 Skewness and Kurtosis - Review 1. Skewness This statistic measures the degree of asymmetry exhibited by the data (i.e. whether there are more observations on one side of the mean than the other): Skewness = i=n Σi=1 (x i x) 3 ns 3 Because the exponent in this moment is odd, skewness can be positive or negative; positive skewness has more observations below the mean than above it (negative vice-versa)

24 Skewness and Kurtosis - Review 1. Skewness This statistic measures the degree of asymmetry exhibited by the data Procedure for finding the skewness of a data set: 1) Calculate the mean 2) Calculate the statistical distances (x i x) for each value 3) Cube each of the statistical distances (x i x) 3 4) Sum the cubed statistical distances, the sum of cubes (i.e. this is the numerator in the skewness formula) 5) Divide the sum of cubes by the sample size multiplied by the standard deviation cubes (i.e. the denominator is n*s 3 in [Σ (x i x) 3 ] / [ n*s 3 ])

25 Skewness and Kurtosis - Review 2. Kurtosis This statistic measures how flat or peaked the distribution is, and is formulated as: i=n Σi=1 (x i x) 4 Kurtosis = ns 4-3 The 3 is included in this formula because it results in the kurtosis of a normal distribution to have the value 0 (this condition is also termed having a mesokurtic distribution)

26 Skewness and Kurtosis - Review 2. Kurtosis This statistic measures how flat or peaked the distribution is Procedure for finding the kurtosis of a data set: 1) Calculate the mean 2) Calculate the statistical distances (x i x) for each value 3) Raise each of the statistical distances to the 4 th power, i.e. (x i x) 4 4) Sum the statistical distances to the 4 th power Σ (x i x) 4 5) Divide the sum by the sample size multiplied by the standard deviation raised to the 4 th power (i.e. the denominator is n*s 4 in [Σ (x i x) 4 ] / [ n*s 4 ]) 6) Subtract 3 from [Σ (x i x) 4 ] / [ n*s 4 ]

27 Skewness & Kurtosis - Review We will use the same example data set: Daily low CH temps. Jan , 2005 in degrees F: Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Jan degrees Using these 14 values, we will calculate the two distribution shape descriptive statistics listed above, the skewness and kurtosis for this data set

28 Skewness & Kurtosis - Review 1. Skewness This statistic measures the degree of asymmetry exhibited by the data 1) Calculate the mean We have previously found the mean = degrees F 2) Calculate the statistical distances (x i x) for each value We have previously calculated the statistical distances: Jan. 18 ( ) = Jan. 25 ( ) = 3.43 Jan. 19 ( ) = Jan. 26 ( ) = Jan. 20 ( ) = 3.43 Jan. 27 ( ) = 0.43 Jan. 21 ( ) = 7.43 Jan. 28 ( ) = Jan. 22 ( ) = 5.43 Jan. 29 ( ) = Jan. 23 ( ) = Jan. 30 ( ) = 8.42 Jan. 24 ( ) = Jan. 31 ( ) = 5.42

29 Skewness & Kurtosis - Review 1. Skewness cont. 3) Cube each of the statistical distances (x i x) 3 Jan. 18 (-10.57) 3 = Jan. 25 (3.43) 3 = 40.3 Jan. 19 (-10.57) 3 = Jan. 26 (11.43) 3 = Jan. 20 (3.43) 3 = 40.3 Jan. 27 (0.43) 3 = 0.08 Jan. 21 (7.43) 3 = Jan. 28 (-3.57) 3 = Jan. 22 (5.43) 3 = Jan. 29 (-2.57) 3 = -17 Jan. 23 (7.57) 3 = Jan. 30 (8.43) 3 = Jan. 24 (-10.57) 3 = Jan. 31 (5.43) 3 = ) Sum the cubed statistical distances, the sum of cubes Sum of cubes = i=n Σ (x i x)3 = i=1

30 Skewness & Kurtosis - Review 1. Skewness cont. 5) Divide the sum of cubes ( ) by n*s 3 (S=7.6 from above): Σ (x *(7.6) = i x) 3 = n*s 3 14* = = The negative value of skewness indicates that our sample distribution has greater frequencies at the higher values of temperature (although interpreting skewness with a sample this small and a distribution that is not really normally shaped is somewhat of a stretch )

31 Skewness & Kurtosis - Review 2. Kurtosis This statistic measures the degree to which the distribution is flat or peaked 1) Calculate the mean We have previously found the mean = degrees F 2) Calculate the statistical distances (x i x) for each value We have previously calculated the statistical distances: Jan. 18 ( ) = Jan. 25 ( ) = 3.43 Jan. 19 ( ) = Jan. 26 ( ) = Jan. 20 ( ) = 3.43 Jan. 27 ( ) = 0.43 Jan. 21 ( ) = 7.43 Jan. 28 ( ) = Jan. 22 ( ) = 5.43 Jan. 29 ( ) = Jan. 23 ( ) = Jan. 30 ( ) = 8.42 Jan. 24 ( ) = Jan. 31 ( ) = 5.42

32 Skewness & Kurtosis - Review 2. Kurtosis cont. 3) Raise each of the statistical distances to the 4 th power (x i x) 4 Jan. 18 (-10.57) 4 = Jan. 25 (3.43) 4 = Jan. 19 (-10.57) 4 = Jan. 26 (11.43) 4 = Jan. 20 (3.43) 4 = Jan. 27 (0.43) 4 = 0.03 Jan. 21 (7.43) 4 = Jan. 28 (-3.57) 4 = Jan. 22 (5.43) 4 = Jan. 29 (-2.57) 4 = Jan. 23 (7.57) 4 = Jan. 30 (8.43) 4 = Jan. 24 (-10.57) 4 = Jan. 31 (5.43) 4 = ) Sum the statistical distances raised to the 4 th power Sum of 4 th powers = i=n Σ (x i x)4 = i=1

33 Skewness & Kurtosis - Review 2. Kurtosis cont. 5) Divide the sum of 4 th powers ( ) by n*s 4 (S=7.6 from above): Σ (x *(7.6) = i x) 4 = n*s 4 14* = = ) Subtract 3 from [Σ (x i x) 4 ] / [ n*s 4 ] Using our values, the kurtosis is = Because this kurtosis is <0, this sample has a platykurtic distribution meaning the curve is flatter than a normal curve (but caveats to interpretation apply)

Basic Procedure for Histograms

Basic Procedure for Histograms 1. Compute the range of observations (min. & max. value) 2. Choose an initial # of classes (most likely based on the range of values, try and find a number of classes that