STAT 113 Variability - PDF Free Download

STAT 113 Variability Colin Reimer Dawson Oberlin College September 14, 2017 1 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 2 / 48

Distribution of a Quantitative Variable The distribution of a quantitative variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) 3 / 48

Skewness A distribution is skewed when the extreme values on one side are more extreme than those on the other. We call a distribution right-skewed when the longer tail is on the right, and left-skewed when the longer tail is on the left. 4 / 48

Distribution of a Quantitative Variable The distribution of a numeric variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) 5 / 48

Resistance/Robustness The mean is strongly affected by skew and by outliers The mean is pulled toward the extreme values. In these cases, we generally prefer a measure of central tendency which is resistant to the influence of extreme values (also called robust). The median is a resist/robust measure of center. 6 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 7 / 48

Measures of Variability We want to quantify the consistency, or lack thereof, of the data. A general term for lack of consistency is variability. We will look at: Range Interquartile Range Variance / Standard Deviation 9 / 48

The Range The range is easy to compute, but not very reliable. 20 10 0 10 20 30 Fund C1 20 10 0 10 20 30 Fund C2 Figure: Historical Annual Returns for Two Hypothetical Index Funds 10 / 48

The Range The range is easy to compute, but not very reliable. 10 5 0 5 10 15 Fund E (Full Data Set) 10 5 0 5 10 15 Fund Sample 1 10 5 0 5 10 15 Fund Sample 2 10 5 0 5 10 15 Fund Sample 3 Figure: Annual Returns for 3 random samples of 5 years 11 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 12 / 48

Robust Measures of Variability We d like a more robust measure of variability, which is not affected so much by extreme values. Analogous to the median: describe the middle part of the data. The idea: find the middle half of the data, and then take its range. Specifically, exclude the lowest 25% and the highest 25%, and take the difference between the highest and lowest remaining values. 13 / 48

Quartiles The median divides the data in two. Percentiles divide the data into 100 pieces. Quartiles divide the data into. The k th quartile (written Q k ) is the point below which k quarters of the data lies. So, in terms of quartiles, the median is, the minimum value is, the maximum value is. We can calculate the range using quartiles as. 14 / 48

Quartiles Q 0 Q 1 Q 2 Q 3 Q 4 20 25 30 35 40 45 50 Height (in.) 15 / 48

The Inter-Quartile Range (IQR) The Inter-Quartile Range (IQR) The Inter-Quartile Range (or IQR) is the distance between the first and third quartiles: IQR = Q 3 Q 1 Pedantic Note The IQR is a single number, not the two quartiles themselves. 16 / 48

The Inter-Quartile Range (IQR) Q 0 Q 1 Q 2 Q 3 Q 4 Range IQR 20 25 30 35 40 45 50 Height (in.) 17 / 48

The Five-Number Summary Five-number Summary The quartiles are very natural to report together to describe the center and spread of a distribution. Q 0 through Q 4 collectively form the five-number summary of a quantitative distribution. Five Number Summary = (x min, Q 1, Median, Q 3, x max ) = (Q 0, Q 1, Q 2, Q 3, Q 4 ) 18 / 48

Box-and-Whisker Plots Box-and-Whisker Plots From the five-number summary, we construct a graph called a box-and-whisker plot (or just box plot, for short) 1. Draw an axis 2. Draw a rectangle (box) from Q 1 to Q 3 3. Draw a line across the box (or place a dot) at Q 2 4. Draw lines (whiskers) extending outward from the box on both sides to either (a) (Simplest version) x min and x max. (b) (R default) Q 1 1.5IQR and Q 3 + 1.5IQR. 5. In version (b), plot points beyond the whiskers individually. 19 / 48

Box-and-Whisker Plot: Version 1 Q 0 Q 1 Q 2 Q 3 Q Range 4 IQR 20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 / 48

Box-and-Whisker Plot: Version 2 Q 0 Q 1 Q 2 Q 3 Q Range 4 IQR 20 25 30 35 40 45 50 20 25 30 35 40 45 50 21 / 48

Box-and-Whisker Plot: Right Skew Density 0.000 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 22 / 48

Box-and-Whisker Plot: Right Skew Density 0.000 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 23 / 48

Matching Graphs to Variables Handout 24 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 25 / 48

Deviations Rather than simply measuring the distance between extremes, we can develop measures based on distance from center. Deviation Scores For each data point, its deviation score is its distance from the mean. Deviation i = x i x, for each i = 1,..., n 26 / 48

Deviations mean = 36.76 20 25 30 35 40 45 50 Height (in.) 27 / 48

Deviations mean = 36.76 20 25 30 35 40 45 50 Height (in.) 28 / 48

Deviations mean = 36.76 Deviation = 6.24 20 25 30 35 40 45 50 Height (in.) 29 / 48

Deviations mean = 36.76 Deviation = 12.76 Deviation = 6.24 20 25 30 35 40 45 50 Height (in.) How can we use these for an overall measure of spread? 30 / 48

Variance If we square all the deviations from the mean and average them, we get the variance. Variance The variance, written s 2, is the average of the squared deviations from the mean. That is, s 2 = n i=1 Deviation2 i n 1 = n i=1 (x i x) 2 n 1 31 / 48

What s with that denominator? With an average, you re supposed to divide by the number of things, aren t you? Why n 1? Usually we are working with a sample, and are interested in estimating the population variability. We get no information about variability from the first observation, so there are only n 1 degrees of freedom in the sample. Interesting math side fact: Variance is equivalent to average squared distance between all distinct pairs of data points. 32 / 48

Standard Deviation Variance (s 2 ) is in squared units relative to the data. No problem: just take the square root. Standard Deviation s = s 2 is the standard deviation s = s 2 = n i=1 Deviation2 i n 1 = n i=1 (x i x) 2 n 1 33 / 48

Same range, different s s = 18.2 20 10 0 10 20 30 Fund C1 s = 8.1 20 10 0 10 20 30 Fund C2 The standard deviation uses all the data. 34 / 48

Outliers Skewness can be an important feature of a distribution. But sometimes a few unusual data points make an otherwise well-behaved distribution look skewed/multimodal. When not part of the overall pattern, these are called outliers. Sometimes reflect measurement errors (e.g., misplaced decimal) Sometimes represent genuinely unusual observations 36 / 48

On-Base Percentage A common statistic for batters in baseball is On-Base Percentage Density 0 2 4 6 8 10 12 Skewness = 0.630 Barry Bonds 0.0 0.1 0.2 0.3 0.4 0.5 0.6 On Base Percentage Figure: Distribution of major-league hitters with at least 100 Plate Appearances in 2002. 37 / 48

Distribution without Bonds On-Base Percentage Density 0 2 4 6 8 10 12 Skewness = 0.199 0.0 0.1 0.2 0.3 0.4 0.5 0.6 On Base Percentage 38 / 48

Visualizing Outliers 0.2 0.3 0.4 0.5 On Base Percentage 0.2 0.3 0.4 0.5 On Base Percentage 39 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 40 / 48

Problems with s and s 2 These measures, even more than the mean itself, are heavily influenced by extreme values. Density 0.000 0.004 0.008 0.012 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 41 / 48

Problems with s and s 2 Density 0.000 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) Density 0.000 0 500 1000 1500 2000 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) Density 0.000 0 500 1000 1500 2000 2001 Income (Thousands of 2016$) (Top 1% Excluded) 42 / 48

Problems with s and s 2 Density 0.000 0 50 100 150 200 250 300 2001 Household Income (Thousands of 2016$) Density 0.000 0 50 100 150 200 250 300 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) Density 0.000 0 50 100 150 200 250 300 2001 Income (Thousands of 2016$) (Top 1% Excluded) 43 / 48

Variance-Stabilizing Transformations The mean and standard deviation are unstable in the presence of skew. However, they have such useful properties otherwise that it is often better to try to remove skew, rather than fall back on other measures. The most common way to remove skew is by a nonlinear transformation of the underlying scale. Take the original variable, x, and define a new variable Y = f(x), where f is a one-to-one function. Most common case: right-skewed data with positive values Logarithmic transform (take y = log(x)) Square Root (take y = x) 44 / 48

Variance-Stabilizing Transformations Original vs. Logarithmic Income Distribution: Density 0.000 0 50 100 150 200 250 2001 Household Income (Thousands of 2016$) Density 0.0 1.0 10 2 10 3 10 4 10 5 10 6 2001 Household Income (2016$) 45 / 48

Summary Quantitative Data Visualizing a quantitative variable Dot Plots Box-and-Whisker Plots Histograms Density curves Describing the distribution of a numeric variable Shape (symmetry, skew, modes) Center (mean, median) Spread (IQR, standard deviation) Outliers (if any) 46 / 48

Summary Shape and Center A distribution is skewed when the extreme values on one end are more extreme than on the other We say that it is skewed in the direction of the more extreme values (e.g., right-skewed if there are a few very large values) The mean is the balance point of the data, written x. Mean has nice math properties, but is affected by skew The median divides the cases in half It is resistant to outliers/skewness 47 / 48

Variability Summary The range is unstable for a sample, and is extremely vulnerable to outliers/skew The Interquartile Range (IQR) is the range of the middle half of the data, and is resistant (like the median) The variance is the average of the squared deviations from each observation to the mean The standard deviation is the square root of the variance, in order to restore units to the original scale Nonlinear transformations (log, square root, etc.) can be used when appropriate to reduce skew and stabilize variance 48 / 48