STAT 113 Variability

Similar documents
1 Describing Distributions with numbers

Some estimates of the height of the podium

2 Exploring Univariate Data

Describing Data: One Quantitative Variable

Lecture 2 Describing Data

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

Numerical Descriptions of Data

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

appstats5.notebook September 07, 2016 Chapter 5

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Section3-2: Measures of Center

Unit 2 Statistics of One Variable

Section 6-1 : Numerical Summaries

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Lecture 1: Review and Exploratory Data Analysis (EDA)

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

DATA SUMMARIZATION AND VISUALIZATION

Frequency Distribution and Summary Statistics

CHAPTER 2 Describing Data: Numerical

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

DATA HANDLING Five-Number Summary

3.1 Measures of Central Tendency

Descriptive Statistics

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

Description of Data I

Math 140 Introductory Statistics. First midterm September

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Putting Things Together Part 2

22.2 Shape, Center, and Spread

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12)

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Empirical Rule (P148)

Lecture Week 4 Inspecting Data: Distributions

STAB22 section 1.3 and Chapter 1 exercises

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

STA 248 H1S Winter 2008 Assignment 1 Solutions

SOLUTIONS TO THE LAB 1 ASSIGNMENT

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Some Characteristics of Data

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Copyright 2005 Pearson Education, Inc. Slide 6-1

Simple Descriptive Statistics

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

How Wealthy Are Europeans?

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

The Normal Distribution

FINALS REVIEW BELL RINGER. Simplify the following expressions without using your calculator. 1) 6 2/3 + 1/2 2) 2 * 3(1/2 3/5) 3) 5/ /2 4

Chapter 4 Variability

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.

4. DESCRIPTIVE STATISTICS

Mini-Lecture 3.1 Measures of Central Tendency

Basic Procedure for Histograms

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Averages and Variability. Aplia (week 3 Measures of Central Tendency) Measures of central tendency (averages)

Lesson 12: Describing Distributions: Shape, Center, and Spread

Descriptive Statistics

1. In a statistics class with 136 students, the professor records how much money each

The Normal Distribution

Monte Carlo Simulation (Random Number Generation)

Section 2.2 One Quantitative Variable: Shape and Center

Edexcel past paper questions

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Chapter 3: Displaying and Describing Quantitative Data Quiz A Name

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Statistics I Chapter 2: Analysis of univariate data

Exploratory Data Analysis

Normal Probability Distributions

3.5 Applying the Normal Distribution (Z-Scores)

Applications of Data Dispersions

Fundamentals of Statistics

Lecture Data Science

STAT:2010 Statistical Methods and Computing. Using density curves to describe the distribution of values of a quantitative

Descriptive Statistics Bios 662

Numerical Measurements

starting on 5/1/1953 up until 2/1/2017.

Variance, Standard Deviation Counting Techniques

Skewness and the Mean, Median, and Mode *

Data screening, transformations: MRC05

Probability & Statistics Modular Learning Exercises

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

ECON 214 Elements of Statistics for Economists

Chapter 3 Descriptive Statistics: Numerical Measures Part A

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

NOTES: Chapter 4 Describing Data

1.2 Describing Distributions with Numbers, Continued

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Descriptive Statistics (Devore Chapter One)

Today s plan: Section 4.1.4: Dispersion: Five-Number summary and Standard Deviation.

Numerical Descriptive Measures. Measures of Center: Mean and Median

Transcription:

STAT 113 Variability Colin Reimer Dawson Oberlin College September 14, 2017 1 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 2 / 48

Distribution of a Quantitative Variable The distribution of a quantitative variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) 3 / 48

Skewness A distribution is skewed when the extreme values on one side are more extreme than those on the other. We call a distribution right-skewed when the longer tail is on the right, and left-skewed when the longer tail is on the left. 4 / 48

Distribution of a Quantitative Variable The distribution of a numeric variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) 5 / 48

Resistance/Robustness The mean is strongly affected by skew and by outliers The mean is pulled toward the extreme values. In these cases, we generally prefer a measure of central tendency which is resistant to the influence of extreme values (also called robust). The median is a resist/robust measure of center. 6 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 7 / 48

Distribution of a Quantitative Variable The distribution of a numeric variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) 8 / 48

Measures of Variability We want to quantify the consistency, or lack thereof, of the data. A general term for lack of consistency is variability. We will look at: Range Interquartile Range Variance / Standard Deviation 9 / 48

The Range The range is easy to compute, but not very reliable. 20 10 0 10 20 30 Fund C1 20 10 0 10 20 30 Fund C2 Figure: Historical Annual Returns for Two Hypothetical Index Funds 10 / 48

The Range The range is easy to compute, but not very reliable. 10 5 0 5 10 15 Fund E (Full Data Set) 10 5 0 5 10 15 Fund Sample 1 10 5 0 5 10 15 Fund Sample 2 10 5 0 5 10 15 Fund Sample 3 Figure: Annual Returns for 3 random samples of 5 years 11 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 12 / 48

Robust Measures of Variability We d like a more robust measure of variability, which is not affected so much by extreme values. Analogous to the median: describe the middle part of the data. The idea: find the middle half of the data, and then take its range. Specifically, exclude the lowest 25% and the highest 25%, and take the difference between the highest and lowest remaining values. 13 / 48

Quartiles The median divides the data in two. Percentiles divide the data into 100 pieces. Quartiles divide the data into. The k th quartile (written Q k ) is the point below which k quarters of the data lies. So, in terms of quartiles, the median is, the minimum value is, the maximum value is. We can calculate the range using quartiles as. 14 / 48

Quartiles Q 0 Q 1 Q 2 Q 3 Q 4 20 25 30 35 40 45 50 Height (in.) 15 / 48

The Inter-Quartile Range (IQR) The Inter-Quartile Range (IQR) The Inter-Quartile Range (or IQR) is the distance between the first and third quartiles: IQR = Q 3 Q 1 Pedantic Note The IQR is a single number, not the two quartiles themselves. 16 / 48

The Inter-Quartile Range (IQR) Q 0 Q 1 Q 2 Q 3 Q 4 Range IQR 20 25 30 35 40 45 50 Height (in.) 17 / 48

The Five-Number Summary Five-number Summary The quartiles are very natural to report together to describe the center and spread of a distribution. Q 0 through Q 4 collectively form the five-number summary of a quantitative distribution. Five Number Summary = (x min, Q 1, Median, Q 3, x max ) = (Q 0, Q 1, Q 2, Q 3, Q 4 ) 18 / 48

Box-and-Whisker Plots Box-and-Whisker Plots From the five-number summary, we construct a graph called a box-and-whisker plot (or just box plot, for short) 1. Draw an axis 2. Draw a rectangle (box) from Q 1 to Q 3 3. Draw a line across the box (or place a dot) at Q 2 4. Draw lines (whiskers) extending outward from the box on both sides to either (a) (Simplest version) x min and x max. (b) (R default) Q 1 1.5IQR and Q 3 + 1.5IQR. 5. In version (b), plot points beyond the whiskers individually. 19 / 48

Box-and-Whisker Plot: Version 1 Q 0 Q 1 Q 2 Q 3 Q Range 4 IQR 20 25 30 35 40 45 50 20 25 30 35 40 45 50 20 / 48

Box-and-Whisker Plot: Version 2 Q 0 Q 1 Q 2 Q 3 Q Range 4 IQR 20 25 30 35 40 45 50 20 25 30 35 40 45 50 21 / 48

Box-and-Whisker Plot: Right Skew Density 0.000 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 22 / 48

Box-and-Whisker Plot: Right Skew Density 0.000 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 23 / 48

Matching Graphs to Variables Handout 24 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 25 / 48

Deviations Rather than simply measuring the distance between extremes, we can develop measures based on distance from center. Deviation Scores For each data point, its deviation score is its distance from the mean. Deviation i = x i x, for each i = 1,..., n 26 / 48

Deviations mean = 36.76 20 25 30 35 40 45 50 Height (in.) 27 / 48

Deviations mean = 36.76 20 25 30 35 40 45 50 Height (in.) 28 / 48

Deviations mean = 36.76 Deviation = 6.24 20 25 30 35 40 45 50 Height (in.) 29 / 48

Deviations mean = 36.76 Deviation = 12.76 Deviation = 6.24 20 25 30 35 40 45 50 Height (in.) How can we use these for an overall measure of spread? 30 / 48

Variance If we square all the deviations from the mean and average them, we get the variance. Variance The variance, written s 2, is the average of the squared deviations from the mean. That is, s 2 = n i=1 Deviation2 i n 1 = n i=1 (x i x) 2 n 1 31 / 48

What s with that denominator? With an average, you re supposed to divide by the number of things, aren t you? Why n 1? Usually we are working with a sample, and are interested in estimating the population variability. We get no information about variability from the first observation, so there are only n 1 degrees of freedom in the sample. Interesting math side fact: Variance is equivalent to average squared distance between all distinct pairs of data points. 32 / 48

Standard Deviation Variance (s 2 ) is in squared units relative to the data. No problem: just take the square root. Standard Deviation s = s 2 is the standard deviation s = s 2 = n i=1 Deviation2 i n 1 = n i=1 (x i x) 2 n 1 33 / 48

Same range, different s s = 18.2 20 10 0 10 20 30 Fund C1 s = 8.1 20 10 0 10 20 30 Fund C2 The standard deviation uses all the data. 34 / 48

Distribution of a Quantitative Variable The distribution of a numeric variable is characterized by: A. Shape (symmetric, skewed, bimodal, etc.) B. Center (mean, median) C. Spread (Interquartile Range, Standard Deviation) D. Outliers (if any) 35 / 48

Outliers Skewness can be an important feature of a distribution. But sometimes a few unusual data points make an otherwise well-behaved distribution look skewed/multimodal. When not part of the overall pattern, these are called outliers. Sometimes reflect measurement errors (e.g., misplaced decimal) Sometimes represent genuinely unusual observations 36 / 48

On-Base Percentage A common statistic for batters in baseball is On-Base Percentage Density 0 2 4 6 8 10 12 Skewness = 0.630 Barry Bonds 0.0 0.1 0.2 0.3 0.4 0.5 0.6 On Base Percentage Figure: Distribution of major-league hitters with at least 100 Plate Appearances in 2002. 37 / 48

Distribution without Bonds On-Base Percentage Density 0 2 4 6 8 10 12 Skewness = 0.199 0.0 0.1 0.2 0.3 0.4 0.5 0.6 On Base Percentage 38 / 48

Visualizing Outliers 0.2 0.3 0.4 0.5 On Base Percentage 0.2 0.3 0.4 0.5 On Base Percentage 39 / 48

Outline Last Time: Shape and Center Variability Boxplots and the IQR Variance and Standard Deviaton Transformations 40 / 48

Problems with s and s 2 These measures, even more than the mean itself, are heavily influenced by extreme values. Density 0.000 0.004 0.008 0.012 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) 41 / 48

Problems with s and s 2 Density 0.000 0 500 1000 1500 2000 2001 Household Income (Thousands of 2016$) Density 0.000 0 500 1000 1500 2000 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) Density 0.000 0 500 1000 1500 2000 2001 Income (Thousands of 2016$) (Top 1% Excluded) 42 / 48

Problems with s and s 2 Density 0.000 0 50 100 150 200 250 300 2001 Household Income (Thousands of 2016$) Density 0.000 0 50 100 150 200 250 300 2001 Income (Thousands of 2016$) (Top 0.1% Excluded) Density 0.000 0 50 100 150 200 250 300 2001 Income (Thousands of 2016$) (Top 1% Excluded) 43 / 48

Variance-Stabilizing Transformations The mean and standard deviation are unstable in the presence of skew. However, they have such useful properties otherwise that it is often better to try to remove skew, rather than fall back on other measures. The most common way to remove skew is by a nonlinear transformation of the underlying scale. Take the original variable, x, and define a new variable Y = f(x), where f is a one-to-one function. Most common case: right-skewed data with positive values Logarithmic transform (take y = log(x)) Square Root (take y = x) 44 / 48

Variance-Stabilizing Transformations Original vs. Logarithmic Income Distribution: Density 0.000 0 50 100 150 200 250 2001 Household Income (Thousands of 2016$) Density 0.0 1.0 10 2 10 3 10 4 10 5 10 6 2001 Household Income (2016$) 45 / 48

Summary Quantitative Data Visualizing a quantitative variable Dot Plots Box-and-Whisker Plots Histograms Density curves Describing the distribution of a numeric variable Shape (symmetry, skew, modes) Center (mean, median) Spread (IQR, standard deviation) Outliers (if any) 46 / 48

Summary Shape and Center A distribution is skewed when the extreme values on one end are more extreme than on the other We say that it is skewed in the direction of the more extreme values (e.g., right-skewed if there are a few very large values) The mean is the balance point of the data, written x. Mean has nice math properties, but is affected by skew The median divides the cases in half It is resistant to outliers/skewness 47 / 48

Variability Summary The range is unstable for a sample, and is extremely vulnerable to outliers/skew The Interquartile Range (IQR) is the range of the middle half of the data, and is resistant (like the median) The variance is the average of the squared deviations from each observation to the mean The standard deviation is the square root of the variance, in order to restore units to the original scale Nonlinear transformations (log, square root, etc.) can be used when appropriate to reduce skew and stabilize variance 48 / 48