Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Similar documents
Numerical Descriptions of Data

Simple Descriptive Statistics

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

3.1 Measures of Central Tendency

Applications of Data Dispersions

Describing Data: One Quantitative Variable

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

appstats5.notebook September 07, 2016 Chapter 5

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

DATA SUMMARIZATION AND VISUALIZATION

1 Describing Distributions with numbers

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Basic Procedure for Histograms

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

CHAPTER 2 Describing Data: Numerical

Lecture 1: Review and Exploratory Data Analysis (EDA)

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Empirical Rule (P148)

Section3-2: Measures of Center

Descriptive Statistics

Some Characteristics of Data

Unit 2 Statistics of One Variable

ECON 214 Elements of Statistics for Economists

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Copyright 2005 Pearson Education, Inc. Slide 6-1

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.

Description of Data I

4. DESCRIPTIVE STATISTICS

2 Exploring Univariate Data

Frequency Distribution and Summary Statistics

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

STAT 113 Variability

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

CHAPTER 6. ' From the table the z value corresponding to this value Z = 1.96 or Z = 1.96 (d) P(Z >?) =

Monte Carlo Simulation (Random Number Generation)

Math146 - Chapter 3 Handouts. The Greek Alphabet. Source: Page 1 of 39

GOALS. Describing Data: Displaying and Exploring Data. Dot Plots - Examples. Dot Plots. Dot Plot Minitab Example. Stem-and-Leaf.

Quantitative Methods for Economics, Finance and Management (A86050 F86050)

Probability & Statistics Modular Learning Exercises

Measures of Central Tendency: Ungrouped Data. Mode. Median. Mode -- Example. Median: Example with an Odd Number of Terms

STAT 157 HW1 Solutions

Numerical Measurements

Descriptive Statistics

Mini-Lecture 3.1 Measures of Central Tendency

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Exploring Data and Graphics

Terms & Characteristics

3.3-Measures of Variation

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

SOLUTIONS TO THE LAB 1 ASSIGNMENT

Population Mean GOALS. Characteristics of the Mean. EXAMPLE Population Mean. Parameter Versus Statistics. Describing Data: Numerical Measures

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

NCSS Statistical Software. Reference Intervals

Descriptive Analysis

The Normal Distribution

Estimation and Confidence Intervals

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

E.D.A. Exploratory Data Analysis E.D.A. Steps for E.D.A. Greg C Elvers, Ph.D.

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

2 DESCRIPTIVE STATISTICS

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

Describing Data: Displaying and Exploring Data

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

The normal distribution is a theoretical model derived mathematically and not empirically.

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Chapter 4-Describing Data: Displaying and Exploring Data

Descriptive Statistics (Devore Chapter One)

Measures of Dispersion (Range, standard deviation, standard error) Introduction

STAB22 section 1.3 and Chapter 1 exercises

SUMMARY STATISTICS EXAMPLES AND ACTIVITIES

Fundamentals of Statistics

Some estimates of the height of the podium

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Review. What is the probability of throwing two 6s in a row with a fair die? a) b) c) d) 0.333

Chapter 4-Describing Data: Displaying and Exploring Data

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

DESCRIPTIVE STATISTICS

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Lecture 2 Describing Data

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

UNIT 4 NORMAL DISTRIBUTION: DEFINITION, CHARACTERISTICS AND PROPERTIES

Today s plan: Section 4.1.4: Dispersion: Five-Number summary and Standard Deviation.

starting on 5/1/1953 up until 2/1/2017.

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA. Name: ID# Section

12/1/2017. Chapter. Copyright 2009 by The McGraw-Hill Companies, Inc. 8B-2

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Engineering Mathematics III. Moments

Measures of Variation. Section 2-5. Dotplots of Waiting Times. Waiting Times of Bank Customers at Different Banks in minutes. Bank of Providence

KARACHI UNIVERSITY BUSINESS SCHOOL UNIVERSITY OF KARACHI BS (BBA) VI

DATA HANDLING Five-Number Summary

Transcription:

Descriptive Statistics (Part 2) 4 Chapter Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis McGraw-Hill/Irwin Copyright 2009 by The McGraw-Hill Companies, Inc. Chebyshev s Theorem Developed by mathematicians Jules Bienaymé (1796-1878) and Pafnuty Chebyshev (1821-1894). For any population with mean mand standard deviation s,, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 1/k 2 ]. 4B-2 1

Chebyshev s Theorem For k = 2 standard deviations, 100[1 1/2 2 ] = 75% So, at least 75.0% will lie within m+ 2s For k = 3 standard deviations, 100[1 1/3 2 ] = 88.9% So, at least 88.9% will lie within m+ 3s Although applicable to any data set, these limits tend to be too wide to be useful. 4B-3 4B-4 The Empirical Rule The normal or Gaussian distribution was named for Karl Gauss (1771-1855). The normal distribution is symmetric and is also known as the bell-shaped curve. The Empirical Rule states that for data from a normal distribution, we expect that for k = 1 about 68.26% will lie within m+ 1s k = 2 about 95.44% will lie within m+ 2s k = 3 about 99.73% will lie within m+ 3s 2

The Empirical Rule Distance from the mean is measured in terms of the number of standard deviations. Note: no upper bound is given. Data values outside µ + 3σ are rare. 4B-5 Example: Exam Scores If 80 students take an exam, how many will score within 2 standard deviations of the mean? Assuming exam scores follow a normal distribution, the empirical rule states about 95.44% will lie within m+ 2s so 95.44% x 80 76 students will score + 2sfrom m. How many students will score more than 2 standard deviations from the mean? 4B-6 3

Unusual Observations Unusual observations are those that lie beyond m+ 2s. Outliers are observations that lie beyond m+ 3s. 4B-7 Unusual Observations For example, the P/E ratio data contains several large data values. Are they unusual or outliers? 7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14 14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 4B-8 4

The Empirical Rule If the sample came from a normal distribution, then the Empirical rule states x 1s = 22.72 ± 1(14.08) = (8.6, 38.8) x 2s = 22.72 ± 2(14.08) = (-5.4, 50.9) x 3s = 22.72 ± 3(14.08) = (-19.5, 65.0) 4B-9 The Empirical Rule Are there any unusual values or outliers? 7 8... 48 55 68 91 Unusual Unusual Outliers Outliers -5.4 8.6 22.72 36.8 50.9 65.0-19 19.5 22 4B-10 5

Defining a Standardized Variable A standardized variable (Z) redefines each observation in terms the number of standard deviations from the mean. Standardization formula for a population: z i x i Standardization formula for a sample: z i x i s x 4B-11 Defining a Standardized Variable z i tells how far away the observation is from the mean. For example, for the P/E data, the first value x 1 = 7. The associated z value is z i x s x i = 7 22.72 14.08 = -1.12 4B-12 6

Defining a Standardized Variable A negative z value means the observation is below the mean. Positive z means the observation is above the mean. For x 68 = 91, z x s x i i = 91 22.72 14.08 = 4.85 4B-13 Defining a Standardized Variable Here are the standardized z values for the P/E data: What do you conclude for these three values? 4B-14 7

Defining a Standardized Variable MegaStat calculates standardized values as well as checks for outliers. In Excel, use =STANDARDIZE(Array, Mean, STDev) to calculate a standardized z value. 4B-15 Outliers 4B-16 What do we do with outliers in a data set? If due to erroneous data, then discard. An outrageous observation (one completely outside of an expected range) is certainly invalid. Recognize unusual data points and outliers and their potential impact on your study. Research books and articles on how to handle outliers. 8

Estimating Sigma 4B-17 For a normal distribution, the range of values is 6s(from m 3sto m+ 3s). If you know the range R (high low), you can estimate the standard deviation as s= R/6. Useful for approximating the standard deviation when only R is known. This estimate depends on the assumption of normality. Percentiles and Quartiles Percentiles Percentiles are data that have been divided into 100 groups. For example, you score in the 83 rd percentile on a standardized test. That means that 83% of the test- takers scored below you. Deciles are data that have been divided into 10 groups. Quintiles are data that have been divided into 5 groups. Quartiles are data that have been divided into 4 groups. 4B-18 9

Percentiles and Quartiles Percentiles Percentiles are used to establish benchmarks for comparison purposes (e.g., health care, manufacturing and banking industries use 5, 25, 50, 75 and 90 percentiles). Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. Percentiles are used in employee merit evaluation and salary benchmarking. 4B-19 Percentiles and Quartiles Quartiles Quartiles are scale points that divide the sorted data into four groups of approximately equal size. Q 1 Q 2 Q 3 Lower 25% Second 25% Third 25% Upper 25% 4B-20 The three values that separate the four groups are called Q 1, Q 2, and Q 3, respectively. 10

Percentiles and Quartiles 4B-21 Quartiles 4B-21 The second quartile Q 2 is the median, an important indicator of central tendency. Q 2 Lower 50% Upper 50% Q 1 and Q 3 measure dispersion since the interquartile range Q 3 Q 1 measures the degree of spread in the middle 50 percent of data values. Q 1 Q 3 Lower 25% Middle 50% Upper 25% Percentiles and Quartiles Quartiles The first quartile Q 1 is the median of the data values below Q 2, and the third quartile Q 3 is the median of the data values above Q 2. Q 1 Q 2 Q 3 Lower 25% Second 25% Third 25% Upper 25% For first half of data, 50% above, 50% below Q 1. For second half of data, 50% above, 50% below Q 3. 4B-22 11

Percentiles and Quartiles Quartiles Depending on n,, the quartiles Q 1,Q 2, and Q 3 may be members of the data set or may lie between two of the sorted data values. 4B-23 4B-24 Percentiles and Quartiles Method of Medians For small data sets, find quartiles using method of medians: Step 1. Sort the observations. Step 2. Find the median Q 2. Step 3. Find the median of the data values that lie below Q 2. Step 4. Find the median of the data values that lie above Q 2. 12

Percentiles and Quartiles Excel Quartiles 4B-25 Use Excel function =QUARTILE(Array, k) to return the kth quartile. Excel treats quartiles as a special case of percentiles. For example, to calculate Q 3 =QUARTILE(Array, 3) =PERCENTILE(Array, 75) Excel calculates the quartile positions as: Position of Q 1 0.25n + 0.75 Position of Q 2 0.50n + 0.50 Position of Q 3 0.75n + 0.25 Percentiles and Quartiles Example: P/E Ratios and Quartiles Consider the following P/E ratios for 68 stocks in a portfolio. 7 8 8 10 10 10 10 12 13 13 13 13 13 13 13 14 14 14 15 15 15 15 15 16 16 16 17 18 18 18 18 19 19 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 25 26 26 26 26 27 29 29 30 31 34 36 37 40 41 45 48 55 68 91 Use quartiles to define benchmarks for stocks that are low-priced (bottom quartile) or high- priced (top quartile). 4B-26 13

Percentiles and Quartiles Example: P/E Ratios and Quartiles Using Excel s method of interpolation, the quartile positions are: Quartile Position Formula Interpolate Between Q 1 = 0.25(68) + 0.75 = 17.75 X 17 + X 18 Q 2 = 0.50(68) + 0.50 = 34.50 X 34 + X 35 Q 3 = 0.75(68) + 0.25 = 51.25 X 51 + X 52 4B-27 Percentiles and Quartiles Example: P/E Ratios and Quartiles The quartiles are: Quartile Formula First (Q 1 ) Q 1 = X 17 + 0.75 (X 18 -X 17 ) = 14 + 0.75 (14-14) = 14 Second (Q 2 ) Q 2 = X 34 + 0.50 (X 35 -X 34 ) = 19 + 0.50 (19-19) = 19 Third ( Third (Q 3 ) Q 3 = X 51 + 0.25 (X 52 -X 51 ) = 26 + 0.25 (26-26) = 26 4B-28 14

Percentiles and Quartiles Example: P/E Ratios and Quartiles So, to summarize: Q 1 Q 2 Q 3 Lower 25% 14 Second 25% 19 Third 25% 26 Upper 25% of P/E Ratios of P/E Ratios of P/E Ratios of P/E Ratios 4B-29 These quartiles express central tendency and dispersion. What is the interquartile range? Because of clustering of identical data values, these quartiles do not provide clean cut points between groups of observations. Percentiles and Quartiles Tip Whether you use the method of medians or Excel, your quartiles will be about the same. Small differences in calculation techniques typically do not lead to different conclusions in business applications. 4B-30 15

Percentiles and Quartiles Caution 4B-31 Quartiles generally resist outliers. However, quartiles do not provide clean cut points in the sorted data, especially in small samples with repeating data values. Data set A: 1, 2, 4, 4, 8, 8, 8, 8 Q = 3, Q 1 2 = 6, Q 3 = 8 Data set B: 0, 3, 3, 6, 6, 6, 10, 15 Q = 3, Q 1 2 = 6, Q 3 = 8 Although they have identical quartiles, these two data sets are not similar. The quartiles do not represent either data set well. 4B-32 Box Plots A useful tool of exploratory data analysis (EDA). Also called a box-and-whisker plot. Based on a five-number summary: X min, Q 1, Q 2, Q 3, X max Consider the five-number summary for the 68 P/E ratios: X min, Q 1, Q 2, Q 3, X max 7 14 19 26 91 16

Box Plots The box plot is displayed visually, like this. A box plot shows central tendancy, dispersion, and shape. 4B-33 Box Plots Fences and Unusual Data Values Use quartiles to detect unusual data points. These points are called fences and can be found using the following formulas: Inner fences Outer fences: Lower fence Q 1 1.5 (Q 3 Q 1 ) Q 1 3.0 (Q 3 Q 1 ) Upper fence Q 3 + 1.5 (Q 3 Q 1 ) Q 3 + 3.0 (Q 3 Q 1 ) Values outside the inner fences are unusual while those outside the outer fences are outliers. 4B-34 17

Box Plots Fences and Unusual Data Values For example, consider the P/E ratio data: Inner fences Lower fence: 14 1.5 (26 14) = 4 Outer fences: 14 3.0 (26 14) = 22 Upper fence: 26 + 1.5 (26 14) = +44 26 + 3.0 (26 14) = +62 Ignore the lower fence since it is negative and P/E ratios are only positive. 4B-35 Box Plots Fences and Unusual Data Values Truncate the whisker at the fences and display unusual values and outliers Inner Outer Fence Fence as dots. Unusual Outliers 4B-36 4B-36 Based on these fences, there are three unusual P/E values and two outliers. 18

Percentiles and Quartiles Midhinge The average of the first and third quartiles. Q Midhinge = 1 Q 3 2 4B-37 The name midhinge derives from the idea that, if the box were folded in half, it would resemble a hinge.. Box Plots Whiskers Box Center of Box is Midhinge Q 1 Q 3 Minimum Median (Q 2 ) Right-skewed Maximum 4B-38 19

Correlation Correlation Coefficient The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y. 4B-39 Correlation Correlation Coefficient Its range is -1 r +1. Excel s formula =CORREL(Xdata, Ydata) 4B-40 20

Correlation Correlation Coefficient Illustration of Correlation Coefficients 4B-41 Correlation What is the nature of the relationship between square feet of shopping area and sales that is implied by the following correlation? 4B-42 21

Grouped Data Nature of Grouped Data Although some information is lost, grouped data are easier to display than raw data. When bin limits are given, the mean and standard deviation can be estimated. Accuracy of grouped estimates depend on - the number of bins - distribution of data within bins - bin frequencies 4B-43 Grouped Data Mean and Standard Deviation Consider the frequency distribution for prices of Lipitor for three cities: 4B-44 Where m j = class midpoint f j = class frequency k = number of classes n = sample size 22

Grouped Data Nature of Grouped Data Estimate the mean and standard deviation by k f jm j 3427.5 x 72.92552 n 47 j 1 j 1 2 k f j () m j x 2091.48936 s 6.74293 n 1 47 1 Note: don t round off too soon. 4B-45 4B-46 Grouped Data Nature of Grouped Data Now estimate the coefficient of variation CV = 100 (s / x ) = 100 (6.74293 / 72.92552) = 9.2% Accuracy Issues How accurate are grouped estimates compared to ungrouped estimates? For the previous example, we can compare the grouped data statistics to the ungrouped data statistics. 23

Grouped Data Accuracy Issues Accuracy tends to improve as the number of bins increases. If the first or last class is open-ended, there will be no class midpoint (no mean can be estimated). Assume a lower limit of zero for the first class when the data are nonnegative. You may be able to assume an upper limit for some variables (e.g., age). Median and quartiles may be estimated even with open-ended classes. 4B-47 Skewness Skewness and Kurtosis Generally, skewness may be indicated by looking at the sample histogram or by comparing the mean and median. 4B-48 This visual indicator is imprecise and does not take into consideration sample size n. 24

Skewness and Kurtosis Skewness Skewness is a unit-free statistic. The coefficient compares two samples measured in different units or one sample with a known reference distribution (e.g., symmetric normal distribution). Calculate the sample s skewness coefficient as: 3 n n xi x Skewness = ( n 1)( n 2) i 1 s 4B-49 Skewness and Kurtosis Skewness In Excel, go to Tools Data Analysis Descriptive Statistics or use the function =SKEW(array) 4B-50 25

Skewness and Kurtosis Skewness Consider the following table showing the 90% range for the sample skewness coefficient. 4B-51 Skewness Skewness and Kurtosis Coefficients within the 90% range may be attributed to random variation. 4B-52 26

Skewness and Kurtosis Skewness (Figure 4.36) Coefficients outside the range suggest the sample came from a nonnormal population. 4B-53 Skewness Skewness and Kurtosis As n increases, the range of chance variation narrows. 4B-54 27

Skewness and Kurtosis Kurtosis Kurtosis is the relative length of the tails and the degree of concentration in the center. Consider three kurtosis prototype shapes. Heavier tails 4B-55 Kurtosis Skewness and Kurtosis A histogram is an unreliable guide to kurtosis since scale and axis proportions may differ. Excel and MINITAB calculate kurtosis as: Kurtosis = 4 2 n( n 1) n xi x 3( n 1) ( n 1)( n 2)( n 3)( i 1 2)( s 3) n n 4B-56 28

Skewness and Kurtosis Kurtosis Consider the following table of expected 90% range for sample kurtosis coefficient. 4B-57 Kurtosis Skewness and Kurtosis A sample coefficient within the ranges may be attributed to chance variation. 4B-58 29

Skewness and Kurtosis Kurtosis Coefficients outside the range would suggest the sample differs from a normal population. 4B-59 Kurtosis Skewness and Kurtosis As sample size increases, the chance range narrows. 4B-60 Inferences about kurtosis are risky for n < 50. 30

Applied Statistics in Business & Economics End of Chapter 4B 4B-61 31