Frequency Distribution and Summary Statistics

Similar documents
1 Describing Distributions with numbers

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

2 Exploring Univariate Data

Lecture 2 Describing Data

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

DATA SUMMARIZATION AND VISUALIZATION

Fundamentals of Statistics

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture Week 4 Inspecting Data: Distributions

appstats5.notebook September 07, 2016 Chapter 5

Some Characteristics of Data

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

STAT 113 Variability

Basic Procedure for Histograms

Unit 2 Statistics of One Variable

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

Descriptive Statistics

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Simple Descriptive Statistics

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

CHAPTER 2 Describing Data: Numerical

Section3-2: Measures of Center

Graphical and Tabular Methods in Descriptive Statistics. Descriptive Statistics

STAB22 section 1.3 and Chapter 1 exercises

Describing Data: One Quantitative Variable

3.1 Measures of Central Tendency

SOLUTIONS TO THE LAB 1 ASSIGNMENT

Description of Data I

Lectures delivered by Prof.K.K.Achary, YRC

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Numerical Descriptions of Data

Descriptive Statistics (Devore Chapter One)

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

Descriptive Statistics

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

E.D.A. Exploratory Data Analysis E.D.A. Steps for E.D.A. Greg C Elvers, Ph.D.

Empirical Rule (P148)

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

CSC Advanced Scientific Programming, Spring Descriptive Statistics

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Section 6-1 : Numerical Summaries

Measures of Dispersion (Range, standard deviation, standard error) Introduction

22.2 Shape, Center, and Spread

Terms & Characteristics

Descriptive Statistics Bios 662

The Normal Distribution

Engineering Mathematics III. Moments

Introduction to Descriptive Statistics

Math 2200 Fall 2014, Exam 1 You may use any calculator. You may not use any cheat sheet.

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Some estimates of the height of the podium

4. DESCRIPTIVE STATISTICS

DATA HANDLING Five-Number Summary

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Data Distributions and Normality

Moments and Measures of Skewness and Kurtosis

Measures of Central tendency

DESCRIPTIVE STATISTICS

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Sampling and Descriptive Statistics

Statistics I Chapter 2: Analysis of univariate data

Monte Carlo Simulation (Random Number Generation)

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

NOTES: Chapter 4 Describing Data

Math Take Home Quiz on Chapter 2

Numerical summary of data

Exploratory Data Analysis

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Variance, Standard Deviation Counting Techniques

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA. Name: ID# Section

2CORE. Summarising numerical data: the median, range, IQR and box plots

Edexcel past paper questions

Getting to know a data-set (how to approach data) Overview: Descriptives & Graphing

Steps with data (how to approach data)

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

2 DESCRIPTIVE STATISTICS

Averages and Variability. Aplia (week 3 Measures of Central Tendency) Measures of central tendency (averages)

Measures of Central Tendency: Ungrouped Data. Mode. Median. Mode -- Example. Median: Example with an Odd Number of Terms

STAT 157 HW1 Solutions

Skewness and the Mean, Median, and Mode *

Math 140 Introductory Statistics. First midterm September

Putting Things Together Part 2

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Getting to know data. Play with data get to know it. Image source: Descriptives & Graphing

Math 227 Elementary Statistics. Bluman 5 th edition

MgtOp 215 TEST 1 (Golden) Spring 2016 Dr. Ahn. Read the following instructions very carefully before you start the test.

1. In a statistics class with 136 students, the professor records how much money each

Diploma in Financial Management with Public Finance

Descriptive Analysis

Transcription:

Frequency Distribution and Summary Statistics Dongmei Li Department of Public Health Sciences Office of Public Health Studies University of Hawai i at Mānoa

Outline 1. Stemplot 2. Frequency table 3. Summary statistics 2

1. Stem-and-leaf plots (stemplots) Always start by looking at the data with graphs and plots Our favorite technique for looking at a single variable is the stemplot A stemplot is a graphical technique that organizes data into a histogram-like display You can observe a lot by looking Yogi Berra 3

Stemplot Illustrative Example Select an SRS of 10 ages List data as an ordered array 05 11 21 24 27 28 30 42 50 52 Divide each data point into a stem-value and leaf-value In this example the tens place will be the stem-value and the ones place will be the leaf value, e.g., 21 has a stem value of 2 and leaf value of 1 4

Stemplot illustration (cont.) Draw an axis for the stem-values: 0 1 2 1 3 4 5 10 axis multiplier (important!) Place leaves next to their stem value 21 plotted (animation) 5

Stemplot illustration continued Plot all data points and rearrange in rank order: 0 5 1 1 2 1478 3 0 4 2 5 02 10 Here is the plot horizontally: (for demonstration purposes) 8 7 4 2 5 1 1 0 2 0 ------------ 0 1 2 3 4 5 ------------ Rotated stemplot 6

Interpreting Stemplots Shape Symmetry Modality (number of peaks) Kurtosis (width of tails) Departures (outliers) Location Gravitational center mean Middle value median Spread Range and inter-quartile range Standard deviation and variance 7

Shape Shape refers to the pattern when plotted Here s the silhouette of our data X X X X X X X X X X ----------- 0 1 2 3 4 5 ----------- Consider: symmetry, modality, kurtosis 8

Shape: Idealized Density Curve A large dataset is introduced An density curve is superimposed to better discuss shape 9

Symmetrical Shapes 10

Asymmetrical shapes 11

Modality (no. of peaks) 12

Kurtosis (width of tails) Mesokurtic (medium) Platykurtic (flat) fat tails Leptokurtic (steep) skinny tails Kurtosis is not be easily judged by eye 13

Stemplot Second Example Data: 1.47, 2.06, 2.36, 3.43, 3.74, 3.78, 3.94, 4.42 Stem = ones-place Leaves = tenths-place Round to keep one digit after decimal point (e.g., 1.47 1.5) Do not plot decimal 1 5 2 14 3 4789 4 4 ( 1) Shape: asymmetric, skewed to the left, unimodal, no outliers 14

Draw a stemplot using JMP Analyze---Distribution---Data---Stem and Leaf Open the JMP data set named Stem_and_leaf_plot.jmp 15

Third Illustrative Example (n = 26) Age data set from 26 subjects {14, 17, 18, 19, 22, 22, 23, 24, 24, 26, 26, 27, 28, 29, 30, 30, 30, 31, 32, 33, 34, 34, 35, 36, 37, 38} Data set: Stem_and_leaf_plot_example2.jmp Distribution of the age variable? 16

2. Frequency Table Frequency = count Relative frequency = proportion or % Cumulative frequency % less than or equal to level AGE Freq Rel.Freq Cum.Freq. ------+----------------------- 3 2 0.3% 0.3% 4 9 1.4% 1.7% 5 28 4.3% 6.0% 6 37 5.7% 11.6% 7 54 8.3% 19.9% 8 85 13.0% 32.9% 9 94 14.4% 47.2% 10 81 12.4% 59.6% 11 90 13.8% 73.4% 12 57 8.7% 82.1% 13 43 6.6% 88.7% 14 25 3.8% 92.5% 15 19 2.9% 95.4% 16 13 2.0% 97.4% 17 8 1.2% 98.6% 18 6 0.9% 99.5% 19 3 0.5% 100.0% ------+----------------------- Total 654 100.0% 17

Frequency Table with Class Intervals When data are sparse, group data into class intervals Create 4 to 12 class intervals Classes can be uniform or non-uniform End-point convention: e.g., first class interval of 0 to 10 will include 0 but exclude 10 (0 to 9.99) Talley frequencies Calculate relative frequency Calculate cumulative frequency 18

Class Intervals Uniform class intervals table (width 10) for data: 05 11 21 24 27 28 30 42 50 52 Class Freq Relative Freq. (%) Cumulative Freq (%) 0 9 1 10 10 10 19 1 20 29 4 30 39 1 40 49 1 10 80 50 59 2 20 100 Total 10 100 -- 19

Histogram A histogram is a frequency chart for a quantitative measurement. Notice how the bars touch. 5 4 3 2 1 0 0-9 10_19 20-29 30-39 40-49 50-59 Age Class 20

Bar Chart A bar chart with non-touching bars is reserved for categorical measurements and non-uniform class intervals 500 450 400 350 300 250 200 150 100 50 0 Pre- Elem. Middle High School-level 21

3. Summary Statistics Central location Mean Median Mode Spread Range and interquartile range (IQR) Variance and standard deviation 22

Location: Mean Eye-ball method visualize where plot would balance Arithmetic method = sum values and divide by n 8 7 4 2 5 1 1 0 2 0 ------------ 0 1 2 3 4 5 ------------ ^ Grav.Center Eye-ball method around 25 to 30 (takes practice) Arithmetic method mean = 290 / 10 = 29 23

Notation n sample size X the variable (e.g., ages of subjects) x i the value of individual i for variable X sum all values (capital sigma) Illustrative data (ages of participants): 21 42 5 11 30 50 28 27 24 52 n = 10 X = AGE variable x 1 = 21, x 2 = 42,, x 10 = 52 x i = x 1 + x 2 + + x 10 = 21 + 42 + + 52 = 290 24

Central Location: Sample Mean Arithmetic average Traditional measure of central location Sum the values and divide by n xbar refers to the sample mean x 1 n x x 1 2 n 1 xn xi n i 1 25

Example: Sample Mean Ten individuals selected at random have the following ages: 21 42 5 11 30 50 28 27 24 52 Note that n = 10, x i = 21 + 42 + + 52 = 290, and x 1 n x i 1 10 (290 ) 29.0 The sample mean is the gravitational center of a distribution 0 10 20 30 40 50 60 Mean = 29 26

Uses of the Sample Mean The sample mean can be used to predict: The value of an observation drawn at random from the sample The value of an observation drawn at random from the population The population mean 27

Population Mean xi 1 N N x i Same operation as sample mean except based on entire population (N population size) Conceptually important Usually not available in practice Sometimes referred to as the expected value 28

Central Location: Median Ordered array: 05 11 21 24 27 28 30 42 50 52 When n is even, the median is the average of the (n 2)th data and the (n 2+1)th data. When n is odd, the median is the ((n+1) 2)th data. For illustrative data: n = 10 the median falls between 27 and 28=(27+28) 2 =27.5 05 11 21 24 27 28 30 42 50 52 median Average the adjacent values: M = 27.5 29

More Examples of Medians Example A: 2 4 6 Median = 4 Example B: 2 4 6 8 Median = 5 (average of 4 and 6) Example C: 6 2 4 Median 2 (Values must be ordered first) 30

The Median is Robust The median is more resistant to skews and outliers than the mean; it is more robust. This data set has a mean of 1636: 1362 1439 1460 1614 1666 1792 1867 Here s the same data set with a data entry error outlier (highlighted). This data set has a mean of 2743: 1362 1439 1460 1614 1666 1792 9867 The median is 1614 in both instances, demonstrating its robustness in the face of outliers. 31

Mode The mode is the most commonly encountered value in the dataset This data set has a mode of 7 {4, 7, 7, 7, 8, 8, 9} This data set has no mode {4, 6, 7, 8} (each point appears only once) The mode is useful only in large data sets with repeating values 32

Comparison of Mean, Median, Mode Note how the mean gets pulled toward the longer tail more than the median mean = median symmetrical distrib mean > median positive skew mean < median negative skew 33

Spread: Quartiles Two distributions can be quite different yet can have the same mean This data compares particulate matter in air samples (μg/m 3 ) at two sites. Both sites have a mean of 36, but Site 1 exhibits much greater variability. We would miss the high pollution days if we relied solely on the mean. Site 1 Site 2 --------------- 42 2 8 2 2 3 234 86 3 6689 2 4 0 4 5 5 6 8 6 10 34

Spread: Range Range = maximum minimum Illustrative example: Site 1 range = 68 22 = 46 Site 2 range = 40 32 = 8 Beware: the sample range will tend to underestimate the population range. Always supplement the range with at least one addition measure of spread Site 1 Site 2 ---------------- 42 2 8 2 2 3 234 86 3 6689 2 4 0 4 5 5 6 8 6 10 35

Spread: Quartiles Quartile 1 (Q1): cuts off bottom quarter of data = median of the lower half of the data set Quartile 3 (Q3): cuts off top quarter of data = median of the upper half of the data set Interquartile Range (IQR) = Q3 Q1 covers the middle 50% of the distribution 05 11 21 24 27 28 30 42 50 52 Q1 median Q3 Q1 = 21, Q3 = 42, and IQR = 42 21 = 21 36

Quartiles (Tukey s Hinges) Example 2 Data are metabolic rates (cal/day), n = 7 1362 1439 1460 1614 1666 1792 1867 median When n is odd, include the median in both halves of the data set. Bottom half: 1362 1439 1460 1614 which has a median of 1449.5 (Q1) Top half: 1614 1666 1792 1867 which has a median of 1729 (Q3) 37

Five-Point Summary Q0 (the minimum) Q1 (25 th percentile) Q2 (median) Q3 (75 th percentile) Q4 (the maximum) 38

Boxplots 1. Calculate 5-point summary. Draw box from Q1 to Q3 w/ line at median 2. Calculate IQR and fences as follows: Fence Lower = Q1 1.5(IQR) Fence Upper = Q3 + 1.5(IQR) Do not draw fences 3. Determine if any values lie outside the fences (outside values). If so, plot these separately. 4. Determine values inside the fences (inside values) Draw whisker from Q3 to upper inside value. Draw whisker from Q1 to lower inside value 39

Illustrative Example: Boxplot Data: 05 11 21 24 27 28 30 42 50 52 1. 5 pt summary: {5, 21, 27.5, 42, 52}; box from 21 to 42 with line @ 27.5 2. IQR = 42 21 = 21. F U = Q3 + 1.5(IQR) = 42 + (1.5)(21) = 73.5 F L = Q1 1.5(IQR) = 21 (1.5)(21) = 10.5 3. None values above upper fence None values below lower fence 4. Upper inside value = 52 Lower inside value = 5 Draws whiskers 60 50 40 30 20 10 0 Upper inside = 52 Q3 = 42 Q2 = 27.5 Q1 = 21 Lower inside = 5 40

Illustrative Example: Boxplot 2 Data: 3 21 22 24 25 26 28 29 31 51 1. 5-point summary: 3, 22, 25.5, 29, 51: draw box 2. IQR = 29 22 = 7 F U = Q3 + 1.5(IQR) = 29 + (1.5)(7) = 39.5 F L = Q1 1.5(IQR) = 22 (1.5)(7) = 11.6 3. One above top fence (51) One below bottom fence (3) 60 50 40 30 20 10 O utside v alue (5 1) In side valu e (31) Up per hin ge (29) Me dian (25.5 ) Lo wer h ing e (2 2) In side valu e (21) O utside v alue (3 ) 4. Upper inside value is 31 0 Lower inside value is 21 Draw whiskers 41

Illustrative Example: Boxplot 3 Seven metabolic rates: 1362 1439 1460 1614 1666 1792 1867 1. 5-point summary: 1362, 1449.5, 1614, 1729, 1867 2000 2. IQR = 1729 1449.5 = 279.5 F U = Q3 + 1.5(IQR) = 1729 + (1.5)(279.5) = 2148.25 1900 1800 1700 1600 F L = Q1 1.5(IQR) = 1449.5 (1.5)(279.5) = 1030.25 1500 1400 1300 3. None outside 4. Whiskers end @ 1867 and 1362 N = 7 Data source: Moore, 42

Boxplots: Interpretation Location Position of median Position of box Spread Hinge-spread (IQR) Whisker-to-whisker spread Range Shape Symmetry or direction of skew Long whiskers (tails) indicate leptokurtosis (Long tails?) 43

Side-by-side boxplots Boxplots are especially useful when comparing groups 44

Spread: Standard Deviation Most common descriptive measures of spread Based on deviations around the mean. This figure demonstrates the deviations of two of its values This data set has a mean of 36. The data point 33 has a deviation of 33 36 = 3. The data point 40 has a deviation of 40 36 = 4. 45

Variance and Standard Deviation Deviation = x i x Sum of squared deviations = SS x x 2 i Sample variance = s 2 n SS 1 Sample standard deviation = s s 2 46

Standard deviation (formula) Sum of Squares s n 1 1 ( x i x ) 2 Sample standard deviation s is the estimator of population standard deviation. 47

Illustrative Example: Standard Deviation Observation Deviations Squared deviations x i x i x x 2 i x 36 36 36 = 0 0 2 = 0 38 38 36 = 2 2 2 = 4 39 39 36 = 3 3 2 = 9 40 40 36 = 4 4 2 = 16 36 36 36 = 0 0 2 = 0 34 34 36 = 2 ( 2) 2 = 4 33 33 36 = 3 ( 3) 2 = 9 32 32 36 = 4 ( 4) 2 = 16 SUMS 0* SS = 58 * Sum of deviations always equals zero 48

Illustrative Example (cont.) Sample variance (s 2 ) s 2 SS 58 8.286 ( n 1 8 1 g/m 3 ) 2 Standard deviation (s) s s 2 8.286 2.88 g/m 3 49

Interpretation of Standard Deviation Measure spread (e.g., if group was s 1 = 15 and group 2 s 2 = 10, group 1 has more spread, i.e., variability) 68-95-99.7 rule (next slide) Chebychev s rule (two slides hence) 50

68-95-99.7 Rule Normal Distributions Only! 68% of data in the range μ ± σ 95% of data in the range μ ± 2σ 99.7% of data in the range μ ± 3σ Example. Suppose a variable has a Normal distribution with μ = 30 and σ = 10. Then: 68% of values are between 30 ± 10 = 20 to 40 95% are between 30 ± (2)(10) = 30 ± 20 = 10 to 50 99.7% are between 30 ± (3)(10) = 30 ± 30 = 0 to 60 51

Chebychev s Rule All Distributions Chebychev s rule says that at least 75% of the values will fall in the range μ ± 2σ (for any shaped distribution) Example: A distribution with μ = 30 and σ = 10 has at least 75% of the values in the range 30 ± (2)(10) = 10 to 50 52

Rules for Rounding Carry at least four significant digits during calculations. Round at last step of operation Always report units Always use common sense and good judgment. 53

Choosing Summary Statistics Always report a measure of central location, a measure of spread, and the sample size Symmetrical mound-shaped distributions report mean and standard deviation Odd shaped distributions report 5- point summaries (or median and IQR) 54

Software and Calculators Use software and calculators to check work. 55

Excel Data Analysis ToolPak Data set: Boxplot.xlsx Summary statistics using Excel 56

Excel Data Analysis ToolPak Get summary statistics using Excel 57

Boxplot using Excel Youtube links to get boxplot in Excel http://www.youtube.com/watch?v=s8zw 4PVarwE&feature=related 70 60 50 40 30 20 10 0 Female age Male age 58

Boxplot and summary statistics by JMP Data set: Boxplot.jmp What do you say about the comparison of distributions of ages for females and males? 59

60

Exercise Surgical times. Durations of surgeries (hours) for 15 patients receiving artificial hearts are shown here. Create a stem plot of these data. Describe the distribution. Are there any outliers? What is the standard deviation of this data set? Draw a boxplot based on this data set. 7.0 6.5 3.5 3.1 2.8 2.5 3.8 2.6 2.4 2.1 1.8 2.3 3.1 3.0 2.5 Data set: Presentation2_Exercise.xlsx Presentation2_exercise.jmp 61