Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12)

Similar documents
appstats5.notebook September 07, 2016 Chapter 5

Some estimates of the height of the podium

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

STAT 113 Variability

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Putting Things Together Part 2

Describing Data: One Quantitative Variable

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

1 Describing Distributions with numbers

2 Exploring Univariate Data

Description of Data I

STAB22 section 1.3 and Chapter 1 exercises

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Lecture 1: Review and Exploratory Data Analysis (EDA)

Section3-2: Measures of Center

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

FINALS REVIEW BELL RINGER. Simplify the following expressions without using your calculator. 1) 6 2/3 + 1/2 2) 2 * 3(1/2 3/5) 3) 5/ /2 4

Putting Things Together Part 1

Lecture 2 Describing Data

Edexcel past paper questions

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences. STAB22H3 Statistics I Duration: 1 hour and 45 minutes

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Empirical Rule (P148)

Ti 83/84. Descriptive Statistics for a List of Numbers

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

4. DESCRIPTIVE STATISTICS

Descriptive Statistics

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

CHAPTER 2 Describing Data: Numerical

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Lecture Week 4 Inspecting Data: Distributions

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Measures of Dispersion (Range, standard deviation, standard error) Introduction

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

Numerical Descriptive Measures. Measures of Center: Mean and Median

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

22.2 Shape, Center, and Spread

NOTES: Chapter 4 Describing Data

DATA SUMMARIZATION AND VISUALIZATION

Descriptive Statistics (Devore Chapter One)

Variance, Standard Deviation Counting Techniques

Numerical Descriptions of Data

Chapter 3. Lecture 3 Sections

starting on 5/1/1953 up until 2/1/2017.

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

BIOL The Normal Distribution and the Central Limit Theorem

Key: 18 5 = 1.85 cm. 5 a Stem Leaf. Key: 2 0 = 20 points. b Stem Leaf. Key: 2 0 = 20 cm. 6 a Stem Leaf. Key: 4 3 = 43 cm.

Today s plan: Section 4.1.4: Dispersion: Five-Number summary and Standard Deviation.

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

DATA HANDLING Five-Number Summary

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Applications of Data Dispersions

Chapter 3 Descriptive Statistics: Numerical Measures Part A

STATS DOESN T SUCK! ~ CHAPTER 4

3.1 Measures of Central Tendency

How Wealthy Are Europeans?

2 DESCRIPTIVE STATISTICS

Solutions for practice questions: Chapter 9, Statistics

The Standard Deviation as a Ruler and the Normal Model. Copyright 2009 Pearson Education, Inc.

Statistics vs. statistics

Today's Agenda Hour 1 Correlation vs association, Pearson s R, non-linearity, Spearman rank correlation,

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

2CORE. Summarising numerical data: the median, range, IQR and box plots

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

AP Statistics Unit 1 (Chapters 1-6) Extra Practice: Part 1

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

1. In a statistics class with 136 students, the professor records how much money each

Section 6-1 : Numerical Summaries

Example. Chapter 8 Probability Distributions and Statistics Section 8.1 Distributions of Random Variables

STOR 155 Practice Midterm 1 Fall 2009

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

IB Interview Guide: Case Study Exercises Three-Statement Modeling Case (30 Minutes)

Math 140 Introductory Statistics

Chapter 5: Summarizing Data: Measures of Variation

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Lecture Data Science

SOLUTIONS TO THE LAB 1 ASSIGNMENT

Web Science & Technologies University of Koblenz Landau, Germany. Lecture Data Science. Statistics and Probabilities JProf. Dr.

Unit 2 Statistics of One Variable

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Numerical Measurements

Math Take Home Quiz on Chapter 2

Math 2200 Fall 2014, Exam 1 You may use any calculator. You may not use any cheat sheet.

Source: Fall 2015 Biostats 540 Exam I. BIOSTATS 540 Fall 2016 Practice Test for Unit 1 Summarizing Data Page 1 of 6

SOLUTIONS: DESCRIPTIVE STATISTICS

STA 248 H1S Winter 2008 Assignment 1 Solutions

12.1 One-Way Analysis of Variance. ANOVA - analysis of variance - used to compare the means of several populations.

5.1 Mean, Median, & Mode

A.REPRESENTATION OF DATA

Transcription:

Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12) Descriptive statistics: - Measures of centrality (Mean, median, mode, trimmed mean) - Measures of spread (MAD, Standard deviation, variance) - Other measures (Quantiles, skewness, shape parameters

Variation: Not everything can be controlled. Results may vary, even in a factory setting. Some bags will get more chips than others, we say there is variation in the weights in each bag. Image source: Failblog.org, Quality Control Fail

There are laws about the proportion of bags sold that can be under-weight. A company needs to know the proportion that will be under but can t afford to check every single bag. Instead they check a sample of bags and hope it represents the population. (Like my survey of 39 students)

..but those samples are not going to be the same every time. Most of you have done this before during R-R-R-Roll Up The Rim season. They say there s a one in six chance of winning, but did you win on EXACTLY one in six cups. Did you win as much as your friends?

Mine: 0 / 3 = 0% Wins Jason: 3 / 13 = 23% Wins Emelie: 6 / 39 = 15% Wins Each person s roll up the rim season is different, why? Variability!

Why should you care? When you re doing a social study or experiment, your results aren t going to be hard set. Image: xkcd.com

If you did the same study tomorrow with similar subjects, you d get different results. It would help if we had an idea how different we would expect these differences to be. Image: xkcd.com

That s what measures of spread like the interquartile range (IQR) and the standard deviation are for. They help us measure how uncertain we are about our central values. IQR is intuitive, works for a wide range of distributions, and has the 1.5xIQR rule for finding outliers. But it s tied to the median and related measures like the quartiles.

A spread measure based on the mean is the standard deviation. To deviate means the stray from the norm. A standard deviation is the typical amount strayed from the mean.

When the distribution looks kind of like this about ⅔ of the distribution is within 1 sd of the mean about 95% is within 2 sd of the mean about 99% is within 3 sd of the mean

Example: Grade 5 Reading Scores have a mean of 120 and a standard deviation (sd) of 25. 120 + 1sd = 145 120 1sd = 95 So about 2/3 of the grade 5s have a reading score between 95 and 145.

Example: Grade 5 Reading Scores have a mean of 120 and a standard deviation (sd) of 25. 120 + 2sd = 120 + 2(25) = 170 120 2sd = 120 2(25) = 70 So about 95% of the grade 5s have a reading score between 70 and 170.

Another way to determine outliers when using the mean and standard deviation is the 3 standard deviation rule. Anything three standard deviations below or above the mean is an outlier..

With the reading scores, anything below 120 3(15) = 75 or above 120 + 3(15) = 165 is an outlier. Like the mean and standard deviation, this outlier measure is only appropriate for symmetric data.

Quartiles and the Five Number Summary - The five numbers are the Minimum (Q0), Lower Quartile (Q1), Median (Q2), Upper Quartile (Q3), and Maximum (Q4). - Q1 means bigger than 1 Quarter of the data. - Q3 means bigger than 3 Quarters of the data. For the values {0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39}, the five number summary is: 0 3 7 12.5 39.

There are several ways to compute the quartiles, but here's the one I used. In this data set: {0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39} There are 13 numbers, n=13. So the median is the 7 th value. The lower quartile is the 3.5th smallest value (between the 2 and 4) The lower quartile is the 3.5th largest value (between the 12 and 13)

Inter-Quartile Range The Inter-Quartile Range. (Literally range the between the quartiles, called the IQR for short), is a measure of spread based on the median rather than the mean. Likewise, it's robust to outliers.

- The Inter-Quartile range is calculated: IQR = Q3 Q1 a) The size of the IQR indicates how spread out the middle half of the data is.

Outliers (1.5 x IQR Rule) 1. Now that we have a measure of spread, we can use it to identify values that are much farther from the center than usual. 2. How? Spread measures like the IQR tell us how far a typical value could be from the average, so anything much more than the typical distance can be identified.

- We call these data points outliers. They (figuratively) lay outside the rest of the data. - Because an outlier stands out from the rest of the data, it o might not belong there, or o is worthy of extra attention.

- One way to define an outlier is o anything below Q1 1.5 IQR or o above Q3 + 1.5 IQR. This is called the 1.5 x IQR rule. (Important).

- Example: {0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39} Q1 = 3, Q3 = 12.5 IQR = 12.5-3 = 9.5. Q1 1.5xIQR = 3 1.5(9.5) = 3-14.25 = -11.25 Anything less than -11.25 is an outlier. In this case there are no outliers on the low end.

- Example: {0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39} Q1 = 3, Q3 = 12.5 IQR = 9.5 Q3 + 1.5xIQR = 12.5 + 1.5*9.5 = 12.5 + 14.25 = 26.75 Anything more than 26.75 is an outlier. 39 is the only outlier.

More on IQR and Outliers: - There are other ways to define outliers, but 1.5xIQR is one of the most straightforward. - If our range has a natural restriction, (like it can t possibly be negative), it s okay for an outlier limit to be beyond that restriction. - If a value is more than Q3 + 3*IQR or less than Q1 3*IQR it is sometimes called an extreme outlier.

- The standard graph for showing the median, quartiles, and outliers of a data set is the boxplot, for {0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39} it looks like this:

- The five-number summary is in the boxplot: - The box from 3 to 12.5 is the region between Q1 and Q3. - The line going through the middle of the box at 7 is the median. -

- The lines going out the ends of the box are called the whiskers. They show the range of values that are not outliers. - The lower whisker goes to the lowest value, 1. The upper whisker goes to 17 because it s the biggest value before the upper limit of 26.75 is hit.

- The individual dot at 39 shows an outlier. - Outliers in SPSS are labelled with their row number so you can find them in data view. - In SPSS extreme outliers are shown as stars. - The farthest outliers on either side are the minimum and maximum. - If there are no outliers on a side, the end of the whisker is that minimum or maximum.

Boxplots and Skew - Skewed distributions have more extreme values on one side, so a boxplot of a skewed distribution will have one whisker longer than the other. - There will also be more outliers on one side of the boxplot than the other.

Side-by-side Boxplots - Boxplots can also be used to compare the distributions of two samples. - Example: Heights of adult men and women.

- There is some overlap - In general men are taller. - The variance is about the same. - Both distributions appear to be symmetric.

This page left obnoxiously blank

What exactly IS an outlier? - It s a value far from anything else that warrants special consideration aside from the rest of the data. - Often it s a mistake in data entry. If were recording a grade of 73%, mistyped, and recorded 3% or 730%, both of these values would be far from the rest of the data and would indicate that the data is not being represented properly.

- If the times to finish a final exam had Q1 at 120 minutes and Q3 at 150 minutes, but someone finished in 62 minutes, that person could be a student with a stronger than recommended background for that course or someone who gave up during the exam. - In both cases, their exam wouldn t a good representation of the exams as whole. - Sometimes outliers can tell your assumptions and expectations are wrong

again...

Finally, there's the variance. The variance is the average squared difference between a value and the mean. The standard deviation is the square root of the variance. We won't be using the variance, but I will be referring to it to explain some concepts in the future.

The standard deviation is only used for symmetric (or close) distributions. When data is skewed the standard deviation breaks down because of direction of the deviations becomes important.

Example: Postively/Right skewed distribution. The first standard deviation below the mean (blue) covers more of the distribution than first one above (red). So a standard deviation below implies something different than a standard deviation above.

Example: Right skewed distribution. Since the mean is more than the median, there are more values below the mean. Does that imply that a deviation below the mean is standard? For skew, avoid the whole mess and use the IQR.

Pop quiz: If the distribution is symmetric and the data is interval, then the best measure of variability is: a) Interquartile range b) Standard Deviation Hint: What is the default central measure? Which measure above is based on that?

Question: If the data is ordinal, then which measure of variability/spread is not possible (without extra assumptions): a) Interquartile range b) Standard Deviation Hint: The standard deviation is based on the mean. Do ordinals have means?

Answer: Standard deviation is impossible for ordinal data because you can t get the mean of ordinal data usually. To get the mean for ordinal data, you need to treat it like interval data, that means assuming that the categories are evenly spaced

Which of the following standard deviations is/are impossible? 40 7 potatoes -4 Hint: The standard deviation is the square root of the variance.

Answer: -4 is impossible. Standard deviation is the (positive) square root of the variance. It doesn t make sense for the typical distance from the mean to be a negative number. 7 potatoes is a fine standard deviation if the variable is number of potatoes. (for interest, the variance would be measured in potatoes 2 )