Numerical Descriptive Measures. Measures of Center: Mean and Median

Similar documents
Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

MA 1125 Lecture 05 - Measures of Spread. Wednesday, September 6, Objectives: Introduce variance, standard deviation, range.

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Descriptive Statistics (Devore Chapter One)

Normal Model (Part 1)

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

CSC Advanced Scientific Programming, Spring Descriptive Statistics

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

IOP 201-Q (Industrial Psychological Research) Tutorial 5

3.1 Measures of Central Tendency

1 Describing Distributions with numbers

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

Example: Histogram for US household incomes from 2015 Table:

VARIABILITY: Range Variance Standard Deviation

Chapter 4 Variability

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

appstats5.notebook September 07, 2016 Chapter 5

We use probability distributions to represent the distribution of a discrete random variable.

STAB22 section 1.3 and Chapter 1 exercises

Measure of Variation

2 DESCRIPTIVE STATISTICS

Describing Data: One Quantitative Variable

Putting Things Together Part 2

Lesson 12: Describing Distributions: Shape, Center, and Spread

Some Characteristics of Data

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Measures of Variation. Section 2-5. Dotplots of Waiting Times. Waiting Times of Bank Customers at Different Banks in minutes. Bank of Providence

The Standard Deviation as a Ruler and the Normal Model. Copyright 2009 Pearson Education, Inc.

Descriptive Statistics

Lecture 2 Describing Data

Section3-2: Measures of Center

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

CHAPTER 2 Describing Data: Numerical

Statistics vs. statistics

DATA SUMMARIZATION AND VISUALIZATION

Simple Descriptive Statistics

Quantitative Methods for Economics, Finance and Management (A86050 F86050)

Chapter 3. Density Curves. Density Curves. Basic Practice of Statistics - 3rd Edition. Chapter 3 1. The Normal Distributions

Averages and Variability. Aplia (week 3 Measures of Central Tendency) Measures of central tendency (averages)

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

STATS DOESN T SUCK! ~ CHAPTER 4

The Normal Distribution

Numerical Descriptions of Data

STAT 157 HW1 Solutions

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.

Math 140 Introductory Statistics

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

The Two-Sample Independent Sample t Test

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

SOLUTIONS TO THE LAB 1 ASSIGNMENT

A CLEAR UNDERSTANDING OF THE INDUSTRY

AP Statistics Chapter 6 - Random Variables

Copyright 2005 Pearson Education, Inc. Slide 6-1

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

3.3-Measures of Variation

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Probability. An intro for calculus students P= Figure 1: A normal integral

STA 320 Fall Thursday, Dec 5. Sampling Distribution. STA Fall

Measures of Central tendency

Applications of Data Dispersions

Data Analysis and Statistical Methods Statistics 651

MAKING SENSE OF DATA Essentials series

CHAPTER 4 DISCRETE PROBABILITY DISTRIBUTIONS

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Lecture 9. Probability Distributions. Outline. Outline

Lecture 1: Review and Exploratory Data Analysis (EDA)

Data Analysis and Statistical Methods Statistics 651

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

Lecture 9. Probability Distributions

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

2 Exploring Univariate Data

Statistics, Measures of Central Tendency I

Expected Value of a Random Variable

Data Analysis. BCF106 Fundamentals of Cost Analysis

Chapter 3 Descriptive Statistics: Numerical Measures Part A

STAT 113 Variability

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

Some estimates of the height of the podium

Description of Data I

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Computing Statistics ID1050 Quantitative & Qualitative Reasoning

3. Probability Distributions and Sampling

STAT Chapter 6: Sampling Distributions

Misleading Graphs. Examples Compare unlike quantities Truncate the y-axis Improper scaling Chart Junk Impossible to interpret

On one of the feet? 1 2. On red? 1 4. Within 1 of the vertical black line at the top?( 1 to 1 2

1.2 Describing Distributions with Numbers, Continued

Fundamentals of Statistics

Since his score is positive, he s above average. Since his score is not close to zero, his score is unusual.

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Numerical Measurements

Figure 1: 2πσ is said to have a normal distribution with mean µ and standard deviation σ. This is also denoted

Transcription:

Steve Sawin Statistics Numerical Descriptive Measures Having seen the shape of a distribution by looking at the histogram, the two most obvious questions to ask about the specific distribution is where is the data clumped and how spread out is it? Both of these are numerical, quantitative questions. They are sufficiently vague, however, that each has several reasonable answers depending on the situation. Measures of Center: Mean and Median Any measure of the center of a distribution can be called the average, though in practice we usually use that term to mean the mean. Definitions: The mean of a set of numbers is the sum of all the numbers divided by how many there are. We can write it as a formula as follows. Suppose x 1, x 2,..., x n are n numbers. The mean of these n numbers is (x 1 + x 2 + + x n )/n, Which we write more compactly as ( n ) x i /n. This funny notation n, called sigma notation because is the Greek letter sigma, means run through each number from 1 to n, and for each number, substitute that number in for i in the formula that follows, and then add them up. We will use sigma notation frequently in the future. A nice geometric way to think about the mean is if you put a weight on the number line at each of the n values, the mean is the place where the whole ensemble would balance on the head of a pin. The median of the data is (roughly) the number such that half of the data points are less than it and half are above it. We say roughly because we have to be careful when several observations have the same value. The precise way to say it is put the n observations in order from smallest to largest. Then, if n is odd pick the (n + 1)/2th value (that is, count (n+1)/2 along the sequence and pick that value. If n is even, it is the average of the n/2th value and the n/2 + 1th value. The book speaks of the mode, which is the peak of the histogram, but we will not care about that at all. Names: The tradition is to use Greek letters for parameters, and Roman letters for statistics. When you calculate the mean of a sample (a statistic) you write it as x. When you

calculate the mean of a population (parameter) you write it as µ (the Greek letter mu ). There is not really a standard terminology for medians, but the median of a sample is usually called m, and the median of a population is sometimes called M. Properties: The mean is sensitive to outliers, meaning a few extreme values tend to pull the mean towards them. Calculating: Both are easy to calculate by hand in small examples, both are best to do by calculator or computer for large data sets. In Excel, both show up in the Descriptive Statistics choice in the Data Analysis Toolkit. Both can also be calculated directly in Excel. For example, typing =MEAN(A1:A20) into a cell will put the mean of the numbers from A1 to A20 in that cell, and =MEDIAN(A1:A20) will put the median of the numbers from A1 to A20 into it. There are some other commands which are variations on these which treat blank or text values differently, which you can play around with. For large data sets the median is actually much harder to calculate than the mean, though that is rarely an issue with a computer. Rules of Thumb: The mean is the point at which the histogram would balance, which usually makes it pretty easy to estimate from a histogram. The median is the point where half the data is below it and half above, which is just a little harder to estimate, but still not bad. For symmetric distributions they should be equal, for right skewed distributions the mean will generally be higher (and both will generally be above the peak) while for left skewed the mean is lower (and both are lower than the peak). Use: In most situations the mean and median are very close and both fit our sense of the middle or typical value pretty well. In highly skewed data they can be very different, and it is generally less clear what the middle of the data should be. Generally, in highly skewed data the median is closer to our sense of what the middle should be, and people tend to use medians for skewed distributions like income and prices. However, the mean is much nicer mathematically, and this makes it a more practical quantity to deal with most of the time in inferential statistics. An important point that occasionally comes up is that because the median depends only on the ordering of the values, it makes sense even for ranked data, while the mean, which involves adding and dividing, does not. You should never use the mean to summarize ranked data, though people often do. Measures of Variation Variation is an extremely important notion. Quality Control and most of the modern applications of statistics to business and engineering focuses on reducing variation, the enemy of planning. In science, variation is error, the thing that stands between your measurements and the true answer. We will consider five measures of variation. One the range, is of little significance, though very simple. Two are nearly the same thing, the populations standard deviation and the sample standard deviation have almost the same definition and, in any case where they are at all useful, practically the same value. The

last two, population variance and sample variance, are the squares of the two standard deviations, so they are just a repackaging of the same information. Definitions: The range is the difference between the largest and the smallest value. Since this only tells you about the largest and smallest values, it is generally not very useful. The population variance is defined by the formula ( n ) (x i µ) 2 /n = ( (x 1 µ) 2 + (x 2 µ) 2 + + (x n µ) 2) /n where x 1, x 2,..., x n are your observations and µ is the mean of these numbers. In other words, for each number we take the difference between it and the mean (this can be seen as its distance from the center), square it (so that it is positive whether the value is smaller or larger than the mean), and then average these quantities. It is easy to see that this number gets bigger as the data gets more spread out, does not change if you add a constant to all of them (that is, shift the whole histogram to the left or right without changing shape) and does not change if you add lots of more numbers that are equally spread out. So it is a good measure of variation. One problem with the population variance is that it does not scale properly. If you double all your numbers, you do not double the variance (in fact you multiply by four). Another way to say this is that if your data has units like inches (e.g., if it represents heights) then the variance would have units of square inches (like an area). To solve this we take the square root. This is called the population standard deviation: ( n ) (x i x) 2 /n. It is generally considered a better way to report the variation in a population. For technical reasons we shall ignore for now, one uses slightly different formulas for computing the standard deviation and variance for samples. The sample variance is given by n (x i x) 2 n 1 and the sample standard deviation is given by n (x i x) 2. n 1 Notice the only difference is you are calling the mean x rather than µ and are dividing by n 1 rather than n. Since the standard deviation and variance really don t tell you much unless you have a lot of data, these two quantities are generally extremely close. If someone asks for the standard deviation without specifying which, they mean the sample standard deviation.

Names: The population standard deviation is called σ (The Greek lower case sigma, you already met its upper case cousin) and so of course the population variance is called σ 2. The sample standard deviation is called s, and the sample variance is then s 2. Properties: Standard deviation and variance are both sensitive to outliers, and adding an outlier will generally bump these numbers up significantly. The standard deviation can generally be interpreted as the typical distance from the mean of a random point. For what it is worth, if you spun the histogram around the mean (its balancing point remember) the standard deviation tells you how hard a push you would need to get it spinning. Calculating: To calculate standard deviation or variance by hand, first calculate the mean. then make a column of the observations, then make a column of each observation minus the mean, then make a column of the squares of these differences. Then just add these numbers in the last column up, divide by n or n 1, and take the square root or not. Except in the simplest case though, you are better off using Excel. The Descriptive Statistics option will give you (sample) standard deviation and sample variance (and range too). If you want the population standard deviation of cells A1 through A20, use =STDEVP(A1:A20). Population variance is =VARP(A1:A20). You can also do the sample s.d. and variance directly with =STDEV(A1:A20) and =VAR(A1:A20). Rules of Thumb: The best way to think of the standard deviation is as a sort of typical distance from the mean. In particular, if your data is bell-shaped, then roughly 68% of the data will be within one standard deviation of the mean. That is, 68% of the data will be greater than µ σ, the point a distance σ below the mean µ, and µ σ, the point a distance σ above the mean µ. Similarly, 95% of the data will fall within two standard deviations of the mean (between µ 2σ and µ + 2σ) and over 99% will fall within three standard deviations of the mean. This means the standard deviation and the mean are a sort of universal measure of how unusual and observation is. Someone whose height is one standard deviation above the mean is tall, but not surprising. Someone two standard deviations above the mean is strikingly tall, you would look twice. If you meet someone who is three standard deviations above the mean, you will stare, and then go home and tell your roommate about it. This rule of thumb is technically only true for population standard deviation, but it works well enough for both. Even of the data is far from bell shaped, at least 3/4 of the data falls within two standard deviations of the mean, so you can get some idea of the standard deviation by looking at the histogram. Use: The standard deviation is most useful when your data is roughly bell shaped. When it is skew or otherwise far from bell-shaped, it is more difficult to interpret. Generally we will be interested in looking at the distribution scaled by the mean and s.d. that is to say, we will talk about points one standard deviation above the mean, or two standard deviations below the mean, or whatever. In fact, the best universal measure of where a data point fits in a bell-shaped distribution is its z-score. If x is a number in a distribution of mean µ and standard deviation σ, its z-score is z = x µ σ.

A z-score of 2 means the data point is two-standard deviations above the mean. The variance will occasionally be useful to talk about, but it will really be a kind of helpful understudy to the standard deviation.