Simple Descriptive Statistics

Similar documents
Some Characteristics of Data

Basic Procedure for Histograms

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Numerical Descriptions of Data

3.1 Measures of Central Tendency

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Engineering Mathematics III. Moments

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Moments and Measures of Skewness and Kurtosis

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Measures of Central tendency

1 Describing Distributions with numbers

Terms & Characteristics

Statistics 114 September 29, 2012

Measures of Central Tendency: Ungrouped Data. Mode. Median. Mode -- Example. Median: Example with an Odd Number of Terms

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

DATA SUMMARIZATION AND VISUALIZATION

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Descriptive Statistics

Measures of Dispersion (Range, standard deviation, standard error) Introduction

appstats5.notebook September 07, 2016 Chapter 5

Applications of Data Dispersions

Chapter 4 Variability

Statistics I Chapter 2: Analysis of univariate data

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

E.D.A. Exploratory Data Analysis E.D.A. Steps for E.D.A. Greg C Elvers, Ph.D.

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Numerical Measurements

Description of Data I

ECON 214 Elements of Statistics for Economists

Descriptive Statistics

Numerical summary of data

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Chapter 3 Descriptive Statistics: Numerical Measures Part A

CHAPTER 2 Describing Data: Numerical

Lectures delivered by Prof.K.K.Achary, YRC

Frequency Distribution and Summary Statistics

2 DESCRIPTIVE STATISTICS

PSYCHOLOGICAL STATISTICS

Descriptive Statistics for Educational Data Analyst: A Conceptual Note

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Chapter 6 Simple Correlation and

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

DESCRIPTIVE STATISTICS II. Sorana D. Bolboacă

Describing Data: One Quantitative Variable

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

Measures of Variation. Section 2-5. Dotplots of Waiting Times. Waiting Times of Bank Customers at Different Banks in minutes. Bank of Providence

Empirical Rule (P148)

Establishing a framework for statistical analysis via the Generalized Linear Model

2 Exploring Univariate Data

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 2 Describing Data

DESCRIPTIVE STATISTICS

Refer to Ex 3-18 on page Record the info for Brand A in a column. Allow 3 adjacent other columns to be added. Do the same for Brand B.

Fundamentals of Statistics

4. DESCRIPTIVE STATISTICS

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Basic Sta)s)cs. Describing Data Measures of Spread

Section3-2: Measures of Center

Measure of Variation

Unit 2 Statistics of One Variable

Exploring Data and Graphics

Descriptive Analysis

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Some estimates of the height of the podium

Quantitative Methods for Economics, Finance and Management (A86050 F86050)

STAT 113 Variability

Monte Carlo Simulation (Random Number Generation)

3.3-Measures of Variation

Copyright 2005 Pearson Education, Inc. Slide 6-1

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Introduction to Descriptive Statistics

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Section 6-1 : Numerical Summaries

Numerical Descriptive Measures. Measures of Center: Mean and Median

Getting to know a data-set (how to approach data) Overview: Descriptives & Graphing

CABARRUS COUNTY 2008 APPRAISAL MANUAL

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Data Distributions and Normality

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

2.1 Properties of PDFs

Section-2. Data Analysis

CSC Advanced Scientific Programming, Spring Descriptive Statistics

Math 140 Introductory Statistics. First midterm September

STATS DOESN T SUCK! ~ CHAPTER 4

2018 CFA Exam Prep. IFT High-Yield Notes. Quantitative Methods (Sample) Level I. Table of Contents

NCSS Statistical Software. Reference Intervals

Getting to know data. Play with data get to know it. Image source: Descriptives & Graphing

DESCRIBING DATA: MESURES OF LOCATION

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Lecture 07: Measures of central tendency

Today s plan: Section 4.1.4: Dispersion: Five-Number summary and Standard Deviation.

Population Mean GOALS. Characteristics of the Mean. EXAMPLE Population Mean. Parameter Versus Statistics. Describing Data: Numerical Measures

The normal distribution is a theoretical model derived mathematically and not empirically.

Transcription:

Simple Descriptive Statistics These are ways to summarize a data set quickly and accurately The most common way of describing a variable distribution is in terms of two of its properties: Central tendency describes the central value of the distribution, around which the observations cluster Dispersion describes how the observations are distributed about the central value Today, we ll focus on measures of dispersion

Why Do We Need Measures of Dispersion at all? Measures of central tendency tell us nothing about the variability / dispersion / deviation / range of values about the central value. Consider the following two unimodal symmetric distributions: Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 91.

Measures of Dispersion In addition to measures of central tendency, we can also summarize data by characterizing its variability Measures of dispersion are concerned with the distribution of values around the mean in data: 1. Range 2. Quartile range etc. 3. Mean deviation 4. Variance, standard deviation and z-scores 5. Coefficient of variation

Measures of Dispersion - Range 1. Range this is the most simply formulated of all measures of dispersion Given a set of measurements x 1, x 2, x 3,,x n-1, x n, the range is defined as the difference between the largest and smallest values: Range = x max x min This is another descriptive measure that is vulnerable to the influence of outliers in a data set, which result in a range that is not really descriptive of most of the data

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. We can divide distributions into a number of parts each containing an equal number of observations: Quartiles each contains 25% of all values Quintiles each contains 20% of all values Deciles each contains 10% of all values Percentiles each contains 1% of all values A standard application of this approach for describing dispersion involves calculating the interquartile range (a.k.a. quartile deviation)

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. cont. Rogerson (p. 6) defines the interquartile range as being the difference between the values of the 25 th and 75 th percentiles (i.e. the minimum value of the 2 nd quartile and the maximum value of the 3 rd quartile) This is well applied to skewed distributions, since it measures deviation around the median The interquartile range provides 2 of the 5 values displayed in a box plot, which is a convenient graphical summary of a data set

Measures of Dispersion Quartile Range etc. 2. Quartile Range etc. cont. A box plot graphically displays the following five values: Median Minimum value Maximum value 25 th percentile value max. median 75 th percentile value min. 75 th %-ile 25 th %-ile Rogerson, p. 8. Under some circumstances, the whiskers are not used for the min. and max. because of outliers

Measures of Dispersion Mean Deviation 3. Mean Deviation Once we have calculated the mean value for a data set, we can assess the difference between any observation and that mean, and this is termed the statistical distance: Statistical distance = x i - x If we take the absolute values of these, and sum for all observations, we have calculated the mean i=n deviation: Mean deviation = Σ x i x i=1 n

Measures of Dispersion Mean Deviation 3. Mean Deviation cont. Why is it necessary to take absolute values of statistical distances (x i x) before summing them to get the mean deviation? Because the statistical distances would be both positive and negative, and when summed using the mean deviation formula (without absolute values), they would sum to zero Mean deviation = i=n Σ x i x i=1 n

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. As an alternative to taking the absolute values of the statistical distances, we can square each deviation before taking their sum, which yields the sum of squares: Sum of Squares = i=n Σ (x i x)2 i=1 The sum of squares expresses the total square variation about the mean, and using this value we can calculate variances and standard deviations for both populations and samples

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Variance is formulated as the sum of squares divided by the population size or the sample size minus one: σ 2 = i=n Σ (x i µ) 2 i=1 N Population variance S 2 = i=n Σi=1 (x i x) 2 n - 1 Sample variance Note the differences in the two formulae, both in the notation but also the denominators!

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. We subtract one from the sample size when calculating sample variance because we wish to produce an unbiased statistic In this context, unbiased refers to the idea that when we calculate this statistic based upon a sample, it is some subset of the population, and if we did this repeatedly (using many samples) then the average of all the sample variances would eventually reach the true, population variance which generally has a larger deviation than any given sample Thus, we divide by n-1 to increase the sample variance, and provide a better estimate of the population variance

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Standard deviation is calculated by taking the square root of variance: σ = i=n Σ (x i µ) 2 i=1 N Population standard deviation S = i=n Σi=1 (x i x) 2 n - 1 Sample standard deviation Why do we prefer standard deviation over variance as a measure of dispersion? Magnitude of values and units match means.

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Just as the mean can be applied to spatial distributions through the bivariate mean center and weighted mean center formulae (computed by considering the (x,y) coordinates of a set of spatial objects), standard deviation can be applied to examining the dispersion of a spatial distribution. This is called standard distance (SD): i=n Σi=1 (x i x) 2 i=n Σi=1 (y i y) 2 SD = n - 1 + n - 1

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. In the event that the observations are weighted, we have a similar formula to use for calculating weighted standard distance: i=n i=n w i (x i x) 2 w i (y i y) 2 Σi=1 Σi=1 SD w = i=n Σ w i i=1 + i=n Σ w i i=1 Note that the x and y used in this formula need to be calculated using the weighted mean center formulae

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Sometimes, we want to compare data from different distributions, which in turn have different means and variances In these circumstances, it s convenient to have a standardized measure of dispersion that can be calculated for an individual observation. The z-score (a.k.a. standard normal variate, standard normal deviate, or just the standard score) is calculated by subtracting the sample mean from the observation, and then dividing that difference by the sample standard deviation: Z-score = x - x S

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Z-scores can be interpreted as an indication of how many standard deviations above (or below, if the z-score is negative) a given observation is from the sample mean, i.e. when we have a value that has a z-score of 2 or greater, we are more than likely looking at an outlier We can form some expectations about the size of the standard deviation for a distribution (even if it is not bell-shaped), and the dispersion of values within that distribution Chebyshev s Theorem

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Chebyshev s Theorem states: Given a set of data, no matter how they are distributed, it may be shown that a proportion of at least (1 1/z 2 ) of the observations (where the value of z is greater than 1) will lie within z standard deviations of their mean x, and the proportion falling beyond the limits of x ± z will be less than 1/z 2 Plugging in values for z: z 1-1/z 2 1 0 2 3/4 (75% of obs.) 3 8/9 (89% of obs.)

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. For bell-shaped (normal) distributions we can go even further than Chebyshev s Theorem, and state an empirical rule about the expectation of the dispersion of values in such a distribution: The Empirical Rule: Given a distribution that is approximately bell-shaped, the interval 1. µ ± σ will contain ~ 68.27% of the values 2. µ ± 2σ will contain ~ 95.45% of the values 3. µ ± 3σ will contain ~ 99.73% of the values

Measures of Dispersion Variance, Standard Deviation, Z-scores 4. Variance etc. cont. Displaying this Empirical Rule graphically, we can get a sense of what defines a normal distribution: Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 100.

Measures of Dispersion Coefficient of Variation 5. Coefficient of Variation We cannot directly compare the standard deviations of frequency distributions with different means, because a distribution with a higher mean is likely to have a larger deviation In addition to z-scores (which describe the deviation of an observation), we need an overall measure of dispersion that is normalized with respect to the mean from the same distribution: S σ Coefficient of variation = or (*100%) x µ

Further Moments of the Distribution While measures of dispersion are useful for helping us describe the width of the distribution, they tell us nothing about the shape of the distribution. You ll recall these figures from earlier in the lecture: Source: Earickson, RJ, and Harlin, JM. 1994. Geographic Measurement and Quantitative Analysis. USA: Macmillan College Publishing Co., p. 91.

Further Moments of the Distribution There are further statistics that describe the shape of the distribution, using formulae that are similar to those of the mean and variance: 1. The 1 st moment of the distribution Mean (describes central value) 2. The 2 nd moment of the distribution Variance (describes dispersion) 3. The 3 rd moment of the distribution Skewness (describes asymmetry) 4. The 4 th moment of the distribution Kurtosis (describes peakedness)

Further Moments of the Distribution Each of the moments of the distribution make use of a formula that is based on a summation of the statistical distances (x i x) for a set of data, where the exponents in the formulae are determined by the moment: i=n (x i x) 2 Variance = Dispersion Σi=1 n - 1 2 nd Moment

Further Moments of the Distribution Each of the moments of the distribution make use of a formula that is based on a summation of the statistical distances (x i x) for a set of data, where the exponents in the formulae are determined by the moment: i=n (x i x) 3 Skewness = Asymmetry Σi=1 ns 3 3 rd Moment

Further Moments of the Distribution Each of the moments of the distribution make use of a formula that is based on a summation of the statistical distances (x i x) for a set of data, where the exponents in the formulae are determined by the moment: i=n (x i x) 4 Kurtosis = Peakedness Σi=1 ns 4-3 4 th Moment

Further Moments of the Distribution - Skewness Skewness This statistic measures the degree of asymmetry exhibited by the data (i.e. whether there are more observations on one side of the mean than the other): Skewness = i=n Σi=1 (x i x) 3 ns 3 Because the exponent in this moment is odd, skewness can be positive or negative; positive skewness has more observations below the mean than above it (negative vice-versa)

Further Moments of the Distribution - Skewness Skewness cont. Skewness can also be assessed by comparing the mean and the median Positive skewness Median < Mean Negative skewness Mean < Median This can also be assessed by calculating Pearson s coefficient of skewness: x is the mean Md is the median S is the std. deviation 3(x Md) Sk = S Sk follows the above convention, and values less than 3 are moderately skewed

Further Moments of the Distribution - Kurtosis Kurtosis This statistic measures how flat or peaked the distribution is, and is formulated as: i=n Σi=1 (x i x) 4 Kurtosis = ns 4-3 The 3 is included in this formula because it results in the kurtosis of a normal distribution to have the value 0 (this condition is also termed having a mesokurtic distribution)

Further Moments of the Distribution - Kurtosis Kurtosis cont. When the kurtosis < 0, the frequencies throughout the curve are closer to equal (i.e. the curve is more flat and wide) and this condition is termed platykurtic When the kurtosis > 0, there are high frequencies in only a small part of the curve (i.e. the curve is more peaked) and this condition is termed leptokurtic NOTE: Both skewness and kurtosis are sensitive to the size of n; when n is small and there are outliers, they are less useful