Statistics I Chapter 2: Analysis of univariate data

Similar documents
Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Numerical Descriptions of Data

Simple Descriptive Statistics

Statistics I Chapter 2: Analysis of univariate data

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Some Characteristics of Data

Numerical Measurements

Chapter 3 Descriptive Statistics: Numerical Measures Part A

DESCRIPTIVE STATISTICS II. Sorana D. Bolboacă

Descriptive Statistics

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

2 Exploring Univariate Data

3.1 Measures of Central Tendency

Section3-2: Measures of Center

Some estimates of the height of the podium

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Numerical summary of data

Frequency Distribution and Summary Statistics

Lecture 2 Describing Data

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Measures of Dispersion (Range, standard deviation, standard error) Introduction

appstats5.notebook September 07, 2016 Chapter 5

Descriptive Analysis

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

STAT 113 Variability

Description of Data I

DATA SUMMARIZATION AND VISUALIZATION

Fundamentals of Statistics

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

Describing Data: One Quantitative Variable

4. DESCRIPTIVE STATISTICS

Data Distributions and Normality

Basic Procedure for Histograms

Lecture 1: Review and Exploratory Data Analysis (EDA)

Section 6-1 : Numerical Summaries

1 Describing Distributions with numbers

Lecture Week 4 Inspecting Data: Distributions

Measures of Central tendency

Engineering Mathematics III. Moments

Empirical Rule (P148)

Copyright 2005 Pearson Education, Inc. Slide 6-1

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Measures of Variation. Section 2-5. Dotplots of Waiting Times. Waiting Times of Bank Customers at Different Banks in minutes. Bank of Providence

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Measures of Central Tendency: Ungrouped Data. Mode. Median. Mode -- Example. Median: Example with an Odd Number of Terms

Moments and Measures of Skewness and Kurtosis

SOLUTIONS TO THE LAB 1 ASSIGNMENT

CHAPTER 2 Describing Data: Numerical

Statistics 114 September 29, 2012

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

The Mode: An Example. The Mode: An Example. Measure of Central Tendency: The Mode. Measure of Central Tendency: The Median

Averages and Variability. Aplia (week 3 Measures of Central Tendency) Measures of central tendency (averages)

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

MgtOp 215 TEST 1 (Golden) Spring 2016 Dr. Ahn. Read the following instructions very carefully before you start the test.

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

Applications of Data Dispersions

Numerical Descriptive Measures. Measures of Center: Mean and Median

Descriptive Statistics

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

PSYCHOLOGICAL STATISTICS

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Quantitative Methods for Economics, Finance and Management (A86050 F86050)

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Terms & Characteristics

Unit 2 Statistics of One Variable

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Quantitative Analysis and Empirical Methods

Topic 8: Model Diagnostics

STATS DOESN T SUCK! ~ CHAPTER 4

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Measure of Variation

Descriptive Statistics for Educational Data Analyst: A Conceptual Note

CHAPTER 6. ' From the table the z value corresponding to this value Z = 1.96 or Z = 1.96 (d) P(Z >?) =

DESCRIPTIVE STATISTICS

Monte Carlo Simulation (Random Number Generation)

ECON 214 Elements of Statistics for Economists

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

2 DESCRIPTIVE STATISTICS

Descriptive Statistics Bios 662

Statistics I Final Exam, 24 June Degrees in ADE, DER-ADE, ADE-INF, FICO, ECO, ECO-DER.

Mini-Lecture 3.1 Measures of Central Tendency

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Math 140 Introductory Statistics. First midterm September

The Not-So-Geeky World of Statistics

Lecture 07: Measures of central tendency

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

Putting Things Together Part 2

Introduction to Descriptive Statistics

Math146 - Chapter 3 Handouts. The Greek Alphabet. Source: Page 1 of 39

SUMMARY STATISTICS EXAMPLES AND ACTIVITIES

Transcription:

Statistics I Chapter 2: Analysis of univariate data

Numerical summary Central tendency Location Spread Form mean quartiles range coeff. asymmetry median percentiles interquartile range coeff. kurtosis mode variance standard deviation coeff. of variation

Descriptive statistics What are they useful? Can we calculate them for all types of variables? Which are the most useful in each case? How can we use the calculator or Excel?

Measures of central tendency The mean The median The mode

Central tendency: the (artithmetic) mean The (artithmetic) mean The mean is the average of all the data n i=1 x = x i n = x 1 +... + x n n It is the most common measure of location It is the center of gravity of the data It can be calculated only for quantitative variables

The mean: example For the experience of the 46 professionals of a computer company, Which is the mean? x = 1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + + 17 + 20 46 = 7.5 años How can we calculate it using the absolute frequency table? and using the relative one? Experience, x i absolute freq., n i relative freq., f i 1 5 0,109 2 4 0,087 3 4 0,087 4 4 0,087 5 3 0,065 6 4 0,087 7 1 0,022 8 4 0,087 10 4 0,087 11 2 0,043 12 2 0,043 13 2 0,043 14 1 0,022 15 1 0,022 16 3 0,065 17 1 0,022 20 1 0,022 Total 46 1

The mean with grouped data This is the same formula but using the center of each interval. For the salary of the 46 professionals of a computer company, Which is the mean? Note: the mean salary using the raw data equals 17250.413

The mean: properties Linearity: If Y = a + bx ȳ = a + b x If the 46 professionals salaries is increased by 2 %, How the mean salary changes? Afterwards the salary is reduced in 100 dolars, Wich is the final mean salary? Disadvantages: Affected by extreme values (outliers) Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2 x = 3 + 1 + 5 + 4 + 2 5 = 3 ȳ = 3 + 1 + 50 + 4 + 2 5 = 12 Its value has been multiplied by 4!! When the data is skewed an alternative robust measure of central tendency is more appropriate

Central tendency: the median...is the most central datum 1 1 1 3 3 5 5 7 8 8 9 1. Order the data from smallest to largest 2. Include repetitions 3. The median is the physical centre 1 1 1 3 3 5 5 7 8 8 M = 3 + 5 2 Median Ordered list from smallest to largest: x (1), x (2),..., x (n) if n odd M = x ((n+1)/2) x (n/2) +x (n/2+1) 2 if n even = 4

The media via the table of frequencies Experience, x i n i f i N i F i 1 5 0,109 5 0,109 2 4 0,087 9 0,196 3 4 0,087 13 0,283 4 4 0,087 17 0,370 5 3 0,065 20 0, 435 < 0.5 M=6 4 0,087 24 0, 522 > 0.5 7 1 0,022 25 0,543 8 4 0,087 29 0,630 9 0 0 29 0,630 10 4 0,087 33 0,717 11 2 0,043 35 0,761 12 2 0,043 37 0,804 13 2 0,043 39 0,848 14 1 0,022 40 0,870 15 1 0,022 41 0,891 16 3 0,065 44 0,957 17 1 0,022 45 0,978 18 0 0 45 0,978 10 0 0 45 0,978 20 1 0,022 46 1,000

The meadian: properties Linearity: If Y = a + bx M y = a + bm x If the 46 professionals salaries is increased by 2 %, How the median salary changes? Afterwards the salary is reduced in 100 dolars, Wich is the final median salary? Can we calculate the meadian with the education level data? Can we calculate the meadian with the 0-1 position of responsability variable? Advantage: Not affected by outliers Example: X : 3, 1, 5, 4, 2, Y : 3, 1, 50, 4, 2 M x = 3 M y = 3 When the data is skewed it is a better measure of central tendency than the mean.

The median and the mean for asymmetric data Annual gross salary in 2014, Encuesta de Estructura Salarial 2014, I.N.E. La diferencia entre el salario medio y el mediano se explica porque en el cálculo del valor medio influyen notablemente los salarios muy altos aunque se refieran a pocos trabajadores. (En la Nota de Prensa del INE de 28 de octubre de 2016)

Central tendency: the mode...is the most frequent value The mode of the variable experience in the 46 professionals example is 1 year, with an absolute frequency of 5 employees. The values 2,3,4,8 and 10 have an absolute frequency of 4 employees.

Central tendency: the mode Does this definition make sense with the education level data? Does this definition make sense with the 0-1 position of responsability variable?

Central tendency: the mode Does this definition make sense with continuous data? modal interval

The mode: properties It can be calculated for both qualitative and quantitative variables. Indeed, it is the only descriptive measurement (mean, median, mode) that makes sense for nominal qualitative variables. Not affected by outliers There can be no mode. There can be more than one mode: bimodal trimodal plurimodal What it can be indicate?

Location measures Quartiles Percentiles

Location measures: quartiles and percentiles Quartiles split the ranked data into four segments with an equal number of values per segment. Percentiles split the ranked data into a hundred segments with an equal number of values per segment. 1. Order the data from smallest to largest 2. Include repetitions 3. Select each quartile (percentile) according to: The first quartil Q1 has position 1 (n + 1). 4 The second quartil Q2 (= median) has position 1 (n + 1). 2 The third quartil Q3 has position 3 (n + 1). 4 The k-th percentile Pk, has position k(n + 1)/100, k = 1,..., 99.

Quartiles: example

Percentiles: example

Masures of spread The range and the interquartile range The variance and the standard deviation The coefficient of variation

Variation: range and interquartile range (IQR) The Range is the simplest measure of variation R = x máx x mín Ignores the way the data is distributed Sensitive to outliers Example: Given observations 3, 1, 5, 4, 2, R = 5 1 = 4 Example: Given observations 3, 1, 5, 4, 100, R = 100 1 = 99 The Interquartile range (IQR) can eliminate some outlier problems. Eliminate high and low observations and calculate the range of the middle 50 % of the data RIC = 3rd cuartil 1st cuartil = Q 3 Q 1

Variation: Interquartile range and boxplot Outliers are observations that fall below the value of Q1 1.5 IQR above the value of Q3 + 1.5 IQR For extreme outliers, replace 1.5 by 3 in the above definition MEDIANA x min Q 1 (Q 2) Q 3 x max 25% 25% 25% 25% 12 24 31 42 58 RI=18

Measure of variation: variance Average of squared deviations of values from the mean Population variance Sample variance n ˆσ 2 i=1 = (x i x) 2 n N σ 2 i=1 = (x i µ) 2 N faster to calculate { }}{ n i=1 = x i 2 n( x) 2 n divided by n Sample quasi-variance (corrected sample variance) n s 2 i=1 = (x i x) 2 n 1 They are related via = n i=1 x 2 i n( x) 2 n 1 ˆσ 2 = n 1 n s2 divided by n 1 If a, b (b 0) are real numbers and y = a + bx, then s 2 y = b 2 s 2 x

Measure of variation: standard deviation (SD) The most-commonly used measure of spread Population standard deviation, sample standard deviation and sample quasi-standard deviation are respectively Shows variation about the mean σ = σ 2 ˆσ = ˆσ 2 s = s 2 Has the same units as the original data, whilst variance is in units 2 Variance and SD are both affected by outliers

Calculating variance and standard deviation Example: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 124 8 = 15.5 ȳ = 124 8 = 15.5 z = 124 8 = 15.5 n i=1 n i=1 n i=1 x 2 i = 11 2 + 12 2 +... + 21 2 = 2000 y 2 i = 14 2 + 15 2 +... + 17 2 = 1928 z 2 i = 11 2 + 11 2 +... + 20 2 = 2068 n sx 2 i=1 = x i 2 n( x) 2 2000 8(15.5)2 = = 78 = 11.1429 sx = 3.3381 n 1 8 1 7 sy 2 1928 8(15.5)2 = = 6 = 0.8571 sy = 0.9258 8 1 7 sz 2 2068 8(15.5)2 = = 146 = 20.8571 sz = 4.5670 8 1 7

Comparing standard deviations Example cont.: X : 11, 12, 13, 16, 16, 17, 18, 21, Y : 14, 15, 15, 15, 16, 16, 16, 17, Z : 11, 11, 11, 12, 19, 20, 20, 20 x = 15.5 s x = 3.3 11 12 13 14 15 16 17 18 19 20 21 y = 15.5 s y = 0.9 11 12 13 14 15 16 17 18 19 20 21 z = 15.5 s z = 4.6 11 12 13 14 15 16 17 18 19 20 21

Measure of variation: coefficient of variation (CV) Measures relative variation and is defined as CV = s x Is a unitless number (sometimes given in % s) Shows variation relative to mean Example: Stock A: Average price last year = 50, Standard deviation = 5 Stock B: Average price last year = 100, Standard deviation = 5 CV A = 5 50 = 0.10 CV B = 5 100 = 0.05 Both stocks have the same SDs, but stock B is less variable relative to its mean price

Numerical summaries and frequency tables. Standarization. If the data is discrete then k i=1 x = x in i n and s 2 = k i=1 x 2 i n i n x 2 n 1 If the data is continuous, we replace x i in the above difinition, by the mid-points of class intervals To standardize variable x means to calculate x x s If you apply this formula to all observations x 1,..., x n and call the transformed ones z 1,..., z n, then the mean of the z s is zero with the standard deviation of one Standarization = finding z-score

Measures of form Fisher s coefficient of asymmetry Fisher coefficient of kurtosis Empirical rule

Shape: comparing mode, mean and median Three types of distributions: Skewed to the left Mean < Median < Mode Symmetric Mean = Median = Mode Skewed to the right Mode < Median < Mean LEFT SKEWED x < M SYMMETRIC x = M RIGHT SKEWED M < x Note: The distribution in the middle is known as bell-shaped or normal

Measures of form: Asymmetry n i=1 (x i x) 3 Fisher s coefficient of asymmetry γ 1 = 1 n S. The data is 3 skewed to the right (positive) if γ 1 > 0, and vice versa. Asimetría a la derecha Asimetría a la izquierda Frequency 0 10 20 30 40 50 60 γ 1 = 2.236 Frequency 0 50 100 150 200 γ 1 = 1.401 0 1 2 3 4 5 6 0.0 0.2 0.4 0.6 0.8 1.0

Measures of form: kurtosis Fisher s coefficient of kurtosis γ 2 = 1 n n i=1 (x i x) 4 S 3 4 For the standard normal, γ 2 = 0. If γ 2 > 0 leptokurtic (sharper than the standard normal) and platykurtic if γ 2 < 0 Distribución Leptocúrtica Distribución Platicúrtica Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Density 0.0 0.2 0.4 0.6 0.8 1.0 2 0 2 4 1.0 0.0 1.0 2.0

Empirical rule If the data is bell-shaped (normal), that is, symmetric and with light tails, the following rule holds: 68 % of the data are in ( x 1s, x + 1s) 95 % of the data are in ( x 2s, x + 2s) 99.7 % of the data are in ( x 3s, x + 3s) Note: This rule is also known as 68-95-99.7 rule Example: We know that for a sample of 100 observations, the mean is 40 and the quasi-standard deviation is 5. Assuming that the data is bell-shaped, give the limits of an interval that captures 95 % of the observations. 95 % of x i s are in: ( x ± 2s) = (40 ± 2(5)) = (30, 50)