Lecture 2 Describing Data

Similar documents
Lecture 1: Review and Exploratory Data Analysis (EDA)

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

2 Exploring Univariate Data

STAT 113 Variability

DATA SUMMARIZATION AND VISUALIZATION

Section3-2: Measures of Center

Lecture Week 4 Inspecting Data: Distributions

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Frequency Distribution and Summary Statistics

Descriptive Statistics

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

appstats5.notebook September 07, 2016 Chapter 5

Numerical Descriptions of Data

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Some estimates of the height of the podium

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

Section 6-1 : Numerical Summaries

1 Describing Distributions with numbers

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Description of Data I

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

Unit 2 Statistics of One Variable

Describing Data: One Quantitative Variable

Example: Histogram for US household incomes from 2015 Table:

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

3.1 Measures of Central Tendency

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Basic Procedure for Histograms

Math 2200 Fall 2014, Exam 1 You may use any calculator. You may not use any cheat sheet.

Putting Things Together Part 2

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Graphical and Tabular Methods in Descriptive Statistics. Descriptive Statistics

Exploratory Data Analysis

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

STAB22 section 1.3 and Chapter 1 exercises

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Some Characteristics of Data

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

CHAPTER 2 Describing Data: Numerical

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

CSC Advanced Scientific Programming, Spring Descriptive Statistics

Descriptive Statistics (Devore Chapter One)

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Simple Descriptive Statistics

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

Exploring Data and Graphics

SOLUTIONS TO THE LAB 1 ASSIGNMENT

STA 248 H1S Winter 2008 Assignment 1 Solutions

22.2 Shape, Center, and Spread

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Descriptive Statistics

NOTES: Chapter 4 Describing Data

Math 140 Introductory Statistics. First midterm September

Numerical Descriptive Measures. Measures of Center: Mean and Median

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Prof. Thistleton MAT 505 Introduction to Probability Lecture 3

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Chapter 4 Variability

Introduction to Descriptive Statistics

Statistics I Chapter 2: Analysis of univariate data

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

4. DESCRIPTIVE STATISTICS

Fundamentals of Statistics

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Data Distributions and Normality

Descriptive Analysis

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA. Name: ID# Section

The Normal Distribution

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

The normal distribution is a theoretical model derived mathematically and not empirically.

2011 Pearson Education, Inc

Math Take Home Quiz on Chapter 2

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

NCSS Statistical Software. Reference Intervals

Chapter 15: Sampling distributions

Descriptive Statistics Bios 662

Random Variables and Probability Distributions

DATA HANDLING Five-Number Summary

Chapter 3: Displaying and Describing Quantitative Data Quiz A Name

Averages and Variability. Aplia (week 3 Measures of Central Tendency) Measures of central tendency (averages)

PSYCHOLOGICAL STATISTICS

Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12)

The Normal Distribution & Descriptive Statistics. Kin 304W Week 2: Jan 15, 2012

Math146 - Chapter 3 Handouts. The Greek Alphabet. Source: Page 1 of 39

Mini-Lecture 3.1 Measures of Central Tendency

Chapter 3 Descriptive Statistics: Numerical Measures Part A

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Lecture Data Science

IOP 201-Q (Industrial Psychological Research) Tutorial 5

1. In a statistics class with 136 students, the professor records how much money each

Transcription:

Lecture 2 Describing Data Thais Paiva STA 111 - Summer 2013 Term II July 2, 2013

Lecture Plan 1 Types of data 2 Describing the data with plots 3 Summary statistics for central tendency and spread 4 Histograms and boxplots

Data Recall that A population is composed of units. Units can represent individuals, financial stocks, nations, trees, etc... A sample is a collection of units drawn from the population For each unit we can measure several different characteristics (variables) To present a dataset, we typically use rows to represent units and columns to represent variables.

An Example Dataset Salary # of Name Gender ( 1,000) Education Family members AA F 50 B 2 BB M 43 B 3 CC F 65 M 1 DD M 200 B 2 EE M 60 M 4 FF M 25 S 2 GG F 15 S 0 HH F 80 D 3 II M 22 S 1 JJ F 69 B 4 KK F 70 M 2 M = male; F = female B = bachelor; S = senior high; M = masters, D = doctoral

Types of Variables Qualitative Variables These values can fall into one of several groups or categories Represent characteristics which cannot be naturally associated to a number Ordinal: although not naturally associated to a number, can be somehow ordered (i.e., education) Nominal (Categorical): there is no natural order (i.e., gender) Quantitative Variables Take numerical values such that using numerical operations makes sense Discrete: only take integer values (i.e., # of family members) Continuous: can take any value (sometimes within a specified range) (i.e., height) Sometimes a variable can be treated as quantitative or qualitative.

Sample versus Population Recall that a parameter is a descriptive measure of a population a statistic is a descriptive measure of a sample In many cases, for statistical inference we want the statistics to approximate the unknown parameters. Some approximations will be better than others We will see which are good approximations This lecture, we will focus on some statistics calculated using our sample.

Gender According to the data, there are 5 males and 6 females in the sample. A very simple graph is the bar chart representation. 0 1 2 3 4 5 6 F M Here the y-axis represents frequency. An alternative is to use percentage. Frequency is simply the number of data values that falls into each category.

Education 0 1 2 3 4 B D M S Bar chart allows us to quickly see the distribution of the variable in our sample. The distribution of a single variable describes the overall pattern of the data. Example: which value is most common? Here we treat education as nominal. We can also re-arrange the x-axis as S B M D and treat education as ordinal.

Salary: Frequencies Using bar chart to describe salary is more complicated because it is continuous. One way is to divide data into categories. For example smaller or bigger than 100 (thousands). Frequency 0 2 4 6 8 10 0 50 100 150 200 Salary This representation that creates categories (bins) for continuous data is called a histogram.

Salary: Histograms There are many ways to define the salary categories! By 50K: By 20K: Frequency 0 1 2 3 4 5 0 50 100 150 200 Salary Frequency 0 1 2 3 4 0 50 100 150 200 Salary Categories can be chosen based on scientific reason or to display certain features of the distribution.

Salary: Distribution Density 0.000 0.004 0.008 0 50 100 150 200 Salary Instead of plotting the number of data point that falls into each bin, we can convert these into relative frequencies. This graph is related to the density of the distribution. More on this later. We simply changed the scale but the general shape of the distribution remains the same.

Salary: Distribution The construction of the distribution requires a little bit of attention since it is based on the relative frequencies. Relative frequency = frequency sample size f <100 = # salaries<100 # salaries = 10 11 f 100 = # salaries 100 # salaries = 1 11 Then the relative frequencies must be distributed evenly along the cells. height <100 = f<100 base <100 = 10 11 1 100 0 = 0.009 height 100 = f 100 base 100 = 1 11 1 200 100 = 0.001

Salary: Distribution There is an important property for the histogram built in terms of relative frequencies: The total area of all the rectangles is equal to 1 height <100 base <100 + height 100 base 100 = 0.009 (100 0) + 0.001 (200 100) = 1 This property is worth remembering! It will will be very important later in the course.

Describing Distributions It is important to convey certain characteristics of a distribution, for example to compare different distributions. We will focus on three specific areas: 1. Shape 2. Center 3. Spread Do the extreme values tend to be small or large? What values are most common? Where is the middle of the distribution? How much variability is present?

Shape: Mode The mode of a variable is the most frequent observation occurring in the data set. Mode(gender)=FEMALE [6(F) and 5(M)] Mode(education)=BACHELOR [4(B), 3(S), 3(M), 1(D)] If you look at the graphs you will realize that the mode coincides with the peaks of a histogram However for salary, because the variable is continuous, we need to group the data first. Again, the mode will depend on the choice of the groups.

Modes While the mode is a useful statistic, often a distribution exhibits several peaks. Ex. Unimodal Ex. Bimodal Frequency 0 100 200 300 0.0 1.0 2.0 3.0 Frequency 0 100 300 500 700 2 0 2 4 6 8 10

Skewness Sometimes a variable is more extreme in one direction, e.g. medical cost, housing price,... Left Skewed Right Skewed Frequency 0 100 200 300 0.5 1.0 1.5 Frequency 0 100 300 500 700 0 10 20 30 40 A symmetric distribution does not exhibit skewness.

Measuring the Center: Mean Given that y 1, y 2,..., y n are the n observations within a sample, then the sample mean is defined as Same as the average ȳ = n y i i=1 We will use the bar notation to denote a sample mean Note how each observed value contributes equally to the mean n

Example: mean salary If the salary is NOT grouped then the mean is equal to 50 + 43 + 65 + 200 + 60 + 25 + 15 + 80 + 22 + 69 + 70 11 = 63.54 If salary is grouped (e.g. by 20K), we can use the midpoint of each group: 0-20 20-40 40-60 60-80 80-180 180-200 n i 1 2 3 4 0 1 n i n.09.18.27.36 0.09 1 10 + 2 30 + 3 50 + 4 70 + 0 130 + 1 190 11 = 62.73

Measuring the Center: Median The median is a statistic based on the order of the observations. It is a particular case of percentile: the 50th-percentile. Example Median(education)=Bachelor since [SSSBB(B)BMMMD] Median(# family members)=2 since [0112(2)2233] Median(gender)=? It cannot be evaluated because the variable is not ordinal

Salary Median If the data are NOT grouped, then the median is 60 Since [15 22 25 43 50 (60) 65 69 70 80 200] If the data are grouped 0-20 20-40 [40-60] 60-80 80-180 180-200 n i 1 2 3 4 0 1 n i.09.18.27.36 0.09 n kp n i 1 3 (6) 10 10 11 i=1 kp i=1 n i n 0.09 0.27 (0.54) 0.90 0.90 1 The median is where the corresponding cumulative frequency is (or the minimum value above) 0.50.

Mean versus Median For measuring central tendency: Mean: quantitative data with symmetric distribution Median: quantitative data with skewed distribution Because the median is solely based on the rank of the data, for a unimodal distribution: Shape Left skewed Symmetric Right skewed mean < median mean median mean > median

Spread of a Distribution When describing a distribution, knowing of the center may be not sufficient. For example: financial analysts are usually interested in the expected returns (mean) and the risk (spread) In this sense, the spread indicates how much uncertainty (variability) is associated with the expected return Consider two hypothetical stock returns at time 1, 2, 3, 4, 5: 1 2, 1, 0, 1, 2 with mean 0 2 200, 100, 0, 100, 200 with mean 0

Measuring Spread: Range The range is simply defined as range = max min Examples: range(salary) = 200 15 = 185 range(# family members) = 4 0 = 4 range(gender) = not possible range(education) = not possible

Measuring Spread: Percentile The p-percentile is the observation within the sample such that p% of the remaining observations are equal or lower than the p-percentile. p The position of the p percentile is 100 (n + 1) the 25% percentile is called 1 st quartile (q 1 ) the 50% percentile is called Median the 75% percentile is called 3 rd quartile (q 3 ) Note the data must be qualitative ordinal or quantitative. Also, the quartiles are also based on the order of the data.

Measuring Spread: Percentile Salary Example For salary The original data are: (50 43 65 200 60 25 15 80 22 69 70) The sorted data are: (15 22 25 43 50 60 65 69 70 80 200) salary [0.25 (11+1)] = salary [3] = 25 > 25th percentile salary [0.5 (11+1)] = salary [6] = 60 > median salary [0.75 (11+1)] = salary [9] = 70 > 75th percentile What if only 10 data were available? A little bit more difficult!

Measuring Spread: Percentile Example Suppose you have the following data: y = 1, 3, 6, 8. What are the q 1, median, q 3? According to the percentile formula, the locations are Therefore 0.25 5 = 1.25 0.5 5 = 2.5 0.755 5 = 3.75 q 1 = 1.5 (or between 1 and 3) median = 4.5 (or between 3 and 6) q 3 = 7.5 (or between 6 and 8)

Boxplot The box plot is a useful graphical representation which combines the five number summary: minimum, q 1, median, q 3, maximum Salary # of Family Members 50 100 150 200 0 1 2 3 4

Measure Spread: Interquartile Range Compared to a histogram, the boxplot shows a lot less detail. Useful to compare the distribution across different groups The box between the first and third quartile gives the interquartile range (IQR) For salary, IQR = 70-25 = 45 IQR = q 3 q 1 IQR measures the central spread and is not influenced by the tails - very useful for describing the spread of a skewed distribution! The location of the thick median line in the box also shows the skewness around the center of the distribution

Outliers Sometimes very extreme (strange) observations (outliers) can occur and it is important to identify them. Outliers can: Represent recording errors Represent interesting observations Influence summary statistics (e.g. the mean) A popular rule is to suspect an observation if it falls more than 1.5 IQR below the first quartile or above the third quartile.

Modified Boxplot 50 100 150 200 Salary Using the 1.5 IQR rule, a modified boxplot shows the potential outliers. The upper and lower limits are 25 1.5 (70 25) = 42.25 70 + 1.5 (70 25) = 137.5 Here the 200K value is shown as a point. Excluding the outlier, our new sample mean is 49.9 instead of 63.5K!

Measuring Spread: Variance The sample variance, s 2, is computed as the sum of the squared deviations about the mean, x, divided by (n-1). s 2 = n (x i x) 2 i=1 n 1 variance(gender) = not possible!! variance(# family members) = [(2 2.18) 2 + (3 2.18) 2 + (1 2.18) 2 + (2 2.18) 2 + (4 2.18) 2 + (2 2.18) 2 + (0 2.18) 2 +(3 2.18) 2 +(1 2.18) 2 +(4 2.18) 2 +(2 2.18) 2 ]/(11 1) = 1.56

Measuring Spread: Standard Deviation The standard deviation, s, is defined as the square root of the variance: Example: sd(gender) = not possible!! sd(# family members) = (1.56) = 1.25

Degrees of Freedom The degrees of freedom represent the effective number of values free to vary in the computation a statistic. In computing the variance, we need to calculate the difference d i = (x i x) for each observation. However, x itself is calculated from the n observations!!! In fact, if you give me the difference for any (n 1)-many observations, I can calculate the remaining one! This is because Remember this property! n (x i x) = 0 i=1

Putting It All Together A Frequency 0 200 Frequency 0 200 10 5 0 5 10 15 20 B 10 5 0 5 10 15 20 A B C Mean 5.2 2.0 1.2 Median 5.1 2.1 0.5 Variance 16.0 1.0 1.3 S.D. 4.0 1.0 1.1 C Frequency 0 600 10 5 0 5 10 15 20

Putting It All Together 5 0 5 10 15 a b c

Summary Measures of Central Tendency: Mean Median Measures of Spread: Percentiles and quartiles Interquartile range Variance and standard deviation Graphical Representations of Distribution: Histogram Boxplot and modified boxplot Mode and skewness