Lecture 1: Review and Exploratory Data Analysis (EDA)

Similar documents
Lecture 2 Describing Data

2 Exploring Univariate Data

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Describing Data: One Quantitative Variable

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

appstats5.notebook September 07, 2016 Chapter 5

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Today s plan: Section 4.1.4: Dispersion: Five-Number summary and Standard Deviation.

Frequency Distribution and Summary Statistics

DATA SUMMARIZATION AND VISUALIZATION

Some estimates of the height of the podium

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Descriptive Statistics

Description of Data I

Numerical Descriptions of Data

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Lecture Week 4 Inspecting Data: Distributions

STAT 113 Variability

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

1 Describing Distributions with numbers

Section3-2: Measures of Center

Unit 2 Statistics of One Variable

Graphical and Tabular Methods in Descriptive Statistics. Descriptive Statistics

Diploma in Financial Management with Public Finance

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Getting to know data. Play with data get to know it. Image source: Descriptives & Graphing

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Descriptive Statistics (Devore Chapter One)

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

SOLUTIONS TO THE LAB 1 ASSIGNMENT

STA 248 H1S Winter 2008 Assignment 1 Solutions

Math 2200 Fall 2014, Exam 1 You may use any calculator. You may not use any cheat sheet.

Exploratory Data Analysis

CHAPTER 2 Describing Data: Numerical

Copyright 2005 Pearson Education, Inc. Slide 6-1

Empirical Rule (P148)

Getting to know a data-set (how to approach data) Overview: Descriptives & Graphing

Simple Descriptive Statistics

22.2 Shape, Center, and Spread

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Section 6-1 : Numerical Summaries

NOTES: Chapter 4 Describing Data

Putting Things Together Part 2

Steps with data (how to approach data)

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA. Name: ID# Section

Math Take Home Quiz on Chapter 2

Descriptive Statistics

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

DATA HANDLING Five-Number Summary

Quantitative Analysis and Empirical Methods

STAT 157 HW1 Solutions

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

2011 Pearson Education, Inc

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

2CORE. Summarising numerical data: the median, range, IQR and box plots

Descriptive Statistics Bios 662

E.D.A. Exploratory Data Analysis E.D.A. Steps for E.D.A. Greg C Elvers, Ph.D.

4. DESCRIPTIVE STATISTICS

3.1 Measures of Central Tendency

Lecture Data Science

Mini-Lecture 3.1 Measures of Central Tendency

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Source: Fall 2015 Biostats 540 Exam I. BIOSTATS 540 Fall 2016 Practice Test for Unit 1 Summarizing Data Page 1 of 6

Basic Procedure for Histograms

Chapter 3 Descriptive Statistics: Numerical Measures Part A

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Chapter 3. Populations and Statistics. 3.1 Statistical populations

Skewness and the Mean, Median, and Mode *

Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12)

Monte Carlo Simulation (Random Number Generation)

Statistics vs. statistics

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Numerical Measurements

Some Characteristics of Data

Sampling and Descriptive Statistics

Numerical Descriptive Measures. Measures of Center: Mean and Median

CHAPTER 6. ' From the table the z value corresponding to this value Z = 1.96 or Z = 1.96 (d) P(Z >?) =

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

Chapter 3: Displaying and Describing Quantitative Data Quiz A Name

Statistics I Chapter 2: Analysis of univariate data

Edexcel past paper questions

Chapter 7. Inferences about Population Variances

Exploratory Data Analysis (EDA)

STAB22 section 1.3 and Chapter 1 exercises

Descriptive Analysis

Data Distributions and Normality

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences. STAB22H3 Statistics I Duration: 1 hour and 45 minutes

Transcription:

Lecture 1: Review and Exploratory Data Analysis (EDA) Ani Manichaikul amanicha@jhsph.edu 16 April 2007 1 / 40

Course Information I Office hours For questions and help When? I ll announce this tomorrow Homework Three assignments Follow-up on material from class Written exam When: Wednesday 16 May, 10.00-12.00 Where: Multimedia classroom and computer classroom, Ruskeasuo campus (B wing, second floor) 2 / 40

Course Information II 16 April 2007 to 16 May 2007 08.30-12.30 Monday, Tuesday, Thursday, Friday 08.30-10.15 Lecture 10.15-10.30 Break 10.30-12.30 Informal lecture, class exercise or computer lab Activities for the second half of class will vary; also time for questions! 3 / 40

Class goals Biostat I Numbers and probability Sampling distributions and inference Statistical models and association / causality Biostat II Developing scientific questions Translating questions into regression models Interpreting results of regression Critiquing the literature 4 / 40

Issues and recurring themes Populations are complicated... statistical techniques may not capture all of the nuances Natural laws will not perfectly predict outcomes Signal-to-Noise: Comparing a trend to its variability Bias-Variance trade-off: Unadjusted vs. adjusted estimates Population vs. sample 5 / 40

What is Biostatistics? Biostatistics is the use of data to describe and make inferences about a scientific problem Remember the Bio in Biostatistics! Biostatistics has limitations: you can t have it all 6 / 40

Types of Biostatistics 1 Descriptive statistics Exploratory data analysis (EDA): often not in literature Summaries: Table 1 in a paper Goal: to visualize relationships, generate hypotheses 2 Inferential statistics Confirmatory data analysis Methods section of a paper Goal: quantify relationships, test hypotheses 7 / 40

Exploratory Data Analysis (EDA) Look at your data! If you can t see it, then don t believe it! EDA allows us to: 1 Visualize distributions and relationships 2 Detect errors 3 Assess assumptions for confirmatory analysis EDA is the first step of data analysis 8 / 40

EDA methods (One-Way) Ordering : Stem-and-Leaf plots Grouping: frequency displays, distributions; histograms Summaries: summary statistics, standard deviation, box-and-whisker plots 9 / 40

Stem-and-Leaf Plots I Age in years (10 observations): 25, 26, 29, 32, 35, 36, 38, 44, 49, 51 Age Interval Observations 20-29 5 6 9 30-39 2 5 6 8 40-49 4 9 50-59 1 10 / 40

Stem-and-Leaf Plots II The age interval is the stem The observations are the leaves Rule of thumb: The number of stems should roughly equal the square root of the number of observations Or the stems should be logical categories 11 / 40

Stem-and-Leaf Plots III Some statistical programs print output like this: where 2* means 20-29. Age Interval Observations 2* 5 6 9 3* 2 5 6 8 4* 4 9 5* 1 12 / 40

Stem-and-Leaf Plots IV Output may also be shown like this: Age Interval 2. 5 6 9 3* 2 3. 5 6 8 4* 4 4. 9 5* 1 Observations where 3* means 30-34 and 3. means 35-39. 13 / 40

Frequency Distribution Tables Shows the number of observations for each range of data Intervals can be chosen in ways similar to stem-and-leaf displays Age Interval Frequency 20-29 3 30-39 4 40-49 2 50-59 1 14 / 40

Cumulative Frequency Distribution Tables Show the frequency, the relative frequency, and cumulative frequency of observations Age Interval Frequency Cum. Freq. Rel. Freq Cum. Rel. Freq. 20-29 3 3 0.3 0.3 30-39 4 7 0.4 0.7 40-49 2 9 0.2 0.9 50-59 1 10 0.1 1.0 This table shows an empircal distribution function obtained from a sample The true distribution function is the distribution of the entire population 15 / 40

Histograms Picture of the frequency or relative frequency distribution Histogram of Age Frequency 0.0 1.0 2.0 3.0 25 30 35 40 45 50 55 Age Note: Graphs are generally better to use in presentations that tables. They allow your audience to visualize a trend quickly. 16 / 40

Summary Statistics Percentiles Measures of central tendency Measures of dispersion or variability 17 / 40

Percentiles The r th percentile, P r is the value that is greater than or equal to r percent of a sample of n observations or less than or equal to (100-r) percent of the observations Percentile Quartile Formula P 25 Q 1 n+1 4 P 50 Q 2 n+1 2 P 75 Q 3 3(n+1) 4 th observation th observation th observation 18 / 40

Calculating quartiles I From the age data: with n=10 25, 26, 29, 32, 35, 36, 38, 44, 49, 51 Q 2 = median = average of 5 th and 6 th observations 35 + 36 = 2 = 35.5 Remember to order your data! 19 / 40

Calculating quartiles II Q 1 = median of lower half of data = third smallest value = 29 Q 3 = median of upper half of data = third largest value = 44 Note: If n is odd, include the median in the upper and lower half of the data. 20 / 40

Measures of Central Tendency Measure Mean Median Mode Formula P n i=1 x i n = x Middle observation Most frequent observation observation From the age example the mean is: 25+26+29+32+35+36+38+44+49+51 10 = 36.5 The mode is more helpful for categorical data, i.e. the most frequent age interval is 30-39 and it has 4 observations. 21 / 40

Measures of spread: Range Range = max-min The difference between the maximum and minimum values From age example: Max = 51, Min = 25 Range = 51-25 = 26 22 / 40

Measures of spread: Variance Variance = Expected value of the squared deviation of the observations from the true mean σ 2 = E[(X 2 X ) 2 ] Sample variance = Average of the squared deviation of the observations from the sample mean s 2 = n i=1 (x i x) 2 n 1 Sample variance from age example = 82.9 23 / 40

Standard deviation Standard deviation = Square root of the variance σ = E[(X 2 X ) 2 ] Sample standard deviation = Square root of the sample variance s = n i=1 (x i x) 2 n 1 From the age data: s = 82.9 = 9.1 Note: The units of the variance are years 2, while the units of the standard deviation are years. Interpretation: The standard deviation gives an idea of how much observations differ from the mean 24 / 40

Box-and-whisker plots I Box-and-whisker plots display quartiles Some terminology: Upper Hinge = Q 3 = Third quartile Lower Hinge = Q 1 = First quartile Interquartile range (IQR) = Q 3 Q 1 Contains the middle 50% of data Upper Fence = Upper Hinge + 1.5 * (IQR) Lower Fence = Lower Hinge - 1.5 * (IQR) Outliers: Data values beyond the fences Whiskers are drawn to the smallest and largest observations within the fences 25 / 40

Box-and-whisker plots II Boxplot of Age Age in Years 25 30 35 40 45 50 IQR = 44-29 = 15 Upper Fence = 44 + 15*1.5 = 66.5 Lower Fence = 29-15*1.5 = 6.5 26 / 40

Pairwise EDA 2 Categorical Variables Frequency table 1 Categorical, 1 Continuous Variable Stratified stem-and-leaf plots Side-by-side box plots 2 Continuous variables Scatterplot 27 / 40

2 Categorical Variables Frequency Table Age Interval Gender Total Female Male 20-29 1 2 3 30-39 2 2 4 40-49 1 1 2 50-59 1 0 1 Total 5 5 10 Looks like the men tend to be younger than women in this example. 28 / 40

1 Categorical and 1 Continuous Variable I Stratified Stem-and-Leaf plots Female Male Age Interval Obs. Age Interval Obs. 20-29 6 20-29 5 9 30-39 5 6 30-39 2 8 40-49 9 40-49 4 50-59 1 50-59 Total 5 5 10 29 / 40

1 Categorical and 1 Continuous Variable II Side-by-Side Box Plots Boxplot of Age by Gender Age in Years 25 30 35 40 45 50 Female Male Allows us to compare the distribution of the continuous variable (age) across values of the categorical variable (gender) 30 / 40

2 Continuous Variables Scatterplot Age by Height Height in Centimeters 155 165 175 185 Scatterplots visually display the relationship between two continuous variables 25 30 35 40 45 50 Age in Years 31 / 40

EDA: What to notice Shape Center Spread 32 / 40

Common Distribution Shapes Symmetrical and bell shaped Positively skewed or skewed to the right Negatively skewed or skewed to the left 33 / 40

Other Distribution Shapes Bimodal Reverse J shaped Uniform 34 / 40

Measures of Center Mode; Peak(s) Median: Equal areas point Mean: Balancing point 35 / 40

Skewness I Positively skewed Longer tail in the high values Mean > Median > Mode Positively skewed or skewed to the right Mode Median Mean 36 / 40

Skewness II Negatively skewed Longer tail in the low values Mode > Median > Mean Negatively skewed or skewed to the left Median Mean Mode 37 / 40

Symmetric Right and left sides are mirror images Left tail looks like right tail Mean = Median = Mode Symmetric 38 / 40

EDA: What to notice Outliers Values that are far from the bulk of the data Outliers can influence the value of some statistical measures Age example Data Mean Original 36.5 With 80-year-old added 40.5 39 / 40

Take Home Message Look at your data FIRST! Happy exploring! 40 / 40