Exploratory Data Analysis

Similar documents
2 Exploring Univariate Data

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Describing Data: One Quantitative Variable

Empirical Rule (P148)

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Lecture 2 Describing Data

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Lecture 1: Review and Exploratory Data Analysis (EDA)

1 Describing Distributions with numbers

DATA SUMMARIZATION AND VISUALIZATION

appstats5.notebook September 07, 2016 Chapter 5

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

The Range, the Inter Quartile Range (or IQR), and the Standard Deviation (which we usually denote by a lower case s).

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

Chapter 3. Descriptive Measures. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 3, Slide 1

Descriptive Statistics (Devore Chapter One)

Numerical Descriptions of Data

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Session 5: Associations

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Description of Data I

Frequency Distribution and Summary Statistics

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Some estimates of the height of the podium

STAT 113 Variability

Descriptive Statistics Bios 662

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Measures of Central Tendency Lecture 5 22 February 2006 R. Ryznar

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

Math 2200 Fall 2014, Exam 1 You may use any calculator. You may not use any cheat sheet.

Exploring Data and Graphics

2CORE. Summarising numerical data: the median, range, IQR and box plots

How Wealthy Are Europeans?

Measures of Variability

Descriptive Statistics

Basic Procedure for Histograms

Math Take Home Quiz on Chapter 2

STAB22 section 1.3 and Chapter 1 exercises

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences. STAB22H3 Statistics I Duration: 1 hour and 45 minutes

MAT 1371 Midterm. This is a closed book examination. However one sheet is permitted. Only non-programmable and non-graphic calculators are permitted.

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA. Name: ID# Section

Math 140 Introductory Statistics. First midterm September

Edexcel past paper questions

Center and Spread. Measures of Center and Spread. Example: Mean. Mean: the balance point 2/22/2009. Describing Distributions with Numbers.

1) What is the range of the data shown in the box and whisker plot? 2) True or False: 75% of the data falls between 6 and 12.

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Source: Fall 2015 Biostats 540 Exam I. BIOSTATS 540 Fall 2016 Practice Test for Unit 1 Summarizing Data Page 1 of 6

NCSS Statistical Software. Reference Intervals

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Section3-2: Measures of Center

Simple Descriptive Statistics

CHAPTER 2 Describing Data: Numerical

The Normal Distribution

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

(a) salary of a bank executive (measured in dollars) quantitative. (c) SAT scores of students at Millersville University quantitative

Lecture Week 4 Inspecting Data: Distributions

3.1 Measures of Central Tendency

Romero Catholic Academy Gender Pay Reporting Findings

Variance, Standard Deviation Counting Techniques

Chapter 3. Lecture 3 Sections

Section 6-1 : Numerical Summaries

MATH 217 Test 2 Version A

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Days Traveling Frequency Relative Frequency Percent Frequency % % 35 and above 1 Total %

Diploma in Financial Management with Public Finance

MSM Course 1 Flashcards. Associative Property. base (in numeration) Commutative Property. Distributive Property. Chapter 1 (p.

Chapter 3: Displaying and Describing Quantitative Data Quiz A Name

Getting to know a data-set (how to approach data) Overview: Descriptives & Graphing

Private Motor Insurance Statistics

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...

Math146 - Chapter 3 Handouts. The Greek Alphabet. Source: Page 1 of 39

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

FINALS REVIEW BELL RINGER. Simplify the following expressions without using your calculator. 1) 6 2/3 + 1/2 2) 2 * 3(1/2 3/5) 3) 5/ /2 4

Full file at Chapter 2 Descriptive Statistics: Tabular and Graphical Presentations

Sampling and Descriptive Statistics

Putting Things Together Part 2

4. DESCRIPTIVE STATISTICS

Monte Carlo Simulation (Random Number Generation)

= P25 = Q1 = = P50 = Q2 = = = P75 = Q3

Mini-Lecture 3.1 Measures of Central Tendency

Name PID Section # (enrolled)

Review: Chebyshev s Rule. Measures of Dispersion II. Review: Empirical Rule. Review: Empirical Rule. Auto Batteries Example, p 59.

1. In a statistics class with 136 students, the professor records how much money each

Some Characteristics of Data

SAMPLE. HSC formula sheet. Sphere V = 4 πr. Volume. A area of base

Manual for the TI-83, TI-84, and TI-89 Calculators

3) Marital status of each member of a randomly selected group of adults is an example of what type of variable?

22.2 Shape, Center, and Spread

Ti 83/84. Descriptive Statistics for a List of Numbers

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Test Bank Elementary Statistics 2nd Edition William Navidi

CHAPTER TWO Descriptive Statistics

Multiple Choice: Identify the choice that best completes the statement or answers the question.

Applications of Data Dispersions

Descriptive Statistics

Steps with data (how to approach data)

Transcription:

Exploratory Data Analysis Stemplots (or Stem-and-leaf plots) Stemplot and Boxplot T -- leading digits are called stems T -- final digits are called leaves STAT 74 Descriptive Statistics 2 Example: (number of hysterectomies performed by 5 male doctors) 27, 5,, 25, 6, 25, 5,, 7, 44, 2, 6, 59, 4, 2 2 755 557 764 467 4 4 5 9 6 7 65 56 STAT 74 Descriptive Statistics Example: Number of hysterectomies performed by 5 male doctors: 27, 5,, 25, 6, 25, 5,, 7, 44, 2, 6, 59, 4, 2 by female doctors, the numbers are: 5, 7,, 4,, 9, 25, 29,, () () 2 557 57 467 49 4 4 2 59 5 9 6 7 56 STAT 74 Descriptive Statistics 4 Back-to-back stem-plot () () 75 94 95 2 557 467 4 4 5 9 6 7 56 Example: (Height data with gender) :, 6, 64, 65, 65, 65, 66, 67 : 6, 64, 69, 7, 7, 7, 72, 72, 72, 72, 7, 74, 75 (See data sheet) Back-to-back 5554 6 49 7 222245 Split-back-to-back 4 6* 4 * => - 4 76555 6# 9 # => 5-9 7* 22224 7# 5 STAT 74 Descriptive Statistics 5 STAT 74 Descriptive Statistics 6

Order Statistics (Hogg & Tanis) Examine Distribution Let x, x 2,, x n be a random sample from a distribution of continuous type. Let y be the smallest of x i s, y 2 the next smallest x i s in order of magnitude, and so y n be the largest of x i s, i.e., y < y 2 < <y n. Then y i, for i =, 2,, n, is called the order statistic of x i. STAT 74 Descriptive Statistics 7 STAT 74 Descriptive Statistics Percentile (measure of location) If < p <, the p th sample percentile, denote as has approximately np sample observations less than it and also n( p) sample observations great than it, and it is undefined if p < n or p >. n + n + To find ~ π p :. Compute (n + ) p [position index] 2. If (n + ) p is an integer, then the p th sample percentile ~ π p= y (n+)p a not an integer, and is equal to r + then b ~ a π p = y r + ( y r + yr ) r is an integer b a a a is a fraction = ( ) y r + y r+ b b b STAT 74 Descriptive Statistics 9 STAT 74 Descriptive Statistics Example: [odd number of data],6,6,64,64,65,65,65,66,67,69,7,7,7,72,72,72,72,7,74,75 Q = 64.5 Median = 69 Q = 72 To find ~π.25 = ~q (the first quartile), a (n + ) p = (2 + ).25 = 5.5 => r = 5 & =.5 b = (.5) 64 +.5 65 = 64.5 ~π.25 Quartiles: The first quartile, Q, or 25 th percentile, is the median of the lower half of the list of ordered observations. The third quartile, Q, or 75 th percentile, is the median of the upper half of the list of ordered observations. ( m ~ = 69, ~π. 75 = 72 ) STAT 74 Descriptive Statistics STAT 74 Descriptive Statistics 2 2

Example: (data sheet) [odd number of data],6,6,64,64,65,65,65,66,67,69,7,7,7,72,72,72,72,7,74,75 Q = 64.5 Median = 69 Q = 72 [even number of data] 6,,6,6,64,64,65,65,65,66,67,69,7,7,7,72,72,72,72,7,74,75 Q = 64 Median = 6 Q = 72 IQR = 72-64 = Interquartile range (IQR) = Q Q IQR = 72 64.5 = 7.5 STAT 74 Descriptive Statistics STAT 74 Descriptive Statistics 4 Example: (data sheet without outlier 6 ) The five-number summary.minimum value.q.median.q.maximum value Min =, Q = 64.5, Median = 69, Q = 72, Max = 75. 7 5 2 STAT 74 Descriptive Statistics 5 STAT 74 Descriptive Statistics 6 Inner and outer fences Mild and Extreme outliers The inner fences are located at a distance of.5 IQR below Q and at a distance of.5 IQR above Q. The outer fences are located at a distance of IQR below Q and at a distance of IQR above Q. T Data values falling between the inner and outer fences are considered mild outliers. T Data values falling outside the outer fences are considered extreme outliers. When outliers exist, the whisker extended to the smallest and largest data values within the inner fence. STAT 74 Descriptive Statistics 7 STAT 74 Descriptive Statistics

Outer fence Inner fence IQR Inner fence 4 4 Outer fence 2 2 22 22 STAT 74 Descriptive Statistics 9 STAT 74 Descriptive Statistics 2 Side-by-side Box Plot 4 Quantile For data values y, y 2,,y n such that y y 2 y n in a sample data set, y r is the quantile of order r/(n+). 7 2 5 9 Quantile-quantile plot or q-q plot is a plot that plotting quantiles of one sample against quantiles of another sample or a distribution. sex STAT 74 Descriptive Statistics 2 STAT 74 Descriptive Statistics 22 Quantile-Quantile Plot Examine the distributions of the two samples: Quantile-Quantile Plot Sample 2 quantile Sample :.5,, 7.4,, 9.2,.9 Sample 2: -.92, -., -.2,.2,.62,.4.2 (,.2) - -2 5 Sample quantile STAT 74 Descriptive Statistics 2 STAT 74 Descriptive Statistics 24 4

Quantile-Quantile Plot Sample 2 quantile Examine Distribution with Quantile-Quantile Plot.2 (,.2) - -2 5 Sample quantile Two distributions are about the same. Two distributions are not the same. STAT 74 Descriptive Statistics 25 STAT 74 Descriptive Statistics 26 z-percentile Example:,, 7,, 9, n = 6, and the value is the [4/(6+)]th quantile, i.e.,.57 quantile or 57.th percentile of the distribution of the sample. The 57.th percentile of a standard normal distribution is a z-score that has 57.% of the distribution below it. Therefore, use.57 for area (.574 is closest to it) and look for z score corresponding to it, and it is around.. So, z-percentile for this.57 quantile is.. (sample quantile, z-percentile) => (,.) Normal Quantile Plot Z-percentile. - -2 (,.) 5 Sample quantile STAT 74 Descriptive Statistics 27 STAT 74 Descriptive Statistics 2 Examine Distribution with Normal Quantile Plot Examine Other Distributions Close to Normal Not Normal T Find the quantiles of the data and the corresponding percentiles of desired distribution to be examined. T Plot the ordered pairs of sample quantiles and their corresponding percentiles of desired distribution. T Examine if it has the straight line pattern. STAT 74 Descriptive Statistics 29 STAT 74 Descriptive Statistics 5

Sample Quantile Let x () denote the smallest sample observation, x (2) the 2 nd smallest sample observation, and the x (n) the largest sample observation. For, i =, 2,, n, x (i) is the [(i-.5)/n]th sample quantile. (Approximation) Example:,, 7,, 9, Grouping and Displaying Categorical Data n = 6, and the value, x (4), is the [(4-.5)/6]th quantile of the distribution of the sample, i.e..5 sample quantile or 5. percentile. STAT 74 Descriptive Statistics STAT 74 Descriptive Statistics 2 Frequency Table and Charts 7 Class Frequency Relative Frequency 9 9/22 =.49 = 4.9% /22 =.59 = 59.% Total 22 % Bar Chart Pie Chart Examine Bivariate Data 5 4.9% 4 59.% 2 Pe rc ent STAT 74 sex Descriptive Statistics STAT 74 Descriptive Statistics 4 Two Categorical Variables Contingency Table Cancer No cancer Row Total Smoker 2 5 Non- Smoker 5 45 5 Column Total 25 75 Percent 4 2 Cluster bar chart Cancer Odds of smoker to have cancer: 2/ = 6/9 Odds of nonsmoker to have cancer: 5/45 = /9 Odds Ratio = (6/9)/(/9) = 6 STAT 74 Descriptive Statistics 5 Have Cancer Do not have Cancer Smoke Do not Smoke Smoking Status STAT 74 Descriptive Statistics 6 6

Two Quantitative variables Data: Temperature Mortality Index 4 52 4 6 42 6 42 4 72 44 45 9 46 77 47 4 94 49 6 5 95 5 5 5 52 2 STAT 74 Descriptive Statistics 7 Two Quantitative variables Average annual temperature and the mortality index for a type of breast cancer in women in certain region of Europe. Mortality Index 9 7 5 4 STAT 74 Average Temperature Descriptive Statistics 5 A Categorical & A Quantitative Variables Side-by-side Boxplot 4 9 Time Plot Youngstown Homicide Rate by Year 7 2 9 Value HRATE 7 5 4 2 5 969 97 97 975 977 979 9 9 95 97 99 99 99 995 YEAR sex STAT 74 Descriptive Statistics 9 STAT 74 Descriptive Statistics 4 Time Plot Time Plot National Homicide Rate By Year National Homicide Rate By Year 2 9 National Homicide Rate 7 5 4 2 National Homicide Rate 9 7 6 5 4 2 964 96 972 976 9 94 9 992 964 96 972 976 9 94 9 992 YEAR YEAR STAT 74 Descriptive Statistics 4 STAT 74 Descriptive Statistics 42 7