Data Analysis and Statistical Methods Statistics 651

Similar documents
Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651

Basic Procedure for Histograms

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Business Statistics 41000: Probability 4

Review of commonly missed questions on the online quiz. Lecture 7: Random variables] Expected value and standard deviation. Let s bet...

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

AP Statistics Chapter 6 - Random Variables

NCSS Statistical Software. Reference Intervals

Statistics 431 Spring 2007 P. Shaman. Preliminaries

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Normal Probability Distributions

Chapter ! Bell Shaped

Numerical Descriptive Measures. Measures of Center: Mean and Median

QQ Plots Stat 342, Spring 2014 Prof. Guttorp - TA Aaron Zimmerman

The Normal Distribution

Business Statistics 41000: Probability 3

Data Analysis and Statistical Methods Statistics 651

Unit2: Probabilityanddistributions. 3. Normal distribution

1 Describing Distributions with numbers

Lecture 2 Describing Data

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Lecture 1: Review and Exploratory Data Analysis (EDA)

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

the display, exploration and transformation of the data are demonstrated and biases typically encountered are highlighted.

Exam 2 Spring 2015 Statistics for Applications 4/9/2015

Statistics and Probability

Financial Econometrics (FinMetrics04) Time-series Statistics Concepts Exploratory Data Analysis Testing for Normality Empirical VaR

The topics in this section are related and necessary topics for both course objectives.

1 Small Sample CI for a Population Mean µ

SYSM 6304 Risk and Decision Analysis Lecture 2: Fitting Distributions to Data

Chapter 7. Sampling Distributions

Lecture 5 - Continuous Distributions

Lecture 6: Normal distribution

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

ECON 214 Elements of Statistics for Economists 2016/2017

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

PASS Sample Size Software

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

STAB22 section 1.3 and Chapter 1 exercises

Data Analysis and Statistical Methods Statistics 651

Frequency Distribution and Summary Statistics

CHAPTER 2 Describing Data: Numerical

Lecture 5: Fundamentals of Statistical Analysis and Distributions Derived from Normal Distributions

Assessing Normality. Contents. 1 Assessing Normality. 1.1 Introduction. Anthony Tanbakuchi Department of Mathematics Pima Community College

Some estimates of the height of the podium

Elementary Statistics

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

As you draw random samples of size n, as n increases, the sample means tend to be normally distributed.

Fundamentals of Statistics

Section The Sampling Distribution of a Sample Mean

Moments and Measures of Skewness and Kurtosis

Math 140 Introductory Statistics

FEEG6017 lecture: The normal distribution, estimation, confidence intervals. Markus Brede,

Statistics for Business and Economics: Random Variables:Continuous

Descriptive Statistics (Devore Chapter One)

Chapter 8 Statistical Intervals for a Single Sample

Skewness and the Mean, Median, and Mode *

DATA SUMMARIZATION AND VISUALIZATION

Linear Regression with One Regressor

Descriptive Statistics

Homework: Due Wed, Feb 20 th. Chapter 8, # 60a + 62a (count together as 1), 74, 82

Figure 1: 2πσ is said to have a normal distribution with mean µ and standard deviation σ. This is also denoted

Introduction to Computational Finance and Financial Econometrics Descriptive Statistics

Examples of continuous probability distributions: The normal and standard normal

PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS

MAKING SENSE OF DATA Essentials series

Chapter 7 Study Guide: The Central Limit Theorem

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Chapter 6: Random Variables

Lecture 3: Probability Distributions (cont d)

Measures of Dispersion (Range, standard deviation, standard error) Introduction

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Honors Statistics. 3. Discuss homework C2# Discuss standard scores and percentiles. Chapter 2 Section Review day 2016s Notes.

CHAPTER 6 Random Variables

Announcements. Unit 2: Probability and distributions Lecture 3: Normal distribution. Normal distribution. Heights of males

MA 1125 Lecture 14 - Expected Values. Wednesday, October 4, Objectives: Introduce expected values.

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Lesson 12: Describing Distributions: Shape, Center, and Spread

1. Variability in estimates and CLT

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

MgtOp S 215 Chapter 8 Dr. Ahn

Some Characteristics of Data

Continuous Distributions

Introduction to R (2)

Putting Things Together Part 2

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

ECON Introductory Econometrics. Lecture 1: Introduction and Review of Statistics

χ 2 distributions and confidence intervals for population variance

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

ECON 214 Elements of Statistics for Economists

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Statistics (This summary is for chapters 17, 28, 29 and section G of chapter 19)

Transcription:

Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 10 (MWF) Checking for normality of the data using the QQplot Suhasini Subba Rao

Checking for Normality (a very rough check) Suppose x 1,..., x n is a sample from a normal distribution with mean µ and variance σ 2. First we order them from the smallest number to the largest number: x (1),..., x (n). Estimate the mean and standard deviations from the data; x and s. Plot all the observations on a number line. Locate the mean x on this line and also the intervals: [ x s, x + s], [ x 2s, x + 2s] and [ x 3s, x + 3s]. If the observations came from a normal, then Roughly 68% of the observations should lie in the interval [ x s, x+s]. 1

95% of the observations should lie in the interval [ x 2s, x + 2s]. 99.7% of the observations should lie in the interval [ x 3s, x + 3s]. Remember this means counting the number of points in each interval, and dividing it by the total number of observations. 2

Example: Minimum temperatures The mean of this distribution is -10C and the standard deviation is 10C. Calculate the proportion within one standard deviation, two standard deviations etc. 3

This is an extremely rough way to check for normality. There can exist weird non-normal distributions where the following: Roughly 68% of the observations should lie in the interval [ x s, x+s]. 95% of the observations should lie in the interval [ x 2s, x + 2s]. 99.7% of the observations should lie in the interval [ x 3s, x + 3s]. could be true. 4

Motivating the QQplot Lecture 10 (MWF) QQplots A QQplots orders the data from the smallest to the largest and plots the data against corresponding normal quantile. This allows on to check for normality of the data. Precisely: Data X 1,..., X n ordered from smallest to largest X (1),..., X (n). Plot X (i) against the i/n quantile of the normal distribution (omitting the first and last observations). If the data comes from a normal distribution (with the mean and variance estimated from the data) the data (empirical quantiles) will match the normal quantiles, and plot should lie on a straightline (on the x = y line). A QQplot has nothing to do with linear regression. The line you see in the plot is not the line of best fit. 5

Checking for normality: The QQ plot This plots what has been described above. The QQplot consists of points and a straight 45 degree line. X X X X X (5) (4) (3) (2) (1)..... x=y line y y y y y (1) (2) (3) (4) (5) If the points tend to lie on the straightline, then this suggests the observations come from a normal distribution. 6

Making a QQplot in JMP Always use software to make a QQplot. Lecture 10 (MWF) QQplots Analyze > Distribution. A window will pop-up. Highlight the variable, press Y, Columns and press okay. Once the histogram pops up, click on the red triangle. Normal Quantile Plot. Click on the 7

Example: Antarctic maximum temperatures This is the histogram and QQplot of the maximum temperatures in Anarctica. It would appear that the maximum temperatures are close to normal. The mean of this data set is µ = 4.5 and the standard deviation is σ = 2.16. 8

Using this information we can calculate the probabilities. Lecture 10 (MWF) QQplots Question This month the maximum temperature is 7 degrees, what is its percentile? Answer We assume normality: P (X 7) = P (Z (7 4.5)/2.16) = 0.87. Assuming normality, 7C degreees is in the 87% percentile. The percentile can be checked, by using the actual data to calculate the percentile. Based on the data the proportion of temperatures less than 7 degrees is about 86.5%. Since 87% and 86.5% are very close we see that the normal distribution approximates well the distribution of maximum temperatures. 9

Example: Antarctic minimum temperature QQplot This is the histogram and QQplot of the maximum temperatures in Anarctica. The distribution of minimum temperatures is far from normal. The mean and standard deviation of this data set is 13.8 degrees and 9.3 respectively. 10

If we use normality of the data to calculate precentile corresponding to 10C P (X 10) = P (Z = 10+13.8 9.4 ) = 0.654 (about 65.4%). But, based on the data the proportion of temperatures less than 10 degrees is about 55%, which is quite different to the proportion calculated using the normal distribution. Approximating the distribution with a normal distribution is giving incorrect percentiles/probabilities. 11

Interpretating a QQ-plot Lecture 10 (MWF) QQplots Some experienced statisticans have shaman like powers when it comes to interpretating QQ-plots. You don t need them, but it is good to have a feel of them. There are three main features you need to look for; Left Skew. This means the distribution is not symmetric. Find the mode (the heightest point of the distribution). The right of the mode should be shorter than the left of the mode. Right Skew. This means the distribution is not symmetric. Find the mode (the heightest point of the distribution). The right of the mode should be longer than the left of the mode. Heavy tails. This means that the probability of large numbers if much more likely than a normal distribution. For example for a 12

normal distribution most the observations 98% lie within the interval [ x 3s, x + 3s]. For a heavy tail distribution a far smaller proportion lie in this interval. 13

Skewed distributions Lecture 10 (MWF) QQplots A right skewed distribution (red) has a long right tail (green is normal). For a left skewed distribution the QQplot is the mirror image along the 45 degree line (arch going upwards and towards the left). 14

A right skewed distribution and the QQplot This is right skewed. The QQplots has a U. 15

QQplot of a left skewed distribution The above is indicates a left skewed distribution. The points are arched, going from the below the 45 degree line across it and down again. 16

Heavy tail distribution Lecture 10 (MWF) QQplots Has much thicker tails than a normal distribution (the blue are the tails of a normal and red are the tails of a thick tail). 17

QQplot of a heavy tailed distribution The plot is like an S. On the left of the plot it is left of the 45 degree line and then towards the right it goes to being right of the 45 degree line. 18

What does thick tailed distribution mean?? Look at the histogram of the following data set (size 200 observations). Look at the proportion of points outside one/two and three standard deviations of the mean (compare with 68%, 95% and 99.8%). It is a lot more than the normal distribution. Look at the tails, it is higher (thicker) than the normal distribution. 19

The corresponding QQplot Below we make a QQplot of the above data set. Lecture 10 (MWF) QQplots The S shape suggests the distribution has thick tails. 20

QQplots: Some general warnings When there are a limited number of observations. It is extremely difficult to check for normality using a QQplot. 21

QQplot of the height data Lecture 10 (MWF) QQplots The heights are not normally distributed. The horizontal lines that we see is because the data is integer valued (heights are given to the nearest inch). There is a mild U shape which suggest some element of skewness. 22

QQplot of the average of 5 heights Does not look normal, but the points are closer to the x = y then the QQplot of the raw heights on the previous page. 23

QQplot of binary data Let us return to the example of people liking apple juice. 100 people were interviewed and each person was asked whether they like apple juice or not (1=yes, 0 = no). Here is the data http://www.stat.tamu.edu/~suhasini/teaching651/apple_juice.txt. 24

34% of this sample liked apple juice. This data is binary (not normal!), this is why you see the two lines. It is clearly not normal, and you cannot make it more normal by increasing the sample size. What does become normal is the sample proportion (which in this case is 34%) - this is due to the CLT, which we discuss in lecture 12. But only when the sample size is relatively large. 25

Simulating data in JMP Lecture 10 (MWF) QQplots Make a new data table. Go to Table > Cols > New Columns.. > (In Column Properties select Formula) > Select Random in the new pop window and the distribution you want to simulate from. In the window above I chose Normal with mean 64.5 and standard deviation 2.5. This means that the number will draw numbers close to 64.5 with spread 2.5. 26

Transforming Data Lecture 10 (MWF) QQplots If the data is far from normal we often do a transformation of it to make it have less outliers and less skewed. Standard transforms for positive data {X i } The log transform; Y i = log X i. The variance of the transformed observation tends to be less than the variance of the original observation (sometimes this transformation is called variance stablisation ). Often used when the sample mean and sample variance of X are close to each other. 27

The power transform; Lecture 10 (MWF) QQplots Y i = X β i β 0. This transformation tends to control outliers and unskews the data. The Box-Cox transform X i ; Y i = Xλ i 1 λ λ 0. 28

Power transformation: Illustration Left is a QQplot of the original data and the right is the QQplot of the square root of the data (i.e. X i X i = X 1/2 i ). Observe how the square root of the data is still skewed - but it is less skewed than the original data. Reducing skewness in data is very useful way of making the CLT work for smaller sample sizes (see later). 29

QQ plots and testing for normality There are statistical tests (I have not defined this yet) for checking normality. One of the most famous ones is called the Kolmogorov- Smirnov test. QQ plots for other distributions It is possible by make a QQplot for other distributions. That is to check whether the observations are drawn from another distribution of interest. The QQplot must be modified to the new distribution (where the quantiles of distribution are compared with the ordered data). If you want to know how please ask me. Again the Kolmogorov-Smirnov test can be used to check whether the observations come from the distribution of interest. 30