chapter 2-3 Normal Positive Skewness Negative Skewness

Similar documents
The Normal Distribution & Descriptive Statistics. Kin 304W Week 2: Jan 15, 2012

Terms & Characteristics

The normal distribution is a theoretical model derived mathematically and not empirically.

Unit2: Probabilityanddistributions. 3. Normal distribution

Fundamentals of Statistics

Establishing a framework for statistical analysis via the Generalized Linear Model

Lectures delivered by Prof.K.K.Achary, YRC

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

LAB 2 INSTRUCTIONS PROBABILITY DISTRIBUTIONS IN EXCEL

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.

ECON 214 Elements of Statistics for Economists 2016/2017

SPSS I: Menu Basics Practice Exercises Target Software & Version: SPSS V Last Updated on January 17, 2007 Created by Jennifer Ortman

Data Distributions and Normality

Descriptive Statistics

Some Characteristics of Data

CHAPTER 6. ' From the table the z value corresponding to this value Z = 1.96 or Z = 1.96 (d) P(Z >?) =

Lecture 1: Review and Exploratory Data Analysis (EDA)

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Moments and Measures of Skewness and Kurtosis

Frequency Distribution and Summary Statistics

Unit 2 Statistics of One Variable

Introduction to Descriptive Statistics

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

IOP 201-Q (Industrial Psychological Research) Tutorial 5

NCSS Statistical Software. Reference Intervals

Simple Descriptive Statistics

SPSS t tests (and NP Equivalent)

Normal Probability Distributions

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Random Variables and Probability Distributions

E.D.A. Exploratory Data Analysis E.D.A. Steps for E.D.A. Greg C Elvers, Ph.D.

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Lecture Week 4 Inspecting Data: Distributions

Hypothesis Tests: One Sample Mean Cal State Northridge Ψ320 Andrew Ainsworth PhD

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Engineering Mathematics III. Moments

STAT 157 HW1 Solutions

Data Analysis and Statistical Methods Statistics 651

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

The Normal Distribution

Chapter 6 Part 3 October 21, Bootstrapping

LESSON 7 INTERVAL ESTIMATION SAMIE L.S. LY

the display, exploration and transformation of the data are demonstrated and biases typically encountered are highlighted.

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

SFSU FIN822 Project 1

UNIT 4 NORMAL DISTRIBUTION: DEFINITION, CHARACTERISTICS AND PROPERTIES

starting on 5/1/1953 up until 2/1/2017.

Descriptive Statistics

ECON 214 Elements of Statistics for Economists

Business Statistics 41000: Probability 4

Hydrology 4410 Class 29. In Class Notes & Exercises Mar 27, 2013

Summary of Statistical Analysis Tools EDAD 5630

Lesson 12: Describing Distributions: Shape, Center, and Spread

Software Tutorial ormal Statistics

A continuous random variable is one that can theoretically take on any value on some line interval. We use f ( x)

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Chapter 6 Analyzing Accumulated Change: Integrals in Action

Honors Statistics. 3. Discuss homework C2# Discuss standard scores and percentiles. Chapter 2 Section Review day 2016s Notes.

Manager Comparison Report June 28, Report Created on: July 25, 2013

MA131 Lecture 8.2. The normal distribution curve can be considered as a probability distribution curve for normally distributed variables.

DATA SUMMARIZATION AND VISUALIZATION

SOLUTIONS TO THE LAB 1 ASSIGNMENT

Chapter 4. The Normal Distribution

Measures of Central tendency

The Normal Probability Distribution

Chapter ! Bell Shaped

DESCRIPTIVE STATISTICS II. Sorana D. Bolboacă

What s Normal? Chapter 8. Hitting the Curve. In This Chapter

Elementary Statistics

Review: Types of Summary Statistics

Data screening, transformations: MRC05

Lecture 2 Describing Data

When we look at a random variable, such as Y, one of the first things we want to know, is what is it s distribution?

Section 7.5 The Normal Distribution. Section 7.6 Application of the Normal Distribution

Business Statistics 41000: Probability 3

Statistics 114 September 29, 2012

Process capability estimation for non normal quality characteristics: A comparison of Clements, Burr and Box Cox Methods

Chapter 7 1. Random Variables

The probability of having a very tall person in our sample. We look to see how this random variable is distributed.

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Sierra Environmental Studies Foundation

2.4 STATISTICAL FOUNDATIONS

Descriptive Analysis

8.1 Binomial Distributions

Expected Value of a Random Variable

REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING

Financial Econometrics Jeffrey R. Russell Midterm 2014

Statistics & Flood Frequency Chapter 3. Dr. Philip B. Bedient

Lecture 6: Non Normal Distributions

Continuous Distributions

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

DESCRIPTIVE STATISTICS

The Normal Distribution

Chapter 7. Random Variables

1 Volatility Definition and Estimation

Graphical and Tabular Methods in Descriptive Statistics. Descriptive Statistics

Study Ch. 7.3, # 63 71

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

Diploma in Business Administration Part 2. Quantitative Methods. Examiner s Suggested Answers

Transcription:

chapter 2-3 Testing Normality Introduction In the previous chapters we discussed a variety of descriptive statistics which assume that the data are normally distributed. This chapter focuses upon testing if a distribution is normally distributed and then possible ways of transforming the data in order to have a distribution that better approximates the normal distribution. The two most common deviations from normality, skewness and kurtosis will be discussed here. Figure 2-3.1 shows the typical shapes of skewed distributions in comparison to the normal distribution. Positively skewed data has a long tail towards the positive or higher scores side of the distribution. This is because there are a few very high scores that are skewing the distribution in this direction. In Biomedical Physiology and Kinesiology it is not uncommon to find positively skewed variables. Variables such as skinfold thicknesses and weight are usually positively skewed. Even muscle girths tend to be positively skewed as a few people tend to want to go into the gym and train excessively to produce very large muscles. Negative skewness is not as common in the types of variables we might encounter in Biomedical Physiology and Kinesiology. An obvious example is the height of basketball players in the NBA. There are very few short players in the leagues. Some do exist however, and because there are only a few of them and they are extremely small in comparison to the rest of the players, they cause the distribution to be skewed towards the small side. Normal Positive Skewness Negative Skewness Figure 2-3.1: Normal, Positively Skewed and Negatively Skewed distributions Another form of deviation from normality is Kurtosis. Figure 2-3.2 shows different kurtic distributions. The normal distribution is referred to as Mesokurtic. Rather than asymmetry as described by skewness, kurtosis is a measure of how centrally located the data are within the distribution. In a leptkurtic distribution the data is bunch more towards the centre causing the distribution to look thinner and more peaked. In the Platykurtic distribution, the shape looks

2-3.2 Testing Normality more flattened as the data are more spread out around the centre. Kurtosis is often the forgotten deviation from normality. Researchers will concern themselves with skewness before they will consider kurtosis. That said, skewness is also often overlooked. Weight and skinfold measures are usually skewed, but rarely will researchers correct the problem before applying parametric statistics. You can find thousands of papers in the scientific literature where parametric statistics have been applied to skinfold and weight data, regardless of the skewness. The good news, however, is that although skewness is a violation of the assumption of normality in these parametric tests, the significance of findings is not profoundly affected. Mesokurtic (Normal) Platykurtic Leptokurtic Figure 2-3.2: Mesokurtic (Normal), Platykurtic and Leptokurtic distributions Coefficient of Skewness A normal distribution, by definition is symmetrical; that is, the distribution looks the same either side of the centre line. Positive and negatively skewed distributions are asymmetrical. The Coefficient of Skewness quantifies this asymmetry. Coefficient of " ( X X ) i 1 i! = Skewness = 3 ( N! 1) s N 3 where X is the mean, s is the standard deviation, and N is the number of data points. If the data are normally distributed the coefficient of skewness is zero. Infact, any symmetric data will have a coefficient of skewness near zero. The sign of the coefficient tells the type of skewness. A positive coefficient of skewness means positive skewness, and the opposite for negative skewness. A coefficient greater than 1 is regarded as significant positive skewness, whereas a coefficient less than -1 is regarded a significant negative skewness. The coefficient of skewness is an option for selection on the SPSS Descriptive statistics dialog box. Figure 2-3.3 shows the SPSS histogram of the Sum of 5 Skinfolds (S5SF) in 5,362 women from the Canada Fitness Survey (CFS) of 1981. The red bars show the distribution of the data whereas a superimposed black line shows a normal distribution with the same mean and standard deviation as the S5Sf data. This superimposed line allows you to visually appraise how deviant from the normal distribution your data are. In these data the distribution is positively skewed. A

Measurement & Inquiry in Kinesiology 2-3.3 quantification of the degree of skewness is seen in the coefficient of skewness listed in the SPSS Descriptive Statistics output for Weight (WT), Height (HT) and Sum of 5 Skinfolds (S5SF) in the same data, also shown in Figure 2-3.3. The coefficient of skewness for S5SF is 1.043 (significantly skewed). Interestingly Height (HT) is not skewed (0.09) but Weight (WT) is more skewed than S5SF with a coefficient of 1.297. The standard error of the coefficient (Std. Error in output) gives your measure of confidence in the coefficient. Coefficient of Skewness ±1.96 x Standard Error of the Coefficient gives the 95% confidence interval of the coefficient. For weight the 95% confidence interval for the coefficient of Skewness would therefore be: 1.297 ±(1.96 x 0.032) = 1.234 to 1.360 1000 Sum of 5 Skinfolds (Women) 800 600 400 200 0 220.0 200.0 180.0 160.0 140.0 120.0 100.0 80.0 60.0 40.0 20.0 Std. Dev = 29.01 Mean = 75.8 N = 5362.00 S5SF Figure 2-3.3: SPSS Histogram of Sum of 5 Skinfolds (S5SF) in 5362 Females from the Canada Fitness Survey (1981) and SPSS Descriptive Statistics output for Weight (WT), Height (HT) and Sum of 5 Skinfolds (S5SF) in the same data.

2-3.4 Testing Normality Coefficient of Kurtosis As illustrated in Figure 2-3.2 a Platykurtic distribution is more flattened, while a Leptokurtic distribution is more peaked than the Mesokurtic or Normal distribution. The degree of Kurtosis is quantified by the Coefficient of Kurtosis Coefficient of " ( X X ) i 1 i! = Kurtosis = 4 ( N! 1) s N 4 where X is the mean, s is the standard deviation, and N is the number of data points. Normalizing Data Many statistical tests are based on the assumption of normally distributed data. As discussed previously, many real data sets are in fact not approximately normal. However, an appropriate transformation of a data set can often yield a transformed data set that does follow approximately a normal distribution. This increases the applicability and usefulness of statistical techniques based on the normality assumption. A simple data transformation applicable to moderately positive or right skewed data is the log 10 transformation. Figure 2-3.4 shows the frequency distribution for Triceps Skinfold (TPSF) for the CFS data set of 1,765 women aged 20-30 years. The coefficient of skewness shows significant skewness at 1.17 and the histogram illustrates this positive skewness. The lower panel of Figure 2-3.4 shows the distribution of the log 10 transform of the data. A new variable was produced (log 10TPSF) by calculating the log 10 of each TPSF measure. The new distribution is more normally distributed with a coefficient of skewness of 0.02. In this case the transformation worked well; however, it is not perfect for all situations. It tends to work better in moderately rather than extremely skewed data. A better but more complex transform is the Box-Cox transform, which will be described later in this chapter. TPSF: µ = 16.4 σ= 5.7 Skewness = 1.17 TPSF Frequency 400 300 200 100 0 6.0 10.0 14.0 18.0 22.0 Log10TPSF: µ = 1.19 σ= 0.15 Skewness = 0.02 LOGTPSF Frequency 400 300 200 100 0.69.81.88.94 1.06 26.0 1.13 Figure 2-3.4: SPSS Histograms of Triceps Skinfold (TPSF) and log10tpsf in 1,765 females aged 20-30 years from the Canada Fitness Survey (1981) 30.0 1.19 1.25 34.0 1.31 38.0 1.38 42.0 1.44 1.50 46.0 1.56 50.0 1.63 54.0 1.69 1

Measurement & Inquiry in Kinesiology 2-3.5 Normal Probability Plots Normal P-P Plot of HT Normal P-P Plot of WT Expected Cum Prob.50.25 0.00 0.00.25.50 Expected Cum Prob.50.25 0.00 0.00.25.50 Observed Cum Prob Observed Cum Prob Figure 2-3.5: SPSS Expected Cumulative Probability vs Observed Cumulative Probability Plots for Height (HT) and Weight (WT) in women of data depicted in Figure 2-3.3 The normal probability plot is a useful tool in determining how normal your distribution is. In the normal probability plot, the cumulative probability for the data (observed) is plotted against the cumulative probability of the data if it were normally distributed (expected), as shown for weight (WT) and height (HT) in Figure 2-3.5. The approximately normally distributed variable, height, can be seen to have a linear relationship between observed and expected values. If the two sets of values agreed perfectly (a correlation of 1) then height would be perfectly normal. The correlation between observed and expected is therefore a measure of normality of the observed scores. The skewed variable, weight, can be seen to have divergent observed scores of cumulative probability as shown by the bend in the normal probability plot for weight. The normal probability plots can be called up in SPSS by using the P-P option of the GRAPH menu. Figure 2-3.6 shows the dialog box for this option. The variables to be tested for normality are moved over to the Variables box. Ensure that Normal is selected in the Test Distribution box. SPSS can produce plots to test more than the normal distribution. Figure 2-3.6: SPSS P-P Plots option of the GRAPH menu to produce normal probability plots

2-3.6 Testing Normality Box-Cox Transformation The Box-Cox transformation is a family of transformations, being defined as:! T ( X ) = ( X " 1) /! For where Y is the response variable and As discussed earlier, the normal probability plot gives us an appreciation of the degree of normality of the distribution as the values of observed cumulative frequency distribution are plotted against the expected normal cumulative frequency distribution of a variable with the same mean and standard deviation. The correlation between the expected and observed values is a measure of agreement of the observed data to is the transformation parameter. = 0, the natural log of the data is taken instead of using the above formula. the normal distribution. This correlation coefficient can be used as the criterion for judgement of the value of λ that best normalizes the distribution. Figure 2-3.7 Correlation shows a typical curve of the correlation coefficients found for different values of λ. In this case - 0.6 was the value of λ that gives the highest correlation (0.91) between the observed and expected values of the cumulative frequency. -0.6 would therefore be chosen as the value of λ to best transform the data to a normal distribution. Unfortunately SPSS does not carry out the Box-Cox analysis, but we can find the best value of λ using MS EXCEL, as described below. 1.9.8.7.6.5.4.3.2.1 0-2 -1-0.6 0 +1 +2 _ Figure 2-3.7: Plot of correlations of expected and observed values of cumulative probability curve for different values of λ. Maximum correlation found for λ=-0.6. Net Admin 12/12/11 9:48 PM Comment: Richard, is T(X) the same as Y, the response variable? Calculating the Box-Cox λ using MS EXCEL Rather than using the correlation between expected and observed cumulative frequency values as the criterion of normality, we will use the coefficient of skewness, which will approach 0 the closer the distribution is to normal. Figure 2-3.8 shows an EXCEL set up for the calculation of the best value of λ using the SOLVER function. The data being analysed are the Sum of 5 Skinfolds on 273 women, aged 18 to 19 years from the Canada Fitness Survey data set. The coefficient of skewness for this variable is 1.19; therefore, the data are significantly skewed and a Box-Cox transformation would be in order.

Measurement & Inquiry in Kinesiology 2-3.7 The first steps in calculating the best fitting value of λ are as follows: Calculate the column of transformed scores for Sum5SF based upon the value of λ entered in cell E1. The value in cell E1 can be any number. Choose a small number similar to the likely answer for λ. In this case 1 was used. It matters little exactly what this number is since it is only a starting point for SOLVER. The equation entered in B2 is =((A2^$E$1)-1)/$E$1, which is the Box-Cox transform equation shown earlier in the chapter but written in EXCEL computational form including specific cell references. Figure 2-3.7: MS EXCEL SOLVER set up for Box-Cox transformation calculation. Calculate the coefficient of skewness for the transformed scores. In Figure 2-3.8 this was placed in cell E2. This is achieved using the SKEW() function of EXCEL which returns the coefficient of skewness of the data in the selected range of cells. In Figure 2-3.8 the equation typed in E2 was =SKEW(B2:B274). Choose the SOLVER function from the TOOLS menu. Figure 2-3.8 shows the SOLVER dialog box. SOLVER requires you to give the address of the target cell. In this case we give the cell address of the coefficient of skewness E2. Now you need to check whether you want SOLVER to seek a maximum, minimum or value closest to 0. In this case we want the coefficient of skewness to get closest to 0. SOLVER needs to change one or more cells that change the target cell E2. Thus E1 (the cell containing the value of λ) is entered in the by changing cells box. SOLVER is now set up. If you click solve now, SOLVER will go through a high speed process of changing the value of λ in cell E1, checking on the value of E2, changing E1 again until E2 reaches the closest possible

2-3.8 Testing Normality value to 0. In this case the value of -0.245 brought the coefficient of skewness closest to 0. Therefore λ = -0.245 would be used to transform the sum of 5 skinfold data to best approximate a normally distributed variable.