Data Distributions and Normality

Similar documents
Parametric Statistics: Exploring Assumptions.

Hypothesis Tests: One Sample Mean Cal State Northridge Ψ320 Andrew Ainsworth PhD

Descriptive Analysis

Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Chapter 11: Inference for Distributions Inference for Means of a Population 11.2 Comparing Two Means

Terms & Characteristics

Lectures delivered by Prof.K.K.Achary, YRC

Some Characteristics of Data

Lecture Week 4 Inspecting Data: Distributions

Fundamentals of Statistics

Basic Procedure for Histograms

Two-Sample T-Test for Superiority by a Margin

Two-Sample T-Test for Non-Inferiority

The Normal Distribution

Financial Econometrics (FinMetrics04) Time-series Statistics Concepts Exploratory Data Analysis Testing for Normality Empirical VaR

2018 AAPM: Normal and non normal distributions: Why understanding distributions are important when designing experiments and analyzing data

Unit 2 Statistics of One Variable

Frequency Distribution and Summary Statistics

Simple Descriptive Statistics

Copyright 2005 Pearson Education, Inc. Slide 6-1

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

KARACHI UNIVERSITY BUSINESS SCHOOL UNIVERSITY OF KARACHI BS (BBA) VI

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Lecture 2 Describing Data

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

E.D.A. Exploratory Data Analysis E.D.A. Steps for E.D.A. Greg C Elvers, Ph.D.

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Valid Missing Total. N Percent N Percent N Percent , ,0% 0,0% 2 100,0% 1, ,0% 0,0% 2 100,0% 2, ,0% 0,0% 5 100,0%

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

NCSS Statistical Software. Reference Intervals

Introduction to Descriptive Statistics

Describing Uncertain Variables

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Statistics I Chapter 2: Analysis of univariate data

Exploratory Data Analysis (EDA)

Descriptive Statistics

Web Science & Technologies University of Koblenz Landau, Germany. Lecture Data Science. Statistics and Probabilities JProf. Dr.

1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Introduction to Statistical Data Analysis II

Statistical Intervals. Chapter 7 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Lecture 1: Review and Exploratory Data Analysis (EDA)

DATA SUMMARIZATION AND VISUALIZATION

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

2 DESCRIPTIVE STATISTICS

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

1) 3 points Which of the following is NOT a measure of central tendency? a) Median b) Mode c) Mean d) Range

2 Exploring Univariate Data

Lecture Data Science

Confidence Interval and Hypothesis Testing: Exercises and Solutions

CSC Advanced Scientific Programming, Spring Descriptive Statistics

IOP 201-Q (Industrial Psychological Research) Tutorial 5

SPSS I: Menu Basics Practice Exercises Target Software & Version: SPSS V Last Updated on January 17, 2007 Created by Jennifer Ortman

On Some Test Statistics for Testing the Population Skewness and Kurtosis: An Empirical Study

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Establishing a framework for statistical analysis via the Generalized Linear Model

Quantitative Analysis and Empirical Methods

chapter 2-3 Normal Positive Skewness Negative Skewness

Measures of Central tendency

Normal Probability Distributions

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Previously, when making inferences about the population mean, μ, we were assuming the following simple conditions:

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

Honor Code: By signing my name below, I pledge my honor that I have not violated the Booth Honor Code during this examination.

DESCRIBING DATA: MESURES OF LOCATION

The Two-Sample Independent Sample t Test

SPSS t tests (and NP Equivalent)

1. Variability in estimates and CLT

DESCRIPTIVE STATISTICS

LAB 2 INSTRUCTIONS PROBABILITY DISTRIBUTIONS IN EXCEL

Chapter 3: Displaying and Describing Quantitative Data Quiz A Name

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Engineering Mathematics III. Moments

Chapter 7. Inferences about Population Variances

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Moments and Measures of Skewness and Kurtosis

The Mode: An Example. The Mode: An Example. Measure of Central Tendency: The Mode. Measure of Central Tendency: The Median

Statistics & Statistical Tests: Assumptions & Conclusions

Lecture 6: Non Normal Distributions

Chapter 8 Estimation

Statistical Analysis of Data from the Stock Markets. UiO-STK4510 Autumn 2015

Tutorial 1. Review of Basic Statistics

One sample z-test and t-test

A continuous random variable is one that can theoretically take on any value on some line interval. We use f ( x)

Mean GMM. Standard error

Exploring Data and Graphics

Power of t-test for Simple Linear Regression Model with Non-normal Error Distribution: A Quantile Function Distribution Approach

Measures of Dispersion (Range, standard deviation, standard error) Introduction

Monte Carlo Simulation (Random Number Generation)

Measures of Central Tendency: Ungrouped Data. Mode. Median. Mode -- Example. Median: Example with an Odd Number of Terms

Measures of Variation. Section 2-5. Dotplots of Waiting Times. Waiting Times of Bank Customers at Different Banks in minutes. Bank of Providence

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...

CHAPTER 6. ' From the table the z value corresponding to this value Z = 1.96 or Z = 1.96 (d) P(Z >?) =

SOLUTIONS TO THE LAB 1 ASSIGNMENT

starting on 5/1/1953 up until 2/1/2017.

MODELLING OF INCOME AND WAGE DISTRIBUTION USING THE METHOD OF L-MOMENTS OF PARAMETER ESTIMATION

Lecture 9 - Sampling Distributions and the CLT

Transcription:

Data Distributions and Normality

Definition (Non)Parametric Parametric statistics assume that data come from a normal distribution, and make inferences about parameters of that distribution. These statistical tests are based on comparing the means (central tendency) of the distributions, as a function of their variability (spread). Non-parametric statistics do not depend on fitting a parameterized distribution, based on normality. These statistical tests are based on comparing the medians (50 % of data distributions) and the ranks of the observations amongst the samples.

The Normal Distribution X ~ N (µ, σ) Every Normal Distribution can be described using only two parameters: Mean and S.D.

Is the Basis of Parametric Statistics 68% 96% 99% Parametric statistical methods require that numerical variables approximate a normal distribution. They compare the means & S.D.s In a normal distribution: ~ 68% observations within 1 standard deviation of mean ~ 96% within 2 standard deviations ~ 99% within 3 standard deviations

Non-Significant = Normal data Assessing Normality Three ways to assess the normality of the data 1) Graphical Displays Histogram, Density plot Boxplot, Q-Q Plot 2) Skewness / Kurtosis - Are they different from 0? (normal distribution) - Rule of Thumb: Too Large (> 1) or too small (< -1) 3) Shapiro Wilk Tests Tests if data differ from a normal distribution Significant = non-normal data

Assessing Normality Three ways to assess the normality of the data 1) Graphical Displays Histogram, Density plot, Boxplot

Assessing Normality Three ways to assess the normality of the data 1) Graphical Displays Histogram, Density plot, Boxplot

Assessing Normality 1) More Graphical Displays Q-Q Plot: quantile / quantile plot compares observed data and theoretical data, from a normal distribution OPTIONS tab: Select the type and the parameters of theoretical data distribution. Default: Normal

Assessing Normality Q-Q Plot: quantile / quantile plot Things to Look For: How many points plotted? Are there any outliers?

Quantifying Distributions 2) Skewness: Distribution symmetry (skew) Skew: Measure of the symmetry of a distribution. Symmetric distributions have a skew = 0. Positive skew: the mean is larger than the median, skewness > 0 Negative skew: the mean is smaller than the median, skewness < 0

Quantifying Distributions 2) Kurtosis: Distribution of data in peak / tails Kurtosis: Measure of the degree to which observations cluster in the tails or the center of the distribution. Positive kurtosis: Less values in tails and more values close to mean. Leptokurtic. Negative kurtosis: More values in tails and less values close to mean. Platykurtic.

Assessing Normality - Example Use Normality.Example.xls Dataset (posted on class web-site) Follow along this example using Rcmdr Open Rstudio and activate Rcmdr Import dataset and start exploring

An Example in Estimation How old is your professor? N = 18 guesses Range = 34 48 Age (yrs) 34 36 37 37 38 38 38 38 39 40 40 41 41 42 42 42 42 48

An Example in Estimation How old is your professor? N = 18 guesses What is the Midpoint Value = Age (yrs) 34 36 37 37 38 38 38 38 39 40 40 41 41 42 42 42 42 48

An Example in Estimation N = 18 guesses Mean = 39.6 Median = 39.5 S.D. = 3.1 value frequency 34 1 35 0 36 1 37 2 38 4 39 1 40 2 41 2 42 4 43 0 44 0 45 0 46 0 47 0 48 1 sum 18 relative frequency 0.056 0.000 0.056 0.111 0.222 0.056 0.111 0.111 0.222 0.000 0.000 0.000 0.000 0.000 0.056 1

An Example in Estimation N = 18 guesses 50% = 39.5 5% = 34 25% = 38 75% = 42 95% = 48 value 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 sum relative freq. 0.056 0.000 0.056 0.111 0.222 0.056 0.111 0.111 0.222 0.000 0.000 0.000 0.000 0.000 0.056 1 cumulative freq. 0.056 0.056 0.111 0.222 0.444 0.500 0.611 0.722 0.944 0.944 0.944 0.944 0.944 0.944 1.000 9.389

Data Summary with Rcmdr Summaries: - Active data set

Data Summary with Rcmdr Summaries: - Numerical summaries

Normality Test with Rcmdr Test of Normality Select data Use Shapiro-Wilk Test multiple data using by groups

Normality Test with Rcmdr Test of Normality: SW (Wilk Sidak) Test Null Hypothesis: Data ARE Normal Alternate Hypothesis: Data ARE NOT Normal

Normality Test with Rcmdr Test of Normality: SW (Wilk Sidak) Test Is this Result Significant? How Can You Tell? P value > 0.05 (alpha). Result is NOT Significant Null is not Rejected. Data ARE Normally Distributed What do you Need to Report? Test Name, Sample Size (n OR df), test statistic, p value

Confidence Intervals Many Tests Formulation = 95% confidence intervals Lower bound: Mean (1.96 * SE) Upper bound: Mean + (1.96 * SE) By definition: 95% of the confidence intervals (from different experiments) will overlap the real parameter µ

NOTE: Estimates Depend on Sample Size C.I. Formulation: Mean +/- (Z score * SE) Mean +/- (1.96 * SE) S.E. = S.D. / sqrt (n) = 3.127466 / (sqrt(18)) = 0.737151 n mean SD sqrt(n) SE 95% CI 3 38.3 1.5 1.7 0.9 1.7 6 40.2 4.4 2.4 1.8 3.5 9 40.1 3.5 3.0 1.2 2.3 12 39.9 3.2 3.5 0.9 1.8 15 39.7 3.0 3.9 0.8 1.5 18 39.6 3.1 4.2 0.7 1.4

NOTE: Estimates are influenced by chance Age Estimate: 39.6 years (SD = 3.1) C.I. Formulation: Mean +/- (Z score * SE) Mean +/- (1.96 * SE) S.E. = S.D. / sqrt (n) n mean SD sqrt(n) SE 95% CI lower upper 9 40.1 3.5 3.0 1.2 2.3 37.8 42.4 9 39.1 2.8 3.0 0.9 1.8 37.3 40.9 Are these two samples from the same population?

Interpreting Confidence Intervals The (CI) is the interval that includes the estimated parameter, with a probability determined by confidence level (usually 95%). NOTE

Interpreting Confidence Intervals Case 1. Two samples indistinguishable. They are from same population Case 2. Two samples different. They are not from same population

Summary - Parametric Statistics Benefits and Costs: - Parametric methods make more assumptions than nonparametric methods. If the extra assumptions are correct, parametric methods have more statistical power (produce more accurate and precise estimates.) - However, if those assumptions are incorrect, parametric methods can be very misleading. They can cause false positives (type I errors). Thus, they are often not considered robust.

Summary Normality Indicators of a normal (Gaussian) distribution A. Mean = Median = Mode B. Skewness: Measures asymmetry of the distribution. A value of zero indicates symmetry. Skewness absolute value > 1 indicates non-normal skewed distribution. C. Kurtosis: Measures the distribution of mass in the distribution. A value of zero indicates a normal distribution. Kurtosis absolute value > 1 indicates non-normal unbalanced distribution.

Suggested Approach: Summary Approach - Use parametric tests whenever possible. -Take care to examine diagnostic statistics and to determine if extra assumptions are met. - If you are in doubt Perform the matching non-parametric test and compare results. If they agree: go with results of normal test If they disagree: what caused the disagreement