MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION

Similar documents
Overview/Outline. Moving beyond raw data. PSY 464 Advanced Experimental Design. Describing and Exploring Data The Normal Distribution

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

Simple Descriptive Statistics

Some Characteristics of Data

Numerical Descriptions of Data

Descriptive Statistics

Lecture Week 4 Inspecting Data: Distributions

appstats5.notebook September 07, 2016 Chapter 5

Description of Data I

STAT 113 Variability

Describing Data: One Quantitative Variable

Math 2311 Bekki George Office Hours: MW 11am to 12:45pm in 639 PGH Online Thursdays 4-5:30pm And by appointment

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

3.1 Measures of Central Tendency

2 Exploring Univariate Data

Lecture 1: Review and Exploratory Data Analysis (EDA)

Wk 2 Hrs 1 (Tue, Jan 10) Wk 2 - Hr 2 and 3 (Thur, Jan 12)

MEASURES OF DISPERSION, RELATIVE STANDING AND SHAPE. Dr. Bijaya Bhusan Nanda,

9/17/2015. Basic Statistics for the Healthcare Professional. Relax.it won t be that bad! Purpose of Statistic. Objectives

Descriptive Statistics

DATA SUMMARIZATION AND VISUALIZATION

Applications of Data Dispersions

Categorical. A general name for non-numerical data; the data is separated into categories of some kind.

1 Describing Distributions with numbers

Lecture 2 Describing Data

Module Tag PSY_P2_M 7. PAPER No.2: QUANTITATIVE METHODS MODULE No.7: NORMAL DISTRIBUTION

4. DESCRIPTIVE STATISTICS

A LEVEL MATHEMATICS ANSWERS AND MARKSCHEMES SUMMARY STATISTICS AND DIAGRAMS. 1. a) 45 B1 [1] b) 7 th value 37 M1 A1 [2]

The Normal Distribution

STAT 157 HW1 Solutions

Descriptive Analysis

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

The Normal Distribution

IOP 201-Q (Industrial Psychological Research) Tutorial 5

Frequency Distribution and Summary Statistics

Chapter 4 Variability

Chapter 2: Descriptive Statistics. Mean (Arithmetic Mean): Found by adding the data values and dividing the total by the number of data.

Section3-2: Measures of Center

The normal distribution is a theoretical model derived mathematically and not empirically.

Chapter 3 Descriptive Statistics: Numerical Measures Part A

STAB22 section 1.3 and Chapter 1 exercises

Math 227 Elementary Statistics. Bluman 5 th edition

Measures of Dispersion (Range, standard deviation, standard error) Introduction

Basic Procedure for Histograms

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

CH 5 Normal Probability Distributions Properties of the Normal Distribution

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 3: April 25, Abstract

Handout 4 numerical descriptive measures part 2. Example 1. Variance and Standard Deviation for Grouped Data. mf N 535 = = 25

Exploring Data and Graphics

Numerical Descriptive Measures. Measures of Center: Mean and Median

Putting Things Together Part 2

MAS1403. Quantitative Methods for Business Management. Semester 1, Module leader: Dr. David Walshaw

Measures of Central tendency

KING FAHD UNIVERSITY OF PETROLEUM & MINERALS DEPARTMENT OF MATHEMATICAL SCIENCES DHAHRAN, SAUDI ARABIA. Name: ID# Section

Data Distributions and Normality

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Normal Probability Distributions

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.

Measures of Center. Mean. 1. Mean 2. Median 3. Mode 4. Midrange (rarely used) Measure of Center. Notation. Mean

Some estimates of the height of the podium

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

PSYCHOLOGICAL STATISTICS

Numerical Measurements

ECON 214 Elements of Statistics for Economists

2011 Pearson Education, Inc

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

CSC Advanced Scientific Programming, Spring Descriptive Statistics

Chapter 6 Simple Correlation and

NOTES TO CONSIDER BEFORE ATTEMPTING EX 2C BOX PLOTS

Introduction to Statistics I

Monte Carlo Simulation (Random Number Generation)

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

5.1 Mean, Median, & Mode

Fundamentals of Statistics

DESCRIPTIVE STATISTICS

2 DESCRIPTIVE STATISTICS

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

DESCRIPTIVE STATISTICS II. Sorana D. Bolboacă

Statistics vs. statistics

Empirical Rule (P148)

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

Terms & Characteristics

NOTES: Chapter 4 Describing Data

Continuous Probability Distributions & Normal Distribution

Lecture 9. Probability Distributions. Outline. Outline

Statistics (This summary is for chapters 17, 28, 29 and section G of chapter 19)

Descriptive Statistics (Devore Chapter One)

David Tenenbaum GEOG 090 UNC-CH Spring 2005

Measures of Central Tendency: Ungrouped Data. Mode. Median. Mode -- Example. Median: Example with an Odd Number of Terms

Lecture 9. Probability Distributions

Edexcel past paper questions

ECO220Y Continuous Probability Distributions: Normal Readings: Chapter 9, section 9.10

DATA HANDLING Five-Number Summary

Lecture Data Science

Introduction to Descriptive Statistics

Statistics I Chapter 2: Analysis of univariate data

MAS187/AEF258. University of Newcastle upon Tyne

Since his score is positive, he s above average. Since his score is not close to zero, his score is unusual.

CHAPTER 2 Describing Data: Numerical

Transcription:

MEASURES OF CENTRAL TENDENCY & VARIABILITY + NORMAL DISTRIBUTION 1 Day 3 Summer 2017.07.31

DISTRIBUTION Symmetry Modality 单峰, 双峰 Skewness 正偏或负偏 Kurtosis 2

3 CHAPTER 4 Measures of Central Tendency 集中趋势

One major purpose of statistical procedures is to summarize raw data in a meaningful way to make some conclusions. e.g. You wonder how the students in your colleague s class are doing in the final exam this year. There is a number you REALLY want to know: Statistics that describe central tendency are numerical values that describe the center of a distribution of scores for a variable. 4

CENTRAL TENDENCY Three common measures of central tendency: Mode 众数 Median 中数 Mean 平均数 5

the most common measure of central tendency is the mean; the statistical notation for the mean is: µ when we are dealing with populations X X X = N when we are dealing with samples The median can be an observed number or not. How about the mode? 6

MEAN AND MEDIAN 1 2 4 3 7

8

MODE 9

10

COMPARE AND CONTRAST The more symmetric a distribution is, the closer these three measures of central tendency will be If a distribution is truly normal (symmetric and unimodal), then the mean, median, and mode will be exactly the same Unfortunately, this rarely happens. We must choose a measure that best suits our purposes and data. 11

ADVANTAGES AND DISADVANTAGES - MEAN Advantages: The mean can be defined mathematically with a simple equation and can easily be manipulated algebraically. Is the most stable estimate of the central tendency of population than would the sample medians or modes Disadvantages: Influenced by the extreme values. (Very sensitive to outliers.) The sample mean may not be an actual value observed in the data. 12

ADVANTAGES AND DISADVANTAGES - MEDIAN Advantages: It is unaffected by extreme scores (outliers) Disadvantages: Depends on the sample of data and is not easily generalized to the greater population Does not enter statistical equations readily and therefore more difficult to work with than the mean. may not be an actual value observed in the data. 13

SOME ADVANTAGES AND DISADVANTAGES - MODE Advantages: Any randomly selected observation, X i, is more likely to be the mode than any other score. It is the only measure of central tendency that can be used with nominal data. Is not affected by extreme scores Disadvantages: Depends on the sample of data and may not be representative of the population Can depend on the way the data is grouped Cannot be defined in simple mathematical equation 14

15 CHAPTER 5 Measures of Variability 分散趋势 / 变异性

VARIABILITY / DISPERSION 变异性 Variability is defined as how the data is distributed around a measure of central tendency (e.g.mean) Measures of variability describe the way and degree to which the data is spread Measures of variability quantify how similar the scores in a sample are to one another. 16

CONSIDER THE FOLLOWING: Two classes were assigned to the same teacher. In the first class, kids come with various background; in the second class, all the kids come from a family where at least one parent is a teacher/professor. How similar do you expect the pretest scores within the two groups to be? 17

THE RESULTING DATA 18

THE DATA FROM A GRAPHICAL PERSPECTIVE Class1- more variability Class 2- less variability 19

MEASURES OF VARIABILITY The range 全距 The interquartile range 四分位距 Deviation 离差 Average deviation Mean Absolute deviation Variance 方差 Standard deviation 标准差 20

RANGE 全距 The distance between the lowest and highest value. Data from the previous example: The range can be heavily influenced by extreme scores. 21

THE INTERQUARTILE RANGE 四分位距 The interquartile range is the range of the middle 50% of the observations. A trimmed statistic: how much from the lower end and the upper end respectively? Calculated by taking the difference between the 75 th percentile and 25 th percentile. The interquartile range has the opposite problem as the range it gets rid of too much of the data Percentile: the percentage of observations that are below a particular score value. 22

FINDING THE INTERQUARTILE RANGE: 23

DEVIATION 离差 The difference between every data point and the mean The average deviation The mean absolute deviation, m.a.d. Variance Standard deviation / SD 24

AVERAGE DEVIATION We could find value. N i ( ) d = X X i for each observed di i= 1 Then use = mean( di ) to look at on average how N far the observations are from the mean. While, the logic is sound, the average deviances for any sample will always be equal to zero --- Why? 25

There are two ways to eliminate problems connected with the positive and negative deviances Take the absolute value of the deviances (ignore the sign) or MAD Square each deviance, since the square of a negative number is positive 26

MAD Mean absolute deviation MAD = X N X Not convenient for statistical manipulation i 27

VARIANCE We start by finding how each observed value differs from the mean: To get rid of the negative deviances, we square each of these values: Then, we sum the squared deviances (often called the sum of squares ) Calculate the average. ( X X) i ( X X) 2 N i ( X ) 2 i X i= 1 28

VARIANCE: FINAL EQUATIONS N σ 2 i= 1 x s = = 2 i= 1 x ( X ) i X n N ( X ) i X n 1 2 2 29

STANDARD DEVIATION- SD 标准方差 Because we squared the deviations while calculating the variance, we have altered the original scale. This makes the variance difficult to interpret. To convert this back to the original scale, we take the square root called the standard deviation. σ is the population standard deviation s is the sample standard deviation Think of SD as a measure of how far our data values deviate from the mean, on average 30

STANDARD DEVIATION: FINAL EQUATIONS N σ = i= 1 x s x = ( X X) i n ( X X) i i= 1 N n 1 2 2 31

OUR EXAMPLE 32

BACK TO OUR EXAMPLE A loose interpretation: Class 1 deviated, either positively or negatively, on average, 24 points from the mean Class 2 deviated, either positively or negatively, on average, 12 points from the mean In general, we can conclude that the values in class 2 tend to be more similar to one another (homogeneous) than that of class 1. Interpretation in terms of our example: Teachers kids all performed very similarly, whereas those from other families were much more variable in the performance. 33

CHARACTERISTICS OF SD Basically a measure of the average of the deviations of each score from the mean. Can be used to build confidence intervals to see how many scores fall below or above the mean --- more on this in Chapter6. 34

DON T BE SCARED. Definitional Computational 35 2 1 2 1 2 2 1 2 2 1 1 N i i n i i X N i N i x X n i n i x X N X s n σ = = = = = = ( ) ( ) 2 2 1 2 2 1 1 N i i x n i i x X X N X X s n σ = = = =

THE PERPETUAL QUESTION: WHY DIVIDE BY n-1 FOR SAMPLE STATISTICS? Adjustment to produce an unbiased estimate 无偏估计. 1. Concrete examples in the book. Gravetter & Wallnau P100-101 Seeing the statistics Howell p99-101 36

SEEING THE STATISTICS www.uvm.edu/~dhowell/fundamentals8/seeingst atisticsapplets/applets.html The true mean of the population is 50, and SD is 29.2. Now we will sample from this population... 37

REPRESENTING DISTRIBUTIONS WITH GRAPHICS --- BOXPLOT A boxplot ( or box and whisker plot) includes a measure of central tendency (the median) and a measure of dispersion (the interquartile range) Hinges= 1 st and 3 rd quartiles= 25 th and 75 th quantile H-spread: the range between the two quartiles Whisker: 1.5*H-spread from the top and bottom of the box 38

BOXPLOT: HOWELL The whiskers stop at the farthest numbers observable in the data set but no more than 1.5*H. Observed values above that point is marked as outliers. You see the full range of the upper whisker, but a very short lower whisker. Why? 39

OUR EXAMPLE IN SPSS At least two routes Graphs Boxplot Analyze Descriptive statistics Explore 40

KEY TERMS Describing distribution:4,,,. Measures of central tendency:3,, Measures of variability:2, Displaying distribution:1 41

BREAK 42

43 THE NORMAL DISTRIBUTION & Z-SCORES Summer 2017.07.31

OVERVIEW Probability for discrete vs. continuous data The normal distribution Standard Normal Distribution z-transformations and z-scores Using z-scores to find probabilities 44

Think of discrete variables with the notion of a probability of a specific outcome We have a known number (100) of purple(10), red(40) & white(50) marbles what is the probability of choosing a red marble? 45

FREQUENCY, AREA, AND PROBABILITY FOR DISCRETE VARIABLES 40 % 10 % 50 % The pie chart to the left represents the frequency distribution of red, purple and white marbles in a bag. 46

We think of continuous variables with the idea of a probability of obtaining a value that falls within a range With the distribution of IQ scores that I collected for a study, what is probability that somebody will have an IQ score of 90? 47

IQ Score Ranges Frequency Proportion Cumulative 70-74 1 0.02 0.02 75-79 2 0.04 0.06 80-84 3 0.06 0.12 85-89 5 0.1 0.22 90-94 6 0.12 0.34 95-99 12 0.24 0.58 100-104 8 0.16 0.74 105-109 6 0.12 0.86 110-114 3 0.06 0.92 115-119 3 0.06 0.98 120-124 1 0.02 1 Total 50 1 48

A PROBABLY NON-PROFESSIONAL WAY TO EXPLAIN Like with the pie chart earlier, we can relate area to probability. Think of the area as the interval area for each bar. Your answer? How many potential ranges could we create? What does this mean? 49

I AM INTERESTED IN THE SCORE 90: AN INTERVAL OF 20 POINTS / 3 GROUPS 90-109:31/50=0.62 50

AN INTERVAL OF 10 POINTS / 6 GROUPS 90-99: 18/50=0.36 51

WITH AN INTERVAL OF 5 POINTS/11 GROUPS 90-94: 6/50=0.12 52

WITH AN INTERVAL OF 2 POINTS/50 GROUPS 90-92: 4/50=0.08 53

A CHANGE OF CONCEPT The probability of exactly any single value is 0, because we can break down the intervals into finer and finer ones until infinity, meaning the bar size will become smaller and smaller until 0. But we want to talk about a specific value. We want to use the same probability to interpret the score we will use probability density function (PDF). An x value will correspond to only one PDF value that is kind of the frequency, and is the height of a point on the normal curve. How does this work? 54

PDF 55

GRAPHING THE PDF AND RELATE TO AREA Density Graphing probability density function 0.0400 0.0350 0.0300 0.0250 0.0200 0.0150 0.0100 0.0050 0.0000 70 80 90 100 110 120 130 IQ Scores Density 0.0400 0.0350 0.0300 0.0250 0.0200 0.0150 0.0100 0.0050 0.0000 Density Graphing probability density function 707274767880828486889092949698100102104106108110112114116118120122124126128130 IQ Scores Graphing probability density function 0.0400 0.0350 0.0300 0.0250 0.0200 0.0150 0.0100 0.0050 0.0000 70 75 80 85 90 95 100105110115120125130 IQ Scores 56

PROBABILITY DENSITY FUNCTION/ PDF 概率密度函数 f ( X ) where = σ 1 ( e) 2π π = 3.14 ( X µ ) 2 / 2 2.718 For every x value, we can plug the value into the function and get a f(x) number, which corresponds to the height of the point on the normal curve corresponding to the X value, we call it density. This is the y value in your z-table. The largest y value is at the center of the normal distribution where z=0. 1 2 2 (90 97.74) / 2*11.33 E.g. f (90) = (2.718) = 0.0279 11.33 2*3.14 e σ = 2 57

PERCENTILES Percentile: the point below which a specified percentage of scores in the distribution fall Percentile rank: the percentage of scores equal to or less than the given score. To get the percentile rank involves integration in calculus. You don t have to calculate for that, someone has already prepared the table for us ( z table). We just need to know how to use it. A percentile is a score, a percentile rank is a percentage. 58

NORMAL DISTRIBUTION 正态分布 Normal distribution is important because: Many dependent variables are assumed to be normally distributed in the population The sampling distribution of the mean is normally distributed ( more coming.) Many statistics models are based on an assumption of a normally distributed variable. 59

NORMAL DISTRIBUTION Density 0.0400 0.0350 0.0300 0.0250 0.0200 0.0150 0.0100 0.0050 Graphing probability density function 0.0000 70 80 90 100 110 120 130 IQ Scores Bell-shaped curve Unimodal Symmetric mean, median and mode are all in the center Not skewed Extends from - to + The total area under the curve is 1 60

NORMAL DISTRIBUTION About 68%of the distribution lies within 1 SD of the mean, 95% lies within 2 SD of the mean and 99.7% of the distribution lies within 3 SD of the mean. We can immediately make some inferences. 61

STANDARD NORMAL DISTRIBUTION 标准正态分布 The standard normal distribution is just a special case of normal distribution with a mean=0 and SD=1. Any normal distribution can be transformed to be a standardized normal distribution. Why bother transforming, or standardizing a distribution? 62

HOW MANY TABLES DO WE NEED? For our IQ data, our mean is 97.74, SD=11.33, one SD below the mean is 97.74-11.33=86.41, one SD above the mean is 97.74+11.33=109.07. The percentile rank of 84.13% corresponds to a raw score of 109.07. For SAT score, mean=500, SD=100, one SD below the mean is 400, one SD above the mean is 600. The percentile rank of 84.13% corresponds to a raw score of 600. 63

STANDARDIZED SCORES 标准分 When we transform our variables to the z- distribution (the standard normal distribution), we are standardizing our scores. This essentially means we put all of our values on the same scale and end up with a distribution of mean=0 and SD=1. We call the process the z-transformation The standardized scores that come out of this process are called z-scores. 64

Z-SCORE TRANSFORMATION z i = X i σ µ X is our original data µ is the mean of the population σ is the population standard deviation The end result will be a set of standardized scores. All scores that are below the mean will be negative and all scores above the mean will be positive We can interpret the value of the z-score as how many standard deviation above or below the mean A z-score =1.0 is a score that is exactly 1 SD above the mean A z-score of -1.5 is score that is exactly 1.5 SD below the mean 65

Z-SCORE EXAMPLE Test score: Mean = 50 Standard deviation = 10 So the z-score if you received a 60 is z = X µ = σ 60 50 10 10 10 and the z-score if you received a 45 is z = X µ = σ 45 50 10 = = 5 10 = 1 = 0.5 66

SO? Now we can refer to the z-table to see what percentile a score value of 60 or 45 corresponds to. A full z-score table can be found in Howell p598-601 Table E-10. A z-score of 1 corresponds to a percentile of 0.8413. This means 84.13% of scores fall at or below a z-score of 1 or the raw score of 60. A z-score of -.5 corresponds to a percentile rank of 0.3085. This means 30.85% of scores fall at or below a z-score of -.5 or a raw score of 45. 67

SUMMARY PDF is introduced to get to the probability for continuous variable. How to transform any scores within a distribution into a z score ( or to standardize the raw scores)? 68