STAT 203 - Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model In Chapter 5, we introduced a few measures of center and spread, and discussed how the mean and standard deviation are good summaries when the histogram (or distribution) is symmetric and unimodal. When it is not symmetric, we use the median and IQR as summaries, although for most of the course, we will deal with things that are approximately symmetric and unimodal. Understanding the idea of what a Standard Deviation is, is very important as many statistical methods rely on this, and we will see it come up again and again throughout the course. Recall: The SD can be thought of as a measure of the typical deviation from the mean. Example: I was at a crucial point in my life where I m trying to decide what to do with it; teach or research? For my Masters I graduated with a grade of 87%, and the mean Master s grade is 83% with a standard deviation of 5%. For my course evaluations I have a mean rating of 4.65 (out of 5), and the mean evaluation score is typically 3.5 with a standard deviation of 0.4. Which one am I relatively better at? (Note: These are made up numbers!) If we just look at the values it is hard to compare the two. 87 is 4 larger than 83, and 4.65 is only 1.15 larger than 3.5, but...87 is a lot bigger than 4.65 and you can only go 1.5 over the average of 3.5, and 17 over the average grade of 83 so...how do we compare the two? The answer is to use the Standard Deviation as a measuring stick, as it summarizes the average deviation from the mean. Essentially, we will want to find out how far each one is from its respective mean, in terms of its average deviation from the mean. 1
The Masters grade is 87 83 = 4% above the mean grade, and... 4% in terms of standard deviation, it is = 0.8 standard deviations above the mean. SD=5% The evaluation score is 4.65 3.5 = 1.15 above the mean score, and... 1.15 in terms of standard deviation, it is SD=0.4 mean. = 2.875 standard deviations above the In terms of each of their own respective means and average deviation from the mean (or SD), the evaluation scores are much higher above their own mean than the Masters grade. In your every day life you are essentially using statistical tools to make decisions, without even knowing it... In my opinion, statistics is simply a discipline that tries to take the way a person thinks about things and makes logical decisions based on what they observe in every day life, and formalize these into a set of objective rules. Adding or Multiplying each value by a constant: 1. Adding a constant (shifting) If we add a constant (c) to each observation in the data then: The measures of center (mean, median, midrange) will all have the constant (c) added to them, and so will the Quartiles. The measures of spread (variance, SD, range, IQR) will all remain the same. 2. Multiplying by a constant (scaling) If we multiply each observation by a constant (c), then: The measures of center (mean, median, midrange), and the measures of spread (SD, range, IQR) and the Quartiles will all be multiplied by the constant (c). The variance will be multiplied by c 2 In short, adding changes center, but not spread. Multiplying changes the spread and center. Multiplying by a constant is how we change measurement units (eg) Kg to lbs. 2
Standardizing (Z-scores): Question: How can we compare observations that were measured on different scales or from two different distributions? Answer: By summarizing how far away each of the observations is from the mean, in terms of its standard deviation (or average/typical deviation from the mean)! The Z-score summarizes how far a given observation (y i ) is from its mean (ȳ), in terms of it s SD (s). Z-score (Z)= Z = y i ȳ s difference between observation and mean Standard deviation Exercise: A flight from Vancouver to Toronto usually takes 4.5 hours with a SD of 15 minutes. If my last flight took 4 hours and 10 minutes, how far is this from the mean in standard units? When we Standardize, we are adding (actually subtracting) a constant from every observation, and then multiplying (actually dividing) every observation by a constant...check rules on last page If we let M = y i ȳ, then the mean of M is ȳ ȳ = 0, and the SD of M is unchanged. If we now let Z = M, then the mean of Z is the mean of M times the constant, which SD equals 0. The SD of Z is the SD of M times the constant, which is SD = 1. SD So, Z-scores have a mean of 0 and a SD of 1. A positive Z-score means that the observation is above the mean, and a negative one means its below it. The farther an observation is from the mean, the larger the Z-score will be in absolute value. 3
The Normal Model (Bell Curve, Normal Distribution): This is where we take a small step into the theoretical world of statistics. Many types of data one collects have a distribution that is bell shaped and roughly symmetric, and the Normal Model is appropriate for summarizing these (note that we are dealing with only quantitative variables here). (eg) weight, IQ scores,... Characteristics of Normal Model: 1. It is bell-shaped, unimodal, and perfectly symmetric about the mean (Ȳ or µ). 2. The spread of the distribution is determined by the standard deviation (s or σ). 3. This model is denoted by: N(µ, σ 2 ), where µ=mean, σ 2 =Variance, and σ is the SD. 4. The total area under the curve is 100% (just as the total area of the bars for a histogram is 100%) Theoretical Normal Models Porbability (%) 0.00 0.05 0.10 0.15 0.20 N(2, 36) N( 4, 9) N(2, 4) 15 10 5 0 5 10 15 Values Notes: For the Normal Model, we use (µ) for the mean instead of (ȳ), and (σ) for the SD instead of (s), why??? The (ȳ) and (s) are Sample Statistics; numerical summaries of the observed data. (sample) The (µ) and (σ) are Population Parameters; that specify the theoretical model. (population) 4
Standardized Values (for the Normal Model: ) Z = y µ σ When we standardize an observation from a Normal Model, the Z-score is N(0, 1). What we do is we use a theoretical Normal Model to describe the distribution of an observed variable. One must check the histogram to make sure that such a model is appropriate (symmetric and unimodal). We take the observed estimates of the mean and SD, and if a Normal Model seems appropriate, then we use the Normal Model (with the same mean and SD to approximate the observed data. We then standardize the value(s) of interest, so that we can use a Standard Normal variable (N(0, 1)). We can then answer questions such as: What proportion of males have weights above 190lbs? How many between 210 and 220? and so on... The 68-95-99.7 Rule: Approximately 68% of the data will be within +/ 1 SD of the mean. Approximately 95% of the data will be within +/ 2 SDs of the mean. Approximately 99.7% of the data will be within +/ 3 SDs of the mean. (eg) if a class has a mean grade of 70% and a SD of 5% and the grades are normally distributed, then approximately 68% of students will receive grades between 65-75%, approx. 95% will receive grades between 60-80%, and 99.7% between 55-85%. Let s Draw a Picture: 5
Finding Percentages Under the Normal Model: 1. Draw a Normal Model and label where the mean is. Then shade the area of interest. 2. Standardize the y-value(s) that are at the boundaries of the area of interest. 3. Use the Normal Table in Appendix E of the Text Book to find the area of the shaded region. Example: What is the area (probability) below a Z-score of Z = 1.5? What is the area (probability) between Z-scores of -1.2 and 1.2? Summary: 1. We estimate the mean and SD for our observed data. 2. Check if a Normal Model is appropriate (symmetric, unimodal) 3. If it is, then we standardize the values of interest. 4. Use the Normal Table to find the percentages we are interested in. (the Normal Model is commonly used in statistics, so make sure to practice many of these problems) 6
Exercises: 1. Suppose that math SAT scores follow the normal model. The past results of the math SAT exams show that males and females have mean scores of 500 and 455 and standard deviations of 100 and 120, respectively. Chris and Kim took the math SAT exam, and they both scored 620. (a) Compare their scores using the z-score. (b) What percentage of males score over 600 on the math SAT test? (c) What percentage of females score between 255 and 555 on the math SAT test? 2. Find the area under the Normal Model for the following Z-scores. (a) smaller than -1.10 (b) bigger than -1.10 (c) bigger than 2.15 (d) between 0 and 1.18 (e) between -1.10 and 1.62 (f) smaller than -4.50 (g) bigger than -6.34 3. Find the z-scores corresponding to the following percentiles: (a) 50 th (b) 70 th (c) 15 th 4. Suppose that scores on a standard IQ test approximately follow the normal model with mean µ = 110 and standard deviation σ = 25. (a) What percentage of people have IQ scores above 100? (b) What percentage have scores between 90 and 120? (c) Find the interquartile range for the IQ scores. 7
5. The length of human pregnancies from conception to birth varies according to a distribution that is approximately normal with mean 266 days and standard deviation 16 days. (a) Between what values do the lengths of the middle 95% of all pregnancies fall? Use the 68-95-99.7 rule to answer this question. (b) How short are the shortest 1% of all pregnancies? 8