FEEG6017 lecture: The normal distribution, estimation, confidence intervals. Markus Brede,

Similar documents
Chapter 9: Sampling Distributions

Chapter 14 : Statistical Inference 1. Note : Here the 4-th and 5-th editions of the text have different chapters, but the material is the same.

ECON 214 Elements of Statistics for Economists 2016/2017

8.1 Estimation of the Mean and Proportion

Interval estimation. September 29, Outline Basic ideas Sampling variation and CLT Interval estimation using X More general problems

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 23

Data Analysis and Statistical Methods Statistics 651

Business Statistics 41000: Probability 4

AP Statistics Chapter 6 - Random Variables

The following content is provided under a Creative Commons license. Your support

ECON 214 Elements of Statistics for Economists

The Normal Probability Distribution

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Sampling and sampling distribution

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

4.1 Introduction Estimating a population mean The problem with estimating a population mean with a sample mean: an example...

Elementary Statistics Triola, Elementary Statistics 11/e Unit 14 The Confidence Interval for Means, σ Unknown

Introduction to Statistics I

MLLunsford 1. Activity: Central Limit Theorem Theory and Computations

Statistics and Probability

The Two-Sample Independent Sample t Test

Chapter 4 Continuous Random Variables and Probability Distributions

The Assumption(s) of Normality

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

Module 4: Probability

Data Analysis and Statistical Methods Statistics 651

Lecture 16: Estimating Parameters (Confidence Interval Estimates of the Mean)

Chapter 4: Estimation

Statistics for Business and Economics: Random Variables:Continuous

Statistics 13 Elementary Statistics

Data Analysis and Statistical Methods Statistics 651

BIOL The Normal Distribution and the Central Limit Theorem

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

Confidence Intervals and Sample Size

Sampling Distributions

Figure 1: 2πσ is said to have a normal distribution with mean µ and standard deviation σ. This is also denoted

Chapter 8 Estimation

Chapter 8 Statistical Intervals for a Single Sample

VARIABILITY: Range Variance Standard Deviation

10/1/2012. PSY 511: Advanced Statistics for Psychological and Behavioral Research 1

Lecture 2 INTERVAL ESTIMATION II

Statistics 431 Spring 2007 P. Shaman. Preliminaries

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Normal Model (Part 1)

Lecture 6: Chapter 6

MATH 264 Problem Homework I

Part V - Chance Variability

Statistical Intervals (One sample) (Chs )

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

CABARRUS COUNTY 2008 APPRAISAL MANUAL

Probability. An intro for calculus students P= Figure 1: A normal integral

7.1 Graphs of Normal Probability Distributions

THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management

Simulation Lecture Notes and the Gentle Lentil Case

MA 1125 Lecture 12 - Mean and Standard Deviation for the Binomial Distribution. Objectives: Mean and standard deviation for the binomial distribution.

Chapter 7 Study Guide: The Central Limit Theorem

Chapter 4 Continuous Random Variables and Probability Distributions

Estimation Y 3. Confidence intervals I, Feb 11,

A Derivation of the Normal Distribution. Robert S. Wilson PhD.

Simple Descriptive Statistics

CHAPTER 5 SAMPLING DISTRIBUTIONS

μ: ESTIMATES, CONFIDENCE INTERVALS, AND TESTS Business Statistics

5.3 Statistics and Their Distributions

Law of Large Numbers, Central Limit Theorem

CHAPTER 7 INTRODUCTION TO SAMPLING DISTRIBUTIONS

STAT Chapter 6 The Standard Deviation (SD) as a Ruler and The Normal Model

Central Limit Theorem (cont d) 7/28/2006

HPM Module_2_Breakeven_Analysis

Section 0: Introduction and Review of Basic Concepts

The Binomial Probability Distribution

8.2 The Standard Deviation as a Ruler Chapter 8 The Normal and Other Continuous Distributions 8-1

Basic Procedure for Histograms

As you draw random samples of size n, as n increases, the sample means tend to be normally distributed.

1/12/2011. Chapter 5: z-scores: Location of Scores and Standardized Distributions. Introduction to z-scores. Introduction to z-scores cont.

Review: Population, sample, and sampling distributions

1. Variability in estimates and CLT

Hypothesis Tests: One Sample Mean Cal State Northridge Ψ320 Andrew Ainsworth PhD

Data Analysis and Statistical Methods Statistics 651

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Contents. 1 Introduction. Math 321 Chapter 5 Confidence Intervals. 1 Introduction 1

A continuous random variable is one that can theoretically take on any value on some line interval. We use f ( x)

6.1, 7.1 Estimating with confidence (CIS: Chapter 10)

The probability of having a very tall person in our sample. We look to see how this random variable is distributed.

6 Central Limit Theorem. (Chs 6.4, 6.5)

The topics in this section are related and necessary topics for both course objectives.

Statistics for Managers Using Microsoft Excel 7 th Edition

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Shifting and rescaling data distributions

Central Limit Theorem

Chapter 4 Variability

Statistical Intervals. Chapter 7 Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Business Statistics 41000: Probability 3

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

Class 16. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Data Analysis and Statistical Methods Statistics 651

STAT:2010 Statistical Methods and Computing. Using density curves to describe the distribution of values of a quantitative

Transcription:

FEEG6017 lecture: The normal distribution, estimation, confidence intervals. Markus Brede, mb8@ecs.soton.ac.uk

The normal distribution The normal distribution is the classic "bell curve". We've seen that we can produce one by adding or averaging a large-enough group of random variates from any distribution. It can also be specified as a probability density function.

The normal distribution The normal distribution is central to statistical inference. A particular normal distribution is fully characterized by just two parameters: the mean, μ, and the standard deviation, σ. In other words, once you've said where the centre of the distribution is, and how wide it is, you've said all you can about it. The general shape of the curve is consistent.

The normal distribution

Standard normal distribution Because the normal distribution has this constant shape we can translate any instance of it into a standardized form. This is because of the scaling relations we discussed when we discussed the proof of the CLT. For each value, subtract the mean and divide by the standard deviation. This gives us the standard normal distribution which has μ = 0 and σ = 1 (red line on previous slide). Values on the standard normal indicate the number of standard deviations that an observation is above or below the mean. They're also called z-scores.

Areas under the normal curve The normal distribution's consistent shape is useful because we can say precise things about areas under the curve. It's a probability distribution so the area sums to 100%. 68% of the time the variate will be within plus or minus one standard deviation of the mean (i.e., a z-score between -1 and 1). 95% of variates will be within two standard deviations. 99.7% of variates will be within three standard deviations.

Areas under the normal curve

Variates from the normal distribution Suppose we have a normal distribution with a mean of 100 and a standard deviation of 10. We can reason about how unusual particular values are. For example, only 0.1% of cases will have a score higher than 130. Around 95% of cases will lie between 80 and 120. Conversely, only about 5% of cases will be more than 20 points away from 100. 34% of cases will be between 100 and 110.

Z-tables In practice these days we use statistical calculators like R to figure out these areas under the normal curve. In the past, you had to look up a pre-computed "Z-table". Area remaining Positive Z- under the score curve to the right of this point 1.0 0.1587 1.5 0.0668 2.0 0.0228 2.5 0.0062

Notable z-scores A few useful z-scores to remember... Z = 1.646 leaves 5% of the curve to the right. Z = 1.96 leaves 2.5% of the curve to the right. Z = 2.575 leaves 0.5% of the curve to the right.

Thinking of a particular sample mean as a variate from a normal distribution Recall the uniform distribution of integers between 1 and 6 we get from throwing a single die. We found previously that if we repeatedly take samples of size N from that distribution, we end up with our collection of sample means being approximately normally distributed.

Sampling distribution of the mean What can we say about this approximately normal distribution of sample means? The mean is the same as the original population mean, i.e., 3.5 in this case. The standard deviation is the same as the original distribution's, scaled by 1 / sqrt(n), where N is the sample size. In the N=25 case, that's 1.708 / sqrt(25) = 0.342.

A note about why we're doing this Remember that in real cases nobody is interested in the green histogram for its own sake. If you really had the resources to collect 10,000 samples of size 25, you'd just call it one huge sample of 250,000. In the real case you get only your single sample of size 25, and you're trying to make inferences about the population based on just that. These computational experiments where we generate many such samples are attempts to stand back from that one-shot perspective.

A problem So it looks as though we're ready to draw conclusions about how unusual certain sample means might be. After all, we've got an approximately normal distribution (of sample means) and we know its mean and sd. But there's a problem: we're helping ourselves to information we wouldn't have in the real case. The original population's mean and standard deviation are typically exactly the things we're trying to find, not pre-given information.

Sampling distribution of the variance We need to work with the only things we have, i.e., the properties of our sample. We know that the mean of our sample is a "good guess" for the mean of the population. What about the variance of our sample? We haven't systematically checked the relationship between sample variance and population variance yet. Let's take 10,000 samples of size 25, calculate the variance in each case, and consider the distribution of those sample variances.

Sampling distribution of the variance At first glance this all looks good. The variances of many samples of size 25 seem to be themselves roughly normally distributed and they seem to be zeroing in on the true value of 2.917. But let's look more closely: for sample sizes between 2 and 10, we'll try collecting 10,000 samples and noting the average value of the calculated sample variance. Turns out there's a systematic problem of underestimation.

Biased and unbiased estimators The sample mean is an unbiased estimator of the population mean. This means that although our sample mean may be quite far from the true value, it's equally likely to be high or low. The sample variance is a biased estimator of the population variance. The sample variance will systematically underestimate the population variance, especially so for small sample sizes.

The sample variance and sample SD Bessel's correction is needed in order to find an unbiased estimator of the population variance. This means simply that we need to divide through by (N - 1) instead of N when calculating the variance and standard deviation. The underestimation happens because we're using the same small set of numbers to estimate both the mean and the variance.

Bessel's correction The Maths When we are estimating the variance with Bessel's correction we essentially calculate: By definition of variance we also have: i.e. E ( 1 n (x n 1 i=1 i x) 2 ) =E ( Hence: 1 n ((x n 1 i=1 i μ) ( x μ)) 2 ) 1 n ((x n i=1 i μ) ( x μ)) 2 = 1 n (x n i=1 i μ) 2 ( x μ) 2 n E ( i=1 ((x i μ) ( x μ)) 2 )= i=1 E ( 1 n (x n 1 i=1 i x) 2 ) = 1 n 1 ( n i=1 n E ((x i μ) 2 ) n E (( x μ) 2 ) V [ x i ] nv [ x])

Bessel's Maths (cont.) The xi are a random sample from a distribution with variance 2. Thus: V (xi)= 2. We also have: V [ x]=v [1/n i=1 n n x i ]=1/n 2 i=1 V [x i ]=σ 2 /n Back to the last expression from the last slide... E ( 1 n (x n 1 i=1 i x) 2 ) = 1 n 1 ( n i=1 = 1 n 1 (nσ2 nσ 2 /n) =σ 2 V [ x i ] nv [ x])

Calculating the two values in Python & R pylab.var(variablename) or pylab.std(variablename) will get you the population version, i.e.., the divisor is N. If you want the sample version (divisor = N-1) you can specify pylab.var(variablename, ddof=1) or pylab.std(variablename, ddof=1). In R, somewhat confusingly, the functions var(variablename) and sd(variablename) automatically assume the sample version, i.e., divisor = N-1. The good news: none of this matters with decent sample sizes.

A realistic case At last we're in a position to take a particular sample of size 25 and see how we could use it to reason about the population. For the sake of the exercise, we'll pretend that we don't already know the true values of the population mean and variance.

A realistic case Some output from Python... Mean of the sample is 3.4 Variance of the sample is 3.44 Sample variance, estimating pop variance, is 3.58333333333 SD of the sample is 1.8547236991 Sample SD, estimating pop SD, is 1.8929694486 So, based on our sample information, the best guess for the population mean is 3.4, and for the population standard deviation it's 1.893. (Not bad guesses: true values are 3.5 and 1.708.)

A realistic case We can place this information in a wider context. We know that our sample mean is "noisy", i.e., it's really drawn from an approximately normal distribution of possible sample means. Our best guess for that distribution is that its mean is 3.4 and its standard deviation is 1.893 / sqrt(25) = 0.379. That calculation gives us the standard error of the mean, i.e., the estimated standard deviation of the sampling distribution for the mean.

Confidence intervals If the sampling distribution of the mean is normally distributed, we can say something about how unlikely it is to get an extreme value from that distribution. We know, for example, that getting a z-score more extreme than ±3 only happens 0.3% of the time. Using our best estimates for the sampling distribution's mean and SD, that's the equivalent of saying that sample means outside the range of 3.4 ± (3 x 0.379) will only happen 0.3% of the time.

Confidence intervals A very handy z-score is 1.96, because it leaves exactly 2.5% of the distribution to the right. If we consider both edges of the normal distribution, that means that only 5% of the time will we get values more extreme than z = ±1.96. So, 95% of the time, we expect our sample mean to lie within the range 3.4 ± (1.96 x 0.379). That's between 2.657 and 4.143.

Confidence intervals Remember that 3.4 is our absolute best guess for the mean. (We don't know that the true value is 3.5.) But we also know that 3.4 is unlikely to be exactly right. We know that we're vulnerable to sampling error. If the mean of a particular sample is within ±1.96 standard errors of the population mean 95% of the time, we can also reverse this logic. We can conclude that 95% of the time, the true population mean is within ±1.96 standard errors of our sample mean.

Confidence intervals To see this in formulae: The Z-table tells us that with 95% probability we have 1.96< μ x s <1.96 Estimated standard deviation of sampling distribution, s0/sqrt(n) True population mean Sample mean Hence, with 95% probability we have: 1.96 s+ x<μ<1.96 s+ x i.e. with 95% probability the true mean is in x±1.96 s

Confidence intervals And that's all a confidence interval is. In this case, we would say that the 95% confidence interval for the true population mean is between 2.657 and 4.143. Note that the right answer, 3.5, is within that interval. Not guaranteed: 5% of the time, the real value will be outside the interval. Of course we won't know when! Confidence intervals can be calculated for different confidence levels (90%, 99%) with different z-scores, and can be calculated for quantities other than the mean.

Additional material David M. Lane's tutorials on the normal distribution and on sampling. A guide to reporting standard deviations and standard errors. The Python code used to produce simulations and graphs for this lecture: part 1 and part 2.