Random variables The binomial distribution The normal distribution Sampling distributions. Distributions. Patrick Breheny.

Similar documents
Random variables The binomial distribution The normal distribution Other distributions. Distributions. Patrick Breheny.

The Binomial Distribution

The Binomial Distribution

The Binomial Distribution

Sampling Distributions and the Central Limit Theorem

The normal distribution is a theoretical model derived mathematically and not empirically.

Math 227 Elementary Statistics. Bluman 5 th edition

The Central Limit Theorem

CH 5 Normal Probability Distributions Properties of the Normal Distribution

Lab #7. In previous lectures, we discussed factorials and binomial coefficients. Factorials can be calculated with:

11.5: Normal Distributions

Sampling Distributions For Counts and Proportions

Part V - Chance Variability

The Normal Distribution

ECON 214 Elements of Statistics for Economists 2016/2017

Statistical Methods in Practice STAT/MATH 3379

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

CHAPTER 6 Random Variables

Examples: Random Variables. Discrete and Continuous Random Variables. Probability Distributions

Probability and distributions

When we look at a random variable, such as Y, one of the first things we want to know, is what is it s distribution?

Chapter 6. The Normal Probability Distributions

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

x is a random variable which is a numerical description of the outcome of an experiment.

Probability & Sampling The Practice of Statistics 4e Mostly Chpts 5 7

Chapter 5 Probability Distributions. Section 5-2 Random Variables. Random Variable Probability Distribution. Discrete and Continuous Random Variables

Section 7.5 The Normal Distribution. Section 7.6 Application of the Normal Distribution

Chapter 7. Random Variables

UNIT 4 NORMAL DISTRIBUTION: DEFINITION, CHARACTERISTICS AND PROPERTIES

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Chapter 15: Sampling distributions

Expected Value of a Random Variable

We use probability distributions to represent the distribution of a discrete random variable.

Binomial Random Variable - The count X of successes in a binomial setting

Lecture 9. Probability Distributions. Outline. Outline

Random Variables CHAPTER 6.3 BINOMIAL AND GEOMETRIC RANDOM VARIABLES

Random Variables. 6.1 Discrete and Continuous Random Variables. Probability Distribution. Discrete Random Variables. Chapter 6, Section 1

Lecture 9. Probability Distributions

When we look at a random variable, such as Y, one of the first things we want to know, is what is it s distribution?

Probability. An intro for calculus students P= Figure 1: A normal integral

Unit2: Probabilityanddistributions. 3. Normal and binomial distributions

Example. Chapter 8 Probability Distributions and Statistics Section 8.1 Distributions of Random Variables

4: Probability. Notes: Range of possible probabilities: Probabilities can be no less than 0% and no more than 100% (of course).

Some Characteristics of Data

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Chapter 4. The Normal Distribution

CHAPTER 4 DISCRETE PROBABILITY DISTRIBUTIONS

The probability of having a very tall person in our sample. We look to see how this random variable is distributed.

Unit2: Probabilityanddistributions. 3. Normal and binomial distributions

The topics in this section are related and necessary topics for both course objectives.

Prof. Thistleton MAT 505 Introduction to Probability Lecture 3

Module 4: Probability

Chapter 8: Binomial and Geometric Distributions

ECON 214 Elements of Statistics for Economists

Lecture 3. Sampling distributions. Counts, Proportions, and sample mean.

BIOL The Normal Distribution and the Central Limit Theorem

Data Analysis and Statistical Methods Statistics 651

2011 Pearson Education, Inc

The Normal Distribution & Descriptive Statistics. Kin 304W Week 2: Jan 15, 2012

Statistics for Business and Economics: Random Variables:Continuous

Lecture 6: Chapter 6

Chapter 6: Random Variables

Math 160 Professor Busken Chapter 5 Worksheets

NOTES: Chapter 4 Describing Data

MATH 446/546 Homework 1:

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

Chapter 8. Variables. Copyright 2004 Brooks/Cole, a division of Thomson Learning, Inc.

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

The following content is provided under a Creative Commons license. Your support

E509A: Principle of Biostatistics. GY Zou

The Normal Probability Distribution

STAB22 section 1.3 and Chapter 1 exercises

The Normal Distribution

3. Probability Distributions and Sampling

Discrete Probability Distributions

Binomial distribution

Topic 6 - Continuous Distributions I. Discrete RVs. Probability Density. Continuous RVs. Background Reading. Recall the discrete distributions

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Statistics 431 Spring 2007 P. Shaman. Preliminaries

STAT 201 Chapter 6. Distribution

The Binomial and Geometric Distributions. Chapter 8

Descriptive Statistics (Devore Chapter One)

Describing Data: One Quantitative Variable

2 General Notions 2.1 DATA Types of Data. Source: Frerichs, R.R. Rapid Surveys (unpublished), NOT FOR COMMERCIAL DISTRIBUTION

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION

STAT:2010 Statistical Methods and Computing. Using density curves to describe the distribution of values of a quantitative

Section The Sampling Distribution of a Sample Mean

Statistics and Probability

A random variable is a (typically represented by ) that has a. value, determined by, A probability distribution is a that gives the

CHAPTER 6 Random Variables

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Solutions for practice questions: Chapter 15, Probability Distributions If you find any errors, please let me know at

The Binomial Distribution

CHAPTER 5 Sampling Distributions

Chapter Seven. The Normal Distribution

Lecture Data Science

A.REPRESENTATION OF DATA

Chapter 8 Estimation

Section Introduction to Normal Distributions

Sampling Distributions

Transcription:

Distributions September 17

Random variables Anything that can be measured or categorized is called a variable If the value that a variable takes on is subject to variability, then it the variable is a random variable We have already seen several random variables in the data that we have looked at so far: sex, class, age, reduction in FVC

Distributions Random variables Data and probability meet in the notion of a distribution A probability distribution applies the theory of probability to describe the behavior of a random variable A distribution describes the probability that a random variable will be observed to take on a specific value or fall within a specific range of values

Discrete distributions Categorical variables are said to have discrete probability distributions In a discrete distribution, variables can only take on a finite number of values Because they only take on a finite number of values, the distribution can describe the probability that each value will occur Examples: Random variable Possible outcomes Survival Yes, no # of copies of a genetic mutation 0,1,2 # of children a woman will have in her lifetime 0,1,2,... # of people in a sample who smoke 0,1,2,...,n

Continuous distributions Continuous variables can take on an infinite number of possible values Because of this, any particular value will have probability 0 So what does a continuous distribution describe? It describes the probability that a continuous random variable will fall within a certain range Examples Random variable Height Weight Cholesterol levels Survival time

Listing the ways Random variables The binomial coefficients When trying to figure out the probability of something, it is sometimes very helpful to list all the different ways that the random process can turn out If all the ways are equally likely, then each one has probability, where n is the total number of ways 1 n Thus, the probability of the event is the number of ways it can happen divided by n

Genetics example Random variables The binomial coefficients For example, the possible outcomes of an individual inheriting cystic fibrosis genes are CC Cc cc cc If all these possibilities are equally likely (as they would be if the individual s parents had one copy of each version of the gene), then the probability of having one copy of each version is 2/4

Coin example Random variables The binomial coefficients Another example where the outcomes are equally likely is flips of a coin Suppose we flip a coin three times; what is the probability that exactly one of the flips was heads? Possible outcomes: HHH HHT HT H HT T T HH T HT T T H T T T The probability is therefore 3/8

The binomial coefficients The binomial coefficients Counting the number of ways something can happen quickly becomes a hassle (imagine flipping a coin 100 times and counting the number of times 20 heads show up) Luckily, mathematicians long ago discovered that when there are two possible outcomes that occur n times, the number of ways of k of them happening is n! k!(n k)! The notation n! means to multiply n by all the positive numbers that come before it (e.g. 3! = 3 2 1)

The binomial coefficients Calculating the binomial coefficients For the coin example, we could have used the binomial coefficients instead of listing all the ways the flips could happen: 3! 1!(3 1)! = 3 2 1 2 1(1) = 3 Many calculators and computer programs (including SAS) have specific functions for calculating binomial coefficients, which we will explore in lab

The binomial coefficients When sequences are not equally likely Suppose we draw 3 balls, with replacement, from an urn that contains 10 balls: 2 red balls and 8 green balls What is the probability that we will draw two red balls? As before, there are three possible sequences: RRG, RGR, and GRR, but the sequences no longer have probability 1 8

The binomial coefficients When sequences are not equally likely (cont d) The probability of each sequence is 2 10 2 10 8 10 = 2 10 8 10 2 10 = 8 10 2 10 Thus, the probability of drawing two red balls is 3 2 10 2 10 8 10 = 9.6% 2 10

The binomial formula Random variables The binomial coefficients This line of reasoning can be summarized in the following formula: the probability that an event will occur k times out of n is n! k!(n k)! pk (1 p) n k In this formula, n is the number of trials, k is the number of times the event is to occur, and p is the probability that the event will occur on any particular trial

The binomial formula (cont d) The binomial coefficients This formula works for any random variable that counts the number of times an event occurs out of n trials, provided that the following assumptions are met: The number of trials n must be fixed in advance The probability that the event occurs, p, must be the same from trial to trial The trials must be independent If these assumptions are met, the random variable is said to follow a binomial distribution, or to be binomially distributed

Example Random variables The binomial coefficients According to the CDC, 22% of the adults in the United States smoke Suppose we sample 10 people; what is the probability that 5 of them will smoke? We can use the binomial formula, with 10! 5!(10 5)!.225 (1.22) 10 5 = 3.7%

Example (cont d) Random variables The binomial coefficients What is the probability that our sample will contain two or fewer smokers? We can add up probabilities from the binomial distribution: P (x 2) = P (x = 0) + P (x = 1) + P (x = 2) =.083 +.235 +.298 = 61.7%

NHANES Random variables The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Every few years, the CDC conducts a huge survey of randomly chosen Americans called the National Health and Nutrition Examination Survey (NHANES) Hundreds of variables are measured on these individuals: Demographic variables like age, education, and income Physiological variables like height, weight, blood pressure, and cholesterol levels Dietary habits Disease status Lots more: everything from cavities to sexual behavior

A common histogram shape The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Histograms of infant mortality rates, heights, and cholesterol levels: Africa NHANES (adult women) NHANES (adult women) Frequency 0 2 4 6 8 10 12 Frequency 0 200 400 600 Frequency 0 100 200 300 400 500 600 0 50 100 150 200 55 60 65 70 20 40 60 80 100 120 Infant mortality rate Height (inches) HDL cholesterol What do these histograms have in common?

The normal curve Random variables The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Mathematicians discovered long ago that the equation y = 1 2π e x2 /2 described the histograms of many random variables 0.0 0.1 0.2 0.3 0.4 0.5 y 4 2 0 2 4 x

Features of the normal curve The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution The normal curve is symmetric around x = 0 The normal curve is always positive The normal curve drops rapidly down to zero as x moves away from 0

Standardizing a variable The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Before we can see the normal curve in action, we must discuss standardization of variables To standardize a variable, we subtract the average and divide by the standard deviation: x std i = x i x s x Once you have standardized a variable, its mean will be 0 and its standard deviation will be 1 Standardized values tell you how many standard deviations away from the mean an observation is

The normal curve in action The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Africa NHANES (adult women) NHANES (adult women) Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 Infant mortality rate (standard units) Height (standard units) HDL cholesterol (standard units) Note that the data has been standardized and that the vertical axis is now called density Data whose histogram looks like the normal curve are said to be normally distributed or to follow a normal distribution

Probabilities from the normal curve The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Probabilities are given by the area under the normal curve: Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 x

The 68%/95% rule Random variables The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution This is where the 68%/95% rule of thumb that we discussed earlier comes from: P=68% P=95% P=100% Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x

Calculating probabilities The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution By knowing that the total area under the normal curve is 1, we can get a rough idea of the area under a curve by looking at a plot However, to get exact numbers, we will need a computer How much area is under this normal curve? is an extremely common question in statistics, and programmers have developed algorithms to answer this question very quickly The output from these algorithms is commonly collected into tables, which is what you will have to use for exams

The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Calculating the area under a normal curve, example 1 Find the area under the normal curve between 0 and 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x.84.5 =.34

The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Calculating the area under a normal curve, example 2 Find the area under the normal curve above 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 1.84 =.16

The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Calculating the area under a normal curve, example 3 Find the area under the normal curve that lies outside -1 and 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 1 - (.84-.16) =.32 Alternatively, we could have used symmetry: 2(.16)=.32

Calculating percentiles The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution A related question of interest is, What is the xth percentile of the normal curve? This is the opposite of the earlier question: instead of being given a value and asked to find the area to the left of the value, now we are told the area to the left and asked to find the value With a table, we can perform this inverse search by finding the probability in the body of the table, then looking to the margins to find the percentile associated with it

Calculating percentiles (cont d) The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution What is the 60th percentile of the normal curve? There is no.600 in the table, but there is a.599, which corresponds to 0.25 The real 60th percentile must lie between 0.25 and 0.26 (it s actually 0.2533) For this class, 0.25, 0.26, or anything in between is an acceptable answer How about the 10th percentile? The 10th percentile is -1.28

The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Calculating values such that a certain area lies within/outside them Find the number x such that the area outside x and x is equal to 10% Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x Our answer is therefore ±1.645 (the 5th/95th percentile)

Reconstructing a histogram The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution In week 2, we said that the mean and standard deviation provide a two-number summary of a histogram We can now make this observation a little more concrete Anything we could have learned from a histogram, we will now determine by approximating the real distribution of the data by the normal distribution This approach is called the normal approximation

NHANES adult women The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution The data set we will work with on these examples is the NHANES sample of the heights of 2,649 adult women The mean height is 63.5 inches The standard deviation of height is 2.75 inches

Estimating probabilities The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Suppose we want to estimate the percent of women who are under 5 feet tall 5 feet, or 60 inches, is 3.5/2.75=1.27 standard deviations below the mean Using the normal distribution, the probability of more than 1.27 standard deviations below the mean is P (x < 1.27) = 10.2% In the actual sample, 10.6% of women were under 5 feet tall

Estimating probabilities (cont d) The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Another example: suppose we want to estimate the percent of women who are between 5 3 and 5 6 (63 and 66 inches) These heights are 0.18 standard deviations below the mean and 0.91 standard deviations above the mean, respectively Using the normal distribution, the probability of falling in this region is 39.0% In the actual data set, 38.8% of women are between 5 3 and 5 6

Approximating percentiles The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Suppose instead that we wished to find the 75th percentile of these women s heights For the normal distribution, 0.67 is the 75th percentile The mean plus 0.67 standard deviations in height is 65.35 inches For the actual data, the 75th percentile is 65.39 inches

The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution The broad applicability of the normal approximation These examples are by no means special: the distribution of many random variables are very closely approximated by the normal distribution Indeed, this is why statisticians call it the normal distribution Other names for the normal distribution include the Gaussian distribution (after its inventor) and the bell curve (after its shape) For variables with approximately normal distributions, the mean and standard deviation essentially tell us everything about the data other summary statistics and graphics are redundant

Caution Random variables The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution Other variables, however, are not approximated by the normal distribution well, and give misleading or nonsensical results when you apply the normal approximation to them For example, the value 0 lies 1.63 standard deviations below the mean infant mortality rate for Europe The normal approximation therefore predicts a probability that 5.1% of the countries in Europe will have negative infant mortality rates

Caution (cont d) Random variables The normal curve Calculating probabilities and percentiles from the normal curve Approximating data with the normal distribution As another example, the normal distribution will always predict the median to lie 0 standard deviations above the mean i.e., it will always predict that the median equals the mean As we have seen, however, the mean and median can differ greatly when distributions are skewed For example, according to the U.S. census bureau, the mean income in the United States is $66,570, while the median income is $48,201

Data distributions Random variables Overview Example So far, we have discussed the distribution of data Certain random variables, like the number of males or smokers in our study, followed a binomial distribution Other random variables, like height, seemed to follow a normal distribution

Overview Example For the rest of the course, we will be concerned with the distribution of statistics For example: we go out, collect a sample of 10 people, measure their heights, and then take the average If we were to repeat this procedure, we would get a different average Thus, we can think about the average of our sample as a random variable Like any random variable, it will have a distribution This distribution is called a sampling distribution, to reflect the fact that it is subject to variability depending on the random sample

What s the point? Random variables Overview Example In practice, no one obtains sampling distributions directly Investigators do not collect 10 samples of 10 individuals and report 10 different means If they can afford to sample 100 people, they collect a single sample of 100 people and report a single mean So why do we study sampling distributions?

What s the point? (cont d) Overview Example The reason we study sampling distributions is to understand how, in theory, our summary statistic would be distributed If we understand the sampling distribution of the average height, then we will understand the amount of variability likely to be present in the actual, single average that we calculate Understanding sampling distributions is central to inference, and to answering the question: How accurate is my generalization to the population likely to be?

Seeing sampling distributions Overview Example As we said before, investigators do not collect multiple samples in order to look at their sampling distributions In order to see sampling distributions, we will have to do one of the following: Argue or prove that the statistic will have a certain distribution, like the normal or binomial distribution Conduct a computer simulation in which we do collect multiple samples; we can then summarize, graph, and describe the sampling distribution of our summary statistic

Crossover trials Random variables Overview Example For example, consider the placebo-controlled experiment from assignment 2 which tested whether a new drug for cystic fibrosis was effective at preventing deterioration of lung function I didn t tell you this at the time, but in reality, the trial featured a crossover design: each participant in the trial received both the drug and the placebo (at different times), crossing over to receive the other treatment halfway through the trial Like all well-designed crossover trials, the therapy (treatment/placebo) that each participant received first was chosen at random Furthermore, there was a washout period during the crossover between the two drug periods

Cystic fibrosis crossover trial Overview Example So, for our cystic fibrosis study, let s consider the statistic x = # of patients who did better on drug than placebo In the actual experiment, 11 patients did better on the drug than placebo (x = 11) But how variable is x? Could it have easily been 7, or 14, or even 5? This is why we study sampling distributions

Overview Example A ball and urn recreation of the experiment One way to get a handle on the variability of x is to recreate our experiment using balls and urns Consider placing 11 balls saying drug and 3 balls saying placebo into an urn We can recreate our experiment by drawing the 14 balls out again with replacement and counting the number of drug balls (x) On a computer, this experiment can be replicated 10,000 times in a fraction of a second

Results of the recreation Overview Example 0.25 0.20 0.15 Pct 0.10 0.05 0.00 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 x

Using the binomial distribution Overview Example Of course, this is sort of the long way around we could have gotten these percentages without a computer by recognizing that x satisfies all the conditions of the binomial distribution Here, we would say that x follows a binomial distribution with n = 14 and p = 11/14 This would give us the same percentages as before, to within the experimental error of the simulation Both are perfectly valid ways of exploring sampling distributions to get a sense of the variability of sample statistics We will discuss how to explore sampling distributions more explicitly in the coming weeks