Random variables The binomial distribution The normal distribution Other distributions. Distributions. Patrick Breheny.

Similar documents
Random variables The binomial distribution The normal distribution Sampling distributions. Distributions. Patrick Breheny.

The Binomial Distribution

The Binomial Distribution

The Binomial Distribution

Sampling Distributions and the Central Limit Theorem

The normal distribution is a theoretical model derived mathematically and not empirically.

The Central Limit Theorem

Math 227 Elementary Statistics. Bluman 5 th edition

Section 7.5 The Normal Distribution. Section 7.6 Application of the Normal Distribution

ECON 214 Elements of Statistics for Economists 2016/2017

The Normal Distribution

x is a random variable which is a numerical description of the outcome of an experiment.

CH 5 Normal Probability Distributions Properties of the Normal Distribution

Chapter 5 Probability Distributions. Section 5-2 Random Variables. Random Variable Probability Distribution. Discrete and Continuous Random Variables

11.5: Normal Distributions

2011 Pearson Education, Inc

The Normal Distribution

4: Probability. Notes: Range of possible probabilities: Probabilities can be no less than 0% and no more than 100% (of course).

Probability & Sampling The Practice of Statistics 4e Mostly Chpts 5 7

UNIT 4 NORMAL DISTRIBUTION: DEFINITION, CHARACTERISTICS AND PROPERTIES

Sampling Distributions For Counts and Proportions

Statistics 511 Supplemental Materials

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

Chapter 15: Sampling distributions

Unit2: Probabilityanddistributions. 3. Normal and binomial distributions

When we look at a random variable, such as Y, one of the first things we want to know, is what is it s distribution?

A random variable is a (typically represented by ) that has a. value, determined by, A probability distribution is a that gives the

Part V - Chance Variability

Chapter 7. Random Variables

Random Variables CHAPTER 6.3 BINOMIAL AND GEOMETRIC RANDOM VARIABLES

Examples: Random Variables. Discrete and Continuous Random Variables. Probability Distributions

Statistical Methods in Practice STAT/MATH 3379

Unit2: Probabilityanddistributions. 3. Normal and binomial distributions

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series

ECON 214 Elements of Statistics for Economists

Binomial Random Variable - The count X of successes in a binomial setting

Homework: Due Wed, Feb 20 th. Chapter 8, # 60a + 62a (count together as 1), 74, 82

Topic 6 - Continuous Distributions I. Discrete RVs. Probability Density. Continuous RVs. Background Reading. Recall the discrete distributions

Chapter 4 Probability Distributions

CHAPTER 6 Random Variables

Chapter 6: Random Variables

The probability of having a very tall person in our sample. We look to see how this random variable is distributed.

Some Characteristics of Data

Random Variables. 6.1 Discrete and Continuous Random Variables. Probability Distribution. Discrete Random Variables. Chapter 6, Section 1

Lecture 9. Probability Distributions. Outline. Outline

Chapter 4. The Normal Distribution

We use probability distributions to represent the distribution of a discrete random variable.

Probability and distributions

Chapter 6: Random Variables

Lecture 9. Probability Distributions

Binomial distribution

Probability. An intro for calculus students P= Figure 1: A normal integral

When we look at a random variable, such as Y, one of the first things we want to know, is what is it s distribution?

Chapter 8: Binomial and Geometric Distributions

MAKING SENSE OF DATA Essentials series

Lab #7. In previous lectures, we discussed factorials and binomial coefficients. Factorials can be calculated with:

Data Analysis and Statistical Methods Statistics 651

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

CHAPTER 6 Random Variables

Distributions in Excel

BIOL The Normal Distribution and the Central Limit Theorem

Week 1 Variables: Exploration, Familiarisation and Description. Descriptive Statistics.

Chapter 8. Variables. Copyright 2004 Brooks/Cole, a division of Thomson Learning, Inc.

Prof. Thistleton MAT 505 Introduction to Probability Lecture 3

Overview. Definitions. Definitions. Graphs. Chapter 4 Probability Distributions. probability distributions

A probability distribution shows the possible outcomes of an experiment and the probability of each of these outcomes.

Data that can be any numerical value are called continuous. These are usually things that are measured, such as height, length, time, speed, etc.

Homework: Due Wed, Nov 3 rd Chapter 8, # 48a, 55c and 56 (count as 1), 67a

Expected Value of a Random Variable

Review. What is the probability of throwing two 6s in a row with a fair die? a) b) c) d) 0.333

Random Variables and Probability Functions

NOTES: Chapter 4 Describing Data

The Binomial Distribution

Statistics 6 th Edition

Describing Data: One Quantitative Variable

The Binomial Distribution

Module 4: Probability

The topics in this section are related and necessary topics for both course objectives.

MATH 446/546 Homework 1:

CHAPTER 5 Sampling Distributions

STAB22 section 1.3 and Chapter 1 exercises

value BE.104 Spring Biostatistics: Distribution and the Mean J. L. Sherley

5.1 Personal Probability

STAT:2010 Statistical Methods and Computing. Using density curves to describe the distribution of values of a quantitative

Chapter 6. The Normal Probability Distributions

Distributions of random variables

Section The Sampling Distribution of a Sample Mean

Binomial Distributions

CS 237: Probability in Computing

Lecture 6: Chapter 6

Theoretical Foundations

Example - Let X be the number of boys in a 4 child family. Find the probability distribution table:

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

E509A: Principle of Biostatistics. GY Zou

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

2 General Notions 2.1 DATA Types of Data. Source: Frerichs, R.R. Rapid Surveys (unpublished), NOT FOR COMMERCIAL DISTRIBUTION

The Normal Probability Distribution

Econ 6900: Statistical Problems. Instructor: Yogesh Uppal

STAT Chapter 5: Continuous Distributions. Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s.

Transcription:

Distributions February 11

Random variables Anything that can be measured or categorized is called a variable If the value that a variable takes on is subject to variability, then it the variable is a random variable We have already seen several random variables in the data that we have looked at so far: sex, class, age, blood pressure

Distributions Random variables Data and probability meet in the notion of a distribution A probability distribution applies the theory of probability to describe the behavior of a random variable A distribution describes the probability that a random variable will be observed to take on a specific value or fall within a specific range of values

Discrete distributions Categorical variables are said to have discrete probability distributions In a discrete distribution, variables can only take on a finite number of values Because they only take on a finite number of values, the distribution can describe the probability that each value will occur Examples: Random variable Possible outcomes Survival Yes, no # of copies of a genetic mutation 0,1,2 # of children a woman will have in her lifetime 0,1,2,... # of people in a sample who smoke 0,1,2,...,n

Continuous distributions Continuous variables can take on an infinite number of possible values Because of this, any particular value will have probability 0 So what does a continuous distribution describe? It describes the probability that a continuous random variable will fall within a certain range Examples Random variable Height Weight Cholesterol levels Survival time

Listing the ways Random variables The binomial coefficients When trying to figure out the probability of something, it is sometimes very helpful to list all the different ways that the random process can turn out If all the ways are equally likely, then each one has probability, where n is the total number of ways 1 n Thus, the probability of the event is the number of ways it can happen divided by n

Genetics example Random variables The binomial coefficients For example, the possible outcomes of an individual inheriting cystic fibrosis genes are CC Cc cc cc If all these possibilities are equally likely (as they would be if the individual s parents had one copy of each version of the gene), then the probability of having one copy of each version is 2/4

Coin example Random variables The binomial coefficients Another example where the outcomes are equally likely is flips of a coin Suppose we flip a coin three times; what is the probability that exactly one of the flips was heads? Possible outcomes: HHH HHT HT H HT T T HH T HT T T H T T T The probability is therefore 3/8

The binomial coefficients The binomial coefficients Counting the number of ways something can happen quickly becomes a hassle (imagine listing the outcomes involved in flipping a coin 100 times) Luckily, mathematicians long ago discovered that when there are two possible outcomes that occur/don t occur n times, the number of ways of one event occurring k times is n! k!(n k)! The notation n! means to multiply n by all the positive numbers that come before it (e.g. 3! = 3 2 1) Note: 0! = 1

The binomial coefficients Calculating the binomial coefficients For the coin example, we could have used the binomial coefficients instead of listing all the ways the flips could happen: 3! 1!(3 1)! = 3 2 1 2 1(1) = 3 Many calculators and computer programs (including SAS) have specific functions for calculating binomial coefficients, which we will explore in lab

The binomial coefficients When sequences are not equally likely Suppose we draw 3 balls, with replacement, from an urn that contains 10 balls: 2 red balls and 8 green balls What is the probability that we will draw two red balls? As before, there are three possible sequences: RRG, RGR, and GRR, but the sequences no longer have probability 1 8

The binomial coefficients When sequences are not equally likely (cont d) The probability of each sequence is 2 10 2 10 8 10 = 2 10 8 10 2 10 = 8 10 2 10 Thus, the probability of drawing two red balls is 3 2 10 2 10 8 10 = 9.6% 2 10.03

The binomial formula Random variables The binomial coefficients This line of reasoning can be summarized in the following formula: the probability that an event will occur k times out of n is n! k!(n k)! pk (1 p) n k In this formula, n is the number of trials, p is the probability that the event will occur on any particular trial We can then use the above formula to figure out the probability that the event will occur k times

Example Random variables The binomial coefficients According to the CDC, 22% of the adults in the United States smoke Suppose we sample 10 people; what is the probability that 5 of them will smoke? We can use the binomial formula, with 10! 5!(10 5)!.225 (1.22) 10 5 = 3.7%

Example (cont d) Random variables The binomial coefficients What is the probability that our sample will contain two or fewer smokers? We can add up probabilities from the binomial distribution: P (x 2) = P (x = 0) + P (x = 1) + P (x = 2) =.083 +.235 +.298 = 61.7%

The binomial coefficients The binomial formula when to use This formula works for any random variable that counts the number of times an event occurs out of n trials, provided that the following assumptions are met: The number of trials n must be fixed in advance The probability that the event occurs, p, must be the same from trial to trial The trials must be independent If these assumptions are met, the random variable is said to follow a binomial distribution, or to be binomially distributed

A common histogram shape Histograms of infant mortality rates, heights, and cholesterol levels: Africa NHANES (adult women) NHANES (adult women) Frequency 0 2 4 6 8 10 12 Frequency 0 200 400 600 Frequency 0 100 200 300 400 500 600 0 50 100 150 200 55 60 65 70 20 40 60 80 100 120 Infant mortality rate Height (inches) HDL cholesterol What do these histograms have in common?

Random variables Mathematicians discovered long ago that the equation y = 1 2π e x2 /2 described the histograms of many random variables 0.0 0.1 0.2 0.3 0.4 0.5 y 4 2 0 2 4 x

Features of the normal curve is symmetric around x = 0 is always positive drops rapidly down near zero as x moves away from 0

in action Africa NHANES (adult women) NHANES (adult women) Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 Infant mortality rate (standard units) Height (standard units) HDL cholesterol (standard units) Note that the data has been standardized and that the vertical axis is now called density Data whose histogram looks like the normal curve are said to be normally distributed or to follow a normal distribution

Probabilities from the normal curve Probabilities are given by the area under the normal curve: Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 x

The 68%/95% rule Random variables This is where the 68%/95% rule of thumb that we discussed earlier comes from: P=68% P=95% P=100% Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x

Calculating probabilities By knowing that the total area under the normal curve is 1, we can get a rough idea of the area under a curve by looking at a plot However, to get exact numbers, we will need a computer How much area is under this normal curve? is an extremely common question in statistics, and programmers have developed algorithms to answer this question very quickly The output from these algorithms is commonly collected into tables, which is what you will have to use for exams

Calculating the area under a normal curve, example 1 Find the area under the normal curve between 0 and 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x.84.5 =.34

Calculating the area under a normal curve, example 2 Find the area under the normal curve above 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 1.84 =.16

Calculating the area under a normal curve, example 3 Find the area under the normal curve that lies outside -1 and 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 1 - (.84-.16) =.32 Alternatively, we could have used symmetry: 2(.16)=.32

Calculating percentiles A related question of interest is, What is the xth percentile of the normal curve? This is the opposite of the earlier question: instead of being given a value and asked to find the area to the left of the value, now we are told the area to the left and asked to find the value With a table, we can perform this inverse search by finding the probability in the body of the table, then looking to the margins to find the percentile associated with it

Calculating percentiles (cont d) What is the 60th percentile of the normal curve? There is no.600 in the table, but there is a.599, which corresponds to 0.25 The real 60th percentile must lie between 0.25 and 0.26 (it s actually 0.2533) For this class, 0.25, 0.26, or anything in between is an acceptable answer How about the 10th percentile? The 10th percentile is -1.28

Calculating values such that a certain area lies within/outside them Find the number x such that the area outside x and x is equal to 10% Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x Our answer is therefore ±1.645 (the 5th/95th percentile)

Reconstructing a histogram In week 2, we said that the mean and standard deviation provide a two-number summary of a histogram We can now make this observation a little more concrete Anything we could have learned from a histogram, we will now determine by approximating the real distribution of the data by the normal distribution This approach is called the normal approximation

NHANES adult women The data set we will work with on these examples is the NHANES sample of the heights of 2,649 adult women The mean height is 63.5 inches The standard deviation of height is 2.75 inches

Procedure: Probabilities using the normal curve The procedure for calculating probabilities with the normal approximation is as follows: #1 Draw a picture of the normal curve and shade in the appropriate probability #2 Convert to standard units: letting x denote a number in the original units and z a number in standard units, z = x x SD where x is the mean and SD is the standard deviation #3 Determine the area under the normal curve using a table or computer

Estimating probabilities: Example # 1 Suppose we want to estimate the percent of women who are under 5 feet tall 5 feet, or 60 inches, is 3.5/2.75=1.27 standard deviations below the mean Using the normal distribution, the probability of more than 1.27 standard deviations below the mean is P (x < 1.27) = 10.2% In the actual sample, 10.6% of women were under 5 feet tall

Estimating probabilities: Example # 2 Another example: suppose we want to estimate the percent of women who are between 5 3 and 5 6 (63 and 66 inches) These heights are 0.18 standard deviations below the mean and 0.91 standard deviations above the mean, respectively Using the normal distribution, the probability of falling in this region is 39.0% In the actual data set, 38.8% of women are between 5 3 and 5 6

Procedure: Percentiles using the normal curve We can also use the normal distribution to approximate percentiles The procedure for calculating percentiles with the normal approximation is as follows: #1 Draw a picture of the normal curve and shade in the appropriate area under the curve #2 Determine the percentiles of the normal curve corresponding to the shaded region using a table or computer #3 Convert from standard units back to the original units: x = x + z(sd) where, again, x is in original units, z is in standard units, x is the mean, and SD is the standard deviation

Approximating percentiles: Example Suppose instead that we wished to find the 75th percentile of these women s heights For the normal distribution, 0.67 is the 75th percentile The mean plus 0.67 standard deviations in height is 65.35 inches For the actual data, the 75th percentile is 65.39 inches

The broad applicability of the normal approximation These examples are by no means special: the distribution of many random variables are very closely approximated by the normal distribution Indeed, this is why statisticians call it the normal distribution Other names for the normal distribution include the Gaussian distribution (after its inventor) and the bell curve (after its shape) For variables with approximately normal distributions, the mean and standard deviation essentially tell us everything about the data other summary statistics and graphics are redundant

Caution Random variables Other variables, however, are not approximated by the normal distribution well, and give misleading or nonsensical results when you apply the normal approximation to them For example, the value 0 lies 1.63 standard deviations below the mean infant mortality rate for Europe The normal approximation therefore predicts a probability that 5.1% of the countries in Europe will have negative infant mortality rates

Caution (cont d) Random variables As another example, the normal distribution will always predict the median to lie 0 standard deviations above the mean i.e., it will always predict that the median equals the mean As we have seen, however, the mean and median can differ greatly when distributions are skewed For example, according to the U.S. census bureau, the mean income in the United States is $66,570, while the median income is $48,201

Are there other distributions, for modeling skewed or otherwise abnormal data? Yes; statisticians have invented dozens and dozens of distributions However, the binomial and normal distributions can be applied to an incredibly wide array of problems, and you will rarely need to be familiar with other distributions One potential exception is the Poisson distribution, which is often used in epidemiology to model the occurrence of rare diseases

The Poisson distribution For example, the number of homicides committed in London on a given day is a discrete number Suppose we wanted to predict the probability that there would be two homicides tomorrow, or that there would be fewer than a dozen in the next month In principle, you could use the binomial distribution, but you d need to know the number of people in London (a very large number) and the probability that a given person will be killed tomorrow (a very small number) We ll skip the details of exactly how the Poisson distribution works, but it provides a way to model calculate the desired probabilities based only on the average number of deaths per day

The Poisson distribution in action Expected Observed 600 Number of occurrences, 2004 2007 400 200 0 0 1 2 3 4 5+ Homicides/day