Random variables The binomial distribution The normal distribution Other distributions. Distributions. Patrick Breheny.

Distributions February 11

Random variables Anything that can be measured or categorized is called a variable If the value that a variable takes on is subject to variability, then it the variable is a random variable We have already seen several random variables in the data that we have looked at so far: sex, class, age, blood pressure

Distributions Random variables Data and probability meet in the notion of a distribution A probability distribution applies the theory of probability to describe the behavior of a random variable A distribution describes the probability that a random variable will be observed to take on a specific value or fall within a specific range of values

Discrete distributions Categorical variables are said to have discrete probability distributions In a discrete distribution, variables can only take on a finite number of values Because they only take on a finite number of values, the distribution can describe the probability that each value will occur Examples: Random variable Possible outcomes Survival Yes, no # of copies of a genetic mutation 0,1,2 # of children a woman will have in her lifetime 0,1,2,... # of people in a sample who smoke 0,1,2,...,n

Continuous distributions Continuous variables can take on an infinite number of possible values Because of this, any particular value will have probability 0 So what does a continuous distribution describe? It describes the probability that a continuous random variable will fall within a certain range Examples Random variable Height Weight Cholesterol levels Survival time

Listing the ways Random variables The binomial coefficients When trying to figure out the probability of something, it is sometimes very helpful to list all the different ways that the random process can turn out If all the ways are equally likely, then each one has probability, where n is the total number of ways 1 n Thus, the probability of the event is the number of ways it can happen divided by n

Genetics example Random variables The binomial coefficients For example, the possible outcomes of an individual inheriting cystic fibrosis genes are CC Cc cc cc If all these possibilities are equally likely (as they would be if the individual s parents had one copy of each version of the gene), then the probability of having one copy of each version is 2/4

Coin example Random variables The binomial coefficients Another example where the outcomes are equally likely is flips of a coin Suppose we flip a coin three times; what is the probability that exactly one of the flips was heads? Possible outcomes: HHH HHT HT H HT T T HH T HT T T H T T T The probability is therefore 3/8

The binomial coefficients The binomial coefficients Counting the number of ways something can happen quickly becomes a hassle (imagine listing the outcomes involved in flipping a coin 100 times) Luckily, mathematicians long ago discovered that when there are two possible outcomes that occur/don t occur n times, the number of ways of one event occurring k times is n! k!(n k)! The notation n! means to multiply n by all the positive numbers that come before it (e.g. 3! = 3 2 1) Note: 0! = 1

The binomial coefficients Calculating the binomial coefficients For the coin example, we could have used the binomial coefficients instead of listing all the ways the flips could happen: 3! 1!(3 1)! = 3 2 1 2 1(1) = 3 Many calculators and computer programs (including SAS) have specific functions for calculating binomial coefficients, which we will explore in lab

The binomial coefficients When sequences are not equally likely Suppose we draw 3 balls, with replacement, from an urn that contains 10 balls: 2 red balls and 8 green balls What is the probability that we will draw two red balls? As before, there are three possible sequences: RRG, RGR, and GRR, but the sequences no longer have probability 1 8

The binomial coefficients When sequences are not equally likely (cont d) The probability of each sequence is 2 10 2 10 8 10 = 2 10 8 10 2 10 = 8 10 2 10 Thus, the probability of drawing two red balls is 3 2 10 2 10 8 10 = 9.6% 2 10.03

The binomial formula Random variables The binomial coefficients This line of reasoning can be summarized in the following formula: the probability that an event will occur k times out of n is n! k!(n k)! pk (1 p) n k In this formula, n is the number of trials, p is the probability that the event will occur on any particular trial We can then use the above formula to figure out the probability that the event will occur k times

Example Random variables The binomial coefficients According to the CDC, 22% of the adults in the United States smoke Suppose we sample 10 people; what is the probability that 5 of them will smoke? We can use the binomial formula, with 10! 5!(10 5)!.225 (1.22) 10 5 = 3.7%

Example (cont d) Random variables The binomial coefficients What is the probability that our sample will contain two or fewer smokers? We can add up probabilities from the binomial distribution: P (x 2) = P (x = 0) + P (x = 1) + P (x = 2) =.083 +.235 +.298 = 61.7%

The binomial coefficients The binomial formula when to use This formula works for any random variable that counts the number of times an event occurs out of n trials, provided that the following assumptions are met: The number of trials n must be fixed in advance The probability that the event occurs, p, must be the same from trial to trial The trials must be independent If these assumptions are met, the random variable is said to follow a binomial distribution, or to be binomially distributed

A common histogram shape Histograms of infant mortality rates, heights, and cholesterol levels: Africa NHANES (adult women) NHANES (adult women) Frequency 0 2 4 6 8 10 12 Frequency 0 200 400 600 Frequency 0 100 200 300 400 500 600 0 50 100 150 200 55 60 65 70 20 40 60 80 100 120 Infant mortality rate Height (inches) HDL cholesterol What do these histograms have in common?

Random variables Mathematicians discovered long ago that the equation y = 1 2π e x2 /2 described the histograms of many random variables 0.0 0.1 0.2 0.3 0.4 0.5 y 4 2 0 2 4 x

Features of the normal curve is symmetric around x = 0 is always positive drops rapidly down near zero as x moves away from 0

in action Africa NHANES (adult women) NHANES (adult women) Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 Infant mortality rate (standard units) Height (standard units) HDL cholesterol (standard units) Note that the data has been standardized and that the vertical axis is now called density Data whose histogram looks like the normal curve are said to be normally distributed or to follow a normal distribution

Probabilities from the normal curve Probabilities are given by the area under the normal curve: Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 x

The 68%/95% rule Random variables This is where the 68%/95% rule of thumb that we discussed earlier comes from: P=68% P=95% P=100% Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x

Calculating probabilities By knowing that the total area under the normal curve is 1, we can get a rough idea of the area under a curve by looking at a plot However, to get exact numbers, we will need a computer How much area is under this normal curve? is an extremely common question in statistics, and programmers have developed algorithms to answer this question very quickly The output from these algorithms is commonly collected into tables, which is what you will have to use for exams

Calculating the area under a normal curve, example 1 Find the area under the normal curve between 0 and 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x.84.5 =.34

Calculating the area under a normal curve, example 2 Find the area under the normal curve above 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 1.84 =.16

Calculating the area under a normal curve, example 3 Find the area under the normal curve that lies outside -1 and 1 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x 1 - (.84-.16) =.32 Alternatively, we could have used symmetry: 2(.16)=.32

Calculating percentiles A related question of interest is, What is the xth percentile of the normal curve? This is the opposite of the earlier question: instead of being given a value and asked to find the area to the left of the value, now we are told the area to the left and asked to find the value With a table, we can perform this inverse search by finding the probability in the body of the table, then looking to the margins to find the percentile associated with it

Calculating percentiles (cont d) What is the 60th percentile of the normal curve? There is no.600 in the table, but there is a.599, which corresponds to 0.25 The real 60th percentile must lie between 0.25 and 0.26 (it s actually 0.2533) For this class, 0.25, 0.26, or anything in between is an acceptable answer How about the 10th percentile? The 10th percentile is -1.28

Calculating values such that a certain area lies within/outside them Find the number x such that the area outside x and x is equal to 10% Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 Density 0.0 0.1 0.2 0.3 0.4 0.5 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4 x x x Our answer is therefore ±1.645 (the 5th/95th percentile)

Reconstructing a histogram In week 2, we said that the mean and standard deviation provide a two-number summary of a histogram We can now make this observation a little more concrete Anything we could have learned from a histogram, we will now determine by approximating the real distribution of the data by the normal distribution This approach is called the normal approximation

NHANES adult women The data set we will work with on these examples is the NHANES sample of the heights of 2,649 adult women The mean height is 63.5 inches The standard deviation of height is 2.75 inches

Procedure: Probabilities using the normal curve The procedure for calculating probabilities with the normal approximation is as follows: #1 Draw a picture of the normal curve and shade in the appropriate probability #2 Convert to standard units: letting x denote a number in the original units and z a number in standard units, z = x x SD where x is the mean and SD is the standard deviation #3 Determine the area under the normal curve using a table or computer

Estimating probabilities: Example # 1 Suppose we want to estimate the percent of women who are under 5 feet tall 5 feet, or 60 inches, is 3.5/2.75=1.27 standard deviations below the mean Using the normal distribution, the probability of more than 1.27 standard deviations below the mean is P (x < 1.27) = 10.2% In the actual sample, 10.6% of women were under 5 feet tall

Estimating probabilities: Example # 2 Another example: suppose we want to estimate the percent of women who are between 5 3 and 5 6 (63 and 66 inches) These heights are 0.18 standard deviations below the mean and 0.91 standard deviations above the mean, respectively Using the normal distribution, the probability of falling in this region is 39.0% In the actual data set, 38.8% of women are between 5 3 and 5 6

Procedure: Percentiles using the normal curve We can also use the normal distribution to approximate percentiles The procedure for calculating percentiles with the normal approximation is as follows: #1 Draw a picture of the normal curve and shade in the appropriate area under the curve #2 Determine the percentiles of the normal curve corresponding to the shaded region using a table or computer #3 Convert from standard units back to the original units: x = x + z(sd) where, again, x is in original units, z is in standard units, x is the mean, and SD is the standard deviation

Approximating percentiles: Example Suppose instead that we wished to find the 75th percentile of these women s heights For the normal distribution, 0.67 is the 75th percentile The mean plus 0.67 standard deviations in height is 65.35 inches For the actual data, the 75th percentile is 65.39 inches

The broad applicability of the normal approximation These examples are by no means special: the distribution of many random variables are very closely approximated by the normal distribution Indeed, this is why statisticians call it the normal distribution Other names for the normal distribution include the Gaussian distribution (after its inventor) and the bell curve (after its shape) For variables with approximately normal distributions, the mean and standard deviation essentially tell us everything about the data other summary statistics and graphics are redundant

Caution Random variables Other variables, however, are not approximated by the normal distribution well, and give misleading or nonsensical results when you apply the normal approximation to them For example, the value 0 lies 1.63 standard deviations below the mean infant mortality rate for Europe The normal approximation therefore predicts a probability that 5.1% of the countries in Europe will have negative infant mortality rates

Caution (cont d) Random variables As another example, the normal distribution will always predict the median to lie 0 standard deviations above the mean i.e., it will always predict that the median equals the mean As we have seen, however, the mean and median can differ greatly when distributions are skewed For example, according to the U.S. census bureau, the mean income in the United States is $66,570, while the median income is $48,201

Are there other distributions, for modeling skewed or otherwise abnormal data? Yes; statisticians have invented dozens and dozens of distributions However, the binomial and normal distributions can be applied to an incredibly wide array of problems, and you will rarely need to be familiar with other distributions One potential exception is the Poisson distribution, which is often used in epidemiology to model the occurrence of rare diseases

The Poisson distribution For example, the number of homicides committed in London on a given day is a discrete number Suppose we wanted to predict the probability that there would be two homicides tomorrow, or that there would be fewer than a dozen in the next month In principle, you could use the binomial distribution, but you d need to know the number of people in London (a very large number) and the probability that a given person will be killed tomorrow (a very small number) We ll skip the details of exactly how the Poisson distribution works, but it provides a way to model calculate the desired probabilities based only on the average number of deaths per day

The Poisson distribution in action Expected Observed 600 Number of occurrences, 2004 2007 400 200 0 0 1 2 3 4 5+ Homicides/day