Distributions 1. What are distributions? When we look at a random variable, such as Y, one of the first things we want to know, is what is it s distribution? In other words, if we have a large number of Y s, what kind of shape does the frequency histogram have? We talked about some of these shapes already (shapes of distributions, etc.) The basic idea (simplified): We take a sample and measure some random variable (e.g. blood oxygen levels of bats). We look to see how this random variable is distributed. Based on this distribution, we then make estimates and/or perform tests that might reveal interesting information about the population. But how we proceed is based on how the random variable is distributed. Not only that, but many of our analyses and tests rely on particular kinds of distributions. Why is this so important? Because the probabilities of getting a particular result are different based on the outcome. For example, consider the two following distributions for length of an insect: Obviously, the probability of of our insect being less then 10 cm depends a lot on the shape of the distribution.
So here are some examples of examples of distributions: 2. The binomial distribution If we toss a coin 25 times, and if Y = number of heads, then Y will have a binomial distribution We write Y ~ Binomial. The ~ symbol means distributed as Often we put the parameters (more on this soon) of our distribution in parenthesis after the type of distribution, for example: Y ~ Binom(25, 0.5) Notice we abbreviated the distribution (we usually do). What are 25 and 0.5? They are parameters (n = 25, and p = 0.5), but we'll save up the details for another page or two. If we measure heights of a sample of men on campus (Y = heights of men on campus), we can be pretty sure that Y will have a normal distribution. We write Y ~ N (The normal distribution is almost always abbreviated N ). We already used this distribution when we did probability. Here it is again: n p y y n y 1 p From now on we will definitely be using y in stead of j. So what makes this a distribution? Because we can use this to calculate all possible outcomes and then see what the distribution of Y looks like. Here's an example using our coin. We toss it 10 times and note that n = 10, p = 0.5 (these are the parameters of our distribution, but more soon).
We get: Heads Tails Probability 10 0 0.00098 9 1 0.00977 8 2 0.04395 (You should recognize 7 3 0.11719 some of these numbers) 6 4 0.20508 5 5 0.24609 4 6 0.20508 3 7 0.11719 2 8 0.04395 1 9 0.00977 0 10 0.00098 Sum: 1.00000 A summary like this can be very useful. For example, we can now easily calculate the probability that Y = 0, 1 or 2 (where Y = number of heads): Pr{0 Y 2} = 0.00098 + 0.00977 + 0.04395 = 0.05470 Also notice that if we add up all the possible outcomes we get 1.0: Pr{0 Y 10} = 1.0 This is important but ought to be obvious: if we toss a coin, something has to happen, and the above list is every single possibility! If we want to see what the distribution of Y looks like we can plot it:
So what (finally), are parameters? Parameters determine what our distribution looks like. For a random variable, Y, we need to know two things to figure out what the distribution of Y looks like: 1) What kind of distribution Y has. 2) What the parameters of this distributions are. Since we're looking at the binomial distribution, let's change the parameters and see what happens to Y: Instead of n = 10 and p = 0.5, let's use n = 3 and p = 0.2. Notice that now Y can go from 0 to 3. So let's again calculate all the probabilities for Y: Y Probability 0 0.512 1 0.384 2 0.096 3 0.008 Sum 1.000 And if we plot Y, this time our distribution looks rather different:
(See also figure 3.13 p. 109 [3.15, p. 106] {3.6.4, p. 111} in your text using n = 5 and p = 0.39 {remember that the 4 th edition uses p =.37}. Again, to emphasize this: the parameters determine what our particular distribution looks like! 3. About distributions in general: We've learned several things about distributions: 1) The shape of a distribution can vary based on the parameters. 2) All possible outcomes must add up to one. a) If Y is discrete, this is easy. For example, with the binomial what we are saying is: n ( n y) py (1 p) n y = 1 y=0 In other words, take take all possible values of y, put them into the binomial distribution formula, and add these up and you'll get 1.00. b) If Y is continuous, then the area under the curve formed by our distribution will add up to one. How do we add up all possible outcomes if our distribution is continuous? We need calculus. Don't worry, you're not responsible for anything involving calculus. But what we're saying is: + (continuous distribution of y) dy = 1 Historical note: the symbol is short for sum (same word in Latin). In calculus we can add up a sequence of infinitely small things, which, in this case must add up to one. Let's use the normal distribution as an example. 4. The normal distribution The importance of the normal distribution to statistics can not be overemphasized. The Germans even put this on the old 10DM bill! Sometimes also known as the Gaussian distribution.
So what is it? f ( y) = 1 σ 2π 1 ( y μ e 2 σ ) 2 Good! Now you know everything, right? Seriously, here are a couple of examples from your text: Example 4.2, p. 122-123 [120-121] {4.1.3, p. 122}: We're looking at the thickness of eggshells from hens, and somehow we know that: μ =.38 mm, and σ =.03 mm This gives us the following picture (note the scale on the x-axis): Example 4.4, p. 124 [p. 121] {4.1.4, p. 123}: This time we're looking at the number of white blood cells per cubic mm, and again we somehow know that:: μ = 7,000 cells/mm 3, and σ = 100 cells/mm 3 (By the way, are these data really normally distributed?)
Some comments on the normal distribution: The curve peaks at the mean (μ) The inflection (direction of the curve) changes at ± σ. See also fig. 4.6, p. 125 [122] {fig. 4.2.1, p. 124}) The parameters for the normal distribution are μ and σ. If I know what these are, I know what my normal distribution looks like. The curve for the normal distribution actually goes from - to +. The area under the curve will add up to 1, or using calculus we can say: + 1 σ 2π 1 ( y μ e 2 σ ) 2 dy = 1 (Again, since this is calculus you are not responsible for the above equation). So now we know the normal distribution is and what it looks like. Why is it so important? 1) Because many things, particularly in biology, have a normal, or approximately normal distribution: Heights, weights, IQ, many blood hormone levels, etc.
5. The normal distribution and probability: 2) Because of something called the Central Limit Theorem. Well get back to this. If you re really curious, see section 5.4 in your text. (Basically it implies that even if Y is not normally distributed, we can often still use a normal distribution in statistics). (It is one of the most important theorems/results in statistics). Since many things in biology (and elsewhere) have a normal distribution, we need to learn how to answer probability questions using the normal distribution. For example, suppose Y = height of male basketball players, and we want to know: Pr{Y < 6 } Incidentally, notice that: Pr{Y < 6 } = Pr{Y 6 }. Why? What we're asking is, what's the probability a male basketball player is less than 6 feet tall? If you know calculus, then you might think we could do: 6 1 σ 2 π 1 ( y μ e 2 σ ) 2 dy Unfortunately this doesn't work except for a few special values of y (notice also that we need to know μ and σ). Instead, we need to use normal distribution tables that list probabilities. If we know μ and σ we find a table for those values of μ and σ, and then find Pr{Y < 6 }. The obvious problem is that we would need an infinite number of normal tables, one for every possible combination of μ and σ. This is obviously impossible, so we need to do something else. The standard normal distribution. Instead, we use one normal distribution to calculate all our probabilities. This is called the standard normal distribution and has: μ = 0, and σ = 1 (= σ 2 )
Here s how we use this distribution: Subtract the mean from the distribution you re studying. This will obviously give you μ = 0. Divide by the standard deviation of the distribution you re studying. This will give you σ = 1. We call this new number Z, for z-score. We say Z ~ N(0,1) Here s the formula: Z = Y So if we use Z instead of our original Y, we only need to list our areas in one table and then use the standard normal (or sometimes z ) curve. Comment: usually we let a computer (or fancy calculator) spit out the answer.
So here's how we can calculate some probabilities using the standard normal (or z) curve/tables: Pr{Z > 1.53}: Let's look at what we want first (it's always a good idea to sketch/draw pictures of what you want): Table 3 in your text will give you the area less than a particular value of Z Go to table 3 in your text. Find 1.53. Read 1.5 off the column on the left side going down. Read the.03 off the top row going across. Now read across and down until these two values (1.5 and.03 in our example) intersect, and write down that number. This is the area of the normal curve that is below 1.53. You should see 0.9370. So we can write Pr{Z < 1.53} = 0.9370. But we want the area above 1.53: We remember that the total area under the curve = 1.0, so we can do: 1-0.9370 = 0.0630. And finally we can say: Pr{Y > 1.53} = 0.0630
Comment: since the standard normal distribution is symmetrical around 0, you could also do the following: Let's try Pr{-1.2 < Z < 0.8}: Change the sign of the value we're interested in: instead of 1.53, use -1.53. Now we can just look up Pr{Y < -1.53} and we get 0.0630. This is a little bit of a shortcut - if it's confusing, don't worry about it and stick to method presented above. Again, let's look at our area first: Look up the values in the z-table for for 0.8 and -1.2: Pr{Z < -1.2} = 0.1151 Pr{Z < 0.8} = 0.7881 And since we want the area between these two z-values, we can subtract one from the other: Pr{-1.2 < Z < 0.8} = Pr{Z < 0.8} - Pr{Z < -1.2} = 0.6730 Look in your text on p. 127 [125] {126} for this example.
But of course, we usually deal with Y, not Z. So let's do a practical example. Exercise 4.3, p. 133 [131] (we're only doing select parts of the exercise): For Swedish men, we somehow know that μ = 1,400 gm, and σ = 100 gm. a) Find the probability that a (random) brain is 1,500 gm or less (note that your text asks the question just a little differently, but it works out the same): Pr{Y < 1,500}: Convert to Z: Z = 1500 1400 100 = 1 very convenient! Note that Pr{Y < 1,500} Pr{Z < 1.0} (The symbol means exactly equivalent to ) Before we go on, let's look at some pictures: Notice that the areas are identical. Look up 1.00 in table 3 and get 0.8413. So Pr{Y < 1,500} = Pr{Z < 1.0} = 0.8413 c) Find the probability that a brain is 1,325 gm or more: Pr{Y > 1,325}: Z = 1325 1400 100 = 0.75 Again, remember that Pr{Y > 1,325} Pr{Z > -0.75}
Just one picture this time: And, of course, we want the area in gray. Look up -0.75 in table 3 and get 0.2266. Remember that this time we need to subtract this result from 1: So Pr{Y > 1,325} = Pr{Z > -0.75} = 1-0.2266 = 0.7734 f) (last one) find probability that a brain is between 1,200 and 1,325 gm: Pr{1,200 < Y < 1,325}: This time we need two values of Z: Z 1 = 1200 1400 100 = 2.0 Z 2 = 1325 1400 100 = 0.75 Look up Z 1 to get 0.0228. Look up Z 2 to get 0.2266. So we have: Pr{1,200 < Y < 1,325} = Pr{-2.0 < Z < -0.75} =.2266 -.0228 =.2038