The topics in this section are related and necessary topics for both course objectives.

Size: px

Start display at page:

Download "The topics in this section are related and necessary topics for both course objectives."

Randolf Watkins
6 years ago
Views:

1 2.5 Probability Distributions The topics in this section are related and necessary topics for both course objectives. A probability distribution indicates how the probabilities are distributed for outcomes (sample points). There are infinitely many different distributions. However, as we learned in the last section, exist only two kinds of probability distributions; discrete and continuous probability distributions. Discrete probability distributions are for discrete outcomes and continuous probability distributions are for continuous outcomes. Discrete Distributions Examples of discrete probability distributions are discrete uniform distributions, Bernoulli distributions, and binomial distributions. A discrete uniform distribution is a probability distribution with equal probability for every possible discrete outcome. There is no restriction on the number of possible outcomes (other than that it should be a finite natural number of them). If there are k outcomes, then each outcome has the same 1/k probability. All outcomes have the same probabilities so they are uniform in terms of probability. Thus, the name of uniform distribution is used for this kind of the probability distribution. All the probability must be equal, but there is no restriction on outcomes other than that they should be discrete (and only a finite number of them). They can be numerical or non-numerical. As a result, this distribution has a sample space of the form, S = {s 1, s 2,, s k }. The notation for a discrete uniform distribution is Uniform{s1, s2,, sk}. Note that the sample space is used as part of the notation. If X is a discrete uniform variable, then we write X~Uniform{s 1, s 2,, s k }, which reads as X is distributed as discrete uniform distribution. Generally, the symbol tilde ~ means distributed as in Statistics. 1

2 An example of a discrete uniform variable is the number of dots up on a balanced die after rolled. Let X be the number of dots up on a balanced die. Then, X~Uniform{1, 2, 3, 4, 5, 6}. See the diagram of the uniform probability distribution below. 1/6 1/6 1/6 1/6 1/6 1/ X In a graph like this one, generally, the probabilities are indicated by the vertical bars on the sample points along the X-axis. The length of a bar indicates the amount of probability (weight) for its sample point. Random variables in equally likely situations have uniform distributions. If you are interested equally likely situations and their related topics such as combinations and permutations, please read the optional appendix to this chapter, Appendix: Counting & Probabilities (Optional). A Bernoulli distribution is a probability distribution for only two possible outcomes. The probability distribution in the Coin Toss Example is an example of a Bernoulli distribution. The name Bernoulli is from a Swiss mathematician James Bernoulli who established this probability distribution. For James Bernoulli, check out the following website. If the two outcomes of a Bernoulli distribution have the equal probabilities (50% each), then it is also a discrete uniform distribution. There is no restriction on outcomes other than that there should be two and only two outcomes. As a result, this distribution has a sample space of the form, S = {s 1, s 2 }. The notation for a Bernoulli distribution is Bernoulli(p) where p is the probability for s 1 or s 2. 2

3 Let X be a Bernoulli variable. Then, we write as X~Bernoulli(p). See the diagram of Bernoulli probability distribution given below. p 1 - p X s1 s2 As an example, if we let X be the number of T on one toss of a crooked coin whose probability of landing T up is 66%, then X~Bernoulli(0.66). This X has a sample space of S = {0, 1}. See its probability distribution given below X 3

4 A binomial distribution is a probability distribution for outcomes of whole numbers with some fixed maximum (whole) number. That is, a binomial distribution must be for outcomes of 0, 1, 2,, k where k is the fixed maximum number (whole number), and it is used to indicate the maximum number of trials for a binomial distribution. The outcomes are typically the number of successes among k independent trials. If no success among the k independent trials, then it is 0. If there is one success among them, then it is 1. If they are all successes, then it is k. A binomial distribution gives the probability for each of those cases (each number of successes). As a result, this distribution has a sample space, S = {0, 1, 2,, k}. There are three conditions (beside a fixed number of trials) for a binomial distribution. 1. The k trials must be all independent of one another. That is, the success or failure of one trial does not affect the success or failure of the other trials. 2. Each trial must have only two possibilities, success and failure. 3. The probabilities for success must be the same for all the trials. The success or failure of each trail can be anything. For instance, if the trial is a coin toss, then the success could be getting H, which means the failure is getting T. If the trial is a rolling a die, then the success could be getting an ace or 5, which means that the failure is getting 2, 3, 4, or 6. In fact, each of these trials is called a Bernoulli trail. If k = 1, then the binomial distribution is a Bernoulli distribution for the outcomes of 0 and 1. Of course, if the probabilities for 0 and 1 are equal to each other, then a Bernoulli distribution is a discrete uniform distribution. The binomial distribution is an extremely important discrete distribution. If the k (the number of trials) increases under certain conditions, it becomes very close to a Normal distribution which is perhaps the most important distribution. 4

5 The notation for a binomial distribution is Binomial(k, p) where k is the number of trials, and p is the probability for the success on each trial. Let X be the number of successes among k independent trials with probability p for the success on every trial. Then, we write X~Binomial(k, p). The probability distribution of Binomial(20, 0.3) is given below. The horizontal axis is for X as for all other distributions. This graph of the Binomial distribution was once found at or computes probabilities and such for Binomial(k, p) where n is used for k. It also computes µ and σ of binomial distributions, which are the next topics. Note that the value of r must be an integer in the site. Mean µ Generally, the center of a probability distribution (for outcomes) is expressed by µ, because 5

6 µ is the pivotal point for the outcomes and its probability distributions. When you have bunch of weights (probabilities) on a horizontal rod (axis), the pivotal point is the point on the rod where all the weights balance. That is, it is the location where you can put a finger under the rod and balance the rod with all the weights. In other words, it is the gravitational center of all the weights along the rod. Here, weights are probabilities and the rod is the horizontal axis (real number line). Namely, µ is the gravitational center of probabilities along the horizontal axis. For instance, if the outcomes 0 and 1 have a discrete uniform distribution, then probability (weight) of 0.5 is at 0 on the and the probability (weight) of 0.5 is at 1 on the real line (rod). Where is the pivotal point on the real line (rod)? It must be 0.5, the midpoint of 0 and 1. It is the point where you can balance these weights (probabilities) with one finger. Suppose 0 and 1 have probabilities P({0}) = 0.3 and P({1}) = 0.7. Then, µ cannot be 0.5 anymore since it has a heavier weight at 1. So, the pivotal point µ must be closer to 1. In fact, it is 0.7. These two weights are balanced at 0.7. P({0}) = 0.5 P({1}) = 0.5 P(1) = 0.7 P({0}) = X µ = 0.5 µ = 0.7 6

7 This mean µ can be computed mathematically as µ = s P(s), ss which means that µ is the sum of all the products of sample points s s and their corresponding probabilities P(s) s. This is for every sample point s in the sample space S. Simply put, µ is the sum of all the products of each outcome and its probability. For instance, if 0 and 1 have a discrete uniform distribution, then µ is computed as 0*P({0}) + 1*P({1}) = 0*(0.5) + 1*(0.5) = 0.5. If 0 and 1 have a Bernoulli distribution of P({0}) = 0.3 and P({1}) = 0.7, then µ is computed as 0*P({0}) + 1*P({1}) = 0*(0.3) + 1*(0.7) = 0.7. If there are four outcomes of 2, 4, 6, and 12 with the discrete uniform distribution; that is, P({2}) = P({4}) = P({6}) = P({12}) = 1/4; then µ is computed as 2*(1/4) + 4*(1/4) + 6*(1/4) + 12*(1/4) = 24/4 = 6. If X~Uniform{1, 2, 3, 4, 5,6}, then µ of this probability distribution is computed as 1(1/6) + 2(/6) + 3(1/6) + 4(1/6) + 5(/6) + 6(1/6) = 3.5 That is, µ = 3.5. Let us look at this point with a diagram of the probability distribution below. 7

8 1/6 1/6 1/6 1/6 1/6 1/ x µ = 3.5 Now, you see, by putting your finger at 3.5, you should be able to balance all the weights (probabilities) along the x-axis. Notice that µ must be between the minimum and maximum outcomes (regardless of how their probabilities are distributed); that is, s (1) < µ < s (k) ; since it is a pivotal point on the real line with probabilities (weights) on the outcomes. So, for instance, if you come up with 8 for µ of the outcomes 3, 4, 5, 6, and 7, then you must know immediately that µ = 8 is incorrect. Again, µ is estimated by the sample average (sample mean) x. You see how this works with the example of the Bernoulli distribution with P({0}) = 0.3 and P({1}) = 0.7. The value of µ is computed as µ = 0*P({0}) + 1*P({1}) = 0*(0.3) + 1*(0.7) = 0.7. (1) Suppose that data of size 10 are taken from this random system (more specifically, random variable) and that the data are {{1, 0, 0, 1, 1, 0, 1, 1, 1, 1}} with three 0 s and seven 1 s since P({0}) = 0.3 and P({1}) = 0.7. The sample average is computed, using its formula, as x = ( )/(10) = ( )/(10) = (0+0+0)/(10) +( )/(10) = 3(0)/(10) + 7(1)/(10) = 3*0/10 + 1*1/10 = 0*3/10 + 1*7/10 (2) = 0*(0.3) + 1*(0.7) = 0.7 which estimates µ =

9 Compare (1) for µ and (2) for x. These 3/10 and 7/10 in (2) are the estimates for 0.3 = P({0}) and 0.7 = P({1}) in (1) respectively. I hope, comparing (1) and (2), you see why µ can be and should be estimated by x. This is true for any probability distribution. With this particular example, these estimates of the probabilities are identical to the estimated probabilities. As a result, the estimate of µ so happen to be identical to the true value of µ. With other data, they might not be identical to each other since, in practice, you might not get three 0 s and seven 1 s out of data of size 10. However, the numbers of 0 s and 1 s are close to 3 and 7 if data of size 10 are taken. If data of size 100 are taken, then the number of 0 s is close to 30 and that of 1 s is close to 70. As a result, 0 is multiplied by a number close to 3/10 (= 30/100) and 1 is multiplied by a number close to 7/10 (= 70/100) in the computation with the sample average. This results in a good (close) estimate for µ by x. Standard Deviation σ The standard deviation σ indicates the variation (spread) of outcomes, if you remember. The probability distribution affects the value of σ as well. If outcomes have more probabilities toward the center µ of the outcomes, then the value of σ is small. For instance, σ for outcomes 3, 4, 5, 6 with P({3}) = 0.1, P({4}) = 0.4, P({5}) = 0.4 and P({6}) = 0.1 is smaller than σ for the same outcomes with P({3}) = 0.4, P({4}) = 0.1, P({5}) = 0.1 and P({6}) = 0.4 because, for the first probability distribution, the outcomes 4 and 5 have more probabilities which are close to the center of the outcomes. 9

10 P({3}) = 0.1, P({4}) = 0.4, P({5}) = 0.4, and P({6}) = 0.1 P({3}) = 0.4, P({4}) = 0.1, P({5}) = 0.1, and P({6}) = 0.4 The same µ = 4.5 but σ for this distribution is smaller. The same µ = 4.5 but σ for this distribution is greater This can be reflected by data which come from these outcomes according to their probability distributions. Large data from the outcomes with the first probability distribution have a lot of observations around the center of the outcomes. They should have about 80% of the data with values 4 and 5 close to the center µ = 4.5. It produces data with less variation. On the other hand, large data from the same outcomes, with the second probability distribution, have a lot of observations away from the center of the outcomes. They should have about 80% of the data with values 3 and 6 away from the center µ = 4.5. It produces data with more variation. When a probability distribution has more probabilities around the center of the outcomes, the distribution is said to be tight. This is clear from the formula of σ for a discrete distribution with k sample points, σ = k 2 (si-μ) P(s i) i =1, by the way, which is very similar to the formula of s, and it should be so since σ is estimated by s. So, a tight probability distribution tends to have a small σ. This holds true for discrete and continuous distributions. 10

11 Let us have numerical examples for computing σ with the probability distribution of P({3}) = 0.1, P({4}) = 0.4, P({5}) = 0.4 and P({6}) = 0.1. We already have µ = 4.5. So, σ = (3 4.5) 2(0.1) + (4 4.5) 2(0.4) + (5 4.5) 2(0.4) + (6 4.5) 2(0.1) (keeping three places after the decimal). Let us compute σ with the probability function of P({3}) = 0.4, P({4}) = 0.1, P({5}) = 0.1 and P({6}) = 0.4. σ = (3 4.5) 2(0.4) + (4 4.5) 2(0.1) + (5 4.5) 2(0.1) + (6 4.5) 2(0.4) (keeping three places after the decimal). These two probability distributions have the same outcomes (sample points) and the same mean (4.5), but the first one with more probabilities close to the center µ of the outcomes has a smaller σ (0.806 against 1.306). In fact, the second probability distribution has more variation by 0.5 (= ) or about 62% (= 0.5/0.806) more variation than the first probability distribution. These two numerical examples illustrate the following point. As stated earlier, a tighter probability distribution (with more probabilities toward the center) has less variation and a small value of σ. In practice, we generally prefer a random system with less variation (one with a smaller σ). Again, σ is estimated by the sample standard deviation s. You see how this works with the example of the Bernoulli distribution with P({0}) = 0.3 and P({1}) = 0.7. For this, let us use the variance σ 2 and the sample variance s 2, which are the squares of the standard deviation σ and the sample standard deviation s, respectively. The reason for using the (sample) variance is that, with them, we do not carry square root signs, and the point to be made remains the same. In fact, the point would be clearer without all the square root signs. The variance of the Binominal(0.7) is σ 2 = (0 - µ) 2 (0.3) + (1 - µ) 2 (0.7) (3) 11

12 = 0.21 (with µ = 0.7) That is, σ ( 0.21) (keeping three places after the decimal). Suppose that data of size 10 are taken from this random system (more specifically, random variable) and that the data are {{1, 0, 0, 1, 1, 0, 1, 1, 1, 1}} with three 0 s and seven 1 s since P({0}) = 0.3 and P({1}) = 0.7. The sample variance is computed, using its formula, as s 2 = ((1- x ) 2 + (0- x ) 2 + (0- x ) 2 + (1- x ) 2 + (1- x ) 2 + (0- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 )/(10 1) = ((0- x ) 2 + (0- x ) 2 + (0- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 )/(9) = ((0- x ) 2 + (0- x ) 2 + (0- x ) 2 )/(9) + ((1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 + (1- x ) 2 )/(9) = (0- x ) 2 (3/9) + (1- x ) 2 (7/9) (4) with x = 0.7. (keeping three places after the decimal). s by (keeping three places after the decimal) Compare (3) for σ 2 and (4) for s 2. The µ in (3) is estimated by x in (4) and these 3/9 and 7/9 in (4) are the estimates for 0.3 = P({0}) and 0.7 = P({1}) in (3), respectively. I hope, comparing (3) and (4), you see why σ 2 can be estimated by s 2 so that σ can be estimated by s. This is true for any probability distribution. Also, note that the estimates, and 0.483, are close respectively to σ 2 = 0.21 and σ = If n = 10, instead of n 1 = 9, were used with the data, the estimates would be identical to σ 2 = 0.21 and σ = Again, they are estimates and, depending on data, they are often close but not exactly the same as their estimated parameters. 12

13 Continuous Distributions Generally, for continuous outcomes, probability is given by an area just like the area of a bar (with a relative frequency for the height) in a histogram. The sample space of a continuous distribution is an interval (or intervals) of the real number line (could be the entire real number line). All real numbers in the interval are possible outcomes (sample points). See the diagrams of a discrete probability distribution and a continuous probability distribution below. A discrete distribution like this one has each probability given by the length (height) of a bar on an outcome. They must all add up to one (100%). A continuous distribution like this one has probability given as an area between the curve and the horizontal axis. The total area between the curve and the horizontal axis must be one (100%). s 1 s 2 s 3 s 4 s 5 a b S = {s 1, s 2, s 3, s 4, s 5 } S = [a, b] With continuous sample points (outcomes), an event (a set) is given by an interval like [c, d] since you cannot list all the elements (sample points) in the set explicitly one by one to indicate the event (set). The probability for an event such as [c, d] is given as the (size of) area over [c, d] between the curve and the horizontal axis. If a continuous random variable X is distributed over its sample space S = [a, b] as given above, then the probability for the event [c, d], P(c X d), is given as the area indicated in the diagram below. 13

14 With a discrete distribution like this one, a probability of an event is computed by adding probabilities of all the sample points that constitute the event (set). For instance, If A = {s 3, s 4 }, then P(A) = P(s 3 ) + P(s 4 ) With a continuous distribution like this one, the probability for an event (set) is given as an area over the event (set) between the curve and the horizontal axis. If B = [c, d], then P(B) is the area between the curve and the horizontal axis over [c, d]. P(B) = P(c X d) is this area. s 1 s 2 s 3 s 4 s 5 a c d b X A = {s 3, s 4 } B = [c, d] Note, with the continuous distribution given above, that P(- < X d) = P(a X d) since P(- < X d) = P(- < X < a) + P(a X d) and P(- < X < a) = 0. Can you tell me why P(- < X < a) = 0? Similarly, P(c X < ) = P(c X b) since P(c X < ) = P(c X b) + P(b < X < ) and P(b < X < ) = 0. If you understand these, then it is obvious that P(- < X < ) = P(a X b) = 1. The computations of µ and σ with continuous probability distributions generally require integrations and are beyond the scope of this book while, for some continuous distributions, they can be computed without integrations. Nonetheless, µ and σ of continuous probability distributions 14

15 indicate and mean the same things as those of discrete probability distributions. The mean µ with a continuous probability distribution is the point on the horizontal axis (line) where the entire area (probability), between the curve (or line) and the horizontal line, balances. It is the pivotal point or the gravitational center of the distribution on the horizontal line. It indicates the center of the (continuous) probability distribution. It looks like µ is just a little bit left of d in the above diagram. The standard deviation σ of a continuous probability distribution indicates how spread out the distribution is. A tighter distribution has a small value of σ, indicating small variation among outcomes. Even for the same sample space, if more probability is located away from the center (say µ), the value of σ for the distribution is larger. See the diagrams given below. σb < σa since the sample space of Distribution B is wider than that of Distribution A. Distribution B, with σb σd < σc since Distribution D has more probability toward the center while they have the same sample space and µ. Distribution D, with σd Distribution A, with σa Distribution C, with σc µ for both Distribution A & B µ for both Distribution C & D Let us have four examples of continuous probability distributions. 15

16 A continuous uniform distribution is a probability distribution for continuous real numbers for its outcomes in an interval with the equal probability for the equal-length subintervals in the interval. The probability exists only in an interval in an equal manner (uniform) so that any subintervals in the intervals of the equal length have the same probability. For instance, if it has the equal probability between 3 and 19, then any sub-intervals of, say, the length 2 has the equal probability of 12.5%. That is, when you take the next measurement, the chance of the measurement value between 5 and 7 is 12.5% and the chance of it between 12 and 14 is also 12.5%. This distribution is denoted as Uniform[a, b] where [a, b] is the interval with equal (uniform) probability. As a result, this distribution has a sample space of the form, S = [a, b], an interval on the real number line. For the continuous uniform distribution discussed as an example in the last paragraph is denoted as Uniform[3, 19] with a = 3 and b = 19. Its sample space is S = [3, 19]. See the diagram given below. X~Uniform[a, b] or X~Uniform[3, 19] with a = 3 and b = 19 This area is P(5 X 7) = 1/8 = since 2*(1/16) = 1/8. This area is P(12 X 14) = 1/8 = since 2*(1/16) = 1/8. The same probability as P(5 X 7) = 1/8 = This height must be 1/(b a) = 1/(19 3) = 1/16 since the length of the area (rectangle) is b a = 19 = 3 = 16 and the total area must be one (100%). a = b = 19 X µ = 11 The mean µ is always the midpoint of the point where the probability starts and the point where the probability ends; that is, (a + b)/2. In the example of 16

17 the uniform distribution above, the mean is 11, µ = 11, which is computed as (3 + 19)/2. Generally, the value of σ is smaller for a continuous uniform distribution with a short interval for probability. For instance, the example continuous uniform distribution, Uniform[3, 19], has the value of sigma about 2.6 while Uniform[56, 58] has the value of σ is about Generally, a tighter uniform distribution has a smaller σ, which means less variation. This makes sense since the outcomes from Uniform[56, 58] have the maximum variation (without considering the probability) of only 2 = while the other one has the maximum variation (without considering the probability) of 16 = A histogram for data from outcomes with a uniform distribution looks like a rectangle with tops of bars all about at the same height. For the first example uniform distribution, a histogram looks like a rectangle started at 3 and ends at 19 with the height of 6.25% (in relative frequency). If the relative frequencies are used, the total area of the rectangle should be close to 1 since the size of an area estimates a probability, and the total probability (area) must be one. A Normal distribution is a continuous distribution with a perfect bell-shape with one mode (bump) right at µ, the center of the outcomes. The outcomes are all possible real numbers. Its sample space is the entire real number line. A bell-shape means a perfectly symmetric, unimodal (one bump at the center) shape tapering off to the sides. It is most prevalent and the most commonly found distributions in nature. Hence, it is called a Normal distribution. It is also known as Gaussian distribution (especially in physics), named after one of the two discoverers of Normal distributions; namely, a German mathematician/physicists Johann Gauss. The other discoverer is the French/English mathematician Abraham de Moivre. A Normal distribution is denoted by N(µ,σ) where µ and σ are the mean and standard deviation of the Normal distribution, respectively, and they are parameters. The first position is for µ and the second position is for σ; an m comes before an s. See the graph of N(µ,σ) given below. 17

This diagram is found at http://www.stat.yale.edu/courses/1997-98/101/normal.

18 This diagram is found at The mean µ is the center of the outcomes and their probabilities (that is, the center of the Normal distribution where the bump is located at), and σ indicates the spread of the outcomes with their probabilities (thus, the spread of the Normal distribution). The smaller the value of σ is, the tighter the probability distribution is. That is, a Normal distribution with a smaller σ has a tighter bell shape. The graphs of four different Normal distributions are plotted in the diagram given below. They are graphs of N(0, 0.2), N(0, 1.0), N(0, 5.0) and N(-2, 0.5). As you can see, µ indicates the center of the distribution and σ indicates the spread of the distribution. 18

19 The green line is the standard Normal distribution This diagram is found at As you remember, [µ 3σ, µ + 3σ] catches most of outcomes and probabilities. In case of a Normal distribution, the six-sigma interval catches 99.73% of the probability for the outcomes. If the σ is small, then the interval is short, but it still catches 99.73% since the bell-shape is tighter. The sample space of a Normal distribution is the entire real number line, but most of the probability is found in [µ 3σ, µ + 3σ]. The standard Normal distribution is a Normal distribution with µ = 0 and σ = 1. That is, the standard Normal distribution is denoted by N(0, 1). It is customarily to use Z for the standard Normal variable. That is, Z is a random variable with the standard Normal distribution (recall what a random variable is). In other words, Z has the standard Normal distribution, which means the outcomes of Z have the standard Normal distribution. We also write Z~N(0, 1) which stands for Z is distributed as the standard Normal distribution. If X~N(µ,σ), then 19

20 P(µ - 3σ < X < µ + 3σ) = P(µ - 2σ < X < µ + 2σ) = , and P(µ - σ < X < µ + σ) = In terms of the standard Normal distribution, P(-3 < Z < 3) = , P(-2 < Z < 2) = , and P(-1 < Z < 1) = See the graph of the standard Normal distribution given below. 20

This diagram was found at http://www.westgard.com/images/ls36f4.gif but no longer available.

21 This diagram was found at but no longer available. You do not have to memorize these numbers since there are Normal tables, computer Normal probability functions and its inverse functions all available. For instance, found in the following website are programmes that find probabilities by giving points (z values) and find points (z values) by giving probabilities. or 21

22 The Normal standardization is used to change any Normal distribution to the standard Normal distribution by subtracting its µ and, then, dividing the difference by its σ. If X~N(µ,σ), then (X - µ)/σ has the standard Normal distribution; that is, (X - µ)/σ~n(0, 1) and, hence, Z = (X - µ)/σ. It is not difficult to remember. For a Normal distribution with µ and σ to be the standard Normal distribution, µ must be zero and σ must be 1. So, how do you get 0 from µ? By subtracting µ, of course. Now, σ must be 1. So, how do you get 1 from σ? By dividing it by σ. You just do these subtraction of µ and division by σ to a Normal random variable X with µ and σ, which is the Normal standardization. Also, these operations must be in that order since µ shows up in the first position and σ shows up in the second position in N(µ,σ). Because of this Normal standardization, Only the standard Normal table is necessary to find probabilities of any Normal distribution. That is, the standard Normal distribution is all needed to find any probability of any Normal distribution. For example, we can find the probability of X less than 5.2 where X is distributed as the Normal distribution with the mean 1.9 and the standard deviation 2.5, using the Normal standardization, as follows. First, the statement the probability of X less than 5.2 needs to be translated into the probability statement P(X < 5.2) so that the Normal standardization can be performed. The Normal standardization gives the same probability in terms of the standard Normal variable Z as P(X < 5.2) = P( (X 1.9)/2.5 < ( )/2.5 ) = P(Z < 1.32) That is, finding P(Z < 1.32) is finding the probability of X less than 5.2 where X~N(1.9, 2.5). You could use a standard Normal table to find this probability. However, in this course, we use internet websites to find this kind of probability. Go to the following website. 22

23 or Click on the circle under and enter 1.32 under z value. Then, click on the button. This should give you under probability. That is, the probability of X less than 5.2 is Note that, if the probability to find were P(Z > 1.32), then click the circle under and do the same. You would get under probability this time for P(Z > 1.32). By the way, this makes sense = 1, the total probability of one. You need to understand that the little z under the distribution is the z value, and the red area in the distribution is the probability on the left or on the right of the z value, which is determined by the inequality in the probability statement. For instance, the z value in the probability statement P(Z < 1.32) is 1.32 (that is, z = 1.32) and the probability is on the left of z = 1.32 (so the red area is on the left of z). Here are some exercises with this website. Can you find P(Z > -0.67), P(Z < -1.38), P(2.14 < Z), and P(-1.44 > Z)? They are respectively , , , and I hope you can get them. 23

24 By the way, the little z and z value are used with the distribution because it is the standard Normal distribution. You can find these z values by using this website if probabilities are given. That is, you can find z (z value) such that P(Z < z) = and P(Z > z) = , and P(z > Z) = They are respectively , , and I hope you can find them. Hint: What could the button in the website be for? The third continuous distribution is the Student s t distribution. This distribution was introduced, in 1908, by an applied mathematician William Gosset who was employed by Guinness Breweries in Ireland. A Student s t distribution is a probability distribution very similar to the standard Normal distribution with more probability away from the center. Its sample space is the entire real number line, just like that of a Normal distribution. It has a symmetric bell shape centered at 0 like the standard Normal distribution. However, it has more probability away from the center ( on the tails ). That is, the tails of the bell are thicker, and the bell shape is not as tight as that of the standard Normal distribution. A Student s t distribution comes with the degree of freedom (this is the parameter for this distribution), and its notation is Student(m) where m > 0; this m is the degree of freedom. The smaller the degree of freedom is, the thicker the tails of the bell are (the more probability away from the center). That is, as the degree of freedom increases, the Student s t distribution has less probability away from the center and more probability toward the center. In fact, as the degree of freedom approaches to infinity, the Student s t distribution approaches to the standard Normal distribution. This is called the asymptotic Normality of the Student s t distribution, which is the relation between the standard Normal and Student s t distribution. See the graphs of five Student s t distributions that are plotted in the diagram given below. They are Student(1), Student(2), Student(5), Student(10), and Student( ) = N(0, 1), which is the asymptote normality of the Student s t distribution. 24

25 This diagram is found at In fact, if the Student s t distribution s degree of freedom exceeds 30, it is very close to the standard Normal distribution. In practice, the standard Normal distribution is used for a Student s t distribution if its degree of freedom is greater than 30. A probability or a point (t value) can be found by giving one of them along with the degree of freedom at the following website. or A Chi-square distribution is a continuous probability distribution with the positive real numbers for its outcomes; that is S = (0, ). A Chi-square distribution comes with the degree of freedom (this is the parameter for this distribution), and its notation is Chi-square(m) where m > 0; this m is the degree of freedom. Depending on the degree of freedom, it has a skewed bell shape starting at the origin, has one bump (unimodal), and tapers down (to the horizontal axis) toward. However, this is not the case for a Chi-square distribution with a small degree of freedom 25

26 See the graphs of five Chi-square distributions that are plotted in the diagram given below. They are Chi-square(1), Chi-square(2), Chisquare(3), Chi-square(4), and Chi-square(5). This diagram is found at Probabilities and points (Χ 2 values), along with the degree of freedom, can be found by giving one of them at the following website. or By the way, the following are the common notations for the continuous random variables. Z ~ N(0, 1), Tn ~ Student(n) and ~ Chi-square(n). By the asymptote normality of the Student s t distribution, Z = T. Sampling Probability Distributions (Sampling Distributions) The probability distribution of an estimator is called the sampling probability distribution since an estimator cannot do its estimation without 26

27 data which are taken from the objects sampled (sampling is involved). On the other hand, the probability distribution for outcomes is called the underlying probability distribution. That is, a sampling (probability) distribution is a probability distribution of an estimator and an underlying (probability) distribution is a probability distribution for outcomes of a random variable or system which is not an estimator. When probability distributions of both outcomes and an estimator are discussed, it gets confusing so sampling or underlying are attached to a probability distribution to clearly indicate which probability distribution that you are referring to. When there is only one kind of probability distribution is involved, you do not need to use sampling or underlying with the probability distribution. Let us have an example of underlying and sampling probability distributions with a sample average. Suppose there are only three outcomes of 0, 4, and 8 with the equal probability of 1/3 for each outcome. That is, P({0}) = P({4}) = P({8}) = 1/3 which is the underlying probability distribution (with S = {0, 4, 8}) because these 1/3 s are the probabilities for the outcomes (no estimator is involved at this point). By the way, it is a discrete uniform distribution. What is µ? It must be 4; that is, the true value of the parameter µ is 4 (µ = 4) for this underlying distribution. By the way, what is σ? It must be In practice, we do not know the true value of µ but suppose that we need it to make correct/good decision. Then, it must be estimated by the sample average x. Let us take a sample of size 2 with replacement. Then, we have nine possible samples of size 2 which are {{0,0}}, 1/9 0 {{0,4}}, 1/9 2 {{4,0}}, 1/9 2 {{0,8}}, 1/9 4 {{8,0}}, 1/9 4 {{4,4}}, 1/9 4 27

28 {{4,8}}, 1/9 6 {{8,4}}, 1/9 6 {{8,8}}, 1/9 8 The number given on the right of each sample (data) is the probability of getting the sample (data). Remember, in practice, you take only one sample, not all of them while this is a theoretical exercise to find the sampling distribution. The probability of each sample is determined by multiplying the chance of getting the first outcome for the sample (data) which is 1/3 and the probability of getting the second outcome for the sample (data) which is 1/3, (1/3)*(1/3) = 1/9. The last number given at the end of the row for each sample is the estimate for µ based on the sample (data) using the sample average. The possible estimates for µ is 0, 2, 4, 6, 8. So, this estimator, the sample average, has probability distribution of {P({0}) = 1/9, P({2}) = 2/9, P({4}) = 3/9, P({6}) = 2/9, P({8}) = 1/9}, which is the probability distribution of the estimator (the sample average) for µ. This probability distribution is a sampling distribution since it is the probability distribution of an estimator for µ. Note that this sampling probability distribution has a sample space of S = {0, 2, 4, 6, 8} while the underlying distribution has a sample space of S = {0, 4, 8}. The mean for both underlying and sampling distributions is identical at 4, but the standard deviation of the sampling distribution is less, 16 / 3, than that of the underlying distribution, 32 / 3, because less probabilities on the end sample points of 0 and 8 (1/3 for the underlying distribution and 1/9 for the sampling distribution). See the diagram given below. 28

29 Underlying Distribution of X~Uniform{0, 4, 8} with σx = 32 3 Sampling Distribution of X with 32 σ X = 6 = /9 = 1/3 1/3 1/3 1/3 1/9 2/9 2/9 1/ X X µx = 4 µ X = µx = 4 There are more sample points between 0 and 8 for the sampling distribution than the underlying distribution. When sample size is large, say 30 or greater, the sampling distribution of a sample average is very close to a Normal distribution. Generally, a sampling probability distribution of a sample average approaches to a Normal distribution, which is true regardless of the underlying probability distribution (even if it is a discrete underlying probability distribution). Let us look at the last example. It started with a discrete probability distribution (underlying distribution) of three outcomes. It is a discrete uniform distribution, which is flat. However, the sampling probability distribution has more estimates (outcomes), increase to five from three. Also, the probability gets more concentrated toward the center, and the distribution is getting bell-shaped. This is with the sample size of only two. If the sample size is increased to 30, 50, 100, and 200, then the sampling distribution becomes a tighter bell-shape, with more sample points, similar to the shape of a Normal distribution. 29

30 Also, notice that µ is 4 for both the underlying and sampling probability distributions. That is, µ does not change from the underlying probability distribution to the sampling probability distribution, which means that the sample average is an unbiased estimator for µ since μ x = μ. Central Limit Theorem There is a theorem called the Central Limit Theorem that states that the sample average s sampling probability distribution is approximately a Normal distribution, N(µ, σ/ n ) for a large sample size n, where µ and σ are those of the original underlying distribution and σ/ n is the standard deviation (standard error) of the sampling probability distribution. That is, the sampling probability distribution gets tighter by n. So, as the sample size n increases, the sampling probability distribution gets tighter. The Normal standardization can be applied to the Central Limit Theorem as x - μ σ/ n being approximately distributed as the standard Normal Distribution. This becomes very important in the next Chapter. It should be noted that, if the underlying probability distribution is a Normal distribution, the sampling probability distribution of a sample average is exactly a Normal distribution with a tighter bell-shape by the factor of n. Let us look at this from the standpoint of accuracy and precision of the estimators. The sample average has the same µ as that of the original outcomes, and this original µ is what the sample average estimates. This means that the sample average is unbiased estimator and it is accurate. The standard deviation of the estimator is standard error σ/ n. It represents the imprecision of the estimator. As the sample size increases, the standard error (imprecision) decreases, and the precision increases as stated in a 30

31 previous section. That is, the sample average, as an estimator for µ, is accurate (unbiased), and its precision increases as data get larger. Copyrighted by Michael Greenwich, 03/

Lecture 9. Probability Distributions. Outline. Outline

Lecture 9. Probability Distributions. Outline. Outline Outline Lecture 9 Probability Distributions 6-1 Introduction 6- Probability Distributions 6-3 Mean, Variance, and Expectation 6-4 The Binomial Distribution Outline 7- Properties of the Normal Distribution