3. Probability Distributions and Sampling

Size: px

Start display at page:

Download "3. Probability Distributions and Sampling"

Gregory Blair
5 years ago
Views:

1 3. Probability Distributions and Sampling 3.1 Introduction: the US Presidential Race Appendix 2 shows a page from the Gallup WWW site. As you probably know, Gallup is an opinion poll company. The page reports the results of a poll, taken during the period October It shows 46% of likely voters supporting George Bush rather than one of the three other candidates (Al Gore, Ralph Nader and Pat Buchanan) or being undecided in how they will eventually vote. The poll was based on a sample of 769 people likely to vote, who were contacted by telephone. It is typical of polls carried out in the run up to an election, or of a survey on customer opinions of a product. Such polls are very useful, but we must be careful how we interpret them, since they always contain an error. This is especially important when they claim to show small changes in the underlying popularity of some person or item, or differences between the popularity of people or items when that difference is small. The page notes the existence of the error. Towards the bottom of the page it says "one can say with 95% confidence that the margin of sampling error is +/ 4 percentage points". Our aim is to understand: why the error occurs what the phrase "95% confidence" means how this value of 4 percentage points is calculated how it depends on the sample size used in the poll Though it is not our main topic here, you may also be interested in the detail of how Gallup carry out their polls, and can read about it in Appendix 1, which is also taken from the Gallup WWW site. 3.2 The Unreliability of a Sample Whenever we use sampling with a relatively small number of people to infer something about a very large number of people, we will make an error. To see this, imagine that we could repeat the poll of 769 people with a different set of 769 people. We take care that the second set is chosen in a similar way to the first set, and there is therefore no reason why it should produce a different result on average. But it almost certainly will in a single case. This is because the randomness in how a randomly-chosen person votes gets only partly averaged out by the sample. CE8: Quantitative Research Methods Ian Rudy 1

2 We can discover how much the percentage of people voting for George Bush varies from poll to poll even if there is no underlying change in his popularity by simulating polls using a computer. We will assume that the percentage of voters in the population who will vote for George Bush is 46%. If we simulate 100 polls, each involving 769 people, we find that the percentage of people in each sample who say they will vote for him is close to 46%, but fluctuates around it. We can draw a histogram to show this: Percent Voting Bush in a Sample Poll Figure 3.1: Histogram of the Percentage of People Voting for Bush in Polls of 769 People (Source: simulated by computer, assuming 46% of the population will vote Bush) The half-width of the histogram is about 5 percentage points. This is close to the value of "+/ 4 percentage points" that Gallup mentioned. So we can understand why the error occurs: it is just because we cannot predict in advance how a randomly chosen individual votes, and that randomness only gets partly averaged out when we average over a sample of 769 people. The only way to average out the randomness is to take a very, very large sample: in fact, we need to take the entire population. But we can see from our graphs what spread of poll results is likely, and hence from this get a rough idea of how large the error in a single poll could be. 3.3 How the Unreliability of a Sample Depends on Its Size It is interesting to ask how the width of the histogram depends on the size of the poll. We can simulate polls of any size using the computer, and if we do so, we get the results shown in Figure 3.2. The obvious point from these graphs is that they get narrower as the sample size increases. This is quite reasonable: the randomness in individuals is more effectively averaged out as the sample size increases. What is less obvious is how the graphs get narrower. It turns out that the width of the graph is approximately proportional to 1/ n, where n is the sample size. This can be proven mathematically: we do not need to know how, but we do need to know that it is true. It is an CE8: Quantitative Research Methods Ian Rudy 2

3 immensely valuable rule, since it allows us to predict the width of the graph without having to carry out many polls, or do the simulations with a computer. CE8: Quantitative Research Methods Ian Rudy 3

4 CE8: Quantitative Research Methods Ian Rudy 4

5 3.4 The Shape of the Graph It is also very useful to know what shape the graph has. The shapes of the graphs above are rather irregular. This is because we had only 100 polls. If we have a larger number, then the curve becomes smoother. Figure 3.3 shows the results of 5000 polls, with 769 people per poll Fraction Voting Bush in a Sample Poll Figure 3.3: Histogram of the Fraction of People Voting for Bush in 5000 Polls of 769 People (Source: simulated by computer, assuming 46% of the population will vote Bush) A problem about drawing these histograms with frequencies on the y-axis is that the values depend on the number of polls we ran. If we are only interested in the shape of the of histogram, we should do something to fix this. It is conventional to divide the height of each of the bars in such a way that the area under the graph becomes equal to 1.0. We then get what is known as a probability distribution: see Figure Fraction Voting Bush in a Sample Poll Figure 3.4: Probability Distribution of the Percentage of People Voting for Bush in Polls of 769 People (Source: simulated by computer, assuming 46% of the population will vote Bush) CE8: Quantitative Research Methods Ian Rudy 5

6 The reason it is known as a probability distribution is that the area of any one bar is equal to the probability that a randomly chosen poll lies within the range of the bar. The total area of any successive set of bars is equal to the probability that a randomly chosen poll lies within the range of the bars. Although it doesn't sound an especially useful property, it turns out that it is. The vertical axis does not any longer represent a frequency. Technically, it is called a probability density, but you do not need to worry about this. Note that it is common for people to talk about a "distribution", regardless of whether the area underneath it is equal to 1.0. You will also hear people talk of "normalising" the distribution. This just means altering the heights of the bars so that the total area underneath is equal to 1.0. It turns out that the shape of our probability distribution is approximately the same as that of an idealised distribution, called the Normal distribution. (This word Normal is nothing to do with the process of normalising the distribution, described above.) Figure 3.5 shows the appropriate Normal distribution for this case superimposed on our distribution of poll results. The Normal distribution has a smooth bell shape, and is symmetrical about its middle Fraction Voting Bush in a Sample Poll Figure 3.5: Probability Distribution of the Percentage of People Voting for Bush in Polls of 769 People, Together with Normal Distribution In fact, the distribution of the poll results would always tend towards a Normal distribution as long as we do enough polls, and as long as the sample size per poll is above about 30. This is why the Normal distribution is so-called: it is what "normally" happens. The Normal distribution is also a reasonable approximation when the sample size per poll is less than 30 but (if we are being fussy) the shape is slightly different. It is better described by what is known as a Binomial distribution. You can read about the Binomial distribution in Clare Morris' book if you wish (it is not part of this Quantitative Research Methods course itself). CE8: Quantitative Research Methods Ian Rudy 6

7 So we have now discovered that the distribution of the US Poll samples follows a Normal distribution. This is incredibly useful, since we can find anything we want to know about the Normal distribution from either from Tables or from a spreadsheet program such as Microsoft Excel. We therefore do not in real life have to repeat the poll 5000 times, or do the simulations by computer. We do, however, need to know how to specify the "half-width" in more precise terms than we have done so far. In practice, people do this by specifying a quantity known as the standard deviation. One standard deviation on either side of the mean accounts for about 66% of the area under the curve; two (more precisely, 1.96) standard deviations on either side of the mean accounts for 95% of the area under the curve. In the case of the US Poll example, the formula for the standard deviation of the samples is given by: P(100-P)/n In general, the standard deviation of samples is called the Standard Error, and Clare Morris uses the term STEP (STandard Error of Percentages) for the particular kind of standard error given by the above formula. 3.5 The Error in Polls We can now sum up what we have found. Let us denote the underlying popularity in the population of a person or item (expressed as a percentage of people who would support/buy the person or item) by P. Then the popularity of the person or item a single poll of n people (assuming n to be more than about 30) will be distributed according to a Normal distribution, with: mean = P standard deviation (or STEP) = P(100-P)/n. In the case above, George Bush's popularity was estimated to be 46%, so P=46. The poll size, n, was 769. We find that the STEP is 46(100-46)/769 = To find the error in a poll, we use the formula: error = Z*STEP Z is a number we choose: if we take Z=1.96, we cover 95% of the area under the Normal Distribution 1. Hence we would expect 95% of poll results to be within ±1.96(1.80) = ±3.53% of 1 You might find it helps to remember that Z is actually just a number of standard deviations, measured away from the centre of the Normal Distribution. Increasing Z increases the extent to which we cover the whole area under the curve. CE8: Quantitative Research Methods Ian Rudy 7

8 ours. The largest possible value of this error occurs when P=50 (when the underlying popularity of the person is 50%) and our value of P=46 is so close to this as to make the difference negligible. This is the sense in which Gallup said "+/ 4 percentage points". 3.6 A More General Note About Distributions The Normal Distribution is only one of many theoretical distributions that arise when we analyse data. You do not need to know about many of these distributions, but you should be aware of how people describe distributions in general. They usually specify three things: the Location of the Distribution In simple terms, this just means where on the horizontal axis the "middle" (in some rough sense) of the distribution is. We can calculate this for the set of poll results by taking the mean (that is, the average) of the polls. Say we do N polls, and the i th poll gives Bush a fraction of the vote equal to x i, then the mean is just: mean = i=n xi i=1 N The Σ sign just means a sum, that is, x 1 + x 2 + x x N the Spread (or "width") of the Distribution The spread is an indication of how wide the graph is. In principle, one could indicate the spread by calculating the range of poll results (that is, the biggest minus the smallest). This has the advantage that the calculation is simple and easy to understand. However, in practice using the range is not a good idea, for various reasons, not least of which is that it depends on only two values (the biggest and the smallest). If one of these is abnormally large or small for the particular data one has available, then the range can be significantly affected. In practice therefore people tend to use the standard deviation instead. The most general formula for the standard deviation is: i=n (x i - population mean) 2 i=1 N (NB: In practice, you will probably need to use (N-1) on the bottom of this formula instead of N. See the note below) There are good reasons for using what appears at first sight to be an unnecessarily complicated formula! All it is doing is finding the average squared difference between the data values and the CE8: Quantitative Research Methods Ian Rudy 8

9 mean data value, with a square root to make sure the units of the standard deviation are the same as those of the original data values. The size of the standard deviation tends to be around one quarter or one sixth of the full width of the distribution, though in detail this depends on the shape of the distribution and the particular sample one happens to be working with. The square of the standard deviation is known as the variance. It is worth noting here that the above formula for standard deviation is not used very often. Instead, one uses a formula with (N-1) on the bottom rather than N. You do not need to understand in detail reasons for the differences between these two versions of the formula. The key reason for the difference is that in practice we tend not to know the population mean in the formula above, and have to estimate it using the mean of the sample we have collected. This turns out to be OK as long as one changes the N into an (N-1). the Shape of the Distribution In principle describing the shape is tricky, as there might be lots of different shapes. But in practice, the Normal distribution matches many of the shapes well. The mean and standard deviation vary from situation to situation (in this case they depend upon the underlying popularity of George Bush amongst the US population, and the size of the sample in each poll), but once one has calculated or specified these, there is a single Normal distribution curve, and we can use this to predict the distribution of individual poll results. CE8: Quantitative Research Methods Ian Rudy 9

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras Lecture - 05 Normal Distribution So far we have looked at discrete distributions