Workshop 1. Descriptive Statistics, Distributions, Sampling and Monte Carlo Simulation. Part I: The Firestone Case 1

Sami Najafi Asadolahi Statistics for Managers Workshop 1 Descriptive Statistics, Distributions, Sampling and Monte Carlo Simulation The purpose of the workshops is to give you hands-on experience with Excel, which can be used to perform statistical analyses based on the ideas covered in the lectures. This first workshop has two parts both with step-by-step guidelines. However, please remember to take a moment to think about the problems and issues you are trying to solve. If you have any problems with the following exercises please ask the lecturer or workshop tutors for help. Part I: The Firestone Case 1 Background: Firestone is offering a refund to customers in order to promote its snow tires. For a more detailed background please read the separate document, Firestone Canada Inc.pdf. To summarize, Firestone is offering to refund customers if the snowfall in their area is less than average over the coming winter according to the following refund scheme: Snowfall less than 20% of average Snowfall less than 30% of average Snowfall less than 40% of average Snowfall more than 40% of average 100% refund 75% refund 50% refund no refund The purpose of this exercise is to work out the probability of having to award refunds in a specific area, namely the area around Toronto. From these probabilities we can also calculate the expected cost of the refunds in this area (which will presumably be used by the insurance company to determine the premium which they will charge to cover the refund scheme). The normal distribution will be useful for our analysis in order to get a better estimate of the probabilities of having to award refunds. This case is a good example of a situation where historical data is not sufficient and a known distribution, such as the normal distribution, can be useful for a more meaningful analysis. The following provides a guideline to follow in carrying out the analysis. However, please remember to take a moment to step back and think about the analysis you are doing. Step 1: Load the data file The data for the Toronto snowfall, Fire-to.xls. This file is read-only so you should save it to your own disk space in order to backup your own analysis and results. Step 2: Examine the data The data file contains monthly snowfall records for the Toronto area. The data covers the period January 1940 to May 1983, the first few years worth of data should appear as below: 1 Original case modified by Neil Burgess & Kostis Christodoulou.

TORONTO snowf al l ( cms) Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1940 29. 7 38. 9 35. 1 10. 4 0 0 0 0 0 0 42. 2 11. 9 1941 41. 7 27. 4 26. 2 0 0 0 0 0 0 0 0 16 1942 11. 9 33. 5 14. 2 14 0 0 0 0 0 0 14. 5 37. 1 1943 68. 3 20. 3 23. 9 20. 3 0 0 0 0 0 0 4. 1 2. 8 1944 6. 6 49. 5 30 17 0 0 0 0 0 0 10. 7 92. 5 1945 45. 7 30. 2 3. 3 10. 9 0 0 0 0 0 0 12. 7 18. 5 1946 56. 6 54. 6 1. 5 3. 3 0 0 0 0 0 0 13 24. 1 1947 59. 9 25. 1 36. 6 12. 7 0. 5 0 0 0 0 0 13 34. 5 1948 37. 1 48. 5 16. 3 0. 3 0 0 0 0 0 0 0 42. 2 1949 21. 8 30. 5 32. 5 1. 3 0 0 0 0 0 0 22. 6 10. 9 1950 41. 7 70. 9 29. 7 1. 8 0 0 0 0 0 0 57. 2 3. 6 1951 28. 7 20. 8 23. 9 0 0 0 0 0 0 0 34 71. 1 1952 45. 5 19. 1 7. 9 0. 3 0 0 0 0 0 2. 3 0 9. 1 Step 3: Calculate yearly totals Exhibit 1: Example of snowfall data The refund scheme is calculated on yearly snowfall so we need to aggregate the monthly data. We will calculate the yearly totals in column N, the first empty column. Label the total series by selecting cell N2 and entering Totals. To calculate the yearly total for 1940, enter in cell N3 the formula: = SUM(B3:M3) Copy this formula down the spreadsheet in order to calculate the totals for the other years by selecting cell N3 and dragging down to select the range N3:N45. From the Editing menu under the Home Tab, Select Fill and then Down (or shortcut Control-D). You should see that the N column has now been filled in with the yearly totals. (Alternatively, double-click the small box on the bottom-right corner of the selection.) Step 4: Calculate summary statistics for yearly snowfall figures From the Data menu, select Data Analysis in the Data Analysis dialog box (shown below) select Descriptive Statistics and press OK (or double-click). If Data Analysis does not appear in the Data tab, press File and then Options. In the Excel Options window that will open, select Add-Ins form the left column. In the Manage drop down menu, select Excel Add-ins and press Go. In the ensuing dialog box, click on Analysis ToolPak VBA and then click OK. Now retry Data and then Data Analysis. Exhibit 2: The Data Analysis dialog box

Specify the parameters to the descriptive statistics procedure: Place the focus in the Input Range box by clicking in the white box. Then either: 1. select the data on the underlying spreadsheet using the mouse (i.e. select N2 and drag down to N45) or, 2. specify the range directly by typing N2:N45 Specify that the data is grouped by columns by clicking on the top radio button. Tick the box to say that we have Labels in the first row. Towards the bottom of the dialog box, choose the Output Option: New Worksheet Ply, and give the results worksheet the title Summary. Finally, tick the bottom box, which specifies Summary Statistics. Exhibit 3: The parameters for the Descriptive statistics procedure The dialog box should now look like Exhibit 3. Perform the analysis by pressing the OK button. Excel will create a new worksheet with the results and call it Summary. Step 5: Examine the results of the Descriptive Analysis Firstly resize the columns so that you can see the results properly: Leave the first two columns selected, then on the Home tab, in the Cells group, click Format and under Cell Size click AutoFit Column Width. Now we can see the summary statistics for the annual snowfall figures. Pay particular attention to the following statistics: Mean - this is the average annual snowfall St. deviation - this measures the variability of the annual figures around the mean Minimum - lowest annual snowfall

Maximum - highest annual snowfall Count - number of observations in the sample Your results should be the same as Exhibit 4, below: Step 6: Calculate the refund levels Exhibit 4: Summary statistics for Toronto snowfall Now that we know the average we can calculate the thresholds, below which the different levels of refund apply: First let s enter the refund levels: In cell D2 (of the summary sheet) enter a title: Refund level, and then in the cells below (i.e. D3 to D5) enter the refund levels: 100%, 75% and 50%. Now the levels of snowfall below which the refunds apply: In cell F2 put the title Snowfall below, and then in the cells F3 to F5 the description of when the refund applies, i.e. 20% of average, 30% of average, and 40% of average. In cell H3 enter the formula, which calculates the threshold for the first refund level: = 20% B3 and similarly for the other two cases. Compare these refund levels to the information in the summary statistics. The minimum recorded snowfall is 66.7 and the refunds only apply if the snowfall is less than 57.1. What does this suggest about the probability of giving refunds? The insurance company might take the view that it is only a fluke that the 40 year sample has no years of very low snowfall; to deal with this we will assume that the snowfall has an underlying distribution and then re-calculate the probabilities of giving refunds using the theoretical distribution. In order to choose a sensible theoretical distribution we will first look at the histogram of our sample distribution. Step 7: Plotting the histogram of the sample distribution Go back to the FIRE-TO worksheet, which contains the snowfall data. Firstly we specify the centres for the bins: Tot al Mean 142. 6488372 St andar d Er r or 4. 869088399 Medi an 141. 1 Mode #N/ A St andar d Devi at i on 31. 92874785 Sampl e Var i ance 1019. 444939 Kur t osi s - 0. 191009286 Skewness - 0. 040209449 Range 139. 6 Mi ni mum 66. 7 Maxi mum 206. 3 Sum 6133. 9 Count 43 Conf i dence Level ( 95. 0%) 9. 826221316 In cell P2 enter a title Range. In P3 enter the value 0; in P4 enter the value 20; in P5 enter 40; and so on until P18 with value 300.

0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 More Frequency (A shortcut is to enter 0 in P3, 20 in P4; then select the two cells and extend the list by dragging downwards the small box on the bottom-right corner of the selection). Now activate the histogram procedure: From the Data menu, select Data Analysis as before, but this time choose the Histogram option. Specify the Input Range as N2:N45 (i.e. the data) Specify the Bin range as P2:P18 Tick the Labels box. Specify the Output option of New worksheet ply and give it the name Distribution. Tick the bottom box for Chart Output. Activate the procedure by pressing the OK button. Excel should create a new worksheet with the title you gave it (i.e. Distribution ) which contains the histogram information and the histogram itself. Note: you can resize the histogram by selecting the chart and dragging the corners. Histogram 12 10 8 6 Frequency 4 2 0 Bin Exhibit 5: Frequency histogram for annual snowfall in Toronto region The histogram shows a typical distribution with a central peak and fewer observations in the two extremes or tails. It appears reasonable to model this using a Normal distribution, but let s do an extra comparison first in order to be sure. (Another convenient way of preparing data for a histogram is by using the function FREQUENCY. Explore in the Excel help how to use that function. Note that this is an array formula, which means that it returns more than one value. To get its output you need to select the output cells, as many as the bins, and press CTRL+SHIFT+ENTER.) Step 8 (optional): Plotting a Normal distribution which closely matches the data The normal distribution takes two parameters, a location parameter, which is the mean of the distribution and a dispersion parameter, which measures variability around the mean. For these two parameters we can simply use the mean and standard deviation, which are contained in the summary statistics. Firstly we must calculate the Cumulative distribution, for a given value this tells us the probability that an observation will be less than (or equal) to the chosen value.

Why do we use the cumulative distribution rather than using the density curve? Enter the title Cumulative in cell C1 of the Distribution worksheet (move the histogram out of the way if necessary) and then enter the following formula in cell C2: = NORM.DIST(A2,142.65,31.93,1) Note: The A2 is the point at which we wish to find the cumulative distribution, the 142.65 is the mean of the normal distribution, 31.93 is the standard deviation, and the 1 is to tell the computer that we are working with cumulative probabilities rather than densities. Then copy the formula down through cells C3 to C17. The cell reference will refer to the different bin centres contained in the cells A2 to A17 and the formula will evaluate to the cumulative distribution at these different points. Exhibit 6 shows a plot of the values in column C. The cumulative distribution rises from zero at low values (we have no observations below this point) to a maximum of 1 at high values (all the observations are below this point). Cumulative 1 0.9 0.8 0.7 0.6 0.5 Cumulative 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Exhibit 6: Cumulative normal distribution In order to compare to the histogram we need to convert the cumulative distribution into first probabilities for each bin and then frequencies per bin. In cell D1 enter the title Probability. For the first bin the probability is the same as the cumulative probability (i.e. there are no lower bins to adjust for) so we enter the formula for the probability in cell D2 as = C2 For the second bin the probability as the same as the cumulative probability (prob. of Bin 1 or 2) minus the previous cumulative probability (prob. Bin 1), and the formula for cell D3 is = C3-C2 You can then copy the formula in cell D3 down through the other cells D4 to D17. We now have the probabilities for each bin. In order to compare to the frequency histogram we simply multiply the probabilities by our original sample size of 43 (given in the summary statistics): In the Distribution worksheet, enter the title FreqN in cell E1.

Then in cell E2 enter the formula and copy this down through cells E3 to E17. = D2 43 We can add this data to the original histogram plot in order to compare the two: Right click with the mouse on the histogram and choose Select Data. Press Add and put E1 as the Series name and E2:E17 as the Series values then press OK. Now select the data you just inserted on the graph by clicking on one the columns and then right clicking on the mouse. Select Change Series Chart Type and choose a chart option with a line (as shown in exhibit 7 below). Histogram 12 10 8 6 Frequency FreqN 4 2 0 Bin Exhibit 7: Comparison of empirical (actual) and theoretical distributions It is possible to perform a (chi-squared) statistical test to see if the differences in the two distributions are simply due to random fluctuations or whether they are statistically significant. We will look at that approach later in class. However for the purposes of this exercise, we can see that the actual data closely matches the theoretical normal distribution, and that it is valid to continue the analysis based on the assumption that future snowfalls will follow the same normal distribution. Step 9: Re-estimate the refund probabilities based on the assumption of a normal distribution Go back to the Summary worksheet. Assuming that the normal distribution is a good fit, we can now calculate the refund probabilities as the cumulative probabilities of the refund levels: In cell J2 enter the title Probability In cell J3 enter the formula: = NORM.DIST(H3,$B$3,$B$7,1) Note: By using absolute ($) references to the cells containing the mean and standard deviation we avoid errors creeping in when we copy the formula. Instead of typing $ you can press F4. Copy the formula down to cells J4 and J5. The cells J3, J4 and J5 now contain the estimates of refund probabilities at the different levels and the worksheet should look like Exhibit 8:

Ref und l evel Snowf al l bel ow Pr obabi l i t y 100% 20% of aver age 28. 52977 0. 000176 75% 30% of aver age 42. 79465 0. 000882 50% 40% of aver age 57. 05953 0. 003674 Exhibit 8: Calculation of Refund probabilities The values are still very small - but at least they are no longer zero and probably represent much more realistic estimates of the refund probabilities. Why? Step 10: Calculating the expected refund per $ of sales As a final step we can estimate a fair insurance premium: In cell D10 enter the title: Expected cost of refund per $. In H10 enter the formula to calculate the expected refund [= sum(amount of refund prob. of refund)] = 100% J3 + 75% ( J4 - J3 ) + 50% ( J5 - J4 ) Note: For the lower levels of refund we subtract the probability of a higher refund in order to avoid double counting. Cell H10 should now contain the value 0.002101. We can view this as the expected percentage refunded (0.2101%), or the expected amount (in dollars) of refund per dollar. This number would form the basis of the premium, which the insurance company would charge (plus margin of course!) in order to cover the financial risk, which is contained in the refund option. Please be ready to answer the following questions. Questions 1. Why are we using the normal distribution to calculate the refund probabilities? Why don t we use historical data? Let us now investigate the sensitivity of the results to the accuracy of our parameters (i.e. mean and standard deviation of the annual snowfall). 2. How would the refund probabilities alter if the official average, on which the offer is based, were different from our sample average (as could happen if the average was calculated on a shorter or longer sample for instance)? Before changing the numbers think about how the probabilities would change. a) What if the official average was lower than our value of 142.65? b) What if the official average was higher than our value? 3. It is also possible that our sample misrepresents the true variability of the data - perhaps snowfalls have been more consistent than usual, or more variable than usual. a) How would the refund probabilities be affected if the true variability were higher than in our sample (standard deviation of more than our 31.93)? b) What if the true variability were lower (standard deviation of less than 31.93)?

Concluding notes Note how the actual cost of the refund to the company is very low, but this is much less obvious to the customer who might well feel that they are getting a very good deal. The fact that the company is well aware of this can be seen in the penultimate paragraph of the case description. Although we are using fairly basic statistical concepts, this is a realistic example, which illustrates the basis for calculating insurance premium. In fact, similar ideas form the basis of pricing financial derivatives such as options. Part II: Sampling Distributions Background: This part of the workshop illustrates the concept of sampling. This is one of the most useful statistical concepts as for many reasons (e.g. cost, practicality) we usually need to draw a conclusion based on a limited sample of data when we do not have access to the whole data set. For example, a supermarket would like to find out how much their customers spend, on average, at each visit. For cost and practicality reasons data is gathered from a few customers, but not from all of them. In other words; based on a sample of customers the supermarket would like to draw a conclusion about the average spending across all customers. The answer based on the limited sample won t be 100% accurate since not all customers were asked, and thus average spending of asked customers is not exactly the same as the average spending calculated from all customers. This error, generated by looking only at a sample rather than the whole customer base is called a sampling error. This exercise allows you to understand the statistical implications of sampling and the characteristics of the sampling error. An Excel spreadsheet is provided, which has a graph (the upper graph) that shows the distribution of the spending of the whole customer base of a supermarket. (Detailed guidelines for using the spreadsheet are given on the next page.) By pressing a button you can take one sample from this distribution, which corresponds doing one survey by asking a random sample of customers how much each of them spent in their visit to the supermarket. The average value from this sample is plotted in the lower graph. You can choose the sample size, i.e., the number of customers you ask. To go a step further: you can also take several samples to get a sense of how different the outcome could be if you were to take more than one sample. Remember that in reality we usually only take one sample (we are only doing one survey), which includes a certain number of customers, and then we calculate the average spending from that sample. However, to understand the error associated with the average spending calculated from this limited sample, instead of the whole customer base, we imagine that we take a few other samples and analyse how far they are from each other by looking at their histogram. If their histogram is very spread out then you know that the average spending calculated from a sample could be quite far from the average spending of the whole customer base. However, if you increase your sample size (ask more customers), the average spending calculated from a sample should be a better estimate of the average spending of the whole customer base. We will explore this distribution here. As mentioned above any statistic based on a limited sample will be subject to sampling error. Often we are interested in the average (mean) value of a variable: average cost, average profit, average time to complete a job, average salary etc. It is safe to make three assumptions, when making inferences on the population mean based on the sample mean. We will explore these assumptions here in this part of the workshop and in our next lecture:

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 Number of Samples 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 Proportion of Customers on average the mean of a sample will be the same as the mean of the underlying population distribution the sampling error of the mean will follow a normal distribution, assuming we have a reasonably large sample of 30 or more the standard error (standard deviation of the sample mean) equals the underlying population standard deviation divided by the square root of the sample size It is useful to get a feel for how this works by experimenting with data. Using an Excel spreadsheet, we will consider an example of estimating the average spend of a supermarket customer, trying out different underlying distributions and taking different sample sizes to estimate the true average spend. Step 1: Load the spreadsheet file The spreadsheet that we will use is, clt95.xls. Open the file and, if asked, reassure Excel that the workbook is from a reliable source by clicking Enable Macros, and then click OK in the next dialog box. Choose shape (1, 2 or 3) Distribution of Amount Spent in Store 1 0.05 Average spend: 20.00 Standard Deviation: 19.97 0.04 0.03 0.02 0.01 Number of samples: 1 max 1000 0 Sample size: 10 max 250 Pounds Spent Press to Resample Analysis of sample averages: 1 0.8 Distribution of Sample Average Average: 18.00 St.Dev: #DIV/0! Speed (0=slow,1=fast) 1 0.6 0.4 0.2 0 Sample Average Resize so that you can see both graphs fully. The only things you will alter in this spreadsheet are: the shape of the underlying distribution the number of samples the size of each sample

Step 2: Set the initial parameters Set the shape of the underlying distribution by clicking in cell C5 (the top boxed cell, with a number in red), then type 1, and press enter. This specifies the underlying population distribution, which is displayed in the top graph. Shape 1 is a distribution with average of 20 and standard deviation of 19.97; lower values are most likely and the distribution decays gradually so that there is still quite a high chance of an individual customer spending 40 pounds or even more (values higher than 43 are not displayed on the graph). Is this a realistic distribution for supermarket spending patterns? In a similar way, specify the initial Number of samples by entering the value 1 in cell D13 (with a blue number). Initially we will look at a single sample, which is all we normally have in reality. Set the Sample size (in cell D15 with the green number) to 10. We will start with quite a small sample and experiment with larger ones later. Step 3: Take a Sample Take a sample by clicking on the Press to Resample button in the middle of the left-hand side of the screen. You can see the average value of the sample by looking in cell D23 (next to the cell that says Average, below the Analysis of Sample Averages title). This value is also displayed in the bottom graph. You should notice that the average value of your sample will be close to, but not exactly the same as, the average of the underlying population (which we know is 20). Also notice, that the cell next to St. Dev. is giving a #DIV/0! error: this is because we can t calculate a measure of variability (which is what standard deviation is) for the sample mean from only one sample. Step 4: Take some more samples Take a series of samples by clicking the Press to Resample button. Repeat this a few times. Notice how the Average value changes from one sample to another: each sample we take gives us a different estimate of the underlying value. Because we are taking a relatively small sample there is quite a large margin of error, let s try a larger sample: Step 5: Take some larger samples Change the sample size by clicking in cell D15 and changing the value from 10 to 100 (press enter). Resample a few times by clicking the button. You should notice that with the larger samples we are now more accurate: most times the sample Average is now quite close to the true value of 20. Let s analyse the distribution of the sampling error by taking a set of samples and analysing the variability between them: Step 6: Take a set of samples Change the number of samples by clicking in cell D13 and changing the value from 1 to 10. Generate the set of ten samples by pressing the Resample button. The bottom graph now shows a histogram of the distributions of the sample averages, similar to the one below:

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 Number of Samples Press to Resample Analysis of sample averages: Average: 19.76 St.Dev: 1.94 Speed (0=slow,1=fast) 1 3 2.5 2 1.5 1 Distribution of Sample Average 0.5 0 Sample Average In this case the histogram indicates that of the ten samples, 2 had an average value of 17 point something, 1 sample had an average of 18 point something, 3 samples were between 19 and 20, 2 between 20 and 21, 1 between 21 and 22 and 1 between 23 and 24. Look at your own histogram and interpret your results similarly. Notice that the average of the sample averages is quite close to the true value of 20. Notice also that the standard deviation (sampling error) of these averages is roughly one-tenth of the population standard deviation: 1.94 in this case as opposed to 19.97 for the underlying population. Step 7: Analyse the samples themselves Look at the actual observations within the different samples by clicking on the Samples worksheet tab at the bottom of the screen. You should see a display similar to that below: Press to Resample Average Obs1 Obs2 Obs3 Obs4 Obs5 Obs6 Obs7 Obs8 Obs9 Obs10 Obs11 Sample1 19.14 4 9 17 1 22 9 13 22 49 12 0 Sample2 20.7 33 10 3 14 44 4 46 4 12 62 7 Sample3 23.57 17 1 21 97 31 18 17 25 12 7 11 Sample4 21.79 4 7 33 30 50 1 1 5 13 51 71 Sample5 17.51 2 38 16 22 3 86 19 6 2 75 4 Sample6 19.05 13 5 61 0 26 19 26 6 12 6 17 Sample7 19.96 8 17 15 36 40 14 8 41 27 5 14 Sample8 17.09 38 8 24 10 4 42 21 37 3 9 9 Sample9 20.16 2 8 20 32 21 28 10 34 89 24 38 Sample10 18.66 38 7 5 60 32 22 27 19 3 28 7 The sheet shows the composition of the 10 different samples, each made up of 100 observations (some will be off the right-hand edge of the screen). The sheet also shows the average value within each sample. Notice that although the individual observations vary quite widely (from 0 to 75 in those shown above) the sample averages are quite consistent.

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 Number of Samples In other words we get more accurate information by looking at the sample average than by looking at a single observation The information that we were looking at previously, on the Control sheet, is a summary of these sample averages. Notice that in this example, Sample 5 has an average of 17.51 and Sample 8 has an average of 17.09, these are the two samples that the histogram told us were in the 17-point-something range. Check that the values for your sample averages correspond to the counts shown in the histogram on your Control sheet. Similarly the Average of all these sample averages should come out to the 19.76 shown on the Control sheet (your value will depend on your particular samples), and the standard error or standard deviation of these averages will come out to roughly 2 (in this case 1.94). Click Press to Resample in the Samples sheet. Step 8: Investigate other sample sizes and number of samples By changing cells D13 and D15 in the Control worksheet, experiment with different sample sizes and different numbers of samples. Experiment until you are comfortable with the key relationships: smaller sample sizes give a larger margin of error (remember: the standard error (of the sample mean) is inversely related to the square root of the sample size); with a large number of samples, you can see that the histogram for the sample average is similar to a normal distribution, the example below is for 1000 samples of size 100 (Warning: it takes a long time to do this one!) 250 200 150 100 50 0 Distribution of Sample Average Sample Average Remember: this is in spite of the fact that the population distribution is not even close to a normal distribution - what you are seeing is the Central Limit Theorem at work: add together enough of anything and it comes out looking normal! Exercise Change the underlying distribution by setting cell C5 of the Control worksheet to 2 or 3. Experiment with these distributions and confirm that the same effects occur.

Part III: Airline overbooking 2 This example presents a stylised version of calculations used by airlines when they seek to optimise revenue by overbooking some flights. A certain percentage of ticketed passengers on average cancel their flights very late, or do not turn up in time. To avoid empty seats, airlines often sell more tickets than there are seats. We will assume that the no-show probability for an individual passenger is 5%. What probability distribution would describe the number of passengers who show up, if n tickets are sold, and each passenger makes the no-show decision independently? Thus, thinking in terms of a binomial distribution, we assume that each ticketed passenger, independently of others, shows up with probability 95%. For a flight with 200 seats, the airline wants to find how sensitive various occupancy probabilities are to the number of tickets it sells. In particular, it wants to calculate: 1. The probability that more than 205 passengers show up 2. The probability that more than 200 passengers show up 3. The probability that at least 195 seats will be filled 4. The probability that at least 190 seats will be filled In the first two events, the airline will have to bump customers from the flight and compensate them; in the last two events the airline would have preferred the extra revenue from having the seats in the airplane occupied. As a basis for analysing this trade-off, and hence optimising n, we will use Excel s BINOMDIST function and a data table.follow the steps on the next page. Step 1: Start up Excel and load the file Overbook.xls Step 2: Examine the file The file contains a layout that will help us address the airline s problem. Note that we have defined two names; NTickets refers to cell B6 and PnoShow refers to cell B4. By using range names instead of cell addresses we can understand Excel s formulas a lot easier. Step 3: Explore the BINOMDIST function Familiarise yourself with the BINOMDIST function using the Excel help. When do we use the cumulative distribution and when the probability mass function? Step 4: Calculate the probabilities In cell B6 we enter a possible number for tickets sold and from this we calculate the required probabilities in row 10. For example, the formulas in cell B10 and D10 are: Cell B10 Cell D10 = 1 BINOM.DIST(205,Ntickets,1-PNoShow,1) and = 1 BINOM.DIST(194,Ntickets,1-PNoShow,1) 2 Based upon an initial problem developed by Neil Burgess & Kostis Christodoulou.

The probability of more than 205 is 1.0 minus the probability of less than or equal to 205, whereas the probability of at least 195 is 1.0 minus the probability of less than or equal to 194. Step 5: Set up the data table Once we have calculated the probabilities in row 10, we would like to see how sensitive these probabilities are to the number of tickets issued. Excel s data tables allow us to explore how an output value changes as we vary one of the inputs. We will create a one-way data table that will allow us to do that. We choose a one-way table because there is only one input, the number of tickets issued, even though there are four output probabilities that are calculated. To create the data table, list several possible values of tickets issued along the side in column A and transfer the probabilities from row 10 to row 14, i.e., in cell B14 type =B10 and copy it across. Once you have copied the probabilities, highlight the range surrounded by a red border, i.e., the range that includes the different input parameters (tickets issued) and the probabilities calculated. Select Data What-If Analysis Data Table; as you are filling in columns by varying the number of tickets issued, the column input cell is $B$6 (just click on the cell) and there is no row input cell. Questions: 1. As the airline sells more tickets, how likely is it to bump customers off? 2. How likely is it to fill most seats? 3. What happens to the results if the probability of a no show increases to, say, 10%? 4. What further information would you like to have in order to make a recommendation to the airline on how many tickets it should issue?

Part IV: Profit-at-Risk This exercise provides simple practice in the use of Monte Carlo simulation techniques for risk analysis. Our example looks at the gross profits of a stylized manufacturing company where the main uncertainties are in the sales volume and in the variable costs of producing the product, which are assumed to vary independently. Fixed costs are not random. Of interest is the profit distribution, since it depends on these two uncertainties and in particular the average profit, the variance, and an estimate of the Value-at-Risk (i.e., how much the company could lose in the worst 5% of the outcomes). You will become familiar with the @Risk Excel addin, which estimates this profit distribution by collecting the results of many iterations of the model, where each iteration takes values from the input distributions: sales volume and variable costs. Step 1: Close down all copies of Excel Step 2: Start @Risk (which will automatically restart Excel) by selecting: Start Programs Palisade Decision Tools @Risk 5.7 for Excel Step 3: When @Risk has loaded, open the data file The data and set up for the product profit-contribution calculation is on the Sales.xls This file is read-only, so you should save it to your own disk space (H: drive) in order to conserve your own analyses and results. Now that you have loaded @Risk with Excel you can define the input distributions, replacing ths single values in the relevant cells. Step 4: Define distributions for sales volume and variable costs. Select cell B8 (sales volume). We would like to generate values of sales volume to represent the historic data distribution. We have historic data in range H5:H22. Thus we specify values for cell B8 as =RiskDuniform(H5:H22). This will sample values from H5:H22, where each value has an equal ( uniform ) chance of being selected. Now select cell B10 (variable costs) and click Define Distributions under Model tab, which is located in @Risk menu. This opens up a window as shown. This allows us to generate random values for B10 according to many specifications for probability distributions. We will use a Normal distribution with an assumed mean and standard deviation. From the window select Normal distribution under the Common Tab and give µ = 50, and σ = 7.5. This creates variable costs randomly from a normal distribution with mean 50, standard deviation 7.5.

Step 5: Change the simulation settings. Select Simulation Settings under Simulation in @Risk menu. This option opens up the simulation settings window as exhibited below. In the General tab, change the # Iterations to 10000 so that you get a better approximation of the profit distribution when you run the simulations. In addition select the radio button Random Values (Monte Carlo). Click OK, and now if you press the keyboard s F9 function button you can see that the values are simulated from their distributions and that the Profit Contribution changes accordingly. Step 6: Add outputs and run simulations. Select cell B12 and click on Add Output under Model tab to add Profit Contribution as the output that you want to analyze, and press OK. Then click Start Simulation under Simulation tab to start the simulations. This will open up the @RISK Simulated input: B10 window as below, where you can analyze the results in more detail.

Step 7: Analyze profit contribution distribution. On the left panel you can see Outputs B12 Profit contribution. Right-click it and select @Risk Brows Results. This opens up the profit contribution distribution. On the right side you can see panel with information related to the output distribution, such as minimum, mean, maximum etc Inside the distribution window you can also move two vertical bars which give the profit contribution values at certain percentiles: the top 95% (Left P; p1 in Summary Statistics) and the top 5% (Right P; p2). If the vertical bars won t settle on a desired value, then reset e.g. Left P, in the panel to the right of the histogram. Based on the distribution can you find the 95% Value-at-Risk? This is the value of profit (loss) which has a 95% chance of being exceeded. Assume that you can improve the manufacturing process by reducing the variability of the variable costs to 2.5 (i.e., cell B10 =RiskNormal(50, 2.5)) but this increases your fixed costs by 2%? Would you like to go ahead and implement this? Would the decision be the same for a risk neutral and a risk averse decision maker? How does the shape of the profit contribution distribution change? Although simple, this example provides the basic methodology underlying much of the modern risk management analytics that are becoming mandatory for compliance purposes in trading organisations. (For further interest, visit http://www.riskmetrics.com)