Rational Decision Making

Size: px

Start display at page:

Download "Rational Decision Making"

Mercy Quinn
5 years ago
Views:

1 Department of Decision Sciences Rational Decision Making Only study guide for DSC2602 University of South Africa Pretoria

2 c 2010 University of South Africa All rights reserved. Printed and published by the University of South Africa, Muckleneuk, Pretoria. DSC2602/1/2011 Cover: Eastern Transvaal, Lowveld (1928) J. H. Pierneef J. H. Pierneef is one of South Africa s best known artists. Permission for the use of this work was kindly granted by the Schweickerdt family. The tree structure is a recurring theme in various branches of the decision sciences.

3 Preface Everyday life is full of decisions. What should I wear today? What should I eat? Should I buy the red or blue shirt? Should I buy a specific house or buy a piece of land? What is the shortest route from my house to work?... And many more. Some of these decisions can be made without thinking or by guesswork. Some can be solved by reasoning or emotions. Some are a bit more difficult and may need additional information. People have been using mathematical tools to aid decision making for decades. During World War II many techniques were developed to assists the military in decision making. These developments were so successful that after World War II many companies used similar techniques in managerial decision making and planning. The decision making task of modern management is more demanding and more important than ever. Many organisations employ operations research or management science personnel or consultants to apply the principles of scientific management to problems and decision making. In this module we focus on a number of useful models and techniques that can be used in the decision making process. Two important themes run through the study guide: data analysis and decision making techniques. Firstly we look at data analysis. This approach starts with data that are manipulated or processed into information that is valuable to decision making. The processing and manipulation of raw data into meaningful information are the heart of data analysis. Data analysis includes data description, data inference, the search for relationships in data and dealing with uncertainty which in turn includes measuring uncertainty and modelling uncertainty explicitly. In addition to data analysis, other decision making techniques are discussed. These techniques include decision analysis, project scheduling and network models. Chapter 1 illustrates a number of ways to summarise the information in data sets, also known as descriptive statistics. It includes graphical and tabular summaries, as well as summary measures such as means, medians and standard deviations. Uncertainty is a key aspect of most business problems. To deal with uncertainty, we need a basic understanding of probability. Chapter 2 covers basic rules of probability and in Chapter 3 we discuss the important concept of probability distributions in some generality. In Chapter 4 we discuss statistical inference (estimation), where the basic problem is to estimate one or more characteristics of a population. Since it is too expensive to obtain the population information, we instead select a sample from the population and then use the information in the sample to infer the characteristics of the population. In Chapter 5 we look at the topic of regression analysis which is used to study relationships between variables. In Chapter 6 we study another type of decision making called decision analysis where costs and profits are considered to be important. The problem is not whether to accept or reject a statement but to select the best alternative from a list of several possible decisions. Usually no statistical data are available. Decision analysis is the study of how people make decisions, particularly when faced with imperfect information or uncertainty.

4 Chapter 7 deals with project management. Project management consists of planning projects, acquiring resources, scheduling activities and evaluating complete projects. Managers are responsible for project management. They must know how long a specific project will take to finish, what the critical tasks are, and very often, what the probability is of completing the project within a given time span. In Chapter 8 the subject of network models is discussed. Network models consist of nodes and arcs. Many real-world problems have a network structure or can be modelled in network form. These include problems in areas such as production, distribution, project planning, facilities location, resource management and financial planning. The graphical network representation of problems provides a powerful visual and conceptual aid to indicate the relationship between components of a system.

5 v DSC2602/1/2010 Contents 1 Descriptive statistics Introduction Data collection Sampling methods Simple random sampling Stratified random sampling Systematic sampling Presentation of data Types of data The frequency table and histogram The pie chart The cumulative frequency polygon The stem-and-leaf diagram Descriptive measures Measures of locality The mean The median The mode Measures of dispersion The variance of a data set The standard deviation of a data set The quartile deviation The coefficient of variation

6 DSC2602/1/2010 vi 1.6 The box-and-whiskers diagram Summary of descriptive measures Measures of locality Measures of dispersion Exercises Probability concepts Introduction Classical probability Some rules in probability theory Conditional probability Joint probabilities: multiplication law Summary of basic probability concepts Exercises Probability distributions Introduction Random variable Discrete random variables The probability distribution of a discrete random variable Expected value of a discrete probability distribution Variance of a discrete probability distribution Discrete distribution function Discrete random distributions The binomial distribution The Poisson distribution Continuous random variables The probability distribution of a continuous random variable Continuous distribution function Continuous distributions The normal distribution The Standard Normal distribution The exponential distribution The uniform distribution Exercises

7 vii DSC2602/1/ Estimation Introduction Types of estimators Point estimation Estimating the mean Estimating the variance Estimating proportions Interval estimators The standard error Confidence intervals for means Confidence intervals for proportions Exercises Correlation and regression Introduction Correlation analysis The scatter diagram The correlation coefficient Pearson s correlation coefficient Spearman s rank correlation coefficient Simple linear regression The estimated regression line The method of least squares Rules and assumptions underlying regression analysis Residual plot analysis The coefficient of determination The F-test for overall significance Forecasting accuracy Fitting nonlinear relationships The use of spreadsheets in simple linear regression Exercises Decision analysis Introduction Structuring decision problems The basic steps in decision making

8 DSC2602/1/2010 viii Payoff tables Decision trees Decision making without probabilities or under uncertainty The optimistic approach - maximax criterion The conservative approach - maximim criterion Minimax regret approach Decision making with probabilities or under risk The expected value approach Decision trees and the expected value approach Sensitivity analysis Expected value of perfect information Decision analysis with sample information Utility and decision making Obtaining utility values for payoffs Utility curve The expected utility approach Utility functions Exercises Project management Introduction PERT/CPM Project scheduling with certain activity durations single time estimates Define the project Project network diagrams Conventions for constructing network diagrams Drawing network diagrams Network calculations Calculating early and late event times The duration of the project The critical path Total float Other measures of float Using linear programming to find a critical path Formulating an LP model

9 ix DSC2602/1/ Using LINDO or LINGO to solve the LP model Project scheduling with uncertain activity durations multiple time estimates Probability of project completion time Project scheduling with time-cost tradeoffs Formulating an LP model to crash a project Using LINGO to solve the LP model for crashing a project Exercises Network models Introduction The shortest-route problem A shortest-route algorithm The labelling phase Backtracking phase Formulating the shortest-route problem as an LP model Solving shortest-path problems with LINGO The maximum-flow problem The minimum-spanning tree problem Minimum-spanning tree algorithm Minimum-spanning tree algorithm Exercises A Solutions to exercises 231 A.1 Chapter 1: Descriptive statistics A.2 Chapter 2: Probability concepts A.3 Chapter 3: Probability distributions A.4 Chapter 4: Estimation A.5 Chapter 5: Correlation and regression A.6 Chapter 6: Decision analysis A.7 Chapter 7: Project management A.8 Chapter 8: Network models B Statistical tables 281 B.1 The cumulative Poisson distribution B.2 The standard normal distribution B.3 Student s t-distribution

10 DSC2602/1/2010 x B.4 The F-distribution B.5 The cumulative binomial distribution C Bibliography 295

11 CHAPTER 1 Descriptive statistics 1.1 Introduction N O business can exist without the information given by numbers. Managing numbers is an important part of understanding and solving problems. Numbers provide a universal language that can easily be understood and supply a description of some aspects of most problems. The collection of numbers and other facts such as names, addresses, opinions etc. provides data. The data only becomes information when it informs the user. Statistics is about changing data to information by analysing the data. The Statistical analysis can be divided into two main branches namely descriptive statistics and inferential statistics. Descriptive statistics deals with methods of organising, summarising and representing data in a convenient and informative way by means of tabulation, graphical representation and calculation of descriptive measures. Inferential statistics is a body of methods used to draw conclusions or inferences about characteristics of populations based on sample data by using the descriptive measures calculated. In this chapter we discuss methods for collecting data and some descriptive statistics. 1.2 Data collection Data can come from existing sources or may need to be collected. Technology makes it possible to collect huge amounts of data. For example, retailers collect point-of-sale data on products and customers and credit agencies have all sorts of data on people who have or would like to obtain credit. In the case where data must be collected, data can be collected from a census where everybody or every item of interest is included or from a sample from the population of interest. 1

12 DSC First we look at an example to clarify some definitions. A tyre company, Radial, advertises that its XXX tyres, generally known as Triple X, will complete at least km before one of the four tyres will no longer meet the minimum safety requirements. Several complaints, however, have been received that the tyres completed only km before the minimum requirements were exceeded. Radial sells directly to the public and it is company policy to keep a record of each customer. During the two years that XXX tyres have been manufactured, sets have been sold. Radial feels that they just do not have the time, personnel or money to locate and question all of their customers. They feel that if they question 100 customers, they will get a good idea of the actual situation. In other words, they will take a sample of 100 from a population of Consider the following definitions: VARIABLE: Any property or characteristic that can be measured or observed, is called a variable. A variable can take on a range of different values. For example, the distance completed on a set of tyres differs for each customer and the observations therefore vary continually. In Radial s case, distance completed is a variable. SAMPLE UNIT: The sample unit is the item that is measured or counted with regard to the variable being studied. Radial s sample unit is a set of tyres measured for minimum safety requirements. POPULATION: A population is the set of all the elements or items being studied. In Radial s case, the sets of XXX tyres that have been sold form the population. SAMPLE: A sample is a representative group or a subset of the population. The 100 sets of tyres that Radial will investigate, form the sample. Note: What is very important is that the sample must always be representative of the population. It should be designed and administrated in such a way as to minimise the chance of being biased (sample outcome does not represent the population of interest). If the sample is likely to leave out certain people or there is a relatively high level of non-response, it would probably not be representative of the population. 1.3 Sampling methods Simple random sampling A good sample requires that every item in the population has an equal and independent chance of being included in the sample. A simple random sample of n elements is a sample that is chosen in such a way that every combination of n elements has an equal chance of being the sample selected. One method of drawing a simple random sample is to allocate a number to each item in the

13 3 DSC2602 population. A computer is used to generate a sequence of random numbers. These numbers are then used to identify items in the population to be included in the sample. Example Printapage, a printing company, has 30 clients with outstanding balances (in Rand) as shown in Table 1.1. Account Account number Balance number Balance Table 1.1: Outstanding balances (in rand) The following random numbers are available: 22; 17; 83; 57; 27; 54; 19; 51; 39; 59; 84 and 20. Use these numbers to draw a random sample of 5 from the 30 customer accounts. Solution Since the total number of elements in the population is 30, an account number larger than 30 will be of no use. The sample units are the numbers of the accounts to be drawn. These are The corresponding outstanding balances are 22; 17; 27; 19 and ; 102; 16; 429 and 197.

14 DSC Example A political candidate wishes to determine the opinions of the voters in his ward. He decides on a sample size of 20. Using random numbers, he chooses 20 telephone numbers from the directory for a telephonic survey. Is this procedure correct? Give a reason for your answer. Solution Not all residents will have telephones, and the numbers of those who do have telephones may not all be included in the telephone directory. Such a sample can therefore not be considered random Stratified random sampling Simple random sampling requires no prior (a priori) knowledge of the population and can therefore be done with relatively little effort. It could, however, happen that all the elements drawn for the sample, are nearly homogeneous or alike. This may cause the conclusions about the population to be biased. If, however, you have prior information about the population, you could rule out this problem to some degree and consider more correct information about the population by making use of stratified random sampling. The population is divided into mutually exclusive sets or strata. This means that a specific element may only belong to one group or stratum. The strata must be chosen in such a way that there will be large differences between the strata, but small differences between the elements within the same stratum. Now simple random samples are taken from each stratum. The number of elements taken from each is often proportional to the size of that stratum. Example Divide Printapage s 30 customers into three strata as follows: Stratum Balance 1 < > 600 A proportional sample of size 12 must be drawn from the population. How would you do it? Solution Printapage s customers are divided into three strata, as shown in Table 1.2.

15 5 DSC2602 Stratum 1 Stratum 2 Stratum 3 Account Balance Account Balance Account Balance number (< R200) number (R200 R600) number (> R600) Frequency=15 Frequency=10 Frequency=5 Table 1.2: Dividing Printapage s data into 3 strata To draw a proportional sample of size 12, the following number of items must be drawn from each stratum: 15 12=6 elements from stratum 1, =4 elements from stratum 2, and =2 elements from stratum Lastly, a simple random sample as describe in the previous section is drawn from each stratum Systematic sampling Systematic sampling starts at a randomly selected starting point in the population. Each subsequent k-th element is then chosen. Example A political candidate wishes to determine the opinions of the voters in his ward. He has a list of voters available. One way of obtaining a systematic sample would be to start with voter number 6 and then select every tenth voter to complete a questionnaire. What are the advantages and disadvantages of such a method?

16 DSC Solution Advantage(s): Systematic sampling is convenient, especially when the size of the population is not known. Disadvantage(s): If the variable being considered is periodic in nature, systematic sampling could produce misleading results. For example, if we were to estimate a shop s sales using a 1-in-7 systematic sampling design, it could happen that only sales figures for Saturdays were selected. Sales would then be overestimated. 1.4 Presentation of data Once data has been collected either by you or by someone else, the initial task is to obtain some overall impression of the findings. This can be done by visually representing the data using frequency tables, charts and diagrams. But before we try to visualise the data, let us first consider the different types of data one might get Types of data There are two main groupings of data qualitative and quantitative. Qualitative data is characterised by categorical answers such as yes or no, male or female, etc. Quantitative data is characterised by numerical values. Quantitative data can further be divided into two groups, discrete data and continuous data. Discrete data include everything that can be considered as a separate unit because of its nature, for example, number of units sold, number of consumers, number of job opportunities, etc., that is everything that you can count on your fingers. Continuous data are usually the result of a measurement and do not consist of fixed, isolated points. There can be a whole range of values between any two values. Examples are length, mass, time and temperature measurements. Example Classify the data in each of the following questions: (a) Do you own a TV set? Yes No (b) How many TV sets do you own? (c) How many kilometres did you drive on your set of Radial tyres? (d) What was your electricity bill last month?

17 7 DSC2602 Solution (a) Qualitative (b) Quantitative discrete (c) Quantitative continuous (d) Quantitative continuous Let s look at Radial s data again. Radial, the company introduced in Section 1.2, has taken a sample of 100 and is happy that it is representative. The sample elements (in thousands) are shown in Table Table 1.3: Sample elements for Radial We identified Radial s data as quantitative and continuous. Perhaps if we could picture the data, we would be able to form a better idea of what is going on The frequency table and histogram The histogram is one of the most common ways of visually representing data. It is a graphical representation of a frequency table. A frequency table is a table in which the data are grouped into intervals. To draw a histogram we must first set up a frequency table. The steps needed to set up a frequency table are as follows: Step 1 Find the range (R) of the data, where R= maximum value of data set minimum value of data set. Step 2 Decide on the number of intervals. If the number of intervals used are too few or too many, one cannot get a good idea of the distribution of the data. It is not always easy to decide how many intervals to use. R is a good number if R is large, but any number between 5 and 8 is acceptable. Do 10 not use fewer than 5.

18 DSC Step 3 Determine the width of the intervals as R number of intervals. The width must be a whole number this will make it easier to determine the limits of the intervals. Step 4 Determine the interval limits. The limits should be such that there is no doubt into which interval a value falls. For example, when we are working with Radial s data, we cannot choose intervals such as Why not? Well, where would you place a value of 65? For the mathematical manipulations that we will be doing with grouped data, it is also necessary that we do not work with intervals such as 55 just smaller than just smaller than 75 What does just smaller than 65 mean? The rule that we will use is to take the lower limit of the first interval as half a unit less than the minimum value, so that there can be no confusion as to which interval a value belongs. The lower limit of the first class must be a value which is smaller than the minimum data value and the upper limit of each interval is the same as the lower limit of the succeeding interval. Step 5 Tabulate the data. Example (a) Set up a frequency table for Radial s data shown in Table 1.3. (b) What percentage of customers were able to do km or more on a set of tyres? (c) What percentage of customers were able to do km or less on a set of tyres? (d) Draw a histogram of the data using the frequency table.

19 9 DSC2602 Solution (a) Set up the frequency table following these steps: Step 1 The range of the data: The minimum value is 14 and the maximum value is 98. The range is therefore R=98 14=84. Step 2 The number of intervals: The number of intervals is calculated as R 10 = 8,4. Therefore, use 8 intervals. Step 3 The interval width: The interval width is calculated as R 8 = 84 = 10,5. Therefore, use a width of Step 4 The interval limits: The interval limits are determined as follows: The minimum value is 14 so the lower limit of the first interval will start at half a unit less than the minimum, which is 13,5. The upper limit of the first interval is determined by adding the width to the lower limit, that is 13,5+11=24,5. The first interval is therefore 13,5 24,5. The second interval starts at 24,5 and also has a width of 11. Its upper limit is therefore 24,5+11=35,5. The last interval starts at 90,5 and its upper limit is 101,5, which is well above the largest element in the sample. The intervals are: =11 { }} { 13,5 24,5 Step 5 Tabulate the data: 24,5 35,5 35,5 46,5 46,5 57,5 57,5 68,5 68,5 79,5 79,5 90,5 90,5 101,5 The only remaining thing to do is to group the data into the intervals. Now go back to the data set and consider the first four sample elements in the first row, which are 61, 38, 19 and 58. Our aim is to find in which one of the following intervals they belong:

20 DSC Interval 13,5 24,5 24,5 35,5 35,5 46,5 46,5 57,5 57,5 68,5 68,5 79,5 79,5 90,5 90,5 101,5 19 Fit in the first interval because it is greater than 13,5 and less than 24,5. 38 Fit in the third interval because it is greater than 35,5 and less than 46,5. 61 and 58 Fit in the fifth interval because they are greater than 57,5 and less than 68,5 Instead of writing 19; 38; 61 and 58 in their corresponding intervals, we represent them with a line,, as follows Interval 13,5 24,5 24,5 35,5 35,5 46,5 46,5 57,5 57,5 68,5 68,5 79,5 79,5 90,5 90,5 101,5 The fifth element in a group of lines is indicated by a line drawn across the group: represents a group of five. Note: The total number of elements falling into an interval is called the frequency. The complete frequency table for Radial is given in Table 1.4. Interval Frequency 3,5 24,5 7 24,5 35,5 6 35,5 46, ,5 57, ,5 68, ,5 79, ,5 90,5 8 90,5 101,5 1 Total 100 Table 1.4: The frequency table for Radial It is clear that the highest frequency occurs in the interval 57,5 68,5. This shows that most of the customers were able to do between 57,5 and 68,5 thousand kilometres on a set of tyres.

21 11 DSC2602 (b) The intervals 79,5 90,5 and 90,5 101,5 represent the number of customers who were able to drive km or more on a set of tyres. The total number is thus the sum of the frequencies in these intervals, namely 8+1=9. The percentage of customers who were able to do km or more on a set of tyres is %=9%. Note: The fraction 9 is called the relative frequency. 100 (c) The first three intervals account for the customers who were able to drive km or less. The sum of the frequencies in these intervals, namely =26 is thus equal to the total number of customers. The percentage of customers who were only able to do km or less on a set of tyres is %=26%. (d) Now we can graphically represent the frequency table by drawing the interval lengths on a horizontal axis and the frequencies on a vertical axis. This is called a histogram. The histogram for Radial is given in Figure 1.1. Frequency ,5 24,5 35,5 46,5 57,5 68,5 79,5 90,5 101,5 Distance Figure 1.1: Histogram for Radial (Notice that the horizontal axis starts at 0 and that the zigzag line is there to break the line in order to prevent a huge space from appearing on the left of the actual graph.) The pie chart Another way of representing data is by means of a pie chart. A pie chart is drawn as a circle and the slices of the circle represent the relative frequencies expressed as a percentage. It is often difficult to draw a pie chart by hand. In Radial s case one needs to divide the circle into 100 equal slices not an easy task! The pie chart for Radial is more or less as shown in Figure 1.2.

22 DSC ! # # " $ # " #! # # $!! # " # % & ' # # % ' # ' # " $ # # % #!! $ & # % ' # # % # $ & # Figure 1.2: Pie chart representing Radial s data The cumulative frequency polygon We calculated in Example that 26% of the customers was only able to do km or less on a set of tyres. Such information can be presented graphically if we first obtain the cumulative less than table. Such a table is set up from the frequency table, setting the upper limits to less than.... The cumulative frequency table for Radial is given in Table 1.5. Upper limit Frequency Cumulative frequency < 24,5 7 7 < 35, (7+6=13) < 46, (7+6+13=26) < 57, ( =39) < 68, ( =69) < 79, etc. < 90, < 101, Table 1.5: The cumulative frequency table for Radial You have probably realised that cumulative means added up. This information can now be represented by a cumulative frequency polygon as shown in Figure 1.3.

23 13 DSC2602 Cumulative frequency ,5 35,5 46,5 57,5 68,5 79,5 90,5 101,5 Distance Figure 1.3: The cumulative frequency polygon for Radial The stem-and-leaf diagram The stem-and-leaf diagram is also a useful diagram and is easy to set up. The first step is to decide how to separate each observation into two parts - the stem and the leaf. Radial s data can be separated in such a way that the first digit of each number is the stem and the second digit is the leaf. First we determine the biggest and smallest numbers in the data set and separate them into a stem and a leaf. The smallest number in the data, 14, has stem 1 and leaf 4. The largest number, 98, has stem 9 and leaf 8. Next we fill in the rest of the data. All the other numbers lie between these two. We can therefore set up the stem from 1 to 9, with the second digit of each number being written next to its stem, as shown in Table 1.6. Stem Leaf Frequency Table 1.6: Stem-and-leaf diagram for Radial (unsorted) But every stem s leaf MUST be in ascending order (from the smallest value to the largest). Radial s sorted stem-and-leaf diagram is given in Table 1.7.

24 DSC Stem Leaf Frequency Table 1.7: Stem-and-leaf diagram for Radial Now turn the page on its side. It is easy to see that most of the customers drove between and thousand kilometres on a set of tyres. 1.5 Descriptive measures The presentation of charts and diagrams can be regarded as the first step in analysing data and is not sufficient for most purposes. They provide an overall picture of the data but give only an approximate indication of specific properties such as midpoint and spread of data. Proper analysis requires a summary of the data in the form of descriptive statistical measures. Descriptive measures are single numerical values that indicate the shape or distribution of the data set. There are descriptive measures of location, spread, symmetry and kurtosis Measures of locality A measure of location or position gives an indication of the midpoint or general size of the distribution. Examples are the mean, median and mode of a data set The mean Radial advertises that its XXX tyres will last for at least km before one of the four tyres will no longer meet the minimum safety requirements. What is the mean number of kilometres that can be driven on a set of XXX tyres? Radial has only the sample of 100 observations available for estimating the mean. If we consider the sample as being representative of the population, we can use the sample mean as an estimator of the population mean. To obtain the sample mean we add up all the observations and divide the result by the number of observations. (This is called the arithmetic mean.) If we add up all the observations in Radial s sample and divide the sum by the number of observations, we get = 58,28.

25 15 DSC2602 We can therefore expect a set of tyres to last 58, = km on average. The formula for the mean is where x= 1 n x (read as x-bar) is the generally accepted symbol for the arithmetic mean, n is the number of observations, is the Greek letter for S and means sum, x i represents the i-th observation, and n x i is just another way of writing x 1 + x x n. i=1 Note: You may enter the data into the statistics mode of your calculator and find the value of the mean by pressing a button. This is much faster than doing the calculation by hand. See the manual of your calculator. When the raw data, that is the values in the original data set, are available, it is easy to calculate the mean. Sometimes, however, the data is given in the form of a frequency table and the actual values are not known. Let s look at Radial s data again. Assume that Radial s sample data is available in the following form only: (Distance in km) n i=1 x i Interval Frequency ( f i ) 13,5 24,5 7 24,5 35,5 6 35,5 46, ,5 57, ,5 68, ,5 79, ,5 90,5 8 90,5 101, We do not know what the actual values in each interval are. For computational purposes we make the following assumption: All values in an interval are equal to the middle value of the interval. The middle value is calculated by adding the lower and the upper limits of the interval and dividing the result by two. The middle value of the first interval is: 13,5+24,5 2 = 19.

26 DSC We thus assume that all the observations in the first interval are equal to 19. The contribution of these seven observations to the grand total is therefore: 7 19=133. The formula for the mean of a frequency distribution is x= k i=1 f ix i k i=1 f i where f i x i k = the frequency for the i-th interval, = the middle value of the i-th interval, and = number of intervals. Note: Data in a frequency table are often referred to as grouped data. Example Calculate the mean distance travelled on a set of XXX tyres using the frequency distribution in Table 1.4. Solution In Table 1.8 the frequency, f i, the middle value, x i, and the product of these are given for each interval. Interval f i x i f i x i 13,5 24, ,5 35, ,5 46, ,5 57, ,5 68, ,5 79, ,5 90, ,5 101, Table 1.8: Calculating the mean from the frequency table The mean is calculated as x= k i=1 f ix i k i=1 f i = = 58,16. The mean distance is therefore 58, = km. Note: The intervals of the frequency distribution are all of equal width, that is 11. After you have calculated the middle value of the first interval, the successive middle values can be obtained by adding 11 to the previous middle value.

27 17 DSC2602 The calculation based on the above assumption resulted in a total that is different from the actual total. A mean calculated in this way will differ from the actual mean. For example, if the observations in the interval 79,5 90,5 are: 88; 80; 83; 82; 90; 86; 86; 80, their sum is 675. If we use the middle value and the frequency of each interval and then calculate the contribution to the total of the observations in the interval 79,5 90,5, their sum is 680. Interval Frequency Middle value f i x i f i x i 79,5 90, =680 In our example the mean obtained using the original data values is km, while the mean obtained using the frequency distribution is km. The mean is the measure of locality that is used most often. It can, however, be misleading. For example, if we calculate the mean of 2; 3; 5; 71, we see that the mean will be x= = 20,25. When a data set has a mean of 20, one intuitively expects most of the values to lie in the vicinity of 20. In this instance, however, most of the values are less than 6, while one value is an outlier of 71! The mean is rather sensitive to outliers and can often be misleading. On its own, without any additional information, it can often lead to incorrect conclusions. Another disadvantage of the mean is that it is a difficult task to calculate the mean for an open frequency distribution, as the following example will illustrate. Example Table 1.9 gives the property value distribution for ratepayers in the Steelcity Metropolitan Substructure. Property value Frequency (in Rands) (in thousands) Less than More than Table 1.9: Property value distribution

28 DSC What assumptions will we have to make about the middle values of the first and the last interval? What is an acceptable lower limit for the first interval? What is an acceptable upper limit for the last interval? If we have access to the original data, we may be able to make a good guess, but this is not always possible. The use of the mean is therefore restricted when there are open distributions. A big advantage of the arithmetic mean is that it uses all the available data. Later we will see that this is not the case for the other measures of locality. Since the mean can be calculated exactly, it forms the basis for many advanced analyses and is not only descriptive in nature The median Since the mean is sensitive to extreme values (outliers), and may often result in misleading conclusions, the median is often preferred as a measure of locality. The median is the value that divides an ordered data set into two equal parts. If the data set is sorted in ascending order, 50% of the data values will lie below, or to the left, of the median, and 50% will lie above, or to the right of the median. The median is determined as follows: If a data set of size n is sorted in ascending sequence, then the median (me) is the n+1 -th value of the data set. 2 Example Solution Determine the median of the following data sets: (a) 6, 9, 12, 12, 13, 15, 18, 24, 27 (b) 2, 3, 5, 71 (a) The data set is arranged in ascending order and n=9. The median (me) is the 9+1 = 5-th value, that is the median is (b) The data set is arranged in ascending order and n=4. The median (me) is the = 2 1 -th value. 2 The 2 1 -th value is a value between the second and third values, that is halfway between 2 3 and 5 or 3+5 = 4. The median is therefore 4 and we can say that 50% of the data lie 2 to the left of 4 and 50% to the right of 4. The median may also be calculated for a frequency table. The cumulative frequency table is used to identify the median interval.

29 19 DSC2602 Let s use Radial s data once again, as shown in Table Interval Frequency ( f i ) Cumulative frequency 13,5 24, ,5 35, ,5 46, ,5 57, ,5 68, ,5 79, ,5 90, ,5 101, Table 1.10: Cumulative frequency table for Radial To identify the median interval, find the interval within which the n+1 2 = = 50,5th value occurs. The median interval is therefore 57,5 68,5. The biggest advantage of the median is that open intervals pose no problem and it is not affected by extreme values. However, it ignores the largest part of the data and cannot be manipulated mathematically The mode The mode of a data set is the value that occurs most often. Consider the following example: A survey was conducted amongst married couples who were married eight years ago. Table 1.11 gives the frequency table of the data obtained. Number of children per couple x i Number of couples f i Highest frequency Table 1.11: Frequency table for number of children per couple It is clear that most couples have no children. The mode is therefore zero. The mode, however, is not a good measure of locality. Sometimes there is no value that occurs more than any other value, or there is more than one value with the same maximum number of occurrences. In addition, the mode has the same drawbacks as the median. The only thing in favour of the mode is that it is easy to understand. For grouped data the modal interval is the interval with the highest frequency.

30 DSC Measures of dispersion The variance of a data set In the previous section, we considered the problems that occur when working with a measure of locality only. Even though the arithmetic mean, uses all the values in a data set, it does not give much information about what the data set really looks like. We also need information on the spread of the data around the mean. Consider the data set 5; 3; 8; 4; 1; 5; 0; 6 with a mean of x= 32 8 = 4. Let s plot the data points around the mean: x= Now calculate the distance from x=4 to each value and then calculate the mean of these distances =16 and 16 8 = 2. When we work with Radial s sample it may be quite difficult to calculate the mean distance how long do you think it would take using 100 values? An alternative is to calculate the deviation from the mean for each observation, that is x x: 1; 1; 4; 0; 3; 1; 4; 2. When these are added, you get 0, which tells you nothing! However, the number crunchers of the old days did not become discouraged, and came up with the clever idea of using (x x), and squaring it. The square of any value is always a positive number. The mean of the squared deviations is called the variance. The positive square root of the variance is called the standard deviation, and we will use this measurement to give an indication of the spread of data around the mean. The variance of a sample is defined as s 2 = n i=1(x i x) 2. n 1 Notice that we divide by n 1 and not by n. The reason for this is that the sample variance (s 2 ) is used to estimate the population variance (σ 2 ). If we were to divide by n, it would give an

31 21 DSC2602 underestimation of the population variance. Division by n 1 therefore gives a better estimator. Calculators normally can calculate s 2 andσ 2. Make sure which one you should use on your calculator. The squared deviations of our sample are calculated in Table x i (x i x) (x i x) Table 1.12: Calculations to find the variance and the standard deviation The variance s 2 = 48 (8 1) = 6,86 and the standard deviation s= 6,86=2,62. An alternative formula for computing the variance is s 2 = n(σn i=1 x2 i ) (Σn i=1 x i) 2 n(n 1) Note: You may enter the data into the statistics mode of your calculator and find the value of the standard deviation by pressing a button. This is much faster than doing the calculation by hand. See the manual of your calculator. The variance can also be calculated for grouped data. For a frequency table the sample variance is defined as follows: where s 2 = k i=1 f ix 2 i n x 2 n 1 x= k i=1 f i x i k i=1 f i = k i=1 f i x i n ; Let s look at Radial s data again. x i = middle value of the i-th interval, and k n= f i. i=1 The variance of Radial s frequency table is calculated in Table (Remember that we have already calculated the mean as x=58,16.) The variance s 2 = , = ,44 99 = 335,67.

32 DSC Interval f i x i f i x 2 i 13,5 24, ,5 35, ,5 46, ,5 57, ,5 68, ,5 79, ,5 90, ,5 101, Table 1.13: Calculations for the variance of Radial s frequency table The standard deviation of a data set The standard deviation is defined as the square root of the variance. The standard deviation of a sample is defined as s= s 2 n i=1 = (x i x) 2. n 1 But what does this tell us? It tells us how far away the observations are from the mean. The larger the standard deviation, the further away the data points are from the mean. The following schematic representation shows how many of the data points lie between one standard deviation to the left and to the right of the mean, and between two standard deviations to the left and to the right of the mean. 95% 68% 2s s x +s +2s The standard deviation plays a important role in inferential statistics that is, the field where the problem of making scientifically based conclusions about populations, using sample data, is considered The quartile deviation The median is that value which separates a sorted data set into two equal parts.

33 23 DSC % 50% If we divide a sorted data set into four equal parts we get the following four quartiles: me 25% 25% 25% 25% me q 1 q 2 q 3 q 1 represents the value that indicates the end of the first 25% of the data values; q 2 represents the value that indicates the end of the second 25% (or the value which divides the data set into two equal parts, that is the median); and q 3 is the value indicating the end of the third 25%. The middle 50% of the data lies between q 1 and q 3. The quartile deviation is q D = q 3 q 1 2 and is the measurement of the dispersion of the data around the median. As with the median, the quartile deviation does not use all the observations. It ignores outliers since the top 25% and the bottom 25% of the data values are not taken into account. Example The purchasing manager of a group of clothing shops has recorded the following 15 observations on the number of days that pass between reordering items from a new range of children s clothing. Reordering intervals (in days) Solution Calculate and interpret the quartile deviation of the reordering intervals. The value of the median or q 2 is the value of the 1 (n+1)th observation in a ranked data set. 2 Similarly, the value of q 1 is the value of the 1(n+1)th observation and the value of q 4 3 is the value of the 3 (n+1)th observation in an ordered data set. 4 (An ordered data set is a data set that is arranged in ascending order.) The ordered data set is: 5; 12; 15; 17; 17; 18; 18; 22; 22; 23; 23; 26; 26; 28; 29.

DATA SUMMARIZATION AND VISUALIZATION

APPENDIX DATA SUMMARIZATION AND VISUALIZATION PART 1 SUMMARIZATION 1: BUILDING BLOCKS OF DATA ANALYSIS 294 PART 2 PART 3 PART 4 VISUALIZATION: GRAPHS AND TABLES FOR SUMMARIZING AND ORGANIZING DATA 296