Descriptive Statistics - PDF Free Download

Chapter 3 Descriptive Statistics Chapter 2 presented graphical techniques for organizing and displaying data. Even though such graphical techniques allow the researcher to make some general observations about the shape and spread of the data, a more complete understanding of the data can be attained by summarizing them using statistics. This chapter presents such statistical measures, including measures of central tendency, measures of variability, and measures of shape. The computation of these measures is different for ungrouped and grouped data. 3.1 Measures of Central Tendency: Ungrouped Data One type of measure that is used to describe a set of data is the measure of central tendency. Measures of central tendency yield information about the centre, or middle part, of a group of numbers. Measures of central tendency do not focus on the span of the data set or how far values are from the middle numbers. The measures of central tendency presented here for ungrouped data are the mode, the median, the mean, percentiles, and quartiles. Mode, Median, and Mean The mode is the most frequently occurring value in a set of data. The median is the middle value in an ordered array of numbers. For an array with an odd number of terms, the median is the middle number. For an array with an even number of terms, the median is the average of the two middle numbers. The mean is the average of a group of numbers and is computed by summing all numbers and dividing by the number of numbers. Demonstration Problem 3.1 Shown below is a list of the 11 largest motor vehicle producers in the world and the number of vehicles produced by each in 2009. Auto Manufacturer Production (millions) Toyota Motor Corp. 7.2 General Motors 6.5 Volkswagen Group 6.1 Ford Motor Co. 4.7 Hyundai 4.6 PSA Peugeot Citroën 3.0 Honda 3.0 Nissan 2.7 Fiat 2.5 Suzuki 2.4 Renault 2.3 1. Input the data into Excel. Save as Demo_3-1. 35

36 Descriptive Statistics 2. Click on the Data tab and Data Analysis (if you don t see this option, refer to Chapter 2 to install the Analysis TookPak). 3. Select Descriptive Statistics and then select into the Input Range box. Select a cell to the right of the data for the Output Range and select Summary statistics. If you input and selected the label, select Labels in first row (make sure it is in the cell directly above the data). 4. Widen the column with the text to see all of the text (click and drag or double-click the line between the column letters). 5. The mean uses all the data, and each data item influences the mean. It is also a disadvantage because extremely large or small values can cause the mean to be pulled toward the extreme value.

Chapter 3 37 Remarks The mean uses all the data, and each data item influences the mean. It is also a disadvantage because extremely large or small values can cause the mean to be pulled toward the extreme value. In this data set, the mean value is 4.09 and the median is 3 showing that the large values pull the mean to a higher value whereas a typical value would be more like 3. The mode or most common value is also 3. Percentiles Percentiles are measures of central tendency that divide a group of data into 100 parts. There are 99 percentiles because it takes 99 dividers to separate a group of data into 100 parts. Let s use our data set to find specific percentiles using an Excel function. Demonstration Problem 3.2 1. Input the following data into Excel in a column: 14, 12, 19, 23, 5, 13, 28, 17. 2. Click in a cell to the right of the data and input the function =PERCENTILE(select range of data, 0.3). The 30th percentile is represented by 0.3. 3. The answer is 13.1 and the whole number would be 13. A percentile may or may not be one of the data values. Note: The Rank and Percentile feature of the Data Analysis tool of Excel has the capability of ordering the data, assigning ranks to the data, and yielding the percentiles of the data. To access this command, click on Data Analysis and select Rank and Percentile from the menu. In the Rank and Percentile dialogue box, enter the location of the data to be analysed in Input Range. For this data set, the output looks like the output on the right:

38 Descriptive Statistics Quartiles Quartiles are measures of central tendency that divide a group of data into four subgroups or parts. If the observations are ordered from smallest to largest, each quartile represents 25% of the observations. The first quartile (Q 1 ) represents the median of the observations ordered from the minimum to the overall median M. The second quartile is the overall median (M) and represents 50% of all observations. The third quartile represents the median of the upper 50% of the observations. A five-number summary gives a complete description of the distribution, including the minimum number, Q 1, M (median), Q 3, and the maximum number. A box plot is a graph of the five-number summary. Side-by-side box plots are useful for comparing several distributions. Demonstration Problem 3.3 1. Open the Demo_3-3 file from the folder titled Demonstration Problem Data Sets on the student companion site located at www.wiley.com/college/cortinhas. 2. Below the data, input the following labels and formulas according to the instructions given earlier in Chapter 1 using Function Wizard or by inputting the formulas manually. You could also use the Quartile function for all of the values by inserting 0,1,2,3,4 for the second argument. For example, Maximum would be =QUARTILE(C2:C17,4). 3. To view the functions used on a worksheet, there is an option in Excel to display all equations. Select File Options Advanced. Scroll down to Display options for this worksheet and select Show formulas in cells instead of their calculated results. Click OK and you will be able to view all of the formulas used on the current worksheet. Reverse the selection when you want to see only the results. You can always click on a cell and see the formula that was input in the Formula Bar.

Chapter 3 39 4. The resulting values calculated are shown as follows: Note: These values are not quite the same as the values calculated in the textbook. That is because the quartiles are calculated by a different algorithm in different software programs. If you use the Min, Max, and Median functions, those values will be the same. The values that differ are Q1 and Q3. 3.2 Measures of Variability: Ungrouped Data Business researchers can use another group of analytic tools, measures of variability, to describe the spread or the dispersion of a set of data. Using measures of variability in conjunction with measures of central tendency makes possible a more complete numerical description of the data. Methods of computing measures of variability differ for ungrouped data and grouped data. This section focuses on seven measures of variability for ungrouped data: range, interquartile range, mean absolute deviation, variance, standard deviation, z scores, and coefficient of variation. Range The range is the difference between the largest value and the smallest value of a data set. Although it is usually a single numeric value, some business researchers define the range of data as the ordered pair of smallest and largest numbers (smallest, largest). It is a crude measure of variability, describing the distance to the outer bounds of the data set. An advantage of the range is its ease of computation. A disadvantage of the range is that, because it is computed with the values that are on the extremes of the data, it is affected by extreme values, and its application as a measure of variability is limited. Interquartile Range Another measure of variability is the interquartile range. The interquartile range is the range of values between the first and third quartile. Essentially, it is the range of the middle 50% of the data and is determined by computing the value of Q 3 minus Q 1. The interquartile range is especially useful in situations where data users are more interested in values towards the middle and less interested in extremes. In addition, the interquartile range is used in the construction of box-and-whisker plots. The interquartile range value can differ slightly when using different software programs due to the underlying algorithms defining the quartiles.

40 Descriptive Statistics Demonstration Problem 3.3 (continued) 1. Open the Demo_3-3_Results file from the folder titled Demonstration Problem Data Sets on the student companion site located at www.wiley.com/college/cortinhas or use the results calculated in the previous exercise on quartiles. 2. To calculate the Range in Excel, use a simple subtraction formula. Click on a cell below the quartile calculations and input = and then click on the computed maximum value, type a - and click on the computed minimum value. For our example, it should look like this: The result: 3. To calculate the interquartile range, use a simple subtraction formula with Q1 and Q3. Click on a cell below the range calculation and input = and then click on the computed Q3 value, type a - and click on the computed Q1 value. For our example, it should look like this: The result: Mean Absolute Deviation, Variance, and Standard Deviation Three other measures of variability are the variance, the standard deviation, and the mean absolute deviation. The variance and standard deviation are widely used in statistics. Although the standard deviation has some stand-alone potential, the importance of variance and standard deviation lies mainly in their role as tools used in conjunction with other statistical devices. Mean Absolute Deviation The mean absolute deviation (MAD) is the average of the absolute values of the deviations around the mean for a set of numbers. There is no function of this in Excel but you can set up a table and use formulas to calculate: MAD MAD Problem x = i N µ 1. A small company started a production line to build computers. During the first five weeks of production, the output is 5, 9, 16, 17, and 18 computers, respectively. Calculate the mean absolute deviation for the computer production data. 2. Input the data vertically in one column. In a cell below the data, calculate the mean of the data set by inputting the function =AVERAGE, selecting the data to find the mean, and pressing Enter. 3. In the next column, input a formula to find the absolute value of the mean subtracted from the value in the first column. Remember that the reference to the average in the formula has to be absolute by using the function key F4. The formula should look like the one below. Copy the

Chapter 3 41 formula down. Insert a formula below to sum the values using the function =SUM(select the data). The resulting values should look like: 4. Below those values, input a function to find the absolute value of the sum of those values for (x μ) as =SUM(select the data). The worksheet should look like the one below. Please note that the sum of (x μ) should be 0. 5. The MAD calculation can be input in a cell as the sum of the absolute value of (x μ) divided by the sample size N, which is 5. The result is 4.8. Variance Because absolute values are not conducive to easy manipulation, mathematicians developed an alternative mechanism for overcoming the zero-sum property of deviations from the mean. This approach utilizes the square of the deviations from the mean. The result is the variance, an important measure of variability. The variance is the average of the squared deviations about the arithmetic mean for a set of numbers. The population variance is denoted by σ 2. 2 ( x µ) 2 σ = i N Population Variance and Standard Deviation Problem 1. Open the MAD_results file or use the results calculated in the previous exercise on MAD. 2. The most straightforward way to calculate a population variance is to use the Variance function in Excel. For this example, choose the population variance (VAR.P). The result is 26.0.

42 Descriptive Statistics 3. The square root of this value is the standard deviation. In Excel, you type the function =SQRT(click on the number or type the number to find the square root). The standard deviation value is 5.1. Or, use the population Standard Deviation function in Excel (STDDEV.P). Standard Deviation The population standard deviation is a measure of the spread of a distribution. A symmetric distribution is completely described by its centre at the mean and its spread defined by multiples of its standard deviation. The standard deviation σ is the square root of the variance σ 2. 2 ( xi x) σ = N Population versus Sample Variance and Standard Deviation The sample variance is denoted by s 2 and the sample standard deviation by s. The main use for sample variances and standard deviations is as estimators of population variances and standard deviations. Because of this, computation of the sample variance and standard deviation differs slightly from computation of the population variance and standard deviation. Both the sample variance and sample standard deviation use n 1 in the denominator.

Chapter 3 43 s 2 x i x) = n 1 2 ( s 2 ( x i x) n 1 ]should be s = in second equation[ Sample Variance and Standard Deviation Problem 1. Input the following data representing a sample of six of the largest accounting firms in the United States and the number of partners associated with each firm. Firm Number of Partners Deloitte & Touche 2654 Ernst & Young 2108 PricewaterhouseCoopers 2069 KPMG 1664 RSM McGladrey 720 Grant Thornton 309 2. The most straightforward way to calculate a sample variance is to use the Sample Variance function in Excel. Choose the function VAR.S. The result is 806,631.07. 3. In a similar way, the sample standard deviation can be calculated by using the sample Standard Deviation function in Excel. For this example, choose the function STDEV.S. The result is 898.13.

44 Descriptive Statistics z Scores Normal distributions are defined as perfectly symmetric, bell-shaped curves. The normal distribution has certain qualities that allow calculations based on standardized values of the standard deviation (as described in the text). A z score represents the number of standard deviations a value (x) is above or below the mean of a set of numbers when the data are normally distributed. Using z scores allows translation of a value s raw distance from the mean into units of standard deviations. If a variable x has a normal distribution N(µ,σ) with a mean µ and a population standard deviation σ, then the standardized variable = x µ xi x z also has a standard normal distribution. For samples, z =. σ s Normal distribution calculations determine the probability that a distribution would have a given value at least that high or higher, or at least that low or lower. The calculations can also determine a specific value for a given probability or proportion. These methods are described in the text. Demonstration Problem 3.5: A Normal Distribution Variation In the computing industry, the average age of professional employees tends to be younger than in many other business professions. Suppose the average age of a professional employed by a particular computer firm is 28 with a standard deviation of 6 years. A histogram of professional employee ages with this firm reveals that the data are normally distributed. Determine the range of ages within which at least 80% of the workers ages would fall. What percentage of employees are younger than 22? Older than 45? Determine the range of ages within which at least 80% of the workers ages would fall. This is a variation of Demonstration Problem 3.5 in the textbook, which addresses the non-normal distribution. Excel functions can calculate normal distribution values. The following method demonstrates calculating probabilities with and without using a z-score calculation: 1. Open an empty worksheet in Excel. 2. Type in the given values and their headings. We are interested in the proportion or probability of having an age as young as or younger than 22. Because of how the normal distribution table is constructed, the area to the left of a specific value is always calculated. If an area to the right of a specific value is desired, we will have to subtract the value from 1. 3. Click in the cell to the right of the P < 22 to input the function to calculate the standardized z- value of a specific x-value. Input =STANDARDIZE(22,28,6) and press Enter. The arguments of the function in order are the x-value, the population mean, and the population standard

Chapter 3 45 deviation. Instead of typing in the specific values, you can click on the cells containing those values. The result is 1. 4. Click in the next cell to the right and input the function to calculate the probability of an x-value being as low or lower than 22. Input =NORM.S.DIST(-1,1). Instead of typing -1, you can click on the z-score of -1. A second argument of True or 1 is entered to calculate the percentage to the left of the given value and press Enter. The result is 0.1587. That is, 15.87% of the ages are less than or equal to 22 years. 5. If a z-value is not required, you can input the function =NORM.DIST(22,28,6,1) and press Enter. The arguments of the function are the x-value, the population mean, the population standard deviation, and 1 for True for a cumulative probability and 0 for False for a cumulative probability. For this example, we are interested in the cumulative probability to the left of the x- value of 22 so this value is True, or 1. The result is 0.1587. That is, 15.87% of the employees are less than or equal to 22 years of age. 6. Click in the cell to the right of the P > 45 to input =NORM.DIST(45,28,6,0) and press Enter. The arguments of the function in order are the x-value, the population mean, the population standard deviation, and 0 for False to calculate the percentage to the right of the given value and press Enter. The result is 0.0012. That is, 0.12% of the employees are more than or equal to 45 years old. 7. To calculate what percentage of values would lie between 2 ages, you could subtract the probability to the left of the smaller value from the probability to the left of the larger value. The

46 Descriptive Statistics in-between percentage would be 0.9977-0.1587 = 0.8390. You can also input the formula =NORM.DIST(45,28,6,1)-NORM.DIST(22,28,6,1). 8. The final calculation is to determine the range within which at least 80% of the workers ages would fall. In an empty cell, calculate the z-value by inputting the formula =NORM.INV(0.8) into an empty cell and press Enter. This function calculates the z-value for a proportion of 0.8 or 80%. In this problem we wanted to find the lower 80% of values. The result is 33 years. 3.3 Measures of Central Tendency Three measures of central tendency are presented in the textbook for grouped data: the mean, the median, and the mode. There are no functions in Excel that help calculate these for values except by modeling the manual calculations shown in the text. 3.4 Measures of Shape Measures of shape are tools that can be used to describe the shape of a distribution of data. In this section, the text examines two measures of shape: skewness and kurtosis. It also looks at box-and-whisker plots. Skewness is addressed in the following section, Descriptive Statistics. Box-and-Whisker Plots and Five-Number Summary Another way to describe a distribution of data is by using a box-and-whisker plot. A box-and-whisker plot, sometimes called a box plot, is a diagram that utilizes the upper and lower quartiles along with the median and the two most extreme values to depict a distribution graphically. There is no way to construct a box and whisker plot in Excel. There are some Excel templates or macros that can be downloaded that provide this capability. 3.5 Descriptive Statistics on the Computer Excel s descriptive statistics output for the same computer production data is displayed below. The Excel output contains the mean, the median, the mode, the sample standard deviation, the sample variance, and the range. The descriptive statistics feature on either of these computer packages yields a lot of useful information about a data set. 1. Input the computer production data into one column of Excel: 5, 9, 16, 17, 18. 2. Click on the Data tab and Data Analysis. 3. Select Descriptive Statistics and then select into the Input Range box. Select a cell to the right of the data for the Output Range and select Summary statistics. If you input and selected the label, select Labels in first row (make sure it is in the cell directly above the data).

Chapter 3 47 4. Widen the column with the text to see all of the text (click and drag or double-click the line between the column letters). 5. The mean uses all the data, and each data item influences the mean. It is also a disadvantage because extremely large or small values can cause the mean to be pulled towards the extreme value.

48 Descriptive Statistics Kurtosis Kurtosis describes the amount of peakedness of a distribution. Kurtosis near 0 means a data set exhibits peakedness close to the normal distribution see examples in text. Skewness Skewness is when a distribution is asymmetrical or lacks symmetry. A distribution can be positively skewed or negatively skewed see examples in text. A skew greater than +1 indicates a high degree of positive skew, 1 indicates a high degree of negative skew, and between indicates a relatively symmetric data set. SUMMARY OF EXCEL COMMANDS USED IN CHAPTER 3 Excel has the capability of producing many of the statistics in this chapter piecemeal and there is one Excel feature, Descriptive Statistics, that produces many of these statistics in one output. Descriptive Statistics Select the Data tab on the Excel worksheet. From the Analysis panel at the right top of the Data tab worksheet, click on Data Analysis. If your Excel worksheet does not show the Data Analysis option, then you can load it as an add-in following directions given in Chapter 2. From the Data Analysis pulldown menu, select Descriptive Statistics. In the Descriptive Statistics dialogue box, click in the Input Range and select the data to be analysed. Check Labels in fir st row if your data contain a label in the first row (cell). Check the box beside Summary statistics. The Summary statistics feature computes a wide variety of descriptive statistics. The output includes the mean, the median, the mode, the standard deviation, the sample variance, a measure of kurtosis, a measure of skewness, the range, the mini mum, the maximum, the sum and the count. Rank and Percentiles The Rank and Percentile feature of the Data Analysis tool of Excel has the capability of ordering the data, assigning ranks to the data, and yielding the percentiles of the data. To access this command, click on Data Analysis (see above) and select Rank and Percentile from the menu. In the Rank and Per centile dialog box, enter the location of the data to be analysed in Input Range. Check Labels in fir st Row if your data contain a label in the first row (cell). Using Functions Many of the individual statistics presented in this chapter can be computed using the Insert Function (fx) feature of Excel. To access Insert Function, go to the For mulas tab on an Excel worksheet (top centre tab) and Insert Function is on the far left of the menu bar. In the Insert Function dialogue box at the top, there is a pulldown menu where it says Or select a categor y. From the pulldown menu associated with this command, select Statistical. There are 83 different statistics that can be computed using one of these commands. Select the one that you want to compute and enter the location of the data. Some of the more useful commands in this menu are AVERAGE, MEDIAN, MODE.SNGL, SK EW, STDEV.S, and VAR.S.