1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:

Size: px

Start display at page:

Download "1 Exercise One. 1.1 Calculate the mean ROI. Note that the data is not grouped! Below you find the raw data in tabular form:"

Blake Ferguson
5 years ago
Views:

1 1 Exercise One Note that the data is not grouped! 1.1 Calculate the mean ROI Below you find the raw data in tabular form: Obs Data The formula for the sample mean: = Translates into the sum of all observations (numerator) divided by the sample size (denominator). In tabular format: Obs Data

2 Total = n 12 Mean = / 12 The mean ROI is = Find the median ROI Firstly, all observations need to be placed in ascending order. The number of observations + 1 (=1) are divided by 2 to determine the location of the median. You may also use the general formula for the location of a percentile: = Where is the location for a particular percentile. is the sample size and is the percentile to be determined. In this example: = =6.5 It should be evident that the 5o th percentile corresponds to the median. The table now looks as follows: Obs Data Sorted Total n 12 Mean

3 n/2 6.5 = (12+1) / 2 It should also be clear that the location of the median is between the 6 th and the 7 th (sorted) observation. More specifically, it is halfway between the 6 th and 7 th observation. The associated values with observation 6 and 7 are 14.9 and 17.4, respectively. Hence, the median is halfway between the14.9 and The following tabular calculation allows determination of any percentile: Obs Data Sorted P Lower Upper L p Fraction Percentile = = ( ) * Total n 12 Mean n/2 6.5 Fraction denotes the fractional part 0.5 of the location =6.5. The percentile above is the median. 1.3 Find the ROI range The range is defined as the maximum observation minimum observation. In the case of this example: = = = Calculate the standard deviation Firstly, the difference between all observations and the mean is calculated. All differences are squared and summed up. Lastly, the sum-squared is divided by the number of observations 1 ( =1). Formally, the variance is expressed as follows: = 1

4 Where is the th observations of the dataset, is the sample mean, and is the sample size (Note that the formula for the population standard deviation differs). The formula translated into the sum of the squared differences between observations and the mean divided by the sample size minus one. In a tabular format: Obs Data Diff Diff^ = ( )^ = ( )^ = ( )^ = ( )^ = ( )^ = ( )^ = ( )^ = ( )^ = ( )^ = ( )^ = ( )^ = ( )^2 Total = n 12 Mean n-1 11 = 12-1 Variance = / 11 Stdev = ^ (1/2) The standard deviation is simply the square root of the variance (note that an exponent of 1/2 is equivalent to taking the square root): = = 1 Students should recognize that the term underneath the square root is simply the formula for the variance. 1.5 Coefficient of Variation The coefficient of variation is defined as the differential of sample standard deviation and sample mean :

5 = = =.

6 2 Exercise Two Note that the data is grouped! 2.1 Determine the first quartile The solution is presented jointly with the next question. 2.2 Determine the third quartile The following data represents the raw data: Intervals f 10 to to to to to 59 3 A note on calculating the class width above: Although the data is technically interval in nature (ages always are), the only data recorded is the age in years. Thus, somebody aged 19.5 years would be recorded as a 19-year-old. The same goes for someone aged 19.7, although rounding would yield 20 years. If this is the case, the more appropriate frequency table is the following: Intervals f 10 to < to < to < to < to <60 3 It reduces the ambiguity associated with fractional values (e.g. it is clear that 19.7 falls into the first interval). It also makes it very clear that the interval width is =20 10=10. Erroneously, the first table may have led you to believe the interval width is =19 10=9. Manually counting the possible age observations (10, 11, 12, 13, 14, 15, 16, 17, 18, 19) shows, however, that the actual width is

7 ten. In the event where there are no inequalities (<) in the interval column, the class width is calculated as +1 = =10. The same goes for the student mark example that precedes the end of chapter exercises. In the following, the table with inequalities is retained. Firstly, the location of the first quartile (or 25 th percentile) and third quartile is determined (75 th percentile). Using the same formula as in 1.2 when calculating the median: = Where is the location for a particular percentile. is the sample size (here =60 or the sum of the frequency column) and (25 and 75) is the percentile to be determined. In this example: = =15.25 = =45.75 It is now possible to identify the class intervals the quartiles (percentiles) belong into: Check the less than cumulative frequency column (f(<)) until you come across the first value that is larger than the specified locations and : Intervals f f(<) 10 to < to < to < to < to < Total 60 L L Because 6<15.25<30 and 30<45.75<48, the intervals the locations fall into are as marked above (green and orange shading respectively). Note that if a location had been exactly equal to a value in the < column (e.g. =30), the relevant interval then is the one above (since that value is explicitly excluded by the inequality sign). Next, the following parameters are required:

8 < The percentile (e.g. 25 for the first quartile) Sample size The lower limit of the interval falls into Class width The cumulative frequency of the previous interval of the interval falls into The observed frequency of the interval falls into From the previous table we can gather all the information required (Note that =25 for the first quartile and =75 for the third quartile): Intervals f f(<) 10 to < to < to < to < to < Total 60 L L Q 1 Q 3 P n O p C f(<) 6 30 f P The relevant formula to be used next for any percentile: = <

9 Students will note that the general formula above reduces to the formulas for quartiles and median for grouped data when simplified (this is left as an exercise). The general formula work sin any circumstances. For the two quartiles of the exercise: 25 = = = = The interquartile range (IQR) The interquartile range is the difference between the third and first quartile: = = =. 2.4 The quartile deviation The quartile deviation is defined as half the interquartile range: = 2 =. 2.5 Interpretation Approximately 50% of the participants of the study who responded positively to the commercial fall between the ages of 24 and 38. However, all people in the initial sample liked the commercial, which is a rather unlikely scenario. To make any inferences about which age category responds most favourably to the commercial, one would have to have at least one respondent reacting negatively to the commercial. This question is rather silly! All we can gather from this is the age distribution of respondents in the initial sample.

10 3 Question Three The skewness of a sample is approximated as follows: = = 12 The (excess) kurtosis of a sample can be approximated with: Where all parameters are defined as for the standard deviation of the sample. Skewness and Kurtosis are sometimes referred to as third and fourth moment of the distribution and expressed as and (with the mean and the standardd deviation as the first two moments). It is little surprising that researchers usually employ computers when calculating skewness and kurtosis. Under the assumption of a normal bell-shaped distribution function, the expected value for 0 and 3. The above formula for kurtosis, however, expresses the kurtosis in excess of three. The expected value for excess kurtosis is 3. Skewness is a measure of symmetry. Note that a positive skewness (0) indicates a distribution function with the tail extended to the right, whereas a negatively skewed (0) function has a tail extended towards the left. The standard normal distribution function has a skewness of Kurtosis is a measure of peakedness. An excess kurtosis 0 indicates a peaked (leptokurtic) function, and excess kurtosis of 0 indicates a flatter than normal (platykurtic) function. As with the skewness, the excess kurtosis of the normal distribution function is 0 (mesokurtic).

11 It is possible to draw a frequency polygon to approximate the form of the distribution function. We firstly include two dummy intervals of the same width as the other intervals before the first and after the last intervals. The associated frequencies are zero (Since no observation is smaller than the lower bound of the first or larger than the upper bound of the last interval). The midpoints are calculated as well. The table now looks as follows: Intervals f f(<) Mid 10 to < to < to < to < to < to < to < The associated histogram and polygon is then:

12 30 Age Categories to <20 10 to <20 20 to <30 30 to <40 40 to <50 50 to <60 60 to <70 Histogram Polygon It is virtually impossible to comment on the kurtosis from the above rough approximation. It is possible to say that the function is not extremely leptokurtic. However, whether the function above is a deviation from normal requires the calculation of a formal statistic (i.e. kurtosis). Unfortunately, the formulas presented here are for non-grouped data only. We require the original data to calculate the kurtosis. There is some evidence for positive skewness. However, it is questionable whether the above graph presents sufficient statistical evidence to infer that the distribution is non-normal. In practice, two formal test statistics have received recognition: the parametric Jarque-Bera test based on the actual parameter estimates for skewness and kurtosis; and the non-parametric Lilliefors-Kolmogorov-Smirnov test. For grouped data, the second test may produce better results. A formal expression of nonnormality is beyond the scope of the course. As in 2.5, I do not believe that you are capable to produce any meaningful results with respect to the kurtosis from the information available.

Lectures delivered by Prof.K.K.Achary, YRC

Lectures delivered by Prof.K.K.Achary, YRC Given a data set, we say that it is symmetric about a central value if the observations are distributed symmetrically about the central value. In symmetrically