Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Size: px

Start display at page:

Trevor Fowler
6 years ago
Views:

1 Appendix: Statistics in Action Part I Financial Time Series 1. These data show the effects of stock splits. If you investigate further, you ll find that most of these splits (such as in May 1970) are 3-for-1 splits rather than 2-for-1 splits seen in the text example. You can recognize that the May 1970 split is a 3-for-1 split by observing that the price is $97.50 at the end of April 1970 and falls to $ (about 1/3 of the prior price) at the end of May a. The White Space Rule says that this figure doesn t show much of what is happening in the first ¾ of the time period, up through about 1985 or This portion of the data is hidden because of the large values produced by compounding in the later years. b. No. These data have a strong pattern of dependence that would be concealed by a histogram. 3. The variation in the month-to-month changes lacks a clear pattern, and a histogram is a helpful summary of such variation. 4. The percentage changes show the gains produced by an investment one month at a time. Cumulative values hide the month-to-month variation in the performance of the stock behind the effects of the compounding value. 5. a A return is 1/100 times the corresponding percentage change, so the SD of the returns is 1/100 times the SD of the percentage changes. b. If we accept the Empirical Rule, then we can find the number of standard deviations that separate the mean from -2.5%. A drop of 2.5% amounts to ( )/ , or about ½ of an SD below the mean. That s not very unusual based on the Empirical Rule, which we should check by inspecting the histogram of percentage changes. A simpler approach might be better than relying on the Empirical Rule: find the percentage of months with a drop of 2.5% or more. This approach is not without assumptions. We still need to make sure that there are no patterns (dependence) in the time series of percentage changes. (Use the visual test for association.) c. Let s find the VaR at 2.5%, excluding the worst 2.5% of outcomes. Using the Empirical Rule, that implies that the stock could change by its mean minus two standard deviations, or by *7.55=-13.55%. That s a loss of $ Executive Compensation 1. Make sure that the histogram is bell-shaped and based on data that do not show a pattern. 2. The skewness in incomes persists after removing outliers. Trimming outliers from a skewed distribution seldom improves its shape; it only sacrifices data for little or no change in the shape of the distribution. 3. You can convert to base-10 logs from natural (base-e) logs by the formula log e x log 10 x. The constant log e 10. As a result, all log transformations have the same effect on the shape of a histogram; they only differ in the scale. 4. The value is closer to the median, not the mean. The mean of compensation is $4.268 million, and the median is $2.533 million. The average of the log 10 compensation 6.411, and taking the antilog gives $2.576 million. Working on a log scale down-weights the effects of outliers, so 10 raised to the mean of the base-10 logs is closer to the median. There s an important lesson here: logs and averaging do not commute. The average of the log of a variable is less than the log of the average. Part II Dice Simulation 1. Luck! The mean for Red is astonishingly large, ,700. For most investors, however, Red loses value because of its large volatility. For some, though, there remains a chance of winning a lot with Red. All you have to do is keep rolling those 4s, 5s, and 6s. 2. The mix used to form Pink in the example is not the best ratio, but we have to define what we mean by best. Given the discussion of this example, we might for instance choose to find the mix that has the largest 268

2 Statistics in Action 269 long-run expected value. That is, pick the fraction x to keep in Red in order to maximize V(x) = [x(0.71) + (1-x)(0.008)] ½ [(x (1-x) 2 (0.0016))] mean variance That s a basic calculus problem (or you could use the Excel Solver). The derivative of V(x) is V (x) = x (1-x) Setting V (x)=0 gives x That is, keep about 40% of your wealth in Red with the rest in White. The long-run return grows from about 14% (0.141, Table 6) to about 15% (V(0.403) 0.149). 3. The investments perform independently of one another. What happens to Red, for instance, has no effect on what happens to either Green or White. Real investments are all affected by the health of the general economy. 4. Method (a) does not compound whereas method (b) does. That makes a big difference in the results. Here are the details. We can do a very complete analysis of method (a) since it amounts to summing independent random variables. The expected value of the random variable A which represents the outcome of a $1 bet using method (a) is E(A) = ( )/6 = On average you win 3.33% of the amount wagered. The variance of A is Var(A) = E(A 2 )-(E A) 2 = ( )/ = The expected outcome of 100 rounds with $1 each time, because each round is independent, is then = The variance of 100 independent rounds is then (SD 3.266). The CLT implies, then, that you re likely to win, but not very much. For example, the chance of winning at this game is P(A 1 + A A 100 )>100) P( N(0,1) > ( )/ ) Method (b) compounds, as in the dice game. The amount won in each round affects the amount won in the next. The expected position after 100 rounds (starting from a $1 initial wager) is Compounding suggests a large gain, but compounding also means that volatility drag enters the picture. Using the expression from the text, the long-run return of this approach is (mean minus half of the variance) long-run return = ½ (0.1067) That is, the volatility will gradually reduce the value of the investment. Typically, strategy (b) loses about 2% each round. M&Ms Here s another way to think about the method for weighing by counts. We are going to compute the upper 99.5% point of the distribution of the weight of a package that has too few pieces. If the weight of a package is more than this percentile, then there is a very high probability that the package has enough pieces. 1. Follow the illustration for the bolts from the text. We need to find the weight x such that (with T 59 the total weight of 59 M&Ms and using normality) P(T 59 < x) = P(Z < (x - 59*0.86)/(0.04* 59)) = The expression on the right hand side of the inequality has to equal the 99.5% percentile of a standard normal distribution, (x - 59*0.86)/(0.04* 59) = Solving for x gives x = 59* *0.04*sqrt(59) grams. That is, 99.5% of packages with 59 pieces weigh less than grams. A package that weights more than this signals that it has more than 59 pieces. A bag with 60 pieces weighs on average slightly more, 60*0.86 = Hence, many bags with 60 pieces will fall below threshold and get an extra piece. (See the answer to 3 for the probability.) 2. We need to assume that the weights of individual candies are normally distributed, whereas for #1 we could appeal to the central limit theorem. As in the solution of #1, let T 9 denote the weight of 9 M&Ms. We need to find the weight threshold x so that P(T 9 < x) = P(Z < (x - 9*0.86)/(0.04* 9)) = The expression on the right has to equal , or (x - 9*0.86)/(0.04* 9) = and x = 9* *0.04*sqrt(9) 8.05 grams. That s the upper 99.5% point of the weights of bags with 9 pieces. A bag with 10 pieces on average weighs 10*0.86 = 8.6 grams. We won t have to add so many pieces as in #1 (see the answer to #3 next). 3. The probability that we have to put in more than enough to fill the bag with 60 pieces (ie, the chance that a bag with 60 weighs less than the acceptance threshold found in #1) is

3 270 Statistics in Action P(T 60 < 51.53) = P( Z < ( *0.86)/(sqrt(60)*0.04) = P(Z < ) The chance that a bag with 10 weights less than the threshold from #2 is P(T 10 < 8.05) = P(Z < ( *0.86)/(sqrt(10)*0.04) = P(Z < -4.35) There s less chance to put in extra pieces in these small bags. On the other hand, we need to assume that the weights are normally distributed. (The result works out like this for M&Ms because the variation in the weight of a piece is so small relative to the mean; i.e., the coefficient of variance is small.) Part III Rare Events 1. Reverse the definition of success and failure. If all of the results were successful, then the approximate 95% confidence interval for the population proportion of success is 1-3/n to To raise the level of confidence requires solving a slightly different equation. The solution p* found in the text solves the equation (1-p*) n = 0.05 which implies that p* = /n. If we set the level of confidence to 99.75%, then the new value for the confidence limit q* is q* = /n We can also find an approximation for q* using the same procedure described in Behind the Math that gives the Rule of Three its name. Following that argument, write q* = x/n. It follows that x = -log e That means we get the higher level of confidence by using the Rule of Six. The 99.75% confidence interval is approximately [0, 6/n]. 3. The Rule of Three is designed to work for large values of n. If n = 20, then the exact value for the upper limit of the 95% confidence interval p* is (as in Question 2) p* = /n = / The approximate value is larger, namely 3/20 = The implication is that the 95% confidence interval [0,3/n] is conservative or what you might call cautious. The coverage of this simple interval is larger than the nominal level 95%. Testing Association 1. No. The fact that we have 200 samples from each location makes it easier to compare the proportions in each case, but that s only convenient, not necessary. The main constraint on the number of cases in each row is that the count be large enough to satisfy the sample size condition that assures that we expect at least 10 in each cell (This rule corresponds to the sample size condition used for proportions, n p 10 and n (1-p) 10. The sample size within the rows does not determine the degrees of freedom for the chi-squared statistic; that s always (#rows-1)(#columns-1). 2. No. These are 84 p-values, not 84 population parameters. For products with p-values less than 0.025, we anticipate finding association in the population. That may not be the case. Recall that the p-value accepts a chance for a false-positive result. It could be the case that H 0 : no association is true, but the p-value is less than In fact, if H 0 holds, 2.5% of p-values will be less than Since we have 650 p-values, we can expect about p-values to be less than by chance alone. We just don t know which these are! 3. The analysis shows whether customer preferences for items (such as color preferences) depend on where we find the customers. If color choice and location are associated, then we may choose to stock different color mixes in the different locations. On the other hand, if the color choices are independent of the location, we can manage the stock in all of the locations similarly. Part IV Analyzing Experiments 1. The estimates in Table 4 imply that the change in sales in the Midwest if ads feature small labor costs is D(Midwest) + D(Sm Labor) + Interaction = = $57.5

4 Statistics in Action 271 The easy way to get this answer is to recognize that the fit of the anova regression is the mean of the cell in Table 5 that combines Midwest and Small Labor, $57.5 (third row, first column). 2. The standard errors reflect the sample sizes and the balanced layout of the table. We observe 10 cases for each of the 12 combinations of Region and Advertising, with 30 observations for each region and 40 for each ad type. Intuitively, it is easy to anticipate that the interactions are less well determined (higher SE) since they rely on combinations. To see why the coefficients of the dummy variables for region and advertising type have the same standard errors even though they are associated with different numbers of cases write out several equations of the fitted model. In particular, the intercept of the fitted model is the mean for the Total price in the West y ˆ = b 0. The fitted value for Total price in the Midwest is y ˆ = b 0 + b Midwest. Hence, the coefficient of D(Midwest) is the difference between the means of these two cells of Table 1 ( = matches the slope of D(Midwest) in Table 4 on page 733). Analogously, the fitted value for Small parts in the West is ˆ y = b 0 + b SmParts. Hence, the coefficient for D(SmParts) from Table 1 is the difference between the mean of Total price in the West with Small Parts in the West ( = 27.4). Thus, the estimated slopes of the dummy variables for region and ad type are differences between pairs of means in Table 1. The standard error of the difference between two means each estimated from 10 cases is σ 2 (1/10 + 1/10) = σ 2 /5 (see Chapter 18). Using s e to estimate σ, we obtain SE = /sqrt(5) 80.6 as shown in Table 4 for the slopes for advertising and region. 3. The interaction remains, but the plot now changes the roles of ad type and region. Rather than join averages associated with the same type of ad, join averages from the same region. Here s the plot. The fact that the lines that join the means from each region are not parallel again shows the presence of the interaction. 4. Use the s e from the shown model, s e = as the estimate of σ. Then estimate the SE for the difference between any pair of means in Table 1 as 2 s e /sqrt(10) Within a region, we have 3 means to compare, so there are 3 pairwise comparisons. If we want to keep an overall alpha level of 0.05, we can test each comparison at level 0.05/ with z value Hence, in order to be different, the absolute value of the difference between means within a row of Table 1 must be least In the Northeast, only the difference between Small Labor and Small Parts exceeds this threshold.

5 272 Statistics in Action 5. We deleted a random selection of 15 cases from the data table. Here s the corresponding table of estimates. In general, the estimates are similar, but the consistency of the standard errors is lost. (Most of the standard errors are slightly larger due to the smaller sample size.) Intercept <.0001* Region[Northeast] Region[South] Region[Midwest] Price Partition[Small labor] Price Partition[Small Parts] Price Partition[Small labor]*region[northeast] * Price Partition[Small labor]*region[south] * Price Partition[Small labor]*region[midwest] Price Partition[Small Parts]*Region[Northeast] Price Partition[Small Parts]*Region[South] Price Partition[Small Parts]*Region[Midwest] * Automated Modeling 1. The following table shows the coefficient estimates for the stepwise model including D(Rush). The estimate is negative, indicating that rush jobs (given the other characteristics in the model) tend to be less costly. Intercept <.0001* Labor Hours * Breakdown/Unit <.0001* Total Metal Cost <.0001* Temp Deviation <.0001* Plant[NEW] * 1/Units * D[Rush] * Table 2 explains why the fitted model makes it appear that, both marginally and within the regression, rushed jobs are cheaper to produce. The explanation is that rushed jobs tend to be simpler jobs, lacking detail. 2. If you add Room Temp to the shown regression, the resulting fit is Intercept Labor Hours <.0001* Breakdown/Unit <.0001* Total Metal Cost <.0001* Temp Deviation <.0001* Plant[NEW] * Room Temp The net effect for room temperature is Temp (Temp-75) 2 You can table and chart this function of temperature in a spreadsheet to find the minimum value (or use calculus). Adding the linear trend shifts the optimum low-cost temperature from 75 down to /(2*0.0356) 73.1

6 Statistics in Action No. The backward elimination does have to reach the same model as found by the forward search. The lack of agreement is usually caused by collinearity. For this example, we will start from the saturated model (which does not include Plant since it is redundant with Manager) and remove variables one at a time. For consistency with the forward stepwise analysis, we remove a variable at each step if its p-value is larger than the threshold we used in the forward selection, p-to-remove = p-to-enter = 0.05/26 = The following tables summarize the model we found with backward stepwise search. R s e Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model Error Prob > F C. Total <.0001* Parameter Estimates Intercept <.0001* Labor Hours <.0001* Breakdown/Unit * Total Metal Cost <.0001* Temp Deviation * The model is the same as that found from the forward search except for the plant effect. We could not include it in the saturated model along with the manager effect! 4. This is a hard question to answer in regression modeling in general. What does it mean if a variable is not among the explanatory variables in a regression? Most importantly, recognize that it does not mean that the omitted variable is unrelated to the response. The number of machine hours is statistically significantly correlated with costs but not once we adjust for other variables such as labor hours that are correlated with machine hours. If we add machine hours to the stepwise regression, we get the following summary of estimates. Intercept <.0001* Labor Hours <.0001* Breakdown/Unit <.0001* Total Metal Cost <.0001* Temp Deviation <.0001* Plant[NEW] * Machine Hours The estimated effect is not statistically significant. The wide confidence interval, however, reminds us that machine hours could have quite an impact even after adjusting for the other variables. The CI is approximately ± 2(45.13) -109 to 71 dollars per hour.

Business Statistics 41000: Probability 3

Business Statistics 41000: Probability 3 Drew D. Creal University of Chicago, Booth School of Business February 7 and 8, 2014 1 Class information Drew D. Creal Email: dcreal@chicagobooth.edu Office: 404