Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Similar documents
Business Statistics 41000: Probability 3

Being Warren Buffett. Wharton Statistics Department

STA 103: Final Exam. Print clearly on this exam. Only correct solutions that can be read will be given credit.

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

LESSON 7 INTERVAL ESTIMATION SAMIE L.S. LY

Stat 328, Summer 2005

STAT 201 Chapter 6. Distribution

Multiple Regression. Review of Regression with One Predictor

STAT 157 HW1 Solutions

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

Characterization of the Optimum

Data Analysis and Statistical Methods Statistics 651

Central Limit Theorem, Joint Distributions Spring 2018

Confidence Intervals. σ unknown, small samples The t-statistic /22

Economics 345 Applied Econometrics

Jacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation?

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

Numerical Descriptive Measures. Measures of Center: Mean and Median

MAKING SENSE OF DATA Essentials series

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Probability. An intro for calculus students P= Figure 1: A normal integral

Point Estimation. Some General Concepts of Point Estimation. Example. Estimator quality

Chapter 6. y y. Standardizing with z-scores. Standardizing with z-scores (cont.)

I. Return Calculations (20 pts, 4 points each)

Part V - Chance Variability

Descriptive Statistics

Chapter 4 Random Variables & Probability. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Jacob: What data do we use? Do we compile paid loss triangles for a line of business?

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

A useful modeling tricks.

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Unit 5: Sampling Distributions of Statistics

Unit 5: Sampling Distributions of Statistics

This homework assignment uses the material on pages ( A moving average ).

Data Analysis and Statistical Methods Statistics 651

THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Multiple regression - a brief introduction

Finance Practice Midterm #1 Solutions

Name: Show all your work! Mathematical Concepts Joysheet 1 MAT 117, Spring 2013 D. Ivanšić

Decision Trees: Booths

Normal distribution Approximating binomial distribution by normal 2.10 Central Limit Theorem

Data screening, transformations: MRC05

The following content is provided under a Creative Commons license. Your support

Lecture 3: Factor models in modern portfolio choice

The Binomial Distribution

Valuing Investments A Statistical Perspective. Bob Stine Department of Statistics Wharton, University of Pennsylvania

P2.T5. Market Risk Measurement & Management. Bruce Tuckman, Fixed Income Securities, 3rd Edition

AP Statistics Chapter 6 - Random Variables

FINAL REVIEW W/ANSWERS

Stat3011: Solution of Midterm Exam One

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

A.REPRESENTATION OF DATA

Prentice Hall Connected Mathematics 2, 7th Grade Units 2009 Correlated to: Minnesota K-12 Academic Standards in Mathematics, 9/2008 (Grade 7)

6. THE BINOMIAL DISTRIBUTION

8.1 Estimation of the Mean and Proportion

Two-Sample T-Test for Superiority by a Margin

Market Volatility and Risk Proxies

Data Analysis and Statistical Methods Statistics 651

Diploma Part 2. Quantitative Methods. Examiner s Suggested Answers

4.3 Normal distribution

Two-Sample T-Test for Non-Inferiority

CPSC 540: Machine Learning

STA2601. Tutorial letter 105/2/2018. Applied Statistics II. Semester 2. Department of Statistics STA2601/105/2/2018 TRIAL EXAMINATION PAPER

Numerical Descriptions of Data

* The Unlimited Plan costs $100 per month for as many minutes as you care to use.

Statistical Intervals (One sample) (Chs )

The Binomial Distribution

Business Statistics 41000: Probability 4

CH 5 Normal Probability Distributions Properties of the Normal Distribution

John Hull, Risk Management and Financial Institutions, 4th Edition

Data Analysis and Statistical Methods Statistics 651

Chapter 5. Continuous Random Variables and Probability Distributions. 5.1 Continuous Random Variables

The Two-Sample Independent Sample t Test

starting on 5/1/1953 up until 2/1/2017.

Both the quizzes and exams are closed book. However, For quizzes: Formulas will be provided with quiz papers if there is any need.

CHAPTER 5 SAMPLING DISTRIBUTIONS

Sampling Distributions and the Central Limit Theorem

STA 220H1F LEC0201. Week 7: More Probability: Discrete Random Variables

1 Inferential Statistic

4.1 Introduction Estimating a population mean The problem with estimating a population mean with a sample mean: an example...

Chapter 5. Sampling Distributions

Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases.

Expectation Exercises.

DATA SUMMARIZATION AND VISUALIZATION

( ) P = = =

CHAPTER 2 Describing Data: Numerical

Computing interest and composition of functions:

Math 140 Introductory Statistics

Basic Procedure for Histograms

Chapter 14 : Statistical Inference 1. Note : Here the 4-th and 5-th editions of the text have different chapters, but the material is the same.

Final Exam Suggested Solutions

Learning Objectives for Ch. 7

Mean-Variance Portfolio Theory

Chapter 5 Normal Probability Distributions

Data Analysis. BCF106 Fundamentals of Cost Analysis

5.3 Statistics and Their Distributions

Homework Assignment Section 3

Transcription:

Appendix: Statistics in Action Part I Financial Time Series 1. These data show the effects of stock splits. If you investigate further, you ll find that most of these splits (such as in May 1970) are 3-for-1 splits rather than 2-for-1 splits seen in the text example. You can recognize that the May 1970 split is a 3-for-1 split by observing that the price is $97.50 at the end of April 1970 and falls to $30.625 (about 1/3 of the prior price) at the end of May 1970. 2. a. The White Space Rule says that this figure doesn t show much of what is happening in the first ¾ of the time period, up through about 1985 or 1990. This portion of the data is hidden because of the large values produced by compounding in the later years. b. No. These data have a strong pattern of dependence that would be concealed by a histogram. 3. The variation in the month-to-month changes lacks a clear pattern, and a histogram is a helpful summary of such variation. 4. The percentage changes show the gains produced by an investment one month at a time. Cumulative values hide the month-to-month variation in the performance of the stock behind the effects of the compounding value. 5. a. 0.0755. A return is 1/100 times the corresponding percentage change, so the SD of the returns is 1/100 times the SD of the percentage changes. b. If we accept the Empirical Rule, then we can find the number of standard deviations that separate the mean from -2.5%. A drop of 2.5% amounts to (-2.5-1.55)/7.55-0.54, or about ½ of an SD below the mean. That s not very unusual based on the Empirical Rule, which we should check by inspecting the histogram of percentage changes. A simpler approach might be better than relying on the Empirical Rule: find the percentage of months with a drop of 2.5% or more. This approach is not without assumptions. We still need to make sure that there are no patterns (dependence) in the time series of percentage changes. (Use the visual test for association.) c. Let s find the VaR at 2.5%, excluding the worst 2.5% of outcomes. Using the Empirical Rule, that implies that the stock could change by its mean minus two standard deviations, or by 1.55-2*7.55=-13.55%. That s a loss of $135.50. Executive Compensation 1. Make sure that the histogram is bell-shaped and based on data that do not show a pattern. 2. The skewness in incomes persists after removing outliers. Trimming outliers from a skewed distribution seldom improves its shape; it only sacrifices data for little or no change in the shape of the distribution. 3. You can convert to base-10 logs from natural (base-e) logs by the formula log e x 2.303 log 10 x. The constant 2.303 log e 10. As a result, all log transformations have the same effect on the shape of a histogram; they only differ in the scale. 4. The value is closer to the median, not the mean. The mean of compensation is $4.268 million, and the median is $2.533 million. The average of the log 10 compensation 6.411, and taking the antilog gives 10 6.411 $2.576 million. Working on a log scale down-weights the effects of outliers, so 10 raised to the mean of the base-10 logs is closer to the median. There s an important lesson here: logs and averaging do not commute. The average of the log of a variable is less than the log of the average. Part II Dice Simulation 1. Luck! The mean for Red is astonishingly large, 1.71 20 45,700. For most investors, however, Red loses value because of its large volatility. For some, though, there remains a chance of winning a lot with Red. All you have to do is keep rolling those 4s, 5s, and 6s. 2. The 50-50 mix used to form Pink in the example is not the best ratio, but we have to define what we mean by best. Given the discussion of this example, we might for instance choose to find the mix that has the largest 268

Statistics in Action 269 long-run expected value. That is, pick the fraction x to keep in Red in order to maximize V(x) = [x(0.71) + (1-x)(0.008)] ½ [(x 2 1.7424 + (1-x) 2 (0.0016))] mean variance That s a basic calculus problem (or you could use the Excel Solver). The derivative of V(x) is V (x) = 0.71-0.008 1.7424 x + 0.0016(1-x) Setting V (x)=0 gives x 0.403. That is, keep about 40% of your wealth in Red with the rest in White. The long-run return grows from about 14% (0.141, Table 6) to about 15% (V(0.403) 0.149). 3. The investments perform independently of one another. What happens to Red, for instance, has no effect on what happens to either Green or White. Real investments are all affected by the health of the general economy. 4. Method (a) does not compound whereas method (b) does. That makes a big difference in the results. Here are the details. We can do a very complete analysis of method (a) since it amounts to summing independent random variables. The expected value of the random variable A which represents the outcome of a $1 bet using method (a) is E(A) = (0.7 + 0.8 + 0.9 + 1.1 + 1.2 + 1.5)/6 = 1.0333 On average you win 3.33% of the amount wagered. The variance of A is Var(A) = E(A 2 )-(E A) 2 = (0.7 2 + 0.8 2 + 0.9 2 + 1.1 2 + 1.2 2 + 1.5 2 )/6-1.0333 2 = 0.1067 The expected outcome of 100 rounds with $1 each time, because each round is independent, is then 100 1.0333 = 103.33. The variance of 100 independent rounds is then 10.67 (SD 3.266). The CLT implies, then, that you re likely to win, but not very much. For example, the chance of winning at this game is P(A 1 + A 2 + + A 100 )>100) P( N(0,1) > (100-103.33)/3.266-1.096) 0.846 Method (b) compounds, as in the dice game. The amount won in each round affects the amount won in the next. The expected position after 100 rounds (starting from a $1 initial wager) is 1.0333 100 26.5. Compounding suggests a large gain, but compounding also means that volatility drag enters the picture. Using the expression from the text, the long-run return of this approach is (mean minus half of the variance) long-run return = 0.0333 ½ (0.1067) -0.02 That is, the volatility will gradually reduce the value of the investment. Typically, strategy (b) loses about 2% each round. M&Ms Here s another way to think about the method for weighing by counts. We are going to compute the upper 99.5% point of the distribution of the weight of a package that has too few pieces. If the weight of a package is more than this percentile, then there is a very high probability that the package has enough pieces. 1. Follow the illustration for the bolts from the text. We need to find the weight x such that (with T 59 the total weight of 59 M&Ms and using normality) P(T 59 < x) = P(Z < (x - 59*0.86)/(0.04* 59)) = 0.995 The expression on the right hand side of the inequality has to equal the 99.5% percentile of a standard normal distribution, 2.5758 (x - 59*0.86)/(0.04* 59) = 2.5758 Solving for x gives x = 59*0.86 + 2.5758*0.04*sqrt(59) 51.53 grams. That is, 99.5% of packages with 59 pieces weigh less than 51.53 grams. A package that weights more than this signals that it has more than 59 pieces. A bag with 60 pieces weighs on average slightly more, 60*0.86 = 51.6. Hence, many bags with 60 pieces will fall below 51.53 threshold and get an extra piece. (See the answer to 3 for the probability.) 2. We need to assume that the weights of individual candies are normally distributed, whereas for #1 we could appeal to the central limit theorem. As in the solution of #1, let T 9 denote the weight of 9 M&Ms. We need to find the weight threshold x so that P(T 9 < x) = P(Z < (x - 9*0.86)/(0.04* 9)) = 0.995 The expression on the right has to equal 2.5758, or (x - 9*0.86)/(0.04* 9) = 2.5758 and x = 9*0.86 + 2.5758*0.04*sqrt(9) 8.05 grams. That s the upper 99.5% point of the weights of bags with 9 pieces. A bag with 10 pieces on average weighs 10*0.86 = 8.6 grams. We won t have to add so many pieces as in #1 (see the answer to #3 next). 3. The probability that we have to put in more than enough to fill the bag with 60 pieces (ie, the chance that a bag with 60 weighs less than the acceptance threshold found in #1) is

270 Statistics in Action P(T 60 < 51.53) = P( Z < (51.53-60*0.86)/(sqrt(60)*0.04) = P(Z < -0.2259) 0.39. The chance that a bag with 10 weights less than the threshold from #2 is P(T 10 < 8.05) = P(Z < (8.05-10*0.86)/(sqrt(10)*0.04) = P(Z < -4.35) 0.00003 There s less chance to put in extra pieces in these small bags. On the other hand, we need to assume that the weights are normally distributed. (The result works out like this for M&Ms because the variation in the weight of a piece is so small relative to the mean; i.e., the coefficient of variance is small.) Part III Rare Events 1. Reverse the definition of success and failure. If all of the results were successful, then the approximate 95% confidence interval for the population proportion of success is 1-3/n to 1. 2. To raise the level of confidence requires solving a slightly different equation. The solution p* found in the text solves the equation (1-p*) n = 0.05 which implies that p* = 1-.05 1/n. If we set the level of confidence to 99.75%, then the new value for the confidence limit q* is q* = 1-.0025 1/n We can also find an approximation for q* using the same procedure described in Behind the Math that gives the Rule of Three its name. Following that argument, write q* = x/n. It follows that x = -log e 0.0025 6. That means we get the higher level of confidence by using the Rule of Six. The 99.75% confidence interval is approximately [0, 6/n]. 3. The Rule of Three is designed to work for large values of n. If n = 20, then the exact value for the upper limit of the 95% confidence interval p* is (as in Question 2) p* = 1-.05 1/n = 1-0.05 1/20 0.139. The approximate value is larger, namely 3/20 = 0.15. The implication is that the 95% confidence interval [0,3/n] is conservative or what you might call cautious. The coverage of this simple interval is larger than the nominal level 95%. Testing Association 1. No. The fact that we have 200 samples from each location makes it easier to compare the proportions in each case, but that s only convenient, not necessary. The main constraint on the number of cases in each row is that the count be large enough to satisfy the sample size condition that assures that we expect at least 10 in each cell (This rule corresponds to the sample size condition used for proportions, n p 10 and n (1-p) 10. The sample size within the rows does not determine the degrees of freedom for the chi-squared statistic; that s always (#rows-1)(#columns-1). 2. No. These are 84 p-values, not 84 population parameters. For products with p-values less than 0.025, we anticipate finding association in the population. That may not be the case. Recall that the p-value accepts a chance for a false-positive result. It could be the case that H 0 : no association is true, but the p-value is less than 0.025. In fact, if H 0 holds, 2.5% of p-values will be less than 0.025. Since we have 650 p-values, we can expect about 0.025 650 16 p-values to be less than 0.025 by chance alone. We just don t know which these are! 3. The analysis shows whether customer preferences for items (such as color preferences) depend on where we find the customers. If color choice and location are associated, then we may choose to stock different color mixes in the different locations. On the other hand, if the color choices are independent of the location, we can manage the stock in all of the locations similarly. Part IV Analyzing Experiments 1. The estimates in Table 4 imply that the change in sales in the Midwest if ads feature small labor costs is D(Midwest) + D(Sm Labor) + Interaction = -17.7-106.9 +182.10 = $57.5

Statistics in Action 271 The easy way to get this answer is to recognize that the fit of the anova regression is the mean of the cell in Table 5 that combines Midwest and Small Labor, $57.5 (third row, first column). 2. The standard errors reflect the sample sizes and the balanced layout of the table. We observe 10 cases for each of the 12 combinations of Region and Advertising, with 30 observations for each region and 40 for each ad type. Intuitively, it is easy to anticipate that the interactions are less well determined (higher SE) since they rely on combinations. To see why the coefficients of the dummy variables for region and advertising type have the same standard errors even though they are associated with different numbers of cases write out several equations of the fitted model. In particular, the intercept of the fitted model is the mean for the Total price in the West y ˆ = b 0. The fitted value for Total price in the Midwest is y ˆ = b 0 + b Midwest. Hence, the coefficient of D(Midwest) is the difference between the means of these two cells of Table 1 (291.1-308.8 = -17.7 matches the slope of D(Midwest) in Table 4 on page 733). Analogously, the fitted value for Small parts in the West is ˆ y = b 0 + b SmParts. Hence, the coefficient for D(SmParts) from Table 1 is the difference between the mean of Total price in the West with Small Parts in the West (336.2-308.8= 27.4). Thus, the estimated slopes of the dummy variables for region and ad type are differences between pairs of means in Table 1. The standard error of the difference between two means each estimated from 10 cases is σ 2 (1/10 + 1/10) = σ 2 /5 (see Chapter 18). Using s e to estimate σ, we obtain SE = 180.156/sqrt(5) 80.6 as shown in Table 4 for the slopes for advertising and region. 3. The interaction remains, but the plot now changes the roles of ad type and region. Rather than join averages associated with the same type of ad, join averages from the same region. Here s the plot. The fact that the lines that join the means from each region are not parallel again shows the presence of the interaction. 4. Use the s e from the shown model, s e = 180.156 as the estimate of σ. Then estimate the SE for the difference between any pair of means in Table 1 as 2 s e /sqrt(10) 80.57 Within a region, we have 3 means to compare, so there are 3 pairwise comparisons. If we want to keep an overall alpha level of 0.05, we can test each comparison at level 0.05/3 0.01667 with z value 2.39. Hence, in order to be different, the absolute value of the difference between means within a row of Table 1 must be least 2.39 80.57 192.56. In the Northeast, only the difference between Small Labor and Small Parts exceeds this threshold.

272 Statistics in Action 5. We deleted a random selection of 15 cases from the data table. Here s the corresponding table of estimates. In general, the estimates are similar, but the consistency of the standard errors is lost. (Most of the standard errors are slightly larger due to the smaller sample size.) Intercept 308.8 57.7404 5.35 <.0001* Region[Northeast] -86.35556 83.89485-1.03 0.3060 Region[South] -12.2 81.65725-0.15 0.8816 Region[Midwest] -2.133333 83.89485-0.03 0.9798 Price Partition[Small labor] -93.24444 83.89485-1.11 0.2692 Price Partition[Small Parts] 12.866667 83.89485 0.15 0.8784 Price Partition[Small labor]*region[northeast] 262.35556 120.1962 2.18 0.0316* Price Partition[Small labor]*region[south] 462.24444 117.0737 3.95 0.0002* Price Partition[Small labor]*region[midwest] 163.68889 120.1962 1.36 0.1765 Price Partition[Small Parts]*Region[Northeast] -115.0254 124.5212-0.92 0.3580 Price Partition[Small Parts]*Region[South] -45.86667 130.5381-0.35 0.7261 Price Partition[Small Parts]*Region[Midwest] 342.02222 120.1962 2.85 0.0055* Automated Modeling 1. The following table shows the coefficient estimates for the stepwise model including D(Rush). The estimate is negative, indicating that rush jobs (given the other characteristics in the model) tend to be less costly. Intercept 15.104066 2.086741 7.24 <.0001* Labor Hours 18.453537 6.392942 2.89 0.0044* Breakdown/Unit 369.33108 65.34033 5.65 <.0001* Total Metal Cost 2.5884003 0.411672 6.29 <.0001* Temp Deviation 0.037684 0.007204 5.23 <.0001* Plant[NEW] 4.6362731 1.199853 3.86 0.0002* 1/Units 821.6258 276.2057 2.97 0.0033* D[Rush] -2.1082262 0.995937-2.12 0.0356* Table 2 explains why the fitted model makes it appear that, both marginally and within the regression, rushed jobs are cheaper to produce. The explanation is that rushed jobs tend to be simpler jobs, lacking detail. 2. If you add Room Temp to the shown regression, the resulting fit is Intercept 4.7161224 5.913345 0.80 0.4261 Labor Hours 34.931763 4.086292 8.55 <.0001* Breakdown/Unit 330.01424 65.48335 5.04 <.0001* Total Metal Cost 2.1925659 0.403624 5.43 <.0001* Temp Deviation 0.03561 0.007328 4.86 <.0001* Plant[NEW] 4.3867225 1.221315 3.59 0.0004* Room Temp 0.1332053 0.073374 1.82 0.0710 The net effect for room temperature is 0.133 Temp + 0.0356 (Temp-75) 2 You can table and chart this function of temperature in a spreadsheet to find the minimum value (or use calculus). Adding the linear trend shifts the optimum low-cost temperature from 75 down to 75 0.133/(2*0.0356) 73.1

Statistics in Action 273 3. No. The backward elimination does have to reach the same model as found by the forward search. The lack of agreement is usually caused by collinearity. For this example, we will start from the saturated model (which does not include Plant since it is redundant with Manager) and remove variables one at a time. For consistency with the forward stepwise analysis, we remove a variable at each step if its p-value is larger than the threshold we used in the forward selection, p-to-remove = p-to-enter = 0.05/26 = 0.00192. The following tables summarize the model we found with backward stepwise search. R 2 0.425376 s e 6.849773 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 4 6633.987 1658.50 35.3478 Error 191 8961.605 46.92 Prob > F C. Total 195 15595.591 <.0001* Parameter Estimates Intercept 17.867943 1.987962 8.99 <.0001* Labor Hours 34.005724 4.225423 8.05 <.0001* Breakdown/Unit 225.8511 60.80471 3.71 0.0003* Total Metal Cost 2.1407105 0.418003 5.12 <.0001* Temp Deviation 0.0249993 0.00681 3.67 0.0003* The model is the same as that found from the forward search except for the plant effect. We could not include it in the saturated model along with the manager effect! 4. This is a hard question to answer in regression modeling in general. What does it mean if a variable is not among the explanatory variables in a regression? Most importantly, recognize that it does not mean that the omitted variable is unrelated to the response. The number of machine hours is statistically significantly correlated with costs but not once we adjust for other variables such as labor hours that are correlated with machine hours. If we add machine hours to the stepwise regression, we get the following summary of estimates. Intercept 15.043869 2.214395 6.79 <.0001* Labor Hours 34.921194 4.414549 7.91 <.0001* Breakdown/Unit 337.88525 66.51137 5.08 <.0001* Total Metal Cost 2.2668504 0.416594 5.44 <.0001* Temp Deviation 0.0367458 0.007361 4.99 <.0001* Plant[NEW] 4.2237607 1.42354 2.97 0.0034* Machine Hours -18.81704 45.12525-0.42 0.6772 The estimated effect is not statistically significant. The wide confidence interval, however, reminds us that machine hours could have quite an impact even after adjusting for the other variables. The CI is approximately -18.82 ± 2(45.13) -109 to 71 dollars per hour.