STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15 For this assignment use the Diamonds dataset in the Stat2Data library. The dataset is used in examples in Sections 3.4 and 3.5 of the book, and for part of the Discussion on Friday, November 3. The pages from the book with the exercises below are linked to the course webpage along with this assignment for those who don t have a copy of the book. 1. Do Exercise 3.23 on page 157. For each model in parts a to d, show the results of "summary(model)" from R, then choose a model and explain your choice, as instructed. R instructions for creating second order models and adding quadratic (squared) terms and interactions were given in Discussion 5 (Nov 3) and Lecture 11 (Nov 6). a. > summary(depth) lm(formula = TotalPrice ~ Depth + I(Depth^2)) -9323-4251 -2676 2134 45513 (Intercept) -28406.783 112211.790-0.253 0.800 Depth 766.369 3353.222 0.229 0.819 I(Depth^2) -3.233 24.869-0.130 0.897 Residual standard error: 7616 on 348 degrees of freedom Multiple R-squared: 0.04748, Adjusted R-squared: 0.042 F-statistic: 8.673 on 2 and 348 DF, p-value: 0.0002111 b. > summary(both) lm(formula = TotalPrice ~ Carat + Depth) -9234.7-1223.7-274.3 1161.0 16368.6 (Intercept) 1059.24 1918.36 0.552 0.581 Carat 15087.01 320.96 47.006 < 2e-16 *** Depth -134.94 30.92-4.364 1.68e-05 *** Residual standard error: 2809 on 348 degrees of freedom Multiple R-squared: 0.8704, Adjusted R-squared: 0.8696 F-statistic: 1168 on 2 and 348 DF, p-value: < 2.2e-16 c. > summary(inter) lm(formula = TotalPrice ~ Depth * Carat) -8254.4-1311.5-157.2 1131.8 14513.9 (Intercept) 31171.41 4219.58 7.387 1.13e-12 *** Depth -598.18 65.47-9.137 < 2e-16 ***
Carat -11827.73 3436.47-3.442 0.000648 *** Depth:Carat 408.45 51.96 7.861 4.84e-14 *** Residual standard error: 2592 on 347 degrees of freedom Multiple R-squared: 0.89, Adjusted R-squared: 0.889 F-statistic: 935.7 on 3 and 347 DF, p-value: < 2.2e-16 d. > summary(second) lm(formula = TotalPrice ~ Depth * Carat + I(Depth^2) + I(Carat^2)) -12196.1-652.7-38.5 485.7 10582.2 (Intercept) 24338.820 30297.912 0.803 0.4223 Depth -728.700 904.439-0.806 0.4210 Carat 7573.620 3040.787 2.491 0.0132 * I(Depth^2) 5.276 6.727 0.784 0.4333 I(Carat^2) 4761.592 330.246 14.418 <2e-16 *** Depth:Carat -83.891 53.530-1.567 0.1180 Residual standard error: 2053 on 345 degrees of freedom Multiple R-squared: 0.9313, Adjusted R-squared: 0.9304 F-statistic: 936.1 on 5 and 345 DF, p-value: < 2.2e-16 Choice of best model: The usual criteria to use include the largest Adjusted R-squared or the smallest MSE or the smallest Residual standard error. (They will always give you the same model, since they are all functions of MSE.) So using any of those, the best model of the four models here and the two in Example 3.11 is the one in Part (d), the full second order model. 2. Do Exercise 3.24 on page 158. R instructions for creating the log of a variable were given in Lecture 3 (Oct 9) and Discussion 2 (Oct 13). R instructions for creating plots to check conditions were given in Discussion 2 (Oct 13) and Discussion 5 (Nov 3) 3.24 a. To examine the constant variance condition, the appropriate graph is the residuals vs fitted values, shown on the right. To examine the normality condition, an appropriate graph is a normal probability plot, shown on the right below (next page), or a histogram of the residuals, shown on the left below. All three graphs show problems with the conditions being met. For the graph on the right, it looks like the variance is increasing. For the normal probability plot, the points deviate substantially from the line, indicating substantial non-normality, also shown by the histogram.
3.24 b. The output for the model with logprice as the response is shown below. The model is still a reasonable choice. It has high Adjusted R-squared, and all of the coefficients have relatively small p- values, even though a few of them are above 0.05. You might want to try dropping the interaction term to see what happens, but that is not required. > summary(second2) lm(formula = logprice ~ Depth * Carat + I(Depth^2) + I(Carat^2)) -0.85021-0.13209 0.01441 0.13613 0.79710 (Intercept) 13.5049624 3.4020467 3.970 8.76e-05 *** Depth -0.2027689 0.1015563-1.997 0.0467 * Carat 2.5863485 0.3414393 7.575 3.33e-13 *** I(Depth^2) 0.0013384 0.0007553 1.772 0.0773. I(Carat^2) -0.5714071 0.0370821-15.409 < 2e-16 *** Depth:Carat 0.0095943 0.0060107 1.596 0.1114 Residual standard error: 0.2306 on 345 degrees of freedom Multiple R-squared: 0.9302, Adjusted R-squared: 0.9292 F-statistic: 919.9 on 5 and 345 DF, p-value: < 2.2e-16 3.24 c. The plot of the residuals versus fitted values is shown on the right. The plots to check normality are shown on the next page. While not perfect, the log transformation clearly has helped with both the constant variance and the normality conditions.
3. Do Exercise 3.25 on page 158. R instructions for nested F tests were given in Lecture 8 (Oct 25), Lecture 11 (Nov 6), and Discussion 5 (Nov 3). 3.25 The models being compared are: Full model: TotalPrice = β0 + β1 Carat + β2 Depth+ β3 Carat 2 + β4 Depth 2 + β5 Carat Depth + ε Reduced model: TotalPrice = β0 + β1 Carat + β2 Carat 2 + ε So, using notation from the Full model, the hypotheses are: Null: β2 =β4 = β5 = 0, or in context, Depth, Depth 2 and Carat Depth can be removed from the model Alternative: Not all of those coefficients are 0, or in context, at least one of the terms involving Depth is needed in the model. From the R output below, F = 9.43 and p = 5.24 10-6, so clearly reject the null hypothesis. Conclude that at least one of the terms involving Depth is needed in the model. The R output is as follows: > second <- lm(totalprice ~ Depth * Carat + I(Depth^2) + I(Carat^2)) > nodepth<-lm(totalprice ~ Carat + I(Carat^2)) > anova(nodepth,second) Analysis of Variance Table Model 1: TotalPrice ~ Carat + I(Carat^2) Model 2: TotalPrice ~ Depth * Carat + I(Depth^2) + I(Carat^2) Res.Df RSS Df Sum of Sq F Pr(>F) 1 348 1574044410 2 345 1454702094 3 119342316 9.4345 5.24e-06 *** 4. Do Exercise 3.26 on page 158. R instructions for prediction and confidence intervals for simple linear regression were given in Lecture 6 (Oct 18) and Discussion 3 (Oct 20) but you will need to expand it to include multiple variables. Here is an example for the command to create a confidence interval for the second order model using the StateSAT data for a state with 41% Takers and Expend value of $25:
predict(secondorder, list(takers=41, Expend = 25), se.fit=f, interval="c") You could also define newdata first, and then use it in the predict command: newdata<-data.frame(takers=41, Expend=25) predict(secondorder, newdata, se.fit=f, interval="c") 3.26 a. Answer can be found using R or computing from the equation given in Example 3.11. The predicted price is $1794.84. (Depending on rounding, your answer could differ slightly.) 3.26 b. Here is the R code and results for a 95% confidence interval: > Quad<-lm(TotalPrice ~ Carat + I(Carat^2), data = Diamonds) > predict(quad, list(carat=0.5), se.fit=f, interval="c") 1 1794.843 1424.296 2165.389 The 95% confidence interval is $1424.30 to $2165.39. Interpretation: We are 95% confident that the mean price of all 0.5 carat diamonds is between $1424.30 and $2165.39. 3.26 c. Here is the R code and results for a 95% prediction interval: > Quad<-lm(TotalPrice ~ Carat + I(Carat^2), data = Diamonds) > predict(quad, list(carat=0.5), se.fit=f, interval="p") 1 1794.843-2404.462 5994.147 The 95% prediction interval is -$2404.46 to $5994.15, but negative prices don t make sense. So we would write the prediction interval as $0 to $5994.15. Interpretation: Here are some possible ways to write it: We are 95% confident that the price of a randomly selected 0.5 carat diamond will be between $0 and $5994.15. We expect that 95% of all 0.5 carat diamonds will cost between $0 and $5994.15. For all 0.5 carat diamonds, about 95% of them will cost between $0 and $5994.15. For any of the above you could replace $0 and $5994.15 with something like no more than 5994.15. For instance, For all 0.5 carat diamonds, about 95% of them will cost no more than $5994.15. 3.26 d. If you didn t already do so for 3.24, you first you need to create the log variable, or you could do it directly in the lm command: > LnFit <-lm(log(totalprice) ~ Depth * Carat + I(Depth^2) + I(Carat^2), data = Diamonds) Then create the intervals: > predict(lnfit, list(carat=0.5, Depth=62), se.fit=f, interval="c") 1 7.525992 7.484671 7.567314 > predict(lnfit, list(carat=0.5, Depth=62), se.fit=f, interval="p") 1 7.525992 7.070612 7.981373 And finally, exponentiate the middle and endpoints to get the point estimate and intervals in dollars: > exp(7.525992) [1] 1855.653 #The point estimate is $1855.65 > exp(7.484671) [1] 1780.538 > exp(7.567314) [1] 1933.939 #The confidence interval is $1780.54 to $1933.94. > exp(7.070612) [1] 1176.868 > exp(7.981373) [1] 2925.946 #The prediction interval is $1176.87 to $2925.95