Stat 40XV Exam Spring 07 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning will receive NO partial credit. Correct numerical answers to difficult questions unaccompanied by supporting reasoning may not receive full credit. SHOW YOUR WORK/EXPLAIN YOURSELF!
. There are data on the UCI Machine Learning Repository concerning financial characteristics of some of the Forbes 500 companies in 986. We will here concern ourselves with 7 of the cases and the 6 quantitative variables Assets, Sales, Market Value, Profits, Cash Flow, and Employees, and the factor variable Sector. (The last of these has 9 levels.) Beginning on Page 8 of this exam there is a printout of some analyses of these data, with particular emphasis on the modeling of Market Value as a function of the other variables. 4 pts a) After accounting for Sales, Cash Flow, and Employees, do the variables Assets and Profits add significantly to one's ability to predict or model Market Value? Give the value of an F statistic and associated degrees of freedom useful for assessing this. Say what you can (given the tables you have to work with) about a corresponding p -value. b) Notice that in the model for Market Value that includes Sales, Cash Flow, and Employees the fitted regression coefficient for Sales is negative. This is presumably counter-intuitive. Correlations between predictors are below. Explain how they help account for this seemingly strange outcome. Sales Cash_Flow Employees Sales.0000000 0.57578 0.84006 Cash_Flow 0.57578.0000000 0.74460 Employees 0.84006 0.74460.0000000
c) Below are a graphic from the leaps package and some cross-validation root mean squared prediction errors from caret. What do these suggest about an appropriate set of quantitative predictors for Market Value? Model Terms CVRMSPE Employees 449 Cash, Employees 4 Sales, Cash, Employees Sales, Profits, Cash, Employees 55 Assets, Sales, Profits, Cash, Employees d) There is some output on the printout from an lm()call that includes not only the quantitative predictors of Market Value, but the factor variable Sector as well. All else (values of the quantitative predictors) being equal, which sector seems to have companies with the largest (per company) Market Values? Explain carefully. Remember that there are 9 sectors to consider.
. There is a famous Wisconsin Breast Cancer dataset on the UCI ML Repository. This dataset has N = 699 6 = 68 complete cases (6 have missing entries), each one describing k = 9 characteristics of a biopsied tumor that has been classified as either benign (444 cases) or malignant (9 cases). There is a printout beginning on Page 0 of this exam from an attempt to model the probability that a submitted biopsy is malignant on the basis of values of predictor variables (originally on -0 scales) x = Clump Thickness, x = Cell Size Uniformity, x = Cell Shape Uniformity, x 4 = Marginal Adhesions, x 5 = Single Epithelial Cell Size, x 6 = Bare Nuclei, x 7 = Bland Chromatin, x 8 = Normal Nucleoli, and x 9 = Mitoses. Use it to answer the questions on this page. a) Which of the features x through x 9 appears to be least important (in the presence of all others) in modeling the probability that a tumor is malignant? Explain. b) For what linear relationship among the predictor variables x through x 9 is the estimated probability that a submitted biopsy is malignant exactly.5? (Give values b0, b,, b9, and c so that the relationship is b0 + bx + bx + + b9x9 = c.) 4
. An R dataset concerns an experiment on the pharmacokinetics of theophylline. Subjects were given oral doses of the drug and serum concentrations were measured over time. These can be analyzed using a two-compartment open pharmacokinetic model, that for a single subject (at dose 4.4) is K K exp( K time) exp( K time) conc = 4.4 e a e a + ε (*) C K K ( ) for model parameters K e = the elimination rate constant, K a = the absorption rate constant, and C = the clearance. A printout beginning on Page 0 summarizes an analysis of n = data pairs time, conc for one subject. Use it to answer the following questions. ( ) a e a) Suppose relationship (*) above holds for iid N( 0,σ ) errors ε. Give approximate 95% two-sided confidence limits for σ. b) What does the plot below suggest about the plausibility of the usual non-linear regression model (*) in the present context? 5
4. There are old experimental data concerning noise passing through automotive exhaust systems at http://lib.stat.cmu.edu/dasl/datafiles/airpullutionfiltersdat.html. The response variable was y = noise level (db), for vehicles of Sizes (=small, =medium, and =large), for silencers/filters of Types (=standard silencer and =Octel pollution filter), and observations on Sides (=right and =left) of the cars studied. Each combination of levels of factors was recorded m = times. Various analyses of these data are on a printout beginning on Page. Use it to answer the following questions. First, ignore the Sides variable and treat the data as if they are factorial data. 8 pts a) Make an interaction plot enhanced with error bars based on 95% confidence limits for combination mean noise. What are your "margins of error" for this plotting? (Give a number.) + / margin: b) Based on the plot above, which effects appear to be both statistically detectable and most important? (Consider Size and Type main effects and interactions. List an order of importance.) c) The most basic goal of the original study was to establish that the Octel filter was at least as good as the standard silencer. Based on your plot and items on the printout, was this established? Explain. 6
Now consider the full -Factor structure of the dataset. d) Are "effects" of SIDE on NOISE detectable? Explain what on the printout supports your judgment. e) What is the value and degrees of freedom for an F test of the hypothesis that all effects involving SIDE are 0? f) What is the effect on perceived "experimental error" when one includes the factor SIDE in the modeling of NOISE? Refer to appropriate values on the printout and explain why what you see makes sense. g) Using the basic " L and L ˆ ideas," give 95% two-sided confidence limits for the difference between right and left side mean noise levels for large vehicles using the Octel filter. 7
R Code and Output for the Forbes 500 Company Data > summary(companies) Assets Sales Market_Value Profits Cash_Flow Min. : Min. : 76 Min. : 5 Min. :-77.50 Min. :-54. st Qu.: st Qu.: 706 st Qu.: 478 st Qu.: 7.80 st Qu.: 7.5 Median : 548 Median : 679 Median : 88 Median : 67.0 Median : 0.4 Mean : 7 Mean : 04 Mean :5 Mean : 96. Mean : 7.8 rd Qu.: 5074 rd Qu.: 45 rd Qu.:89 rd Qu.: 67.50 rd Qu.: 0. Max. :4045 Max. :74 Max. :946 Max. : 485.00 Max. :46.0 Employees sector Min. : 0.60 Energy :5 st Qu.:.80 Finance :4 Median :.60 Manufacturing:0 Mean : 8.86 Retail :0 rd Qu.: 7.50 Other : 7 Max. :84.80 HiTech : 6 (Other) : > summary(lm(market_value~assets+sales+profits+cash_flow+ + Employees,data=Companies)) Call: lm(formula = Market_Value ~ Assets + Sales + Profits + Cash_Flow + Employees, data = Companies) Residuals: Min Q Median Q Max -55.9-44. -07. 4.7 457.4 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 48.6559 8.76464 Assets 0.0780 0.044 0.809.89 0.45 0.0704. Sales -0.449 0.09705-4.48.97e-05 *** Profits Cash_Flow -4.6849 6.76447.8055 -.44 0.08 *.44774 4.67.48e-05 *** Employees 46.9500 6.469 7. 4.0e-0 *** Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 05 on 67 degrees of freedom Multiple R-squared: 0.766, Adjusted R-squared: 0.6955 F-statistic:.89 on 5 and 67 DF, p-value: <.e-6 > anova(lm(market_value~assets+sales+profits+cash_flow+ + Employees,data=Companies)) Analysis of Variance Table Response: Market_Value Df Sum Sq Mean Sq F value Pr(>F) Assets 5858896 5858896 56.889.607e-0 *** Sales 6446 6446 5.86.089e-07 *** Profits 5097 5097.8675.46e-05 *** Cash_Flow 74574 74574.684 0.988 Employees 55049 55049 5.60 4.06e-0 *** Residuals 67 69000 0988 Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. > summary(lm(market_value~sales+cash_flow+employees,data=companies)) Call: lm(formula = Market_Value ~ Sales + Cash_Flow + Employees, data = Companies) Residuals: Min Q Median Q Max -658.7-40.9-80.5 54.7 44. 8
Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 8.67 76.097.604 0. Sales Cash_Flow -0.447.957 0.0790 -.097 0.008 ** 0.59600 6.60 6.9e-09 *** Employees 9.095 6.496 6.57.90e-08 *** Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 088 on 69 degrees of freedom Multiple R-squared: 0.6644, Adjusted R-squared: 0.6498 F-statistic: 45.5 on and 69 DF, p-value:.46e-6 > anova(lm(market_value~sales+cash_flow+employees,data=companies)) Analysis of Variance Table Response: Market_Value Df Sum Sq Mean Sq F value Pr(>F) Sales 8866 8866 68.59 6.78e- *** Cash_Flow 7599 7599 7.6.56e-06 *** Employees 4787040 4787040 40.46.90e-08 *** Residuals 69 87755 84454 Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. > options(contrasts = rep("contr.sum", )) > summary(lm(market_value~assets+sales+profits+cash_flow+ + Employees+sector,data=Companies)) Call: lm(formula = Market_Value ~ Assets + Sales + Profits + Cash_Flow + Employees + sector, data = Companies) Residuals: Min Q Median Q Max -778.7-8.7 7.4 64.6 4. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 8.87 84.7860 0.640 0.54445 Assets 0.057 0.04850.80 0.05 * Sales -0. 0.09 -.656 0.000548 *** Profits -.09.77507 -.75 0.08495. Cash_Flow 4.95550.505.9 0.0068 ** Employees 4.8590 6.85 6.9.5e-09 *** sector -969.8808 89.56774 -.8 0.477 sector 6.45 70.8977 0.505 0.65607 sector -.076 44.4766-0.906 0.68647 sector4 697.744 56.6905 4.760.0e-05 *** sector5-89.7854 96.07-0.64 0.54066 sector6 5.474 408.57748 0.8 0.89846 sector7 64.986.9054.986 0.0578. sector8-640.489 70.9879 -.76 0.089667. Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 86. on 59 degrees of freedom Multiple R-squared: 0.80, Adjusted R-squared: 0.7808 F-statistic: 0.7 on and 59 DF, p-value: <.e-6 9
R Code and Output for the Wisconsin Cancer Data > model <- glm(y~.,family=binomial(link='logit'),data=wisc) > summary(model) Call: glm(formula = y ~., family = binomial(link = "logit"), data = WISC) Deviance Residuals: Min Q Median -.484-0.5-0.069 Q 0.0 Max.4698 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.094.7488-8.600 < e-6 *** x 0.550 0.40.767 0.00065 *** x -0.0068 0.0908-0.00 0.97609 x 0.7 0.060.99 0.6688 x4 0.064 0.45.678 0.007400 ** x5 0.0966 0.5659 0.67 0.5759 x6 0.80 0.0984 4.08 4.47e-05 *** x7 0.4479 0.78.609 0.00907 ** x8 0.0 0.87.887 0.0595. x9 0.5484 0.877.67 0.0788 Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. (Dispersion parameter for binomial family taken to be ) Null deviance: 884.5 on 68 degrees of freedom Residual deviance: 0.89 on 67 degrees of freedom AIC:.89 Number of Fisher Scoring iterations: 8 R Code and Output for the Theophylline Data > cbind(theoph.4$time,theoph.4$conc) [,] [,] [,] 0.00 0.00 [,] 0.5.89 [,] 0.60 4.60 [4,].07 8.60 [5,]. 8.8 [6,].50 7.54 [7,] 5.0 6.88 [8,] 7.0 5.78 [9,] 9.0 5. [0,].98 4.9 [,] 4.65.5 > Conc.out<-nls(conc~4.4*(Ke*Ka/C)*(exp(-Ke*Time)-exp(-Ka*Time))/(Ka-Ke), + data=theoph.4,start=c(c=.04,ke=.09,ka=.),trace=t) 6.450 : 0.04 0.09.0 5.7685 : 0.07854 0.087759.66407 5.705 : 0.077 0.0875.7406768 5.7957 : 0.0740666 0.0875008.707705 5.795 : 0.079804 0.0874575.766764 5.795 : 0.0740040 0.0874694.7455 5.795 : 0.079976 0.0874660.749099 5.795 : 0.07999 0.08746707.747 0
> summary(conc.out) Formula: conc ~ 4.4 * (Ke * Ka/C) * (exp(-ke * Time) - exp(-ka * Time))/(Ka - Ke) Parameters: Estimate Std. Error t value Pr(> t ) C 0.07400 0.00545 6.906 0.0004 *** Ke 0.087467 0.0978 4.4 0.009 ** Ka.747 0.69080 4.54 0.004 ** Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 0.8465 on 8 degrees of freedom Number of iterations to convergence: 7 Achieved convergence tolerance: 6.0e-06 > plot(theoph.4$time,residuals(conc.out),ce=,pch=9,xlab="time",ylab="residual") > abline(a=0,b=0) R Code and Output for the Exhaust Noise Data > summary(noise) NOISE Min. :760.0 SIZE : TYPE :8 SIDE :8 st Qu.:78.5 : :8 :8 Median :80.0 Mean :80. : rd Qu.:87.5 Max. :855.0 > options(contrasts = rep("contr.sum", )) > summary(lm(noise~size*type,data=noise)) Call: lm(formula = NOISE ~ SIZE * TYPE, data = Noise) Residuals: Min Q Median Q Max -5.8-5.08-0.467 5.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 80.9.48 600.989 < e-6 *** SIZE 4.08.906 7.58.9e-08 *** SIZE.6.906.85.5e- *** TYPE 5.47.48 4.08 0.0006 *** SIZE:TYPE -.750.906 -.967 0.05848. SIZE:TYPE 6.667.906.497 0.00488 ** Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 8.088 on 0 degrees of freedom Multiple R-squared: 0.94, Adjusted R-squared: 0.94 F-statistic: 85.4 on 5 and 0 DF, p-value: <.e-6 > > aggregate(noise$noise,by=list(noise$size,noise$type),mean) Group. Group. x 85.8 845.8 775.0000 4 8.5000 5 8.6667 6 770.0000
> aggregate(noise$noise,by=list(noise$size,noise$type),sd) Group. Group. x 0.684880 5.8456 4.46408.786 5 4.0848 6 > 6.4555 > aggregate(noise$noise,by=list(noise$size,noise$type,noise$side),mean) Group. Group. Group. x 86.6667 84.6667 4 786.6667 80.0000 5 8.6667 6 7 775.0000 85.0000 8 850.0000 9 0 76. 85.0000 8.6667 765.0000 > aggregate(noise$noise,by=list(noise$size,noise$type,noise$side),sd) Group. Group. Group. x 5.7750.88675.88675 4 5 0.000000.88675 6 0.000000 7 8 0.000000 5.000000 9 5.7750 0 0.000000 5.7750 5.000000 > > summary(lm(noise~size*type*side,data=noise)) Call: lm(formula = NOISE ~ SIZE * TYPE * SIDE, data = Noise) Residuals: Min Q Median Q Max -6.667 -.667 0.000. 6.667 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 80.89 0.665 7.865 < e-6 *** SIZE 4.078 0.900 5.585 4.7e-4 *** SIZE.6 0.900 6. < e-6 *** TYPE 5.467 0.665 8.50.04e-08 *** SIDE 0.89 0.665 0.8 0.8904 SIZE:TYPE -.7500 0.900-4.66 0.00046 *** SIZE:TYPE 6.6667 0.900 7.407.0e-07 *** SIZE:SIDE -5.97 0.900-6.65 7.e-07 *** SIZE:SIDE -. 0.900 -.469 0.006 * TYPE:SIDE -0.6944 0.665 -.09 0.86067 SIZE:TYPE:SIDE -.689 0.900 -.9 0.00794 ** SIZE:TYPE:SIDE -.889 0.900 -.54 0.5907 Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error:.89 on 4 degrees of freedom Multiple R-squared: 0.988, Adjusted R-squared: 0.989 F-statistic: 84 on and 4 DF, p-value: <.e-6