Stat 401XV Exam 3 Spring 2017

Similar documents
Logistic Regression. Logistic Regression Theory

############################ ### toxo.r ### ############################

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

Study 2: data analysis. Example analysis using R

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013

Generalized Linear Models

Introduction to General and Generalized Linear Models

Final Exam Suggested Solutions

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction

Non-linearities in Simple Regression

Chapter 8 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (May 1, 2010)

Statistics 175 Applied Statistics Generalized Linear Models Jianqing Fan

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Midterm

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15

Predicting Charitable Contributions

boxcox() returns the values of α and their loglikelihoods,

> > is.factor(scabdata$trt) [1] TRUE > is.ordered(scabdata$trt) [1] FALSE > scabdata$trtord <- ordered(scabdata$trt, +

Random Effects ANOVA

MODEL SELECTION CRITERIA IN R:

Regression and Simulation

Step 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set.

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

> budworm$samplogit < log((budworm$y+0.5)/(budworm$m budworm$y+0.5))

6 Multiple Regression

Homework Assignment Section 3

Final Exam - section 1. Thursday, December hours, 30 minutes

NHY examples. Bernt Arne Ødegaard. 23 November Estimating dividend growth in Norsk Hydro 8

Economics 424/Applied Mathematics 540. Final Exam Solutions

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Multiple regression - a brief introduction

Stat 328, Summer 2005

Multiple linear regression

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

Bradley-Terry Models. Stat 557 Heike Hofmann

STA2601. Tutorial letter 105/2/2018. Applied Statistics II. Semester 2. Department of Statistics STA2601/105/2/2018 TRIAL EXAMINATION PAPER

CREDIT RISK MODELING IN R. Logistic regression: introduction

State Ownership at the Oslo Stock Exchange. Bernt Arne Ødegaard

Logit Models for Binary Data

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

STA 4504/5503 Sample questions for exam True-False questions.

Case Study: Applying Generalized Linear Models

General Business 706 Midterm #3 November 25, 1997

The Norwegian State Equity Ownership

MCMC Package Example

Monetary Economics Risk and Return, Part 2. Gerald P. Dwyer Fall 2015

Generalized Multilevel Regression Example for a Binary Outcome

σ e, which will be large when prediction errors are Linear regression model

1 Estimating risk factors for IBM - using data 95-06

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2013, Mr. Ruey S. Tsay. Final Exam

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Diploma Part 2. Quantitative Methods. Examiner s Suggested Answers

CHAPTER 4 DATA ANALYSIS Data Hypothesis

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Midterm

Lecture Note: Analysis of Financial Time Series Spring 2017, Ruey S. Tsay

Jaime Frade Dr. Niu Interest rate modeling

Intro to GLM Day 2: GLM and Maximum Likelihood

Projects for Bayesian Computation with R

Logistic Regression with R: Example One

Two Way ANOVA in R Solutions

Analysis of Variance in Matrix form

R is a collaborative project with many contributors. Type contributors() for more information.

ARIMA ANALYSIS WITH INTERVENTIONS / OUTLIERS

BIOS 4120: Introduction to Biostatistics Breheny. Lab #7. I. Binomial Distribution. RCode: dbinom(x, size, prob) binom.test(x, n, p = 0.

Mixed models in R using the lme4 package Part 3: Inference based on profiled deviance

Quantitative Techniques Term 2

PASS Sample Size Software

State Ownership at the Oslo Stock Exchange

Lecture 1: Empirical Properties of Returns

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

BEcon Program, Faculty of Economics, Chulalongkorn University Page 1/7

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Final Exam

7. For the table that follows, answer the following questions: x y 1-1/4 2-1/2 3-3/4 4

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay. Midterm

DATA SUMMARIZATION AND VISUALIZATION

SFSU FIN822 Project 1

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2013, Mr. Ruey S. Tsay. Midterm

Lecture Note: Analysis of Financial Time Series Spring 2008, Ruey S. Tsay. Seasonal Time Series: TS with periodic patterns and useful in

Final Exam, section 1. Thursday, May hour, 30 minutes

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2011, Mr. Ruey S. Tsay. Final Exam

Econometric Methods for Valuation Analysis

SAS Simple Linear Regression Example

Assessing Model Stability Using Recursive Estimation and Recursive Residuals

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

3. The distinction between variable costs and fixed costs is:

Non-Inferiority Tests for Two Means in a 2x2 Cross-Over Design using Differences

Homework Assignment Section 3

Stat3011: Solution of Midterm Exam One

Regression Model Assumptions Solutions

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2016, Mr. Ruey S. Tsay. Midterm

Demonstrate Approval of Loans by a Bank

Tests for Two ROC Curves

Influence of Personal Factors on Health Insurance Purchase Decision

Lecture 21: Logit Models for Multinomial Responses Continued

Parameter Estimation

The SAS System 11:03 Monday, November 11,

Lapse Modeling for the Post-Level Period

WEB APPENDIX 8A 7.1 ( 8.9)

STAT758. Final Project. Time series analysis of daily exchange rate between the British Pound and the. US dollar (GBP/USD)

Transcription:

Stat 40XV Exam Spring 07 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning will receive NO partial credit. Correct numerical answers to difficult questions unaccompanied by supporting reasoning may not receive full credit. SHOW YOUR WORK/EXPLAIN YOURSELF!

. There are data on the UCI Machine Learning Repository concerning financial characteristics of some of the Forbes 500 companies in 986. We will here concern ourselves with 7 of the cases and the 6 quantitative variables Assets, Sales, Market Value, Profits, Cash Flow, and Employees, and the factor variable Sector. (The last of these has 9 levels.) Beginning on Page 8 of this exam there is a printout of some analyses of these data, with particular emphasis on the modeling of Market Value as a function of the other variables. 4 pts a) After accounting for Sales, Cash Flow, and Employees, do the variables Assets and Profits add significantly to one's ability to predict or model Market Value? Give the value of an F statistic and associated degrees of freedom useful for assessing this. Say what you can (given the tables you have to work with) about a corresponding p -value. b) Notice that in the model for Market Value that includes Sales, Cash Flow, and Employees the fitted regression coefficient for Sales is negative. This is presumably counter-intuitive. Correlations between predictors are below. Explain how they help account for this seemingly strange outcome. Sales Cash_Flow Employees Sales.0000000 0.57578 0.84006 Cash_Flow 0.57578.0000000 0.74460 Employees 0.84006 0.74460.0000000

c) Below are a graphic from the leaps package and some cross-validation root mean squared prediction errors from caret. What do these suggest about an appropriate set of quantitative predictors for Market Value? Model Terms CVRMSPE Employees 449 Cash, Employees 4 Sales, Cash, Employees Sales, Profits, Cash, Employees 55 Assets, Sales, Profits, Cash, Employees d) There is some output on the printout from an lm()call that includes not only the quantitative predictors of Market Value, but the factor variable Sector as well. All else (values of the quantitative predictors) being equal, which sector seems to have companies with the largest (per company) Market Values? Explain carefully. Remember that there are 9 sectors to consider.

. There is a famous Wisconsin Breast Cancer dataset on the UCI ML Repository. This dataset has N = 699 6 = 68 complete cases (6 have missing entries), each one describing k = 9 characteristics of a biopsied tumor that has been classified as either benign (444 cases) or malignant (9 cases). There is a printout beginning on Page 0 of this exam from an attempt to model the probability that a submitted biopsy is malignant on the basis of values of predictor variables (originally on -0 scales) x = Clump Thickness, x = Cell Size Uniformity, x = Cell Shape Uniformity, x 4 = Marginal Adhesions, x 5 = Single Epithelial Cell Size, x 6 = Bare Nuclei, x 7 = Bland Chromatin, x 8 = Normal Nucleoli, and x 9 = Mitoses. Use it to answer the questions on this page. a) Which of the features x through x 9 appears to be least important (in the presence of all others) in modeling the probability that a tumor is malignant? Explain. b) For what linear relationship among the predictor variables x through x 9 is the estimated probability that a submitted biopsy is malignant exactly.5? (Give values b0, b,, b9, and c so that the relationship is b0 + bx + bx + + b9x9 = c.) 4

. An R dataset concerns an experiment on the pharmacokinetics of theophylline. Subjects were given oral doses of the drug and serum concentrations were measured over time. These can be analyzed using a two-compartment open pharmacokinetic model, that for a single subject (at dose 4.4) is K K exp( K time) exp( K time) conc = 4.4 e a e a + ε (*) C K K ( ) for model parameters K e = the elimination rate constant, K a = the absorption rate constant, and C = the clearance. A printout beginning on Page 0 summarizes an analysis of n = data pairs time, conc for one subject. Use it to answer the following questions. ( ) a e a) Suppose relationship (*) above holds for iid N( 0,σ ) errors ε. Give approximate 95% two-sided confidence limits for σ. b) What does the plot below suggest about the plausibility of the usual non-linear regression model (*) in the present context? 5

4. There are old experimental data concerning noise passing through automotive exhaust systems at http://lib.stat.cmu.edu/dasl/datafiles/airpullutionfiltersdat.html. The response variable was y = noise level (db), for vehicles of Sizes (=small, =medium, and =large), for silencers/filters of Types (=standard silencer and =Octel pollution filter), and observations on Sides (=right and =left) of the cars studied. Each combination of levels of factors was recorded m = times. Various analyses of these data are on a printout beginning on Page. Use it to answer the following questions. First, ignore the Sides variable and treat the data as if they are factorial data. 8 pts a) Make an interaction plot enhanced with error bars based on 95% confidence limits for combination mean noise. What are your "margins of error" for this plotting? (Give a number.) + / margin: b) Based on the plot above, which effects appear to be both statistically detectable and most important? (Consider Size and Type main effects and interactions. List an order of importance.) c) The most basic goal of the original study was to establish that the Octel filter was at least as good as the standard silencer. Based on your plot and items on the printout, was this established? Explain. 6

Now consider the full -Factor structure of the dataset. d) Are "effects" of SIDE on NOISE detectable? Explain what on the printout supports your judgment. e) What is the value and degrees of freedom for an F test of the hypothesis that all effects involving SIDE are 0? f) What is the effect on perceived "experimental error" when one includes the factor SIDE in the modeling of NOISE? Refer to appropriate values on the printout and explain why what you see makes sense. g) Using the basic " L and L ˆ ideas," give 95% two-sided confidence limits for the difference between right and left side mean noise levels for large vehicles using the Octel filter. 7

R Code and Output for the Forbes 500 Company Data > summary(companies) Assets Sales Market_Value Profits Cash_Flow Min. : Min. : 76 Min. : 5 Min. :-77.50 Min. :-54. st Qu.: st Qu.: 706 st Qu.: 478 st Qu.: 7.80 st Qu.: 7.5 Median : 548 Median : 679 Median : 88 Median : 67.0 Median : 0.4 Mean : 7 Mean : 04 Mean :5 Mean : 96. Mean : 7.8 rd Qu.: 5074 rd Qu.: 45 rd Qu.:89 rd Qu.: 67.50 rd Qu.: 0. Max. :4045 Max. :74 Max. :946 Max. : 485.00 Max. :46.0 Employees sector Min. : 0.60 Energy :5 st Qu.:.80 Finance :4 Median :.60 Manufacturing:0 Mean : 8.86 Retail :0 rd Qu.: 7.50 Other : 7 Max. :84.80 HiTech : 6 (Other) : > summary(lm(market_value~assets+sales+profits+cash_flow+ + Employees,data=Companies)) Call: lm(formula = Market_Value ~ Assets + Sales + Profits + Cash_Flow + Employees, data = Companies) Residuals: Min Q Median Q Max -55.9-44. -07. 4.7 457.4 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 48.6559 8.76464 Assets 0.0780 0.044 0.809.89 0.45 0.0704. Sales -0.449 0.09705-4.48.97e-05 *** Profits Cash_Flow -4.6849 6.76447.8055 -.44 0.08 *.44774 4.67.48e-05 *** Employees 46.9500 6.469 7. 4.0e-0 *** Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 05 on 67 degrees of freedom Multiple R-squared: 0.766, Adjusted R-squared: 0.6955 F-statistic:.89 on 5 and 67 DF, p-value: <.e-6 > anova(lm(market_value~assets+sales+profits+cash_flow+ + Employees,data=Companies)) Analysis of Variance Table Response: Market_Value Df Sum Sq Mean Sq F value Pr(>F) Assets 5858896 5858896 56.889.607e-0 *** Sales 6446 6446 5.86.089e-07 *** Profits 5097 5097.8675.46e-05 *** Cash_Flow 74574 74574.684 0.988 Employees 55049 55049 5.60 4.06e-0 *** Residuals 67 69000 0988 Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. > summary(lm(market_value~sales+cash_flow+employees,data=companies)) Call: lm(formula = Market_Value ~ Sales + Cash_Flow + Employees, data = Companies) Residuals: Min Q Median Q Max -658.7-40.9-80.5 54.7 44. 8

Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 8.67 76.097.604 0. Sales Cash_Flow -0.447.957 0.0790 -.097 0.008 ** 0.59600 6.60 6.9e-09 *** Employees 9.095 6.496 6.57.90e-08 *** Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 088 on 69 degrees of freedom Multiple R-squared: 0.6644, Adjusted R-squared: 0.6498 F-statistic: 45.5 on and 69 DF, p-value:.46e-6 > anova(lm(market_value~sales+cash_flow+employees,data=companies)) Analysis of Variance Table Response: Market_Value Df Sum Sq Mean Sq F value Pr(>F) Sales 8866 8866 68.59 6.78e- *** Cash_Flow 7599 7599 7.6.56e-06 *** Employees 4787040 4787040 40.46.90e-08 *** Residuals 69 87755 84454 Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. > options(contrasts = rep("contr.sum", )) > summary(lm(market_value~assets+sales+profits+cash_flow+ + Employees+sector,data=Companies)) Call: lm(formula = Market_Value ~ Assets + Sales + Profits + Cash_Flow + Employees + sector, data = Companies) Residuals: Min Q Median Q Max -778.7-8.7 7.4 64.6 4. Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 8.87 84.7860 0.640 0.54445 Assets 0.057 0.04850.80 0.05 * Sales -0. 0.09 -.656 0.000548 *** Profits -.09.77507 -.75 0.08495. Cash_Flow 4.95550.505.9 0.0068 ** Employees 4.8590 6.85 6.9.5e-09 *** sector -969.8808 89.56774 -.8 0.477 sector 6.45 70.8977 0.505 0.65607 sector -.076 44.4766-0.906 0.68647 sector4 697.744 56.6905 4.760.0e-05 *** sector5-89.7854 96.07-0.64 0.54066 sector6 5.474 408.57748 0.8 0.89846 sector7 64.986.9054.986 0.0578. sector8-640.489 70.9879 -.76 0.089667. Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 86. on 59 degrees of freedom Multiple R-squared: 0.80, Adjusted R-squared: 0.7808 F-statistic: 0.7 on and 59 DF, p-value: <.e-6 9

R Code and Output for the Wisconsin Cancer Data > model <- glm(y~.,family=binomial(link='logit'),data=wisc) > summary(model) Call: glm(formula = y ~., family = binomial(link = "logit"), data = WISC) Deviance Residuals: Min Q Median -.484-0.5-0.069 Q 0.0 Max.4698 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.094.7488-8.600 < e-6 *** x 0.550 0.40.767 0.00065 *** x -0.0068 0.0908-0.00 0.97609 x 0.7 0.060.99 0.6688 x4 0.064 0.45.678 0.007400 ** x5 0.0966 0.5659 0.67 0.5759 x6 0.80 0.0984 4.08 4.47e-05 *** x7 0.4479 0.78.609 0.00907 ** x8 0.0 0.87.887 0.0595. x9 0.5484 0.877.67 0.0788 Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. (Dispersion parameter for binomial family taken to be ) Null deviance: 884.5 on 68 degrees of freedom Residual deviance: 0.89 on 67 degrees of freedom AIC:.89 Number of Fisher Scoring iterations: 8 R Code and Output for the Theophylline Data > cbind(theoph.4$time,theoph.4$conc) [,] [,] [,] 0.00 0.00 [,] 0.5.89 [,] 0.60 4.60 [4,].07 8.60 [5,]. 8.8 [6,].50 7.54 [7,] 5.0 6.88 [8,] 7.0 5.78 [9,] 9.0 5. [0,].98 4.9 [,] 4.65.5 > Conc.out<-nls(conc~4.4*(Ke*Ka/C)*(exp(-Ke*Time)-exp(-Ka*Time))/(Ka-Ke), + data=theoph.4,start=c(c=.04,ke=.09,ka=.),trace=t) 6.450 : 0.04 0.09.0 5.7685 : 0.07854 0.087759.66407 5.705 : 0.077 0.0875.7406768 5.7957 : 0.0740666 0.0875008.707705 5.795 : 0.079804 0.0874575.766764 5.795 : 0.0740040 0.0874694.7455 5.795 : 0.079976 0.0874660.749099 5.795 : 0.07999 0.08746707.747 0

> summary(conc.out) Formula: conc ~ 4.4 * (Ke * Ka/C) * (exp(-ke * Time) - exp(-ka * Time))/(Ka - Ke) Parameters: Estimate Std. Error t value Pr(> t ) C 0.07400 0.00545 6.906 0.0004 *** Ke 0.087467 0.0978 4.4 0.009 ** Ka.747 0.69080 4.54 0.004 ** Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 0.8465 on 8 degrees of freedom Number of iterations to convergence: 7 Achieved convergence tolerance: 6.0e-06 > plot(theoph.4$time,residuals(conc.out),ce=,pch=9,xlab="time",ylab="residual") > abline(a=0,b=0) R Code and Output for the Exhaust Noise Data > summary(noise) NOISE Min. :760.0 SIZE : TYPE :8 SIDE :8 st Qu.:78.5 : :8 :8 Median :80.0 Mean :80. : rd Qu.:87.5 Max. :855.0 > options(contrasts = rep("contr.sum", )) > summary(lm(noise~size*type,data=noise)) Call: lm(formula = NOISE ~ SIZE * TYPE, data = Noise) Residuals: Min Q Median Q Max -5.8-5.08-0.467 5.0000 5.0000 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 80.9.48 600.989 < e-6 *** SIZE 4.08.906 7.58.9e-08 *** SIZE.6.906.85.5e- *** TYPE 5.47.48 4.08 0.0006 *** SIZE:TYPE -.750.906 -.967 0.05848. SIZE:TYPE 6.667.906.497 0.00488 ** Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error: 8.088 on 0 degrees of freedom Multiple R-squared: 0.94, Adjusted R-squared: 0.94 F-statistic: 85.4 on 5 and 0 DF, p-value: <.e-6 > > aggregate(noise$noise,by=list(noise$size,noise$type),mean) Group. Group. x 85.8 845.8 775.0000 4 8.5000 5 8.6667 6 770.0000

> aggregate(noise$noise,by=list(noise$size,noise$type),sd) Group. Group. x 0.684880 5.8456 4.46408.786 5 4.0848 6 > 6.4555 > aggregate(noise$noise,by=list(noise$size,noise$type,noise$side),mean) Group. Group. Group. x 86.6667 84.6667 4 786.6667 80.0000 5 8.6667 6 7 775.0000 85.0000 8 850.0000 9 0 76. 85.0000 8.6667 765.0000 > aggregate(noise$noise,by=list(noise$size,noise$type,noise$side),sd) Group. Group. Group. x 5.7750.88675.88675 4 5 0.000000.88675 6 0.000000 7 8 0.000000 5.000000 9 5.7750 0 0.000000 5.7750 5.000000 > > summary(lm(noise~size*type*side,data=noise)) Call: lm(formula = NOISE ~ SIZE * TYPE * SIDE, data = Noise) Residuals: Min Q Median Q Max -6.667 -.667 0.000. 6.667 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 80.89 0.665 7.865 < e-6 *** SIZE 4.078 0.900 5.585 4.7e-4 *** SIZE.6 0.900 6. < e-6 *** TYPE 5.467 0.665 8.50.04e-08 *** SIDE 0.89 0.665 0.8 0.8904 SIZE:TYPE -.7500 0.900-4.66 0.00046 *** SIZE:TYPE 6.6667 0.900 7.407.0e-07 *** SIZE:SIDE -5.97 0.900-6.65 7.e-07 *** SIZE:SIDE -. 0.900 -.469 0.006 * TYPE:SIDE -0.6944 0.665 -.09 0.86067 SIZE:TYPE:SIDE -.689 0.900 -.9 0.00794 ** SIZE:TYPE:SIDE -.889 0.900 -.54 0.5907 Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. Residual standard error:.89 on 4 degrees of freedom Multiple R-squared: 0.988, Adjusted R-squared: 0.989 F-statistic: 84 on and 4 DF, p-value: <.e-6