Test #1 (Solution Key)

Size: px

Start display at page:

Download "Test #1 (Solution Key)"

Shana Patrick
6 years ago
Views:

1 STAT 47/67 Test #1 (Solution Key) 1. (To be done by hand) Exploring his own drink-and-drive habits, a student recalls the last 7 parties that he attended. He records the number of cans of beer he drank, his highway speed on the way home, and whether he was stopped by police. Party Beer, cans Speed, mph Stopped by police yes yes no no no yes yes Use the following statistical machine learning methods to predict whether or not the person will be stopped by police. (a) Tree. Draw a decision tree that can be used for this classification. Label each node, to make it clear how to use this tree for classification. What is the training error rate of this tree? Speed Beer (b) Pruning. Suppose the tree is pruned to just one split. If speed is the only variable used, what split minimizes the Gini index? What is the Gini index for this split, and what is the training error rate of this tree? (c) Bagging. Four bootstrap samples from the given data are (listed by the party number): Bootstrap sample #1 ( 1, 1,, 3, 4, 5, 6 ) Bootstrap sample # ( 3, 4, 4, 5, 5, 7, 7 ) Bootstrap sample #3 ( 3, 3, 4, 6, 7, 7, 7 ) Bootstrap sample #4 ( 1,,,, 5, 5, 6 ) Draw a tree based on each bootstrap sample. Predict the out-of-bag data by the majority vote. Use these results to estimate the testing error rate of this method. SOLUTION. (a) or Beer<.5 Beer<.5 All the terminal nodes are pure, so the training error rate is 0. (b) There are 3 possible splits - at speed = 57.5, 65, or 7.5 (or 57, 65, and 73). At 57.5, one node is pure, with ˆp no = 1, ˆp yes = 0, and for the other node, there are blue and 4 red points, so ˆp no = 1/3 and ˆp yes = /3. The Gini index is G = (1)(0)+(1/3)(/3) = /9. At 65, one node is pure, and for the other node, ˆp no = 3/4 and ˆp yes = 1/4. G = 0+(3/4)(1/4)) = 3/16. At 7.5, one node is pure, and for the other node, ˆp no = 1/ and ˆp yes = 1/. G = 0+(1/)(1/)) = 1/4. Thus, the best split minimizing the Gini index is by Speed = 65. The classification is:

2 - if Speed < 65, then classify to. - if Speed 65, then classify to. This rule classifies data points 1-6 correctly and misclassifies point 7. The training error rate is 1/7 or 14.3%. (c) Bootstrap sample #1 ( 1, 1,, 3, 4, 5, 6 ) Both nodes are pure, hence no more splits. Predict OOB Ŷ7 =. Bootstrap sample # ( 3, 4, 4, 5, 5, 7, 7 ) Beer< Speed Beer Both nodes are pure, hence no more splits. Predict OOB Ŷ1 = Ŷ = Ŷ6 =. Bootstrap sample #3 ( 3, 3, 4, 6, 7, 7, 7 ) Beer < 1.5 Both nodes are pure, hence no more splits. Predict OOB Ŷ1 = Ŷ = and Ŷ5 =. Bootstrap sample #4 ( 1,,,, 5, 5, 6 ) Both nodes are pure, hence no more splits. Predict OOB Ŷ3 = Ŷ4 = Ŷ7 =. By the majority vote, we have predictions: Ŷ 1 = Ŷ = Ŷ3 = Ŷ4 = Ŷ6 = Ŷ7 = and Ŷ5 =. Among the OOB data, Y 3,4 are classified correctly, and Y 1,,5,6,7 are misclassified, so the testing error rate is estimated to be 5/7 or 71.4%.

3 . (Use R for data analysis) A real estate appraiser is interested in a reliable working method of evaluating residential home prices as a function of various features. Data on 5 recent home sales are available on our course web site All the files there contain identical data in different formats. The following variables are included. Column Variable 1 Identification number 1 5 Sales price of residence ( $1000 dollars) 3 Finished area of residence (square feet) 4 Total number of bedrooms in residence 5 Total number of bathrooms in residence 6 Air conditioning: present or absent 7 Number of cars that garage will hold 8 Swimming pool: present or absent 9 Year property was originally constructed 10 Quality of construction: high, medium, or low 11 Architectural style. Three styles are coded as 1,, and 3 1 Lot size (square feet) 13 Location near a highway: yes or no While predicting Sales Price, we would like to reduce multi-collinearity in order to obtain a more reliable prediction. Compare performance of the following methods. (a) Variable selection. List variables removed from the model, and write the final regression equation that a real estate appraiser should use. What method do you apply? Predict the sales price of a high quality 4-bedroom house built in 1990 in style #3, with a -car garage, an air conditioner, a swimming pool, 3 bathrooms, finished area of 500 square feet, and a 0,000 square foot lot, that is far from a highway. (b) Ridge regression. Find the optimal tuning parameter λ. According to the model, how is the home price expected to change if one additional bathroom and a swimming pool are built, without changing other variables? (c) Lasso. Find the optimal tuning parameter λ. List variables that lasso eliminated from the model and variables that were retained. According to the model, is a home expected to gain value if it is close to a highway? Use your results to answer this question. (d) Principal components regression. Fit a principal components regression based on standardized (rescaled) X-variables. How many principal components are needed to explain 80% of the total variation of standardized X-variables? How many principal components are needed to explain 80% of the total variation of sales prices? (Stat-67 only) In (a,b,c), estimate the prediction mean squared error by some cross-validation. Which method provides the lowest prediction MSE?

4 SOLUTION. (a) Stepwise selection method removes Bedrooms, Pool, and Air Conditioner from the model. The resulting regression equation is Price = FinishedArea I Quality=low 13.6 I Quality=medium +1.5 Year LotSize 15.9 I Style= 38.1 I Style= Bathrooms 36.5 Highway +9.4 GarageSize. Predicted sales price is Ŷ =$44s,413. The prediction mean squared error is M SE = 3447 (in millions squared dollars), which corresponds to the root mean squared error of MSE = 58.7 thousand dollars. Computed by the method of leave-oneout cross-validation. (b) The optimal tuning parameter is about λ = It provides the prediction mean squared error of MSE = 3486 (in millions squared dollars) The home price is expected to increase by an estimated value of ˆβ bathroom + ˆβ pool = = 18.4 thousand dollars (c) The optimal tuning parameter is about λ = It provides the prediction mean squared error of MSE = 3467 (in millions squared dollars) Lasso has not removed any variables. The slope ˆβ highway = 33. is negative, so the home valueis expected to decrease if it is close to a highway (decrease by 33. thousand dollars). (d) The first seven principal components explain 83% of the total variation of X. The first twelve principal components explain 8% of the total variation of the response, the sales price.

5 R Codes (FYI) Reading data > HOMES = read.table(url(" > names(homes) [1] "V1" "V" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V1" "V13" or > HOMES = read.csv(url(" > names(homes) [1] "ID" "SALES_PRICE" "FINISHED_AREA" "BEDROOMS" "BATHROOMS" [6] "GARAGE_SIZE" "YEAR_BUILT" "STYLE" "LOT_SIZE" "AIR_CONDITIONER" [11] "POOL" "QUALITY" "HIGHWAY" > head(homes) ID SALES_PRICE FINISHED_AREA BEDROOMS BATHROOMS GARAGE_SIZE YEAR_BUILT STYLE LOT_SIZE AIR_CONDITIONER POOL QUALITY HIGHWAY 1 MEDIUM MEDIUM 3 MEDIUM 4 MEDIUM 5 MEDIUM 6 MEDIUM > attach(homes) Regression and variable selection > null = lm( SALES_PRICE ~ 1, data=homes ) > full = lm( SALES_PRICE ~.-ID, data=homes ) > step( null, scope=list(lower=null, upper=full), direction="forward" ) SALES_PRICE ~ FINISHED_AREA + QUALITY + YEAR_BUILT + LOT_SIZE + as.factor(style) + BATHROOMS + HIGHWAY + GARAGE_SIZE Df Sum of Sq RSS AIC <none> BEDROOMS POOL AIR_CONDITIONER Coefficients: (Intercept) FINISHED_AREA QUALITYLOW QUALITYMEDIUM -.346e e e e+0 YEAR_BUILT LOT_SIZE as.factor(style) as.factor(style)3 1.49e e e e+01 BATHROOMS HIGHWAY GARAGE_SIZE 8.81e e e+00

6 > reg = glm( SALES_PRICE ~ FINISHED_AREA + BATHROOMS + GARAGE_SIZE + YEAR_BUILT + LOT_SIZE + QUALITY + HIGHWAY + as.factor(style), data=homes ) > Yhat = predict( reg, data.frame( FINISHED_AREA=500, BATHROOMS=3, GARAGE_SIZE=, YEAR_BUILT=1990, LOT_SIZE=0000, QUALITY="HIGH", HIGHWAY="", STYLE=3 ) ) > Yhat Predicted price of a given home > library(boot) > cv.error = cv.glm( HOMES, reg ) > cv.error$delta [1] Predicted mean squared error Ridge Regression > library(glmnet) > X = model.matrix( SALES_PRICE ~.-ID-STYLE+as.factor(STYLE), data=homes ) > cv.ridge = cv.glmnet( X, SALES_PRICE, alpha=0, lambda=1:0 ) > cv.ridge$lambda.min [1] 1 > cv.ridge = cv.glmnet( X, SALES_PRICE, alpha=0, lambda=seq(0,10,0.01) ) > cv.ridge$lambda.min [1] 0.76 > cv.ridge = cv.glmnet( X, SALES_PRICE, alpha=0, lambda=seq(0,1,0.001) ) > cv.ridge$lambda.min [1] Optimal tuning parameter lambda (although it s unstable. Depends on > min(cv.ridge$cvm) randomized CV, so your answer may be different (has to be between 0 and 1) [1] Prediction MSE > ridgereg = glmnet( X, SALES_PRICE, alpha=0, lambda=cv.ridge$lambda.min ) Our best model > predict( ridgereg, type="coefficients" ) (Intercept) e+03 FINISHED_AREA e-01 BEDROOMS e+00 BATHROOMS e+00 GARAGE_SIZE e+00 YEAR_BUILT e+00 LOT_SIZE e-03 AIR_CONDITIONER e+00 AIR_CONDITIONER e-01 POOL e+00 QUALITYLOW e+0 QUALITYMEDIUM e+0 HIGHWAY e+01 as.factor(style) e+01 as.factor(style) e+01 > plot(cv.ridge$lambda, cv.ridge$cvm)

7 LASSO > cv.lasso = cv.glmnet( X, SALES_PRICE, alpha=1 ) > cv.lasso$lambda.min [1] > cv.lasso = cv.glmnet( X, SALES_PRICE, alpha=1, lambda=seq(0,1,0.001) ) > cv.lasso$lambda.min [1] 0.79 Optimal tuning parameter lambda > plot(cv.lasso$lambda, cv.lasso$cvm) > min(cv.lasso$cvm) [1] Prediction MSE > lasso = glmnet( X, SALES_PRICE, alpha=1, lambda=cv.lasso$lambda.min ) > predict( lasso, type="coefficients" ) Our best lasso model (Intercept) e+03 FINISHED_AREA e-01 BEDROOMS e+00 BATHROOMS e+00 GARAGE_SIZE e+00 YEAR_BUILT e+00 LOT_SIZE e-03 AIR_CONDITIONER e+00 POOL e+00 QUALITYLOW e+0 QUALITYMEDIUM e+0 HIGHWAY e+01 as.factor(style) e+01 as.factor(style) e+01 > plot(cv.lasso$lambda, cv.lasso$cvm) PC Regression > pcreg = pcr( SALES_PRICE ~.-ID-STYLE+as.factor(STYLE), data=homes, scale=true, validation="cv" ) > summary(pcreg) VALIDATION: RMSEP Cross-validated using 10 random segments. TRAINING: % variance explained 1 comps comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps X SALES_PRICE comps 10 comps 11 comps 1 comps 13 comps X SALES_PRICE

Lecture 9: Classification and Regression Trees

Lecture 9: Classification and Regression Trees Advanced Applied Multivariate Analysis STAT 2221, Spring 2015 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department of Mathematical