Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Size: px

Start display at page:

Download "Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty"

Candice Booker
5 years ago
Views:

1 Milestone2 Zillow House Price Prediciton Group Lingzi Hong and Pranali Shetty

2 MILESTONE 2 REPORT Data Collection The following additional features were added 1. Population, Number of College Graduates and Education Rank by Zipcode 2. Along with the number of schools within 1km of each house we also added. School type and school rating with the hope to improve the prediction. Data Cleaning and Pre-processing The data we scraped and the attributes we decided to add were divided into four groups, Inner House Properties School Data Environmental Data Zipcode Features Our main tasks was to clean the four different datasets and then merge them together. The following were done to clean and pre-process the data Remove the price less than 1M and higher than 7M Remove $,,, K or M in the dataset House built year to the age of house Convert appropriate variables (SchoolType, Rank, Postal, Built_Year) as factor Delete instances with more than 100 baths. Convert numbers like 1M to in price. Convert Acre to Square Feet in Lot_Area (These were done in Excel as they were easier to implement) Removed words like Built in, Sold and others from the dataset. Bed attribute had Studio as a value, which technically means no bedroom so we decided to replace that by 0. And the word Studio was not uniform throughout, meaning some columns had Studio, values due to scrapping errors. So we standardized the values first and the replaced Studio by 0. Data Engineering Create a variable Age which is the measure of how old the house is. It is calculated as its built year subtracted from 2014(present). Since Built Year is a factor we noticed that during our initial analysis we couldn t derive adequate value from that data and thus we designed a way to use it. Data Merging All the four datasets had the base data which is the inner house properties. So the three external attributes Zipcode factors, School Data and Environmental factors were then merged on unique column names to get a full dataset with all attributes. So by this time, we have around 6383 rows and about 19 raw (Original not recoded in any way) attributes that we can statistically work with. Latitude, Longitude, City, Region, Address won t be considered from now on. They were used fully in preliminary data analysis to understand the dataset and its features both by geo-graphs and statistical test.

3 Data Overview After the above processes we have 6383 rows of data and about 19 attributes. Please note, we have not excluded any NA or blank fields at this point. They are tackled as per the modelling requirements later. The dataset has four main sections, from which attributes can be picked to help in prediction (Factors as explicitly specified, others are continuous data or are converted to scale during analysis) Inner House Properties Bed (Factor but Converted to Scale) Bath Built_Year (Factor) Price_Sqft (this is just Price/Sqft Area technically same as Price so not used in analysis) Lot_Area Sqft_Area School Properties School (Number of Schools within 1km of the house) SchoolDist (Distance of nearest school to house) SchoolTSRatio (Student-Teacher Ratio at School) SchoolRating (1-10, Rank of School. 10 best and then decreases to 1) SchoolType (Factor) Zipcode Features (These are factors taken to describe a community) MedIncome Postal (Factor) Population College.Graduates Rank (Education Rank in a particular Zipcode) (Factor) MedAge Environmental Data TranEnvi (Distance to the nearest water body) TransDist (Distance to the nearest transportation medium) Crime (Number of criminal incidents within 3kms of the house) Envi (Number of water bodies in the area)

4 Data Models Summary (Q.4 Comparative Analysis + Analysis Questions of each model answered here) Research Question To understand the factors that affect the house prices in Seattle and predict it. We performed the following machine learning techniques to best answer our research question. They are as follows Linear Regression Multivariate Regression Regularization Ridge and Lasso Methods Logistic Regression Naïve Bayes Classification Decision Tree Classification Boosting and Random Forest Decision Trees Regression Boosting and Random Forest SVM s 1. Linear Regression We ran the linear regression for each of the attributes, except the ones that are factors. Since it is linear regression which gives best results when both independent and dependent variables are continuous, we didn t run them for factors. The intercept, coefficient and residuals for each model were recorded. To compare models we also noted the correlation and RMSE of each model. (R-Square was not a good predictor in this case, as the model has only one independent variable) College.Graduate, Population, SchoolRating, MedIncome, School, Sqft_Area are significant predictors when a model of all variables were created. Among those, College.Graduate, Population and Sqft_Area are most predictive. The features that showed strong correlation to price are College.Graduate, Sqft_Area,Bath, Bed (Slight but still better than the others rejected), Number of water bodies in the area, Crime, Distance to nearest transport, Student teacher ratio, School Rating, Median Income of the community (Zipcode) and population of the community. So in conclusion, linear regression is applicable to our case, certainly not a single variable one. Running linear regression helped us understand which variable are significant and which not. Also since many of our attributes are continuous, linear regression is a good approach to use as a starting step. Model Adjusted R- Squared Price ~ Sqft_Area Correlatio n RMSE Significant (Predictive ) Intercept Coefficien t Residual Standard Error YES on 5100 degrees of freedom Price ~ YES on Bath degrees of freedom Price ~ Bed YES on

5 Price ~ Lot_Area Price ~ Age degrees of freedom YES on degrees Price ~ Envi Price ~ Crime Price ~ TranEnvi Price ~ TransDist Price ~ School Price ~ SchoolDist Price ~ SchoolTSRa tio Price ~ SchoolRati ng Price ~ MedIncom e Price ~ Population Price ~ College.Gra duates e of freedom YES on 5094 degrees of freedom YES on 5104 degrees of freedom YES on 5104 degrees of freedom NO on 5104 degrees of freedom Slightly on (p<0.1) degrees of freedom YES on (p<0.05) degrees of freedom NO on 156 degrees of freedom YES on degrees of freedom Slight on 120 (p<0.1) degrees of but good freedom correlatio n YES on degrees of freedom YES on degrees of freedom YES on degrees of freedom 2. Multivariate Regression While using multivariate regression, we used 4 main data; the house properties, school around it, environmental factors and community characteristics. Then starting with each we added data to the model. We noted the

6 confidence intervals, Anova, model coefficients and intercept values in each case. Overall 12 models were built. The correlation, R-squared and RMSE values of each model helped us analyze which one was the best. Basic comparison of the models created Model Number R-Squared Correlation RMSE Model Model Model Model Model Model Model Model Model Model Model Model Model 11 is picked as the best analyzing the above table. Price= e e+05*Bath e +04*Bed e+02*Sqft_Area e- 01*Lot_Area e+00*Age e+04*ChangeCrime e+04*Envi e+04*TranEnvi e+03*SchoolRating e+00*MedIncome e+04*Rank e+00*Population e+04*BedBath 3. Regularization Models For each of the multivariate regression models regularization (Lasso and Ridge) were performed. Model Number Correlation RMSE Lasso Correlation RMSE Ridge Lasso Ridge Model Model Model Model Model Model Model Model Model Model Model Model Regularization didn t improve the model. It drops the Correlation a lot however at the same time the RMSE value decreases too which is a good sign. We mainly looked at the Lasso Method as Ridge usually ends up keeping many variables which doesn t lead to a good model.

7 4. Logistic Regression Our dependent variable is continuous so this method doesn t suit it the best. Plus we couldn t use the glm() function as it requires the dependent variable to be binary. So we decided to split our model into 10 categories of price ranges and then run Ordinal Logistic Regression using the polr() function from the MASS package. We developed about 8 models form the knowledge we gained from the multiple regression models. However as the range with the categories is too large like $ $ is a lot in terms of price estimation, almost every model we ran gave all significant features. Which we derive is the case as the model fits the attributes in that wide range. So for our research question purpose, which is to determine the house price and not range, this model fails. 5. Naïve Bayes Separate price into 5 categories 1. Laplace estimator change nothing in the model 2. Application of Kernel make the prediction results better 3. In Naïve Bayes model, pick inner propertity variables and Zipcode performs best. 4. From the plot we could see Bath and Area of house is relatively better variables that can be used to define class of price. model1 (Inner Propertity and Zipcode) Classification(NB) model2 (add Envi) model3 (add School) model4 (all variables) model5 (add Laplace) model6 (add kernel) model7 (kernel and Laplace)

9 6. Decision Trees In Decision Tree separate price into 5 categories a. Result of Splitting training and testing data training testing b. In Decision Tree Classification Model Train a decision tree and show and interpret some of the resulting if-then rules Choose one branch of the tree to explain If Postal of the house is 90106, 98108,98118 or Then if age is bigger than 1, go to another branch. If age is less than 1 and crime variable is less than 803, the price belongs to the 3rd class. If age is less than 1 and crime variable is bigger than 803, check the area of the house, if it is less than 1980, then price belongs to 4th class.if it is more than 1980 and less than 2395,price belongs to the 5th class. If area is bigger than 2395, price belongs to 4th class. Comparison of Training and testing results by confusion matrix and percentage of data correctly classified Result of Training dataset Correct percentage Confusion matrix (a) (b) (c) (d) (e) <-classified as (a) class (b) class (c) class 3 (d) class (e) class 5

10 Result of Testing dataset Correct percentage Confusion matrix Performance of training dataset is better than testing dataset. c. Boosting overall achieves high prediction accuracy than decision without boosting, with more trees the prediction accuracy will increase and then decrease. Classification(Decision Tree) model1 (Basic Model) model2 (Boosting 8 trees) model3 (Boosting 10 trees) model4 (Boosting 12 trees) model5 (Boosting 20 trees) d. In our model, with more variable included, performance of the model will be better, meanwhile, train more trees for the same variable will improve the accuracy of prediction. Classification(Random Forest) model1 (8 variables) model2 (6 variables) model3 (4 variables) model4 (8 variables 50 trees) model4 (8 variables 100 trees) model4 (8 variables 150 trees) model4 (8 variables 200 trees) e. Decision Tree Regression Model Variable Importance Variable importance Sqft_Area Postal Bath Crime

11 Lot_Area Envi Age Bed Plot Tree Model

12 Comparison of Decision tree and random forest Parameter decision tree random forest cor rmse SVM Model Classification Model Classification(SVM) Kernel vanilladot rbfdot anovadot Percent of correct Regression Model Regression(SVM) Kernel vanilladot rbfdot anovadot cor rmse Linear Regression Model performs best for classification while Gaussian Kernel perform best for regression model. Comparative Analysis of Classification and Regression models Comparison between Classificaion Model In Naïve Bayes, model6 with combination of inner propertity variables and zipcode and kernel performs best, prediction accuracy is In Decision Trees, model1 with 8 variables in randomforest performs best, prediction accuracy is In SVM, model with vanilladot kernel performs best, prediction accuracy is The best model comes from random forest. Meanwhile, Random forest has overall higher prediction accuracy than SVM and Naïve Bayes. In house price prediction, Naives Bayes performs not well as other classifiers. Comparision between Regression Model In multivariate regression, Model 11 is picked as the best with R-Squared , correlation of prediction and test is and rmse In Decision Tree, the best regression model comes from random forest with correlation and rmse In SVM model, model with rbfdot kernel performs best wht correlation and rmse Of the three models, random forest still performs better than the others.

13 MODELS Q.1 a. Linear Regression We tried out various independent parameters. Linear regression provides the best results or rather it is preferred to use it when both dependent (outcome) variable and the independent variable are continuous. 1. Model 1 house price & sqft area > summary(model) Call lm(formula = Price ~ Sqft_Area, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** Sqft_Area < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Sqft_Area Significant Residual standard error Residual standard error on 5100 degrees of freedom (4 observations deleted due to missingness) R-Square Value Multiple R-squared , Adjusted R-squared F-statistic 4890 on 1 and 5100 DF, p-value < 2.2e-16 Intercept and Coefficient > model$coefficients (Intercept) Sqft_Area

14 Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Sqft_Area e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Sqft_Area Prediction and Confidence Bands

15 Diagnostic Plots 2. Model 2 House and Bath > summary(model) Call lm(formula = Price ~ Bath, data = hou_train) Data Residuals

16 Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** Bath < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Bath Significant Residual standard error Residual standard error on 5104 degrees of freedom R-Square values Multiple R-squared , Adjusted R-squared F-statistic 1972 on 1 and 5104 DF, p-value < 2.2e-16 Intercept and Coefficient > model$coefficients (Intercept) Bath Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Bath

17 Prediction and Confidence Bands

18 3. Model 3 House Price and Bed Diagnostic Plots

19 > summary(model) Call lm(formula = Price ~ Bed, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) * Bed <2e-16 *** --- Signif. codes 0 *** ** 0.01 * Bed Significant Residual standard error Residual standard error on 5104 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5104 DF, p-value < 2.2e-16 Intercept and Coefficient > model$coefficients (Intercept) Bed Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bed e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Bed

20 Prediction and Confidence Bands Diagnostic Plot

4. Model 4 Lot Area and House Price > summary(model) Call lm(formula = Price ~ Lot_Area, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max -482971-259555 -128928 93450 5634693

21 4. Model 4 Lot Area and House Price > summary(model) Call lm(formula = Price ~ Lot_Area, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 5.896e e < 2e-16 *** Lot_Area e e e-07 *** --- Signif. codes 0 *** ** 0.01 * Lot_Area Significant Residual Standard Error Residual standard error on 5040 degrees of freedom (64 observations deleted due to missingness) R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5040 DF, p-value 3.506e-07 Coefficients and Intercept > model$coefficients (Intercept) Lot_Area Correlation > cor(prediction,hou_test$price, use='complete')

22 [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Lot_Area e e e-07 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) e+05 Lot_Area e-01 Prediction and Confidence Bands

23 5. Model 5 House Price and Age of House Diagnostic Plot

24 > summary(model) Call lm(formula = Price ~ Age, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** Age e-13 *** --- Signif. codes 0 *** ** 0.01 * Age Significant Residual Standard Error Residual standard error on 5094 degrees of freedom (10 observations deleted due to missingness) R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5094 DF, p-value 1.227e-13 Coefficients and Intercept > model$coefficients (Intercept) Age Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Age e e e-13 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Age

25 Prediction and Confidence bands Diagnostic Plots

26 6. Model 6 House Price and Number of Water Views in 0.5km Radius > summary(model) Call lm(formula = Price ~ Envi, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** Envi <2e-16 *** --- Signif. codes 0 *** ** 0.01 * Envi Significant Residual Standard Error Residual standard error on 5104 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5104 DF, p-value < 2.2e-16 Coefficients and Intercept > model$coefficients (Intercept) Envi Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Envi e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Envi

27 Prediction and Confidence Bands

28 7. Model 7 House Price and Crime Diagnostic Plots

29 > summary(model) Call lm(formula = Price ~ Crime, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 4.374e e <2e-16 *** Crime 8.834e e <2e-16 *** --- Signif. codes 0 *** ** 0.01 * Crime Significant Residual Standard Error Residual standard error on 5104 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5104 DF, p-value < 2.2e-16 Coefficients and Intercept > model$coefficients (Intercept) Crime Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Crime e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Crime

30 Prediction and Confidence Bands Diagnostic Plots

31 8. Model 8 House Price and Distance to water bodies > summary(model) Call lm(formula = Price ~ TranEnvi, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** TranEnvi Signif. codes 0 *** ** 0.01 * TranEnvi Significant Residual Standard Error Residual standard error on 5104 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared 8.001e-05 F-statistic on 1 and 5104 DF, p-value Coefficients and Intercept > model$coefficients (Intercept) TranEnvi Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) TranEnvi e e Residuals e e+11 > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) TranEnvi

32 From the plot we can see that there are only four different values in TranEnvi and most of the house price has TranEnvi as 0, so it is not suitable for linear regression. The diagnostics plot also shows that. Still, we have just listed the parameters for review. Prediction and Confidence Bands

33 Diagnostic Plots 9. Model 9 House Price and Distance of House and Its Nearest Transportation Place > summary(model) Call lm(formula = Price ~ TransDist, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** TransDist Signif. codes 0 *** ** 0.01 * TransDist Slightly Significant (p<0.1) Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value Coefficients and Intercept > model$coefficients (Intercept) TransDist

34 Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) TransDist e e+11 Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) TransDist

35 Prediction and Confidence Bands Diagnostic Plots

36 10. Model 10 House Price and Number of Schools in 2km Radius > summary(model) Call lm(formula = Price ~ School, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** School * Signif. codes 0 *** ** 0.01 * School School Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared 0.02 F-statistic on 1 and 156 DF, p-value Coefficients and Intercept > model$coefficients (Intercept) School Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) School e e * Residuals e e Signif. codes 0 *** ** 0.01 * confint(model, level=0.95) 2.5 % 97.5 % (Intercept) School

37 Prediction and Confidence Bands

38 Diagnostic Plots 11. Model 11 House Price and Distance to the Nearest School > summary(model) Call lm(formula = Price ~ SchoolDist, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-13 *** SchoolDist Signif. codes 0 *** ** 0.01 * SchoolDist Not Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value Coefficients and Intercept > model$coefficients (Intercept) SchoolDist

39 Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) SchoolDist e e Residuals e e+11 > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) SchoolDist

40 Prediction and Confidence Bands Diagnostic Plots

41 12. Model 12 House Price and the Ratio of Student to Teacher in the Nearest School > summary(model) Call lm(formula = Price ~ SchoolTSRatio, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** SchoolTSRatio e-09 *** Signif. codes 0 *** ** 0.01 * SchoolTSRatio Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value 2.97e-09 Coefficients and Intercept > model$coefficients (Intercept) SchoolTSRatio Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) SchoolTSRatio e e e-09 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) SchoolTSRatio

42 Prediction and Confidence Bands

43 Diagnostic Plots 13. Model 13 House Price and Rating of the Nearest School > summary(model) Call lm(formula = Price ~ SchoolRating, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** SchoolRating Signif. codes 0 *** ** 0.01 * SchoolRating Slightly significant (however, significant if we consider p<0.1) Residual Standard Error Residual standard error on 120 degrees of freedom (36 observations deleted due to missingness) R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 120 DF, p-value Coefficients and Intercept

44 > model$coefficients (Intercept) SchoolRating Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) SchoolRating e e Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) SchoolRating

45 Prediction and confidence bands Diagnostic Plots

46 14. Model 14 House Price and the Median Income of the District > summary(model) Call lm(formula = Price ~ MedIncome, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 1.529e e e-13 *** MedIncome e e e-08 *** --- Signif. codes 0 *** ** 0.01 * MedIncome Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value 2.343e-08 Coefficients and Intercept > model$coefficients (Intercept) MedIncome Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) MedIncome e e e-08 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) MedIncome

47 Prediction and Confidence Bands

48 Diagnostic Plots 15. Model 15 House Price and the Population > summary(model) Call lm(formula = Price ~ Population, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** Population e-13 *** --- Signif. codes 0 *** ** 0.01 * Population Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value 3.436e-13

Coefficients and Intercept > model$coefficients (Intercept) Population 953573.8876-16.5021 Correlation > cor(prediction,hou_test$price, use='complete') [1] 0.

49 Coefficients and Intercept > model$coefficients (Intercept) Population Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Population e e e-13 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Population

50 Prediction and Confidence Bands Diagnostic Plots

51 16. Model 16 House Price and the Ratio of College Graduates, thus education level of the District > summary(model) Call lm(formula = Price ~ College.Graduates, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) College.Graduates e-05 *** Signif. codes 0 *** ** 0.01 * College.Graduates Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value 4e-05 Coefficients and Intercept > model$coefficients (Intercept) College.Graduates Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) College.Graduates e e e-05 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) College.Graduates

52 Prediction and Confidence Bands

53 Diagnostic Plots 17. Model of all variables to check ranking of variables > summary(model) Call lm(formula = Price ~ Bed + Bath + Sqft_Area + Age + Lot_Area + School + SchoolDist + SchoolTSRatio + SchoolRating + MedIncome + Population + College.Graduates + TransDist + Crime + Envi, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) ** Bed Bath Sqft_Area *** Age Lot_Area School * SchoolDist SchoolTSRatio SchoolRating ** MedIncome **

54 Population e-05 *** College.Graduates e-05 *** TransDist Crime Envi Signif. codes 0 *** ** 0.01 * College.Graduate, Population, SchoolRating, MedIncome,School, Sqft_Area Significant Residual Standard Error Residual standard error on 106 degrees of freedom (36 observations deleted due to missingness) R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 15 and 106 DF, p-value 1.768e-10 Coefficients and Intercept model$coefficients (Intercept) Bed Bath Sqft_Area Age Lot_Area School SchoolDist SchoolTSRatio SchoolRating MedIncome Population College.Graduates TransDist Crime Envi RMSE > rmse(prediction,hou_test$price) [1] Correlation > cor(prediction,hou_test$price, use='complete') [1] From the above scenario it looks like College.Graduate, Population, SchoolRating, MedIncome, School, Sqft_Area are significant predictors. Among those, College.Graduate, Population and Sqft_Area are most predictive. SchoolType is a factor with levels such as public school, private school, so we can t build a linear regression with house price. Rank again is a factor and hence we can t use it for linear regression. Postal, Built_Year though are factors in the dataset, again being factors are not usefull in linear regression. b. Multivariate Regression While doing multivariate analysis we started by taking different factors into account one at a time as follows Inner House Parameters 1. Model 1 Independent Variables Equation Bath+Bed+Sqft_Area+Lot_Area > summary(model1, corr=t) Call lm(formula = Price ~ Bath + Bed + Sqft_Area + Lot_Area, data = hou_train)

55 Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 2.858e e e-05 *** Bed 2.891e e < 2e-16 *** Sqft_Area 3.142e e < 2e-16 *** Lot_Area e e e-06 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area Residual standard error on 5031 degrees of freedom (61 observations deleted due to missingness) R-Square Multiple R-squared 0.493, Adjusted R-squared F-statistic 1223 on 4 and 5031 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Bath Bed Sqft_Area Lot_Area Intercept and Coefficients > model1$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 > anova(model1) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e * Sqft_Area e e < 2.2e-16 *** Lot_Area e e+12 Residuals e e e-06 *** --- Signif. codes 0 *** ** 0.01 * > confint(model1, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+04 Bed e e+04 Sqft_Area e e+02 Lot_Area e e-01 RMSE > rmse(prediction1,hou_test$price) [1] Correlation > cor(prediction1,hou_test$price, use="complete") [1]

56 Diagnostic Plots 2. Model 2 Using Interaction Variables (between Bed and Bath) Independent Variables Equation Bath+Bed+BathBed+Sqft_Area+Lot_Area (Bath*Bed=Bed+Bath+BedBath) > summary(model2, corr=t) Call lm(formula = Price ~ Bath + Bed + BathBed + Sqft_Area + Lot_Area, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 8.030e e < 2e-16 *** Bed 6.664e e < 2e-16 *** Sqft_Area 3.238e e < 2e-16 *** Lot_Area e e e-07 *** BathBed e e e-14 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath Residual standard error on 5030 degrees of freedom (61 observations deleted due to missingness) R-Square Multiple R-squared 0.499, Adjusted R-squared

57 F-statistic 1002 on 5 and 5030 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area Bath Bed Sqft_Area Lot_Area BathBed Intercept and Coefficients > model2$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 BathBed e+04 > anova(model2) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed Sqft_Area e e * e e < 2.2e-16 *** Lot_Area e e e-06 *** BathBed e e+12 Residuals e e e-14 *** --- Signif. codes 0 *** ** 0.01 * > confint(model2, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+04 Bed Sqft_Area e e e e+02 Lot_Area e e-01 BathBed e e+04 Correlation > cor(prediction2,hou_test$price, use="complete") [1] RMSE > rmse(prediction2,hou_test$price) [1]

58 Diagnostic Plots 3. Model 3 Add Age of House Independent Variables Equation Bath*Bed+Sqft_Area+Lot_Area+Age > summary(model3, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + Age, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.239e e < 2e-16 *** Bed 6.829e e < 2e-16 *** Sqft_Area 2.945e e < 2e-16 *** Lot_Area e e *** Age 2.019e e < 2e-16 *** BathBed e e e-15 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, Age Residual standard error on 5029 degrees of freedom (61 observations deleted due to missingness)

59 R-Square Multiple R-squared , Adjusted R-squared F-statistic on 6 and 5029 DF, p-value < 2.2e-16 Correlation of Coefficients Bath (Intercept) Bath Bed Sqft_Area Lot_Area Age Bed Sqft_Area Lot_Area Age BathBed Intercept and Coefficients > model3$coefficients (Intercept) Bath Bed Sqft_Area e e e e+02 Lot_Area Age BathBed e e e+04 > anova(model3) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed Sqft_Area e e * e e < 2.2e-16 *** Lot_Area e e e-06 *** Age BathBed e e < 2.2e-16 *** e e e-15 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model3, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area Lot_Area e e e e-01 Age e e+03 BathBed e e+04 Correlation > cor(prediction3,hou_test$price, use="complete") [1] RMSE > rmse(prediction3,hou_test$price) [1]

60 Diagnostic Plots 4. Model 4 Age is not linear with price it shows parabolic (nth order polynomial) Independent Variables Equation Bath*Bed+Sqft_Area+Lot_Area+Age+I(Age^2) > summary(model4, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + Age + I(Age^2), data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.158e e < 2e-16 *** Bed 6.703e e < 2e-16 *** Sqft_Area 2.918e e < 2e-16 *** Lot_Area e e ** Age e e *** I(Age^2) 3.481e e e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Age, Age 2, Lot_Area, BedBath Residual standard error on 5022 degrees of freedom

61 (67 observations deleted due to missingness) R-Square Multiple R-squared 0.545, Adjusted R-squared F-statistic on 7 and 5022 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area Age I(Age^2) Bath Bed Sqft_Area Lot_Area Age I(Age^2) BathBed Intercept and Coefficients > model4$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 Age I(Age^2) BathBed e e e+04 > anova(model4) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e Sqft_Area e e < 2.2e-16 *** Lot_Area Age e e e-07 *** e e < 2.2e-16 *** I(Age^2) e e < 2.2e-16 *** BathBed e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model4, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath Bed e e e e+04 Sqft_Area e e+02 Lot_Area Age e e e e+02 I(Age^2) e e+01 BathBed e e+04 RMSE > rmse(prediction4,hou_test$price) [1] Correlation > cor(prediction4,hou_test$price, use="complete") [1]

62 Diagnostic Plots 5. Model 5 Use new Rescaled Age So, when doing the exploratory analysis for Age variable (as can be seen above also), it can been that it is not linear but rather has a dip in the middle years. This could mean that old and new houses probably have a higher price as compared to the middle ages ones. Thus, we decided to take the mean of the house and centralize the value of Age. (Mean-YearBuilt/10) Independent Variable Equation Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2) > summary(model5, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + RevAge + I(RevAge^2), data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.158e e < 2e-16 *** Bed Sqft_Area 6.703e e < 2e-16 *** 2.918e e < 2e-16 *** Lot_Area e e ** RevAge e e < 2e-16 *** I(RevAge^2) 3.481e e e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 *

63 Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, RevAge, RevAge 2 Residual standard error on 5022 degrees of freedom (67 observations deleted due to missingness) R-Square Multiple R-squared 0.545, Adjusted R-squared F-statistic on 7 and 5022 DF, p-value < 2.2e-16 Correlation of Coefficients Bath (Intercept) Bath Bed Sqft_Area Lot_Area RevAge I(RevAge^2) Bed Sqft_Area Lot_Area RevAge I(RevAge^2) BathBed Intercept and Coefficients > model5$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 RevAge I(RevAge^2) BathBed e e e+04 > anova(model5) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed Sqft_Area e e e e < 2.2e-16 *** Lot_Area e e e-07 *** RevAge I(RevAge^2) e e < 2.2e-16 *** e e < 2.2e-16 *** BathBed e e < 2.2e-16 *** Residuals e e+10 Signif. codes 0 *** ** 0.01 * > confint(model5, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area Lot_Area e e e e-01 RevAge e e+04 I(RevAge^2) e e+03 BathBed e e+04 Correlation > cor(prediction5,hou_test$price, use="complete") [1] RMSE > rmse(prediction5,hou_test$price) [1]

64 Diagnostic Plots No change in model so either Age variable. Next Add Environmental Factors 6. Model 6 Add Crime Independent Variables Equation BathBed+Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2)+Crime > summary(model6, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + RevAge + I(RevAge^2) + Crime, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.142e e < 2e-16 *** Bed 6.537e e < 2e-16 *** Sqft_Area 2.885e e < 2e-16 *** Lot_Area e e ** RevAge e e < 2e-16 *** I(RevAge^2) 1.493e e *** Crime 4.406e e < 2e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, Crime, RevAge, RevAge 2

65 Residual standard error on 5021 degrees of freedom (67 observations deleted due to missingness) R-Square Multiple R-squared , Adjusted R-squared F-statistic on 8 and 5021 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area RevAge I(RevAge^2) Crime Bath Bed Sqft_Area Lot_Area RevAge 0.12 I(RevAge^2) Crime BathBed Intercept and Coefficients > model6$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 RevAge I(RevAge^2) Crime BathBed e e e e+04 > anova(model6) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e Sqft_Area e e < 2.2e-16 *** Lot_Area RevAge e e e-07 *** e e < 2.2e-16 *** I(RevAge^2) e e < 2.2e-16 *** Crime BathBed e e < 2.2e-16 *** e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model6, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath Bed e e e e+04 Sqft_Area e e+02 Lot_Area RevAge e e e e+04 I(RevAge^2) e e+03 Crime BathBed e e e e+04 Correlation > cor(prediction6,hou_test$price, use="complete") [1] RMSE > rmse(prediction6,hou_test$price) [1]

66 Diagnostic Plots 7. Model 7 Add Envi Independent Variables Equation Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2)+ChangeCrime+Envi > summary(model7, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + RevAge + I(RevAge^2) + ChangeCrime + Envi, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.051e e < 2e-16 *** Bed 6.230e e < 2e-16 *** Sqft_Area 2.791e e < 2e-16 *** Lot_Area e e * RevAge e e < 2e-16 *** I(RevAge^2) 9.362e e * ChangeCrime e e e-16 *** Envi 1.362e e < 2e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath,Envi, ChangeCrime, RevAge, RevAge 2 Residual standard error on 5020 degrees of freedom (67 observations deleted due to missingness) R-Square

67 Multiple R-squared , Adjusted R-squared F-statistic on 9 and 5020 DF, p-value < 2.2e-16 Correlation of Coefficients Bath (Intercept) Bath Bed Sqft_Area Lot_Area RevAge I(RevAge^2) Bed Sqft_Area Lot_Area RevAge I(RevAge^2) ChangeCrime Envi BathBed ChangeCrime Envi Bath Bed Sqft_Area Lot_Area RevAge I(RevAge^2) ChangeCrime Envi BathBed > anova(model7) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed Sqft_Area e e e e < 2.2e-16 *** Lot_Area e e e-07 *** RevAge I(RevAge^2) e e < 2.2e-16 *** e e < 2.2e-16 *** ChangeCrime e e < 2.2e-16 *** Envi BathBed e e < 2.2e-16 *** e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * Intercept and Coefficients > model7$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 RevAge I(RevAge^2) ChangeCrime Envi BathBed e e e e e+04 > confint(model7, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area e e+02 Lot_Area e e-02 RevAge e e+04 I(RevAge^2) e e+03 ChangeCrime e e+04 Envi e e+04 BathBed e e+04 Correlation > cor(prediction7,hou_test$price, use="complete") [1] RMSE > rmse(prediction7,hou_test$price) [1]

68 Diagnostic Plots 8. Model 8 Use AgeSquare (2014-Built_Year) and Add Envi to it Independent Variables Equation Bath*Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi ChangeCrime it is the z-value of crime in an attempt to center and scale it. > summary(model8, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + AgeSquare + ChangeCrime + Envi, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.028e e < 2e-16 *** Bed 6.194e e < 2e-16 *** Sqft_Area 2.799e e < 2e-16 *** Lot_Area e e * AgeSquare 1.305e e < 2e-16 *** ChangeCrime e e e-16 *** Envi 1.357e e < 2e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, ChangeCrime, Envi, Age 2 Residual standard error on 5021 degrees of freedom (67 observations deleted due to missingness)

69 R-Square Multiple R-squared 0.57, Adjusted R-squared F-statistic 832 on 8 and 5021 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi BathBed 0.62 Envi Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi BathBed 0.08 Intercept and Coefficients > model8$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 AgeSquare ChangeCrime Envi BathBed e e e e+04 > confint(model8, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area Lot_Area e e e e-02 AgeSquare e e+01 ChangeCrime e e+04 Envi e e+04 BathBed e e+04 > anova(model8) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e Sqft_Area e e < 2.2e-16 *** Lot_Area AgeSquare e e e-07 *** e e < 2.2e-16 *** ChangeCrime e e < 2.2e-16 *** Envi BathBed e e < 2.2e-16 *** e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * Correlation > cor(prediction8,hou_test$price, use="complete") [1] RMSE > rmse(prediction8,hou_test$price) [1]

70 Diagnostic Plots 9. Model 9 All external and internal values Independent Variables Equation Bath+BathBed+Bed+Sqft_Area+Lot_Area+Age+AgeSquare+ChangeCrime+Envi+TranEnvi > summary(model9, corr=t) Call lm(formula = Price ~ Bath + BathBed + Bed + Sqft_Area + Lot_Area + AgeSquare + ChangeCrime + Envi + TranEnvi, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.030e e < 2e-16 *** Bed Sqft_Area 6.196e e < 2e-16 *** 2.800e e < 2e-16 *** Lot_Area e e * AgeSquare 1.310e e < 2e-16 *** ChangeCrime e e e-16 *** Envi 1.354e e < 2e-16 *** TranEnvi BathBed 1.156e e e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, Envi, ChangeCrime, Age 2 Residual standard error on 5020 degrees of freedom

71 (67 observations deleted due to missingness) R-Square Multiple R-squared 0.57, Adjusted R-squared F-statistic on 9 and 5020 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi BathBed Envi TranEnvi Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi BathBed Intercept and Coefficients > model9$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 AgeSquare ChangeCrime Envi TranEnvi BathBed e e e e e+04 > anova(model9) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e Sqft_Area e e < 2.2e-16 *** Lot_Area AgeSquare e e e-07 *** e e < 2.2e-16 *** ChangeCrime e e < 2.2e-16 *** Envi TranEnvi e e < 2.2e-16 *** e e BathBed e e < 2.2e-16 *** Residuals e e+10 Signif. codes 0 *** ** 0.01 * > confint(model9, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area Lot_Area e e e e-02 AgeSquare e e+01 ChangeCrime e e+04 Envi e e+04 TranEnvi e e+04 BathBed e e+04 Correlation > cor(prediction9,hou_test$price, use="complete") [1]

72 RMSE > rmse(prediction9,hou_test$price) [1] Diagnostic Plots Next add School factors 10. Model 10 Test with all school factors and then drop non-significant ones. The good predictors are then added to the above model. Step 1 Test all school factors Independent Variable Equation School+SchoolTSRatio+SchoolDist+SchoolRating (SchoolType is factor and does not yield sensible result in this case so drop it) > summary(model, corr=t) Call lm(formula = Price ~ School + SchoolTSRatio + SchoolDist + SchoolRating, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) School e-15 *** SchoolTSRatio e-05 *** SchoolDist SchoolRating *** < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Residual standard error on 3032 degrees of freedom (2060 observations deleted due to missingness)

73 R-Square Multiple R-squared , Adjusted R-squared F-statistic 166 on 4 and 3032 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) School SchoolTSRatio SchoolDist School SchoolTSRatio SchoolDist SchoolRating As it can be seen above, School is not significant hence will not be used in next model. Step 2 Add significant school factors, Independent Variable Equation Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi+SchoolTSRatio+SchoolDist +SchoolRating > summary(model10, corr=t) Call lm(formula = Price ~ Bath + BathBed + Bed + Sqft_Area + Lot_Area + AgeSquare + ChangeCrime + Envi + TranEnvi + SchoolTSRatio + SchoolDist + SchoolRating, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.223e e < 2e-16 *** Bed 7.198e e < 2e-16 *** Sqft_Area 2.049e e < 2e-16 *** Lot_Area e e * AgeSquare 6.127e e e-05 *** ChangeCrime e e Envi 1.729e e < 2e-16 *** TranEnvi e e ** SchoolTSRatio e e SchoolDist 3.857e e SchoolRating 3.370e e < 2e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, Envi, TranEnvi, SchoolRating, Age 2 Residual standard error on 2997 degrees of freedom (2087 observations deleted due to missingness) R-Square Multiple R-squared 0.543, Adjusted R-squared F-statistic on 12 and 2997 DF, p-value < 2.2e-16 Correlation of Coefficients Bath (Intercept) Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio

74 SchoolDist SchoolRating BathBed Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating BathBed Intercept and Coefficients > model10$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio e e e e e+03 SchoolDist SchoolRating BathBed e e e+04 > anova(model10) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed e e Sqft_Area Lot_Area e e < 2.2e-16 *** e e e-08 *** AgeSquare e e < 2.2e-16 *** ChangeCrime Envi e e e-13 *** e e < 2.2e-16 *** TranEnvi e e e-12 *** SchoolTSRatio SchoolDist e e e e < 2.2e-16 *** SchoolRating e e < 2.2e-16 *** BathBed Residuals e e < 2.2e-16 *** e e Signif. codes 0 *** ** 0.01 * > confint(model10, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed Sqft_Area e e e e+02 Lot_Area e e-02 AgeSquare ChangeCrime e e e e+04 Envi e e+04 TranEnvi e e+04 SchoolTSRatio e e+03 SchoolDist e e+04 SchoolRating BathBed e e e e+04 Correlation > cor(prediction10,hou_test$price, use="complete") [1] RMSE > rmse(prediction10,hou_test$price) [1]

75 Diagnostic Plots Next add Zipcodel factors 11. Model 11 Test with all zipcode factors and then drop non-significant ones. The good predictors are then added to the above model. Step 1 Test all Zipcode factors Independent Variable Equation MedIncome+as.numeric(Rank)+MedAge+Population+College.Graduates > summary(model, corr=t) Call lm(formula = Price ~ MedIncome + as.numeric(rank) + MedAge + Population + College.Graduates, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 2.764e e < 2e-16 *** MedIncome e e e-07 *** as.numeric(rank) e e e-11 *** MedAge e e Population e e < 2e-16 *** College.Graduates e e ** --- Signif. codes 0 *** ** 0.01 * Residual standard error on 5091 degrees of freedom R-Square Multiple R-squared , Adjusted R-squared F-statistic 421 on 5 and 5091 DF, p-value < 2.2e-16

76 Correlation of Coefficients (Intercept) MedIncome as.numeric(rank) MedAge Population MedIncome as.numeric(rank) MedAge Population College.Graduates Drop Median Age and use the rest. Step 2 Add significant Zipcode factors, Independent Variable Equation Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi+SchoolTSRatio+SchoolDist +SchoolRating+MedIncome+as.numeric(Rank)+Population+College.Graduates > summary(model11, corr=t) Call lm(formula = Price ~ Bath + BathBed + Bed + Sqft_Area + Lot_Area + AgeSquare + ChangeCrime + Envi + TranEnvi + SchoolTSRatio + SchoolDist + SchoolRating + MedIncome + as.numeric(rank) + Population + College.Graduates, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 1.023e e *** Bath 1.114e e < 2e-16 *** Bed Sqft_Area 6.186e e < 2e-16 *** 1.932e e < 2e-16 *** Lot_Area e e * AgeSquare ChangeCrime 3.778e e e e ** e-09 *** Envi 1.316e e < 2e-16 *** TranEnvi SchoolTSRatio e e * e e SchoolDist 3.200e e SchoolRating MedIncome 9.472e e e-06 *** e e < 2e-16 *** as.numeric(rank) e e e-06 *** Population e e < 2e-16 *** College.Graduates e e BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, Age 2, Envi, ChangeCrime, TranEnvi, SchoolRating, MedI ncome, Rank, Population, BedBath Residual standard error on 2993 degrees of freedom (2087 observations deleted due to missingness) R-Square Multiple R-squared , Adjusted R-squared F-statistic on 16 and 2993 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area AgeSquare Bath Bed Sqft_Area Lot_Area

77 AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating MedIncome as.numeric(rank) Population College.Graduates BathBed ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating MedIncome as.numeric(rank) Population College.Graduates BathBed SchoolRating MedIncome as.numeric(rank) Population Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating MedIncome 0.13 as.numeric(rank) Population 0.11 College.Graduates BathBed College.Graduates Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating MedIncome as.numeric(rank) Population College.Graduates BathBed 0.02 Intercept and Coefficients > model11$coefficients (Intercept) Bath Bed Sqft_Area e e e e+02 Lot_Area e-01 AgeSquare e+00 ChangeCrime e+04 Envi e+04 TranEnvi SchoolTSRatio SchoolDist SchoolRating

78 e e e e+03 MedIncome as.numeric(rank) Population College.Graduates e e e e+03 BathBed e+04 > anova(model11) Analysis of Variance Table Response Price Bath Df Sum Sq Mean Sq F value Pr(>F) e e < 2.2e-16 *** Bed e e Sqft_Area Lot_Area e e < 2.2e-16 *** e e e-09 *** AgeSquare e e < 2.2e-16 *** ChangeCrime Envi e e e-15 *** e e < 2.2e-16 *** TranEnvi e e e-13 *** SchoolTSRatio SchoolDist e e e e < 2.2e-16 *** SchoolRating e e < 2.2e-16 *** MedIncome as.numeric(rank) e e ** e e < 2.2e-16 *** Population e e < 2.2e-16 *** College.Graduates BathBed e e e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model11, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+06 Bath Bed e e e e+04 Sqft_Area e e+02 Lot_Area AgeSquare e e e e+00 ChangeCrime e e+04 Envi TranEnvi e e e e+04 SchoolTSRatio e e+03 SchoolDist SchoolRating e e e e+04 MedIncome e e+00 as.numeric(rank) e e+04 Population e e+00 College.Graduates e e+01 BathBed e e+04 Correlation > cor(prediction11,hou_test$price, use="complete") [1] RMSE > rmse(prediction11,hou_test$price) [1] (Please note Rank has a range from , it is the education rank of each zipcode. Thus this ordinal values having so many terms it treated as Scale)

79 Diagnostic Plots Model 12 Creating Model with variables with high correlation to Price (r>0.2) # house parameters cor(hou$price, hou$bath, use="complete") # cor(hou$price, hou$bed, use="complete") # cor(hou$price, hou$sqft_area, use="complete") # cor(hou$price, hou$lot_area, use="complete") # cor(hou$price, hou$revagesquare, use="complete") # # External Environment cor(hou$price, hou$changecrime, use="complete") # cor(hou$price, hou$envi, use="complete") # cor(hou$price, hou$tranenvi, use="complete") # cor(hou$price, hou$transdist, use="complete") # # Parameters by community(zipcode) cor(hou$price, as.numeric(hou$rank), use="complete") # cor(hou$price, hou$population, use="complete") # cor(hou$price, hou$college.graduates, use="complete") # cor(hou$price, hou$medage, use="complete") # # School parameters cor(hou$price, hou$school) # cor(hou$price, hou$schooldist) # cor(hou$price, hou$schooltsratio) # cor(hou$price, hou$schoolrating, use="complete") #

80 Independent Variables Equation Bath+Sqft_Area+ChangeCrime+Envi+TransDist+as.numeric(Rank)+College.Graduates+SchoolRating > summary(model, corr=t) Call lm(formula = Price ~ Bath + Sqft_Area + ChangeCrime + Envi + TransDist + as.numeric(rank) + College.Graduates + SchoolRating, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e e-06 *** Bath Sqft_Area 3.546e e e e e-07 *** ChangeCrime e e Envi TransDist 1.896e e *** e e as.numeric(rank) 2.125e e e-06 *** College.Graduates 1.010e e e-06 *** SchoolRating e e Signif. codes 0 *** ** 0.01 * Predictive features Sqft_Area, Envi, Rank, College.Graduates Residual standard error on 104 degrees of freedom (4984 observations deleted due to missingness) R-Square Multiple R-squared 0.516, Adjusted R-squared F-statistic on 8 and 104 DF, p-value 1.556e-13 Correlation of Coefficients (Intercept) Bath Sqft_Area ChangeCrime Envi TransDist Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates SchoolRating as.numeric(rank) College.Graduates Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates 1.00 SchoolRating Intercept and Coefficients > model$coefficients (Intercept) Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates SchoolRating > anova(model) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e e-05 ***

81 Sqft_Area e e e-08 *** ChangeCrime Envi e e *** e e *** TransDist e e as.numeric(rank) College.Graduates e e e e e-06 *** SchoolRating e e Residuals e e+10 Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) Bath e e Sqft_Area e ChangeCrime Envi e e TransDist e as.numeric(rank) e+05 College.Graduates e SchoolRating e Correlation > cor(prediction,hou_test$price, use="complete") [1] RMSE > rmse(prediction,hou_test$price) [1] Diagnostic Plots

82 Best Model (Model 11) Price= e e+05*Bath e+04*Bed e+02*Sqft_Area e- 01*Lot_Area e+00*Age e+04*ChangeCrime e+04*Envi e+04*TranEnvi e+03*SchoolRating e+00*MedIncome e+04*Rank e+00*Population e+04*BedBath c. Regularization (multiple regression models) Model 1 Price~Bath+Bed+Sqft_Area+Lot_Area Lasso > coef(fit) 5 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) Bed Bath Sqft_Area Lot_Area > fit$lambda.min [1] > z<-as.matrix(hou_test[,c('bed','bath','sqft_area','lot_area')]) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Ridge > coef(fit) 5 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) Bed Bath Sqft_Area Lot_Area > fit$lambda.min [1] > z<-as.matrix(hou_test[,c('bed','bath','sqft_area','lot_area')])

83 > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Model 2 Price~Bath+Bed+BathBed+Sqft_Area+Lot_Area Lasso > coef(fit) 7 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath. Bed. Sqft_Area Lot_Area. BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

84 Ridge > coef(fit) 7 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed Sqft_Area Lot_Area BathBed > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Model 3 Price~Bath*Bed+Sqft_Area+Lot_Area+Age

85 Lasso > coef(fit) 8 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed Sqft_Area Lot_Area Age BathBed > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Ridge > coef(fit) 8 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+05 (Intercept). Bath e+04 Bed e+04 Sqft_Area e+02 Lot_Area e-01 Age e+03 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse)

86 > rmse [1] Model 4 Price~Bath*Bed+Sqft_Area+Lot_Area+Age+I(Age^2) Lasso > coef(fit) 9 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed. Sqft_Area Lot_Area. Age. I(Age^2) BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

87 Ridge > coef(fit) 9 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+03 Sqft_Area e+02 Lot_Area e-01 Age e+02 I(Age^2) e+01 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

88 Model 5 Price~Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2) Lasso > coef(fit) 9 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed Sqft_Area Lot_Area. RevAge I(RevAge^2) BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Ridge > coef(fit) 9 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+04 Sqft_Area e+02 Lot_Area e-01 RevAge e+04 I(RevAge^2) e+03 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price)

89 > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Model 6 Price~Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2)+Crime Lasso > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed. Sqft_Area Lot_Area. RevAge I(RevAge^2). Crime BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

90 Ridge > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+03 Sqft_Area e+02 Lot_Area e-01 RevAge e+04 I(RevAge^2) e+03 Crime e+01 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

91 Model 7 Price~Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2)+ChangeCrime+Envi Lasso > coef(fit) 11 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed Sqft_Area Lot_Area. RevAge I(RevAge^2). ChangeCrime Envi BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

92 Ridge > coef(fit) 11 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+03 (Intercept). Bath e+04 Bed e+03 Sqft_Area e+02 Lot_Area e-01 RevAge e+04 I(RevAge^2) e+02 ChangeCrime e+04 Envi e+04 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

93 Model 8 Price~Bath*Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi Lasso > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed. Sqft_Area Lot_Area. AgeSquare ChangeCrime Envi BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Ridge > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+03 Sqft_Area e+02 Lot_Area e-01 AgeSquare e+01 ChangeCrime e+04 Envi e+04 BathBed e+02 > fit$lambda.min [1]

94 > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Model 9 Price~Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi Lasso > coef(fit) 11 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed. Sqft_Area Lot_Area. AgeSquare ChangeCrime Envi TranEnvi. BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

95 Ridge > coef(fit) 11 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+04 Sqft_Area e+02 Lot_Area e-01 AgeSquare e+01 ChangeCrime e+04 Envi e+04 TranEnvi e+02 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

96 Model 10 Price~Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi+SchoolTSRatio+Scho oldist+schoolrating Lasso > coef(fit) 14 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+05 (Intercept). Bath. Bed e+03 Sqft_Area e+01 Lot_Area e-01 AgeSquare. ChangeCrime e+04 Envi e+03 TranEnvi. SchoolTSRatio. SchoolDist. SchoolRating. BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

97 Ridge > coef(fit) 14 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+05 (Intercept). Bath e+03 Bed e+03 Sqft_Area e+01 Lot_Area e+00 AgeSquare e-01 ChangeCrime e+04 Envi e+03 TranEnvi. SchoolTSRatio e+03 SchoolDist e+04 SchoolRating e+03 BathBed e+01 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

98 Model 11 Price~Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi+SchoolTSRatio+Scho oldist+schoolrating+medincome+as.numeric(rank)+population+college.graduates Lasso > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath. Sqft_Area ChangeCrime Envi TransDist. as.numeric(rank). College.Graduates SchoolRating. > fit$lambda.min [1] > # Predict > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > # correlation > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > # RMSE > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

99 Ridge > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates SchoolRating > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

100 Model 12 Price~Bath+Sqft_Area+ChangeCrime+Envi+TransDist+as.numeric(Rank)+College.Graduates+SchoolRating Lasso > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath. Sqft_Area ChangeCrime Envi TransDist. as.numeric(rank). College.Graduates SchoolRating. > fit$lambda.min [1] > # Predict > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > # correlation > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > # RMSE > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

101 Ridge > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates SchoolRating > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

102 Regularization didn t improve the model. It drops the Correlation a lot however at the same time the RMSE value decreases too which is a good sign. We mainly looked at the Lasso Method as Ridge usually ends up keeping many variables which doesn t lead to a good model. d. Run Multiple times We ran the above models 4 times and then took the average of the following (All values are rounded off to 4 decimal points as we were entering the values manually for each) For random four seeds these were the average values Linear Regression Model 1 Adjusted R-Squared Correlation RMSE Model 2 Adjusted R-Squared Correlation RMSE Model 3 Adjusted R-Squared Correlation RMSE Model 4 Adjusted R-Squared Correlation RMSE Model 5 Adjusted R-Squared Correlation RMSE Model 6 Adjusted R-Squared Correlation RMSE Model 7 Adjusted R-Squared Correlation RMSE Model 8 Adjusted R-Squared 0 Correlation RMSE Model 9 Adjusted R-Squared Correlation

103 RMSE Model 10 Adjusted R-Squared 0.02 Correlation RMSE Model 11 Adjusted R-Squared Correlation RMSE Model 12 Adjusted R-Squared Correlation RMSE Model 13 Adjusted R-Squared 0.02 Correlation RMSE Model 14 Adjusted R-Squared Correlation RMSE Model 15 Adjusted R-Squared Correlation RMSE Model 16 Adjusted R-Squared 0. Correlation RMSE Model 17 Adjusted R-Squared Correlation RMSE Multivariate Regression Model 1 Adjusted R-Squared Correlation RMSE Model 2 Adjusted R-Squared Correlation RMSE Model 3 Adjusted R-Squared 0.551

104 Correlation RMSE Model 4 Adjusted R-Squared Correlation RMSE Model 5 Adjusted R-Squared Correlation RMSE Model 6 Adjusted R-Squared Correlation RMSE Model 7 Adjusted R-Squared Correlation RMSE Model 8 Adjusted R-Squared Correlation RMSE Model 9 Adjusted R-Squared Correlation RMSE Model 10 Adjusted R-Squared Correlation RMSE Model 11 Adjusted R-Squared Correlation RMSE Model 11 Adjusted R-Squared Correlation RMSE Regularization Models Model 1 -Lasso Correlation RMSE Model 1 Ridge Correlation RMSE

105 Model 2 -Lasso Correlation RMSE Model 2 Ridge Correlation RMSE Model 3 -Lasso Correlation RMSE Model 3 Ridge Correlation RMSE Model 4 -Lasso Correlation RMSE Model 4 Ridge Correlation RMSE Model 5 -Lasso Correlation RMSE Model 5 Ridge Correlation RMSE Model 6 -Lasso Correlation RMSE Model 6 Ridge Correlation RMSE Model 7 -Lasso Correlation RMSE Model 7 Ridge Correlation RMSE Model 8 -Lasso Correlation RMSE Model 8 Ridge Correlation RMSE

106 Model 9 -Lasso Correlation RMSE Model 9 Ridge Correlation RMSE Model 10-Lasso Correlation 0.33 RMSE Model 10 Ridge Correlation RMSE Model 11-Lasso Correlation RMSE Model 11 Ridge Correlation RMSE Model 12-Lasso Correlation RMSE Model 12 Ridge Correlation RMSE Q.2. a. Logistic Regression We have taken a different approach for logistic regression here. The glm() function we studied in class is for binary variables. However, since we were trying to predict the house prices dividing it as binary is not a good approach, hence we decided to use Ordered Logistic Regression. So the Scale attribute Price is converted to an ordinal scale with predefined intervals (Please refer the code for more details). The polr() function form the MASS package is used. (Please Note Since we have 10 categories the confusion matrix was too big to show, but the code has it) Model 1 Pricetag~Bath+Bed+Sqft_Area+scale(Lot_Area) > summary(m) Call polr(formula = Pricetag ~ Bath + Bed + Sqft_Area + scale(lot_area), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath e Bed e Sqft_Area e scale(lot_area) e

107 Intercepts Value Std. Error t value Residual Deviance AIC (66 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath Bed e e Sqft_Area e scale(lot_area) e e e e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath Bed e e e e-06 Sqft_Area e e-227 scale(lot_area) e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bath Bed Sqft_Area scale(lot_area) > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area

108 scale(lot_area) If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > # odds ratios > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) > # Odd Ratios and CI > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction The Crosstabs of predicted values was too big to display here, since we have 10 categories of Price. We ran the following code for it. sf <- function(y) { c('y>=1' = qlogis(mean(y >= 1)), 'Y>=2' = qlogis(mean(y >= 2)), 'Y>=3' = qlogis(mean(y >= 3)), 'Y>=4' = qlogis(mean(y >= 4)), 'Y>=5' = qlogis(mean(y >= 5)), 'Y>=6' = qlogis(mean(y >= 6)), 'Y>=7' = qlogis(mean(y >= 7)), 'Y>=8' = qlogis(mean(y >= 8)), 'Y>=9' = qlogis(mean(y >= 9)), 'Y>=10' = qlogis(mean(y >= 10)) ) } (s <- with(hou_test, summary(as.numeric(pricetag)~bath+bed+sqft_area+lot_area, fun=sf))) > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 2 Pricetag~Bath+Bed+BathBed+Sqft_Area+scale(Lot_Area) > summary(m) Call polr(formula = Pricetag ~ Bath + Bed + BathBed + Sqft_Area + scale(lot_area), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath e

109 Bed e Sqft_Area e scale(lot_area) e BathBed e Intercepts Value Std. Error t value Residual Deviance AIC (66 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) > (ctable <- coef(summary(m))) Value Std. Error t value Bath e Bed Sqft_Area e e scale(lot_area) e BathBed e e e e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath Bed e e e e-16 Sqft_Area e e-235 scale(lot_area) e e-22 BathBed e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bath Bed

110 Sqft_Area scale(lot_area) BathBed > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > ## odds ratios > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) BathBed > ## OR and CI > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 3 Pricetag~ Bath*Bed+Sqft_Area+scale(Lot_Area)+Age > summary(m) Call polr(formula = Pricetag ~ Bath * Bed + Sqft_Area + scale(lot_area) + Age, data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath e Bed e Sqft_Area e scale(lot_area) e Age e BathBed e Intercepts Value Std. Error t value

111 Residual Deviance AIC (69 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath Bed e e Sqft_Area e scale(lot_area) e Age e BathBed e e e e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath e e-78 Bed Sqft_Area e e e e-169 scale(lot_area) e e-20 Age BathBed e e e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bath Bed Sqft_Area scale(lot_area) Age BathBed > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) Age BathBed

112 If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) Age BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) Age BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 4 Pricetag~Bath*Bed+Sqft_Area+scale(Lot_Area)+scale(Age)+scale(AgeSquare) > summary(m) Call polr(formula = Pricetag ~ Bath * Bed + Sqft_Area + scale(lot_area) + scale(age) + scale(agesquare), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath e Bed e Sqft_Area e scale(lot_area) e scale(age) e scale(agesquare) e BathBed e Intercepts Value Std. Error t value Residual Deviance AIC (59 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value

113 Bath e Bed Sqft_Area e e scale(lot_area) e scale(age) e scale(agesquare) e BathBed e e e e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath e e-51 Bed Sqft_Area e e e e-194 scale(lot_area) e e-19 scale(age) e e-189 scale(agesquare) e e+00 BathBed e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bath Bed Sqft_Area scale(lot_area) scale(age) NA NA scale(agesquare) NA BathBed > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) scale(age) scale(agesquare) BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > exp(coef(m))

114 Bath Bed Sqft_Area scale(lot_area) scale(age) scale(agesquare) BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) scale(age) NA NA scale(agesquare) NA BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 5 Pricetag~Bath*Bed+Sqft_Area+scale(Lot_Area)+scale(RevAge)+scale(I(RevAge^2)) > summary(m) Call polr(formula = Pricetag ~ Bath * Bed + Sqft_Area + scale(lot_area) + scale(revage) + scale(i(revage^2)), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed Intercepts Value Std. Error t value Residual Deviance AIC (59 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath e Bed Sqft_Area e e scale(lot_area) e scale(revage) e scale(i(revage^2)) e BathBed e e e

115 e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath e e-49 Bed Sqft_Area e e e e-184 scale(lot_area) e e-19 scale(revage) e e-79 scale(i(revage^2)) e e-21 BathBed e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done... Bath 2.5 % 97.5 % Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed > confint.default(m) # CIs assuming normality Bath 2.5 % 97.5 % Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 %

116 Bath Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Changing the Age variable made no change at all. Model 6 Pricetag~BathBed+Bed+scale(Sqft_Area)+scale(Lot_Area)+scale(I(Age^2))+Crime We also scale the Sqft_Area in model > summary(m) Call polr(formula = Pricetag ~ BathBed + Bed + scale(sqft_area) + scale(lot_area) + scale(i(age^2)) + Crime, data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bed e scale(sqft_area) e scale(lot_area) e scale(i(age^2)) e Crime e BathBed e Intercepts Value Std. Error t value Residual Deviance AIC (59 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bed e scale(sqft_area) e scale(lot_area) e scale(i(age^2)) e Crime BathBed e e e e e

117 e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Bed Value Std. Error t value p value e e-05 scale(sqft_area) e e+00 scale(lot_area) e e-20 scale(i(age^2)) e e-27 Crime e e-41 BathBed e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bed scale(sqft_area) scale(lot_area) scale(i(age^2)) Crime BathBed > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bed scale(sqft_area) scale(lot_area) scale(i(age^2)) Crime BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, BedBath interaction needs to be dropped Log-odds and Odd Ratios for unit change > exp(coef(m)) Bed scale(sqft_area) scale(lot_area) scale(i(age^2)) Crime BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bed scale(sqft_area) scale(lot_area) scale(i(age^2)) Crime BathBed

118 Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by 6.96 (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 7 Pricetag~Bath+BathBed+Bed+Sqft_Area+scale(Lot_Area)+AgeSquare+ChangeCrime+Envi+TranEnvi > summary(m) Call polr(formula = Pricetag ~ Bath + BathBed + Bed + Sqft_Area + scale(lot_area) + AgeSquare + ChangeCrime + Envi + TranEnvi, data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath 6.237e e Bed Sqft_Area 1.003e e e e scale(lot_area) e e AgeSquare ChangeCrime 8.389e e e e Envi 1.089e e TranEnvi BathBed e e e e Intercepts Value Std. Error t value Residual Deviance AIC (59 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath e e Bed e e Sqft_Area e e scale(lot_area) e e AgeSquare e e ChangeCrime e e Envi e e TranEnvi e e BathBed e e e e e e e e e e e e e e e e e e e e

119 P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath Bed e e e e e e-05 Sqft_Area e e e-169 scale(lot_area) e e-02 AgeSquare e e e e-04 ChangeCrime e e e-34 Envi TranEnvi e e e e e e+00 BathBed e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done... Bath 2.5 % 97.5 % e e-01 Bed e e-01 Sqft_Area e e-03 scale(lot_area) e e-01 AgeSquare e e-05 ChangeCrime Envi e e e e-01 TranEnvi NA NA BathBed e e-02 > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) AgeSquare ChangeCrime Envi TranEnvi BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, al varibales significant. Log-odds and Odd Ratios for unit change > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) AgeSquare ChangeCrime Envi TranEnvi BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed

120 Sqft_Area scale(lot_area) AgeSquare ChangeCrime Envi TranEnvi NA NA BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 8 Model of high correlation parameter (with price) Pricetag~Bath+Bed+Sqft_Area+ChangeCrime+Envi+TransDist+as.numeric(Rank)+scale(College.Graduates)+scal e(schoolrating) > summary(m) Call polr(formula = Pricetag ~ Bath + Bed + Sqft_Area + ChangeCrime + Envi + TransDist + as.numeric(rank) + College.Graduates + scale(schoolrating), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath Bed Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates scale(schoolrating) Intercepts Value Std. Error t value Residual Deviance AIC (4994 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath Bed e e e e-02 Sqft_Area e e+00 ChangeCrime Envi e e e e+00 TransDist e e+02 as.numeric(rank) College.Graduates e e e e+01 scale(schoolrating) e e-01

121 e e e e e e e e e e e e e e e e e e+04 P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath Bed e e e e e e-01 Sqft_Area e e e-03 ChangeCrime Envi e e e e e e-11 TransDist e e e+00 as.numeric(rank) College.Graduates e e e e e e+00 scale(schoolrating) e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done... Bath 2.5 % 97.5 % Bed Sqft_Area ChangeCrime NA NA Envi TransDist as.numeric(rank) NA NA NA NA College.Graduates NA NA scale(schoolrating) > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates scale(schoolrating) If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, Sqft_Area, ChangeCrime, Envi, TransDist, Rank, College.Graduate are significant. Log-odds and Odd Ratios for unit change > exp(coef(m))

122 Bath Bed Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates scale(schoolrating) > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area ChangeCrime NA NA Envi TransDist NA NA as.numeric(rank) NA NA College.Graduates NA NA scale(schoolrating) Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Conclusion The logistic regression models show mostly all variables as significant. However, this process is not right for our dataset. Firstly, we are aiming to predict price of a house, which is a scale variable. To perform logistic regression we divided it into categories. In spite of converting to nominal the rangle of each category is still very large and thus we feel that every attribute acts significant. The correlation although is good for some models, this method is not suitable for our research question.

123 b. Naïve Bayes 1. Model1 Classifier with Bath, Bed, Sqft_Area, Lot_Area, Age, Postal Precision= Model2 Classifier with Bath, Bed, Sqft_Area, Lot_Area, Age, waterenvi TransEnvi Precision=

124 3. Model3 Classifier with Bath, Bed, Sqft_area, Lot_area, Age, waterenvi TransEnvi School SchoolRating School Type Precision= Model4 Classifier with Bath, Bed, Sqft_area, Lot_area, Age, waterenvi TransEnvi School SchoolRating School Type Population Educationlevel Precision=

125 5. Model5 Classifier with Bath, Bed, Sqft_area, Lot_area, Age, waterenvi TransEnvi School SchoolRating School Type Population Educationlevel With Laplace Estimator Precision= Model6 Model4 with Kernel Precision=

126 Plot of Model 6

127

128 7. Model7 Model5 with Kernel Pricision= N-fold Validation for Model5 Accuracies mean(accuracies)= Summary 1. Laplace estimator change nothing in the model 2. Application of Kernel make the prediction results better 3. In Naïve Bayes model, pick inner propertity variables and Zipcode performs best. 4. From the plot we could see Bath and Area of house is relatively better variables that can be used to define class of price. model1 (Inner Propertity and Zipcode) Classification(NB) model2 (add Envi) model3 (add School) model4 (all variables) model5 (add Laplace) model6 (add kernel) model7 (kernel and Laplace)

129 Q.3. a. Result of Splitting training and testing data training testing > prop.table(table(hou_train$pricetag)) > prop.table(table(hou_test$pricetag)) b. Classify model Decision Tree #Bath, Bed, Postal, Sqft_Area, Lot_Area, Envi, Age, Crime hou_model<-c5.0(pricetag~bath+bed+postal+sqft_area+lot_area+envi+age+crime, data=hou_train,na.action=na.pass) C5.0 [Release 2.07 GPL Edition] Mon Oct Class specified by attribute `outcome' Read 5307 cases (9 attributes) from undefined.data Decision tree Postal in {98059,98101,98102,98103,98105,98107,98109,98112,98115,98117,98119, 98122,98136,98144,98146,98164,98177,98199}...Sqft_Area > (1553/157.5) Sqft_Area <= Postal in {98059,98164} 5 (0) Postal in {98136,98144,98146,98177}...Sqft_Area <= Crime > Age <= (9/1) Age > Lot_Area <= (2/1) Lot_Area > Sqft_Area <= (3/1) Sqft_Area > (2) Crime <= Bed > 4 2 (17/6) Bed <= 4...Postal = Crime <= (8/1) Crime > (5/2) Postal = Age <= 67...Lot_Area <= (2) Lot_Area > (2) Age > 67...Age <= 74 3 (5/1) Age > 74 2 (2) Postal = Crime <= (3/1) Crime > Lot_Area > (3/1) Lot_Area <= Lot_Area <= (3/1)

130 Lot_Area > (10/2) Postal = Envi > 4 4 (2/1) Envi <= 4...Crime <= (19/8) Crime > Age <= 69 3 (3/1) Age > 69...Lot_Area <= (4/1) Lot_Area > (2) Sqft_Area > Bed > 6...Bath <= (2/1) Bath > (5) Bed <= 6...Postal = (25/7) Postal in {98136,98144,98177}...Crime <= (5/1) Crime > Bed > 5...Age > (3) Age <= Crime <= (16/7) Crime > Crime <= Lot_Area <= (4/1) Lot_Area > (2) Crime > Envi <= 3 2 (2/1) Envi > 3 4 (7/2) Bed <= 5...Sqft_Area > Age > 99 3 (9/1) Age <= 99...Postal = Envi <= 3 3 (4) Envi > 3 5 (19/11) Postal = Age <= 70 4 (13/5) Age > 70 3 (4/1) Postal = Envi > 10 5 (3) Envi <= 10...Crime > (8/1) Crime <= Envi <= 6 3 (4/1) Sqft_Area <= 1560 Envi > 6 4 (5/1)...Postal = Bath <= (17/5) Bath > (2/1) Postal = Age <= 15 3 (9) Age > 15...Bath > (6) Bath <= Bed <= 4...Bath > (3) Bath <= Envi <= 8 3 (2) Envi > 8 4 (3/1) Bed > 4...Bath > (2) Bath <= Sqft_Area <= (2) Sqft_Area > (6/2) Postal = Lot_Area > Envi <= 3 2 (2/1) Envi > 3 5 (3) Lot_Area <= Bed <= 4

131 ...Bath > (7/1) Bath <= Age <= 82 4 (2) Age > 82...Sqft_Area <= (2) Sqft_Area > (2) Bed > 4...Lot_Area > (6/1) Lot_Area <= Sqft_Area <= Lot_Area <= (7) Lot_Area > 4019 [S1] Sqft_Area > Envi <= 3 4 (3) Envi > 3 [S2] Postal in {98101,98102,98103,98105,98107,98109,98112,98115,98117,98119, 98122,98199}...Sqft_Area > (535.5/179.2) Sqft_Area <= Sqft_Area <= Postal in {98101,98117,98119} 3 (13/5) Postal in {98102,98109} 2 (5/2) Postal in {98105,98107,98112,98199} 5 (14/3) Postal = Lot_Area <= (2) Lot_Area > (5) Postal = Envi <= 7 3 (4/1) Envi > 7 1 (2/1) Postal = Envi <= 7 3 (5) Envi > 7...Envi <= 8 2 (2) Envi > 8 1 (2) Sqft_Area > Postal in {98101,98102,98105,98109,98112}...Sqft_Area <= Age <= 45...Crime <= (6/1) Crime > (6/1) Age > 45...Bath > (6.1/2) Bath <= Age <= 90 5 (23/5) Age > 90 4 (17/8) Sqft_Area > Bed <= 4...Envi > 6 5 (20) Envi <= 6...Sqft_Area <= (3) Sqft_Area > (2) Bed > 4...Bath > (17.1/1) Bath <= Lot_Area <= (5/1) Lot_Area > (6/1) Postal in {98103,98107,98115,98117,98119,98122,98199}...Bed > 5 5 (37/17) Bed <= 5...Bed <= 2...Bath <= (14.2/4.2) Bath > Postal in {98103,98107,98115,98117,98119, 98199} 3 (2) Postal = (2) Bed > 2...Bath > Crime > (9/1) Crime <= Age > 8...Envi <= 3 3 (4/1) Envi > 3 4 (3/1)

132 Age <= 8...Age <= 2 5 (2) Age > 2...Bed <= 4 5 (3/1) Bath <= 2.3 Bed > 4 4 (9/1)...Lot_Area <= (24.1/10) Lot_Area > Sqft_Area > Postal = Age <= 66 3 (5/1) Age > 66 4 (21/7) Postal = Bath <= (8/3) Bath > (3) Postal = Lot_Area <= (26/8) Lot_Area > (3) Postal = Crime <= (2) Crime > (4) Postal = Age > (8) Age <= Crime <= (2/1) Crime > (29/10) Postal = Bed > 4 4 (16.9/8.9) Bed <= 4...Bath <= (2) Bath > (2) Postal = Envi > 2 4 (3.9/0.9) Envi <= 2...Lot_Area <= (3) Lot_Area > (2) Sqft_Area <= Lot_Area > Sqft_Area <= (2/1) Sqft_Area > (2) Lot_Area <= Envi > 10...Postal in {98107,98115, 98117,98119, 98199} 5 (0) Postal = (1) Postal = Crime <= (2/1) Crime > (7) Envi <= 10...Postal = Age <= 62 2 (3/2) Age > 62...Age <= (5) Age > (2) Postal = Age <= 71 4 (6/2) Age > 71 [S3] Postal = Sqft_Area <= Crime <= (3) Crime > (2) Sqft_Area > Envi > 5 4 (11/3) Envi <= 5 [S4] Postal = Sqft_Area <= (12/2) Sqft_Area > Bath > (2/1) Bath <= 1.25 [S5] Postal = Crime > (4/1)

133 Crime <= Age <= 95 [S6] Age > 95 [S7] Postal = Sqft_Area > (2) Sqft_Area <= Envi <= 3 3 (2/1) Envi > 3 [S8] Postal = Age > 90 [S9] Age <= 90...Sqft_Area > 1115 [S10] Sqft_Area <= Crime <= 1191 [S11] Crime > 1191 [S12] Postal in {98104,98106,98108,98118,98121,98125,98126,98133,98155,98178}...Sqft_Area > Envi > 9...Sqft_Area > (30/1) Sqft_Area <= Lot_Area <= (4/1) Lot_Area > Age > 70 5 (5) Age <= 70...Bed > 5 4 (3) Bed <= 5...Sqft_Area <= (2) Sqft_Area > (2) Envi <= 9...Sqft_Area > Crime <= Envi <= 0 2 (5) Envi > 0...Bed <= 5 5 (2) Bed > 5 2 (2/1) Crime > Bath > Postal in {98104,98121,98155} 5 (0) Postal = Crime <= (6/2) Crime > (8/1) Postal = Bed <= 9 5 (66/14) Bed > 9 3 (2/1) Postal = Envi <= 0 3 (3/1) Envi > 0 5 (10/1) Postal = Envi <= 3 5 (2) Envi > 3 3 (5/2) Postal = Age <= 4 5 (5) Age > 4...Bath > (5/1) Bath <= Bed <= 6 5 (5/2) Bed > 6 2 (3/2) Postal = Age <= 43 5 (13/2) Age > 43...Sqft_Area <= (2) Sqft_Area > Bath <= (2) Bath > (2) Postal = Envi <= 5...Lot_Area <= (2/1) Lot_Area > Sqft_Area <= (3/1) Sqft_Area > (13/4) Envi > 5...Envi <= 6

134 ...Crime <= Bath <= (2) Bath > (2) Crime > Crime <= (7) Crime > (2) Envi > 6...Bath > (33/6) Bath <= Lot_Area <= (4) Bath <= 2.3 Lot_Area > (2)...Age <= 33 5 (6/1) Age > 33...Envi > 6...Sqft_Area > Crime <= (3/1) Crime > (6) Sqft_Area <= Sqft_Area <= (2/1) Sqft_Area > Bed <= 5 4 (4) Bed > 5...Age <= 59 4 (2) Age > 59 5 (4/1) Envi <= 6...Bath <= (10/2) Bath > Envi > 5...Bath <= (3/2) Bath > (5) Envi <= 5...Postal in {98104,98106,98121, 98155} 3 (4/1) Postal = Sqft_Area <= (2/1) Sqft_Area > (2) Postal = Lot_Area <= (4/1) Lot_Area > (2) Postal = Sqft_Area <= (2/1) Sqft_Area > (2) Postal = Crime <= (3) Crime > (3/2) Postal = Crime <= (3/1) Crime > (2/1) Postal = Envi > 2 4 (2) Envi <= 2...Lot_Area <= (7/3) Lot_Area > (4/2) Sqft_Area <= Crime <= Age > 54 2 (15/2) Age <= 54...Age > 47 1 (3) Age <= 47...Bath <= (3) Bath > (2) Crime > Postal in {98104,98121,98155} 3 (0) Postal in {98125,98126,98133}...Sqft_Area > Bed > 6...Age <= 54 3 (4) Age > 54 2 (2/1) Bed <= 6...Crime <= (7) Crime > 467

135 ...Age <= 53...Age <= 6...Postal = (2) Postal in {98126,98133} 5 (3) Age > 6...Envi <= 0 4 (16/4) Envi > 0...Postal = (6/3) Postal = (2) Postal = Crime <= (6/1) Crime > (4) Age > 53...Envi > 1...Sqft_Area <= (8/2) Sqft_Area > (5/1) Envi <= 1...Lot_Area > (2) Lot_Area <= Envi <= 0...Bath <= Crime <= (3) Crime > (2) Bath > Age <= 72 3 (25/12) Age > 72 4 (3/1) Envi > 0...Bath > (4/1) Bath <= Lot_Area > (3) Lot_Area <= [S13] Sqft_Area <= Lot_Area <= Sqft_Area <= (16/1) Sqft_Area > (3/1) Lot_Area > Age <= 30...Bed > 5 4 (3/1) Bed <= 5...Age <= 1...Postal in {98125,98133} 5 (3) Postal = (1) Age > 1...Age <= 24 4 (5/1) Age > 24 5 (2) Age > 30...Lot_Area > Postal = (0) Postal = (4/1) Postal = Lot_Area <= (2) Lot_Area > Lot_Area <= (8/2) Lot_Area > (2/1) Lot_Area <= Bath <= Crime <= (13/1) Crime > Sqft_Area <= (2) Sqft_Area > (3/1) Bath > Age > 94 2 (2) Age <= 94...Sqft_Area <= (10/1) Sqft_Area > Postal = Bed <= 5 4 (19/8) Bed > 5 3 (13/5) Postal = Age <= 40 2 (2/1) Age > 40...Bed <= 5 3 (14/6)

136 Bed > 5 4 (8/1) Postal = Lot_Area > (9) Lot_Area <= Envi > 3 4 (5/1) Envi <= 3 [S14] Postal in {98106,98108,98118,98178}...Age <= 4...Age > 1 3 (11/2) Age <= 1...Crime <= (3) Crime > Sqft_Area <= (5) Sqft_Area > Sqft_Area <= (6) Sqft_Area > (2) Age > 4...Lot_Area > Bath <= Postal = (3/1) Postal in {98106,98118,98178} 2 (1) Bath > Envi > 6...Lot_Area <= (2/1) Lot_Area > (3) Envi <= 6...Bath > (5/1) Bath <= Postal = (2/1) Postal = (1) Postal = Bed <= 5 1 (2/1) Bed > 5 2 (3) Postal = Age <= 42 2 (2) Age > 42 3 (4) Lot_Area <= Sqft_Area > Bed > 6 3 (12/4) Bed <= 6...Lot_Area > Envi <= 7 3 (5/1) Envi > 7 2 (2/1) Lot_Area <= Bath > (5) Bath <= Sqft_Area <= (3) Sqft_Area > (11/4) Sqft_Area <= Crime > Age > 94...Lot_Area > (3/1) Lot_Area <= Lot_Area <= (4/1) Lot_Area > (5/1) Age <= 94...Crime > (45/16) Crime <= Bath <= (7/2) Bath > Lot_Area <= (31/14) Lot_Area > Lot_Area <= (7/1) Lot_Area > (10/4) Crime <= Lot_Area <= (21/9) Lot_Area > Postal = (6/2) Postal = Bed <= 5...Crime <= (4) Crime > (5/1)

137 Bed > 5...Bath <= (2) Bath > (3/1) Postal = Lot_Area <= Crime <= (2/1) Crime > (2) Lot_Area > Sqft_Area > (17/4) Sqft_Area <= Age > 62 3 (7/1) Age <= 62...Crime <= (9/5) Crime > 711 [S15] Postal = Bath <= Age <= 71 2 (3) Age > 71 4 (5/1) Bath > Sqft_Area <= (7) Sqft_Area > Crime <= (2) Crime > Envi <= 7 2 (32/16) Envi > 7 [S16] Sqft_Area <= Bed <= 2...Crime > Age <= 12 3 (2.4/0.4) Age > 12 2 (3.6) Crime <= Bath <= Sqft_Area <= (100/14) Sqft_Area > (2/1) Bath > Crime <= (2) Crime > Crime <= (4) Crime > (5/1) Bed > 2...Lot_Area > Postal = (0) Postal in {98106,98108} 3 (4.2/0.2) Postal in {98104,98118,98121,98125,98126,98133,98178}...Sqft_Area <= (38.3/1.2) Sqft_Area > Envi > 2...Age <= 72 5 (5.1/0.1) Age > 72 2 (2.1) Envi <= 2...Lot_Area <= (3.1/1.1) Lot_Area > Bed > 4 2 (8.2/1.1) Bed <= 4...Age <= 31 2 (5.1/0.1) Age > 31...Bath <= (23.4/6.1) Bath > (3) Lot_Area <= Postal in {98104,98121} 2 (0) Postal in {98125,98126,98133}...Sqft_Area > Lot_Area <= Crime <= (15.3/3.1) Crime > (15.3/2.3) Lot_Area > Crime > (183.2/77.4) Crime <= Bath <= (2) Bath > (2) Sqft_Area <= Age <= 56

138 ...Bed > 4...Age <= 28...Age <= 5 3 (2) Age > 5 2 (16.9/2) Age > 28...Crime <= (2) Crime > (10/1) Bed <= 4...Postal = (0) Postal = Sqft_Area <= (5.9) Sqft_Area > Lot_Area <= (5.7/0.7) Lot_Area > (2.3) Postal = Age <= 16 2 (11) Age > 16...Lot_Area <= (6.1/2.4) Lot_Area > (8.6/1.1) Age > 56...Sqft_Area <= Age > 93 1 (2) Age <= 93...Crime <= Lot_Area <= (29/4) Lot_Area > (2) Crime > Postal in {98125,98126} 2 (4/1) Postal = (3) Sqft_Area > Bed > 5 3 (7/1) Bed <= 5...Envi > 1...Sqft_Area <= (26/11) Sqft_Area > Crime <= (2/1) Crime > (5) Envi <= 1...Age <= 66...Bath > Age <= 64 3 (7/2) Age > 64 2 (2) Bath <= Postal = Lot_Area <= (2) Lot_Area > (7) Postal = Crime <= (4/1) Crime > (12/3) Postal = Age > 63 3 (7.9/0.9) Age <= 63...Bed <= 4 2 (8) Bed > 4...Age > 62 2 (2) Age <= 62 [S17] Age > 66...Crime <= (2) Crime > Envi > 0...Bed <= 4...Lot_Area <= (8/3) Lot_Area > (4/1) Bed > 4...Age <= 72 1 (3/1) Age > 72 3 (2) Envi <= 0...Age > (2/1) Age <= 100 [S18] Postal in {98106,98108,98118,98155,98178}...Envi > 8...Sqft_Area <= (7/3)

139 Sqft_Area > Envi > 10...Crime <= (3) Crime > (3/1) Envi <= 10...Lot_Area > (3/1) Lot_Area <= Crime > (2) Crime <= Crime <= (2/1) Envi <= 8 Crime > (10/2)...Sqft_Area <= Bath > Bed > 5 1 (2) Bed <= 5...Age <= 23...Bed <= 4 1 (2) Bed > 4 2 (2) Age > 23...Lot_Area <= (14) Lot_Area > Age <= 70 3 (3) Age > 70 2 (2) Bath <= Crime > Envi <= 5 2 (39/17) Envi > 5...Envi <= 6...Postal = (1) Postal in {98106,98118,98155, Envi > } 3 (5)...Age <= (5/2) Age > (2/1) Crime <= Postal = (1) Postal = Sqft_Area <= (2) Sqft_Area > (14/4) Postal = Bed <= 4 2 (13/5) Bed > 4 1 (7/1) Postal = Crime > Sqft_Area <= (17/5) Sqft_Area > (8/1) Crime <= Bath <= (32/5) Bath > Age <= 36 2 (2) Age > 36 1 (2) Postal = Age <= 69 2 (12.9/4) Age > 69...Bed > 4 1 (6) Bed <= 4...Crime <= (2) Crime > Age <= 93 1 (8) Age > 93...Lot_Area > (2) Lot_Area <= 8650 [S19] Sqft_Area > Age <= 4...Crime <= (6.8) Crime > (3/1) Age > 4...Crime <= Postal = (0) Postal = Crime <= (19/10)

140 Crime > (6) Postal = Envi > 1 2 (7/1) Envi <= 1...Bed <= 4...Sqft_Area <= (4) Sqft_Area > (7.9/2) Bed > 4...Bath <= (28/15) Bath > Bed <= 5 3 (25/10) Bed > 5 2 (3.9/1) Postal in {98118,98178}...Envi <= 1...Bath > (7/1) Bath <= Lot_Area <= (5) Lot_Area > (7/3) Envi > 1...Lot_Area <= (54.7/19) Lot_Area > Envi <= 7...Postal = (9/4) Postal = (3.2/1) Envi > 7...Sqft_Area <= (2) Sqft_Area > (2/1) Crime > Bath > (10/1) Bath <= Bed > 5...Postal in {98106,98108,98155, 98178} 2 (6/3) Postal = Sqft_Area <= (6) Sqft_Area > Sqft_Area <= (5/1) Sqft_Area > Age <= 73 1 (2/1) Age > 73 5 (2) Bed <= 5...Crime > (2) Crime <= Age <= 63...Crime > (20.9/1.9) Crime <= Crime <= (5) Crime > Bed <= 4 1 (2/1) Age > 63 Bed > 4 [S20]...Bath > (10/2) Bath <= Bath > (6) Bath <= 1.5 [S21] SubTree [S1] Lot_Area <= (2) Lot_Area > (2) SubTree [S2] Sqft_Area <= (2/1) Sqft_Area > (2) SubTree [S3] Sqft_Area <= (2) Sqft_Area > (5)

141 SubTree [S4] Envi <= 4 4 (4/1) Envi > 4 5 (2) SubTree [S5] Crime <= (13/4) Crime > (2/1) SubTree [S6] Sqft_Area <= (3/1) Sqft_Area > (13/1) SubTree [S7] Sqft_Area <= (4) Sqft_Area > (4/1) SubTree [S8] Lot_Area > (12/4) Lot_Area <= Lot_Area <= (3/2) Lot_Area > (3/1) SubTree [S9] Lot_Area <= (4) Lot_Area > Sqft_Area <= (2) Sqft_Area > (2/1) SubTree [S10] Crime <= (3/1) Crime > (7) SubTree [S11] Lot_Area <= (16/6) Lot_Area > (2) SubTree [S12] Sqft_Area > (6/1) Sqft_Area <= Sqft_Area > (8) Sqft_Area <= Sqft_Area <= (2) Sqft_Area > (2) SubTree [S13] Lot_Area <= (4/1) Lot_Area > (7) SubTree [S14] Crime <= (3/1) Crime > (6/1) SubTree [S15] Bed <= 5 4 (3) Bed > 5 1 (4/2) SubTree [S16] Bath > (4/1) Bath <= 2.5

142 ...Age > 83 3 (4/1) Age <= 83...Sqft_Area <= (2) Sqft_Area > (3/1) SubTree [S17] Age > 61 3 (4) Age <= 61...Lot_Area <= (3) Lot_Area > (3/1) SubTree [S18] Postal = (4/1) Postal = Crime <= (2) Crime > (8/1) Postal = Bed > 4...Bath <= Lot_Area <= (2) Lot_Area > (3/1) Bath > Crime <= (2) Crime > (3/1) Bed <= 4...Age > 76 2 (10/2) Age <= 76...Crime > (7/1) Crime <= Crime <= (5/1) Crime > (3/1) SubTree [S19] Sqft_Area <= (4) Sqft_Area > (3/1) SubTree [S20] Sqft_Area <= (4) Sqft_Area > (2) SubTree [S21] Sqft_Area <= (11/4) Sqft_Area > Bed <= 4...Postal = (3/1) Postal in {98106,98118,98155,98178} 2 (3/1) Bed > 4...Lot_Area > Postal in {98106,98108,98155,98178} 4 (5/1) Postal = (3/1) Lot_Area <= Envi <= 7 1 (2) Envi > 7...Lot_Area <= (2) Lot_Area > (2/1) Evaluation on training data (5307 cases) Decision Tree Size Errors (20.7%) <<

Interpret some of the resulting if-then rules Choose one branch of the tree to explain If Postal of the house is 90106, 98108,98118 or 98178 Then if age is bigger than 1, go to another branch.

143 Interpret some of the resulting if-then rules Choose one branch of the tree to explain If Postal of the house is 90106, 98108,98118 or Then if age is bigger than 1, go to another branch. If age is less than 1 and crime variable is less than 803, the price belongs to the 3rd class. If age is less than 1 and crime variable is bigger than 803, check the area of the house, if it is less than 1980, then price belongs to 4th class.if it is more than 1980 and less than 2395,price belongs to the 5th class. If area is bigger than 2395, price belongs to 4th class. Comparison of Training and testing results by confusion matrix and percentage of data correctly classified Result of Training dataset Correct percentage Confusion matrix (a) (b) (c) (d) (e) <-classified as (a) class 1 (b) class (c) class (d) class 4 (e) class 5 R-Output (a) (b) (c) (d) (e) <-classified as (a) class (b) class (c) class 3 (d) class (e) class 5 Attribute usage % Postal 99.96% Sqft_Area 46.15% Crime 45.92% Bed 41.06% Lot_Area 40.04% Envi 35.80% Age 33.86% Bath Time 0.1 secs

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times. Mixed-effects models An introduction by Christoph Scherber Up to now, we have been dealing with linear models of the form where ß0 and ß1 are parameters of fixed value. Example: Let us assume that we are