Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Size: px
Start display at page:

Download "Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty"

Transcription

1 Milestone2 Zillow House Price Prediciton Group Lingzi Hong and Pranali Shetty

2 MILESTONE 2 REPORT Data Collection The following additional features were added 1. Population, Number of College Graduates and Education Rank by Zipcode 2. Along with the number of schools within 1km of each house we also added. School type and school rating with the hope to improve the prediction. Data Cleaning and Pre-processing The data we scraped and the attributes we decided to add were divided into four groups, Inner House Properties School Data Environmental Data Zipcode Features Our main tasks was to clean the four different datasets and then merge them together. The following were done to clean and pre-process the data Remove the price less than 1M and higher than 7M Remove $,,, K or M in the dataset House built year to the age of house Convert appropriate variables (SchoolType, Rank, Postal, Built_Year) as factor Delete instances with more than 100 baths. Convert numbers like 1M to in price. Convert Acre to Square Feet in Lot_Area (These were done in Excel as they were easier to implement) Removed words like Built in, Sold and others from the dataset. Bed attribute had Studio as a value, which technically means no bedroom so we decided to replace that by 0. And the word Studio was not uniform throughout, meaning some columns had Studio, values due to scrapping errors. So we standardized the values first and the replaced Studio by 0. Data Engineering Create a variable Age which is the measure of how old the house is. It is calculated as its built year subtracted from 2014(present). Since Built Year is a factor we noticed that during our initial analysis we couldn t derive adequate value from that data and thus we designed a way to use it. Data Merging All the four datasets had the base data which is the inner house properties. So the three external attributes Zipcode factors, School Data and Environmental factors were then merged on unique column names to get a full dataset with all attributes. So by this time, we have around 6383 rows and about 19 raw (Original not recoded in any way) attributes that we can statistically work with. Latitude, Longitude, City, Region, Address won t be considered from now on. They were used fully in preliminary data analysis to understand the dataset and its features both by geo-graphs and statistical test.

3 Data Overview After the above processes we have 6383 rows of data and about 19 attributes. Please note, we have not excluded any NA or blank fields at this point. They are tackled as per the modelling requirements later. The dataset has four main sections, from which attributes can be picked to help in prediction (Factors as explicitly specified, others are continuous data or are converted to scale during analysis) Inner House Properties Bed (Factor but Converted to Scale) Bath Built_Year (Factor) Price_Sqft (this is just Price/Sqft Area technically same as Price so not used in analysis) Lot_Area Sqft_Area School Properties School (Number of Schools within 1km of the house) SchoolDist (Distance of nearest school to house) SchoolTSRatio (Student-Teacher Ratio at School) SchoolRating (1-10, Rank of School. 10 best and then decreases to 1) SchoolType (Factor) Zipcode Features (These are factors taken to describe a community) MedIncome Postal (Factor) Population College.Graduates Rank (Education Rank in a particular Zipcode) (Factor) MedAge Environmental Data TranEnvi (Distance to the nearest water body) TransDist (Distance to the nearest transportation medium) Crime (Number of criminal incidents within 3kms of the house) Envi (Number of water bodies in the area)

4 Data Models Summary (Q.4 Comparative Analysis + Analysis Questions of each model answered here) Research Question To understand the factors that affect the house prices in Seattle and predict it. We performed the following machine learning techniques to best answer our research question. They are as follows Linear Regression Multivariate Regression Regularization Ridge and Lasso Methods Logistic Regression Naïve Bayes Classification Decision Tree Classification Boosting and Random Forest Decision Trees Regression Boosting and Random Forest SVM s 1. Linear Regression We ran the linear regression for each of the attributes, except the ones that are factors. Since it is linear regression which gives best results when both independent and dependent variables are continuous, we didn t run them for factors. The intercept, coefficient and residuals for each model were recorded. To compare models we also noted the correlation and RMSE of each model. (R-Square was not a good predictor in this case, as the model has only one independent variable) College.Graduate, Population, SchoolRating, MedIncome, School, Sqft_Area are significant predictors when a model of all variables were created. Among those, College.Graduate, Population and Sqft_Area are most predictive. The features that showed strong correlation to price are College.Graduate, Sqft_Area,Bath, Bed (Slight but still better than the others rejected), Number of water bodies in the area, Crime, Distance to nearest transport, Student teacher ratio, School Rating, Median Income of the community (Zipcode) and population of the community. So in conclusion, linear regression is applicable to our case, certainly not a single variable one. Running linear regression helped us understand which variable are significant and which not. Also since many of our attributes are continuous, linear regression is a good approach to use as a starting step. Model Adjusted R- Squared Price ~ Sqft_Area Correlatio n RMSE Significant (Predictive ) Intercept Coefficien t Residual Standard Error YES on 5100 degrees of freedom Price ~ YES on Bath degrees of freedom Price ~ Bed YES on

5 Price ~ Lot_Area Price ~ Age degrees of freedom YES on degrees Price ~ Envi Price ~ Crime Price ~ TranEnvi Price ~ TransDist Price ~ School Price ~ SchoolDist Price ~ SchoolTSRa tio Price ~ SchoolRati ng Price ~ MedIncom e Price ~ Population Price ~ College.Gra duates e of freedom YES on 5094 degrees of freedom YES on 5104 degrees of freedom YES on 5104 degrees of freedom NO on 5104 degrees of freedom Slightly on (p<0.1) degrees of freedom YES on (p<0.05) degrees of freedom NO on 156 degrees of freedom YES on degrees of freedom Slight on 120 (p<0.1) degrees of but good freedom correlatio n YES on degrees of freedom YES on degrees of freedom YES on degrees of freedom 2. Multivariate Regression While using multivariate regression, we used 4 main data; the house properties, school around it, environmental factors and community characteristics. Then starting with each we added data to the model. We noted the

6 confidence intervals, Anova, model coefficients and intercept values in each case. Overall 12 models were built. The correlation, R-squared and RMSE values of each model helped us analyze which one was the best. Basic comparison of the models created Model Number R-Squared Correlation RMSE Model Model Model Model Model Model Model Model Model Model Model Model Model 11 is picked as the best analyzing the above table. Price= e e+05*Bath e +04*Bed e+02*Sqft_Area e- 01*Lot_Area e+00*Age e+04*ChangeCrime e+04*Envi e+04*TranEnvi e+03*SchoolRating e+00*MedIncome e+04*Rank e+00*Population e+04*BedBath 3. Regularization Models For each of the multivariate regression models regularization (Lasso and Ridge) were performed. Model Number Correlation RMSE Lasso Correlation RMSE Ridge Lasso Ridge Model Model Model Model Model Model Model Model Model Model Model Model Regularization didn t improve the model. It drops the Correlation a lot however at the same time the RMSE value decreases too which is a good sign. We mainly looked at the Lasso Method as Ridge usually ends up keeping many variables which doesn t lead to a good model.

7 4. Logistic Regression Our dependent variable is continuous so this method doesn t suit it the best. Plus we couldn t use the glm() function as it requires the dependent variable to be binary. So we decided to split our model into 10 categories of price ranges and then run Ordinal Logistic Regression using the polr() function from the MASS package. We developed about 8 models form the knowledge we gained from the multiple regression models. However as the range with the categories is too large like $ $ is a lot in terms of price estimation, almost every model we ran gave all significant features. Which we derive is the case as the model fits the attributes in that wide range. So for our research question purpose, which is to determine the house price and not range, this model fails. 5. Naïve Bayes Separate price into 5 categories 1. Laplace estimator change nothing in the model 2. Application of Kernel make the prediction results better 3. In Naïve Bayes model, pick inner propertity variables and Zipcode performs best. 4. From the plot we could see Bath and Area of house is relatively better variables that can be used to define class of price. model1 (Inner Propertity and Zipcode) Classification(NB) model2 (add Envi) model3 (add School) model4 (all variables) model5 (add Laplace) model6 (add kernel) model7 (kernel and Laplace)

8

9 6. Decision Trees In Decision Tree separate price into 5 categories a. Result of Splitting training and testing data training testing b. In Decision Tree Classification Model Train a decision tree and show and interpret some of the resulting if-then rules Choose one branch of the tree to explain If Postal of the house is 90106, 98108,98118 or Then if age is bigger than 1, go to another branch. If age is less than 1 and crime variable is less than 803, the price belongs to the 3rd class. If age is less than 1 and crime variable is bigger than 803, check the area of the house, if it is less than 1980, then price belongs to 4th class.if it is more than 1980 and less than 2395,price belongs to the 5th class. If area is bigger than 2395, price belongs to 4th class. Comparison of Training and testing results by confusion matrix and percentage of data correctly classified Result of Training dataset Correct percentage Confusion matrix (a) (b) (c) (d) (e) <-classified as (a) class (b) class (c) class 3 (d) class (e) class 5

10 Result of Testing dataset Correct percentage Confusion matrix Performance of training dataset is better than testing dataset. c. Boosting overall achieves high prediction accuracy than decision without boosting, with more trees the prediction accuracy will increase and then decrease. Classification(Decision Tree) model1 (Basic Model) model2 (Boosting 8 trees) model3 (Boosting 10 trees) model4 (Boosting 12 trees) model5 (Boosting 20 trees) d. In our model, with more variable included, performance of the model will be better, meanwhile, train more trees for the same variable will improve the accuracy of prediction. Classification(Random Forest) model1 (8 variables) model2 (6 variables) model3 (4 variables) model4 (8 variables 50 trees) model4 (8 variables 100 trees) model4 (8 variables 150 trees) model4 (8 variables 200 trees) e. Decision Tree Regression Model Variable Importance Variable importance Sqft_Area Postal Bath Crime

11 Lot_Area Envi Age Bed Plot Tree Model

12 Comparison of Decision tree and random forest Parameter decision tree random forest cor rmse SVM Model Classification Model Classification(SVM) Kernel vanilladot rbfdot anovadot Percent of correct Regression Model Regression(SVM) Kernel vanilladot rbfdot anovadot cor rmse Linear Regression Model performs best for classification while Gaussian Kernel perform best for regression model. Comparative Analysis of Classification and Regression models Comparison between Classificaion Model In Naïve Bayes, model6 with combination of inner propertity variables and zipcode and kernel performs best, prediction accuracy is In Decision Trees, model1 with 8 variables in randomforest performs best, prediction accuracy is In SVM, model with vanilladot kernel performs best, prediction accuracy is The best model comes from random forest. Meanwhile, Random forest has overall higher prediction accuracy than SVM and Naïve Bayes. In house price prediction, Naives Bayes performs not well as other classifiers. Comparision between Regression Model In multivariate regression, Model 11 is picked as the best with R-Squared , correlation of prediction and test is and rmse In Decision Tree, the best regression model comes from random forest with correlation and rmse In SVM model, model with rbfdot kernel performs best wht correlation and rmse Of the three models, random forest still performs better than the others.

13 MODELS Q.1 a. Linear Regression We tried out various independent parameters. Linear regression provides the best results or rather it is preferred to use it when both dependent (outcome) variable and the independent variable are continuous. 1. Model 1 house price & sqft area > summary(model) Call lm(formula = Price ~ Sqft_Area, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** Sqft_Area < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Sqft_Area Significant Residual standard error Residual standard error on 5100 degrees of freedom (4 observations deleted due to missingness) R-Square Value Multiple R-squared , Adjusted R-squared F-statistic 4890 on 1 and 5100 DF, p-value < 2.2e-16 Intercept and Coefficient > model$coefficients (Intercept) Sqft_Area

14 Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Sqft_Area e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Sqft_Area Prediction and Confidence Bands

15 Diagnostic Plots 2. Model 2 House and Bath > summary(model) Call lm(formula = Price ~ Bath, data = hou_train) Data Residuals

16 Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** Bath < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Bath Significant Residual standard error Residual standard error on 5104 degrees of freedom R-Square values Multiple R-squared , Adjusted R-squared F-statistic 1972 on 1 and 5104 DF, p-value < 2.2e-16 Intercept and Coefficient > model$coefficients (Intercept) Bath Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Bath

17 Prediction and Confidence Bands

18 3. Model 3 House Price and Bed Diagnostic Plots

19 > summary(model) Call lm(formula = Price ~ Bed, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) * Bed <2e-16 *** --- Signif. codes 0 *** ** 0.01 * Bed Significant Residual standard error Residual standard error on 5104 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5104 DF, p-value < 2.2e-16 Intercept and Coefficient > model$coefficients (Intercept) Bed Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bed e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Bed

20 Prediction and Confidence Bands Diagnostic Plot

21 4. Model 4 Lot Area and House Price > summary(model) Call lm(formula = Price ~ Lot_Area, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 5.896e e < 2e-16 *** Lot_Area e e e-07 *** --- Signif. codes 0 *** ** 0.01 * Lot_Area Significant Residual Standard Error Residual standard error on 5040 degrees of freedom (64 observations deleted due to missingness) R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5040 DF, p-value 3.506e-07 Coefficients and Intercept > model$coefficients (Intercept) Lot_Area Correlation > cor(prediction,hou_test$price, use='complete')

22 [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Lot_Area e e e-07 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) e+05 Lot_Area e-01 Prediction and Confidence Bands

23 5. Model 5 House Price and Age of House Diagnostic Plot

24 > summary(model) Call lm(formula = Price ~ Age, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** Age e-13 *** --- Signif. codes 0 *** ** 0.01 * Age Significant Residual Standard Error Residual standard error on 5094 degrees of freedom (10 observations deleted due to missingness) R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5094 DF, p-value 1.227e-13 Coefficients and Intercept > model$coefficients (Intercept) Age Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Age e e e-13 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Age

25 Prediction and Confidence bands Diagnostic Plots

26 6. Model 6 House Price and Number of Water Views in 0.5km Radius > summary(model) Call lm(formula = Price ~ Envi, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** Envi <2e-16 *** --- Signif. codes 0 *** ** 0.01 * Envi Significant Residual Standard Error Residual standard error on 5104 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5104 DF, p-value < 2.2e-16 Coefficients and Intercept > model$coefficients (Intercept) Envi Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Envi e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Envi

27 Prediction and Confidence Bands

28 7. Model 7 House Price and Crime Diagnostic Plots

29 > summary(model) Call lm(formula = Price ~ Crime, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 4.374e e <2e-16 *** Crime 8.834e e <2e-16 *** --- Signif. codes 0 *** ** 0.01 * Crime Significant Residual Standard Error Residual standard error on 5104 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 5104 DF, p-value < 2.2e-16 Coefficients and Intercept > model$coefficients (Intercept) Crime Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Crime e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Crime

30 Prediction and Confidence Bands Diagnostic Plots

31 8. Model 8 House Price and Distance to water bodies > summary(model) Call lm(formula = Price ~ TranEnvi, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** TranEnvi Signif. codes 0 *** ** 0.01 * TranEnvi Significant Residual Standard Error Residual standard error on 5104 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared 8.001e-05 F-statistic on 1 and 5104 DF, p-value Coefficients and Intercept > model$coefficients (Intercept) TranEnvi Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) TranEnvi e e Residuals e e+11 > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) TranEnvi

32 From the plot we can see that there are only four different values in TranEnvi and most of the house price has TranEnvi as 0, so it is not suitable for linear regression. The diagnostics plot also shows that. Still, we have just listed the parameters for review. Prediction and Confidence Bands

33 Diagnostic Plots 9. Model 9 House Price and Distance of House and Its Nearest Transportation Place > summary(model) Call lm(formula = Price ~ TransDist, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-08 *** TransDist Signif. codes 0 *** ** 0.01 * TransDist Slightly Significant (p<0.1) Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value Coefficients and Intercept > model$coefficients (Intercept) TransDist

34 Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) TransDist e e+11 Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) TransDist

35 Prediction and Confidence Bands Diagnostic Plots

36 10. Model 10 House Price and Number of Schools in 2km Radius > summary(model) Call lm(formula = Price ~ School, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** School * Signif. codes 0 *** ** 0.01 * School School Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared 0.02 F-statistic on 1 and 156 DF, p-value Coefficients and Intercept > model$coefficients (Intercept) School Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) School e e * Residuals e e Signif. codes 0 *** ** 0.01 * confint(model, level=0.95) 2.5 % 97.5 % (Intercept) School

37 Prediction and Confidence Bands

38 Diagnostic Plots 11. Model 11 House Price and Distance to the Nearest School > summary(model) Call lm(formula = Price ~ SchoolDist, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e-13 *** SchoolDist Signif. codes 0 *** ** 0.01 * SchoolDist Not Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value Coefficients and Intercept > model$coefficients (Intercept) SchoolDist

39 Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) SchoolDist e e Residuals e e+11 > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) SchoolDist

40 Prediction and Confidence Bands Diagnostic Plots

41 12. Model 12 House Price and the Ratio of Student to Teacher in the Nearest School > summary(model) Call lm(formula = Price ~ SchoolTSRatio, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** SchoolTSRatio e-09 *** Signif. codes 0 *** ** 0.01 * SchoolTSRatio Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value 2.97e-09 Coefficients and Intercept > model$coefficients (Intercept) SchoolTSRatio Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) SchoolTSRatio e e e-09 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) SchoolTSRatio

42 Prediction and Confidence Bands

43 Diagnostic Plots 13. Model 13 House Price and Rating of the Nearest School > summary(model) Call lm(formula = Price ~ SchoolRating, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** SchoolRating Signif. codes 0 *** ** 0.01 * SchoolRating Slightly significant (however, significant if we consider p<0.1) Residual Standard Error Residual standard error on 120 degrees of freedom (36 observations deleted due to missingness) R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 120 DF, p-value Coefficients and Intercept

44 > model$coefficients (Intercept) SchoolRating Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) SchoolRating e e Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) SchoolRating

45 Prediction and confidence bands Diagnostic Plots

46 14. Model 14 House Price and the Median Income of the District > summary(model) Call lm(formula = Price ~ MedIncome, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 1.529e e e-13 *** MedIncome e e e-08 *** --- Signif. codes 0 *** ** 0.01 * MedIncome Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value 2.343e-08 Coefficients and Intercept > model$coefficients (Intercept) MedIncome Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) MedIncome e e e-08 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) MedIncome

47 Prediction and Confidence Bands

48 Diagnostic Plots 15. Model 15 House Price and the Population > summary(model) Call lm(formula = Price ~ Population, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** Population e-13 *** --- Signif. codes 0 *** ** 0.01 * Population Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value 3.436e-13

49 Coefficients and Intercept > model$coefficients (Intercept) Population Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Population e e e-13 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) Population

50 Prediction and Confidence Bands Diagnostic Plots

51 16. Model 16 House Price and the Ratio of College Graduates, thus education level of the District > summary(model) Call lm(formula = Price ~ College.Graduates, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) College.Graduates e-05 *** Signif. codes 0 *** ** 0.01 * College.Graduates Significant Residual Standard Error Residual standard error on 156 degrees of freedom R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 1 and 156 DF, p-value 4e-05 Coefficients and Intercept > model$coefficients (Intercept) College.Graduates Correlation > cor(prediction,hou_test$price, use='complete') [1] RMSE > rmse(prediction,hou_test$price) [1] > anova(model) # anova table Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) College.Graduates e e e-05 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) 2.5 % 97.5 % (Intercept) College.Graduates

52 Prediction and Confidence Bands

53 Diagnostic Plots 17. Model of all variables to check ranking of variables > summary(model) Call lm(formula = Price ~ Bed + Bath + Sqft_Area + Age + Lot_Area + School + SchoolDist + SchoolTSRatio + SchoolRating + MedIncome + Population + College.Graduates + TransDist + Crime + Envi, data = hou_train) Data Residuals Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) ** Bed Bath Sqft_Area *** Age Lot_Area School * SchoolDist SchoolTSRatio SchoolRating ** MedIncome **

54 Population e-05 *** College.Graduates e-05 *** TransDist Crime Envi Signif. codes 0 *** ** 0.01 * College.Graduate, Population, SchoolRating, MedIncome,School, Sqft_Area Significant Residual Standard Error Residual standard error on 106 degrees of freedom (36 observations deleted due to missingness) R-Square Values Multiple R-squared , Adjusted R-squared F-statistic on 15 and 106 DF, p-value 1.768e-10 Coefficients and Intercept model$coefficients (Intercept) Bed Bath Sqft_Area Age Lot_Area School SchoolDist SchoolTSRatio SchoolRating MedIncome Population College.Graduates TransDist Crime Envi RMSE > rmse(prediction,hou_test$price) [1] Correlation > cor(prediction,hou_test$price, use='complete') [1] From the above scenario it looks like College.Graduate, Population, SchoolRating, MedIncome, School, Sqft_Area are significant predictors. Among those, College.Graduate, Population and Sqft_Area are most predictive. SchoolType is a factor with levels such as public school, private school, so we can t build a linear regression with house price. Rank again is a factor and hence we can t use it for linear regression. Postal, Built_Year though are factors in the dataset, again being factors are not usefull in linear regression. b. Multivariate Regression While doing multivariate analysis we started by taking different factors into account one at a time as follows Inner House Parameters 1. Model 1 Independent Variables Equation Bath+Bed+Sqft_Area+Lot_Area > summary(model1, corr=t) Call lm(formula = Price ~ Bath + Bed + Sqft_Area + Lot_Area, data = hou_train)

55 Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 2.858e e e-05 *** Bed 2.891e e < 2e-16 *** Sqft_Area 3.142e e < 2e-16 *** Lot_Area e e e-06 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area Residual standard error on 5031 degrees of freedom (61 observations deleted due to missingness) R-Square Multiple R-squared 0.493, Adjusted R-squared F-statistic 1223 on 4 and 5031 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Bath Bed Sqft_Area Lot_Area Intercept and Coefficients > model1$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 > anova(model1) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e * Sqft_Area e e < 2.2e-16 *** Lot_Area e e+12 Residuals e e e-06 *** --- Signif. codes 0 *** ** 0.01 * > confint(model1, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+04 Bed e e+04 Sqft_Area e e+02 Lot_Area e e-01 RMSE > rmse(prediction1,hou_test$price) [1] Correlation > cor(prediction1,hou_test$price, use="complete") [1]

56 Diagnostic Plots 2. Model 2 Using Interaction Variables (between Bed and Bath) Independent Variables Equation Bath+Bed+BathBed+Sqft_Area+Lot_Area (Bath*Bed=Bed+Bath+BedBath) > summary(model2, corr=t) Call lm(formula = Price ~ Bath + Bed + BathBed + Sqft_Area + Lot_Area, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 8.030e e < 2e-16 *** Bed 6.664e e < 2e-16 *** Sqft_Area 3.238e e < 2e-16 *** Lot_Area e e e-07 *** BathBed e e e-14 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath Residual standard error on 5030 degrees of freedom (61 observations deleted due to missingness) R-Square Multiple R-squared 0.499, Adjusted R-squared

57 F-statistic 1002 on 5 and 5030 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area Bath Bed Sqft_Area Lot_Area BathBed Intercept and Coefficients > model2$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 BathBed e+04 > anova(model2) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed Sqft_Area e e * e e < 2.2e-16 *** Lot_Area e e e-06 *** BathBed e e+12 Residuals e e e-14 *** --- Signif. codes 0 *** ** 0.01 * > confint(model2, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+04 Bed Sqft_Area e e e e+02 Lot_Area e e-01 BathBed e e+04 Correlation > cor(prediction2,hou_test$price, use="complete") [1] RMSE > rmse(prediction2,hou_test$price) [1]

58 Diagnostic Plots 3. Model 3 Add Age of House Independent Variables Equation Bath*Bed+Sqft_Area+Lot_Area+Age > summary(model3, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + Age, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.239e e < 2e-16 *** Bed 6.829e e < 2e-16 *** Sqft_Area 2.945e e < 2e-16 *** Lot_Area e e *** Age 2.019e e < 2e-16 *** BathBed e e e-15 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, Age Residual standard error on 5029 degrees of freedom (61 observations deleted due to missingness)

59 R-Square Multiple R-squared , Adjusted R-squared F-statistic on 6 and 5029 DF, p-value < 2.2e-16 Correlation of Coefficients Bath (Intercept) Bath Bed Sqft_Area Lot_Area Age Bed Sqft_Area Lot_Area Age BathBed Intercept and Coefficients > model3$coefficients (Intercept) Bath Bed Sqft_Area e e e e+02 Lot_Area Age BathBed e e e+04 > anova(model3) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed Sqft_Area e e * e e < 2.2e-16 *** Lot_Area e e e-06 *** Age BathBed e e < 2.2e-16 *** e e e-15 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model3, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area Lot_Area e e e e-01 Age e e+03 BathBed e e+04 Correlation > cor(prediction3,hou_test$price, use="complete") [1] RMSE > rmse(prediction3,hou_test$price) [1]

60 Diagnostic Plots 4. Model 4 Age is not linear with price it shows parabolic (nth order polynomial) Independent Variables Equation Bath*Bed+Sqft_Area+Lot_Area+Age+I(Age^2) > summary(model4, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + Age + I(Age^2), data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.158e e < 2e-16 *** Bed 6.703e e < 2e-16 *** Sqft_Area 2.918e e < 2e-16 *** Lot_Area e e ** Age e e *** I(Age^2) 3.481e e e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Age, Age 2, Lot_Area, BedBath Residual standard error on 5022 degrees of freedom

61 (67 observations deleted due to missingness) R-Square Multiple R-squared 0.545, Adjusted R-squared F-statistic on 7 and 5022 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area Age I(Age^2) Bath Bed Sqft_Area Lot_Area Age I(Age^2) BathBed Intercept and Coefficients > model4$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 Age I(Age^2) BathBed e e e+04 > anova(model4) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e Sqft_Area e e < 2.2e-16 *** Lot_Area Age e e e-07 *** e e < 2.2e-16 *** I(Age^2) e e < 2.2e-16 *** BathBed e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model4, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath Bed e e e e+04 Sqft_Area e e+02 Lot_Area Age e e e e+02 I(Age^2) e e+01 BathBed e e+04 RMSE > rmse(prediction4,hou_test$price) [1] Correlation > cor(prediction4,hou_test$price, use="complete") [1]

62 Diagnostic Plots 5. Model 5 Use new Rescaled Age So, when doing the exploratory analysis for Age variable (as can be seen above also), it can been that it is not linear but rather has a dip in the middle years. This could mean that old and new houses probably have a higher price as compared to the middle ages ones. Thus, we decided to take the mean of the house and centralize the value of Age. (Mean-YearBuilt/10) Independent Variable Equation Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2) > summary(model5, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + RevAge + I(RevAge^2), data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.158e e < 2e-16 *** Bed Sqft_Area 6.703e e < 2e-16 *** 2.918e e < 2e-16 *** Lot_Area e e ** RevAge e e < 2e-16 *** I(RevAge^2) 3.481e e e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 *

63 Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, RevAge, RevAge 2 Residual standard error on 5022 degrees of freedom (67 observations deleted due to missingness) R-Square Multiple R-squared 0.545, Adjusted R-squared F-statistic on 7 and 5022 DF, p-value < 2.2e-16 Correlation of Coefficients Bath (Intercept) Bath Bed Sqft_Area Lot_Area RevAge I(RevAge^2) Bed Sqft_Area Lot_Area RevAge I(RevAge^2) BathBed Intercept and Coefficients > model5$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 RevAge I(RevAge^2) BathBed e e e+04 > anova(model5) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed Sqft_Area e e e e < 2.2e-16 *** Lot_Area e e e-07 *** RevAge I(RevAge^2) e e < 2.2e-16 *** e e < 2.2e-16 *** BathBed e e < 2.2e-16 *** Residuals e e+10 Signif. codes 0 *** ** 0.01 * > confint(model5, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area Lot_Area e e e e-01 RevAge e e+04 I(RevAge^2) e e+03 BathBed e e+04 Correlation > cor(prediction5,hou_test$price, use="complete") [1] RMSE > rmse(prediction5,hou_test$price) [1]

64 Diagnostic Plots No change in model so either Age variable. Next Add Environmental Factors 6. Model 6 Add Crime Independent Variables Equation BathBed+Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2)+Crime > summary(model6, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + RevAge + I(RevAge^2) + Crime, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.142e e < 2e-16 *** Bed 6.537e e < 2e-16 *** Sqft_Area 2.885e e < 2e-16 *** Lot_Area e e ** RevAge e e < 2e-16 *** I(RevAge^2) 1.493e e *** Crime 4.406e e < 2e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, Crime, RevAge, RevAge 2

65 Residual standard error on 5021 degrees of freedom (67 observations deleted due to missingness) R-Square Multiple R-squared , Adjusted R-squared F-statistic on 8 and 5021 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area RevAge I(RevAge^2) Crime Bath Bed Sqft_Area Lot_Area RevAge 0.12 I(RevAge^2) Crime BathBed Intercept and Coefficients > model6$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 RevAge I(RevAge^2) Crime BathBed e e e e+04 > anova(model6) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e Sqft_Area e e < 2.2e-16 *** Lot_Area RevAge e e e-07 *** e e < 2.2e-16 *** I(RevAge^2) e e < 2.2e-16 *** Crime BathBed e e < 2.2e-16 *** e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model6, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath Bed e e e e+04 Sqft_Area e e+02 Lot_Area RevAge e e e e+04 I(RevAge^2) e e+03 Crime BathBed e e e e+04 Correlation > cor(prediction6,hou_test$price, use="complete") [1] RMSE > rmse(prediction6,hou_test$price) [1]

66 Diagnostic Plots 7. Model 7 Add Envi Independent Variables Equation Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2)+ChangeCrime+Envi > summary(model7, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + RevAge + I(RevAge^2) + ChangeCrime + Envi, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.051e e < 2e-16 *** Bed 6.230e e < 2e-16 *** Sqft_Area 2.791e e < 2e-16 *** Lot_Area e e * RevAge e e < 2e-16 *** I(RevAge^2) 9.362e e * ChangeCrime e e e-16 *** Envi 1.362e e < 2e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath,Envi, ChangeCrime, RevAge, RevAge 2 Residual standard error on 5020 degrees of freedom (67 observations deleted due to missingness) R-Square

67 Multiple R-squared , Adjusted R-squared F-statistic on 9 and 5020 DF, p-value < 2.2e-16 Correlation of Coefficients Bath (Intercept) Bath Bed Sqft_Area Lot_Area RevAge I(RevAge^2) Bed Sqft_Area Lot_Area RevAge I(RevAge^2) ChangeCrime Envi BathBed ChangeCrime Envi Bath Bed Sqft_Area Lot_Area RevAge I(RevAge^2) ChangeCrime Envi BathBed > anova(model7) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed Sqft_Area e e e e < 2.2e-16 *** Lot_Area e e e-07 *** RevAge I(RevAge^2) e e < 2.2e-16 *** e e < 2.2e-16 *** ChangeCrime e e < 2.2e-16 *** Envi BathBed e e < 2.2e-16 *** e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * Intercept and Coefficients > model7$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 RevAge I(RevAge^2) ChangeCrime Envi BathBed e e e e e+04 > confint(model7, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area e e+02 Lot_Area e e-02 RevAge e e+04 I(RevAge^2) e e+03 ChangeCrime e e+04 Envi e e+04 BathBed e e+04 Correlation > cor(prediction7,hou_test$price, use="complete") [1] RMSE > rmse(prediction7,hou_test$price) [1]

68 Diagnostic Plots 8. Model 8 Use AgeSquare (2014-Built_Year) and Add Envi to it Independent Variables Equation Bath*Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi ChangeCrime it is the z-value of crime in an attempt to center and scale it. > summary(model8, corr=t) Call lm(formula = Price ~ Bath * Bed + Sqft_Area + Lot_Area + AgeSquare + ChangeCrime + Envi, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.028e e < 2e-16 *** Bed 6.194e e < 2e-16 *** Sqft_Area 2.799e e < 2e-16 *** Lot_Area e e * AgeSquare 1.305e e < 2e-16 *** ChangeCrime e e e-16 *** Envi 1.357e e < 2e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, ChangeCrime, Envi, Age 2 Residual standard error on 5021 degrees of freedom (67 observations deleted due to missingness)

69 R-Square Multiple R-squared 0.57, Adjusted R-squared F-statistic 832 on 8 and 5021 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi BathBed 0.62 Envi Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi BathBed 0.08 Intercept and Coefficients > model8$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 AgeSquare ChangeCrime Envi BathBed e e e e+04 > confint(model8, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area Lot_Area e e e e-02 AgeSquare e e+01 ChangeCrime e e+04 Envi e e+04 BathBed e e+04 > anova(model8) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e Sqft_Area e e < 2.2e-16 *** Lot_Area AgeSquare e e e-07 *** e e < 2.2e-16 *** ChangeCrime e e < 2.2e-16 *** Envi BathBed e e < 2.2e-16 *** e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * Correlation > cor(prediction8,hou_test$price, use="complete") [1] RMSE > rmse(prediction8,hou_test$price) [1]

70 Diagnostic Plots 9. Model 9 All external and internal values Independent Variables Equation Bath+BathBed+Bed+Sqft_Area+Lot_Area+Age+AgeSquare+ChangeCrime+Envi+TranEnvi > summary(model9, corr=t) Call lm(formula = Price ~ Bath + BathBed + Bed + Sqft_Area + Lot_Area + AgeSquare + ChangeCrime + Envi + TranEnvi, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.030e e < 2e-16 *** Bed Sqft_Area 6.196e e < 2e-16 *** 2.800e e < 2e-16 *** Lot_Area e e * AgeSquare 1.310e e < 2e-16 *** ChangeCrime e e e-16 *** Envi 1.354e e < 2e-16 *** TranEnvi BathBed 1.156e e e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, Envi, ChangeCrime, Age 2 Residual standard error on 5020 degrees of freedom

71 (67 observations deleted due to missingness) R-Square Multiple R-squared 0.57, Adjusted R-squared F-statistic on 9 and 5020 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi BathBed Envi TranEnvi Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi BathBed Intercept and Coefficients > model9$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 AgeSquare ChangeCrime Envi TranEnvi BathBed e e e e e+04 > anova(model9) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath Bed e e < 2.2e-16 *** e e Sqft_Area e e < 2.2e-16 *** Lot_Area AgeSquare e e e-07 *** e e < 2.2e-16 *** ChangeCrime e e < 2.2e-16 *** Envi TranEnvi e e < 2.2e-16 *** e e BathBed e e < 2.2e-16 *** Residuals e e+10 Signif. codes 0 *** ** 0.01 * > confint(model9, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed e e+04 Sqft_Area Lot_Area e e e e-02 AgeSquare e e+01 ChangeCrime e e+04 Envi e e+04 TranEnvi e e+04 BathBed e e+04 Correlation > cor(prediction9,hou_test$price, use="complete") [1]

72 RMSE > rmse(prediction9,hou_test$price) [1] Diagnostic Plots Next add School factors 10. Model 10 Test with all school factors and then drop non-significant ones. The good predictors are then added to the above model. Step 1 Test all school factors Independent Variable Equation School+SchoolTSRatio+SchoolDist+SchoolRating (SchoolType is factor and does not yield sensible result in this case so drop it) > summary(model, corr=t) Call lm(formula = Price ~ School + SchoolTSRatio + SchoolDist + SchoolRating, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) School e-15 *** SchoolTSRatio e-05 *** SchoolDist SchoolRating *** < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Residual standard error on 3032 degrees of freedom (2060 observations deleted due to missingness)

73 R-Square Multiple R-squared , Adjusted R-squared F-statistic 166 on 4 and 3032 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) School SchoolTSRatio SchoolDist School SchoolTSRatio SchoolDist SchoolRating As it can be seen above, School is not significant hence will not be used in next model. Step 2 Add significant school factors, Independent Variable Equation Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi+SchoolTSRatio+SchoolDist +SchoolRating > summary(model10, corr=t) Call lm(formula = Price ~ Bath + BathBed + Bed + Sqft_Area + Lot_Area + AgeSquare + ChangeCrime + Envi + TranEnvi + SchoolTSRatio + SchoolDist + SchoolRating, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e < 2e-16 *** Bath 1.223e e < 2e-16 *** Bed 7.198e e < 2e-16 *** Sqft_Area 2.049e e < 2e-16 *** Lot_Area e e * AgeSquare 6.127e e e-05 *** ChangeCrime e e Envi 1.729e e < 2e-16 *** TranEnvi e e ** SchoolTSRatio e e SchoolDist 3.857e e SchoolRating 3.370e e < 2e-16 *** BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, BedBath, Envi, TranEnvi, SchoolRating, Age 2 Residual standard error on 2997 degrees of freedom (2087 observations deleted due to missingness) R-Square Multiple R-squared 0.543, Adjusted R-squared F-statistic on 12 and 2997 DF, p-value < 2.2e-16 Correlation of Coefficients Bath (Intercept) Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio

74 SchoolDist SchoolRating BathBed Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating BathBed Intercept and Coefficients > model10$coefficients (Intercept) Bath Bed Sqft_Area Lot_Area e e e e e-01 AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio e e e e e+03 SchoolDist SchoolRating BathBed e e e+04 > anova(model10) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e < 2.2e-16 *** Bed e e Sqft_Area Lot_Area e e < 2.2e-16 *** e e e-08 *** AgeSquare e e < 2.2e-16 *** ChangeCrime Envi e e e-13 *** e e < 2.2e-16 *** TranEnvi e e e-12 *** SchoolTSRatio SchoolDist e e e e < 2.2e-16 *** SchoolRating e e < 2.2e-16 *** BathBed Residuals e e < 2.2e-16 *** e e Signif. codes 0 *** ** 0.01 * > confint(model10, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+05 Bath e e+05 Bed Sqft_Area e e e e+02 Lot_Area e e-02 AgeSquare ChangeCrime e e e e+04 Envi e e+04 TranEnvi e e+04 SchoolTSRatio e e+03 SchoolDist e e+04 SchoolRating BathBed e e e e+04 Correlation > cor(prediction10,hou_test$price, use="complete") [1] RMSE > rmse(prediction10,hou_test$price) [1]

75 Diagnostic Plots Next add Zipcodel factors 11. Model 11 Test with all zipcode factors and then drop non-significant ones. The good predictors are then added to the above model. Step 1 Test all Zipcode factors Independent Variable Equation MedIncome+as.numeric(Rank)+MedAge+Population+College.Graduates > summary(model, corr=t) Call lm(formula = Price ~ MedIncome + as.numeric(rank) + MedAge + Population + College.Graduates, data = hou_train) Residuals Min 1Q Median Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 2.764e e < 2e-16 *** MedIncome e e e-07 *** as.numeric(rank) e e e-11 *** MedAge e e Population e e < 2e-16 *** College.Graduates e e ** --- Signif. codes 0 *** ** 0.01 * Residual standard error on 5091 degrees of freedom R-Square Multiple R-squared , Adjusted R-squared F-statistic 421 on 5 and 5091 DF, p-value < 2.2e-16

76 Correlation of Coefficients (Intercept) MedIncome as.numeric(rank) MedAge Population MedIncome as.numeric(rank) MedAge Population College.Graduates Drop Median Age and use the rest. Step 2 Add significant Zipcode factors, Independent Variable Equation Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi+SchoolTSRatio+SchoolDist +SchoolRating+MedIncome+as.numeric(Rank)+Population+College.Graduates > summary(model11, corr=t) Call lm(formula = Price ~ Bath + BathBed + Bed + Sqft_Area + Lot_Area + AgeSquare + ChangeCrime + Envi + TranEnvi + SchoolTSRatio + SchoolDist + SchoolRating + MedIncome + as.numeric(rank) + Population + College.Graduates, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) 1.023e e *** Bath 1.114e e < 2e-16 *** Bed Sqft_Area 6.186e e < 2e-16 *** 1.932e e < 2e-16 *** Lot_Area e e * AgeSquare ChangeCrime 3.778e e e e ** e-09 *** Envi 1.316e e < 2e-16 *** TranEnvi SchoolTSRatio e e * e e SchoolDist 3.200e e SchoolRating MedIncome 9.472e e e-06 *** e e < 2e-16 *** as.numeric(rank) e e e-06 *** Population e e < 2e-16 *** College.Graduates e e BathBed e e < 2e-16 *** --- Signif. codes 0 *** ** 0.01 * Predictive features Bath, Bed, Sqft_Area, Lot_Area, Age 2, Envi, ChangeCrime, TranEnvi, SchoolRating, MedI ncome, Rank, Population, BedBath Residual standard error on 2993 degrees of freedom (2087 observations deleted due to missingness) R-Square Multiple R-squared , Adjusted R-squared F-statistic on 16 and 2993 DF, p-value < 2.2e-16 Correlation of Coefficients (Intercept) Bath Bed Sqft_Area Lot_Area AgeSquare Bath Bed Sqft_Area Lot_Area

77 AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating MedIncome as.numeric(rank) Population College.Graduates BathBed ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating MedIncome as.numeric(rank) Population College.Graduates BathBed SchoolRating MedIncome as.numeric(rank) Population Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating MedIncome 0.13 as.numeric(rank) Population 0.11 College.Graduates BathBed College.Graduates Bath Bed Sqft_Area Lot_Area AgeSquare ChangeCrime Envi TranEnvi SchoolTSRatio SchoolDist SchoolRating MedIncome as.numeric(rank) Population College.Graduates BathBed 0.02 Intercept and Coefficients > model11$coefficients (Intercept) Bath Bed Sqft_Area e e e e+02 Lot_Area e-01 AgeSquare e+00 ChangeCrime e+04 Envi e+04 TranEnvi SchoolTSRatio SchoolDist SchoolRating

78 e e e e+03 MedIncome as.numeric(rank) Population College.Graduates e e e e+03 BathBed e+04 > anova(model11) Analysis of Variance Table Response Price Bath Df Sum Sq Mean Sq F value Pr(>F) e e < 2.2e-16 *** Bed e e Sqft_Area Lot_Area e e < 2.2e-16 *** e e e-09 *** AgeSquare e e < 2.2e-16 *** ChangeCrime Envi e e e-15 *** e e < 2.2e-16 *** TranEnvi e e e-13 *** SchoolTSRatio SchoolDist e e e e < 2.2e-16 *** SchoolRating e e < 2.2e-16 *** MedIncome as.numeric(rank) e e ** e e < 2.2e-16 *** Population e e < 2.2e-16 *** College.Graduates BathBed e e e e < 2.2e-16 *** Residuals e e Signif. codes 0 *** ** 0.01 * > confint(model11, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) e e+06 Bath Bed e e e e+04 Sqft_Area e e+02 Lot_Area AgeSquare e e e e+00 ChangeCrime e e+04 Envi TranEnvi e e e e+04 SchoolTSRatio e e+03 SchoolDist SchoolRating e e e e+04 MedIncome e e+00 as.numeric(rank) e e+04 Population e e+00 College.Graduates e e+01 BathBed e e+04 Correlation > cor(prediction11,hou_test$price, use="complete") [1] RMSE > rmse(prediction11,hou_test$price) [1] (Please note Rank has a range from , it is the education rank of each zipcode. Thus this ordinal values having so many terms it treated as Scale)

79 Diagnostic Plots Model 12 Creating Model with variables with high correlation to Price (r>0.2) # house parameters cor(hou$price, hou$bath, use="complete") # cor(hou$price, hou$bed, use="complete") # cor(hou$price, hou$sqft_area, use="complete") # cor(hou$price, hou$lot_area, use="complete") # cor(hou$price, hou$revagesquare, use="complete") # # External Environment cor(hou$price, hou$changecrime, use="complete") # cor(hou$price, hou$envi, use="complete") # cor(hou$price, hou$tranenvi, use="complete") # cor(hou$price, hou$transdist, use="complete") # # Parameters by community(zipcode) cor(hou$price, as.numeric(hou$rank), use="complete") # cor(hou$price, hou$population, use="complete") # cor(hou$price, hou$college.graduates, use="complete") # cor(hou$price, hou$medage, use="complete") # # School parameters cor(hou$price, hou$school) # cor(hou$price, hou$schooldist) # cor(hou$price, hou$schooltsratio) # cor(hou$price, hou$schoolrating, use="complete") #

80 Independent Variables Equation Bath+Sqft_Area+ChangeCrime+Envi+TransDist+as.numeric(Rank)+College.Graduates+SchoolRating > summary(model, corr=t) Call lm(formula = Price ~ Bath + Sqft_Area + ChangeCrime + Envi + TransDist + as.numeric(rank) + College.Graduates + SchoolRating, data = hou_train) Residuals Min 1Q Median 3Q Max Coefficients Estimate Std. Error t value Pr(> t ) (Intercept) e e e-06 *** Bath Sqft_Area 3.546e e e e e-07 *** ChangeCrime e e Envi TransDist 1.896e e *** e e as.numeric(rank) 2.125e e e-06 *** College.Graduates 1.010e e e-06 *** SchoolRating e e Signif. codes 0 *** ** 0.01 * Predictive features Sqft_Area, Envi, Rank, College.Graduates Residual standard error on 104 degrees of freedom (4984 observations deleted due to missingness) R-Square Multiple R-squared 0.516, Adjusted R-squared F-statistic on 8 and 104 DF, p-value 1.556e-13 Correlation of Coefficients (Intercept) Bath Sqft_Area ChangeCrime Envi TransDist Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates SchoolRating as.numeric(rank) College.Graduates Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates 1.00 SchoolRating Intercept and Coefficients > model$coefficients (Intercept) Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates SchoolRating > anova(model) Analysis of Variance Table Response Price Df Sum Sq Mean Sq F value Pr(>F) Bath e e e-05 ***

81 Sqft_Area e e e-08 *** ChangeCrime Envi e e *** e e *** TransDist e e as.numeric(rank) College.Graduates e e e e e-06 *** SchoolRating e e Residuals e e+10 Signif. codes 0 *** ** 0.01 * > confint(model, level=0.95) # CIs for model parameters 2.5 % 97.5 % (Intercept) Bath e e Sqft_Area e ChangeCrime Envi e e TransDist e as.numeric(rank) e+05 College.Graduates e SchoolRating e Correlation > cor(prediction,hou_test$price, use="complete") [1] RMSE > rmse(prediction,hou_test$price) [1] Diagnostic Plots

82 Best Model (Model 11) Price= e e+05*Bath e+04*Bed e+02*Sqft_Area e- 01*Lot_Area e+00*Age e+04*ChangeCrime e+04*Envi e+04*TranEnvi e+03*SchoolRating e+00*MedIncome e+04*Rank e+00*Population e+04*BedBath c. Regularization (multiple regression models) Model 1 Price~Bath+Bed+Sqft_Area+Lot_Area Lasso > coef(fit) 5 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) Bed Bath Sqft_Area Lot_Area > fit$lambda.min [1] > z<-as.matrix(hou_test[,c('bed','bath','sqft_area','lot_area')]) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Ridge > coef(fit) 5 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) Bed Bath Sqft_Area Lot_Area > fit$lambda.min [1] > z<-as.matrix(hou_test[,c('bed','bath','sqft_area','lot_area')])

83 > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Model 2 Price~Bath+Bed+BathBed+Sqft_Area+Lot_Area Lasso > coef(fit) 7 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath. Bed. Sqft_Area Lot_Area. BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

84 Ridge > coef(fit) 7 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed Sqft_Area Lot_Area BathBed > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Model 3 Price~Bath*Bed+Sqft_Area+Lot_Area+Age

85 Lasso > coef(fit) 8 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed Sqft_Area Lot_Area Age BathBed > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Ridge > coef(fit) 8 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+05 (Intercept). Bath e+04 Bed e+04 Sqft_Area e+02 Lot_Area e-01 Age e+03 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse)

86 > rmse [1] Model 4 Price~Bath*Bed+Sqft_Area+Lot_Area+Age+I(Age^2) Lasso > coef(fit) 9 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed. Sqft_Area Lot_Area. Age. I(Age^2) BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

87 Ridge > coef(fit) 9 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+03 Sqft_Area e+02 Lot_Area e-01 Age e+02 I(Age^2) e+01 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

88 Model 5 Price~Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2) Lasso > coef(fit) 9 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed Sqft_Area Lot_Area. RevAge I(RevAge^2) BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Ridge > coef(fit) 9 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+04 Sqft_Area e+02 Lot_Area e-01 RevAge e+04 I(RevAge^2) e+03 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price)

89 > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Model 6 Price~Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2)+Crime Lasso > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed. Sqft_Area Lot_Area. RevAge I(RevAge^2). Crime BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

90 Ridge > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+03 Sqft_Area e+02 Lot_Area e-01 RevAge e+04 I(RevAge^2) e+03 Crime e+01 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

91 Model 7 Price~Bath*Bed+Sqft_Area+Lot_Area+RevAge+I(RevAge^2)+ChangeCrime+Envi Lasso > coef(fit) 11 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed Sqft_Area Lot_Area. RevAge I(RevAge^2). ChangeCrime Envi BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

92 Ridge > coef(fit) 11 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+03 (Intercept). Bath e+04 Bed e+03 Sqft_Area e+02 Lot_Area e-01 RevAge e+04 I(RevAge^2) e+02 ChangeCrime e+04 Envi e+04 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

93 Model 8 Price~Bath*Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi Lasso > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed. Sqft_Area Lot_Area. AgeSquare ChangeCrime Envi BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Ridge > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+03 Sqft_Area e+02 Lot_Area e-01 AgeSquare e+01 ChangeCrime e+04 Envi e+04 BathBed e+02 > fit$lambda.min [1]

94 > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1] Model 9 Price~Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi Lasso > coef(fit) 11 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Bed. Sqft_Area Lot_Area. AgeSquare ChangeCrime Envi TranEnvi. BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

95 Ridge > coef(fit) 11 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+04 (Intercept). Bath e+04 Bed e+04 Sqft_Area e+02 Lot_Area e-01 AgeSquare e+01 ChangeCrime e+04 Envi e+04 TranEnvi e+02 BathBed e+02 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

96 Model 10 Price~Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi+SchoolTSRatio+Scho oldist+schoolrating Lasso > coef(fit) 14 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+05 (Intercept). Bath. Bed e+03 Sqft_Area e+01 Lot_Area e-01 AgeSquare. ChangeCrime e+04 Envi e+03 TranEnvi. SchoolTSRatio. SchoolDist. SchoolRating. BathBed. > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

97 Ridge > coef(fit) 14 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) e+05 (Intercept). Bath e+03 Bed e+03 Sqft_Area e+01 Lot_Area e+00 AgeSquare e-01 ChangeCrime e+04 Envi e+03 TranEnvi. SchoolTSRatio e+03 SchoolDist e+04 SchoolRating e+03 BathBed e+01 > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

98 Model 11 Price~Bath+BathBed+Bed+Sqft_Area+Lot_Area+AgeSquare+ChangeCrime+Envi+TranEnvi+SchoolTSRatio+Scho oldist+schoolrating+medincome+as.numeric(rank)+population+college.graduates Lasso > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath. Sqft_Area ChangeCrime Envi TransDist. as.numeric(rank). College.Graduates SchoolRating. > fit$lambda.min [1] > # Predict > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > # correlation > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > # RMSE > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

99 Ridge > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates SchoolRating > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

100 Model 12 Price~Bath+Sqft_Area+ChangeCrime+Envi+TransDist+as.numeric(Rank)+College.Graduates+SchoolRating Lasso > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath. Sqft_Area ChangeCrime Envi TransDist. as.numeric(rank). College.Graduates SchoolRating. > fit$lambda.min [1] > # Predict > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > # correlation > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > # RMSE > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

101 Ridge > coef(fit) 10 x 1 sparse Matrix of class "dgcmatrix" 1 (Intercept) (Intercept). Bath Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates SchoolRating > fit$lambda.min [1] > z<-model.matrix(f, data=hou_test) > prediction<-predict(fit, newx=z) > t<-as.vector(hou_test$price) > cor(prediction, t) [,1] > mse=fit$cvm[fit$lambda == fit$lambda.min] > rmse = sqrt(mse) > rmse [1]

102 Regularization didn t improve the model. It drops the Correlation a lot however at the same time the RMSE value decreases too which is a good sign. We mainly looked at the Lasso Method as Ridge usually ends up keeping many variables which doesn t lead to a good model. d. Run Multiple times We ran the above models 4 times and then took the average of the following (All values are rounded off to 4 decimal points as we were entering the values manually for each) For random four seeds these were the average values Linear Regression Model 1 Adjusted R-Squared Correlation RMSE Model 2 Adjusted R-Squared Correlation RMSE Model 3 Adjusted R-Squared Correlation RMSE Model 4 Adjusted R-Squared Correlation RMSE Model 5 Adjusted R-Squared Correlation RMSE Model 6 Adjusted R-Squared Correlation RMSE Model 7 Adjusted R-Squared Correlation RMSE Model 8 Adjusted R-Squared 0 Correlation RMSE Model 9 Adjusted R-Squared Correlation

103 RMSE Model 10 Adjusted R-Squared 0.02 Correlation RMSE Model 11 Adjusted R-Squared Correlation RMSE Model 12 Adjusted R-Squared Correlation RMSE Model 13 Adjusted R-Squared 0.02 Correlation RMSE Model 14 Adjusted R-Squared Correlation RMSE Model 15 Adjusted R-Squared Correlation RMSE Model 16 Adjusted R-Squared 0. Correlation RMSE Model 17 Adjusted R-Squared Correlation RMSE Multivariate Regression Model 1 Adjusted R-Squared Correlation RMSE Model 2 Adjusted R-Squared Correlation RMSE Model 3 Adjusted R-Squared 0.551

104 Correlation RMSE Model 4 Adjusted R-Squared Correlation RMSE Model 5 Adjusted R-Squared Correlation RMSE Model 6 Adjusted R-Squared Correlation RMSE Model 7 Adjusted R-Squared Correlation RMSE Model 8 Adjusted R-Squared Correlation RMSE Model 9 Adjusted R-Squared Correlation RMSE Model 10 Adjusted R-Squared Correlation RMSE Model 11 Adjusted R-Squared Correlation RMSE Model 11 Adjusted R-Squared Correlation RMSE Regularization Models Model 1 -Lasso Correlation RMSE Model 1 Ridge Correlation RMSE

105 Model 2 -Lasso Correlation RMSE Model 2 Ridge Correlation RMSE Model 3 -Lasso Correlation RMSE Model 3 Ridge Correlation RMSE Model 4 -Lasso Correlation RMSE Model 4 Ridge Correlation RMSE Model 5 -Lasso Correlation RMSE Model 5 Ridge Correlation RMSE Model 6 -Lasso Correlation RMSE Model 6 Ridge Correlation RMSE Model 7 -Lasso Correlation RMSE Model 7 Ridge Correlation RMSE Model 8 -Lasso Correlation RMSE Model 8 Ridge Correlation RMSE

106 Model 9 -Lasso Correlation RMSE Model 9 Ridge Correlation RMSE Model 10-Lasso Correlation 0.33 RMSE Model 10 Ridge Correlation RMSE Model 11-Lasso Correlation RMSE Model 11 Ridge Correlation RMSE Model 12-Lasso Correlation RMSE Model 12 Ridge Correlation RMSE Q.2. a. Logistic Regression We have taken a different approach for logistic regression here. The glm() function we studied in class is for binary variables. However, since we were trying to predict the house prices dividing it as binary is not a good approach, hence we decided to use Ordered Logistic Regression. So the Scale attribute Price is converted to an ordinal scale with predefined intervals (Please refer the code for more details). The polr() function form the MASS package is used. (Please Note Since we have 10 categories the confusion matrix was too big to show, but the code has it) Model 1 Pricetag~Bath+Bed+Sqft_Area+scale(Lot_Area) > summary(m) Call polr(formula = Pricetag ~ Bath + Bed + Sqft_Area + scale(lot_area), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath e Bed e Sqft_Area e scale(lot_area) e

107 Intercepts Value Std. Error t value Residual Deviance AIC (66 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath Bed e e Sqft_Area e scale(lot_area) e e e e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath Bed e e e e-06 Sqft_Area e e-227 scale(lot_area) e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bath Bed Sqft_Area scale(lot_area) > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area

108 scale(lot_area) If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > # odds ratios > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) > # Odd Ratios and CI > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction The Crosstabs of predicted values was too big to display here, since we have 10 categories of Price. We ran the following code for it. sf <- function(y) { c('y>=1' = qlogis(mean(y >= 1)), 'Y>=2' = qlogis(mean(y >= 2)), 'Y>=3' = qlogis(mean(y >= 3)), 'Y>=4' = qlogis(mean(y >= 4)), 'Y>=5' = qlogis(mean(y >= 5)), 'Y>=6' = qlogis(mean(y >= 6)), 'Y>=7' = qlogis(mean(y >= 7)), 'Y>=8' = qlogis(mean(y >= 8)), 'Y>=9' = qlogis(mean(y >= 9)), 'Y>=10' = qlogis(mean(y >= 10)) ) } (s <- with(hou_test, summary(as.numeric(pricetag)~bath+bed+sqft_area+lot_area, fun=sf))) > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 2 Pricetag~Bath+Bed+BathBed+Sqft_Area+scale(Lot_Area) > summary(m) Call polr(formula = Pricetag ~ Bath + Bed + BathBed + Sqft_Area + scale(lot_area), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath e

109 Bed e Sqft_Area e scale(lot_area) e BathBed e Intercepts Value Std. Error t value Residual Deviance AIC (66 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) > (ctable <- coef(summary(m))) Value Std. Error t value Bath e Bed Sqft_Area e e scale(lot_area) e BathBed e e e e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath Bed e e e e-16 Sqft_Area e e-235 scale(lot_area) e e-22 BathBed e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bath Bed

110 Sqft_Area scale(lot_area) BathBed > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > ## odds ratios > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) BathBed > ## OR and CI > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 3 Pricetag~ Bath*Bed+Sqft_Area+scale(Lot_Area)+Age > summary(m) Call polr(formula = Pricetag ~ Bath * Bed + Sqft_Area + scale(lot_area) + Age, data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath e Bed e Sqft_Area e scale(lot_area) e Age e BathBed e Intercepts Value Std. Error t value

111 Residual Deviance AIC (69 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath Bed e e Sqft_Area e scale(lot_area) e Age e BathBed e e e e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath e e-78 Bed Sqft_Area e e e e-169 scale(lot_area) e e-20 Age BathBed e e e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bath Bed Sqft_Area scale(lot_area) Age BathBed > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) Age BathBed

112 If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) Age BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) Age BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 4 Pricetag~Bath*Bed+Sqft_Area+scale(Lot_Area)+scale(Age)+scale(AgeSquare) > summary(m) Call polr(formula = Pricetag ~ Bath * Bed + Sqft_Area + scale(lot_area) + scale(age) + scale(agesquare), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath e Bed e Sqft_Area e scale(lot_area) e scale(age) e scale(agesquare) e BathBed e Intercepts Value Std. Error t value Residual Deviance AIC (59 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value

113 Bath e Bed Sqft_Area e e scale(lot_area) e scale(age) e scale(agesquare) e BathBed e e e e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath e e-51 Bed Sqft_Area e e e e-194 scale(lot_area) e e-19 scale(age) e e-189 scale(agesquare) e e+00 BathBed e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bath Bed Sqft_Area scale(lot_area) scale(age) NA NA scale(agesquare) NA BathBed > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) scale(age) scale(agesquare) BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > exp(coef(m))

114 Bath Bed Sqft_Area scale(lot_area) scale(age) scale(agesquare) BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) scale(age) NA NA scale(agesquare) NA BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 5 Pricetag~Bath*Bed+Sqft_Area+scale(Lot_Area)+scale(RevAge)+scale(I(RevAge^2)) > summary(m) Call polr(formula = Pricetag ~ Bath * Bed + Sqft_Area + scale(lot_area) + scale(revage) + scale(i(revage^2)), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed Intercepts Value Std. Error t value Residual Deviance AIC (59 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath e Bed Sqft_Area e e scale(lot_area) e scale(revage) e scale(i(revage^2)) e BathBed e e e

115 e e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath e e-49 Bed Sqft_Area e e e e-184 scale(lot_area) e e-19 scale(revage) e e-79 scale(i(revage^2)) e e-21 BathBed e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done... Bath 2.5 % 97.5 % Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed > confint.default(m) # CIs assuming normality Bath 2.5 % 97.5 % Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, all above variables are significant Log-odds and Odd Ratios for unit change > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 %

116 Bath Bed Sqft_Area scale(lot_area) scale(revage) scale(i(revage^2)) BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Changing the Age variable made no change at all. Model 6 Pricetag~BathBed+Bed+scale(Sqft_Area)+scale(Lot_Area)+scale(I(Age^2))+Crime We also scale the Sqft_Area in model > summary(m) Call polr(formula = Pricetag ~ BathBed + Bed + scale(sqft_area) + scale(lot_area) + scale(i(age^2)) + Crime, data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bed e scale(sqft_area) e scale(lot_area) e scale(i(age^2)) e Crime e BathBed e Intercepts Value Std. Error t value Residual Deviance AIC (59 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bed e scale(sqft_area) e scale(lot_area) e scale(i(age^2)) e Crime BathBed e e e e e

117 e e e e e e P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Bed Value Std. Error t value p value e e-05 scale(sqft_area) e e+00 scale(lot_area) e e-20 scale(i(age^2)) e e-27 Crime e e-41 BathBed e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done % 97.5 % Bed scale(sqft_area) scale(lot_area) scale(i(age^2)) Crime BathBed > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bed scale(sqft_area) scale(lot_area) scale(i(age^2)) Crime BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, BedBath interaction needs to be dropped Log-odds and Odd Ratios for unit change > exp(coef(m)) Bed scale(sqft_area) scale(lot_area) scale(i(age^2)) Crime BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bed scale(sqft_area) scale(lot_area) scale(i(age^2)) Crime BathBed

118 Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by 6.96 (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 7 Pricetag~Bath+BathBed+Bed+Sqft_Area+scale(Lot_Area)+AgeSquare+ChangeCrime+Envi+TranEnvi > summary(m) Call polr(formula = Pricetag ~ Bath + BathBed + Bed + Sqft_Area + scale(lot_area) + AgeSquare + ChangeCrime + Envi + TranEnvi, data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath 6.237e e Bed Sqft_Area 1.003e e e e scale(lot_area) e e AgeSquare ChangeCrime 8.389e e e e Envi 1.089e e TranEnvi BathBed e e e e Intercepts Value Std. Error t value Residual Deviance AIC (59 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath e e Bed e e Sqft_Area e e scale(lot_area) e e AgeSquare e e ChangeCrime e e Envi e e TranEnvi e e BathBed e e e e e e e e e e e e e e e e e e e e

119 P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath Bed e e e e e e-05 Sqft_Area e e e-169 scale(lot_area) e e-02 AgeSquare e e e e-04 ChangeCrime e e e-34 Envi TranEnvi e e e e e e+00 BathBed e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done... Bath 2.5 % 97.5 % e e-01 Bed e e-01 Sqft_Area e e-03 scale(lot_area) e e-01 AgeSquare e e-05 ChangeCrime Envi e e e e-01 TranEnvi NA NA BathBed e e-02 > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area scale(lot_area) AgeSquare ChangeCrime Envi TranEnvi BathBed If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, al varibales significant. Log-odds and Odd Ratios for unit change > exp(coef(m)) Bath Bed Sqft_Area scale(lot_area) AgeSquare ChangeCrime Envi TranEnvi BathBed > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed

120 Sqft_Area scale(lot_area) AgeSquare ChangeCrime Envi TranEnvi NA NA BathBed Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Model 8 Model of high correlation parameter (with price) Pricetag~Bath+Bed+Sqft_Area+ChangeCrime+Envi+TransDist+as.numeric(Rank)+scale(College.Graduates)+scal e(schoolrating) > summary(m) Call polr(formula = Pricetag ~ Bath + Bed + Sqft_Area + ChangeCrime + Envi + TransDist + as.numeric(rank) + College.Graduates + scale(schoolrating), data = hou_train, Hess = TRUE) Coefficients Value Std. Error t value Bath Bed Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates scale(schoolrating) Intercepts Value Std. Error t value Residual Deviance AIC (4994 observations deleted due to missingness) Table form of the above > (ctable <- coef(summary(m))) Value Std. Error t value Bath Bed e e e e-02 Sqft_Area e e+00 ChangeCrime Envi e e e e+00 TransDist e e+02 as.numeric(rank) College.Graduates e e e e+01 scale(schoolrating) e e-01

121 e e e e e e e e e e e e e e e e e e+04 P-Value added to the above for significance test > (ctable <- cbind(ctable, "p value" = p)) Value Std. Error t value p value Bath Bed e e e e e e-01 Sqft_Area e e e-03 ChangeCrime Envi e e e e e e-11 TransDist e e e+00 as.numeric(rank) College.Graduates e e e e e e+00 scale(schoolrating) e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e+00 Please note This method does not provide automatic statistical significane. The significance wil be obtained from the Confidence Intervals. > (ci <- confint(m)) # default method gives profiled CIs Waiting for profiling to be done... Bath 2.5 % 97.5 % Bed Sqft_Area ChangeCrime NA NA Envi TransDist as.numeric(rank) NA NA NA NA College.Graduates NA NA scale(schoolrating) > confint.default(m) # CIs assuming normality 2.5 % 97.5 % Bath Bed Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates scale(schoolrating) If the 95% CI does not cross 0, the parameter estimate is statistically significant [1] Thus, Sqft_Area, ChangeCrime, Envi, TransDist, Rank, College.Graduate are significant. Log-odds and Odd Ratios for unit change > exp(coef(m))

122 Bath Bed Sqft_Area ChangeCrime Envi TransDist as.numeric(rank) College.Graduates scale(schoolrating) > exp(cbind(or = coef(m), ci)) OR 2.5 % 97.5 % Bath Bed Sqft_Area ChangeCrime NA NA Envi TransDist NA NA as.numeric(rank) NA NA College.Graduates NA NA scale(schoolrating) Example of above interpretation For every 1 unit change in Sqft_Area, the odds of changing from Price category 1 to 2 or others (from lower category to higher) increases by (for all other variables in th model held constant) Prediction > cor(as.numeric(prediction), as.numeric(hou_test$pricetag), use="complete") [1] Conclusion The logistic regression models show mostly all variables as significant. However, this process is not right for our dataset. Firstly, we are aiming to predict price of a house, which is a scale variable. To perform logistic regression we divided it into categories. In spite of converting to nominal the rangle of each category is still very large and thus we feel that every attribute acts significant. The correlation although is good for some models, this method is not suitable for our research question.

123 b. Naïve Bayes 1. Model1 Classifier with Bath, Bed, Sqft_Area, Lot_Area, Age, Postal Precision= Model2 Classifier with Bath, Bed, Sqft_Area, Lot_Area, Age, waterenvi TransEnvi Precision=

124 3. Model3 Classifier with Bath, Bed, Sqft_area, Lot_area, Age, waterenvi TransEnvi School SchoolRating School Type Precision= Model4 Classifier with Bath, Bed, Sqft_area, Lot_area, Age, waterenvi TransEnvi School SchoolRating School Type Population Educationlevel Precision=

125 5. Model5 Classifier with Bath, Bed, Sqft_area, Lot_area, Age, waterenvi TransEnvi School SchoolRating School Type Population Educationlevel With Laplace Estimator Precision= Model6 Model4 with Kernel Precision=

126 Plot of Model 6

127

128 7. Model7 Model5 with Kernel Pricision= N-fold Validation for Model5 Accuracies mean(accuracies)= Summary 1. Laplace estimator change nothing in the model 2. Application of Kernel make the prediction results better 3. In Naïve Bayes model, pick inner propertity variables and Zipcode performs best. 4. From the plot we could see Bath and Area of house is relatively better variables that can be used to define class of price. model1 (Inner Propertity and Zipcode) Classification(NB) model2 (add Envi) model3 (add School) model4 (all variables) model5 (add Laplace) model6 (add kernel) model7 (kernel and Laplace)

129 Q.3. a. Result of Splitting training and testing data training testing > prop.table(table(hou_train$pricetag)) > prop.table(table(hou_test$pricetag)) b. Classify model Decision Tree #Bath, Bed, Postal, Sqft_Area, Lot_Area, Envi, Age, Crime hou_model<-c5.0(pricetag~bath+bed+postal+sqft_area+lot_area+envi+age+crime, data=hou_train,na.action=na.pass) C5.0 [Release 2.07 GPL Edition] Mon Oct Class specified by attribute `outcome' Read 5307 cases (9 attributes) from undefined.data Decision tree Postal in {98059,98101,98102,98103,98105,98107,98109,98112,98115,98117,98119, 98122,98136,98144,98146,98164,98177,98199}...Sqft_Area > (1553/157.5) Sqft_Area <= Postal in {98059,98164} 5 (0) Postal in {98136,98144,98146,98177}...Sqft_Area <= Crime > Age <= (9/1) Age > Lot_Area <= (2/1) Lot_Area > Sqft_Area <= (3/1) Sqft_Area > (2) Crime <= Bed > 4 2 (17/6) Bed <= 4...Postal = Crime <= (8/1) Crime > (5/2) Postal = Age <= 67...Lot_Area <= (2) Lot_Area > (2) Age > 67...Age <= 74 3 (5/1) Age > 74 2 (2) Postal = Crime <= (3/1) Crime > Lot_Area > (3/1) Lot_Area <= Lot_Area <= (3/1)

130 Lot_Area > (10/2) Postal = Envi > 4 4 (2/1) Envi <= 4...Crime <= (19/8) Crime > Age <= 69 3 (3/1) Age > 69...Lot_Area <= (4/1) Lot_Area > (2) Sqft_Area > Bed > 6...Bath <= (2/1) Bath > (5) Bed <= 6...Postal = (25/7) Postal in {98136,98144,98177}...Crime <= (5/1) Crime > Bed > 5...Age > (3) Age <= Crime <= (16/7) Crime > Crime <= Lot_Area <= (4/1) Lot_Area > (2) Crime > Envi <= 3 2 (2/1) Envi > 3 4 (7/2) Bed <= 5...Sqft_Area > Age > 99 3 (9/1) Age <= 99...Postal = Envi <= 3 3 (4) Envi > 3 5 (19/11) Postal = Age <= 70 4 (13/5) Age > 70 3 (4/1) Postal = Envi > 10 5 (3) Envi <= 10...Crime > (8/1) Crime <= Envi <= 6 3 (4/1) Sqft_Area <= 1560 Envi > 6 4 (5/1)...Postal = Bath <= (17/5) Bath > (2/1) Postal = Age <= 15 3 (9) Age > 15...Bath > (6) Bath <= Bed <= 4...Bath > (3) Bath <= Envi <= 8 3 (2) Envi > 8 4 (3/1) Bed > 4...Bath > (2) Bath <= Sqft_Area <= (2) Sqft_Area > (6/2) Postal = Lot_Area > Envi <= 3 2 (2/1) Envi > 3 5 (3) Lot_Area <= Bed <= 4

131 ...Bath > (7/1) Bath <= Age <= 82 4 (2) Age > 82...Sqft_Area <= (2) Sqft_Area > (2) Bed > 4...Lot_Area > (6/1) Lot_Area <= Sqft_Area <= Lot_Area <= (7) Lot_Area > 4019 [S1] Sqft_Area > Envi <= 3 4 (3) Envi > 3 [S2] Postal in {98101,98102,98103,98105,98107,98109,98112,98115,98117,98119, 98122,98199}...Sqft_Area > (535.5/179.2) Sqft_Area <= Sqft_Area <= Postal in {98101,98117,98119} 3 (13/5) Postal in {98102,98109} 2 (5/2) Postal in {98105,98107,98112,98199} 5 (14/3) Postal = Lot_Area <= (2) Lot_Area > (5) Postal = Envi <= 7 3 (4/1) Envi > 7 1 (2/1) Postal = Envi <= 7 3 (5) Envi > 7...Envi <= 8 2 (2) Envi > 8 1 (2) Sqft_Area > Postal in {98101,98102,98105,98109,98112}...Sqft_Area <= Age <= 45...Crime <= (6/1) Crime > (6/1) Age > 45...Bath > (6.1/2) Bath <= Age <= 90 5 (23/5) Age > 90 4 (17/8) Sqft_Area > Bed <= 4...Envi > 6 5 (20) Envi <= 6...Sqft_Area <= (3) Sqft_Area > (2) Bed > 4...Bath > (17.1/1) Bath <= Lot_Area <= (5/1) Lot_Area > (6/1) Postal in {98103,98107,98115,98117,98119,98122,98199}...Bed > 5 5 (37/17) Bed <= 5...Bed <= 2...Bath <= (14.2/4.2) Bath > Postal in {98103,98107,98115,98117,98119, 98199} 3 (2) Postal = (2) Bed > 2...Bath > Crime > (9/1) Crime <= Age > 8...Envi <= 3 3 (4/1) Envi > 3 4 (3/1)

132 Age <= 8...Age <= 2 5 (2) Age > 2...Bed <= 4 5 (3/1) Bath <= 2.3 Bed > 4 4 (9/1)...Lot_Area <= (24.1/10) Lot_Area > Sqft_Area > Postal = Age <= 66 3 (5/1) Age > 66 4 (21/7) Postal = Bath <= (8/3) Bath > (3) Postal = Lot_Area <= (26/8) Lot_Area > (3) Postal = Crime <= (2) Crime > (4) Postal = Age > (8) Age <= Crime <= (2/1) Crime > (29/10) Postal = Bed > 4 4 (16.9/8.9) Bed <= 4...Bath <= (2) Bath > (2) Postal = Envi > 2 4 (3.9/0.9) Envi <= 2...Lot_Area <= (3) Lot_Area > (2) Sqft_Area <= Lot_Area > Sqft_Area <= (2/1) Sqft_Area > (2) Lot_Area <= Envi > 10...Postal in {98107,98115, 98117,98119, 98199} 5 (0) Postal = (1) Postal = Crime <= (2/1) Crime > (7) Envi <= 10...Postal = Age <= 62 2 (3/2) Age > 62...Age <= (5) Age > (2) Postal = Age <= 71 4 (6/2) Age > 71 [S3] Postal = Sqft_Area <= Crime <= (3) Crime > (2) Sqft_Area > Envi > 5 4 (11/3) Envi <= 5 [S4] Postal = Sqft_Area <= (12/2) Sqft_Area > Bath > (2/1) Bath <= 1.25 [S5] Postal = Crime > (4/1)

133 Crime <= Age <= 95 [S6] Age > 95 [S7] Postal = Sqft_Area > (2) Sqft_Area <= Envi <= 3 3 (2/1) Envi > 3 [S8] Postal = Age > 90 [S9] Age <= 90...Sqft_Area > 1115 [S10] Sqft_Area <= Crime <= 1191 [S11] Crime > 1191 [S12] Postal in {98104,98106,98108,98118,98121,98125,98126,98133,98155,98178}...Sqft_Area > Envi > 9...Sqft_Area > (30/1) Sqft_Area <= Lot_Area <= (4/1) Lot_Area > Age > 70 5 (5) Age <= 70...Bed > 5 4 (3) Bed <= 5...Sqft_Area <= (2) Sqft_Area > (2) Envi <= 9...Sqft_Area > Crime <= Envi <= 0 2 (5) Envi > 0...Bed <= 5 5 (2) Bed > 5 2 (2/1) Crime > Bath > Postal in {98104,98121,98155} 5 (0) Postal = Crime <= (6/2) Crime > (8/1) Postal = Bed <= 9 5 (66/14) Bed > 9 3 (2/1) Postal = Envi <= 0 3 (3/1) Envi > 0 5 (10/1) Postal = Envi <= 3 5 (2) Envi > 3 3 (5/2) Postal = Age <= 4 5 (5) Age > 4...Bath > (5/1) Bath <= Bed <= 6 5 (5/2) Bed > 6 2 (3/2) Postal = Age <= 43 5 (13/2) Age > 43...Sqft_Area <= (2) Sqft_Area > Bath <= (2) Bath > (2) Postal = Envi <= 5...Lot_Area <= (2/1) Lot_Area > Sqft_Area <= (3/1) Sqft_Area > (13/4) Envi > 5...Envi <= 6

134 ...Crime <= Bath <= (2) Bath > (2) Crime > Crime <= (7) Crime > (2) Envi > 6...Bath > (33/6) Bath <= Lot_Area <= (4) Bath <= 2.3 Lot_Area > (2)...Age <= 33 5 (6/1) Age > 33...Envi > 6...Sqft_Area > Crime <= (3/1) Crime > (6) Sqft_Area <= Sqft_Area <= (2/1) Sqft_Area > Bed <= 5 4 (4) Bed > 5...Age <= 59 4 (2) Age > 59 5 (4/1) Envi <= 6...Bath <= (10/2) Bath > Envi > 5...Bath <= (3/2) Bath > (5) Envi <= 5...Postal in {98104,98106,98121, 98155} 3 (4/1) Postal = Sqft_Area <= (2/1) Sqft_Area > (2) Postal = Lot_Area <= (4/1) Lot_Area > (2) Postal = Sqft_Area <= (2/1) Sqft_Area > (2) Postal = Crime <= (3) Crime > (3/2) Postal = Crime <= (3/1) Crime > (2/1) Postal = Envi > 2 4 (2) Envi <= 2...Lot_Area <= (7/3) Lot_Area > (4/2) Sqft_Area <= Crime <= Age > 54 2 (15/2) Age <= 54...Age > 47 1 (3) Age <= 47...Bath <= (3) Bath > (2) Crime > Postal in {98104,98121,98155} 3 (0) Postal in {98125,98126,98133}...Sqft_Area > Bed > 6...Age <= 54 3 (4) Age > 54 2 (2/1) Bed <= 6...Crime <= (7) Crime > 467

135 ...Age <= 53...Age <= 6...Postal = (2) Postal in {98126,98133} 5 (3) Age > 6...Envi <= 0 4 (16/4) Envi > 0...Postal = (6/3) Postal = (2) Postal = Crime <= (6/1) Crime > (4) Age > 53...Envi > 1...Sqft_Area <= (8/2) Sqft_Area > (5/1) Envi <= 1...Lot_Area > (2) Lot_Area <= Envi <= 0...Bath <= Crime <= (3) Crime > (2) Bath > Age <= 72 3 (25/12) Age > 72 4 (3/1) Envi > 0...Bath > (4/1) Bath <= Lot_Area > (3) Lot_Area <= [S13] Sqft_Area <= Lot_Area <= Sqft_Area <= (16/1) Sqft_Area > (3/1) Lot_Area > Age <= 30...Bed > 5 4 (3/1) Bed <= 5...Age <= 1...Postal in {98125,98133} 5 (3) Postal = (1) Age > 1...Age <= 24 4 (5/1) Age > 24 5 (2) Age > 30...Lot_Area > Postal = (0) Postal = (4/1) Postal = Lot_Area <= (2) Lot_Area > Lot_Area <= (8/2) Lot_Area > (2/1) Lot_Area <= Bath <= Crime <= (13/1) Crime > Sqft_Area <= (2) Sqft_Area > (3/1) Bath > Age > 94 2 (2) Age <= 94...Sqft_Area <= (10/1) Sqft_Area > Postal = Bed <= 5 4 (19/8) Bed > 5 3 (13/5) Postal = Age <= 40 2 (2/1) Age > 40...Bed <= 5 3 (14/6)

136 Bed > 5 4 (8/1) Postal = Lot_Area > (9) Lot_Area <= Envi > 3 4 (5/1) Envi <= 3 [S14] Postal in {98106,98108,98118,98178}...Age <= 4...Age > 1 3 (11/2) Age <= 1...Crime <= (3) Crime > Sqft_Area <= (5) Sqft_Area > Sqft_Area <= (6) Sqft_Area > (2) Age > 4...Lot_Area > Bath <= Postal = (3/1) Postal in {98106,98118,98178} 2 (1) Bath > Envi > 6...Lot_Area <= (2/1) Lot_Area > (3) Envi <= 6...Bath > (5/1) Bath <= Postal = (2/1) Postal = (1) Postal = Bed <= 5 1 (2/1) Bed > 5 2 (3) Postal = Age <= 42 2 (2) Age > 42 3 (4) Lot_Area <= Sqft_Area > Bed > 6 3 (12/4) Bed <= 6...Lot_Area > Envi <= 7 3 (5/1) Envi > 7 2 (2/1) Lot_Area <= Bath > (5) Bath <= Sqft_Area <= (3) Sqft_Area > (11/4) Sqft_Area <= Crime > Age > 94...Lot_Area > (3/1) Lot_Area <= Lot_Area <= (4/1) Lot_Area > (5/1) Age <= 94...Crime > (45/16) Crime <= Bath <= (7/2) Bath > Lot_Area <= (31/14) Lot_Area > Lot_Area <= (7/1) Lot_Area > (10/4) Crime <= Lot_Area <= (21/9) Lot_Area > Postal = (6/2) Postal = Bed <= 5...Crime <= (4) Crime > (5/1)

137 Bed > 5...Bath <= (2) Bath > (3/1) Postal = Lot_Area <= Crime <= (2/1) Crime > (2) Lot_Area > Sqft_Area > (17/4) Sqft_Area <= Age > 62 3 (7/1) Age <= 62...Crime <= (9/5) Crime > 711 [S15] Postal = Bath <= Age <= 71 2 (3) Age > 71 4 (5/1) Bath > Sqft_Area <= (7) Sqft_Area > Crime <= (2) Crime > Envi <= 7 2 (32/16) Envi > 7 [S16] Sqft_Area <= Bed <= 2...Crime > Age <= 12 3 (2.4/0.4) Age > 12 2 (3.6) Crime <= Bath <= Sqft_Area <= (100/14) Sqft_Area > (2/1) Bath > Crime <= (2) Crime > Crime <= (4) Crime > (5/1) Bed > 2...Lot_Area > Postal = (0) Postal in {98106,98108} 3 (4.2/0.2) Postal in {98104,98118,98121,98125,98126,98133,98178}...Sqft_Area <= (38.3/1.2) Sqft_Area > Envi > 2...Age <= 72 5 (5.1/0.1) Age > 72 2 (2.1) Envi <= 2...Lot_Area <= (3.1/1.1) Lot_Area > Bed > 4 2 (8.2/1.1) Bed <= 4...Age <= 31 2 (5.1/0.1) Age > 31...Bath <= (23.4/6.1) Bath > (3) Lot_Area <= Postal in {98104,98121} 2 (0) Postal in {98125,98126,98133}...Sqft_Area > Lot_Area <= Crime <= (15.3/3.1) Crime > (15.3/2.3) Lot_Area > Crime > (183.2/77.4) Crime <= Bath <= (2) Bath > (2) Sqft_Area <= Age <= 56

138 ...Bed > 4...Age <= 28...Age <= 5 3 (2) Age > 5 2 (16.9/2) Age > 28...Crime <= (2) Crime > (10/1) Bed <= 4...Postal = (0) Postal = Sqft_Area <= (5.9) Sqft_Area > Lot_Area <= (5.7/0.7) Lot_Area > (2.3) Postal = Age <= 16 2 (11) Age > 16...Lot_Area <= (6.1/2.4) Lot_Area > (8.6/1.1) Age > 56...Sqft_Area <= Age > 93 1 (2) Age <= 93...Crime <= Lot_Area <= (29/4) Lot_Area > (2) Crime > Postal in {98125,98126} 2 (4/1) Postal = (3) Sqft_Area > Bed > 5 3 (7/1) Bed <= 5...Envi > 1...Sqft_Area <= (26/11) Sqft_Area > Crime <= (2/1) Crime > (5) Envi <= 1...Age <= 66...Bath > Age <= 64 3 (7/2) Age > 64 2 (2) Bath <= Postal = Lot_Area <= (2) Lot_Area > (7) Postal = Crime <= (4/1) Crime > (12/3) Postal = Age > 63 3 (7.9/0.9) Age <= 63...Bed <= 4 2 (8) Bed > 4...Age > 62 2 (2) Age <= 62 [S17] Age > 66...Crime <= (2) Crime > Envi > 0...Bed <= 4...Lot_Area <= (8/3) Lot_Area > (4/1) Bed > 4...Age <= 72 1 (3/1) Age > 72 3 (2) Envi <= 0...Age > (2/1) Age <= 100 [S18] Postal in {98106,98108,98118,98155,98178}...Envi > 8...Sqft_Area <= (7/3)

139 Sqft_Area > Envi > 10...Crime <= (3) Crime > (3/1) Envi <= 10...Lot_Area > (3/1) Lot_Area <= Crime > (2) Crime <= Crime <= (2/1) Envi <= 8 Crime > (10/2)...Sqft_Area <= Bath > Bed > 5 1 (2) Bed <= 5...Age <= 23...Bed <= 4 1 (2) Bed > 4 2 (2) Age > 23...Lot_Area <= (14) Lot_Area > Age <= 70 3 (3) Age > 70 2 (2) Bath <= Crime > Envi <= 5 2 (39/17) Envi > 5...Envi <= 6...Postal = (1) Postal in {98106,98118,98155, Envi > } 3 (5)...Age <= (5/2) Age > (2/1) Crime <= Postal = (1) Postal = Sqft_Area <= (2) Sqft_Area > (14/4) Postal = Bed <= 4 2 (13/5) Bed > 4 1 (7/1) Postal = Crime > Sqft_Area <= (17/5) Sqft_Area > (8/1) Crime <= Bath <= (32/5) Bath > Age <= 36 2 (2) Age > 36 1 (2) Postal = Age <= 69 2 (12.9/4) Age > 69...Bed > 4 1 (6) Bed <= 4...Crime <= (2) Crime > Age <= 93 1 (8) Age > 93...Lot_Area > (2) Lot_Area <= 8650 [S19] Sqft_Area > Age <= 4...Crime <= (6.8) Crime > (3/1) Age > 4...Crime <= Postal = (0) Postal = Crime <= (19/10)

140 Crime > (6) Postal = Envi > 1 2 (7/1) Envi <= 1...Bed <= 4...Sqft_Area <= (4) Sqft_Area > (7.9/2) Bed > 4...Bath <= (28/15) Bath > Bed <= 5 3 (25/10) Bed > 5 2 (3.9/1) Postal in {98118,98178}...Envi <= 1...Bath > (7/1) Bath <= Lot_Area <= (5) Lot_Area > (7/3) Envi > 1...Lot_Area <= (54.7/19) Lot_Area > Envi <= 7...Postal = (9/4) Postal = (3.2/1) Envi > 7...Sqft_Area <= (2) Sqft_Area > (2/1) Crime > Bath > (10/1) Bath <= Bed > 5...Postal in {98106,98108,98155, 98178} 2 (6/3) Postal = Sqft_Area <= (6) Sqft_Area > Sqft_Area <= (5/1) Sqft_Area > Age <= 73 1 (2/1) Age > 73 5 (2) Bed <= 5...Crime > (2) Crime <= Age <= 63...Crime > (20.9/1.9) Crime <= Crime <= (5) Crime > Bed <= 4 1 (2/1) Age > 63 Bed > 4 [S20]...Bath > (10/2) Bath <= Bath > (6) Bath <= 1.5 [S21] SubTree [S1] Lot_Area <= (2) Lot_Area > (2) SubTree [S2] Sqft_Area <= (2/1) Sqft_Area > (2) SubTree [S3] Sqft_Area <= (2) Sqft_Area > (5)

141 SubTree [S4] Envi <= 4 4 (4/1) Envi > 4 5 (2) SubTree [S5] Crime <= (13/4) Crime > (2/1) SubTree [S6] Sqft_Area <= (3/1) Sqft_Area > (13/1) SubTree [S7] Sqft_Area <= (4) Sqft_Area > (4/1) SubTree [S8] Lot_Area > (12/4) Lot_Area <= Lot_Area <= (3/2) Lot_Area > (3/1) SubTree [S9] Lot_Area <= (4) Lot_Area > Sqft_Area <= (2) Sqft_Area > (2/1) SubTree [S10] Crime <= (3/1) Crime > (7) SubTree [S11] Lot_Area <= (16/6) Lot_Area > (2) SubTree [S12] Sqft_Area > (6/1) Sqft_Area <= Sqft_Area > (8) Sqft_Area <= Sqft_Area <= (2) Sqft_Area > (2) SubTree [S13] Lot_Area <= (4/1) Lot_Area > (7) SubTree [S14] Crime <= (3/1) Crime > (6/1) SubTree [S15] Bed <= 5 4 (3) Bed > 5 1 (4/2) SubTree [S16] Bath > (4/1) Bath <= 2.5

142 ...Age > 83 3 (4/1) Age <= 83...Sqft_Area <= (2) Sqft_Area > (3/1) SubTree [S17] Age > 61 3 (4) Age <= 61...Lot_Area <= (3) Lot_Area > (3/1) SubTree [S18] Postal = (4/1) Postal = Crime <= (2) Crime > (8/1) Postal = Bed > 4...Bath <= Lot_Area <= (2) Lot_Area > (3/1) Bath > Crime <= (2) Crime > (3/1) Bed <= 4...Age > 76 2 (10/2) Age <= 76...Crime > (7/1) Crime <= Crime <= (5/1) Crime > (3/1) SubTree [S19] Sqft_Area <= (4) Sqft_Area > (3/1) SubTree [S20] Sqft_Area <= (4) Sqft_Area > (2) SubTree [S21] Sqft_Area <= (11/4) Sqft_Area > Bed <= 4...Postal = (3/1) Postal in {98106,98118,98155,98178} 2 (3/1) Bed > 4...Lot_Area > Postal in {98106,98108,98155,98178} 4 (5/1) Postal = (3/1) Lot_Area <= Envi <= 7 1 (2) Envi > 7...Lot_Area <= (2) Lot_Area > (2/1) Evaluation on training data (5307 cases) Decision Tree Size Errors (20.7%) <<

143 Interpret some of the resulting if-then rules Choose one branch of the tree to explain If Postal of the house is 90106, 98108,98118 or Then if age is bigger than 1, go to another branch. If age is less than 1 and crime variable is less than 803, the price belongs to the 3rd class. If age is less than 1 and crime variable is bigger than 803, check the area of the house, if it is less than 1980, then price belongs to 4th class.if it is more than 1980 and less than 2395,price belongs to the 5th class. If area is bigger than 2395, price belongs to 4th class. Comparison of Training and testing results by confusion matrix and percentage of data correctly classified Result of Training dataset Correct percentage Confusion matrix (a) (b) (c) (d) (e) <-classified as (a) class 1 (b) class (c) class (d) class 4 (e) class 5 R-Output (a) (b) (c) (d) (e) <-classified as (a) class (b) class (c) class 3 (d) class (e) class 5 Attribute usage % Postal 99.96% Sqft_Area 46.15% Crime 45.92% Bed 41.06% Lot_Area 40.04% Envi 35.80% Age 33.86% Bath Time 0.1 secs

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times. Mixed-effects models An introduction by Christoph Scherber Up to now, we have been dealing with linear models of the form where ß0 and ß1 are parameters of fixed value. Example: Let us assume that we are

More information

Multiple regression - a brief introduction

Multiple regression - a brief introduction Multiple regression - a brief introduction Multiple regression is an extension to regular (simple) regression. Instead of one X, we now have several. Suppose, for example, that you are trying to predict

More information

Study 2: data analysis. Example analysis using R

Study 2: data analysis. Example analysis using R Study 2: data analysis Example analysis using R Steps for data analysis Install software on your computer or locate computer with software (e.g., R, systat, SPSS) Prepare data for analysis Subjects (rows)

More information

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings Dummy Variables A dummy variable or binary variable is a variable that takes on a value of 0 or 1 as an indicator that the observation has some kind of characteristic. Common examples: Sex (female): FEMALE=1

More information

Stat 401XV Exam 3 Spring 2017

Stat 401XV Exam 3 Spring 2017 Stat 40XV Exam Spring 07 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning

More information

Multiple linear regression

Multiple linear regression Multiple linear regression Business Statistics 41000 Spring 2017 1 Topics 1. Including multiple predictors 2. Controlling for confounders 3. Transformations, interactions, dummy variables OpenIntro 8.1,

More information

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15 STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15 For this assignment use the Diamonds dataset in the Stat2Data library. The dataset is used in examples

More information

Regression and Simulation

Regression and Simulation Regression and Simulation This is an introductory R session, so it may go slowly if you have never used R before. Do not be discouraged. A great way to learn a new language like this is to plunge right

More information

6 Multiple Regression

6 Multiple Regression More than one X variable. 6 Multiple Regression Why? Might be interested in more than one marginal effect Omitted Variable Bias (OVB) 6.1 and 6.2 House prices and OVB Should I build a fireplace? The following

More information

Stat 328, Summer 2005

Stat 328, Summer 2005 Stat 328, Summer 2005 Exam #2, 6/18/05 Name (print) UnivID I have neither given nor received any unauthorized aid in completing this exam. Signed Answer each question completely showing your work where

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Scott Creel Wednesday, September 10, 2014 This exercise extends the prior material on using the lm() function to fit an OLS regression and test hypotheses about effects on a parameter.

More information

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

Multiple Regression and Logistic Regression II. Dajiang 525 Apr Multiple Regression and Logistic Regression II Dajiang Liu @PHS 525 Apr-19-2016 Materials from Last Time Multiple regression model: Include multiple predictors in the model = + + + + How to interpret the

More information

Non-linearities in Simple Regression

Non-linearities in Simple Regression Non-linearities in Simple Regression 1. Eample: Monthly Earnings and Years of Education In this tutorial, we will focus on an eample that eplores the relationship between total monthly earnings and years

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

NHY examples. Bernt Arne Ødegaard. 23 November Estimating dividend growth in Norsk Hydro 8

NHY examples. Bernt Arne Ødegaard. 23 November Estimating dividend growth in Norsk Hydro 8 NHY examples Bernt Arne Ødegaard 23 November 2017 Abstract Finance examples using equity data for Norsk Hydro (NHY) Contents 1 Calculating Beta 4 2 Cost of Capital 7 3 Estimating dividend growth in Norsk

More information

Internet Appendix. Additional Results. Figure A1: Stock of retail credit cards over time

Internet Appendix. Additional Results. Figure A1: Stock of retail credit cards over time Internet Appendix A Additional Results Figure A1: Stock of retail credit cards over time Stock of retail credit cards by month. Time of deletion policy noted with vertical line. Figure A2: Retail credit

More information

Jaime Frade Dr. Niu Interest rate modeling

Jaime Frade Dr. Niu Interest rate modeling Interest rate modeling Abstract In this paper, three models were used to forecast short term interest rates for the 3 month LIBOR. Each of the models, regression time series, GARCH, and Cox, Ingersoll,

More information

Final Exam Suggested Solutions

Final Exam Suggested Solutions University of Washington Fall 003 Department of Economics Eric Zivot Economics 483 Final Exam Suggested Solutions This is a closed book and closed note exam. However, you are allowed one page of handwritten

More information

Analysis of Variance in Matrix form

Analysis of Variance in Matrix form Analysis of Variance in Matrix form The ANOVA table sums of squares, SSTO, SSR and SSE can all be expressed in matrix form as follows. week 9 Multiple Regression A multiple regression model is a model

More information

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT) Regression Review and Robust Regression Slides prepared by Elizabeth Newton (MIT) S-Plus Oil City Data Frame Monthly Excess Returns of Oil City Petroleum, Inc. Stocks and the Market SUMMARY: The oilcity

More information

ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University

ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS Pooja Shivraj Southern Methodist University KINDS OF REGRESSION ANALYSES Linear Regression Logistic Regression Dichotomous dependent variable (yes/no, died/

More information

CHAPTER 7 MULTIPLE REGRESSION

CHAPTER 7 MULTIPLE REGRESSION CHAPTER 7 MULTIPLE REGRESSION ANSWERS TO PROBLEMS AND CASES 5. Y = 7.5 + 3(0) - 1.(7) = -17.88 6. a. A correlation matrix displays the correlation coefficients between every possible pair of variables

More information

R is a collaborative project with many contributors. Type contributors() for more information.

R is a collaborative project with many contributors. Type contributors() for more information. R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type license() or licence() for distribution details. R is a collaborative project

More information

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998 Economics 312 Sample Project Report Jeffrey Parker Introduction This project is based on Exercise 2.12 on page 81 of the Hill, Griffiths, and Lim text. It examines how the sale price of houses in Stockton,

More information

Test #1 (Solution Key)

Test #1 (Solution Key) STAT 47/67 Test #1 (Solution Key) 1. (To be done by hand) Exploring his own drink-and-drive habits, a student recalls the last 7 parties that he attended. He records the number of cans of beer he drank,

More information

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013 Ordinal Multinomial Logistic Thom M. Suhy Southern Methodist University May14th, 2013 GLM Generalized Linear Model (GLM) Framework for statistical analysis (Gelman and Hill, 2007, p. 135) Linear Continuous

More information

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley. Appendix: Statistics in Action Part I Financial Time Series 1. These data show the effects of stock splits. If you investigate further, you ll find that most of these splits (such as in May 1970) are 3-for-1

More information

Homework Assignment Section 3

Homework Assignment Section 3 Homework Assignment Section 3 Tengyuan Liang Business Statistics Booth School of Business Problem 1 A company sets different prices for a particular stereo system in eight different regions of the country.

More information

CHAPTER 6 DATA ANALYSIS AND INTERPRETATION

CHAPTER 6 DATA ANALYSIS AND INTERPRETATION 208 CHAPTER 6 DATA ANALYSIS AND INTERPRETATION Sr. No. Content Page No. 6.1 Introduction 212 6.2 Reliability and Normality of Data 212 6.3 Descriptive Analysis 213 6.4 Cross Tabulation 218 6.5 Chi Square

More information

Random Effects ANOVA

Random Effects ANOVA Random Effects ANOVA Grant B. Morgan Baylor University This post contains code for conducting a random effects ANOVA. Make sure the following packages are installed: foreign, lme4, lsr, lattice. library(foreign)

More information

11/28/2018. Overview. Multiple Linear Regression Analysis. Multiple regression. Multiple regression. Multiple regression. Multiple regression

11/28/2018. Overview. Multiple Linear Regression Analysis. Multiple regression. Multiple regression. Multiple regression. Multiple regression Multiple Linear Regression Analysis BSAD 30 Dave Novak Fall 208 Source: Ragsdale, 208 Spreadsheet Modeling and Decision Analysis 8 th edition 207 Cengage Learning 2 Overview Last class we considered the

More information

Regression Model Assumptions Solutions

Regression Model Assumptions Solutions Regression Model Assumptions Solutions Below are the solutions to these exercises on model diagnostics using residual plots. # Exercise 1 # data("cars") head(cars) speed dist 1 4 2 2 4 10 3 7 4 4 7 22

More information

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction

Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction Negative Binomial Model for Count Data Log-linear Models for Contingency Tables - Introduction Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Negative Binomial Family Example: Absenteeism from

More information

UNDERSTANDING ML/DL MODELS USING INTERACTIVE VISUALIZATION TECHNIQUES

UNDERSTANDING ML/DL MODELS USING INTERACTIVE VISUALIZATION TECHNIQUES UNDERSTANDING ML/DL MODELS USING INTERACTIVE VISUALIZATION TECHNIQUES Chakri Cherukuri Senior Researcher Quantitative Financial Research Group 1 OUTLINE Introduction Applied machine learning in finance

More information

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron Statistical Models of Stocks and Bonds Zachary D Easterling: Department of Economics The University of Akron Abstract One of the key ideas in monetary economics is that the prices of investments tend to

More information

Two Way ANOVA in R Solutions

Two Way ANOVA in R Solutions Two Way ANOVA in R Solutions Solutions to exercises found here # Exercise 1 # #Read in the moth experiment data setwd("h:/datasets") moth.experiment = read.csv("moth trap experiment.csv", header = TRUE)

More information

Session 5. Predictive Modeling in Life Insurance

Session 5. Predictive Modeling in Life Insurance SOA Predictive Analytics Seminar Hong Kong 29 Aug. 2018 Hong Kong Session 5 Predictive Modeling in Life Insurance Jingyi Zhang, Ph.D Predictive Modeling in Life Insurance JINGYI ZHANG PhD Scientist Global

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - IIIb Henrik Madsen March 18, 2012 Henrik Madsen () Chapman & Hall March 18, 2012 1 / 32 Examples Overdispersion and Offset!

More information

Multiple Regression. Review of Regression with One Predictor

Multiple Regression. Review of Regression with One Predictor Fall Semester, 2001 Statistics 621 Lecture 4 Robert Stine 1 Preliminaries Multiple Regression Grading on this and other assignments Assignment will get placed in folder of first member of Learning Team.

More information

The SAS System 11:03 Monday, November 11,

The SAS System 11:03 Monday, November 11, The SAS System 11:3 Monday, November 11, 213 1 The CONTENTS Procedure Data Set Name BIO.AUTO_PREMIUMS Observations 5 Member Type DATA Variables 3 Engine V9 Indexes Created Monday, November 11, 213 11:4:19

More information

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018 ` Subject CS1 Actuarial Statistics 1 Core Principles Syllabus for the 2019 exams 1 June 2018 Copyright in this Core Reading is the property of the Institute and Faculty of Actuaries who are the sole distributors.

More information

Step 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set.

Step 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set. Step 1: Load the appropriate R package. You will need two libraries: nlme and lme4. Step 2: Fit a separate mixed model for each independence claim in the basis set. For instance, in Table 2 the first basis

More information

MODEL SELECTION CRITERIA IN R:

MODEL SELECTION CRITERIA IN R: 1. R 2 statistics We may use MODEL SELECTION CRITERIA IN R R 2 = SS R SS T = 1 SS Res SS T or R 2 Adj = 1 SS Res/(n p) SS T /(n 1) = 1 ( ) n 1 (1 R 2 ). n p where p is the total number of parameters. R

More information

2SLS HATCO SPSS, STATA and SHAZAM. Example by Eddie Oczkowski. August 2001

2SLS HATCO SPSS, STATA and SHAZAM. Example by Eddie Oczkowski. August 2001 2SLS HATCO SPSS, STATA and SHAZAM Example by Eddie Oczkowski August 2001 This example illustrates how to use SPSS to estimate and evaluate a 2SLS latent variable model. The bulk of the example relates

More information

Econometric Methods for Valuation Analysis

Econometric Methods for Valuation Analysis Econometric Methods for Valuation Analysis Margarita Genius Dept of Economics M. Genius (Univ. of Crete) Econometric Methods for Valuation Analysis Cagliari, 2017 1 / 26 Correlation Analysis Simple Regression

More information

ALGORITHMIC TRADING STRATEGIES IN PYTHON

ALGORITHMIC TRADING STRATEGIES IN PYTHON 7-Course Bundle In ALGORITHMIC TRADING STRATEGIES IN PYTHON Learn to use 15+ trading strategies including Statistical Arbitrage, Machine Learning, Quantitative techniques, Forex valuation methods, Options

More information

Loan Approval and Quality Prediction in the Lending Club Marketplace

Loan Approval and Quality Prediction in the Lending Club Marketplace Loan Approval and Quality Prediction in the Lending Club Marketplace Milestone Write-up Yondon Fu, Shuo Zheng and Matt Marcus Recap Lending Club is a peer-to-peer lending marketplace where individual investors

More information

Generalized Multilevel Regression Example for a Binary Outcome

Generalized Multilevel Regression Example for a Binary Outcome Psy 510/610 Multilevel Regression, Spring 2017 1 HLM Generalized Multilevel Regression Example for a Binary Outcome Specifications for this Bernoulli HLM2 run Problem Title: no title The data source for

More information

Contents Part I Descriptive Statistics 1 Introduction and Framework Population, Sample, and Observations Variables Quali

Contents Part I Descriptive Statistics 1 Introduction and Framework Population, Sample, and Observations Variables Quali Part I Descriptive Statistics 1 Introduction and Framework... 3 1.1 Population, Sample, and Observations... 3 1.2 Variables.... 4 1.2.1 Qualitative and Quantitative Variables.... 5 1.2.2 Discrete and Continuous

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, President, OptiMine Consulting, West Chester, PA ABSTRACT Data Mining is a new term for the

More information

Introduction to Population Modeling

Introduction to Population Modeling Introduction to Population Modeling In addition to estimating the size of a population, it is often beneficial to estimate how the population size changes over time. Ecologists often uses models to create

More information

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques 6.1 Introduction Trading in stock market is one of the most popular channels of financial investments.

More information

Technical Documentation for Household Demographics Projection

Technical Documentation for Household Demographics Projection Technical Documentation for Household Demographics Projection REMI Household Forecast is a tool to complement the PI+ demographic model by providing comprehensive forecasts of a variety of household characteristics.

More information

Improving Stock Price Prediction with SVM by Simple Transformation: The Sample of Stock Exchange of Thailand (SET)

Improving Stock Price Prediction with SVM by Simple Transformation: The Sample of Stock Exchange of Thailand (SET) Thai Journal of Mathematics Volume 14 (2016) Number 3 : 553 563 http://thaijmath.in.cmu.ac.th ISSN 1686-0209 Improving Stock Price Prediction with SVM by Simple Transformation: The Sample of Stock Exchange

More information

SAS Simple Linear Regression Example

SAS Simple Linear Regression Example SAS Simple Linear Regression Example This handout gives examples of how to use SAS to generate a simple linear regression plot, check the correlation between two variables, fit a simple linear regression

More information

Logistic Regression with R: Example One

Logistic Regression with R: Example One Logistic Regression with R: Example One math = read.table("http://www.utstat.toronto.edu/~brunner/appliedf12/data/mathcat.data") math[1:5,] hsgpa hsengl hscalc course passed outcome 1 78.0 80 Yes Mainstrm

More information

UPDATED IAA EDUCATION SYLLABUS

UPDATED IAA EDUCATION SYLLABUS II. UPDATED IAA EDUCATION SYLLABUS A. Supporting Learning Areas 1. STATISTICS Aim: To enable students to apply core statistical techniques to actuarial applications in insurance, pensions and emerging

More information

Lecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit

Lecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit Lecture 10: Alternatives to OLS with limited dependent variables, part 1 PEA vs APE Logit/Probit PEA vs APE PEA: partial effect at the average The effect of some x on y for a hypothetical case with sample

More information

Credit Card Default Predictive Modeling

Credit Card Default Predictive Modeling Credit Card Default Predictive Modeling Background: Predicting credit card payment default is critical for the successful business model of a credit card company. An accurate predictive model can help

More information

DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS

DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS By Ashish Pandit A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science

More information

MODELLING HEALTH MAINTENANCE ORGANIZATIONS PAYMENTS UNDER THE NATIONAL HEALTH INSURANCE SCHEME IN NIGERIA

MODELLING HEALTH MAINTENANCE ORGANIZATIONS PAYMENTS UNDER THE NATIONAL HEALTH INSURANCE SCHEME IN NIGERIA MODELLING HEALTH MAINTENANCE ORGANIZATIONS PAYMENTS UNDER THE NATIONAL HEALTH INSURANCE SCHEME IN NIGERIA *Akinyemi M.I 1, Adeleke I. 2, Adedoyin C. 3 1 Department of Mathematics, University of Lagos,

More information

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1 1 Chapter 1 1.1 Definitions Stat 101 Exam 1 - Embers Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2.

More information

Determination of the Optimal Stratum Boundaries in the Monthly Retail Trade Survey in the Croatian Bureau of Statistics

Determination of the Optimal Stratum Boundaries in the Monthly Retail Trade Survey in the Croatian Bureau of Statistics Determination of the Optimal Stratum Boundaries in the Monthly Retail Trade Survey in the Croatian Bureau of Statistics Ivana JURINA (jurinai@dzs.hr) Croatian Bureau of Statistics Lidija GLIGOROVA (gligoroval@dzs.hr)

More information

To be two or not be two, that is a LOGISTIC question

To be two or not be two, that is a LOGISTIC question MWSUG 2016 - Paper AA18 To be two or not be two, that is a LOGISTIC question Robert G. Downer, Grand Valley State University, Allendale, MI ABSTRACT A binary response is very common in logistic regression

More information

2018 Predictive Analytics Symposium Session 10: Cracking the Black Box with Awareness & Validation

2018 Predictive Analytics Symposium Session 10: Cracking the Black Box with Awareness & Validation 2018 Predictive Analytics Symposium Session 10: Cracking the Black Box with Awareness & Validation SOA Antitrust Compliance Guidelines SOA Presentation Disclaimer Cracking the Black Box with Awareness

More information

The Relationship between Consumer Price Index and Producer Price Index in China

The Relationship between Consumer Price Index and Producer Price Index in China Southern Illinois University Carbondale OpenSIUC Research Papers Graduate School Winter 12-15-2017 The Relationship between Consumer Price Index and Producer Price Index in China binbin shen sbinbin1217@siu.edu

More information

arxiv: v1 [q-fin.ec] 28 Apr 2014

arxiv: v1 [q-fin.ec] 28 Apr 2014 The Italian Crisis and Producer Households Debt: a Source of Stability? A Reproducible Research arxiv:1404.7377v1 [q-fin.ec] 28 Apr 2014 Accepted at the Risk, Banking and Finance Society, University of

More information

REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING

REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING International Civil Aviation Organization 27/8/10 WORKING PAPER REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING Cairo 2 to 4 November 2010 Agenda Item 3 a): Forecasting Methodology (Presented

More information

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine Models of Patterns Lecture 3, SMMD 2005 Bob Stine Review Speculative investing and portfolios Risk and variance Volatility adjusted return Volatility drag Dependence Covariance Review Example Stock and

More information

The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used.

The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used. Machine Learning Group Homework 3 MSc Business Analytics Team 9 Alexander Romanenko, Artemis Tomadaki, Justin Leiendecker, Zijun Wei, Reza Brianca Widodo The Loans_processed.csv file is the dataset we

More information

Building and Checking Survival Models

Building and Checking Survival Models Building and Checking Survival Models David M. Rocke May 23, 2017 David M. Rocke Building and Checking Survival Models May 23, 2017 1 / 53 hodg Lymphoma Data Set from KMsurv This data set consists of information

More information

is the bandwidth and controls the level of smoothing of the estimator, n is the sample size and

is the bandwidth and controls the level of smoothing of the estimator, n is the sample size and Paper PH100 Relationship between Total charges and Reimbursements in Outpatient Visits Using SAS GLIMMIX Chakib Battioui, University of Louisville, Louisville, KY ABSTRACT The purpose of this paper is

More information

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR STATISTICAL DISTRIBUTIONS AND THE CALCULATOR 1. Basic data sets a. Measures of Center - Mean ( ): average of all values. Characteristic: non-resistant is affected by skew and outliers. - Median: Either

More information

Influence of Personal Factors on Health Insurance Purchase Decision

Influence of Personal Factors on Health Insurance Purchase Decision Influence of Personal Factors on Health Insurance Purchase Decision INFLUENCE OF PERSONAL FACTORS ON HEALTH INSURANCE PURCHASE DECISION The decision in health insurance purchase include decisions about

More information

Subject CS2A Risk Modelling and Survival Analysis Core Principles

Subject CS2A Risk Modelling and Survival Analysis Core Principles ` Subject CS2A Risk Modelling and Survival Analysis Core Principles Syllabus for the 2019 exams 1 June 2018 Copyright in this Core Reading is the property of the Institute and Faculty of Actuaries who

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT) Decision Tree Each node checks

More information

Final Exam - section 1. Thursday, December hours, 30 minutes

Final Exam - section 1. Thursday, December hours, 30 minutes Econometrics, ECON312 San Francisco State University Michael Bar Fall 2013 Final Exam - section 1 Thursday, December 19 1 hours, 30 minutes Name: Instructions 1. This is closed book, closed notes exam.

More information

When determining but for sales in a commercial damages case,

When determining but for sales in a commercial damages case, JULY/AUGUST 2010 L I T I G A T I O N S U P P O R T Choosing a Sales Forecasting Model: A Trial and Error Process By Mark G. Filler, CPA/ABV, CBA, AM, CVA When determining but for sales in a commercial

More information

1 Estimating risk factors for IBM - using data 95-06

1 Estimating risk factors for IBM - using data 95-06 1 Estimating risk factors for IBM - using data 95-06 Basic estimation of asset pricing models, using IBM returns data Market model r IBM = a + br m + ɛ CAPM Fama French 1.1 Using octave/matlab er IBM =

More information

The Norwegian State Equity Ownership

The Norwegian State Equity Ownership The Norwegian State Equity Ownership B A Ødegaard 15 November 2018 Contents 1 Introduction 1 2 Doing a performance analysis 1 2.1 Using R....................................................................

More information

Econometrics is. The estimation of relationships suggested by economic theory

Econometrics is. The estimation of relationships suggested by economic theory Econometrics is Econometrics is The estimation of relationships suggested by economic theory Econometrics is The estimation of relationships suggested by economic theory The application of mathematical

More information

The study on the financial leverage effect of GD Power Corp. based on. financing structure

The study on the financial leverage effect of GD Power Corp. based on. financing structure 5th International Conference on Education, Management, Information and Medicine (EMIM 2015) The study on the financial leverage effect of GD Power Corp. based on financing structure Xin Ling Du 1, a and

More information

Predicting stock prices for large-cap technology companies

Predicting stock prices for large-cap technology companies Predicting stock prices for large-cap technology companies 15 th December 2017 Ang Li (al171@stanford.edu) Abstract The goal of the project is to predict price changes in the future for a given stock.

More information

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Midterm

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Midterm Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay Midterm GSB Honor Code: I pledge my honor that I have not violated the Honor Code during this examination.

More information

Prediction of Stock Price Movements Using Options Data

Prediction of Stock Price Movements Using Options Data Prediction of Stock Price Movements Using Options Data Charmaine Chia cchia@stanford.edu Abstract This study investigates the relationship between time series data of a daily stock returns and features

More information

> attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount")

> attach(grocery) > boxplot(sales~discount, ylab=sales,xlab=discount) Example of More than 2 Categories, and Analysis of Covariance Example > attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount") Sales 160 200 240 > tapply(sales,discount,mean) 10.00% 15.00%

More information

Loan Approval and Quality Prediction in the Lending Club Marketplace

Loan Approval and Quality Prediction in the Lending Club Marketplace Loan Approval and Quality Prediction in the Lending Club Marketplace Final Write-up Yondon Fu, Matt Marcus and Shuo Zheng Introduction Lending Club is a peer-to-peer lending marketplace where individual

More information

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1 GGraph 9 Gender : R Linear =.43 : R Linear =.769 8 7 6 5 4 3 5 5 Males Only GGraph Page R Linear =.43 R Loess 9 8 7 6 5 4 5 5 Explore Case Processing Summary Cases Valid Missing Total N Percent N Percent

More information

Monetary Economics Risk and Return, Part 2. Gerald P. Dwyer Fall 2015

Monetary Economics Risk and Return, Part 2. Gerald P. Dwyer Fall 2015 Monetary Economics Risk and Return, Part 2 Gerald P. Dwyer Fall 2015 Reading Malkiel, Part 2, Part 3 Malkiel, Part 3 Outline Returns and risk Overall market risk reduced over longer periods Individual

More information

Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases.

Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases. Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases. Goal: Find unusual cases that might be mistakes, or that might

More information

TABLE I SUMMARY STATISTICS Panel A: Loan-level Variables (22,176 loans) Variable Mean S.D. Pre-nuclear Test Total Lending (000) 16,479 60,768 Change in Log Lending -0.0028 1.23 Post-nuclear Test Default

More information

Lecture Note: Analysis of Financial Time Series Spring 2008, Ruey S. Tsay. Seasonal Time Series: TS with periodic patterns and useful in

Lecture Note: Analysis of Financial Time Series Spring 2008, Ruey S. Tsay. Seasonal Time Series: TS with periodic patterns and useful in Lecture Note: Analysis of Financial Time Series Spring 2008, Ruey S. Tsay Seasonal Time Series: TS with periodic patterns and useful in predicting quarterly earnings pricing weather-related derivatives

More information

Lecture Note: Analysis of Financial Time Series Spring 2017, Ruey S. Tsay

Lecture Note: Analysis of Financial Time Series Spring 2017, Ruey S. Tsay Lecture Note: Analysis of Financial Time Series Spring 2017, Ruey S. Tsay Seasonal Time Series: TS with periodic patterns and useful in predicting quarterly earnings pricing weather-related derivatives

More information

NEWCASTLE UNIVERSITY. School SEMESTER /2013 ACE2013. Statistics for Marketing and Management. Time allowed: 2 hours

NEWCASTLE UNIVERSITY. School SEMESTER /2013 ACE2013. Statistics for Marketing and Management. Time allowed: 2 hours NEWCASTLE UNIVERSITY School SEMESTER 2 2012/2013 Statistics for Marketing and Management Time allowed: 2 hours Candidates should attempt ALL questions. Marks for each question are indicated. However you

More information

EXST7015: Multiple Regression from Snedecor & Cochran (1967) RAW DATA LISTING

EXST7015: Multiple Regression from Snedecor & Cochran (1967) RAW DATA LISTING Multiple (Linear) Regression Introductory example Page 1 1 options ps=256 ls=132 nocenter nodate nonumber; 3 DATA ONE; 4 TITLE1 ''; 5 INPUT X1 X2 X3 Y; 6 **** LABEL Y ='Plant available phosphorus' 7 X1='Inorganic

More information

Effects of Financial Parameters on Poverty - Using SAS EM

Effects of Financial Parameters on Poverty - Using SAS EM Effects of Financial Parameters on Poverty - Using SAS EM By - Akshay Arora Student, MS in Business Analytics Spears School of Business Oklahoma State University Abstract Studies recommend that developing

More information

SFSU FIN822 Project 1

SFSU FIN822 Project 1 SFSU FIN822 Project 1 This project can be done in a team of up to 3 people. Your project report must be accompanied by printouts of programming outputs. You could use any software to solve the problems.

More information

General Business 706 Midterm #3 November 25, 1997

General Business 706 Midterm #3 November 25, 1997 General Business 706 Midterm #3 November 25, 1997 There are 9 questions on this exam for a total of 40 points. Please be sure to put your name and ID in the spaces provided below. Now, if you feel any

More information

Survey Sampling, Fall, 2006, Columbia University Homework assignments (2 Sept 2006)

Survey Sampling, Fall, 2006, Columbia University Homework assignments (2 Sept 2006) Survey Sampling, Fall, 2006, Columbia University Homework assignments (2 Sept 2006) Assignment 1, due lecture 3 at the beginning of class 1. Lohr 1.1 2. Lohr 1.2 3. Lohr 1.3 4. Download data from the CBS

More information