Regression Model Assumptions Solutions

Size: px

Start display at page:

Download "Regression Model Assumptions Solutions"

Maud Whitehead
5 years ago
Views:

1 Regression Model Assumptions Solutions Below are the solutions to these exercises on model diagnostics using residual plots. # Exercise 1 # data("cars") head(cars) speed dist # Exercise 2 # attach(cars) plot(speed,dist,pch=16,las=1,cex=1.2,cex.lab=1.2,xlab='speed (mph)',ylab='stopping distance (ft)')

2 There is a correlation between stopping distance and speed. Stopping distances increases as speed increases. # Exercise 3 # lm_model <- lm(dist~speed) plot(speed,dist,pch=16,las=1,cex=1.2,cex.lab=1.2,xlab='speed (mph)',ylab='stopping distance (ft)') abline(lm_model,lwd=3,col='red')

3 # Exercise 4 # summary(lm_model) Call: lm(formula = dist ~ speed) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *

4 speed e-12 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 48 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 48 DF, p-value: 1.49e-12 There estimated slope coefficient is significantly different than zero. The R-squared values indicate that about 65% of the variance in stopping distance can be explained by speed. # Exercise 5 # par(mfrow = c(2,2)) plot(lm_model)

5 # Exercise 6 # Yes, the data does appear to be homoscedastic. The mean for the errors is the same for all values of the explanatory variable. This can be seen in the top-left or bottom-left plots of the four-panel figure in exercise 5. A perfectly horizontal line at 0 would indicate the mean of the residuals is zero and the same for all values of the explanatory variable. We see there is some deviance from a horizontal line at the tails of the explanatory variable. # Exercise 7 #

6 The QQ plot (top-right of the four-panel figure) indicates that the residuals are close to normally-distributed. There is some deviance from normality at the tails of the distribution where there are fewer data points. # Exercise 8 # par(mfrow=c(1,1)) plot(speed,lm_model$residuals,pch=16,ylab='residuals') The residuals are not correlated with the explanatory variable.

7 # Exercise 9 # There are several points (on the bottom-right plot of the four-panel figure) with a large Cook s distance that may have an influence on the regression, including points labeled 49, 23, and 39. # Exercise 10 # cars_no_49 = cars[-49,] lm_model2 <- lm(cars_no_49$dist~cars_no_49$speed) summary(lm_model2) Call: lm(formula = cars_no_49$dist ~ cars_no_49$speed) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * cars_no_49$speed e-12 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.1 on 47 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 47 DF, p-value: 3.262e-12

8 We see that removing the 49th record changed some values in the model regression, but nothing to qualitatively alter the results. For example, the estimated slope coefficient is now slightly smaller. Regression Model Assumptions Exercises You might fit a statistical model to a set of data and obtain parameter estimates. However, you are not done at this point. You need to make sure the assumptions of the particular model you used were met. One tool is to examine the model residuals. We previously discussed this in a tutorial. The residuals are the difference between your observed data and your predicted values. In this exercise set, you will examine several aspects of residual plots. These residual plots help determine if you have met your model assumptions. Answers to the exercises are available here. Exercise 1 Load the cars data set using the data() function. This data contains the stopping distances (feet) for different car speeds (miles per hour). The data was recorded in the 1920s.

9 Exercise 2 Plot car speeds on the y-axis and stopping distances on the x- axis. What kind of pattern is present? Exercise 3 Using the lm() function to fit a linear model to the data with the stopping distance as the response variable. Plot the line of best fit. Exercise 4 Use summary() to obtain parameter estimates and model details. Is the slope significantly different than zero? How much of the variance can be explained by car speed? Exercise 5 Use the plot() command on the linear model to obtain the four plots of residuals. Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan. These courses cover different statistical models that can help you choose the right design for your solution. Exercise 6 Are the data homoscedastic? Homoscedastic data means the distribution of errors should be the same for all values of the explanatory variable. Exercise 7 Are the residuals normally-distributed? Exercise 8 Are the residuals correlated with the explanatory variable? Exercise 9 Bonus test. Now take a look at the fourth plot: the residuals versus leverage. This plot does not indicate whether we have met model assumptions, but it does tell us if certain data

points are more influential than others in the regression. Points that have been labeled with a number have a high Cook s Distance, which means they are particularly influential for the regression.

10 points are more influential than others in the regression. Points that have been labeled with a number have a high Cook s Distance, which means they are particularly influential for the regression. These are usually the points not clustered with the majority of points. Are there any points with a high Cook s distance? Exercise 10 Remove the 49th record (the one with a large Cook s distance) from the data. How does this change the parameter estimates in the model regression? Regression Model Assumptions Tutorial Regression is used to explore the relationship between one variable (often termed the response) and one or more other variables (termed explanatory). Several exercises are already available on simple linear regression or multiple regression. These are fantastic tools that are used frequently. However, each has a number of assumptions that need to be met. Unfortunately, people often conduct regression analyses without checking their assumptions. In this tutorial, we will focus on how to check assumptions for simple linear regression. We will use the trees data already found in R. The data

11 includes the girth, height, and volume for 31 Black Cherry Trees. The following code loads the data and then creates a plot of volume versus girth. The red line is the line of best fit from linear regression. data("trees") attach(trees) plot(girth,height,pch=16,las=1) lm_model <- lm(height~girth) abline(lm_model,lwd=2,col='red') We can also use summary() to to obtain the model estimates from linear regression. summary(lm_model) Call:

12 lm(formula = Height ~ Girth) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-14 *** Girth ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 29 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 29 DF, p-value: We obtain parameter estimates, a R-squared value, and other useful bits of information. Great. Using several graphs, let s examine if our model has met the assumptions of linear regression. We are going to examine model residuals. A residual is simply the difference between each observed value (black points in first graph) and the predicted value (the red line). Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan. These courses cover different statistical models that can help you choose the right design for your solution. Thankfully, the plot() command provides several useful plots to check our model assumptions about residuals. par(mfrow=c(2,2)) plot(lm_model)

13 Okay, there is a lot going on in these graphs. Let s break down each assumption and how they relate to these four graphs. 1. Homoscedasticity of residuals The residuals should be homoscedastic. This means that the distribution of errors should be the same for all values of the explanatory variable. In the top left figure, we see residuals plotted against the fitted values. If the residuals are homoscedastic, we would expect the red line to fall on zero for the entire range of fitted values. There is some discrepancy, but the assumption looks to hold overall. The bottom left plot can also be used to examine homoscedasticity. Here, standardized residuals are used with the same reasoning that the red line should be horizontal. It is important to remember that examining model assumptions graphically is more of an art than science. 2. Normality of the residuals

14 The residuals should also be normally distributed. To check for normality, examine the qqnorm() plot on the top right of the 4-panel figure. Perfectly normal distributed data would all fall on the dashed line. This data looks fairly close to normal with some deviance at the tails. 3. Residuals should not be correlated with the explanatory variables Another important assumption is that residuals should not be correlated with any explanatory variables. Previously, we let the plot() command handle the residuals, but we need the residuals themselves now. To extract model residuals we use lm_model$residuals. We can then use this command to plot the model residuals versus Girth, the explanatory variable. par(mfrow=c(1,1)) plot(girth, lm_model$residuals, ylab='residuals',pch=16)

15 Here we see there is very little correlation between our residuals and the Girth variable. Therefore, this assumption is met. 4. The mean of the residuals should equal zero We can see the mean of residuals is close to zero from either our plots, or simply using mean(lm_model$residuals). Our assumption is met as the mean is very close to zero. 5. Little or no multicollinearity If we had more than one explanatory variable, we would have a multiple regression model. In that case, it would be important to verify that none of the explanatory variables are strongly correlated with one another. We only have one explanatory variable in our regression, so this assumption is not relevant here. Calculating Marginal Effects Exercises A common experience for those in the social sciences migrating to R from SPSS or STATA is that some procedures that happened at the click of a button will now require more work or are too obscured by the unfamiliar language to see how to accomplish. One such procedure that I ve experienced is when calculating the marginal effects of a generalized linear model. In this exercise set, we will

16 explore calculating marginal effects for linear, logistic, and probit regression models in R. Exercises in this section will be solved using the Margins and mfx packages. It is recommended to take a look at the concise and excellent documentation for these packages before continuing. Answers to the exercises are available here. Exercise 1 Load the mtcars dataset. Build a linear regression of mpg on wt, qsec, am, and hp. Exercise 2 Print the coefficients from the linear model in the previous exercise. Exercise 3 Using Margins package find marginal effects. Exercise 4 Verify that you receive the same results from Exercises 2 and 3. Why do these marginal effects match the coefficients found when printing the linear model object? Exercise 5 Using the mtcars dataset, built a linear regression similar to Exercise 1 except include an interaction term with am and hp. Find the marginal effects for this regression. Exercise 6 Using your favorite dataset (mine is field.goals from the nutshell package), construct a logistic regression. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

17 Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 7 Explain why marginal effects for a logit model more complex than for a linear model? Exercise 8 For the next two exercises, you may use either package. Calculate the marginal effects with respect to the mean. Exercise 9 Calculate the average marginal effects. Exercise 10 If these marginal effects are different, explain why they are different. Ridge regression in R exercises Bias vs Variance tradeoff is always encountered in applying

18 supervised learning algorithms. Least squares regression provides a good fit for the training set but can suffer from high variance which lowers predictive ability. To counter this problem, we can regularize the beta coefficients by employing a penalization term. Ridge regression applies l2 penalty to the residual sum of squares. In contrast, LASSO regression, which was covered here previously, applies l1 penalty. Using ridge regression, we can shrink the beta coefficients towards zero which would reduce variance at the cost of higher bias which can result in better predictive ability than least squares regression. In this exercise set we will use the glmnet package (package description: here) to implement ridge regression in R. Answers to the exercises are available here. Exercise 1 Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) Least Angle Regression (with discussion) Annals of Statistics). This is the same dataset from the LASSO exercise set and has patient level data on the progression of diabetes. Next, load the glmnet package that will that we will now use to implement ridge regression. The dataset has three matrices x, x2 and y. x has a smaller set of independent variables while x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison. Exercise 2 Fit the ridge regression model using the glmnet function and plot the trace of the estimated coefficients against lambdas. Note that coefficients are shrunk closer to zero for higher values of lambda.

19 Exercise 3 Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error. Exercise 4 Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that coefficients are lower than least squares estimates. Exercise 5 To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note the shrinkage effect on the estimates. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 6 Split the data randomly between a training set (80%) and test set (20%). We will use these to get the prediction standard error for least squares and ridge regression models. Exercise 7 Fit the ridge regression model on the training and get the estimated beta coefficients for both the minimum lambda and the higher lambda within 1-standard error of the minimum.

20 Exercise 8 Get predictions from the ridge regression model for the test set and calculate the prediction standard error. Do this for both the minimum lambda and the higher lambda within 1- standard error of the minimum. Exercise 9 Fit the least squares model on the training set. Exercise 10 Get predictions from the least squares model for the test set and calculate the prediction standard error. LASSO regression in R exercises Least Absolute Shrinkage and Selection Operator (LASSO) performs regularization and variable selection on a given model. Depending on the size of the penalty term, LASSO shrinks less relevant predictors to (possibly) zero. Thus, it enables us to consider a more parsimonious model. In this exercise set we will use the glmnet package (package description: here) to implement LASSO regression in R. Answers to the exercises are available here.

21 Exercise 1 Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) Least Angle Regression Annals of Statistics). This has patient level data on the progression of diabetes. Next, load the glmnet package that will be used to implement LASSO. Exercise 2 The dataset has three matrices x, x2 and y. While x has a smaller set of independent variables, x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes. It is a good idea to visually inspect the relationship of each of the predictors with the dependent variable. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Use a loop to automate the process. Exercise 3 Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison. Exercise 4 Use the glmnet function to plot the path of each of x s variable coefficients against the L1 norm of the beta vector. This graph indicates at which stage each coefficient shrinks to zero. Learn more about the glmnet package in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-

22 linear kernels. And much more Exercise 5 Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error. Exercise 6 Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y. Exercise 7 To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note that more coefficients are now shrunk to zero. Exercise 8 As mentioned earlier, x2 contains a wider variety of predictors. Using OLS, regress y on x2 and evaluate results. Exercise 9 Repeat exercise-4 for the new model. Exercise 10 Repeat exercises 5 and 6 for the new model and see which coefficients are shrunk to zero. This is an effective way to narrow down on important predictors when there are many candidates.

Quantile Regression in R exercises The standard OLS (Ordinary Least Squares) model explains the relationship between independent variables and the conditional mean of the dependent variable.

23 Quantile Regression in R exercises The standard OLS (Ordinary Least Squares) model explains the relationship between independent variables and the conditional mean of the dependent variable. In contrast, quantile regression models this relationship for different quantiles of the dependent variable. In this exercise set we will use the quantreg package (package description: here) to implement quantile regression in R. Answers to the exercises are available here. Exercise 1 Load the quantreg package and the barro dataset (Barro and Lee, 1994). This has data on GDP growth rates for various countries. Next, summarize the data. Exercise 2 The dependent variable is y.net (Annual change per capita GDP). The remaining variables will be used to explain y.net. It is easier to combine variables using cbind before applying regression techniques. Combine variables so that we can write Y ~ X. Exercise 3 Regress y.net on the independent variables using OLS. We will use this result as benchmark for comparison.

24 Exercise 4 Using the rq function, estimate the model at the median y.net. Compare results from exercise-3. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 5 Estimate the model for the first and third quartiles and compare results. Exercise 6 Using a single command estimate the model for 10 equally spaced deciles of y.net. Exercise 7 quantreg package also offers shrinkage estimators to determine which variables play the most important role in predicting y.net. Estimate the model with LASSO based quantile regression at the median level with lambda=0.5. Exercise 8 Quantile plots are most useful for interpreting results. To do that we need to define the sequence of percentiles. Use the seq function to define the sequence of percentiles from 5% to 95% with a jump of 5%. Exercise 9 Use the result from exercise-8 to plot the graphs. Note that

25 the red line is the OLS estimate bounded by the dotted lines which represent confidence intervals. Exercise 10 Using results from exercise-5, test whether coefficients are significantly different for the first and third quartile based regressions. Quantile Regression in R solutions Below are the solutions to these exercises on Quantile regression. # # Exercise 1 # # library(quantreg) data(barro) summary(barro) y.net lgdp2 mse2 fse2 Min. : Min. :5.820 Min. : Min. :0.0000

26 1st Qu.: st Qu.: st Qu.: st Qu.: Median : Median :7.745 Median : Median : Mean : Mean :7.791 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. :9.508 Max. : Max. : fhe2 mhe2 lexp2 lintr2 Min. : Min. : Min. :3.555 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median :4.064 Median : Mean : Mean : Mean :4.044 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. :4.315 Max. : gedy2 Iy2 gcony2 lblakp2 Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : pol2 ttrad2 Min. : Min. : st Qu.: st Qu.:

27 Median : Median : Mean : Mean : rd Qu.: rd Qu.: Max. : Max. : # # Exercise 2 # # y <- barro$y.net x<- cbind(barro$lgdp2, barro$mse2, barro$fse2, barro$fhe2, barro$mhe2, barro$lexp2, barro$lintr2, barro$gedy2, barro$iy2, barro$gcony2, barro$lblakp2, barro$pol2, barro$ttrad2) colnames(x) <- c("initial Per Capita GDP", "Male Secondary Education", "Female Secondary Education", "Female Higher Education", "Male Higher Education", "Life Expectancy", "Human Capital", "Education/GDP", "Investment/GDP", "Public Consumption/GDP", "Black Market Premium", "Political Instability", "Growth Rate Terms Trade")

28 # # Exercise 3 # # model3 <- lm(y ~ x) summary(model3) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) xinitial Per Capita GDP e-13 *** xmale Secondary Education ** xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy e-05 *** xhuman Capital xeducation/gdp xinvestment/gdp ** xpublic Consumption/GDP

29 ** xblack Market Premium e-09 *** xpolitical Instability ** xgrowth Rate Terms Trade e-05 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 147 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 13 and 147 DF, p-value: < 2.2e-16 # # Exercise 4 # # model4 <- rq(y ~ x, tau = 0.5) summary(model4, se="rank") Call: rq(formula = y ~ x, tau = 0.5) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp

30 xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 5 # # model5a <- rq(y ~ x, tau = 0.25) summary(model5a, se="rank") Call: rq(formula = y ~ x, tau = 0.25) tau: [1] 0.25 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade model5b <- rq(y ~ x, tau = 0.75) summary(model5b, se="rank")

31 Call: rq(formula = y ~ x, tau = 0.75) tau: [1] 0.75 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 6 # # model6 <- rq(y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) summary(model6, se="rank") Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.1 Coefficients: coefficients lower bd upper bd (Intercept)

32 xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.2 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9))

33 tau: [1] 0.3 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.4 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade

34 Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.6 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp

35 xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.7 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.8 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education

36 xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.9 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 7 # # model7 <- rq.fit.lasso(x, y, tau = 0.5, lambda = 0.5)

37 print(model7$coefficients) Initial Per Capita GDP Male Secondary Education e e-02 Female Secondary Education Female Higher Education e e-12 Male Higher Education Life Expectancy e e-02 Human Capital Education/GDP e e-12 Investment/GDP Public Consumption/GDP e e-02 Black Market Premium Political Instability e e-02 Growth Rate Terms Trade e-02 # Note that some beta estimates have been shrunk closer to zero. # # Exercise 8 # # tau_seq <- seq(0.05, 0.95, by=0.05) # # Exercise 9 # # model8 <- rq(y ~ x, tau = tau_seq) plot(summary(model8))

38 # Note that 'gcony2' is different from the OLS estimate at # lower quantiles of 'y.net'. # Exercise 10 # anova(model5a, model5b) Warning in summary.rq(x, se = se, covariance = TRUE): 8 non-positive fis Warning in summary.rq(x, se = se, covariance = TRUE): 3 non-positive fis Quantile Regression Analysis of Deviance Table Model: y ~ x Joint Test of Equality of Slopes: tau in { } Df Resid Df F value Pr(>F) ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # Estimates are significantly different.

39 Evaluate your model with R Exercises There was a time where statistician had to manually crunch number when they wanted to fit their data to a model. Since this process was so long, those statisticians usually did a lot of preliminary work researching other model who worked in the past or looking for studies in other scientific field like psychology or sociology who can influence their model with the goal to maximize their chance to make a relevant model. Then they would create a model and an alternative model and choose the one which seem more efficient. Now that even an average computer give us incredible computing power, it s easy to make multiple models and choose the one that best fit the data. Even though it is better to have good prior knowledge of the process you are trying to analyze and of other model used in the past, coming to a conclusion using mostly only the data help you avoid bias and help you create better models. In this set of exercises, we ll see how to apply the most used error metrics on your models with the intention to rate them and choose the one that is the more appropriate for the situation. Most of those errors metrics are not part of any R package, in consequence you have to apply the equation I give you on your data. Personally, I preferred to write a function which I can easily use on everyone of my models, but there s many ways to code those equations. If your code is different from the one in the

40 solution, feel free to post your code in the comments. Answers to the exercises are available here. Exercise 1 We start by looking at error metrics we can use for regression model. For linear regression problem, the more used metrics are the coefficient of determination R 2 which show what percentage of variance is explained by the model and the adjusted R 2 which penalize model who use variables that doesn t have an effective contribution to the model (see this page for more details). Load the attitude data set from the package of the same name and make three linear models with the goal to explain the rating variable. The first one use all the variables from the dataset, the second one use the variable complaints, privileges, learning and advance as independent variables and the third one use only the complaints, learning and advance variables. Then use the summary() function to print R 2 and the adjusted R 2. Exercise 2 Another way to measure how your model fit your data is to use the Root Mean Squared Error (RMSE), which is defined as the square root of the average of the square of all the error made by your model. You can find the mathematical definition of the RMSE on this page. Calculate the RMSE of the prediction made by your three models. Exercise 3 The mean absolute error (MAE) is a good alternative to the RMSE if you don t want to penalise too much the large estimation error of your model. The MAE is given by the equation The mathematical definition of the MAE can be found here. Calculate the MAE of the prediction made by the 3 models.

41 Exercise 4 Sometime some prediction error hurt your model than other. For example, if you are trying to predict the financial loss of a business over a period of time, under estimation of the loss would put the business at risk of bankruptcy, while overestimation of the loss will result in a conservative model. In those case, using the Root Mean Squared Logarithmic Error (RMSLE) as an error metric is useful since this metric penalize under estimation. The RMSLE is given by the equation given on this page. Calculate the RMSLE of the prediction made by the three models. Exercise 5 Now that we ve seen some examples of error metrics which can be used in a regression context, let s see five examples of error metrics which can be used when you perform clustering analysis. But first, we must create a clustering model to test those metrics on. Load the iris dataset and apply the kmeans algorithm. Since the iris dataset has three distinct labels, use the kmeans algorithm with three centers. Also, use set the maximum number of iterations at 50 and use the Lloyd algorithm. Once it s done, take time to rearrange the labels of your prediction so they are compatible with the factors in the iris dataset. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of

42 error term support vector machines with linear and nonlinear kernels. And much more Exercise 6 Print the confusion matrix of your model. Exercise 7 The easiest way to measure how well your model did categorize the data, is to calculate the accuracy, the recall and the precision of your results. Write three functions which return those individual values and calculate those metrics for your models. Exercise 8 The F-measure summarize the precision and recall value of your model by calculating the harmonic mean of those two values. Write a function which return the F-measure of your model and compute twice this measure for your data: once with a parameter of 2 and then with a parameter of 0.5. Exercise 9 The Purity is a measure of the homogeneity of your cluster: if all your cluster regroup object of the same class, you ll get a purity score of one and if there s no majority class in any of the cluster, you ll get a purity score of 0. Write a function which return the purity score of your model and test it on your predictions. Exercise 10 The last error metrics we ll see today is the Dunn index, which indicate if the clusters are compact and separated. You can find the mathematical definition of the Dunn index here. Load the cvalid package and use the dunn() on your model, to compute the Dunn index of your classification. Note that this function take an integer vector representing the cluster partitioning as parameter.

Multiple regression - a brief introduction

Multiple regression - a brief introduction Multiple regression is an extension to regular (simple) regression. Instead of one X, we now have several. Suppose, for example, that you are trying to predict