Regression Model Assumptions Solutions

Size: px
Start display at page:

Download "Regression Model Assumptions Solutions"

Transcription

1 Regression Model Assumptions Solutions Below are the solutions to these exercises on model diagnostics using residual plots. # Exercise 1 # data("cars") head(cars) speed dist # Exercise 2 # attach(cars) plot(speed,dist,pch=16,las=1,cex=1.2,cex.lab=1.2,xlab='speed (mph)',ylab='stopping distance (ft)')

2 There is a correlation between stopping distance and speed. Stopping distances increases as speed increases. # Exercise 3 # lm_model <- lm(dist~speed) plot(speed,dist,pch=16,las=1,cex=1.2,cex.lab=1.2,xlab='speed (mph)',ylab='stopping distance (ft)') abline(lm_model,lwd=3,col='red')

3 # Exercise 4 # summary(lm_model) Call: lm(formula = dist ~ speed) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *

4 speed e-12 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 48 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 48 DF, p-value: 1.49e-12 There estimated slope coefficient is significantly different than zero. The R-squared values indicate that about 65% of the variance in stopping distance can be explained by speed. # Exercise 5 # par(mfrow = c(2,2)) plot(lm_model)

5 # Exercise 6 # Yes, the data does appear to be homoscedastic. The mean for the errors is the same for all values of the explanatory variable. This can be seen in the top-left or bottom-left plots of the four-panel figure in exercise 5. A perfectly horizontal line at 0 would indicate the mean of the residuals is zero and the same for all values of the explanatory variable. We see there is some deviance from a horizontal line at the tails of the explanatory variable. # Exercise 7 #

6 The QQ plot (top-right of the four-panel figure) indicates that the residuals are close to normally-distributed. There is some deviance from normality at the tails of the distribution where there are fewer data points. # Exercise 8 # par(mfrow=c(1,1)) plot(speed,lm_model$residuals,pch=16,ylab='residuals') The residuals are not correlated with the explanatory variable.

7 # Exercise 9 # There are several points (on the bottom-right plot of the four-panel figure) with a large Cook s distance that may have an influence on the regression, including points labeled 49, 23, and 39. # Exercise 10 # cars_no_49 = cars[-49,] lm_model2 <- lm(cars_no_49$dist~cars_no_49$speed) summary(lm_model2) Call: lm(formula = cars_no_49$dist ~ cars_no_49$speed) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * cars_no_49$speed e-12 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.1 on 47 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 47 DF, p-value: 3.262e-12

8 We see that removing the 49th record changed some values in the model regression, but nothing to qualitatively alter the results. For example, the estimated slope coefficient is now slightly smaller. Regression Model Assumptions Exercises You might fit a statistical model to a set of data and obtain parameter estimates. However, you are not done at this point. You need to make sure the assumptions of the particular model you used were met. One tool is to examine the model residuals. We previously discussed this in a tutorial. The residuals are the difference between your observed data and your predicted values. In this exercise set, you will examine several aspects of residual plots. These residual plots help determine if you have met your model assumptions. Answers to the exercises are available here. Exercise 1 Load the cars data set using the data() function. This data contains the stopping distances (feet) for different car speeds (miles per hour). The data was recorded in the 1920s.

9 Exercise 2 Plot car speeds on the y-axis and stopping distances on the x- axis. What kind of pattern is present? Exercise 3 Using the lm() function to fit a linear model to the data with the stopping distance as the response variable. Plot the line of best fit. Exercise 4 Use summary() to obtain parameter estimates and model details. Is the slope significantly different than zero? How much of the variance can be explained by car speed? Exercise 5 Use the plot() command on the linear model to obtain the four plots of residuals. Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan. These courses cover different statistical models that can help you choose the right design for your solution. Exercise 6 Are the data homoscedastic? Homoscedastic data means the distribution of errors should be the same for all values of the explanatory variable. Exercise 7 Are the residuals normally-distributed? Exercise 8 Are the residuals correlated with the explanatory variable? Exercise 9 Bonus test. Now take a look at the fourth plot: the residuals versus leverage. This plot does not indicate whether we have met model assumptions, but it does tell us if certain data

10 points are more influential than others in the regression. Points that have been labeled with a number have a high Cook s Distance, which means they are particularly influential for the regression. These are usually the points not clustered with the majority of points. Are there any points with a high Cook s distance? Exercise 10 Remove the 49th record (the one with a large Cook s distance) from the data. How does this change the parameter estimates in the model regression? Regression Model Assumptions Tutorial Regression is used to explore the relationship between one variable (often termed the response) and one or more other variables (termed explanatory). Several exercises are already available on simple linear regression or multiple regression. These are fantastic tools that are used frequently. However, each has a number of assumptions that need to be met. Unfortunately, people often conduct regression analyses without checking their assumptions. In this tutorial, we will focus on how to check assumptions for simple linear regression. We will use the trees data already found in R. The data

11 includes the girth, height, and volume for 31 Black Cherry Trees. The following code loads the data and then creates a plot of volume versus girth. The red line is the line of best fit from linear regression. data("trees") attach(trees) plot(girth,height,pch=16,las=1) lm_model <- lm(height~girth) abline(lm_model,lwd=2,col='red') We can also use summary() to to obtain the model estimates from linear regression. summary(lm_model) Call:

12 lm(formula = Height ~ Girth) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-14 *** Girth ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 29 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 29 DF, p-value: We obtain parameter estimates, a R-squared value, and other useful bits of information. Great. Using several graphs, let s examine if our model has met the assumptions of linear regression. We are going to examine model residuals. A residual is simply the difference between each observed value (black points in first graph) and the predicted value (the red line). Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan. These courses cover different statistical models that can help you choose the right design for your solution. Thankfully, the plot() command provides several useful plots to check our model assumptions about residuals. par(mfrow=c(2,2)) plot(lm_model)

13 Okay, there is a lot going on in these graphs. Let s break down each assumption and how they relate to these four graphs. 1. Homoscedasticity of residuals The residuals should be homoscedastic. This means that the distribution of errors should be the same for all values of the explanatory variable. In the top left figure, we see residuals plotted against the fitted values. If the residuals are homoscedastic, we would expect the red line to fall on zero for the entire range of fitted values. There is some discrepancy, but the assumption looks to hold overall. The bottom left plot can also be used to examine homoscedasticity. Here, standardized residuals are used with the same reasoning that the red line should be horizontal. It is important to remember that examining model assumptions graphically is more of an art than science. 2. Normality of the residuals

14 The residuals should also be normally distributed. To check for normality, examine the qqnorm() plot on the top right of the 4-panel figure. Perfectly normal distributed data would all fall on the dashed line. This data looks fairly close to normal with some deviance at the tails. 3. Residuals should not be correlated with the explanatory variables Another important assumption is that residuals should not be correlated with any explanatory variables. Previously, we let the plot() command handle the residuals, but we need the residuals themselves now. To extract model residuals we use lm_model$residuals. We can then use this command to plot the model residuals versus Girth, the explanatory variable. par(mfrow=c(1,1)) plot(girth, lm_model$residuals, ylab='residuals',pch=16)

15 Here we see there is very little correlation between our residuals and the Girth variable. Therefore, this assumption is met. 4. The mean of the residuals should equal zero We can see the mean of residuals is close to zero from either our plots, or simply using mean(lm_model$residuals). Our assumption is met as the mean is very close to zero. 5. Little or no multicollinearity If we had more than one explanatory variable, we would have a multiple regression model. In that case, it would be important to verify that none of the explanatory variables are strongly correlated with one another. We only have one explanatory variable in our regression, so this assumption is not relevant here. Calculating Marginal Effects Exercises A common experience for those in the social sciences migrating to R from SPSS or STATA is that some procedures that happened at the click of a button will now require more work or are too obscured by the unfamiliar language to see how to accomplish. One such procedure that I ve experienced is when calculating the marginal effects of a generalized linear model. In this exercise set, we will

16 explore calculating marginal effects for linear, logistic, and probit regression models in R. Exercises in this section will be solved using the Margins and mfx packages. It is recommended to take a look at the concise and excellent documentation for these packages before continuing. Answers to the exercises are available here. Exercise 1 Load the mtcars dataset. Build a linear regression of mpg on wt, qsec, am, and hp. Exercise 2 Print the coefficients from the linear model in the previous exercise. Exercise 3 Using Margins package find marginal effects. Exercise 4 Verify that you receive the same results from Exercises 2 and 3. Why do these marginal effects match the coefficients found when printing the linear model object? Exercise 5 Using the mtcars dataset, built a linear regression similar to Exercise 1 except include an interaction term with am and hp. Find the marginal effects for this regression. Exercise 6 Using your favorite dataset (mine is field.goals from the nutshell package), construct a logistic regression. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:

17 Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 7 Explain why marginal effects for a logit model more complex than for a linear model? Exercise 8 For the next two exercises, you may use either package. Calculate the marginal effects with respect to the mean. Exercise 9 Calculate the average marginal effects. Exercise 10 If these marginal effects are different, explain why they are different. Ridge regression in R exercises Bias vs Variance tradeoff is always encountered in applying

18 supervised learning algorithms. Least squares regression provides a good fit for the training set but can suffer from high variance which lowers predictive ability. To counter this problem, we can regularize the beta coefficients by employing a penalization term. Ridge regression applies l2 penalty to the residual sum of squares. In contrast, LASSO regression, which was covered here previously, applies l1 penalty. Using ridge regression, we can shrink the beta coefficients towards zero which would reduce variance at the cost of higher bias which can result in better predictive ability than least squares regression. In this exercise set we will use the glmnet package (package description: here) to implement ridge regression in R. Answers to the exercises are available here. Exercise 1 Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) Least Angle Regression (with discussion) Annals of Statistics). This is the same dataset from the LASSO exercise set and has patient level data on the progression of diabetes. Next, load the glmnet package that will that we will now use to implement ridge regression. The dataset has three matrices x, x2 and y. x has a smaller set of independent variables while x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison. Exercise 2 Fit the ridge regression model using the glmnet function and plot the trace of the estimated coefficients against lambdas. Note that coefficients are shrunk closer to zero for higher values of lambda.

19 Exercise 3 Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error. Exercise 4 Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that coefficients are lower than least squares estimates. Exercise 5 To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note the shrinkage effect on the estimates. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 6 Split the data randomly between a training set (80%) and test set (20%). We will use these to get the prediction standard error for least squares and ridge regression models. Exercise 7 Fit the ridge regression model on the training and get the estimated beta coefficients for both the minimum lambda and the higher lambda within 1-standard error of the minimum.

20 Exercise 8 Get predictions from the ridge regression model for the test set and calculate the prediction standard error. Do this for both the minimum lambda and the higher lambda within 1- standard error of the minimum. Exercise 9 Fit the least squares model on the training set. Exercise 10 Get predictions from the least squares model for the test set and calculate the prediction standard error. LASSO regression in R exercises Least Absolute Shrinkage and Selection Operator (LASSO) performs regularization and variable selection on a given model. Depending on the size of the penalty term, LASSO shrinks less relevant predictors to (possibly) zero. Thus, it enables us to consider a more parsimonious model. In this exercise set we will use the glmnet package (package description: here) to implement LASSO regression in R. Answers to the exercises are available here.

21 Exercise 1 Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) Least Angle Regression Annals of Statistics). This has patient level data on the progression of diabetes. Next, load the glmnet package that will be used to implement LASSO. Exercise 2 The dataset has three matrices x, x2 and y. While x has a smaller set of independent variables, x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes. It is a good idea to visually inspect the relationship of each of the predictors with the dependent variable. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Use a loop to automate the process. Exercise 3 Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison. Exercise 4 Use the glmnet function to plot the path of each of x s variable coefficients against the L1 norm of the beta vector. This graph indicates at which stage each coefficient shrinks to zero. Learn more about the glmnet package in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-

22 linear kernels. And much more Exercise 5 Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error. Exercise 6 Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y. Exercise 7 To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note that more coefficients are now shrunk to zero. Exercise 8 As mentioned earlier, x2 contains a wider variety of predictors. Using OLS, regress y on x2 and evaluate results. Exercise 9 Repeat exercise-4 for the new model. Exercise 10 Repeat exercises 5 and 6 for the new model and see which coefficients are shrunk to zero. This is an effective way to narrow down on important predictors when there are many candidates.

23 Quantile Regression in R exercises The standard OLS (Ordinary Least Squares) model explains the relationship between independent variables and the conditional mean of the dependent variable. In contrast, quantile regression models this relationship for different quantiles of the dependent variable. In this exercise set we will use the quantreg package (package description: here) to implement quantile regression in R. Answers to the exercises are available here. Exercise 1 Load the quantreg package and the barro dataset (Barro and Lee, 1994). This has data on GDP growth rates for various countries. Next, summarize the data. Exercise 2 The dependent variable is y.net (Annual change per capita GDP). The remaining variables will be used to explain y.net. It is easier to combine variables using cbind before applying regression techniques. Combine variables so that we can write Y ~ X. Exercise 3 Regress y.net on the independent variables using OLS. We will use this result as benchmark for comparison.

24 Exercise 4 Using the rq function, estimate the model at the median y.net. Compare results from exercise-3. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 5 Estimate the model for the first and third quartiles and compare results. Exercise 6 Using a single command estimate the model for 10 equally spaced deciles of y.net. Exercise 7 quantreg package also offers shrinkage estimators to determine which variables play the most important role in predicting y.net. Estimate the model with LASSO based quantile regression at the median level with lambda=0.5. Exercise 8 Quantile plots are most useful for interpreting results. To do that we need to define the sequence of percentiles. Use the seq function to define the sequence of percentiles from 5% to 95% with a jump of 5%. Exercise 9 Use the result from exercise-8 to plot the graphs. Note that

25 the red line is the OLS estimate bounded by the dotted lines which represent confidence intervals. Exercise 10 Using results from exercise-5, test whether coefficients are significantly different for the first and third quartile based regressions. Quantile Regression in R solutions Below are the solutions to these exercises on Quantile regression. # # Exercise 1 # # library(quantreg) data(barro) summary(barro) y.net lgdp2 mse2 fse2 Min. : Min. :5.820 Min. : Min. :0.0000

26 1st Qu.: st Qu.: st Qu.: st Qu.: Median : Median :7.745 Median : Median : Mean : Mean :7.791 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. :9.508 Max. : Max. : fhe2 mhe2 lexp2 lintr2 Min. : Min. : Min. :3.555 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median :4.064 Median : Mean : Mean : Mean :4.044 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. :4.315 Max. : gedy2 Iy2 gcony2 lblakp2 Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : pol2 ttrad2 Min. : Min. : st Qu.: st Qu.:

27 Median : Median : Mean : Mean : rd Qu.: rd Qu.: Max. : Max. : # # Exercise 2 # # y <- barro$y.net x<- cbind(barro$lgdp2, barro$mse2, barro$fse2, barro$fhe2, barro$mhe2, barro$lexp2, barro$lintr2, barro$gedy2, barro$iy2, barro$gcony2, barro$lblakp2, barro$pol2, barro$ttrad2) colnames(x) <- c("initial Per Capita GDP", "Male Secondary Education", "Female Secondary Education", "Female Higher Education", "Male Higher Education", "Life Expectancy", "Human Capital", "Education/GDP", "Investment/GDP", "Public Consumption/GDP", "Black Market Premium", "Political Instability", "Growth Rate Terms Trade")

28 # # Exercise 3 # # model3 <- lm(y ~ x) summary(model3) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) xinitial Per Capita GDP e-13 *** xmale Secondary Education ** xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy e-05 *** xhuman Capital xeducation/gdp xinvestment/gdp ** xpublic Consumption/GDP

29 ** xblack Market Premium e-09 *** xpolitical Instability ** xgrowth Rate Terms Trade e-05 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 147 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 13 and 147 DF, p-value: < 2.2e-16 # # Exercise 4 # # model4 <- rq(y ~ x, tau = 0.5) summary(model4, se="rank") Call: rq(formula = y ~ x, tau = 0.5) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp

30 xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 5 # # model5a <- rq(y ~ x, tau = 0.25) summary(model5a, se="rank") Call: rq(formula = y ~ x, tau = 0.25) tau: [1] 0.25 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade model5b <- rq(y ~ x, tau = 0.75) summary(model5b, se="rank")

31 Call: rq(formula = y ~ x, tau = 0.75) tau: [1] 0.75 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 6 # # model6 <- rq(y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) summary(model6, se="rank") Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.1 Coefficients: coefficients lower bd upper bd (Intercept)

32 xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.2 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9))

33 tau: [1] 0.3 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.4 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade

34 Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.6 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp

35 xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.7 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.8 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education

36 xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.9 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 7 # # model7 <- rq.fit.lasso(x, y, tau = 0.5, lambda = 0.5)

37 print(model7$coefficients) Initial Per Capita GDP Male Secondary Education e e-02 Female Secondary Education Female Higher Education e e-12 Male Higher Education Life Expectancy e e-02 Human Capital Education/GDP e e-12 Investment/GDP Public Consumption/GDP e e-02 Black Market Premium Political Instability e e-02 Growth Rate Terms Trade e-02 # Note that some beta estimates have been shrunk closer to zero. # # Exercise 8 # # tau_seq <- seq(0.05, 0.95, by=0.05) # # Exercise 9 # # model8 <- rq(y ~ x, tau = tau_seq) plot(summary(model8))

38 # Note that 'gcony2' is different from the OLS estimate at # lower quantiles of 'y.net'. # Exercise 10 # anova(model5a, model5b) Warning in summary.rq(x, se = se, covariance = TRUE): 8 non-positive fis Warning in summary.rq(x, se = se, covariance = TRUE): 3 non-positive fis Quantile Regression Analysis of Deviance Table Model: y ~ x Joint Test of Equality of Slopes: tau in { } Df Resid Df F value Pr(>F) ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # Estimates are significantly different.

39 Evaluate your model with R Exercises There was a time where statistician had to manually crunch number when they wanted to fit their data to a model. Since this process was so long, those statisticians usually did a lot of preliminary work researching other model who worked in the past or looking for studies in other scientific field like psychology or sociology who can influence their model with the goal to maximize their chance to make a relevant model. Then they would create a model and an alternative model and choose the one which seem more efficient. Now that even an average computer give us incredible computing power, it s easy to make multiple models and choose the one that best fit the data. Even though it is better to have good prior knowledge of the process you are trying to analyze and of other model used in the past, coming to a conclusion using mostly only the data help you avoid bias and help you create better models. In this set of exercises, we ll see how to apply the most used error metrics on your models with the intention to rate them and choose the one that is the more appropriate for the situation. Most of those errors metrics are not part of any R package, in consequence you have to apply the equation I give you on your data. Personally, I preferred to write a function which I can easily use on everyone of my models, but there s many ways to code those equations. If your code is different from the one in the

40 solution, feel free to post your code in the comments. Answers to the exercises are available here. Exercise 1 We start by looking at error metrics we can use for regression model. For linear regression problem, the more used metrics are the coefficient of determination R 2 which show what percentage of variance is explained by the model and the adjusted R 2 which penalize model who use variables that doesn t have an effective contribution to the model (see this page for more details). Load the attitude data set from the package of the same name and make three linear models with the goal to explain the rating variable. The first one use all the variables from the dataset, the second one use the variable complaints, privileges, learning and advance as independent variables and the third one use only the complaints, learning and advance variables. Then use the summary() function to print R 2 and the adjusted R 2. Exercise 2 Another way to measure how your model fit your data is to use the Root Mean Squared Error (RMSE), which is defined as the square root of the average of the square of all the error made by your model. You can find the mathematical definition of the RMSE on this page. Calculate the RMSE of the prediction made by your three models. Exercise 3 The mean absolute error (MAE) is a good alternative to the RMSE if you don t want to penalise too much the large estimation error of your model. The MAE is given by the equation The mathematical definition of the MAE can be found here. Calculate the MAE of the prediction made by the 3 models.

41 Exercise 4 Sometime some prediction error hurt your model than other. For example, if you are trying to predict the financial loss of a business over a period of time, under estimation of the loss would put the business at risk of bankruptcy, while overestimation of the loss will result in a conservative model. In those case, using the Root Mean Squared Logarithmic Error (RMSLE) as an error metric is useful since this metric penalize under estimation. The RMSLE is given by the equation given on this page. Calculate the RMSLE of the prediction made by the three models. Exercise 5 Now that we ve seen some examples of error metrics which can be used in a regression context, let s see five examples of error metrics which can be used when you perform clustering analysis. But first, we must create a clustering model to test those metrics on. Load the iris dataset and apply the kmeans algorithm. Since the iris dataset has three distinct labels, use the kmeans algorithm with three centers. Also, use set the maximum number of iterations at 50 and use the Lloyd algorithm. Once it s done, take time to rearrange the labels of your prediction so they are compatible with the factors in the iris dataset. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of

42 error term support vector machines with linear and nonlinear kernels. And much more Exercise 6 Print the confusion matrix of your model. Exercise 7 The easiest way to measure how well your model did categorize the data, is to calculate the accuracy, the recall and the precision of your results. Write three functions which return those individual values and calculate those metrics for your models. Exercise 8 The F-measure summarize the precision and recall value of your model by calculating the harmonic mean of those two values. Write a function which return the F-measure of your model and compute twice this measure for your data: once with a parameter of 2 and then with a parameter of 0.5. Exercise 9 The Purity is a measure of the homogeneity of your cluster: if all your cluster regroup object of the same class, you ll get a purity score of one and if there s no majority class in any of the cluster, you ll get a purity score of 0. Write a function which return the purity score of your model and test it on your predictions. Exercise 10 The last error metrics we ll see today is the Dunn index, which indicate if the clusters are compact and separated. You can find the mathematical definition of the Dunn index here. Load the cvalid package and use the dunn() on your model, to compute the Dunn index of your classification. Note that this function take an integer vector representing the cluster partitioning as parameter.

Multiple regression - a brief introduction

Multiple regression - a brief introduction Multiple regression - a brief introduction Multiple regression is an extension to regular (simple) regression. Instead of one X, we now have several. Suppose, for example, that you are trying to predict

More information

Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases.

Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases. Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases. Goal: Find unusual cases that might be mistakes, or that might

More information

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty Milestone2 Zillow House Price Prediciton Group Lingzi Hong and Pranali Shetty MILESTONE 2 REPORT Data Collection The following additional features were added 1. Population, Number of College Graduates

More information

Non-linearities in Simple Regression

Non-linearities in Simple Regression Non-linearities in Simple Regression 1. Eample: Monthly Earnings and Years of Education In this tutorial, we will focus on an eample that eplores the relationship between total monthly earnings and years

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Scott Creel Wednesday, September 10, 2014 This exercise extends the prior material on using the lm() function to fit an OLS regression and test hypotheses about effects on a parameter.

More information

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR STATISTICAL DISTRIBUTIONS AND THE CALCULATOR 1. Basic data sets a. Measures of Center - Mean ( ): average of all values. Characteristic: non-resistant is affected by skew and outliers. - Median: Either

More information

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1 1 Chapter 1 1.1 Definitions Stat 101 Exam 1 - Embers Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2.

More information

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013 Ordinal Multinomial Logistic Thom M. Suhy Southern Methodist University May14th, 2013 GLM Generalized Linear Model (GLM) Framework for statistical analysis (Gelman and Hill, 2007, p. 135) Linear Continuous

More information

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT) Regression Review and Robust Regression Slides prepared by Elizabeth Newton (MIT) S-Plus Oil City Data Frame Monthly Excess Returns of Oil City Petroleum, Inc. Stocks and the Market SUMMARY: The oilcity

More information

Intro to GLM Day 2: GLM and Maximum Likelihood

Intro to GLM Day 2: GLM and Maximum Likelihood Intro to GLM Day 2: GLM and Maximum Likelihood Federico Vegetti Central European University ECPR Summer School in Methods and Techniques 1 / 32 Generalized Linear Modeling 3 steps of GLM 1. Specify the

More information

PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS

PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS Melfi Alrasheedi School of Business, King Faisal University, Saudi

More information

9. Logit and Probit Models For Dichotomous Data

9. Logit and Probit Models For Dichotomous Data Sociology 740 John Fox Lecture Notes 9. Logit and Probit Models For Dichotomous Data Copyright 2014 by John Fox Logit and Probit Models for Dichotomous Responses 1 1. Goals: I To show how models similar

More information

Stat 401XV Exam 3 Spring 2017

Stat 401XV Exam 3 Spring 2017 Stat 40XV Exam Spring 07 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning

More information

Regression and Simulation

Regression and Simulation Regression and Simulation This is an introductory R session, so it may go slowly if you have never used R before. Do not be discouraged. A great way to learn a new language like this is to plunge right

More information

Window Width Selection for L 2 Adjusted Quantile Regression

Window Width Selection for L 2 Adjusted Quantile Regression Window Width Selection for L 2 Adjusted Quantile Regression Yoonsuh Jung, The Ohio State University Steven N. MacEachern, The Ohio State University Yoonkyung Lee, The Ohio State University Technical Report

More information

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE

AP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE AP STATISTICS Name: FALL SEMESTSER FINAL EXAM STUDY GUIDE Period: *Go over Vocabulary Notecards! *This is not a comprehensive review you still should look over your past notes, homework/practice, Quizzes,

More information

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology FE670 Algorithmic Trading Strategies Lecture 4. Cross-Sectional Models and Trading Strategies Steve Yang Stevens Institute of Technology 09/26/2013 Outline 1 Cross-Sectional Methods for Evaluation of Factor

More information

> attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount")

> attach(grocery) > boxplot(sales~discount, ylab=sales,xlab=discount) Example of More than 2 Categories, and Analysis of Covariance Example > attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount") Sales 160 200 240 > tapply(sales,discount,mean) 10.00% 15.00%

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #6 EPSY 905: Maximum Likelihood In This Lecture The basics of maximum likelihood estimation Ø The engine that

More information

Final Exam Suggested Solutions

Final Exam Suggested Solutions University of Washington Fall 003 Department of Economics Eric Zivot Economics 483 Final Exam Suggested Solutions This is a closed book and closed note exam. However, you are allowed one page of handwritten

More information

Model fit assessment via marginal model plots

Model fit assessment via marginal model plots The Stata Journal (2010) 10, Number 2, pp. 215 225 Model fit assessment via marginal model plots Charles Lindsey Texas A & M University Department of Statistics College Station, TX lindseyc@stat.tamu.edu

More information

R & R Study. Chapter 254. Introduction. Data Structure

R & R Study. Chapter 254. Introduction. Data Structure Chapter 54 Introduction A repeatability and reproducibility (R & R) study (sometimes called a gauge study) is conducted to determine if a particular measurement procedure is adequate. If the measurement

More information

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times. Mixed-effects models An introduction by Christoph Scherber Up to now, we have been dealing with linear models of the form where ß0 and ß1 are parameters of fixed value. Example: Let us assume that we are

More information

Solutions for practice questions: Chapter 15, Probability Distributions If you find any errors, please let me know at

Solutions for practice questions: Chapter 15, Probability Distributions If you find any errors, please let me know at Solutions for practice questions: Chapter 15, Probability Distributions If you find any errors, please let me know at mailto:msfrisbie@pfrisbie.com. 1. Let X represent the savings of a resident; X ~ N(3000,

More information

the display, exploration and transformation of the data are demonstrated and biases typically encountered are highlighted.

the display, exploration and transformation of the data are demonstrated and biases typically encountered are highlighted. 1 Insurance data Generalized linear modeling is a methodology for modeling relationships between variables. It generalizes the classical normal linear model, by relaxing some of its restrictive assumptions,

More information

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998 Economics 312 Sample Project Report Jeffrey Parker Introduction This project is based on Exercise 2.12 on page 81 of the Hill, Griffiths, and Lim text. It examines how the sale price of houses in Stockton,

More information

Homework Assignment Section 3

Homework Assignment Section 3 Homework Assignment Section 3 Tengyuan Liang Business Statistics Booth School of Business Problem 1 A company sets different prices for a particular stereo system in eight different regions of the country.

More information

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION In Inferential Statistic, ESTIMATION (i) (ii) is called the True Population Mean and is called the True Population Proportion. You must also remember that are not the only population parameters. There

More information

Jacob: What data do we use? Do we compile paid loss triangles for a line of business?

Jacob: What data do we use? Do we compile paid loss triangles for a line of business? PROJECT TEMPLATES FOR REGRESSION ANALYSIS APPLIED TO LOSS RESERVING BACKGROUND ON PAID LOSS TRIANGLES (The attached PDF file has better formatting.) {The paid loss triangle helps you! distinguish between

More information

Wage Determinants Analysis by Quantile Regression Tree

Wage Determinants Analysis by Quantile Regression Tree Communications of the Korean Statistical Society 2012, Vol. 19, No. 2, 293 301 DOI: http://dx.doi.org/10.5351/ckss.2012.19.2.293 Wage Determinants Analysis by Quantile Regression Tree Youngjae Chang 1,a

More information

Linear functions Increasing Linear Functions. Decreasing Linear Functions

Linear functions Increasing Linear Functions. Decreasing Linear Functions 3.5 Increasing, Decreasing, Max, and Min So far we have been describing graphs using quantitative information. That s just a fancy way to say that we ve been using numbers. Specifically, we have described

More information

Data screening, transformations: MRC05

Data screening, transformations: MRC05 Dale Berger Data screening, transformations: MRC05 This is a demonstration of data screening and transformations for a regression analysis. Our interest is in predicting current salary from education level

More information

Summary of Statistical Analysis Tools EDAD 5630

Summary of Statistical Analysis Tools EDAD 5630 Summary of Statistical Analysis Tools EDAD 5630 Test Name Program Used Purpose Steps Main Uses/Applications in Schools Principal Component Analysis SPSS Measure Underlying Constructs Reliability SPSS Measure

More information

Multiple Regression and Logistic Regression II. Dajiang 525 Apr

Multiple Regression and Logistic Regression II. Dajiang 525 Apr Multiple Regression and Logistic Regression II Dajiang Liu @PHS 525 Apr-19-2016 Materials from Last Time Multiple regression model: Include multiple predictors in the model = + + + + How to interpret the

More information

Problem Set 6. I did this with figure; bar3(reshape(mean(rx),5,5) );ylabel( size ); xlabel( value ); mean mo return %

Problem Set 6. I did this with figure; bar3(reshape(mean(rx),5,5) );ylabel( size ); xlabel( value ); mean mo return % Business 35905 John H. Cochrane Problem Set 6 We re going to replicate and extend Fama and French s basic results, using earlier and extended data. Get the 25 Fama French portfolios and factors from the

More information

Review questions for Multinomial Logit/Probit, Tobit, Heckit, Quantile Regressions

Review questions for Multinomial Logit/Probit, Tobit, Heckit, Quantile Regressions 1. I estimated a multinomial logit model of employment behavior using data from the 2006 Current Population Survey. The three possible outcomes for a person are employed (outcome=1), unemployed (outcome=2)

More information

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings Dummy Variables A dummy variable or binary variable is a variable that takes on a value of 0 or 1 as an indicator that the observation has some kind of characteristic. Common examples: Sex (female): FEMALE=1

More information

Study 2: data analysis. Example analysis using R

Study 2: data analysis. Example analysis using R Study 2: data analysis Example analysis using R Steps for data analysis Install software on your computer or locate computer with software (e.g., R, systat, SPSS) Prepare data for analysis Subjects (rows)

More information

Multiple Regression. Review of Regression with One Predictor

Multiple Regression. Review of Regression with One Predictor Fall Semester, 2001 Statistics 621 Lecture 4 Robert Stine 1 Preliminaries Multiple Regression Grading on this and other assignments Assignment will get placed in folder of first member of Learning Team.

More information

Jacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation?

Jacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation? PROJECT TEMPLATE: DISCRETE CHANGE IN THE INFLATION RATE (The attached PDF file has better formatting.) {This posting explains how to simulate a discrete change in a parameter and how to use dummy variables

More information

Statistical Evidence and Inference

Statistical Evidence and Inference Statistical Evidence and Inference Basic Methods of Analysis Understanding the methods used by economists requires some basic terminology regarding the distribution of random variables. The mean of a distribution

More information

Lecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit

Lecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit Lecture 10: Alternatives to OLS with limited dependent variables, part 1 PEA vs APE Logit/Probit PEA vs APE PEA: partial effect at the average The effect of some x on y for a hypothetical case with sample

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

The SAS System 11:03 Monday, November 11,

The SAS System 11:03 Monday, November 11, The SAS System 11:3 Monday, November 11, 213 1 The CONTENTS Procedure Data Set Name BIO.AUTO_PREMIUMS Observations 5 Member Type DATA Variables 3 Engine V9 Indexes Created Monday, November 11, 213 11:4:19

More information

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley. Appendix: Statistics in Action Part I Financial Time Series 1. These data show the effects of stock splits. If you investigate further, you ll find that most of these splits (such as in May 1970) are 3-for-1

More information

Name: Class: Date: in general form.

Name: Class: Date: in general form. Write the equation in general form. Mathematical Applications for the Management Life and Social Sciences 11th Edition Harshbarger TEST BANK Full clear download at: https://testbankreal.com/download/mathematical-applications-management-life-socialsciences-11th-edition-harshbarger-test-bank/

More information

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research... iii Table of Contents Preface... xiii Purpose... xiii Outline of Chapters... xiv New to the Second Edition... xvii Acknowledgements... xviii Chapter 1: Introduction... 1 1.1: Social Research... 1 Introduction...

More information

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine Models of Patterns Lecture 3, SMMD 2005 Bob Stine Review Speculative investing and portfolios Risk and variance Volatility adjusted return Volatility drag Dependence Covariance Review Example Stock and

More information

Assessing the reliability of regression-based estimates of risk

Assessing the reliability of regression-based estimates of risk Assessing the reliability of regression-based estimates of risk 17 June 2013 Stephen Gray and Jason Hall, SFG Consulting Contents 1. PREPARATION OF THIS REPORT... 1 2. EXECUTIVE SUMMARY... 2 3. INTRODUCTION...

More information

starting on 5/1/1953 up until 2/1/2017.

starting on 5/1/1953 up until 2/1/2017. An Actuary s Guide to Financial Applications: Examples with EViews By William Bourgeois An actuary is a business professional who uses statistics to determine and analyze risks for companies. In this guide,

More information

Basic Procedure for Histograms

Basic Procedure for Histograms Basic Procedure for Histograms 1. Compute the range of observations (min. & max. value) 2. Choose an initial # of classes (most likely based on the range of values, try and find a number of classes that

More information

Stat 328, Summer 2005

Stat 328, Summer 2005 Stat 328, Summer 2005 Exam #2, 6/18/05 Name (print) UnivID I have neither given nor received any unauthorized aid in completing this exam. Signed Answer each question completely showing your work where

More information

REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING

REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING International Civil Aviation Organization 27/8/10 WORKING PAPER REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING Cairo 2 to 4 November 2010 Agenda Item 3 a): Forecasting Methodology (Presented

More information

WEB APPENDIX 8A 7.1 ( 8.9)

WEB APPENDIX 8A 7.1 ( 8.9) WEB APPENDIX 8A CALCULATING BETA COEFFICIENTS The CAPM is an ex ante model, which means that all of the variables represent before-the-fact expected values. In particular, the beta coefficient used in

More information

Regression with a binary dependent variable: Logistic regression diagnostic

Regression with a binary dependent variable: Logistic regression diagnostic ACADEMIC YEAR 2016/2017 Università degli Studi di Milano GRADUATE SCHOOL IN SOCIAL AND POLITICAL SCIENCES APPLIED MULTIVARIATE ANALYSIS Luigi Curini luigi.curini@unimi.it Do not quote without author s

More information

Logit Models for Binary Data

Logit Models for Binary Data Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis These models are appropriate when the response

More information

Better decision making under uncertain conditions using Monte Carlo Simulation

Better decision making under uncertain conditions using Monte Carlo Simulation IBM Software Business Analytics IBM SPSS Statistics Better decision making under uncertain conditions using Monte Carlo Simulation Monte Carlo simulation and risk analysis techniques in IBM SPSS Statistics

More information

SAS Simple Linear Regression Example

SAS Simple Linear Regression Example SAS Simple Linear Regression Example This handout gives examples of how to use SAS to generate a simple linear regression plot, check the correlation between two variables, fit a simple linear regression

More information

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.

We will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed. We will discuss the normal distribution in greater detail in our unit on probability. However, as it is often of use to use exploratory data analysis to determine if the sample seems reasonably normally

More information

Establishing a framework for statistical analysis via the Generalized Linear Model

Establishing a framework for statistical analysis via the Generalized Linear Model PSY349: Lecture 1: INTRO & CORRELATION Establishing a framework for statistical analysis via the Generalized Linear Model GLM provides a unified framework that incorporates a number of statistical methods

More information

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority Chapter 235 Analysis of 2x2 Cross-Over Designs using -ests for Non-Inferiority Introduction his procedure analyzes data from a two-treatment, two-period (2x2) cross-over design where the goal is to demonstrate

More information

Lecture 2 Describing Data

Lecture 2 Describing Data Lecture 2 Describing Data Thais Paiva STA 111 - Summer 2013 Term II July 2, 2013 Lecture Plan 1 Types of data 2 Describing the data with plots 3 Summary statistics for central tendency and spread 4 Histograms

More information

STAT 157 HW1 Solutions

STAT 157 HW1 Solutions STAT 157 HW1 Solutions http://www.stat.ucla.edu/~dinov/courses_students.dir/10/spring/stats157.dir/ Problem 1. 1.a: (6 points) Determine the Relative Frequency and the Cumulative Relative Frequency (fill

More information

a. Explain why the coefficients change in the observed direction when switching from OLS to Tobit estimation.

a. Explain why the coefficients change in the observed direction when switching from OLS to Tobit estimation. 1. Using data from IRS Form 5500 filings by U.S. pension plans, I estimated a model of contributions to pension plans as ln(1 + c i ) = α 0 + U i α 1 + PD i α 2 + e i Where the subscript i indicates the

More information

Quantile Regression. By Luyang Fu, Ph. D., FCAS, State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting

Quantile Regression. By Luyang Fu, Ph. D., FCAS, State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting Quantile Regression By Luyang Fu, Ph. D., FCAS, State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting Agenda Overview of Predictive Modeling for P&C Applications Quantile

More information

Descriptive Statistics (Devore Chapter One)

Descriptive Statistics (Devore Chapter One) Descriptive Statistics (Devore Chapter One) 1016-345-01 Probability and Statistics for Engineers Winter 2010-2011 Contents 0 Perspective 1 1 Pictorial and Tabular Descriptions of Data 2 1.1 Stem-and-Leaf

More information

R is a collaborative project with many contributors. Type contributors() for more information.

R is a collaborative project with many contributors. Type contributors() for more information. R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type license() or licence() for distribution details. R is a collaborative project

More information

The Least Squares Regression Line

The Least Squares Regression Line The Least Squares Regression Line Section 5.3 Cathy Poliak, Ph.D. cathy@math.uh.edu Office hours: T Th 1:30 pm - 3:30 pm 620 PGH & 5:30 pm - 7:00 pm CASA Department of Mathematics University of Houston

More information

tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6}

tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6} PS 4 Monday August 16 01:00:42 2010 Page 1 tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6} log: C:\web\PS4log.smcl log type: smcl opened on:

More information

AP Stats: 3B ~ Least Squares Regression and Residuals. Objectives:

AP Stats: 3B ~ Least Squares Regression and Residuals. Objectives: Objectives: INTERPRET the slope and y intercept of a least-squares regression line USE the least-squares regression line to predict y for a given x CALCULATE and INTERPRET residuals and their standard

More information

Descriptive Statistics

Descriptive Statistics Chapter 3 Descriptive Statistics Chapter 2 presented graphical techniques for organizing and displaying data. Even though such graphical techniques allow the researcher to make some general observations

More information

Predicting Foreign Exchange Arbitrage

Predicting Foreign Exchange Arbitrage Predicting Foreign Exchange Arbitrage Stefan Huber & Amy Wang 1 Introduction and Related Work The Covered Interest Parity condition ( CIP ) should dictate prices on the trillion-dollar foreign exchange

More information

Notes on a Basic Business Problem MATH 104 and MATH 184 Mark Mac Lean (with assistance from Patrick Chan) 2011W

Notes on a Basic Business Problem MATH 104 and MATH 184 Mark Mac Lean (with assistance from Patrick Chan) 2011W Notes on a Basic Business Problem MATH 104 and MATH 184 Mark Mac Lean (with assistance from Patrick Chan) 2011W This simple problem will introduce you to the basic ideas of revenue, cost, profit, and demand.

More information

APPLICATIONS OF STATISTICAL DATA MINING METHODS

APPLICATIONS OF STATISTICAL DATA MINING METHODS Libraries Annual Conference on Applied Statistics in Agriculture 2004-16th Annual Conference Proceedings APPLICATIONS OF STATISTICAL DATA MINING METHODS George Fernandez Follow this and additional works

More information

Session 5. Predictive Modeling in Life Insurance

Session 5. Predictive Modeling in Life Insurance SOA Predictive Analytics Seminar Hong Kong 29 Aug. 2018 Hong Kong Session 5 Predictive Modeling in Life Insurance Jingyi Zhang, Ph.D Predictive Modeling in Life Insurance JINGYI ZHANG PhD Scientist Global

More information

Stat3011: Solution of Midterm Exam One

Stat3011: Solution of Midterm Exam One 1 Stat3011: Solution of Midterm Exam One Fall/2003, Tiefeng Jiang Name: Problem 1 (30 points). Choose one appropriate answer in each of the following questions. 1. (B ) The mean age of five people in a

More information

DATA HANDLING Five-Number Summary

DATA HANDLING Five-Number Summary DATA HANDLING Five-Number Summary The five-number summary consists of the minimum and maximum values, the median, and the upper and lower quartiles. The minimum and the maximum are the smallest and greatest

More information

General Business 706 Midterm #3 November 25, 1997

General Business 706 Midterm #3 November 25, 1997 General Business 706 Midterm #3 November 25, 1997 There are 9 questions on this exam for a total of 40 points. Please be sure to put your name and ID in the spaces provided below. Now, if you feel any

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 13, 2018

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 13, 2018 Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 3, 208 [This handout draws very heavily from Regression Models for Categorical

More information

Risk Analysis. å To change Benchmark tickers:

Risk Analysis. å To change Benchmark tickers: Property Sheet will appear. The Return/Statistics page will be displayed. 2. Use the five boxes in the Benchmark section of this page to enter or change the tickers that will appear on the Performance

More information

Tests for Two ROC Curves

Tests for Two ROC Curves Chapter 65 Tests for Two ROC Curves Introduction Receiver operating characteristic (ROC) curves are used to summarize the accuracy of diagnostic tests. The technique is used when a criterion variable is

More information

Impact of Unemployment and GDP on Inflation: Imperial study of Pakistan s Economy

Impact of Unemployment and GDP on Inflation: Imperial study of Pakistan s Economy International Journal of Current Research in Multidisciplinary (IJCRM) ISSN: 2456-0979 Vol. 2, No. 6, (July 17), pp. 01-10 Impact of Unemployment and GDP on Inflation: Imperial study of Pakistan s Economy

More information

This homework assignment uses the material on pages ( A moving average ).

This homework assignment uses the material on pages ( A moving average ). Module 2: Time series concepts HW Homework assignment: equally weighted moving average This homework assignment uses the material on pages 14-15 ( A moving average ). 2 Let Y t = 1/5 ( t + t-1 + t-2 +

More information

Predicting Charitable Contributions

Predicting Charitable Contributions Predicting Charitable Contributions By Lauren Meyer Executive Summary Charitable contributions depend on many factors from financial security to personal characteristics. This report will focus on demographic

More information

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, President, OptiMine Consulting, West Chester, PA ABSTRACT Data Mining is a new term for the

More information

SEX DISCRIMINATION PROBLEM

SEX DISCRIMINATION PROBLEM SEX DISCRIMINATION PROBLEM 5. Displaying Relationships between Variables In this section we will use scatterplots to examine the relationship between the dependent variable (starting salary) and each of

More information

WC-5 Just How Credible Is That Employer? Exploring GLMs and Multilevel Modeling for NCCI s Excess Loss Factor Methodology

WC-5 Just How Credible Is That Employer? Exploring GLMs and Multilevel Modeling for NCCI s Excess Loss Factor Methodology Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the letter and spirit of the antitrust laws. Seminars conducted under the auspices of the CAS are designed solely to

More information

THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management

THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management BA 386T Tom Shively PROBABILITY CONCEPTS AND NORMAL DISTRIBUTIONS The fundamental idea underlying any statistical

More information

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15 STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15 For this assignment use the Diamonds dataset in the Stat2Data library. The dataset is used in examples

More information

Modelling the Sharpe ratio for investment strategies

Modelling the Sharpe ratio for investment strategies Modelling the Sharpe ratio for investment strategies Group 6 Sako Arts 0776148 Rik Coenders 0777004 Stefan Luijten 0783116 Ivo van Heck 0775551 Rik Hagelaars 0789883 Stephan van Driel 0858182 Ellen Cardinaels

More information

CHAPTER 2 Describing Data: Numerical

CHAPTER 2 Describing Data: Numerical CHAPTER Multiple-Choice Questions 1. A scatter plot can illustrate all of the following except: A) the median of each of the two variables B) the range of each of the two variables C) an indication of

More information

Web Extension: Continuous Distributions and Estimating Beta with a Calculator

Web Extension: Continuous Distributions and Estimating Beta with a Calculator 19878_02W_p001-008.qxd 3/10/06 9:51 AM Page 1 C H A P T E R 2 Web Extension: Continuous Distributions and Estimating Beta with a Calculator This extension explains continuous probability distributions

More information

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis

Standardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis Descriptive Statistics (Part 2) 4 Chapter Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis McGraw-Hill/Irwin Copyright 2009 by The McGraw-Hill Companies, Inc. Chebyshev s Theorem

More information

6 Multiple Regression

6 Multiple Regression More than one X variable. 6 Multiple Regression Why? Might be interested in more than one marginal effect Omitted Variable Bias (OVB) 6.1 and 6.2 House prices and OVB Should I build a fireplace? The following

More information

Frequency Distributions

Frequency Distributions Frequency Distributions January 8, 2018 Contents Frequency histograms Relative Frequency Histograms Cumulative Frequency Graph Frequency Histograms in R Using the Cumulative Frequency Graph to Estimate

More information

Lasso and Ridge Quantile Regression using Cross Validation to Estimate Extreme Rainfall

Lasso and Ridge Quantile Regression using Cross Validation to Estimate Extreme Rainfall Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 3 (2016), pp. 3305 3314 Research India Publications http://www.ripublication.com/gjpam.htm Lasso and Ridge Quantile Regression

More information

Estimating a demand function

Estimating a demand function Estimating a demand function One of the most basic topics in economics is the supply/demand curve. Simply put, the supply offered for sale of a commodity is directly related to its price, while the demand

More information

Gov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010

Gov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010 Gov 2001: Section 5 I. A Normal Example II. Uncertainty Gov 2001 Spring 2010 A roadmap We started by introducing the concept of likelihood in the simplest univariate context one observation, one variable.

More information

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 10, 2017

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 10, 2017 Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 0, 207 [This handout draws very heavily from Regression Models for Categorical

More information