Regression Model Assumptions Solutions
|
|
- Maud Whitehead
- 5 years ago
- Views:
Transcription
1 Regression Model Assumptions Solutions Below are the solutions to these exercises on model diagnostics using residual plots. # Exercise 1 # data("cars") head(cars) speed dist # Exercise 2 # attach(cars) plot(speed,dist,pch=16,las=1,cex=1.2,cex.lab=1.2,xlab='speed (mph)',ylab='stopping distance (ft)')
2 There is a correlation between stopping distance and speed. Stopping distances increases as speed increases. # Exercise 3 # lm_model <- lm(dist~speed) plot(speed,dist,pch=16,las=1,cex=1.2,cex.lab=1.2,xlab='speed (mph)',ylab='stopping distance (ft)') abline(lm_model,lwd=3,col='red')
3 # Exercise 4 # summary(lm_model) Call: lm(formula = dist ~ speed) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) *
4 speed e-12 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 48 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 48 DF, p-value: 1.49e-12 There estimated slope coefficient is significantly different than zero. The R-squared values indicate that about 65% of the variance in stopping distance can be explained by speed. # Exercise 5 # par(mfrow = c(2,2)) plot(lm_model)
5 # Exercise 6 # Yes, the data does appear to be homoscedastic. The mean for the errors is the same for all values of the explanatory variable. This can be seen in the top-left or bottom-left plots of the four-panel figure in exercise 5. A perfectly horizontal line at 0 would indicate the mean of the residuals is zero and the same for all values of the explanatory variable. We see there is some deviance from a horizontal line at the tails of the explanatory variable. # Exercise 7 #
6 The QQ plot (top-right of the four-panel figure) indicates that the residuals are close to normally-distributed. There is some deviance from normality at the tails of the distribution where there are fewer data points. # Exercise 8 # par(mfrow=c(1,1)) plot(speed,lm_model$residuals,pch=16,ylab='residuals') The residuals are not correlated with the explanatory variable.
7 # Exercise 9 # There are several points (on the bottom-right plot of the four-panel figure) with a large Cook s distance that may have an influence on the regression, including points labeled 49, 23, and 39. # Exercise 10 # cars_no_49 = cars[-49,] lm_model2 <- lm(cars_no_49$dist~cars_no_49$speed) summary(lm_model2) Call: lm(formula = cars_no_49$dist ~ cars_no_49$speed) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * cars_no_49$speed e-12 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.1 on 47 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 47 DF, p-value: 3.262e-12
8 We see that removing the 49th record changed some values in the model regression, but nothing to qualitatively alter the results. For example, the estimated slope coefficient is now slightly smaller. Regression Model Assumptions Exercises You might fit a statistical model to a set of data and obtain parameter estimates. However, you are not done at this point. You need to make sure the assumptions of the particular model you used were met. One tool is to examine the model residuals. We previously discussed this in a tutorial. The residuals are the difference between your observed data and your predicted values. In this exercise set, you will examine several aspects of residual plots. These residual plots help determine if you have met your model assumptions. Answers to the exercises are available here. Exercise 1 Load the cars data set using the data() function. This data contains the stopping distances (feet) for different car speeds (miles per hour). The data was recorded in the 1920s.
9 Exercise 2 Plot car speeds on the y-axis and stopping distances on the x- axis. What kind of pattern is present? Exercise 3 Using the lm() function to fit a linear model to the data with the stopping distance as the response variable. Plot the line of best fit. Exercise 4 Use summary() to obtain parameter estimates and model details. Is the slope significantly different than zero? How much of the variance can be explained by car speed? Exercise 5 Use the plot() command on the linear model to obtain the four plots of residuals. Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan. These courses cover different statistical models that can help you choose the right design for your solution. Exercise 6 Are the data homoscedastic? Homoscedastic data means the distribution of errors should be the same for all values of the explanatory variable. Exercise 7 Are the residuals normally-distributed? Exercise 8 Are the residuals correlated with the explanatory variable? Exercise 9 Bonus test. Now take a look at the fourth plot: the residuals versus leverage. This plot does not indicate whether we have met model assumptions, but it does tell us if certain data
10 points are more influential than others in the regression. Points that have been labeled with a number have a high Cook s Distance, which means they are particularly influential for the regression. These are usually the points not clustered with the majority of points. Are there any points with a high Cook s distance? Exercise 10 Remove the 49th record (the one with a large Cook s distance) from the data. How does this change the parameter estimates in the model regression? Regression Model Assumptions Tutorial Regression is used to explore the relationship between one variable (often termed the response) and one or more other variables (termed explanatory). Several exercises are already available on simple linear regression or multiple regression. These are fantastic tools that are used frequently. However, each has a number of assumptions that need to be met. Unfortunately, people often conduct regression analyses without checking their assumptions. In this tutorial, we will focus on how to check assumptions for simple linear regression. We will use the trees data already found in R. The data
11 includes the girth, height, and volume for 31 Black Cherry Trees. The following code loads the data and then creates a plot of volume versus girth. The red line is the line of best fit from linear regression. data("trees") attach(trees) plot(girth,height,pch=16,las=1) lm_model <- lm(height~girth) abline(lm_model,lwd=2,col='red') We can also use summary() to to obtain the model estimates from linear regression. summary(lm_model) Call:
12 lm(formula = Height ~ Girth) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-14 *** Girth ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 29 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 29 DF, p-value: We obtain parameter estimates, a R-squared value, and other useful bits of information. Great. Using several graphs, let s examine if our model has met the assumptions of linear regression. We are going to examine model residuals. A residual is simply the difference between each observed value (black points in first graph) and the predicted value (the red line). Learn more about evaluating different statistical models in the online courses Linear regression in R for Data Scientists and Structural equation modeling (SEM) with lavaan. These courses cover different statistical models that can help you choose the right design for your solution. Thankfully, the plot() command provides several useful plots to check our model assumptions about residuals. par(mfrow=c(2,2)) plot(lm_model)
13 Okay, there is a lot going on in these graphs. Let s break down each assumption and how they relate to these four graphs. 1. Homoscedasticity of residuals The residuals should be homoscedastic. This means that the distribution of errors should be the same for all values of the explanatory variable. In the top left figure, we see residuals plotted against the fitted values. If the residuals are homoscedastic, we would expect the red line to fall on zero for the entire range of fitted values. There is some discrepancy, but the assumption looks to hold overall. The bottom left plot can also be used to examine homoscedasticity. Here, standardized residuals are used with the same reasoning that the red line should be horizontal. It is important to remember that examining model assumptions graphically is more of an art than science. 2. Normality of the residuals
14 The residuals should also be normally distributed. To check for normality, examine the qqnorm() plot on the top right of the 4-panel figure. Perfectly normal distributed data would all fall on the dashed line. This data looks fairly close to normal with some deviance at the tails. 3. Residuals should not be correlated with the explanatory variables Another important assumption is that residuals should not be correlated with any explanatory variables. Previously, we let the plot() command handle the residuals, but we need the residuals themselves now. To extract model residuals we use lm_model$residuals. We can then use this command to plot the model residuals versus Girth, the explanatory variable. par(mfrow=c(1,1)) plot(girth, lm_model$residuals, ylab='residuals',pch=16)
15 Here we see there is very little correlation between our residuals and the Girth variable. Therefore, this assumption is met. 4. The mean of the residuals should equal zero We can see the mean of residuals is close to zero from either our plots, or simply using mean(lm_model$residuals). Our assumption is met as the mean is very close to zero. 5. Little or no multicollinearity If we had more than one explanatory variable, we would have a multiple regression model. In that case, it would be important to verify that none of the explanatory variables are strongly correlated with one another. We only have one explanatory variable in our regression, so this assumption is not relevant here. Calculating Marginal Effects Exercises A common experience for those in the social sciences migrating to R from SPSS or STATA is that some procedures that happened at the click of a button will now require more work or are too obscured by the unfamiliar language to see how to accomplish. One such procedure that I ve experienced is when calculating the marginal effects of a generalized linear model. In this exercise set, we will
16 explore calculating marginal effects for linear, logistic, and probit regression models in R. Exercises in this section will be solved using the Margins and mfx packages. It is recommended to take a look at the concise and excellent documentation for these packages before continuing. Answers to the exercises are available here. Exercise 1 Load the mtcars dataset. Build a linear regression of mpg on wt, qsec, am, and hp. Exercise 2 Print the coefficients from the linear model in the previous exercise. Exercise 3 Using Margins package find marginal effects. Exercise 4 Verify that you receive the same results from Exercises 2 and 3. Why do these marginal effects match the coefficients found when printing the linear model object? Exercise 5 Using the mtcars dataset, built a linear regression similar to Exercise 1 except include an interaction term with am and hp. Find the marginal effects for this regression. Exercise 6 Using your favorite dataset (mine is field.goals from the nutshell package), construct a logistic regression. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to:
17 Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 7 Explain why marginal effects for a logit model more complex than for a linear model? Exercise 8 For the next two exercises, you may use either package. Calculate the marginal effects with respect to the mean. Exercise 9 Calculate the average marginal effects. Exercise 10 If these marginal effects are different, explain why they are different. Ridge regression in R exercises Bias vs Variance tradeoff is always encountered in applying
18 supervised learning algorithms. Least squares regression provides a good fit for the training set but can suffer from high variance which lowers predictive ability. To counter this problem, we can regularize the beta coefficients by employing a penalization term. Ridge regression applies l2 penalty to the residual sum of squares. In contrast, LASSO regression, which was covered here previously, applies l1 penalty. Using ridge regression, we can shrink the beta coefficients towards zero which would reduce variance at the cost of higher bias which can result in better predictive ability than least squares regression. In this exercise set we will use the glmnet package (package description: here) to implement ridge regression in R. Answers to the exercises are available here. Exercise 1 Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) Least Angle Regression (with discussion) Annals of Statistics). This is the same dataset from the LASSO exercise set and has patient level data on the progression of diabetes. Next, load the glmnet package that will that we will now use to implement ridge regression. The dataset has three matrices x, x2 and y. x has a smaller set of independent variables while x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison. Exercise 2 Fit the ridge regression model using the glmnet function and plot the trace of the estimated coefficients against lambdas. Note that coefficients are shrunk closer to zero for higher values of lambda.
19 Exercise 3 Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error. Exercise 4 Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that coefficients are lower than least squares estimates. Exercise 5 To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note the shrinkage effect on the estimates. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 6 Split the data randomly between a training set (80%) and test set (20%). We will use these to get the prediction standard error for least squares and ridge regression models. Exercise 7 Fit the ridge regression model on the training and get the estimated beta coefficients for both the minimum lambda and the higher lambda within 1-standard error of the minimum.
20 Exercise 8 Get predictions from the ridge regression model for the test set and calculate the prediction standard error. Do this for both the minimum lambda and the higher lambda within 1- standard error of the minimum. Exercise 9 Fit the least squares model on the training set. Exercise 10 Get predictions from the least squares model for the test set and calculate the prediction standard error. LASSO regression in R exercises Least Absolute Shrinkage and Selection Operator (LASSO) performs regularization and variable selection on a given model. Depending on the size of the penalty term, LASSO shrinks less relevant predictors to (possibly) zero. Thus, it enables us to consider a more parsimonious model. In this exercise set we will use the glmnet package (package description: here) to implement LASSO regression in R. Answers to the exercises are available here.
21 Exercise 1 Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) Least Angle Regression Annals of Statistics). This has patient level data on the progression of diabetes. Next, load the glmnet package that will be used to implement LASSO. Exercise 2 The dataset has three matrices x, x2 and y. While x has a smaller set of independent variables, x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes. It is a good idea to visually inspect the relationship of each of the predictors with the dependent variable. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Use a loop to automate the process. Exercise 3 Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison. Exercise 4 Use the glmnet function to plot the path of each of x s variable coefficients against the L1 norm of the beta vector. This graph indicates at which stage each coefficient shrinks to zero. Learn more about the glmnet package in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and non-
22 linear kernels. And much more Exercise 5 Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error. Exercise 6 Using the minimum value of lambda from the previous exercise, get the estimated beta matrix. Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y. Exercise 7 To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note that more coefficients are now shrunk to zero. Exercise 8 As mentioned earlier, x2 contains a wider variety of predictors. Using OLS, regress y on x2 and evaluate results. Exercise 9 Repeat exercise-4 for the new model. Exercise 10 Repeat exercises 5 and 6 for the new model and see which coefficients are shrunk to zero. This is an effective way to narrow down on important predictors when there are many candidates.
23 Quantile Regression in R exercises The standard OLS (Ordinary Least Squares) model explains the relationship between independent variables and the conditional mean of the dependent variable. In contrast, quantile regression models this relationship for different quantiles of the dependent variable. In this exercise set we will use the quantreg package (package description: here) to implement quantile regression in R. Answers to the exercises are available here. Exercise 1 Load the quantreg package and the barro dataset (Barro and Lee, 1994). This has data on GDP growth rates for various countries. Next, summarize the data. Exercise 2 The dependent variable is y.net (Annual change per capita GDP). The remaining variables will be used to explain y.net. It is easier to combine variables using cbind before applying regression techniques. Combine variables so that we can write Y ~ X. Exercise 3 Regress y.net on the independent variables using OLS. We will use this result as benchmark for comparison.
24 Exercise 4 Using the rq function, estimate the model at the median y.net. Compare results from exercise-3. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of error term support vector machines with linear and nonlinear kernels. And much more Exercise 5 Estimate the model for the first and third quartiles and compare results. Exercise 6 Using a single command estimate the model for 10 equally spaced deciles of y.net. Exercise 7 quantreg package also offers shrinkage estimators to determine which variables play the most important role in predicting y.net. Estimate the model with LASSO based quantile regression at the median level with lambda=0.5. Exercise 8 Quantile plots are most useful for interpreting results. To do that we need to define the sequence of percentiles. Use the seq function to define the sequence of percentiles from 5% to 95% with a jump of 5%. Exercise 9 Use the result from exercise-8 to plot the graphs. Note that
25 the red line is the OLS estimate bounded by the dotted lines which represent confidence intervals. Exercise 10 Using results from exercise-5, test whether coefficients are significantly different for the first and third quartile based regressions. Quantile Regression in R solutions Below are the solutions to these exercises on Quantile regression. # # Exercise 1 # # library(quantreg) data(barro) summary(barro) y.net lgdp2 mse2 fse2 Min. : Min. :5.820 Min. : Min. :0.0000
26 1st Qu.: st Qu.: st Qu.: st Qu.: Median : Median :7.745 Median : Median : Mean : Mean :7.791 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. :9.508 Max. : Max. : fhe2 mhe2 lexp2 lintr2 Min. : Min. : Min. :3.555 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median :4.064 Median : Mean : Mean : Mean :4.044 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. :4.315 Max. : gedy2 Iy2 gcony2 lblakp2 Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : pol2 ttrad2 Min. : Min. : st Qu.: st Qu.:
27 Median : Median : Mean : Mean : rd Qu.: rd Qu.: Max. : Max. : # # Exercise 2 # # y <- barro$y.net x<- cbind(barro$lgdp2, barro$mse2, barro$fse2, barro$fhe2, barro$mhe2, barro$lexp2, barro$lintr2, barro$gedy2, barro$iy2, barro$gcony2, barro$lblakp2, barro$pol2, barro$ttrad2) colnames(x) <- c("initial Per Capita GDP", "Male Secondary Education", "Female Secondary Education", "Female Higher Education", "Male Higher Education", "Life Expectancy", "Human Capital", "Education/GDP", "Investment/GDP", "Public Consumption/GDP", "Black Market Premium", "Political Instability", "Growth Rate Terms Trade")
28 # # Exercise 3 # # model3 <- lm(y ~ x) summary(model3) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) xinitial Per Capita GDP e-13 *** xmale Secondary Education ** xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy e-05 *** xhuman Capital xeducation/gdp xinvestment/gdp ** xpublic Consumption/GDP
29 ** xblack Market Premium e-09 *** xpolitical Instability ** xgrowth Rate Terms Trade e-05 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 147 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 13 and 147 DF, p-value: < 2.2e-16 # # Exercise 4 # # model4 <- rq(y ~ x, tau = 0.5) summary(model4, se="rank") Call: rq(formula = y ~ x, tau = 0.5) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp
30 xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 5 # # model5a <- rq(y ~ x, tau = 0.25) summary(model5a, se="rank") Call: rq(formula = y ~ x, tau = 0.25) tau: [1] 0.25 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade model5b <- rq(y ~ x, tau = 0.75) summary(model5b, se="rank")
31 Call: rq(formula = y ~ x, tau = 0.75) tau: [1] 0.75 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 6 # # model6 <- rq(y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) summary(model6, se="rank") Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.1 Coefficients: coefficients lower bd upper bd (Intercept)
32 xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.2 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9))
33 tau: [1] 0.3 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.4 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade
34 Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.5 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.6 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp
35 xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.7 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.8 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education
36 xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade Call: rq(formula = y ~ x, tau = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)) tau: [1] 0.9 Coefficients: coefficients lower bd upper bd (Intercept) xinitial Per Capita GDP xmale Secondary Education xfemale Secondary Education xfemale Higher Education xmale Higher Education xlife Expectancy xhuman Capital xeducation/gdp xinvestment/gdp xpublic Consumption/GDP xblack Market Premium xpolitical Instability xgrowth Rate Terms Trade # # Exercise 7 # # model7 <- rq.fit.lasso(x, y, tau = 0.5, lambda = 0.5)
37 print(model7$coefficients) Initial Per Capita GDP Male Secondary Education e e-02 Female Secondary Education Female Higher Education e e-12 Male Higher Education Life Expectancy e e-02 Human Capital Education/GDP e e-12 Investment/GDP Public Consumption/GDP e e-02 Black Market Premium Political Instability e e-02 Growth Rate Terms Trade e-02 # Note that some beta estimates have been shrunk closer to zero. # # Exercise 8 # # tau_seq <- seq(0.05, 0.95, by=0.05) # # Exercise 9 # # model8 <- rq(y ~ x, tau = tau_seq) plot(summary(model8))
38 # Note that 'gcony2' is different from the OLS estimate at # lower quantiles of 'y.net'. # Exercise 10 # anova(model5a, model5b) Warning in summary.rq(x, se = se, covariance = TRUE): 8 non-positive fis Warning in summary.rq(x, se = se, covariance = TRUE): 3 non-positive fis Quantile Regression Analysis of Deviance Table Model: y ~ x Joint Test of Equality of Slopes: tau in { } Df Resid Df F value Pr(>F) ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # Estimates are significantly different.
39 Evaluate your model with R Exercises There was a time where statistician had to manually crunch number when they wanted to fit their data to a model. Since this process was so long, those statisticians usually did a lot of preliminary work researching other model who worked in the past or looking for studies in other scientific field like psychology or sociology who can influence their model with the goal to maximize their chance to make a relevant model. Then they would create a model and an alternative model and choose the one which seem more efficient. Now that even an average computer give us incredible computing power, it s easy to make multiple models and choose the one that best fit the data. Even though it is better to have good prior knowledge of the process you are trying to analyze and of other model used in the past, coming to a conclusion using mostly only the data help you avoid bias and help you create better models. In this set of exercises, we ll see how to apply the most used error metrics on your models with the intention to rate them and choose the one that is the more appropriate for the situation. Most of those errors metrics are not part of any R package, in consequence you have to apply the equation I give you on your data. Personally, I preferred to write a function which I can easily use on everyone of my models, but there s many ways to code those equations. If your code is different from the one in the
40 solution, feel free to post your code in the comments. Answers to the exercises are available here. Exercise 1 We start by looking at error metrics we can use for regression model. For linear regression problem, the more used metrics are the coefficient of determination R 2 which show what percentage of variance is explained by the model and the adjusted R 2 which penalize model who use variables that doesn t have an effective contribution to the model (see this page for more details). Load the attitude data set from the package of the same name and make three linear models with the goal to explain the rating variable. The first one use all the variables from the dataset, the second one use the variable complaints, privileges, learning and advance as independent variables and the third one use only the complaints, learning and advance variables. Then use the summary() function to print R 2 and the adjusted R 2. Exercise 2 Another way to measure how your model fit your data is to use the Root Mean Squared Error (RMSE), which is defined as the square root of the average of the square of all the error made by your model. You can find the mathematical definition of the RMSE on this page. Calculate the RMSE of the prediction made by your three models. Exercise 3 The mean absolute error (MAE) is a good alternative to the RMSE if you don t want to penalise too much the large estimation error of your model. The MAE is given by the equation The mathematical definition of the MAE can be found here. Calculate the MAE of the prediction made by the 3 models.
41 Exercise 4 Sometime some prediction error hurt your model than other. For example, if you are trying to predict the financial loss of a business over a period of time, under estimation of the loss would put the business at risk of bankruptcy, while overestimation of the loss will result in a conservative model. In those case, using the Root Mean Squared Logarithmic Error (RMSLE) as an error metric is useful since this metric penalize under estimation. The RMSLE is given by the equation given on this page. Calculate the RMSLE of the prediction made by the three models. Exercise 5 Now that we ve seen some examples of error metrics which can be used in a regression context, let s see five examples of error metrics which can be used when you perform clustering analysis. But first, we must create a clustering model to test those metrics on. Load the iris dataset and apply the kmeans algorithm. Since the iris dataset has three distinct labels, use the kmeans algorithm with three centers. Also, use set the maximum number of iterations at 50 and use the Lloyd algorithm. Once it s done, take time to rearrange the labels of your prediction so they are compatible with the factors in the iris dataset. Learn more about Model Evaluation in the online course Regression Machine Learning with R. In this course you will learn how to: Avoid model over-fitting using cross-validation for optimal parameter selection Explore maximum margin methods such as best penalty of
42 error term support vector machines with linear and nonlinear kernels. And much more Exercise 6 Print the confusion matrix of your model. Exercise 7 The easiest way to measure how well your model did categorize the data, is to calculate the accuracy, the recall and the precision of your results. Write three functions which return those individual values and calculate those metrics for your models. Exercise 8 The F-measure summarize the precision and recall value of your model by calculating the harmonic mean of those two values. Write a function which return the F-measure of your model and compute twice this measure for your data: once with a parameter of 2 and then with a parameter of 0.5. Exercise 9 The Purity is a measure of the homogeneity of your cluster: if all your cluster regroup object of the same class, you ll get a purity score of one and if there s no majority class in any of the cluster, you ll get a purity score of 0. Write a function which return the purity score of your model and test it on your predictions. Exercise 10 The last error metrics we ll see today is the Dunn index, which indicate if the clusters are compact and separated. You can find the mathematical definition of the Dunn index here. Load the cvalid package and use the dunn() on your model, to compute the Dunn index of your classification. Note that this function take an integer vector representing the cluster partitioning as parameter.
Multiple regression - a brief introduction
Multiple regression - a brief introduction Multiple regression is an extension to regular (simple) regression. Instead of one X, we now have several. Suppose, for example, that you are trying to predict
More informationLecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases.
Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases. Goal: Find unusual cases that might be mistakes, or that might
More informationMilestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty
Milestone2 Zillow House Price Prediciton Group Lingzi Hong and Pranali Shetty MILESTONE 2 REPORT Data Collection The following additional features were added 1. Population, Number of College Graduates
More informationNon-linearities in Simple Regression
Non-linearities in Simple Regression 1. Eample: Monthly Earnings and Years of Education In this tutorial, we will focus on an eample that eplores the relationship between total monthly earnings and years
More informationGeneralized Linear Models
Generalized Linear Models Scott Creel Wednesday, September 10, 2014 This exercise extends the prior material on using the lm() function to fit an OLS regression and test hypotheses about effects on a parameter.
More informationSTATISTICAL DISTRIBUTIONS AND THE CALCULATOR
STATISTICAL DISTRIBUTIONS AND THE CALCULATOR 1. Basic data sets a. Measures of Center - Mean ( ): average of all values. Characteristic: non-resistant is affected by skew and outliers. - Median: Either
More informationStat 101 Exam 1 - Embers Important Formulas and Concepts 1
1 Chapter 1 1.1 Definitions Stat 101 Exam 1 - Embers Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2.
More informationOrdinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013
Ordinal Multinomial Logistic Thom M. Suhy Southern Methodist University May14th, 2013 GLM Generalized Linear Model (GLM) Framework for statistical analysis (Gelman and Hill, 2007, p. 135) Linear Continuous
More informationRegression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)
Regression Review and Robust Regression Slides prepared by Elizabeth Newton (MIT) S-Plus Oil City Data Frame Monthly Excess Returns of Oil City Petroleum, Inc. Stocks and the Market SUMMARY: The oilcity
More informationIntro to GLM Day 2: GLM and Maximum Likelihood
Intro to GLM Day 2: GLM and Maximum Likelihood Federico Vegetti Central European University ECPR Summer School in Methods and Techniques 1 / 32 Generalized Linear Modeling 3 steps of GLM 1. Specify the
More informationPARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS
PARAMETRIC AND NON-PARAMETRIC BOOTSTRAP: A SIMULATION STUDY FOR A LINEAR REGRESSION WITH RESIDUALS FROM A MIXTURE OF LAPLACE DISTRIBUTIONS Melfi Alrasheedi School of Business, King Faisal University, Saudi
More information9. Logit and Probit Models For Dichotomous Data
Sociology 740 John Fox Lecture Notes 9. Logit and Probit Models For Dichotomous Data Copyright 2014 by John Fox Logit and Probit Models for Dichotomous Responses 1 1. Goals: I To show how models similar
More informationStat 401XV Exam 3 Spring 2017
Stat 40XV Exam Spring 07 I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed ATTENTION! Incorrect numerical answers unaccompanied by supporting reasoning
More informationRegression and Simulation
Regression and Simulation This is an introductory R session, so it may go slowly if you have never used R before. Do not be discouraged. A great way to learn a new language like this is to plunge right
More informationWindow Width Selection for L 2 Adjusted Quantile Regression
Window Width Selection for L 2 Adjusted Quantile Regression Yoonsuh Jung, The Ohio State University Steven N. MacEachern, The Ohio State University Yoonkyung Lee, The Ohio State University Technical Report
More informationAP STATISTICS FALL SEMESTSER FINAL EXAM STUDY GUIDE
AP STATISTICS Name: FALL SEMESTSER FINAL EXAM STUDY GUIDE Period: *Go over Vocabulary Notecards! *This is not a comprehensive review you still should look over your past notes, homework/practice, Quizzes,
More informationFE670 Algorithmic Trading Strategies. Stevens Institute of Technology
FE670 Algorithmic Trading Strategies Lecture 4. Cross-Sectional Models and Trading Strategies Steve Yang Stevens Institute of Technology 09/26/2013 Outline 1 Cross-Sectional Methods for Evaluation of Factor
More information> attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount")
Example of More than 2 Categories, and Analysis of Covariance Example > attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount") Sales 160 200 240 > tapply(sales,discount,mean) 10.00% 15.00%
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #6 EPSY 905: Maximum Likelihood In This Lecture The basics of maximum likelihood estimation Ø The engine that
More informationFinal Exam Suggested Solutions
University of Washington Fall 003 Department of Economics Eric Zivot Economics 483 Final Exam Suggested Solutions This is a closed book and closed note exam. However, you are allowed one page of handwritten
More informationModel fit assessment via marginal model plots
The Stata Journal (2010) 10, Number 2, pp. 215 225 Model fit assessment via marginal model plots Charles Lindsey Texas A & M University Department of Statistics College Station, TX lindseyc@stat.tamu.edu
More informationR & R Study. Chapter 254. Introduction. Data Structure
Chapter 54 Introduction A repeatability and reproducibility (R & R) study (sometimes called a gauge study) is conducted to determine if a particular measurement procedure is adequate. If the measurement
More informationLet us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.
Mixed-effects models An introduction by Christoph Scherber Up to now, we have been dealing with linear models of the form where ß0 and ß1 are parameters of fixed value. Example: Let us assume that we are
More informationSolutions for practice questions: Chapter 15, Probability Distributions If you find any errors, please let me know at
Solutions for practice questions: Chapter 15, Probability Distributions If you find any errors, please let me know at mailto:msfrisbie@pfrisbie.com. 1. Let X represent the savings of a resident; X ~ N(3000,
More informationthe display, exploration and transformation of the data are demonstrated and biases typically encountered are highlighted.
1 Insurance data Generalized linear modeling is a methodology for modeling relationships between variables. It generalizes the classical normal linear model, by relaxing some of its restrictive assumptions,
More informationThe data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998
Economics 312 Sample Project Report Jeffrey Parker Introduction This project is based on Exercise 2.12 on page 81 of the Hill, Griffiths, and Lim text. It examines how the sale price of houses in Stockton,
More informationHomework Assignment Section 3
Homework Assignment Section 3 Tengyuan Liang Business Statistics Booth School of Business Problem 1 A company sets different prices for a particular stereo system in eight different regions of the country.
More informationT.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION
In Inferential Statistic, ESTIMATION (i) (ii) is called the True Population Mean and is called the True Population Proportion. You must also remember that are not the only population parameters. There
More informationJacob: What data do we use? Do we compile paid loss triangles for a line of business?
PROJECT TEMPLATES FOR REGRESSION ANALYSIS APPLIED TO LOSS RESERVING BACKGROUND ON PAID LOSS TRIANGLES (The attached PDF file has better formatting.) {The paid loss triangle helps you! distinguish between
More informationWage Determinants Analysis by Quantile Regression Tree
Communications of the Korean Statistical Society 2012, Vol. 19, No. 2, 293 301 DOI: http://dx.doi.org/10.5351/ckss.2012.19.2.293 Wage Determinants Analysis by Quantile Regression Tree Youngjae Chang 1,a
More informationLinear functions Increasing Linear Functions. Decreasing Linear Functions
3.5 Increasing, Decreasing, Max, and Min So far we have been describing graphs using quantitative information. That s just a fancy way to say that we ve been using numbers. Specifically, we have described
More informationData screening, transformations: MRC05
Dale Berger Data screening, transformations: MRC05 This is a demonstration of data screening and transformations for a regression analysis. Our interest is in predicting current salary from education level
More informationSummary of Statistical Analysis Tools EDAD 5630
Summary of Statistical Analysis Tools EDAD 5630 Test Name Program Used Purpose Steps Main Uses/Applications in Schools Principal Component Analysis SPSS Measure Underlying Constructs Reliability SPSS Measure
More informationMultiple Regression and Logistic Regression II. Dajiang 525 Apr
Multiple Regression and Logistic Regression II Dajiang Liu @PHS 525 Apr-19-2016 Materials from Last Time Multiple regression model: Include multiple predictors in the model = + + + + How to interpret the
More informationProblem Set 6. I did this with figure; bar3(reshape(mean(rx),5,5) );ylabel( size ); xlabel( value ); mean mo return %
Business 35905 John H. Cochrane Problem Set 6 We re going to replicate and extend Fama and French s basic results, using earlier and extended data. Get the 25 Fama French portfolios and factors from the
More informationReview questions for Multinomial Logit/Probit, Tobit, Heckit, Quantile Regressions
1. I estimated a multinomial logit model of employment behavior using data from the 2006 Current Population Survey. The three possible outcomes for a person are employed (outcome=1), unemployed (outcome=2)
More informationDummy Variables. 1. Example: Factors Affecting Monthly Earnings
Dummy Variables A dummy variable or binary variable is a variable that takes on a value of 0 or 1 as an indicator that the observation has some kind of characteristic. Common examples: Sex (female): FEMALE=1
More informationStudy 2: data analysis. Example analysis using R
Study 2: data analysis Example analysis using R Steps for data analysis Install software on your computer or locate computer with software (e.g., R, systat, SPSS) Prepare data for analysis Subjects (rows)
More informationMultiple Regression. Review of Regression with One Predictor
Fall Semester, 2001 Statistics 621 Lecture 4 Robert Stine 1 Preliminaries Multiple Regression Grading on this and other assignments Assignment will get placed in folder of first member of Learning Team.
More informationJacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation?
PROJECT TEMPLATE: DISCRETE CHANGE IN THE INFLATION RATE (The attached PDF file has better formatting.) {This posting explains how to simulate a discrete change in a parameter and how to use dummy variables
More informationStatistical Evidence and Inference
Statistical Evidence and Inference Basic Methods of Analysis Understanding the methods used by economists requires some basic terminology regarding the distribution of random variables. The mean of a distribution
More informationLecture 10: Alternatives to OLS with limited dependent variables, part 1. PEA vs APE Logit/Probit
Lecture 10: Alternatives to OLS with limited dependent variables, part 1 PEA vs APE Logit/Probit PEA vs APE PEA: partial effect at the average The effect of some x on y for a hypothetical case with sample
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationThe SAS System 11:03 Monday, November 11,
The SAS System 11:3 Monday, November 11, 213 1 The CONTENTS Procedure Data Set Name BIO.AUTO_PREMIUMS Observations 5 Member Type DATA Variables 3 Engine V9 Indexes Created Monday, November 11, 213 11:4:19
More informationCopyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.
Appendix: Statistics in Action Part I Financial Time Series 1. These data show the effects of stock splits. If you investigate further, you ll find that most of these splits (such as in May 1970) are 3-for-1
More informationName: Class: Date: in general form.
Write the equation in general form. Mathematical Applications for the Management Life and Social Sciences 11th Edition Harshbarger TEST BANK Full clear download at: https://testbankreal.com/download/mathematical-applications-management-life-socialsciences-11th-edition-harshbarger-test-bank/
More informationTable of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...
iii Table of Contents Preface... xiii Purpose... xiii Outline of Chapters... xiv New to the Second Edition... xvii Acknowledgements... xviii Chapter 1: Introduction... 1 1.1: Social Research... 1 Introduction...
More informationModels of Patterns. Lecture 3, SMMD 2005 Bob Stine
Models of Patterns Lecture 3, SMMD 2005 Bob Stine Review Speculative investing and portfolios Risk and variance Volatility adjusted return Volatility drag Dependence Covariance Review Example Stock and
More informationAssessing the reliability of regression-based estimates of risk
Assessing the reliability of regression-based estimates of risk 17 June 2013 Stephen Gray and Jason Hall, SFG Consulting Contents 1. PREPARATION OF THIS REPORT... 1 2. EXECUTIVE SUMMARY... 2 3. INTRODUCTION...
More informationstarting on 5/1/1953 up until 2/1/2017.
An Actuary s Guide to Financial Applications: Examples with EViews By William Bourgeois An actuary is a business professional who uses statistics to determine and analyze risks for companies. In this guide,
More informationBasic Procedure for Histograms
Basic Procedure for Histograms 1. Compute the range of observations (min. & max. value) 2. Choose an initial # of classes (most likely based on the range of values, try and find a number of classes that
More informationStat 328, Summer 2005
Stat 328, Summer 2005 Exam #2, 6/18/05 Name (print) UnivID I have neither given nor received any unauthorized aid in completing this exam. Signed Answer each question completely showing your work where
More informationREGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING
International Civil Aviation Organization 27/8/10 WORKING PAPER REGIONAL WORKSHOP ON TRAFFIC FORECASTING AND ECONOMIC PLANNING Cairo 2 to 4 November 2010 Agenda Item 3 a): Forecasting Methodology (Presented
More informationWEB APPENDIX 8A 7.1 ( 8.9)
WEB APPENDIX 8A CALCULATING BETA COEFFICIENTS The CAPM is an ex ante model, which means that all of the variables represent before-the-fact expected values. In particular, the beta coefficient used in
More informationRegression with a binary dependent variable: Logistic regression diagnostic
ACADEMIC YEAR 2016/2017 Università degli Studi di Milano GRADUATE SCHOOL IN SOCIAL AND POLITICAL SCIENCES APPLIED MULTIVARIATE ANALYSIS Luigi Curini luigi.curini@unimi.it Do not quote without author s
More informationLogit Models for Binary Data
Chapter 3 Logit Models for Binary Data We now turn our attention to regression models for dichotomous data, including logistic regression and probit analysis These models are appropriate when the response
More informationBetter decision making under uncertain conditions using Monte Carlo Simulation
IBM Software Business Analytics IBM SPSS Statistics Better decision making under uncertain conditions using Monte Carlo Simulation Monte Carlo simulation and risk analysis techniques in IBM SPSS Statistics
More informationSAS Simple Linear Regression Example
SAS Simple Linear Regression Example This handout gives examples of how to use SAS to generate a simple linear regression plot, check the correlation between two variables, fit a simple linear regression
More informationWe will also use this topic to help you see how the standard deviation might be useful for distributions which are normally distributed.
We will discuss the normal distribution in greater detail in our unit on probability. However, as it is often of use to use exploratory data analysis to determine if the sample seems reasonably normally
More informationEstablishing a framework for statistical analysis via the Generalized Linear Model
PSY349: Lecture 1: INTRO & CORRELATION Establishing a framework for statistical analysis via the Generalized Linear Model GLM provides a unified framework that incorporates a number of statistical methods
More informationAnalysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority
Chapter 235 Analysis of 2x2 Cross-Over Designs using -ests for Non-Inferiority Introduction his procedure analyzes data from a two-treatment, two-period (2x2) cross-over design where the goal is to demonstrate
More informationLecture 2 Describing Data
Lecture 2 Describing Data Thais Paiva STA 111 - Summer 2013 Term II July 2, 2013 Lecture Plan 1 Types of data 2 Describing the data with plots 3 Summary statistics for central tendency and spread 4 Histograms
More informationSTAT 157 HW1 Solutions
STAT 157 HW1 Solutions http://www.stat.ucla.edu/~dinov/courses_students.dir/10/spring/stats157.dir/ Problem 1. 1.a: (6 points) Determine the Relative Frequency and the Cumulative Relative Frequency (fill
More informationa. Explain why the coefficients change in the observed direction when switching from OLS to Tobit estimation.
1. Using data from IRS Form 5500 filings by U.S. pension plans, I estimated a model of contributions to pension plans as ln(1 + c i ) = α 0 + U i α 1 + PD i α 2 + e i Where the subscript i indicates the
More informationQuantile Regression. By Luyang Fu, Ph. D., FCAS, State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting
Quantile Regression By Luyang Fu, Ph. D., FCAS, State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting Agenda Overview of Predictive Modeling for P&C Applications Quantile
More informationDescriptive Statistics (Devore Chapter One)
Descriptive Statistics (Devore Chapter One) 1016-345-01 Probability and Statistics for Engineers Winter 2010-2011 Contents 0 Perspective 1 1 Pictorial and Tabular Descriptions of Data 2 1.1 Stem-and-Leaf
More informationR is a collaborative project with many contributors. Type contributors() for more information.
R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type license() or licence() for distribution details. R is a collaborative project
More informationThe Least Squares Regression Line
The Least Squares Regression Line Section 5.3 Cathy Poliak, Ph.D. cathy@math.uh.edu Office hours: T Th 1:30 pm - 3:30 pm 620 PGH & 5:30 pm - 7:00 pm CASA Department of Mathematics University of Houston
More informationtm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6}
PS 4 Monday August 16 01:00:42 2010 Page 1 tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6} log: C:\web\PS4log.smcl log type: smcl opened on:
More informationAP Stats: 3B ~ Least Squares Regression and Residuals. Objectives:
Objectives: INTERPRET the slope and y intercept of a least-squares regression line USE the least-squares regression line to predict y for a given x CALCULATE and INTERPRET residuals and their standard
More informationDescriptive Statistics
Chapter 3 Descriptive Statistics Chapter 2 presented graphical techniques for organizing and displaying data. Even though such graphical techniques allow the researcher to make some general observations
More informationPredicting Foreign Exchange Arbitrage
Predicting Foreign Exchange Arbitrage Stefan Huber & Amy Wang 1 Introduction and Related Work The Covered Interest Parity condition ( CIP ) should dictate prices on the trillion-dollar foreign exchange
More informationNotes on a Basic Business Problem MATH 104 and MATH 184 Mark Mac Lean (with assistance from Patrick Chan) 2011W
Notes on a Basic Business Problem MATH 104 and MATH 184 Mark Mac Lean (with assistance from Patrick Chan) 2011W This simple problem will introduce you to the basic ideas of revenue, cost, profit, and demand.
More informationAPPLICATIONS OF STATISTICAL DATA MINING METHODS
Libraries Annual Conference on Applied Statistics in Agriculture 2004-16th Annual Conference Proceedings APPLICATIONS OF STATISTICAL DATA MINING METHODS George Fernandez Follow this and additional works
More informationSession 5. Predictive Modeling in Life Insurance
SOA Predictive Analytics Seminar Hong Kong 29 Aug. 2018 Hong Kong Session 5 Predictive Modeling in Life Insurance Jingyi Zhang, Ph.D Predictive Modeling in Life Insurance JINGYI ZHANG PhD Scientist Global
More informationStat3011: Solution of Midterm Exam One
1 Stat3011: Solution of Midterm Exam One Fall/2003, Tiefeng Jiang Name: Problem 1 (30 points). Choose one appropriate answer in each of the following questions. 1. (B ) The mean age of five people in a
More informationDATA HANDLING Five-Number Summary
DATA HANDLING Five-Number Summary The five-number summary consists of the minimum and maximum values, the median, and the upper and lower quartiles. The minimum and the maximum are the smallest and greatest
More informationGeneral Business 706 Midterm #3 November 25, 1997
General Business 706 Midterm #3 November 25, 1997 There are 9 questions on this exam for a total of 40 points. Please be sure to put your name and ID in the spaces provided below. Now, if you feel any
More informationCOMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS
COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS Answer all parts. Closed book, calculators allowed. It is important to show all working,
More informationMaximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 13, 2018
Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 3, 208 [This handout draws very heavily from Regression Models for Categorical
More informationRisk Analysis. å To change Benchmark tickers:
Property Sheet will appear. The Return/Statistics page will be displayed. 2. Use the five boxes in the Benchmark section of this page to enter or change the tickers that will appear on the Performance
More informationTests for Two ROC Curves
Chapter 65 Tests for Two ROC Curves Introduction Receiver operating characteristic (ROC) curves are used to summarize the accuracy of diagnostic tests. The technique is used when a criterion variable is
More informationImpact of Unemployment and GDP on Inflation: Imperial study of Pakistan s Economy
International Journal of Current Research in Multidisciplinary (IJCRM) ISSN: 2456-0979 Vol. 2, No. 6, (July 17), pp. 01-10 Impact of Unemployment and GDP on Inflation: Imperial study of Pakistan s Economy
More informationThis homework assignment uses the material on pages ( A moving average ).
Module 2: Time series concepts HW Homework assignment: equally weighted moving average This homework assignment uses the material on pages 14-15 ( A moving average ). 2 Let Y t = 1/5 ( t + t-1 + t-2 +
More informationPredicting Charitable Contributions
Predicting Charitable Contributions By Lauren Meyer Executive Summary Charitable contributions depend on many factors from financial security to personal characteristics. This report will focus on demographic
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, President, OptiMine Consulting, West Chester, PA ABSTRACT Data Mining is a new term for the
More informationSEX DISCRIMINATION PROBLEM
SEX DISCRIMINATION PROBLEM 5. Displaying Relationships between Variables In this section we will use scatterplots to examine the relationship between the dependent variable (starting salary) and each of
More informationWC-5 Just How Credible Is That Employer? Exploring GLMs and Multilevel Modeling for NCCI s Excess Loss Factor Methodology
Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the letter and spirit of the antitrust laws. Seminars conducted under the auspices of the CAS are designed solely to
More informationTHE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management
THE UNIVERSITY OF TEXAS AT AUSTIN Department of Information, Risk, and Operations Management BA 386T Tom Shively PROBABILITY CONCEPTS AND NORMAL DISTRIBUTIONS The fundamental idea underlying any statistical
More informationSTATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15
STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15 For this assignment use the Diamonds dataset in the Stat2Data library. The dataset is used in examples
More informationModelling the Sharpe ratio for investment strategies
Modelling the Sharpe ratio for investment strategies Group 6 Sako Arts 0776148 Rik Coenders 0777004 Stefan Luijten 0783116 Ivo van Heck 0775551 Rik Hagelaars 0789883 Stephan van Driel 0858182 Ellen Cardinaels
More informationCHAPTER 2 Describing Data: Numerical
CHAPTER Multiple-Choice Questions 1. A scatter plot can illustrate all of the following except: A) the median of each of the two variables B) the range of each of the two variables C) an indication of
More informationWeb Extension: Continuous Distributions and Estimating Beta with a Calculator
19878_02W_p001-008.qxd 3/10/06 9:51 AM Page 1 C H A P T E R 2 Web Extension: Continuous Distributions and Estimating Beta with a Calculator This extension explains continuous probability distributions
More informationStandardized Data Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis
Descriptive Statistics (Part 2) 4 Chapter Percentiles, Quartiles and Box Plots Grouped Data Skewness and Kurtosis McGraw-Hill/Irwin Copyright 2009 by The McGraw-Hill Companies, Inc. Chebyshev s Theorem
More information6 Multiple Regression
More than one X variable. 6 Multiple Regression Why? Might be interested in more than one marginal effect Omitted Variable Bias (OVB) 6.1 and 6.2 House prices and OVB Should I build a fireplace? The following
More informationFrequency Distributions
Frequency Distributions January 8, 2018 Contents Frequency histograms Relative Frequency Histograms Cumulative Frequency Graph Frequency Histograms in R Using the Cumulative Frequency Graph to Estimate
More informationLasso and Ridge Quantile Regression using Cross Validation to Estimate Extreme Rainfall
Global Journal of Pure and Applied Mathematics. ISSN 0973-1768 Volume 12, Number 3 (2016), pp. 3305 3314 Research India Publications http://www.ripublication.com/gjpam.htm Lasso and Ridge Quantile Regression
More informationEstimating a demand function
Estimating a demand function One of the most basic topics in economics is the supply/demand curve. Simply put, the supply offered for sale of a commodity is directly related to its price, while the demand
More informationGov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010
Gov 2001: Section 5 I. A Normal Example II. Uncertainty Gov 2001 Spring 2010 A roadmap We started by introducing the concept of likelihood in the simplest univariate context one observation, one variable.
More informationMaximum Likelihood Estimation Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 10, 2017
Maximum Likelihood Estimation Richard Williams, University of otre Dame, https://www3.nd.edu/~rwilliam/ Last revised January 0, 207 [This handout draws very heavily from Regression Models for Categorical
More information