Regression Lecture Notes VII Statistics 112, Fall 2002
Outline Predicting based on Use of the conditional mean (the regression function) to make predictions. Prediction based on a sample. Regression line. Least squares line. Reading for today: Section 2.3. Also review the concept of a conditional distribution and the conditional mean from chapter 4. Reading for next time: Section 2.4.
A Prediction Problem You are trying to determine how many years to stay in school. You are currently focusing on the economic aspects and want to predict what your earnings will be for various levels of education. Suppose that you have available the joint distribution of education and earnings from a population of individuals like yourself. Earnings $ 30,000 $ 40,000 $ 50,000 $ 60,000 12 0.7 0.1 0.1 0.1 Educ. 14 0.1 0.7 0.1 0.1 16 0.1 0.1 0.7 0.1 18 0.1 0.1 0.1 0.7 Table 1: Conditional distribution of earnings given education What is your prediction of your earnings if you only graduate from high school? What is your prediction of your earnings if you graduate from college?
Conditional Mean for Prediction One reasonable approach to predicting your earnings for a given level of schooling: use the mean level of earnings for each level of schooling. Educ. 12 14 Mean Earnings 16 18 The distribution of earnings among those who obtained a given level of schooling (e.g., high school) is called the conditional distribution of earnings given the level of schooling. The mean (expecation) of earnings among those who obtained a certain level of earnings is called the conditional mean (expectation) of earnings for the given level of schooling.
The Regression Function A response variable measures an outcome of a study and is often denoted by. An explanatory variable explains, predicts or causes changes in the response variable and is often denoted by. We are often interested in predicting based on : Education Height Past DJIA Blood alcohol content Exposure to radioactive contamination Earnings Weight Present DJIA Probability of an accident Cancer mortality The conditional mean of given is a reasonable prediction of given. This conditional mean is denoted by and it is called the regression function (it is a function of ). No prediction is infallible because there is scatter about the conditional mean, e.g., a high school graduate will occasionally earn $ 60,000.
Linear Regression Function Sometimes, the regression function is approximately linear in, i.e., the plot of vs. is a straight line,. What are the interpretations of the intercept and slope of a linear regression function? The intercept is the value of The slope is the amount that when. increases for every one unit increase in, e.g., your expected earnings increases $ 6,000 for each extra year of education.
Prediction Based on a Sample Typically, we do not know the population regression function and must estimate it based on a sample from the population,. Suppose that in a random sample of your family and friends, you obtain the following data Name Educ. Earn. Name Educ. Earn. AC 12 $ 30,000 RE 16 $ 32,000 DE 12 $ 20,000 LB 16 $ 40,000 BJ 12 $ 25,000 AJ 16 $ 30,000 TL 14 $ 33,000 PG 18 $ 40,000 QJ 16 $ 38,000 JU 18 $ 50,000
A natural idea for predicting based on is to use the sample mean of given, i.e., the mean in a vertical strip of the scatterplot. Some problems with this approach: There may only be a small number of observations for each in the sample, meaning that the sample estimate of will have high variance. What is a 95% confidence interval for the mean earnings for college graduates (without a master s degree) assuming that the population distribution is normal? What if we want to predict the earnings for someone who attends one year of college? How would we do it? The plot of the sample means of given (the graph of averages) will typically not be a line even if the population regression function is linear.
The Regression Line A regression line is a straight line that best fits the graph of averages. It is an estimate of the population regression function when the population regression function is linear. Which line? Prediction errors. Because we want to use the line to predict based on, it is sensible to choose the line that makes the smallest prediction errors. The prediction error for point is error predicted based on i.e., the vertical distance between the regression line at and.
The Least Squares Line One way to judge the magnitude of prediction errors is to square them. The least squares line is the line that makes the sum of squared prediction errors smallest, i.e., the the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. Least squares line:. What is the interpretation of and? What would you predict your earnings to be if you graduate from college? What would you predict your earnings to be if you attend one year of college? What would you predict your earnings to be if you obtain a doctorate? Which of the above predictions do you feel most confident about? Which prediction do you feel least confident about?