Linear regression model we assume that two quantitative variables, x and y, are linearly related; that is, the population of (x, y) pairs are related by an ideal population regression line y = α + βx + e where α and β represent the y-intercept and slope coefficients; the quantity e is included to represent the fact that the relation is subject to random errors in measurement e can be interpreted to represent either - the deviation from the mean of that value of y from the population regression line, or - the error in using the line to predict a value of y from the corresponding given x we assume that e is a normally distributed random variable with mean µ e = 0 and standard deviation σ e, which will be large when prediction errors are large and small when prediction errors are small
note that e is a different random variable for different values of x; all such e are assumed to be independent of each other and identically distributed (so there is no harm in giving them the same name) for a fixed value x* of x, the quantity α + βx* represents the (fixed) height of the regression line at x = x*, so y = α + βx* + e is subject to the same kind of variability as e: namely, y is normally distributed with mean µ y = α + βx* and standard deviation σ y = σ e β, being the slope of the line, represents the change in µ y associated with a unit change in x; that is, β is the average change in y associated with a unit change in x parameters of interest for the regression model are σ e, which measures the ideal size of errors in using the line to make predictions of y values, and β, which measures the average change in y associated with a unit change in x
Estimating regression parameters estimating σ e standard deviation about the regression line s e = SSResid n 2 is not an unbiased estimator of σ e [TI83: STAT TESTS LinRegTTest (denoted s)] estimating β the slope of the regression line, b = r s x s y, is an unbiased estimator for β [TI83: STAT TESTS LinRegTTest, also STAT CALC LinReg(a+bx).]
The sampling distribution for b the sampling distribution of b is studied to determine how estimates of β will behave from sample to sample assuming that the n data points produce identical independent normally distributed errors e, all with mean 0 and standard deviation σ e, we have that - µ b = β σ e - σ b = s x n 1 - the sampling distribution of b is normal, but since neither σ e nor σ b are known, we estimate σ e with the statistic s e, and σ b with the statistic s e s b =, then estimate b with the statistic s x n 1 t = b β having df = n 2 s b
Confidence interval for β Assuming that the n data points produce identical independent normally distributed errors e, all with mean 0 and standard deviation σ e, we obtain the following confidence interval for β: b ± (t-crit.) s b where the t-critical value is based on df = n 2
Model utility test for linear regression If the slope of the regression line is β = 0, then the line is horizontal and values of y do not depend on x, so there is no use to search for a prediction of y based on knowledge of x. A test for whether β = 0 can determine whether it is appropriate to search for a linear regression between the variables x and y. Hypotheses H 0 : β = 0 H a : β 0 Test statistic Assumptions t = b 0 s b, with df = n 2 independent normally distributed errors with mean 0 and equal standard deviatons [TI-83: STAT TESTS LinRegTTest ]
Residual analysis We can use a residual plot (a plot of residuals vs. x values) to check whether it is reasonable to assume that errors are identically distributed independent normal variables; the z-scores of these residuals can be used to display a standardized residual plot: z resid = resid 0 s resid but the standard deviations of each residual vary from point to point and are not automatically calculated by the TI-83. Many statistical packages, however, do perfome these calculations. What to look for: absence of patterns in the (standardized) residual plot very few large residuals (more than 2 standard deviations from the x-axis) no variations in spread of the residuals (would indicate that σ e varies with x) influential residuals (residual points far removed from the bulk of the plot)
The sampling distribution for a + bx* Assuming that the n data points produce identical independent normally distributed errors e, all with mean 0 and standard deviation σ e, we study the distribution of the prediction statistic a + bx* for some fixed choice of x = x*. a + bx* is an unbiased estimate for the true regression value α + βx*, which thus represents µa + bx* 1 σ a + bx* = σ e n + z x n 1 1 statistic s a + bx* = s e n + z x n 1 2 and is estimated by the 2 a + bx* is normally distributed, but replacing σ a + bx* with the estimate s a + bx* produces a standardized t variable with df = n 2
Confidence interval for a + bx* With the same assumptions as above, the confidence interval formula for a + bx*, the mean value of the predicted y, is where t has df = n 2 (a + bx*) ± (t-crit) s a + bx* Prediction intervals With the same assumptions as above, the prediction interval formula for y*, the prediction of y for the x value x = x*, is (a + bx*) ± (t-crit) s 2 2 e + s a+bx* where t has df = n 2 (variability comes not only from the size of the error but the extent to which the estimate a + bx* differs from the mean value)