SIMPLE REGRESSIO THEORY II Smple Regresson Theory II 00 Samuel L. Baker Assessng how good the regresson equaton s lkely to be Assgnment A gets nto drawng nferences about how close the regresson lne mght be to the true lne. We make these nferences by examnng how close the data ponts are to the regresson lne we draw. If the data ponts lne up well, we nfer that our regresson lne s lkely to be close to the true lne. If our data ponts are wdely scattered around our regresson lne, we consder ourselves less certan about where the true lne s. If we are unlucky, the data ponts wll fool us by lnng up very well n a wrong drecton. Other tmes, agan f we are unlucky, the data ponts wll fal to lne up very well, even though there really s a relatonshp between our X and Y varables. If those thngs happen, we wll come to an ncorrect concluson about the relatonshp between X and Y. Hypothess testng, whch we dscuss soon, s how we try to sort ths out. In Smple Regresson Theory I, I dscussed the assumpton that the observed ponts come from a true lne wth a random error added or subtracted. I called ths Assumpton, rather than just The Assumpton, because there we wll need some more assumptons f we want to make nferences about how good our regresson lne s for explanng and predctng. Let s revew the notaton that we are usng: Each data pont s produced by ths true equaton: Y =α+βx +e. Ths equaton says that the th data pont s Y value s the true lne s ntercept, plus the th X value tmes the true lne s slope, plus a random error e that s dfferent for each data pont. That s Assumpton expressed wth algebra. Once we accept Assumpton, there are two parameters, the true slope β and the true ntercept α, that we want to estmate, based on our data. An estmate s a specfc numercal guess as to what some unknown parameter s. In the Assgnment A nstructons, page, the example spreadsheet shows 0.067857 as an estmate of the true slope β. The word estmate n that sentence s mportant. 0.0678857 s not β. It s an estmate of β. An estmator s a recpe, or a method, for obtanng an estmate. You have used two estmators of β already. One s the eyeball estmator (plot the ponts, draw the lne, nspect the graph, calculate the slope). The other s the least squares estmator (type the data nto a spreadsheet, mplement the least squares formula, report the result). Here s an dea that can take some gettng used to: Estmates are random varables. Ths may seem lke a funny dea, because you get your estmate from your data. However, f you use the theory we have been developng here, each Y value n your data s a random varable. That s because each Y value ncludes the random error that s causng t to be above or below the true lne. Your estmates of the ntercept and slope, α^ and β^ ( alpha-hat and beta hat ), derve from these Y values. The α^ and β^ that you get from any partcular data set depends on what the errors e were when those data were generated. Ths makes your estmates of the parameters random varables.
SIMPLE REGRESSIO THEORY II In assgnment, everybody n the class had data wth the same true slope and ntercept, but dfferent errors. As a result of the dfferent errors, everybody got dfferent values for α^ and β^. That s what t means to say that the estmates of α and β are random varables. Because each estmate s a random varable, each estmator (recpe for gettng the estmate) has what s called a samplng dstrbuton. Ths means that each estmator, when used n a partcular stuaton, has an expected value and a varance expected spread. A good estmator a good way of makng an estmate would be one that s lkely to gve an estmate that s close to the true value. A good estmator would have an expected value that s near the true value and a varance that s small. As mentoned, Assumpton s that there s a true lne, and that the observed data ponts scatter around that lne due to random error. By addng more assumptons, we can use the samplng dstrbuton dea to move toward assessng how good an estmator the least squares method s. We can also use ths dea to assess how good an estmator the eyeball method s. Assumpton : The expected value of any error s 0. We can wrte ths assumpton n algebrac terms as: Expected(e ) = 0 for all each observaton. ( numbers the observatons from to ), Ths means that we assume that the ponts we observe are not systematcally above or below the true lne. If, n practce, there are more ponts, say, above the true lne than below t, that s entrely due to random error. If ths assumpton s true, then the expected values of the least squares lne's slope and ntercept are the true lne's slope and ntercept. Algebracally, we can wrte: Expected(α^) = α and Expected(β^) = β. Those formulas mean that the least squares regresson lne s just as lkely to be above the true lne as below t, and just as lkely to be too steep as to be not steep enough. The slope and ntercept estmates are amed at ther targets. The jargon term for ths s that the estmators are unbased. The next two assumptons are about the varances and covarances of the observatons errors (e ). Assumpton 3: All the errors have the same varance. Expected(e ) = σ for all Expected(e ) s the varance of th data pont s error e. The varance formula usually nvolves subtractng the mean, but here the mean of each error s 0. That s assumpton, that the expected value of e s 0. otce that, n the formula Expected(e ) = σ, there s a subscrpt on the left sde of the equals sgn, but no subscrpt on the rght sde. Ths expresses the dea that all the errors varances are the same. I use σ ("sgma-squared") here for the varance of the error, because that s the textbook conventon. All the errors havng the same varance means that all of the observatons are equally lkely to be far from
SIMPLE REGRESSIO THEORY II 3 or near to the true lne. o one observaton s more relable than any other, deservng more weght. In almost any data set, some data ponts wll be closer to the true lne than others. Ths, we assume, s entrely by luck. If one or two ponts stck way out, lke a basketball player n a room full of jockeys, we have doubts about assumpton 3. Assumpton 4: All the errors are ndependent of each other. Expected(e e j ) = 0 for any two observatons such that =/ j Expected(e e j ) s the covarance of the two random varables e and e j. The assumpton that ths s 0 for any par of observatons means ths: If the one pont happens to be above the true lne, t s not any more or less lkely that the next pont wll also be above the true lne. A comment on makng these assumptons These assumptons through 4 are not made just from the data. Rather, these assumptons must come prmarly from our general understandng of how the data were generated. We can look at the data and get an dea about how reasonable these assumptons are. (Assgnment does ths.) Usually, we pck the assumptons we make based on a combnaton of the look of the data and our understandng of where the data came from. When you ft a least squares lne to some ponts, you are mplctly makng assumptons through 4. When you draw a straght lne by eye through a bunch of ponts, you are also mplctly makng assumptons through 4. What f you are not comfortable wth makng those assumptons? Later n the course, we wll get nto some of the alternatve models that you can try when one of more of these assumptons do not hold. For example, we wll talk about non-lnear models. For now, let us stck wth the lnear model, whch means that we accept those assumptons. Varances of the least squares ntercept and slope estmates Earler, t was ponted out that the estmates for the slope and ntercept, α^ and β^, are random varables, so each has a mean, or expected value, and a varance. From assumptons through 4, one can derve formulas for the varances of α^ and β^ when they are estmated usng least squares. (You cannot derve formulas for the varances of α^ and β^ when they are estmated usng the draw-by-eye method. To estmate those varances, you could ask a number of people to draw regresson lnes by eye. That s what ths class does for Assgnment!) Here are formulas for the varances of the least squares estmators of α^ and β^. (If you would lke to see how these formulas are derved, please consult a statstcs textbook.) Varance( $) α = σ [ + X ( X X ) ] Varance( $ β) = σ ( X X )
SIMPLE REGRESSIO THEORY II 4 In these formulas, σ s the varance of the errors. Asumpton 3 says all errors have the same varance, whch s σ. One of the themes of ths course s that you can learn somethng from formulas lke these, f you take the trouble to examne them. Don t let them ntmdate you! Each formula has σ multpled or dvded by somethng. Ths means that the varances of our regresson lne parameters are both proportonal to σ, the varance of the errors from the true equaton. An mplcaton s that the smaller the errors are, the less randomness there s n our α^ and β^ values, and the closer our estmated parameters are lkely to be to the true parameters. In the formula for the varance of β^, the denomnator gets bgger when there are more X's and when those X's are more spread out. A bgger denomnator makes a fracton smaller. Ths tells us somethng about desgnng a study or experment: If you want your estmates to come out close to the truth, get a lot of observatons that are spread out over a bg range of values of the ndependent varable. Those formulas have a major drawback: We cannot drectly use them! That s because we don't know what σ s. What we can do s estmate σ. We call our estmate s, and calculate t from the resduals lke ths: s u = In that expresson, the resduals are desgnated by u. The numerator says to square each resdual and then add them all up, gvng you the sum of the squares of the resduals. (The least squares lne s the lne that makes ths sum the smallest.) To get the estmate of the varance of the errors, we dvde the sum of squared resduals by. That makes the estmated varance lke an average squared resdual. I say lke an average squared resdual because we dvde by, rather than by. Why dvde by -, nstead of? Usng - allows for the fact that the least squares lne wll ft the data better than the true lne does. The least squares lne s, by the defnton of least squares, the lne wth the smallest sum of squared resduals. Any other lne, ncludng the true lne, wll have a larger sum of squared resduals. - s the degrees of freedom. In your ntroductory statstcs course, you used models n whch the degrees of freedom s -. For nstance, when you estmated a populaton varance based on the data n a sample of from that populaton, the degrees of freedom was -. Why dd you subtract from then, and why are we subtractng now? Then, to estmate the varance of a populaton around the populaton s mean, you used one statstc, the mean of the sample. ow, we are usng predcted values calculated from two statstcs, the slope and the ntercept of the regresson lne. It s easer to get a predcted value that s close to your observaton when you have two estmated parameters, nstead of just one. Usng allows for that. In general, the number of degrees of freedom s -P, where P s the number of parameters used to get the
SIMPLE REGRESSIO THEORY II 5 number(s) that you subtract from each observaton s value. To repeat, the reason you subtract P from n these stuatons s that the estmated mean and the estmated regresson lne ft the data better than the actual populaton mean or the true lne would. The more parameters you have, the easer t s to ft your model to the data, even f the model really s not any good. You have to make the degrees of freedom smaller to make the estmated varance come out bg enough to correct for ths. Substtutng s for σ n the formulas for the varances of α^ and β^ gves estmated varances for α^ and β^. You just change every "σ" nto an "s". X Estmated Varance( $) α = s [ + ] ( X X ) Estmated Varance( $ β ) = Standard Error s ( X X ) The standard error of somethng s the square root of ts estmated varance. s, the square root of s, s called the standard error of the regresson. Here s the formula: s = u = β^ has a standard error, too. It s the square root of the estmated varance of β^:: Std. Error of $ β = s ( X X ) Ths measures of how much rsk there s n usng β^ as an estmate of β. You can deduce from the formula that the rsk n the estmate of β s smaller f: the actual ponts le close to the regresson lne (so that s s small), or the X's are far apart (so that the sum of squared X devatons s bg). Hypothess tests on the slope parameter To do conventonal hypothess testng and confdence ntervals, we must make another assumpton about
SIMPLE REGRESSIO THEORY II 6 the errors, one that s even more restrctve: Assumpton 5 (needed for hypothess testng): Each error has the normal dstrbuton. Each error e has the normal dstrbuton wth a mean of 0 and a varance of σ. The normal dstrbuton should be famlar from ntroductory statstcs. The normal dstrbuton s densty functon has a symmetrcal bell shape. A normal random varable wll be wthn one standard devaton of ts mean about 68.6% of the tme. It wll be wthn two standard devatons 99.54% of the tme. Ths means that, on average, 99.54% of your data ponts should be closer to the true lne than tmes σ. The normal dstrbuton s tght, wth very few outlers. If the errors are normally dstrbuted, as assumpton 5 states, then the followng expresson has the t dstrbuton wth (umber of observatons - ) degrees of freedom. If you would lke to see ths derved, please consult a statstcs textbook. The general dea s that we use the t dstrbuton for ths rather than the normal or z dstrbuton to allow for the fact that the regresson lne generally fts the data better than the true lne. The regresson lne fts too well, n other words. Usng the wder t dstrbuton, rather than the narrower normal dstrbuton, corrects for ths. A branchng place n our dscusson. At ths pont, we can ether go through the mechancs of hypothess testng, or we can dscuss the basc phlosophy of hypothess testng. Whch s best for you depends on what you already know about statstcs and how you learn.. For the phlosophy dscusson, please see the downloadable Phlosophy of Hypothess Testng fle. Then come back here to see the mechancs.. If you want to see the mechancs frst, contnue readng here. Hypothess testng mechancs for smple regresson To use the t formula above to test hypotheses about what the true β s, do ths:. Plug your hypotheszed value for β n where the β s n ths expresson: Usually, the hypotheszed value s 0, but not always.. Plug n the estmated coeffcent where the β^ s, and put the standard error of the coeffcent n the denomnator of the fracton. 3. Evaluate the expresson. Ths s your t value. 4. Fnd the crtcal value from the t table (there s a t-table s n the downloadable fle of tables) by frst pckng a sgnfcance level. The most common sgnfcance level s 5%, or 0.05. That tells you whch column n the table to use Use the row n the t table that corresponds to the number of
SIMPLE REGRESSIO THEORY II 7 degrees of freedom you have. In general, the degrees for freedom s the number of observatons mnus the number of parameters that you are estmatng. For smple regresson, the number of parameters s (the parameters are the slope and the ntercept), so use the row for (umber of observatons - ) degrees of freedom. 5. The column and the row gve you a crtcal t value. If your calculated t value (gnore any mnus sgn) s bgger than ths crtcal value from the t table number, reject the hypotheszed value for β. Otherwse, you don't reject t. In Assgnment A, you wll expand the spreadsheet from Assgnment to nclude a cell that calculates the value of the t fracton. Ths t value, as we wll call t, wll be only for testng the hypothess that the true β s 0. Ths wll do steps,, and 3 for you. In step 5 above, you are supposed to gnore any mnus sgn. That s because we are dong a two-taled test. We do ths because we want to reject hypotheszed values for the true slope that are ether too hgh or too low to be reasonable, gven our data. The column headngs n the t table n the downloadable fle show sgnfcance levels for a two-taled test. Some books present ther t tables dfferently. They base ther tables on a one-taled test, so they have you use the column headed 0.05 to get the crtcal value for a two-taled 0.05-sgnfcance-level test. The crtcal value you get s the same, because the t table value at the 0.05 sgnfcance level for a two-taled test s equal to the t table value at the 0.05 sgnfcance level for a one-taled test. When you use other books t tables, read the fne prnt so you know whch column to use. Sgnfcance levels and types of errors Why use a sgnfcance level of 0.05? Only because t s the most common choce. You can choose any level you want. In choosng a sgnfcance level, you are tradng off two types of possble error:. Type error: Rejectng an hypothess that's true. Type error: Refusng to reject an hypothess that's false If you are testng the hypothess that the true slope s 0, whch s what you usually do, then these become:. Type error: The true slope s 0, but you say that there s a slope. In other words, there really s no relatonshp between X and Y, but you fool yourself nto thnkng that there s a relatonshp.. Type error: The true slope s not 0, but you say that the true slope mght be 0. In other words, there really s a relatonshp between X and Y, but you say that you are not sure that there s a relatonshp. Smaller sgnfcance levels, lke 0.0 or 0.00, make Type errors less lkely. Actually, the sgnfcance level s the probablty of makng a Type error. At a 0.00 sgnfcance level, f you do fnd that your estmate s sgnfcant, you can be very confdent that the true value s dfferent from the hypotheszed value. You wll only be wrong one tme n a thousand. If your hypotheszed value s 0, whch t usually s, you can be very confdent that the true parameter s not 0. If what you are testng s your estmate of the
SIMPLE REGRESSIO THEORY II 8 slope, whch t usually s, then you can be very confdent that the slope s not 0 and that there really s a relatonshp between your X and your Y varables. The drawback of a small sgnfcance level s that t ncreases the probablty of a Type error. Wth a small sgnfcance level, t s more lkely that you wll fal to detect a true relatonshp between X and Y. Wth a small sgnfcance level, you are demandng overwhelmng evdence of a relatonshp beng there. The deal way to pck a sgnfcance level s to wegh the consequences of each type of error, and pck a sgnfcance level that best balances the costs. What researchers often do, however, s pck.05, because they know that ths sgnfcance level s generally accepted. Watch out for confusng or contradctory termnology:. α-level of sgnfcance. The sgnfcance level s sometmes called the α-level. Try not to confuse ths wth the ntercept α of a lnear regresson equaton.. Hgh sgnfcance level. A low alpha number s a hgh level of sgnfcance. If you reject the hypothess that a coeffcent s 0 at the 0.00 level, that coeffcent s hghly sgnfcant. 3. 95% sgnfcance or 0.05 sgnfcance. These terms are nterchangeable. The same goes for 99% sgnfcance or 0.0 sgnfcance. Usually, the context wll enable you keep these straght. Confdence ntervals for equaton parameters An alternatve way to test hypotheses about β s to calculate a confdence nterval for β^ and then see f your hypotheszed value s nsde t. The 95% confdence nterval (two-taled test) for β^ s: 95% confdence nterval for the slope estmate The t 0.05 n the formulas above means the value from the t table n the column for the 0.05 sgnfcance level for a two-taled test and the row for - degrees of freedom. For a 99% confdence nterval, you would use t 0.0. The ± means that you evaluate the expresson once wth a + to get the top of the confdence nterval, then you evaluate t wth a - to get the bottom. Ths confdence nterval s called two-taled because t has a hgh end and a low end that are equdstant from the estmated value. That s what the ± does. (A one-taled confdence nterval would have one end that s above the estmated value and go off to mnus nfnty to the left, or t could have one end that s below the estmated value and go off to plus nfnty to the rght.) Look now at the rght hand verson of the confdence nterval formula above. Let s explore the underlyng
SIMPLE REGRESSIO THEORY II 9 relatonshps. The s n the numerator tells us that the confdence ntervals for the coeffcents get bgger f the resduals are bgger. The denomnator tells us that havng more X values and havng them more spread out from ther mean makes the confdence nterval smaller. Agan, ths tells us that we gan confdence n our estmate f the resduals are small and f we have a lot of spread-out X values. For the ntercept, α, the 95% confdence nterval s: Confdence nterval for the predcton of Y We can also calculate confdence ntervals for a predcton. Let X 0 be the X value for whch we want the Y$ = α$ + β$ X predcted Y value. We ll call that predcted Y value Y^ 0.. The 95% confdence nterval for the predcton s: 0 0 Here's what we can see n ths expresson: The confdence nterval s the bg expresson to the rght of the ±. The wdth of ths confdence nterval depends on the sum of squared resduals s and depends nversely on 3(X -X ), the sum of the squared devatons of the X values from ther mean. Bg resduals, relatve to the spread of the X's, make s bg, whch makes the confdence ntervals wde. More X values, or more spreadout X values, make 3(X -X ) expresson larger, whch makes the confdence nterval narrower. The (X 0 -X&) expresson n the numerator under the square root sgn tells us that confdence nterval gets wder as the X 0 you choose gets further from the mean of the X's.
SIMPLE REGRESSIO THEORY II 0 R, a measure of ft An overall measure of how well the regresson lne fts the ponts s R (read "R squared"). R s always between 0 (no ft) and (perfect ft). R shows on how bg the resduals are n relaton to the devatons of the Y's from ther mean. It s customary to say that the R tells you how much of the varaton n the Y's s "explaned" by the regresson lne. otce that the only dfference between the top and the bottom of the fracton s that the top has y-hats and the bottom has y-bar. The y-hats are the predcted values for Y from the regresson equaton. Y-bar s the average of the Y values. Here s the R formula n words: The R-Squared tells you how much your ablty to predct s mproved by usng the regresson lne, compared wth not usng t. The least possble mprovement s 0. Ths s f the regresson lne s no help at all. You mght as well use the mean of the Y values for your predcton. The most possble mprovement s. Ths s f the regresson lne fts the data perfectly. That s why the R-squared s always between 0 and. The regresson lne s never worse than worthless (0), and t can't be better than perfect (). Some statstcal software reports an "adjusted" R-squared. Ths allows for the fact that an X varable that s really completely unrelated to your Y varable wll probably have some relatonshp to Y n your data just by luck. The adjusted R-squared reduces the R-squared by how much ft would probably happen just by luck. Sometmes ths reducton s more than the calculated R-squared, so you can have an adjusted R-squared that s less than 0. All conclusons from the R-squared are based on the assumptons behnd usng least squares beng true. If those assumptons are not true, then t s possble that usng the regresson lne to predct would be worse than worthless.
SIMPLE REGRESSIO THEORY II Correlaton Correlaton s a measure of how much two varables are lnearly related to each other. If two varables, X and Y, go up and down together n more or less a straght lne way, they are postvely correlated. If Y goes down when X goes up, they are negatvely correlated. In a smple regresson, the correlaton s the square root of the regresson s R. For ths reason, the correlaton (also called the correlaton coeffcent ) s desgnated as r. An alternatve formula for r s: The correlaton coeffcent r can range from - to + Wth some algebra, you can show that r and $β are related. Multply the top and bottom of the r formula by the square root of the sum of the squares of the x devatons. You get ths: Look at the parts of ths fracton that are not under square root symbols. These are the left halves of the numerator and denomnator. Do you recognze them as the same as the formula for the least squares β^? The hgher the slope of the regresson lne s (hgher absolute value of β^ ), the hgher r s. When X and Y are not correlated, r = 0. Accordng to the formula above, β^ must be 0 as well. So, f X and Y are unrelated, the least squares lne s horzontal.
SIMPLE REGRESSIO THEORY II Durbn-Watson Statstc In Assgnment you wll see that the regular goodness of ft statstcs (R and t) cannot detect stuatons where a lnear least squares model s not approprate. The Durbn-Watson statstc can detect some of those stuatons. In partcular, the Durbn-Watson statstc tests for seral (as n "seres") correlaton of the resduals. Here s the Durbn-Watson statstc formula: DW = ( u u ) u (The u s n ths formula are the resduals.) A rule-of-thumb for nterpretaton of DW: DW < ndcates resduals track each other. A postve resdual tends to be followed by another postve resdual. A negatve resdual tends to be followed by another negatve resdual. DW near ndcates no seral correlaton. DW > 3 ndcates resduals alternate, postve-negatve-postve-negatve.. For a more formal test, see the Durbn-Watson table n the downloadable fle of tables. Seral correlaton of resduals ndcates that you can do better than your current model at predctng. Least Squares assumes that the next resdual wll be 0. If there s seral correlaton, that means that you can partly predct a resdual from the one that came before. If the Durbn-Watson test fnds seral correlaton, t may ndcate that a curved model would be better than a straght lne model. Another possblty s that the true relatonshp s a lne, but the error n one observaton s affectng the error n the next observaton. If ths s so, you can get better predctons wth a more elaborate model that takes ths effect nto account.