Chapter 11 Part 6. Correlation Continued. LOWESS Regression

Size: px

Start display at page:

Download "Chapter 11 Part 6. Correlation Continued. LOWESS Regression"

Bertina Dalton
6 years ago
Views:

1 Chapter 11 Part 6 Correlation Continued LOWESS Regression February 17, 2009

2 Goal: To review the properties of the correlation coefficient. To introduce you to the various tools that can be used to decide if a distribution is normally distributed. To introduce you to tests of normality, skewness and kurtosis. To introduce you to the use of LOWESS regression as a diagnostic tool. To introduce you to looking carefully at your data. Skills: You should be able to determine whether or not a distribution is normal. You should know how to use and interpret both parametric and non-parametric correlation coefficients. You should know how to obtain and interpret a LOWESS line. You should now have the experience to know better than to just run Stata commands without looking at the data. Commands: pnorm qnorm sfrancia swilk sktest Dataset: allhat1000old.dta GreeneTouchold.dta Example151Goodold.dta

3 Last class we looked at the correlation (i.e. linear relationship) between the baseline diastolic blood pressure and weight at baseline. At the time I pointed out that in order to statistically test whether there was a linear relationship, both of the variables (DBP and weight) would have to be normally distributed. Last class we just focused on how to get and interpret correlation coefficients and ignored the need for normality. Today we are going back to check on the normality of the variables. We are going to first consider graphical techniques for deciding if the variables are normally distributed. For each of baseline DBP and baseline weight I have plotted a histogram, a Q-Q plot (quantiles of the normal distribution, Stata command qnorm) and a standardized normal probability plot (Stata command pnorm). Notice that for both the Q-Q plot (also called a quantile-normal plot)and the standardized normal probability plot the straight line represents the normal distribution. So ideally all of our points would fall on this normal line. Deviations from the line are indications of non-normality. There is no statistical test that goes with these plots although there are statistical tests for normality. So we just look at the plots and make our best judgement. The Q-Q plot is sensitive to non-normality near the tails of the plot. The standardized normal probability plot is sensitive to non-normality in the middle range of the data. Below I have added baseline triglycerides to the mixed because it is skewed in a different way from the other two variables. So we will include the correlation of triglycerides with the other two variables. After using the plots to consider normality we are going to use the summarize command with detail to get the skewness and kurtosis of each variable. Finally, we are going to use the Shapiro-Wilk normality test (for data sets where 4 n 2000, the Stata command is swilk), the Shapiro-Francia normality test (for 5 n 5000, the Stata command is sfrancia) and a skewness and kurtosis test for normality (the Stata command is sktest). Usually statistical tests are considered better than plots but in this case the plots are actually preferred. Page -1-

4 Frequency Baseline visit 1 DBP allhat1000.dta Avg DBP at Visit 1 Notice the tall bars at 80 mm Hg and 90 mm Hg. This indicates what is called a digit preference (i.e. people tend to round off to 80 and 90). The distribution is skewed to the left. Avg DBP at Visit Baseline visit 1 DBP Quantiles of normal distribution plot Inverse Normal The Stata command is "qnorm". This plot is sensitive to non-normality near the tails. This is also called a Q-Q plot or a normal quantile plot. Notice that the histogram is skewed to the left and the qnorm plot curves downward. The plot deviates from the normal line more on the left end of the graph. We know how to use the dropdown menus to get the histogram but how do we use them to get the Q-Q plot? Page -2-

5 You have the option of selecting a plot type (3 rd tab from the left), but the usual choice is the scatterplot which in this case is the default plot. The scatterplot is what is pictured for the Q-Q plot on the page above. Page -3-

6 The Stata command is "pnorm". The plot is sensitive to non-normality in the middle range of the data. The 2 spots pointed out by the arrows are probably related to the tall bars at 80 and 90 mm Hg. The plot above is obtained using the normal probability plot (pnorm). Page -4-

The distribution of weight is skewed to the right. The Stata command is qnorm. This plot is sensitive to non-normality near the tails. This is called the Q-Q plot.

7 The distribution of weight is skewed to the right. The Stata command is qnorm. This plot is sensitive to non-normality near the tails. This is called the Q-Q plot. Notice that histogram is skewed to the right and the qnorm plot curves upward. The Stata command is pnorm. The plot is sensitive to non-normality in the middle range of the data. Page -5-

8 Frequency Baseline triglycerides allhat1000.dta Baseline triglycerides for the antihypertensive study Notice that the triglyceride distribution is very skewed to the right. Triglycerides-BL Anti Baseline triglycerides for the antihypertensive study Quantiles of normal distribution plot Inverse Normal The Stata command is "qnorm". This plot is sensitive to non-normality near the tails. This is also called a Q-Q plot. Notice that the histogram is skewed to the right and the qnorm plot curves upward. Notice that the spaced out dots on the right side of the plot go with the spaced out bars in the histogram above. Normal F[(ATRIG-m)/s] Baseline triglycerides for the antihypertension study Standardized normal probability plot Empirical P[i] = i/(n+1) The Stata command is "pnorm". The plot is sensitive to non-normality in the middle range of the data. Page -6-

9 A distribution can be skewed to both the left and the right in which case the curve in the Q-Q plot can go up at one end and down at the other.. sum(bv1dbp),det Avg DBP at Visit Percentiles Smallest 1% % % Obs % Sum of Wgt % 85 Mean Largest Std. Dev % % Variance % Skewness % Kurtosis sum(blwgt),det Weight(lbs) at Baseline Percentiles Smallest 1% % % Obs % Sum of Wgt % 178 Mean Largest Std. Dev % % Variance % Skewness % Kurtosis sum(atrig),det Triglycerides-BL Anti Percentiles Smallest 1% % % Obs % Sum of Wgt % 138 Mean Largest Std. Dev % % Variance % Skewness % Kurtosis Page -7-

10 Variable Skewness Kurtosis baseline DBP baseline weight baseline triglycerides (TG) DBP has skewness < 0 so it is skewed to the left (something we have already seen in the graph). Weight and TG are both skewed to the right (i.e. skewness > 0). DBP is the least skewed (i.e. the skewness value is the closest to zero) and TG is the most skewed (i.e. the skewness value is the furthest from 0, the skewness value of the normal distribution). The kurtosis value for the normal distribution is 3. The DBP has the kurtosis value the closest to 3 and weight has the kurtosis value the most distant from 3. Tests for normality and for skewness and kurtosis: help swilk, help sfrancia dialogs: swilk sfrancia Title Syntax [R] swilk -- Shapiro-Wilk and Shapiro-Francia tests for normality Shapiro-Wilk normality test swilk varlist [if] [in] [, options] Shapiro-Francia normality test sfrancia varlist [if] [in] Description swilk performs the Shapiro-Wilk W test for normality, and sfrancia performs the Shapiro-Francia W' test for normality. swilk can be used with 4<=n<=2,000 observations, and sfrancia can be used with 5<=n<=5,000 observations; see [R] sktest for a test allowing more observations. Page -8-

11 help sktest dialog: sktest Title Syntax [R] sktest -- Skewness and kurtosis test for normality Description sktest varlist [if] [in] [weight] [, noadjust] aweights and fweights are allowed; see weight. For each variable in varlist, sktest presents a test for normality based on skewness and another based on kurtosis and then combines the two tests into an overall test statistic. sktest requires a minimum of 8 observations to make its calculations. Option Main noadjust suppresses the empirical adjustment made by Royston (1991) to the overall chi-squared and its significance level and presents the unaltered test as described by D'Agostino, Balanger, and D'Agostino Jr. (1990). I usually go with the default since Stata usually chooses as the default the most commonly used statistic. For each of the three normality tests above, the null hypothesis is that the distribution is normal. So if we reject the null hypothesis we have declared that the distribution is not normal. The skewness and kurtosis test tests the skewness and kurtosis individually, as well as, presenting a combined test for overall normality. You ll find the Shapiro-Francia and Shapiro-Wilks tests under swilk in the Stata manuals. For the Shapiro-Francia test the W ' V ' W ' is the test statistic and the is a transform of the. Both give the same information. The median value of. sfrancia BV1DBP BLWGT ATRIG V ' is 1 and large values indicate non-normality. Shapiro-Francia W' test for normal data Variable Obs W' V' z Prob>z BV1DBP BLWGT ATRIG The interpretation of the W and V for the Shapiro-Wilk test is similar to that of the Page -9-

12 test statistics of the Shapiro-Francia test.. swilk BV1DBP BLWGT ATRIG Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z BV1DBP BLWGT ATRIG The sktest skewness and kurtosis test for normality you ll find under sktest in the Stata manuals.. sktest BV1DBP BLWGT ATRIG Skewness/Kurtosis tests for Normality joint Variable Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi BV1DBP BLWGT ATRIG Well in each case we have rejected the normality of the variables. Last time we considered these same three variables using the spearman and ktau commands for the non-parametric tests Spearman and Kendall s tau. For both tests we failed to reject normality for each of the 3 variables. The parametric tests are considered too sensitive. So what should we do? The histograms are somewhat skewed but the Q-Q plots and the standardized normal probability plots aren t too bad and the nonparametric tests say the normality is ok. I would just report the parametric correlation results given below.. pwcorr BV1DBP BLWGT ATRIG,obs sig BV1DBP BLWGT ATRIG BV1DBP BLWGT ATRIG Each of the three p-values say that we fail to reject the null hypothesis that the correlation is equal to zero (i.e. the pairs of variables are not linearly correlated). Page -10-

13 An example to try to convince you that the assumption behind the linear regression graph with the normal curves is not as strange as it might seem. μ yxt μ yxp x t x p I have repeatedly used the graph above so you would get the idea that the regression line is a line through the means of a series of normal curves. Below I have used the data set allhat1000.dta to try to show you with real data that a line through means makes sense. In the plot below the Visit 1 DBP for 1000 people is on the x-axis and the Visit 2 DBP for the same 1000 people is on the y-axis. The dataset is allhat1000.dta. Notice that for Visit 1 DBP = 70 there are 27 values for Visit 2 DBP with mean = 77.3, for Visit 1 DBP = 80 there are 95 values for Visit 2 DBP with mean = 82.2 and for Visit 1 DBP = 90 there are 108 values for Visit 2 DBP with mean = sum(bv2dbp) if BV1DBP == 70 Variable Obs Mean Std. Dev. Min Max BV2DBP sum(bv2dbp) if BV1DBP == 80 Variable Obs Mean Std. Dev. Min Max BV2DBP sum(bv2dbp) if BV1DBP == 90 Variable Obs Mean Std. Dev. Min Max BV2DBP The plot below gives the scatter of Visit 1 (x-axis) and Visit 2 (y-axis) values of DBP. The solid dots are for Visit 1 DBP = 70, 80 and 90. The squares are the points (70, 77.3), (80, 82.2) and (90, 87.8). Page -11-

14 Visit 2 DBP mmhg Visit 1 DBP mmhg Below I have added the least squares regression line and 3 more points. The additional points (x = 82, 85 and 100) were obtained in the same manner as the 3 in the graph above. Notice that the 6 points are not so far off the regression line people for whom DBP measured at 2 different time points Diastolic Blood Pressure at Visit The dark circles are the points ( xy, x ) For x = 70, 80, 82, 85, 90 and Diastolic Blood Pressure at Visit 1 Page -12-

15 Now of course we are dealing with a finite number of y values at each value of x, whereas for the theoretical graph there would be a whole distribution's worth of y- values. But I hope you can see that the idea of a normal distribution of y's at each value of x with the mean of the distribution of y's being on the regression line does make sense (we of course can't show the normal part). LOWESS regression (a diagnostic tool for regression): I have mentioned LOWESS regression before and even graphed one but haven't given you any real details. Ordinary Least Squares regression fits a line to the data even if it is clear that a line is not appropriate (see page 18 and 19 below). LOWESS regression doesn't make any model assumptions. The LOWESS (locally weighted scatterplot smoother) regression curve is one of our diagnostic tools for regression. The idea is that each point ( xi, yi) in the dataset is fitted to a separate linear regression line based on adjacent observations. These points are weighted so that the further away the x value is from x i, the less impact it has on determining the estimate $y i estimate is called the bandwidth. $y i. The proportion of the total data that is used to create each In Stata the default bandwidth is 0.8 (i.e. 80% of the dataset) which works for mid-size data sets. For large data sets using a bandwidth of 0.3 or 0.4 is recommended; a bandwidth of 0.99 is recommended for small data sets. The wider the bandwidth the smoother the curve. Narrow bandwidths produce curves that are more sensitive to local perturbations in the data. The recommendation is to experiment with different bandwidths. (William D. Dupont, Statistical Modeling for Biomedical Researchers, Cambridge University Press, 2002). There is no statistical test related to LOWESS regression. We simply eyeball the graphs. The LOWESS curve below shows that fitting a line to the DBP visit 1 and visit 2 data is a pretty good idea. Page -13-

16 Lowess Curve and OLS Regression Line DBP at Visit Lowess curve is solid Regression line is dashed DBP at Visit 1 Below we are still using the data set allhat1000.dta, but now baseline weight is the predictor and visit 1 DBP is the outcome. DBP at Visit Lowess Curve and OLS Regression Line Regression line = dashed Lowess curve = solid Weight at baseline (lbs) Page -14-

17 The LOWESS curve immediately above is pretty flat. Below I have given y (i.e. the mean of the visit 1 DBP = 84.5) and the LOWESS curve. DBP at Visit Lowess Curve and Mean of DBP at Visit 1 (84.5) Lowess curve = solid Mean of DBP = dashed Weight at baseline (lbs) Notice there is not a lot of difference between the two graphs above. The regression output is given below.. regress BV1DBP BLWGT Source SS df MS Number of obs = F( 1, 997) = 0.87 Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = BV1DBP Coef. Std. Err. t P> t [95% Conf. Interval] BLWGT _cons We have F = 0.87, df = 1 and 997 with p = Well that is a definite failure to reject the null hypothesis (i.e. fail to reject slope = 0). Notice that the estimate of the slope is That is about as close to a flat line as you can get. Notice that the y-intercept = 83.1 is pretty close to the mean of the Visit 1 DBP (84.5). No wonder the two graphs look alike. Of course, in this case we didn t really need either the regression output or the LOWESS curve to make that determination. Just looking at the scatter plot should have given us a pretty good idea of what the answer would be. Page -15-

18 Let us go back and look at the Greene-Touchstone data (i.e. the estriol/birth weight problem). Remember I said that it was not really a wonderful set of data to use as an example for regression, but it is a good example of some of the problems with regression. Birthweigth in gms Greene-Touchstone Study OLS regression line (dashed) and LOWESS (solid) Bandwidth = Estriol mg/24 hrs Birthweigth in gms Greene-Touchstone Study OLS regression line (dashed) and LOWESS (solid) Bandwidth = Birthweigth in gms Greene-Touchstone Study OLS regression line (dashed) and LOWESS (solid) Bandwidth = Estriol mg/24 hrs Estriol mg/24 hrs Notice that the smoothest LOWESS curve has bandwidth = 0.99 and the most jagged has bandwidth = 0.3. The LOWESS curve (regardless of bandwidth) says that fitting a regression line is not exactly the best choice we could make. The regression output below shows that while we rejected the null hypothesis (i.e. we concluded that the slope was different from zero), R 2 is only 0.37 indicating that the regression line accounts for only 37% of the variability in the data. Page -16-

19 . regress bwt100 estriol Source SS df MS Number of obs = F( 1, 29) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = bwt100 Coef. Std. Err. t P> t [95% Conf. Interval] estriol _cons We need to be careful about interpreting regression results without looking at the graph. The data file used below is Example151Goodold.dta. The example is taken from Common Errors in Statistics (and How to Avoid Them) by P.I. Good and J.W. Hardin. Each of the 4 regression runs below has the same R 2 and the same estimates for the slope and the y-intercept. Do you think they are all the same?. regress y1 x1 Source SS df MS Number of obs = F( 1, 9) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = y1 Coef. Std. Err. t P> t [95% Conf. Interval] x _cons regress y2 x2 Source SS df MS Number of obs = F( 1, 9) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = y2 Coef. Std. Err. t P> t [95% Conf. Interval] x _cons regress y3 x3 Source SS df MS Number of obs = F( 1, 9) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = y3 Coef. Std. Err. t P> t [95% Conf. Interval] x _cons regress y4 x4 Page -17-

20 Source SS df MS Number of obs = F( 1, 9) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = y4 Coef. Std. Err. t P> t [95% Conf. Interval] x _cons The lines in the 4 graphs below show be written y$ = x y = y3= x05. x R R= 067. = y = x R = y1 8 6 y OLS regression line Lowess regression curve 4 2 OLS regression line Lowess regression curve x1 x y = x R = y = x R = OLS regression line y3 8 6 y OLS regression line Lowess regression curve 4 2 Lowess regression curve x x4 Page -18-

21 Inappropriate use of regression Baseline medications Race. tab RACE Race Freq. Percent Cum White Black Asian/Pacific Islander Other Total 1, tab BLMEDS Baseline Medications Freq. Percent Cum On 1-2 drugs ge 2 months On drugs lt 2 months Currently untreated Total 1, label list racelbl racelbl: 1 White 2 Black 3 Amer Indian/Alaskan native 4 Asian/Pacific Islander 5 Other. label list blmedlbl blmedlbl: 1 On 1-2 drugs ge 2 months 2 On drugs lt 2 months 3 Currently untreated Page -19-

would make your variable of interest. gladder gives histograms and qladder gives Q-Q plots.

22 Response to question in class: If you think you should transform your data, how do you decide which would be the best transformation to use. I would use gladder or qladder which give you an array of potential transformations so you can see how normal that transformation would make your variable of interest. gladder gives histograms and qladder gives Q-Q plots. Notice that Ladder-ofpowers histograms is highlighted but that the line below histograms is the command for the Q-Q plots. just typing in gladder BLWGT gets you the same results. Page -20-

Solutions for Session 5: Linear Models

Solutions for Session 5: Linear Models 30/10/2018. do solution.do. global basedir http://personalpages.manchester.ac.uk/staff/mark.lunt. global datadir $basedir/stats/5_linearmodels1/data. use $datadir/anscombe.