Chapter 11 Part 6. Correlation Continued. LOWESS Regression

Similar documents
Solutions for Session 5: Linear Models

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

Chapter 6 Part 3 October 21, Bootstrapping

Final Exam - section 1. Thursday, December hours, 30 minutes

Heteroskedasticity. . reg wage black exper educ married tenure

Handout seminar 6, ECON4150

Cameron ECON 132 (Health Economics): FIRST MIDTERM EXAM (A) Fall 17

Model fit assessment via marginal model plots

Data screening, transformations: MRC05

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

NCSS Statistical Software. Reference Intervals

Quantitative Techniques Term 2

You created this PDF from an application that is not licensed to print to novapdf printer (

Econometrics is. The estimation of relationships suggested by economic theory

Basic Procedure for Histograms

Econ 371 Problem Set #4 Answer Sheet. 6.2 This question asks you to use the results from column (1) in the table on page 213.

Two-Sample T-Test for Superiority by a Margin

Technical Documentation for Household Demographics Projection

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

Two-Sample T-Test for Non-Inferiority

Labor Force Participation and the Wage Gap Detailed Notes and Code Econometrics 113 Spring 2014

İnsan TUNALI 8 November 2018 Econ 511: Econometrics I. ASSIGNMENT 7 STATA Supplement

*1A. Basic Descriptive Statistics sum housereg drive elecbill affidavit witness adddoc income male age literacy educ occup cityyears if control==1

u panel_lecture . sum

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

Chapter 18: The Correlational Procedures

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

Introduction to R (2)

Table of Contents. New to the Second Edition... Chapter 1: Introduction : Social Research...

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

Stat 328, Summer 2005

tm / / / / / / / / / / / / Statistics/Data Analysis User: Klick Project: Limited Dependent Variables{space -6}

Assignment #5 Solutions: Chapter 14 Q1.

Dummy variables 9/22/2015. Are wages different across union/nonunion jobs. Treatment Control Y X X i identifies treatment

Your Name (Please print) Did you agree to take the optional portion of the final exam Yes No. Directions

MATHEMATICS APPLIED TO BIOLOGICAL SCIENCES MVE PA 07. LP07 DESCRIPTIVE STATISTICS - Calculating of statistical indicators (1)

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 13, 2018

Categorical Outcomes. Statistical Modelling in Stata: Categorical Outcomes. R by C Table: Example. Nominal Outcomes. Mark Lunt.

Advanced Econometrics

[BINARY DEPENDENT VARIABLE ESTIMATION WITH STATA]

Problem Set 9 Heteroskedasticty Answers

The Multivariate Regression Model

Data Distributions and Normality

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Summary of Statistical Analysis Tools EDAD 5630

sociology SO5032 Quantitative Research Methods Brendan Halpin, Sociology, University of Limerick Spring 2018 SO5032 Quantitative Research Methods

Problem Set 6 ANSWERS

Question 1a 1b 1c 1d 1e 1f 2a 2b 2c 2d 3a 3b 3c 3d M ult:choice Points

2018 AAPM: Normal and non normal distributions: Why understanding distributions are important when designing experiments and analyzing data

SPSS I: Menu Basics Practice Exercises Target Software & Version: SPSS V Last Updated on January 17, 2007 Created by Jennifer Ortman

Time series data: Part 2

chapter 2-3 Normal Positive Skewness Negative Skewness

Lecture 6: Non Normal Distributions

Chapter 6 Part 6. Confidence Intervals chi square distribution binomial distribution

Lecture 5: Fundamentals of Statistical Analysis and Distributions Derived from Normal Distributions

Diploma Part 2. Quantitative Methods. Examiner s Suggested Answers

Establishing a framework for statistical analysis via the Generalized Linear Model

starting on 5/1/1953 up until 2/1/2017.

How To: Perform a Process Capability Analysis Using STATGRAPHICS Centurion

Point-Biserial and Biserial Correlations

Simple Descriptive Statistics

DATA SUMMARIZATION AND VISUALIZATION

The relationship between GDP, labor force and health expenditure in European countries

SAS Simple Linear Regression Example

1) The Effect of Recent Tax Changes on Taxable Income

Summarising Data. Summarising Data. Examples of Types of Data. Types of Data

Probability & Statistics Modular Learning Exercises

Multiple regression - a brief introduction

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 10, 2017

Homework Assignment Section 3

Chapter 3. Numerical Descriptive Measures. Copyright 2016 Pearson Education, Ltd. Chapter 3, Slide 1

STATISTICAL DISTRIBUTIONS AND THE CALCULATOR

Data Analysis and Statistical Methods Statistics 651

F^3: F tests, Functional Forms and Favorite Coefficient Models

Longitudinal Logistic Regression: Breastfeeding of Nepalese Children

ECON Introductory Econometrics Seminar 2, 2015

The SAS System 11:03 Monday, November 11,

Question scores. Question 1a 1b 1c 1d 1e 2a 2b 2c 2d 2e 2f 3a 3b 3c 3d M ult:choice Points

Determinants of FII Inflows:India

Descriptive Statistics

Regression and Simulation

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15

The instructions on this page also work for the TI-83 Plus and the TI-83 Plus Silver Edition.

Descriptive Analysis

Percentiles, STATA, Box Plots, Standardizing, and Other Transformations

Econometric Methods for Valuation Analysis

Some Characteristics of Data

Advanced Industrial Organization I Identi cation of Demand Functions

Professor Brad Jones University of Arizona POL 681, SPRING 2004 INTERACTIONS and STATA: Companion To Lecture Notes on Statistical Interactions

M249 Diagnostic Quiz

Model Construction & Forecast Based Portfolio Allocation:

Monte Carlo Simulation (General Simulation Models)

WEB APPENDIX 8A 7.1 ( 8.9)

Frequency Distribution and Summary Statistics

Stat3011: Solution of Midterm Exam One

Data Analysis. BCF106 Fundamentals of Cost Analysis

Effect of Education on Wage Earning

Dot Plot: A graph for displaying a set of data. Each numerical value is represented by a dot placed above a horizontal number line.

Gamma Distribution Fitting

Transcription:

Chapter 11 Part 6 Correlation Continued LOWESS Regression February 17, 2009

Goal: To review the properties of the correlation coefficient. To introduce you to the various tools that can be used to decide if a distribution is normally distributed. To introduce you to tests of normality, skewness and kurtosis. To introduce you to the use of LOWESS regression as a diagnostic tool. To introduce you to looking carefully at your data. Skills: You should be able to determine whether or not a distribution is normal. You should know how to use and interpret both parametric and non-parametric correlation coefficients. You should know how to obtain and interpret a LOWESS line. You should now have the experience to know better than to just run Stata commands without looking at the data. Commands: pnorm qnorm sfrancia swilk sktest Dataset: allhat1000old.dta GreeneTouchold.dta Example151Goodold.dta

Last class we looked at the correlation (i.e. linear relationship) between the baseline diastolic blood pressure and weight at baseline. At the time I pointed out that in order to statistically test whether there was a linear relationship, both of the variables (DBP and weight) would have to be normally distributed. Last class we just focused on how to get and interpret correlation coefficients and ignored the need for normality. Today we are going back to check on the normality of the variables. We are going to first consider graphical techniques for deciding if the variables are normally distributed. For each of baseline DBP and baseline weight I have plotted a histogram, a Q-Q plot (quantiles of the normal distribution, Stata command qnorm) and a standardized normal probability plot (Stata command pnorm). Notice that for both the Q-Q plot (also called a quantile-normal plot)and the standardized normal probability plot the straight line represents the normal distribution. So ideally all of our points would fall on this normal line. Deviations from the line are indications of non-normality. There is no statistical test that goes with these plots although there are statistical tests for normality. So we just look at the plots and make our best judgement. The Q-Q plot is sensitive to non-normality near the tails of the plot. The standardized normal probability plot is sensitive to non-normality in the middle range of the data. Below I have added baseline triglycerides to the mixed because it is skewed in a different way from the other two variables. So we will include the correlation of triglycerides with the other two variables. After using the plots to consider normality we are going to use the summarize command with detail to get the skewness and kurtosis of each variable. Finally, we are going to use the Shapiro-Wilk normality test (for data sets where 4 n 2000, the Stata command is swilk), the Shapiro-Francia normality test (for 5 n 5000, the Stata command is sfrancia) and a skewness and kurtosis test for normality (the Stata command is sktest). Usually statistical tests are considered better than plots but in this case the plots are actually preferred. Page -1-

Frequency 0 50 100 150 200 Baseline visit 1 DBP allhat1000.dta 40 50 60 70 80 90 100 110 120 Avg DBP at Visit 1 Notice the tall bars at 80 mm Hg and 90 mm Hg. This indicates what is called a digit preference (i.e. people tend to round off to 80 and 90). The distribution is skewed to the left. Avg DBP at Visit 1 40 60 80 100 120 Baseline visit 1 DBP Quantiles of normal distribution plot 40 60 80 100 120 Inverse Normal The Stata command is "qnorm". This plot is sensitive to non-normality near the tails. This is also called a Q-Q plot or a normal quantile plot. Notice that the histogram is skewed to the left and the qnorm plot curves downward. The plot deviates from the normal line more on the left end of the graph. We know how to use the dropdown menus to get the histogram but how do we use them to get the Q-Q plot? Page -2-

You have the option of selecting a plot type (3 rd tab from the left), but the usual choice is the scatterplot which in this case is the default plot. The scatterplot is what is pictured for the Q-Q plot on the page above. Page -3-

The Stata command is "pnorm". The plot is sensitive to non-normality in the middle range of the data. The 2 spots pointed out by the arrows are probably related to the tall bars at 80 and 90 mm Hg. The plot above is obtained using the normal probability plot (pnorm). Page -4-

The distribution of weight is skewed to the right. The Stata command is qnorm. This plot is sensitive to non-normality near the tails. This is called the Q-Q plot. Notice that histogram is skewed to the right and the qnorm plot curves upward. The Stata command is pnorm. The plot is sensitive to non-normality in the middle range of the data. Page -5-

Frequency 0 50 100 150 Baseline triglycerides allhat1000.dta 0 100 200 300 400 500 Baseline triglycerides for the antihypertensive study Notice that the triglyceride distribution is very skewed to the right. Triglycerides-BL Anti -200 0 200 400 600 Baseline triglycerides for the antihypertensive study Quantiles of normal distribution plot -100 0 100 200 300 400 Inverse Normal The Stata command is "qnorm". This plot is sensitive to non-normality near the tails. This is also called a Q-Q plot. Notice that the histogram is skewed to the right and the qnorm plot curves upward. Notice that the spaced out dots on the right side of the plot go with the spaced out bars in the histogram above. Normal F[(ATRIG-m)/s] 0.00 0.25 0.50 0.75 1.00 Baseline triglycerides for the antihypertension study Standardized normal probability plot 0.00 0.25 0.50 0.75 1.00 Empirical P[i] = i/(n+1) The Stata command is "pnorm". The plot is sensitive to non-normality in the middle range of the data. Page -6-

A distribution can be skewed to both the left and the right in which case the curve in the Q-Q plot can go up at one end and down at the other.. sum(bv1dbp),det Avg DBP at Visit 1 ------------------------------------------------------------- Percentiles Smallest 1% 60 46 5% 67.5 50 10% 71 51 Obs 1000 25% 79 52 Sum of Wgt. 1000 50% 85 Mean 84.452 Largest Std. Dev. 9.797535 75% 91 110 90% 97 110 Variance 95.99169 95% 100 110 Skewness -.373029 99% 103.5 110 Kurtosis 3.286597. sum(blwgt),det Weight(lbs) at Baseline ------------------------------------------------------------- Percentiles Smallest 1% 106 85 5% 127 92 10% 137 94 Obs 999 25% 155 96 Sum of Wgt. 999 50% 178 Mean 182.5145 Largest Std. Dev. 38.59176 75% 205 332 90% 233 334 Variance 1489.324 95% 250 343 Skewness.6869403 99% 294 350 Kurtosis 4.048437. sum(atrig),det Triglycerides-BL Anti ------------------------------------------------------------- Percentiles Smallest 1% 40 27 5% 60 36 10% 72 36 Obs 972 25% 97.5 37 Sum of Wgt. 972 50% 138 Mean 151.7541 Largest Std. Dev. 71.51119 75% 195 364 90% 250 442 Variance 5113.85 95% 293 480 Skewness.8662118 99% 336 485 Kurtosis 3.712122 Page -7-

Variable Skewness Kurtosis baseline DBP -0.37 3.29 baseline weight 0.69 4.05 baseline triglycerides (TG) 0.87 3.71 DBP has skewness < 0 so it is skewed to the left (something we have already seen in the graph). Weight and TG are both skewed to the right (i.e. skewness > 0). DBP is the least skewed (i.e. the skewness value is the closest to zero) and TG is the most skewed (i.e. the skewness value is the furthest from 0, the skewness value of the normal distribution). The kurtosis value for the normal distribution is 3. The DBP has the kurtosis value the closest to 3 and weight has the kurtosis value the most distant from 3. Tests for normality and for skewness and kurtosis: help swilk, help sfrancia dialogs: swilk sfrancia Title Syntax [R] swilk -- Shapiro-Wilk and Shapiro-Francia tests for normality Shapiro-Wilk normality test swilk varlist [if] [in] [, options] Shapiro-Francia normality test sfrancia varlist [if] [in] Description swilk performs the Shapiro-Wilk W test for normality, and sfrancia performs the Shapiro-Francia W' test for normality. swilk can be used with 4<=n<=2,000 observations, and sfrancia can be used with 5<=n<=5,000 observations; see [R] sktest for a test allowing more observations. Page -8-

help sktest dialog: sktest Title Syntax [R] sktest -- Skewness and kurtosis test for normality Description sktest varlist [if] [in] [weight] [, noadjust] aweights and fweights are allowed; see weight. For each variable in varlist, sktest presents a test for normality based on skewness and another based on kurtosis and then combines the two tests into an overall test statistic. sktest requires a minimum of 8 observations to make its calculations. Option +------+ ----+ Main +-------------------------------------------------------------- noadjust suppresses the empirical adjustment made by Royston (1991) to the overall chi-squared and its significance level and presents the unaltered test as described by D'Agostino, Balanger, and D'Agostino Jr. (1990). I usually go with the default since Stata usually chooses as the default the most commonly used statistic. For each of the three normality tests above, the null hypothesis is that the distribution is normal. So if we reject the null hypothesis we have declared that the distribution is not normal. The skewness and kurtosis test tests the skewness and kurtosis individually, as well as, presenting a combined test for overall normality. You ll find the Shapiro-Francia and Shapiro-Wilks tests under swilk in the Stata manuals. For the Shapiro-Francia test the W ' V ' W ' is the test statistic and the is a transform of the. Both give the same information. The median value of. sfrancia BV1DBP BLWGT ATRIG V ' is 1 and large values indicate non-normality. Shapiro-Francia W' test for normal data Variable Obs W' V' z Prob>z -------------+------------------------------------------------- BV1DBP 1000 0.98858 7.628 4.410 0.00001 BLWGT 999 0.97509 16.627 5.892 0.00001 ATRIG 972 0.94783 34.019 7.178 0.00001 The interpretation of the W and V for the Shapiro-Wilk test is similar to that of the Page -9-

test statistics of the Shapiro-Francia test.. swilk BV1DBP BLWGT ATRIG Shapiro-Wilk W test for normal data Variable Obs W V z Prob>z -------------+------------------------------------------------- BV1DBP 1000 0.98794 7.606 5.025 0.00000 BLWGT 999 0.97526 15.592 6.802 0.00000 ATRIG 972 0.94763 32.187 8.587 0.00000 The sktest skewness and kurtosis test for normality you ll find under sktest in the Stata manuals.. sktest BV1DBP BLWGT ATRIG Skewness/Kurtosis tests for Normality ------- joint ------ Variable Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 -------------+------------------------------------------------------- BV1DBP 0.000 0.076 22.85 0.0000 BLWGT 0.000 0.000 71.28 0.0000 ATRIG 0.000 0.000. 0.0000 Well in each case we have rejected the normality of the variables. Last time we considered these same three variables using the spearman and ktau commands for the non-parametric tests Spearman and Kendall s tau. For both tests we failed to reject normality for each of the 3 variables. The parametric tests are considered too sensitive. So what should we do? The histograms are somewhat skewed but the Q-Q plots and the standardized normal probability plots aren t too bad and the nonparametric tests say the normality is ok. I would just report the parametric correlation results given below.. pwcorr BV1DBP BLWGT ATRIG,obs sig BV1DBP BLWGT ATRIG -------------+--------------------------- BV1DBP 1.0000 1000 BLWGT 0.0295 1.0000 0.3520 999 999 ATRIG -0.0200 0.0524 1.0000 0.5344 0.1029 972 971 972 Each of the three p-values say that we fail to reject the null hypothesis that the correlation is equal to zero (i.e. the pairs of variables are not linearly correlated). Page -10-

An example to try to convince you that the assumption behind the linear regression graph with the normal curves is not as strange as it might seem. μ yxt μ yxp x t x p I have repeatedly used the graph above so you would get the idea that the regression line is a line through the means of a series of normal curves. Below I have used the data set allhat1000.dta to try to show you with real data that a line through means makes sense. In the plot below the Visit 1 DBP for 1000 people is on the x-axis and the Visit 2 DBP for the same 1000 people is on the y-axis. The dataset is allhat1000.dta. Notice that for Visit 1 DBP = 70 there are 27 values for Visit 2 DBP with mean = 77.3, for Visit 1 DBP = 80 there are 95 values for Visit 2 DBP with mean = 82.2 and for Visit 1 DBP = 90 there are 108 values for Visit 2 DBP with mean = 87.8.. sum(bv2dbp) if BV1DBP == 70 Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- BV2DBP 27 77.33333 8.086075 60 92. sum(bv2dbp) if BV1DBP == 80 Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- BV2DBP 95 82.16842 7.855954 60 100. sum(bv2dbp) if BV1DBP == 90 Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- BV2DBP 108 87.83333 7.323627 60 105 The plot below gives the scatter of Visit 1 (x-axis) and Visit 2 (y-axis) values of DBP. The solid dots are for Visit 1 DBP = 70, 80 and 90. The squares are the points (70, 77.3), (80, 82.2) and (90, 87.8). Page -11-

Visit 2 DBP mmhg 40 60 80 100 120 45 50 55 60 65 70 75 80 85 90 95 100 105 110 Visit 1 DBP mmhg Below I have added the least squares regression line and 3 more points. The additional points (x = 82, 85 and 100) were obtained in the same manner as the 3 in the graph above. Notice that the 6 points are not so far off the regression line. 1000 people for whom DBP measured at 2 different time points Diastolic Blood Pressure at Visit 2 40 60 80 100 The dark circles are the points ( xy, x ) For x = 70, 80, 82, 85, 90 and 100 45 50 55 60 65 70 75 80 85 90 95 100 105 110 Diastolic Blood Pressure at Visit 1 Page -12-

Now of course we are dealing with a finite number of y values at each value of x, whereas for the theoretical graph there would be a whole distribution's worth of y- values. But I hope you can see that the idea of a normal distribution of y's at each value of x with the mean of the distribution of y's being on the regression line does make sense (we of course can't show the normal part). LOWESS regression (a diagnostic tool for regression): I have mentioned LOWESS regression before and even graphed one but haven't given you any real details. Ordinary Least Squares regression fits a line to the data even if it is clear that a line is not appropriate (see page 18 and 19 below). LOWESS regression doesn't make any model assumptions. The LOWESS (locally weighted scatterplot smoother) regression curve is one of our diagnostic tools for regression. The idea is that each point ( xi, yi) in the dataset is fitted to a separate linear regression line based on adjacent observations. These points are weighted so that the further away the x value is from x i, the less impact it has on determining the estimate $y i estimate is called the bandwidth. $y i. The proportion of the total data that is used to create each In Stata the default bandwidth is 0.8 (i.e. 80% of the dataset) which works for mid-size data sets. For large data sets using a bandwidth of 0.3 or 0.4 is recommended; a bandwidth of 0.99 is recommended for small data sets. The wider the bandwidth the smoother the curve. Narrow bandwidths produce curves that are more sensitive to local perturbations in the data. The recommendation is to experiment with different bandwidths. (William D. Dupont, Statistical Modeling for Biomedical Researchers, Cambridge University Press, 2002). There is no statistical test related to LOWESS regression. We simply eyeball the graphs. The LOWESS curve below shows that fitting a line to the DBP visit 1 and visit 2 data is a pretty good idea. Page -13-

Lowess Curve and OLS Regression Line DBP at Visit 2 40 60 80 100 120 Lowess curve is solid Regression line is dashed 40 60 80 100 120 DBP at Visit 1 Below we are still using the data set allhat1000.dta, but now baseline weight is the predictor and visit 1 DBP is the outcome. DBP at Visit 1 40 60 80 100 120 Lowess Curve and OLS Regression Line Regression line = dashed Lowess curve = solid 100 150 200 250 300 350 Weight at baseline (lbs) Page -14-

The LOWESS curve immediately above is pretty flat. Below I have given y (i.e. the mean of the visit 1 DBP = 84.5) and the LOWESS curve. DBP at Visit 1 40 60 80 100 120 Lowess Curve and Mean of DBP at Visit 1 (84.5) Lowess curve = solid Mean of DBP = dashed 100 150 200 250 300 350 Weight at baseline (lbs) Notice there is not a lot of difference between the two graphs above. The regression output is given below.. regress BV1DBP BLWGT Source SS df MS Number of obs = 999 -------------+------------------------------ F( 1, 997) = 0.87 Model 83.3040589 1 83.3040589 Prob > F = 0.3520 Residual 95792.5518 997 96.0807942 R-squared = 0.0009 -------------+------------------------------ Adj R-squared = -0.0001 Total 95875.8559 998 96.0679918 Root MSE = 9.8021 BV1DBP Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- BLWGT.0074864.00804 0.93 0.352 -.0082909.0232638 _cons 83.09008 1.499837 55.40 0.000 80.14688 86.03328 We have F = 0.87, df = 1 and 997 with p = 0.35. Well that is a definite failure to reject the null hypothesis (i.e. fail to reject slope = 0). Notice that the estimate of the slope is 0.007. That is about as close to a flat line as you can get. Notice that the y-intercept = 83.1 is pretty close to the mean of the Visit 1 DBP (84.5). No wonder the two graphs look alike. Of course, in this case we didn t really need either the regression output or the LOWESS curve to make that determination. Just looking at the scatter plot should have given us a pretty good idea of what the answer would be. Page -15-

Let us go back and look at the Greene-Touchstone data (i.e. the estriol/birth weight problem). Remember I said that it was not really a wonderful set of data to use as an example for regression, but it is a good example of some of the problems with regression. Birthweigth in gms 2500 3000 3500 4000 4500 Greene-Touchstone Study OLS regression line (dashed) and LOWESS (solid) Bandwidth = 0.5 5 10 15 20 25 Estriol mg/24 hrs Birthweigth in gms 2500 3000 3500 4000 4500 Greene-Touchstone Study OLS regression line (dashed) and LOWESS (solid) Bandwidth = 0.3 5 10 15 20 25 Birthweigth in gms 2500 3000 3500 4000 4500 Greene-Touchstone Study OLS regression line (dashed) and LOWESS (solid) Bandwidth = 0.99 5 10 15 20 25 Estriol mg/24 hrs Estriol mg/24 hrs Notice that the smoothest LOWESS curve has bandwidth = 0.99 and the most jagged has bandwidth = 0.3. The LOWESS curve (regardless of bandwidth) says that fitting a regression line is not exactly the best choice we could make. The regression output below shows that while we rejected the null hypothesis (i.e. we concluded that the slope was different from zero), R 2 is only 0.37 indicating that the regression line accounts for only 37% of the variability in the data. Page -16-

. regress bwt100 estriol Source SS df MS Number of obs = 31 -------------+------------------------------ F( 1, 29) = 17.16 Model 2505744.76 1 2505744.76 Prob > F = 0.0003 Residual 4234255.24 29 146008.801 R-squared = 0.3718 -------------+------------------------------ Adj R-squared = 0.3501 Total 6740000 30 224666.667 Root MSE = 382.11 bwt100 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- estriol 60.81905 14.68117 4.14 0.000 30.79268 90.84542 _cons 2152.343 262.0417 8.21 0.000 1616.407 2688.278 We need to be careful about interpreting regression results without looking at the graph. The data file used below is Example151Goodold.dta. The example is taken from Common Errors in Statistics (and How to Avoid Them) by P.I. Good and J.W. Hardin. Each of the 4 regression runs below has the same R 2 and the same estimates for the slope and the y-intercept. Do you think they are all the same?. regress y1 x1 Source SS df MS Number of obs = 11 -------------+------------------------------ F( 1, 9) = 17.99 Model 27.5100011 1 27.5100011 Prob > F = 0.0022 Residual 13.7626904 9 1.52918783 R-squared = 0.6665 -------------+------------------------------ Adj R-squared = 0.6295 Total 41.2726916 10 4.12726916 Root MSE = 1.2366 y1 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x1.5000909.1179055 4.24 0.002.2333701.7668117 _cons 3.000091 1.124747 2.67 0.026.4557369 5.544445. regress y2 x2 Source SS df MS Number of obs = 11 -------------+------------------------------ F( 1, 9) = 17.97 Model 27.5000024 1 27.5000024 Prob > F = 0.0022 Residual 13.776294 9 1.53069933 R-squared = 0.6662 -------------+------------------------------ Adj R-squared = 0.6292 Total 41.2762964 10 4.12762964 Root MSE = 1.2372 y2 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x2.5.1179638 4.24 0.002.2331475.7668526 _cons 3.000909 1.125303 2.67 0.026.4552978 5.54652. regress y3 x3 Source SS df MS Number of obs = 11 -------------+------------------------------ F( 1, 9) = 17.97 Model 27.4700075 1 27.4700075 Prob > F = 0.0022 Residual 13.7561905 9 1.52846561 R-squared = 0.6663 -------------+------------------------------ Adj R-squared = 0.6292 Total 41.2261979 10 4.12261979 Root MSE = 1.2363 y3 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x3.4997273.1178777 4.24 0.002.2330695.7663851 _cons 3.002455 1.124481 2.67 0.026.4587014 5.546208. regress y4 x4 Page -17-

Source SS df MS Number of obs = 11 -------------+------------------------------ F( 1, 9) = 18.00 Model 27.4900007 1 27.4900007 Prob > F = 0.0022 Residual 13.7424908 9 1.52694342 R-squared = 0.6667 -------------+------------------------------ Adj R-squared = 0.6297 Total 41.2324915 10 4.12324915 Root MSE = 1.2357 y4 Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- x4.4999091.1178189 4.24 0.002.2333841.7664341 _cons 3.001727 1.123921 2.67 0.026.4592411 5.544213 The lines in the 4 graphs below show be written y$ = 3+ 05. x 14 12 2 2 y = y3= + 305. + x05. x R R= 067. = 067. 14 12 2 y = 3+ 05. x R = 067. 10 10 y1 8 6 y2 8 6 4 2 OLS regression line Lowess regression curve 4 2 OLS regression line Lowess regression curve 0 0 2 4 6 8 10 12 14 16 18 20 0 0 2 4 6 8 10 12 14 16 18 20 x1 x2 14 2 y = 3+ 05. x R = 067. 14 2 y = 3+ 05. x R = 067. 12 12 OLS regression line 10 10 y3 8 6 y4 8 6 4 2 OLS regression line Lowess regression curve 4 2 Lowess regression curve 0 0 2 4 6 8 10 12 14 16 18 20 x3 0 0 2 4 6 8 10 12 14 16 18 20 x4 Page -18-

Inappropriate use of regression Baseline medications 1 1.5 2 2.5 3 1 2 3 4 5 Race. tab RACE Race Freq. Percent Cum. ---------------------------+----------------------------------- White 544 54.40 54.40 Black 395 39.50 93.90 Asian/Pacific Islander 14 1.40 95.30 Other 47 4.70 100.00 ---------------------------+----------------------------------- Total 1,000 100.00. tab BLMEDS Baseline Medications Freq. Percent Cum. -------------------------+----------------------------------- On 1-2 drugs ge 2 months 855 85.50 85.50 On drugs lt 2 months 33 3.30 88.80 Currently untreated 112 11.20 100.00 -------------------------+----------------------------------- Total 1,000 100.00. label list racelbl racelbl: 1 White 2 Black 3 Amer Indian/Alaskan native 4 Asian/Pacific Islander 5 Other. label list blmedlbl blmedlbl: 1 On 1-2 drugs ge 2 months 2 On drugs lt 2 months 3 Currently untreated Page -19-

Response to question in class: If you think you should transform your data, how do you decide which would be the best transformation to use. I would use gladder or qladder which give you an array of potential transformations so you can see how normal that transformation would make your variable of interest. gladder gives histograms and qladder gives Q-Q plots. Notice that Ladder-ofpowers histograms is highlighted but that the line below histograms is the command for the Q-Q plots. just typing in gladder BLWGT gets you the same results. Page -20-