SAS Simple Linear Regression Example

Similar documents
The SAS System 11:03 Monday, November 11,

EXST7015: Multiple Regression from Snedecor & Cochran (1967) RAW DATA LISTING

Homework 0 Key (not to be handed in) due? Jan. 10

1. Distinguish three missing data mechanisms:

Chapter 11 : Model checking and refinement An example: Blood-brain barrier study on rats

Topic 8: Model Diagnostics

Stat 328, Summer 2005

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

Lecture 13: Identifying unusual observations In lecture 12, we learned how to investigate variables. Now we learn how to investigate cases.

Notice that X2 and Y2 are skewed. Taking the SQRT of Y2 reduces the skewness greatly.

> attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount")

Data screening, transformations: MRC05

Are the movements of stocks, bonds, and housing linked? Zachary D Easterling Department of Economics The University of Akron

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Econ 371 Problem Set #4 Answer Sheet. 6.2 This question asks you to use the results from column (1) in the table on page 213.

Two-Sample T-Test for Superiority by a Margin

ARIMA ANALYSIS WITH INTERVENTIONS / OUTLIERS

Two-Sample T-Test for Non-Inferiority

Linear regression model

Non-linearities in Simple Regression

SFSU FIN822 Project 1

Model fit assessment via marginal model plots

Handout seminar 6, ECON4150

Technical Documentation for Household Demographics Projection

Assignment #5 Solutions: Chapter 14 Q1.

Probability & Statistics Modular Learning Exercises

The data definition file provided by the authors is reproduced below: Obs: 1500 home sales in Stockton, CA from Oct 1, 1996 to Nov 30, 1998

Chapter 11 Part 6. Correlation Continued. LOWESS Regression

WesVar Analysis Example Replication C7

Time series data: Part 2

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

1 Small Sample CI for a Population Mean µ

The instructions on this page also work for the TI-83 Plus and the TI-83 Plus Silver Edition.

One Sample T-Test With Howell Data, IQ of Students in Vermont

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing

Analysis of Variance in Matrix form

Modeling Panel Data: Choosing the Correct Strategy. Roberto G. Gutierrez

Econometrics is. The estimation of relationships suggested by economic theory

11/28/2018. Overview. Multiple Linear Regression Analysis. Multiple regression. Multiple regression. Multiple regression. Multiple regression

Labor Force Participation and the Wage Gap Detailed Notes and Code Econometrics 113 Spring 2014

Analysis of 2x2 Cross-Over Designs using T-Tests for Non-Inferiority

Quantitative Techniques Term 2

Establishing a framework for statistical analysis via the Generalized Linear Model

Solutions for Session 5: Linear Models

Regression and Simulation

XLSTAT TIP SHEET FOR BUSINESS STATISTICS CENGAGE LEARNING

You created this PDF from an application that is not licensed to print to novapdf printer (

The Multivariate Regression Model

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

Homework Assignment Section 3

Monetary Economics Risk and Return, Part 2. Gerald P. Dwyer Fall 2015

Chapter 14. Descriptive Methods in Regression and Correlation. Copyright 2016, 2012, 2008 Pearson Education, Inc. Chapter 14, Slide 1

book 2014/5/6 15:21 page 261 #285

Chapter 6 Part 3 October 21, Bootstrapping

CHAPTER 4 DATA ANALYSIS Data Hypothesis

Quantile regression and surroundings using SAS

Introduction to R (2)

Final Exam - section 1. Thursday, December hours, 30 minutes

Multiple Regression. Review of Regression with One Predictor

Environmental samples below the limits of detection comparing regression methods to predict environmental concentrations ABSTRACT INTRODUCTION

Statistical Models of Stocks and Bonds. Zachary D Easterling: Department of Economics. The University of Akron

Lampiran 1 Data Efektivits BPHTB

Five Things You Should Know About Quantile Regression

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

New SAS Procedures for Analysis of Sample Survey Data

Question 1a 1b 1c 1d 1e 1f 2a 2b 2c 2d 3a 3b 3c 3d M ult:choice Points

Point-Biserial and Biserial Correlations

The relationship between GDP, labor force and health expenditure in European countries

Intro. Econometrics Fall 2015

Analysis Variable : Y Analysis Variable : Y E

σ e, which will be large when prediction errors are Linear regression model

u panel_lecture . sum

To be two or not be two, that is a LOGISTIC question

Problem Set 6 ANSWERS

Joseph O. Marker Marker Actuarial Services, LLC and University of Michigan CLRS 2011 Meeting. J. Marker, LSMWP, CLRS 1

Dummy variables 9/22/2015. Are wages different across union/nonunion jobs. Treatment Control Y X X i identifies treatment

2016 FACULTY SALARY EQUITY ANALYSIS

Determination of the Optimal Stratum Boundaries in the Monthly Retail Trade Survey in the Croatian Bureau of Statistics

Your Name (Please print) Did you agree to take the optional portion of the final exam Yes No. Directions

Introduction to General and Generalized Linear Models

Study 2: data analysis. Example analysis using R

Rand Final Pop 2. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Loss Simulation Model Testing and Enhancement

GARCH Models. Instructor: G. William Schwert

ECON Introductory Econometrics Seminar 2, 2015

Risk Analysis. å To change Benchmark tickers:

. ********** OUTPUT FILE: CARD & KRUEGER (1994)***********.. * STATA 10.0 CODE. * copyright C 2008 by Tito Boeri & Jan van Ours. * "THE ECONOMICS OF

Cameron ECON 132 (Health Economics): FIRST MIDTERM EXAM (A) Fall 17

Statistics for Business and Economics

Sociology 704: Topics in Multivariate Statistics Instructor: Natasha Sarkisian. Binary Logit

The FREQ Procedure. Table of Sex by Gym Sex(Sex) Gym(Gym) No Yes Total Male Female Total

Stat 401XV Exam 3 Spring 2017

Business Statistics 41000: Probability 3

Impact of Household Income on Poverty Levels

is the bandwidth and controls the level of smoothing of the estimator, n is the sample size and

A Brief Illustration of Regression Analysis in Economics John Bucci. Okun s Law

Booth School of Business, University of Chicago Business 41202, Spring Quarter 2016, Mr. Ruey S. Tsay. Solutions to Midterm

General Business 706 Midterm #3 November 25, 1997

Session 5: Associations

Discrete Choice Modeling

Transcription:

SAS Simple Linear Regression Example This handout gives examples of how to use SAS to generate a simple linear regression plot, check the correlation between two variables, fit a simple linear regression model, check the residuals from the model, and also shows some of the ODS (Output Delivery System) output in SAS. Read in Raw Data We first read in the raw data from the werner2.dat raw dataset, and set up the missing value codes using a data step, and then check descriptive statistics for the numeric variables, using Proc Means. OPTIONS FORMCHAR=" ---- + ---+= -/\<>*"; libname b510 "C:\Users\kwelch\Desktop\B510"; DATA b510.werner; INFILE "C:\Users\kwelch\Desktop\B510\werner2.dat"; INPUT ID 1-4 AGE 5-8 HT 9-12 WT 13-16 PILL 17-20 CHOL 21-24 ALB 25-28 1 CALC 29-32 1 URIC 33-36 1; IF HT = 999 THEN HT =.; IF WT = 999 THEN WT =.; IF CHOL = 600 THEN CHOL =.; IF ALB = 99 THEN ALB =.; IF CALC = 99 THEN CALC =.; IF URIC = 99 THEN URIC =.; /*Check the Data*/ title "DESCRIPTIVE STATISTICS"; proc means data=b510.werner; DESCRIPTIVE STATISTICS The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------------- ID 188 1598.96 1057.09 3.0000000 3519.00 AGE 188 33.8191489 10.1126942 19.0000000 55.0000000 HT 186 64.5107527 2.4850673 57.0000000 71.0000000 WT 186 131.6720430 20.6605767 94.0000000 215.0000000 PILL 188 1.5000000 0.5013351 1.0000000 2.0000000 CHOL 187 235.1550802 44.5706219 50.0000000 390.0000000 ALB 186 4.1112903 0.3579694 3.2000000 5.0000000 CALC 185 9.9621622 0.4795556 8.6000000 11.1000000 URIC 187 4.7705882 1.1572312 2.2000000 9.9000000 ------------------------------------------------------------------------------- Correlation We now check the correlation between the response (or dependent) variable, CHOL, and the predictor (or independent) variable, AGE. It is positive, and significant (r =.369, p<.0001). Note that there are 188 1

observations for AGE, but only 187 for CHOL, and that the correlation is based on the 187 observations that have values for both variables. title "Pearson Correlation"; proc corr data=b510.werner; var age chol; Pearson Correlation The CORR Procedure 2 Variables: AGE CHOL Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum AGE 188 33.81915 10.11269 6358 19.00000 55.00000 CHOL 187 235.15508 44.57062 43974 50.00000 390.00000 Pearson Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations AGE CHOL AGE 1.00000 0.36923 <.0001 188 187 CHOL 0.36923 1.00000 <.0001 187 187 Scatterplot We now check a bivariate scatterplot to assess whether the relationship between CHOL and AGE appears to be linear, and to check for outliers. Although there is not a very tight relationship between these two variables, it does appear that the relationship is linear and increasing. title "Scatterplot with Regression Line"; proc sgplot data=b510.werner; reg y=chol x=age; 2

Simple Linear Regression We now fit a linear regression model, with CHOL as the Y (dependent or outcome) variable and AGE as the X (independent or predictor) variable, using Proc Reg. We first illustrate the most basic Proc Reg syntax, and then show some useful options. The Quit statement is used to tell SAS that there are no more statements coming for this run of Proc Reg. The output shows that there is a positive relationship between these two variables. When age increases by one year, average cholesterol is predicted to increase by 1.62 units, and this is a significant relationship (t(185) = 5.40, p<.0001). Note that the degrees of freedom for the t-test are 185, the same as the error degrees of freedom. The model R-square (.1368) is the square of the correlation between the two variables. There were 187 observations used in the regression model. title "Simple Linear Regression Model with no options"; proc reg data=b510.werner; model chol = age; quit; Simple Linear Regression Model with no options The REG Procedure Model: MODEL1 Dependent Variable: CHOL Number of Observations Read 188 Number of Observations Used 187 Number of Observations with Missing Values 1 3

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 50373 50373 29.20 <.0001 Error 185 319123 1724.99020 Corrected Total 186 369497 Root MSE 41.53300 R-Square 0.1363 Dependent Mean 235.15508 Adj R-Sq 0.1317 Coeff Var 17.66196 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 179.96174 10.65564 16.89 <.0001 AGE 1 1.62897 0.30144 5.40 <.0001 Simple Linear Regression with Diagnostic Plots We now include some diagnostic plots using Proc Reg. We also generate a new dataset called OUTREG1 that contains all of the original variables, plus the predicted value for each observation (PREDICT), the residual (RESID) and the studentized-deleted residual (RSTUD), and Cook's Distance (COOKD).. ods graphics on; title "Simple Linear Regression with Diagnostic Plots"; proc reg DATA=B510.werner; MODEL CHOL=AGE / stb clb; OUTPUT OUT=OUTREG1 P=PREDICT R=RESID RSTUDENT=RSTUDENT COOKD=COOKD; quit; ods graphics off; The partial output below shows the standardized estimate (obtained with the STB option), which shows the estimated change in Y (in standard deviation units) when X is increased by one standard deviation. This estimate is 0.369. We also see the 95% Confidence limits for the parameter estimate, which are form 1.03 to 2.22. Parameter Estimates Parameter Standard Standardized Variable DF Estimate Error t Value Pr > t Estimate Intercept 1 179.96174 10.65564 16.89 <.0001 0 AGE 1 1.62897 0.30144 5.40 <.0001 0.36923 Parameter Estimates Variable DF 95% Confidence Limits Intercept 1 158.93955 200.98392 AGE 1 1.03426 2.22368 4

The diagnostic panel shows a series of diagnostic plots for this regression model. The residual plot below shows a scatterplot with the residuals on the Y-axis and AGE on the X-axis. We want to look for a lack of pattern in these residuals. We can see that there is one low outlier, at about age 25. 5

The fit plot shown below shows the regression model fit, and summarizes some of the statistics for the model. Check the output dataset We now check the output dataset, using Proc Print. We also request that Proc Print display the labels for the each variable, by using the Label option. We print selected variables for those observations with the absolute value of the studentized deleted residuals being greater than or equal to 3, using a Where statement. 6

Percent Studentized Residual without Current Obs More SAS/Statistics Tutorial at www.littledumbdoctor.com title "Partial Listing of Output Dataset"; proc print data=outreg1; where abs(rstud) >=3; VAR ID AGE CHOL PREDICT RESID RSTUD COOKD LCL UCL LCLM UCLM; Partial Listing of Output Dataset Obs ID AGE CHOL PREDICT RESID RSTUD COOKD LCL UCL LCLM UCLM 4 1797 25 50 220.686-170.686-4.32214 0.081802 138.358 303.014 212.698 228.674 182 3134 50 390 261.410 128.590 3.20326 0.094792 178.695 344.126 250.106 272.714 Check the residuals for normality We now check the studentized residuals for normality, using Proc Univariate. This is similar to the output from the ODS graphics that was shown in the earlier panel. title "Checking Residuals for Normality"; proc univariate data=outreg1 PLOT NORMAL; var rstud; histogram / normal; qqplot / normal(mu=est sigma=est); The residuals appear to be fairly normally distributed, but there is at least one very low outlier, which we identified earlier, when we checked the values in the output dataset. Checking Residuals for Normality Checking Residuals for Normality 35 30 25 20 4 2 0 15 10 5-2 -4 0-6 -4.0-3.2-2.4-1.6-0.8 0 0.8 1.6 2.4 3.2 Studentized Residual without Current Obs Refit the regression model without the cases in question -3-2 -1 0 1 2 3 Normal Quantiles We now refit the model, but without the two outliers being included, by using a Where statement.. ods graphics on; title "Rerun the model without two obs"; proc reg data=b510.werner; 7

where id not in (1797, 3134); model chol=age; quit; ods graphics off; More SAS/Statistics Tutorial at www.littledumbdoctor.com We can see the changes in the parameter estimates from the output below. Dependent Variable: CHOL Number of Observations Read 186 Number of Observations Used 185 Number of Observations with Missing Values 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 38478 38478 25.82 <.0001 Error 183 272754 1490.46158 Corrected Total 184 311232 Root MSE 38.60650 R-Square 0.1236 Dependent Mean 235.31892 Adj R-Sq 0.1188 Coeff Var 16.40603 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept 1 186.70039 9.98091 18.71 <.0001 AGE 1 1.43658 0.28274 5.08 <.0001 8