Transformation and Weighted Least Squares

Similar documents
MgtOp 215 Chapter 13 Dr. Ahn

CHAPTER 9 FUNCTIONAL FORMS OF REGRESSION MODELS

Which of the following provides the most reasonable approximation to the least squares regression line? (a) y=50+10x (b) Y=50+x (d) Y=1+50x

ECONOMETRICS - FINAL EXAM, 3rd YEAR (GECO & GADE)

Evaluating Performance

3: Central Limit Theorem, Systematic Errors

Module Contact: Dr P Moffatt, ECO Copyright of the University of East Anglia Version 2

occurrence of a larger storm than our culvert or bridge is barely capable of handling? (what is The main question is: What is the possibility of

Capability Analysis. Chapter 255. Introduction. Capability Analysis

Chapter 3 Descriptive Statistics: Numerical Measures Part B

Chapter 3 Student Lecture Notes 3-1

σ may be counterbalanced by a larger

Linear Combinations of Random Variables and Sampling (100 points)

A Comparison of Statistical Methods in Interrupted Time Series Analysis to Estimate an Intervention Effect

Data Mining Linear and Logistic Regression

Tests for Two Correlations

Analysis of Variance and Design of Experiments-II

Simple Regression Theory II 2010 Samuel L. Baker

Notes are not permitted in this examination. Do not turn over until you are told to do so by the Invigilator.

Measures of Spread IQR and Deviation. For exam X, calculate the mean, median and mode. For exam Y, calculate the mean, median and mode.

Graphical Methods for Survival Distribution Fitting

Calibration Methods: Regression & Correlation. Calibration Methods: Regression & Correlation

Elton, Gruber, Brown and Goetzmann. Modern Portfolio Theory and Investment Analysis, 7th Edition. Solutions to Text Problems: Chapter 4

Teaching Note on Factor Model with a View --- A tutorial. This version: May 15, Prepared by Zhi Da *

/ Computational Genomics. Normalization

Natural Resources Data Analysis Lecture Notes Brian R. Mitchell. IV. Week 4: A. Goodness of fit testing

3/3/2014. CDS M Phil Econometrics. Vijayamohanan Pillai N. Truncated standard normal distribution for a = 0.5, 0, and 0.5. CDS Mphil Econometrics

CrimeStat Version 3.3 Update Notes:

Sampling Distributions of OLS Estimators of β 0 and β 1. Monte Carlo Simulations

EDC Introduction

Scribe: Chris Berlind Date: Feb 1, 2010

Spatial Variations in Covariates on Marriage and Marital Fertility: Geographically Weighted Regression Analyses in Japan

OCR Statistics 1 Working with data. Section 2: Measures of location

Mode is the value which occurs most frequency. The mode may not exist, and even if it does, it may not be unique.

Tests for Two Ordered Categorical Variables

Using Conditional Heteroskedastic

Likelihood Fits. Craig Blocker Brandeis August 23, 2004

TCOM501 Networking: Theory & Fundamentals Final Examination Professor Yannis A. Korilis April 26, 2002

Supplementary material for Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Elements of Economic Analysis II Lecture VI: Industry Supply

International ejournals

The Effects of Industrial Structure Change on Economic Growth in China Based on LMDI Decomposition Approach

4. Greek Letters, Value-at-Risk

Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 12

A Bootstrap Confidence Limit for Process Capability Indices

Random Variables. b 2.

Multifactor Term Structure Models

Introduction. Why One-Pass Statistics?

Interval Estimation for a Linear Function of. Variances of Nonnormal Distributions. that Utilize the Kurtosis

THE VOLATILITY OF EQUITY MUTUAL FUND RETURNS

1 Omitted Variable Bias: Part I. 2 Omitted Variable Bias: Part II. The Baseline: SLR.1-4 hold, and our estimates are unbiased

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013

UNIVERSITY OF VICTORIA Midterm June 6, 2018 Solutions

An Application of Alternative Weighting Matrix Collapsing Approaches for Improving Sample Estimates

Testing for Omitted Variables

Spurious Seasonal Patterns and Excess Smoothness in the BLS Local Area Unemployment Statistics

Physics 4A. Error Analysis or Experimental Uncertainty. Error

Project Management Project Phases the S curve

Risk Reduction and Real Estate Portfolio Size

Topic 8: Model Diagnostics

EXAMINATIONS OF THE HONG KONG STATISTICAL SOCIETY

Economic Design of Short-Run CSP-1 Plan Under Linear Inspection Cost

Creating a zero coupon curve by bootstrapping with cubic splines.

Financial mathematics

Introduction. Chapter 7 - An Introduction to Portfolio Management

II. Random Variables. Variable Types. Variables Map Outcomes to Numbers

Principles of Finance

Chapter 5 Student Lecture Notes 5-1

Clearing Notice SIX x-clear Ltd

EXST7015: Multiple Regression from Snedecor & Cochran (1967) RAW DATA LISTING

Does a Threshold Inflation Rate Exist? Quantile Inferences for Inflation and Its Variability

>1 indicates country i has a comparative advantage in production of j; the greater the index, the stronger the advantage. RCA 1 ij

Analysis of the Relationship between Managers Compensation and Earnings in Companies Listed in the Tehran Stock Exchange

2) In the medium-run/long-run, a decrease in the budget deficit will produce:

arxiv:cond-mat/ v1 [cond-mat.other] 28 Nov 2004

Maturity Effect on Risk Measure in a Ratings-Based Default-Mode Model

ISyE 512 Chapter 9. CUSUM and EWMA Control Charts. Instructor: Prof. Kaibo Liu. Department of Industrial and Systems Engineering UW-Madison

Understanding price volatility in electricity markets

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019

Forecasts in Times of Crises

Alternatives to Shewhart Charts

The Integration of the Israel Labour Force Survey with the National Insurance File

Midterm Exam. Use the end of month price data for the S&P 500 index in the table below to answer the following questions.

A MODEL OF COMPETITION AMONG TELECOMMUNICATION SERVICE PROVIDERS BASED ON REPEATED GAME

Conditional beta capital asset pricing model (CAPM) and duration dependence tests

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Dr. Wayne A. Taylor

Productivity Levels and International Competitiveness 5 Between Canada and the United States

Correlations and Copulas

Finance 402: Problem Set 1 Solutions

Notice that X2 and Y2 are skewed. Taking the SQRT of Y2 reduces the skewness greatly.

Skewness and kurtosis unbiased by Gaussian uncertainties

COMPARISON OF THE ANALYTICAL AND NUMERICAL SOLUTION OF A ONE-DIMENSIONAL NON-STATIONARY COOLING PROBLEM. László Könözsy 1, Mátyás Benke 2

Pivot Points for CQG - Overview

Corrected Maximum Likelihood Estimators in Linear Heteroskedastic Regression Models *

Merton-model Approach to Valuing Correlation Products

FORD MOTOR CREDIT COMPANY SUGGESTED ANSWERS. Richard M. Levich. New York University Stern School of Business. Revised, February 1999

Using Cumulative Count of Conforming CCC-Chart to Study the Expansion of the Cement

Elton, Gruber, Brown, and Goetzmann. Modern Portfolio Theory and Investment Analysis, 7th Edition. Solutions to Text Problems: Chapter 9

Risk and Return: The Security Markets Line

Information Flow and Recovering the. Estimating the Moments of. Normality of Asset Returns

Transcription:

APM 63 Regresson Analyss Project Transformaton and Weghted Least Squares. INTRODUCTION Yanjun Yan yayan@syr.edu Due on 4/4/5 (Thu.) Turned n on 4/4 (Thu.) Ths project ams at modelng the peak rate of flow Q of water from sx watersheds followng storm epsodes. The storm epsodes have been chosen from a large data set to gve a range of storm ntenstes. The ndependent varables used n ths study nclude the area of watersheds (m) X, the average slope of watershed (%) X 3, the estmated sol storage capacty (nches of water) X 6, the ranfall (nches) X 8 and the tme perod durng whch ranfall exceeded ¼ nch/hour X 9. From the scatter plots of the dependent varable Q wth each ndependent varable, certan nonlnear patterns may be observed. Therefore the logarthm transform s mplemented to make the transformed varables n better lnear relatonshp. If wthout the logarthm transform, the ordnary least square (OLS) modelng on the raw data may volate the assumptons of normalty or heteroscedastcty based on the resdual analyss results. If wth the logarthm transform, both the OLS and the weghted least square (WLS) modelng are ftted on the transformed varables to compare the dfference between the OLS and WLS. Resdual analyss plays an mportant role n comparng the OLS and WLS models. Now that the varable n concern s Q tself, but not ts logarthm transform, the model of the logarthm transformed varables needs nverse logarthm transform and further explanaton to comprehend the physcal relaton between the orgnal dependent varable Q and the ndependent varables X s.. METHODS, RESULTS AND DISCUSSION () Show basc statstcal descrptve parameters and show the scatter plots for the relatonshps between Q and X, X 3, X 6, X 8 and X 9. By Proc CORR, the smple statstcs and the correlaton between the varances can be obtaned as shown on the next page. The SAS output shows the mean, summaton, the standard devaton, mnma, maxma of all the varables n concern and the correlaton between them. Proc GPLOT or PLOT may produce the scatter plots, but the plots are bg and are not very clear f they are coped nto ths document. In order to save space and acheve better mage qualty, the scatter plots are generated by the MATLAB scrpt as shown n Fgure.

After studyng Fgure and observng the correlaton matrx agan, Q seems to be hghly correlated to X, X 8 and X 9, whch s consstent wth the scatter plots. Further from the scatter plots, we can see that there s roughly a curvlnear relatonshp between Q and the selected X s such as X 8 and X 9. And the varaton of Q for dfferent range of certan X seems to be dfferent, especally on the scatter plot between Q and X, X 3 and X 6. Ths pattern ndcates that we should use some transform to lnearze the relatons. Smple Statstcs Varable N Mean Std Dev Sum Mnmum Maxmum Q 3 9 97 38737 8. 479 X 3.43.78337 7.96.3 7. X3 3 7.5 4.6 5. 3. 5. X6 3.3.5669 39..5. X8 3.837.657 85..75 5.5 X9 3 3.6333.57775 94.9.7 6.5 Pearson Correlaton Coeffcents, N = 3 Prob > r under H: Rho= Q X X3 X6 X8 X9 Q..7834.54.456.3337.858 <..77.83.79.33 X.7834. -.785.79.795.986 <..688.5378.368.7 X3.54 -.785..466 -.73 -.886.77.688.449.79.4974 X6.456.79.466..758 -.35.83.5378.449.79.9484 X8.3337.795 -.73.758..88745.79.368.79.79 <. X9.858.986 -.886 -.35.88745..33.7.4974.9484 <.

5 Scatter Plots of Q vs X's 5 5 4 4 4 3 3 3 Q Q Q 4 6 8 5 5.5.5 X X 3 X 6 5 5 4 4 3 3 Q Q 3 4 5 X 8 4 6 X 9 Fgure. Scatter Plot between the dependent varable Q and several ndependent varables X s. () Apply Ordnary Least Squares (OLS) to the model [] as follows and conduct a resdual analyss on ths model: [] Q= β + β *X + β *X 3 + β 3 *X 6 + β 4 *X 8 + β 5 *X 9 + e. Model [] OLS resdual - - 3 Predcted Q Scatter Plots of Model [] OLS resduals vs Q & X's Model [] OLS resdual - - 4 6 8 X Model [] OLS resdual - - 5 5 X 3 Model [] OLS resdual - -.5.5 Model [] OLS resdual - - 4 Model [] OLS resdual - - 4 6 X 6 X 8 X 9 Fgure. Scatter Plot between the resdual and Q or X s. 3

In constructng the model [], there are several assumptons as lsted below. If these assumptons are volated, the lnear regresson fttng or testng may not be vald any more. Thus people have proposed some methods to fx the volaton of these assumptons.. Normalty. The resduals are assumed to be normally dstrbuted. It s not necessary for estmaton of the regresson parameters, but the normalty s needed for tests of sgnfcance and constructon of confdence nterval estmates of the parameters. If the normalty condton s volated, the parameter estmates are stll the best lnear unbased estmates f other assumptons are met, but the probablty levels assocated wth the tests of sgnfcance or the confdence coeffcents wll not be correct. Transformaton of the dependent varable to a form that s more nearly normally dstrbuted s the usual recourse to non-normalty.. Homogenety. The assumpton of common varance plays a key role n ordnary least squares. All observatons n OLS receve the same weght. But f the varance s dfferent for dfferent observaton, a ratonal use of the data would requre that more weght be gven to those observatons that contan the most nformaton. The drect mpact of heterogeneous varances n OLS s a loss of precson n the estmates compared to the precson that would have been realzed f the heterogeneous varances had been taken nto account. Both the transformaton of dependent varable and the use of weghted least squares (WLS) can help handle the heterogeneous varances. 3. Uncorrelated errors. The assumpton on un-correlaton s easer for mathematcal trackng, but the correlaton among resduals occur from many sources. The mpact of correlated errors on the OLS s loss n precson n the estmates, smlar to the heterogeneous varances. The remedy to the problem of correlated errors s to utlze a model that takes nto account the correlaton structure n the data such as the Generalzed Least Squares. The OLS regresson on model [] s constructed by Proc REG wth opton SPEC. The scatter plot for the resduals wth Q and X s are smlarly generated by MATLAB as shown n Fgure. The OLS results are as followng: The REG Procedure Model: MODELOLS Dependent Varable: Q Analyss of Varance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 356444 63888 5.5 <. Error 4 3693 468 Corrected Total 9 456835 Root MSE 645.664 R-Square.7593 Dependent Mean 9.3333 Adj R-Sq.79 Coeff Var 49.99998 4

Parameter Estmates Parameter Standard Varable DF Estmate Error t Value Pr > t Intercept -48.7655 447.64478 -.9.378 X 345.963 44.885 7.7 <. X3 8.956 3.347.74.3 X6-74.59 9.3965 -.5.35 X8 478.683.77.8.3 X9-4.5589 7.975 -.4.7 Test of Frst and Second Moment Specfcaton DF Ch-Square Pr > ChSq 9.59.4836 Model [] s now ftted by OLS as followng: Q = -48.7655 + 345.963 * X + 8.956 * X3 + (-74.59) * X6 + 478.683 * X8 + (-4.5589) * X9. (eq. ) From the ANOVA tself, we can not see clearly whether there s any assumpton volaton or model nadequacy, snce the model fttng doesn t check those assumptons. If the data satsfy all the model assumptons, then we can be confdent n the t tests lsted n the ANOVA. But f these assumptons are volated, those tests wll be msleadng. Ths s why the resdual analyss s mportant to check those assumptons. From the SPEC opton, the probablty of non-constant varance test s.4836. The null hypothess s of constant varance. If the sgnfcance level s set to be.5, then the null hypothess of constant varance can not be rejected. Therefore the varance for dfferent range of X s may not dffer too much, thus we can use the common varance of a certan range of X s n later WLS analyss. Further the Proc UNIVARIATE s used to analyze the resdual statstcs and check ts normalty. Even though t s notable that the skewness of the resdual s -.664666, and the kurtoss s.473, nether of whch s small, t s also notced that none of the normalty tests are sgnfcant, and the normalty plot of the resduals s mostly concdent wth the deal normalty lne except the left-lowest pont. Therefore the resduals are stll mostly normal. Meanwhle consderng the fact that the t test s knd of robust to the non-normalty assumpton, currently the most prortzed task s to lnearze the relaton nstead of normalze the resdual. Moments N 3 Sum Weghts 3 Mean Sum Observatons Std Devaton 587.38649 Varance 344954.94 Skewness -.664666 Kurtoss.473 Uncorrected SS 3693.3 Corrected SS 3693.3 Coeff Varaton. Std Error Mean 7.35Tests for Normalty 5

Test --Statstc--- -----p Value------ Shapro-Wlk W.96879 Pr < W.493 Kolmogorov-Smrnov D.89847 Pr > D >.5 Cramer-von Mses W-Sq.4564 Pr > W-Sq >.5 Anderson-Darlng A-Sq.776 Pr > A-Sq >.5 Normal Probablty Plot + ++* *+*+ +*+ ++** ****** ***+ ***** -3+ **++ +**+ *+*+* +*++ +++ +++ ++ -7+ * +----+----+----+----+----+----+----+----+----+----+ - - + + (3) Take natural logarthm transforms for all varables (Q, X ~X 9 ). Compute the correlaton matrx usng the transformed varables. Whch varables are most lkely to contrbute sgnfcantly to the varaton n ln(q)? Are there hghly correlated ndependent varables? The opton CORR n Proc REG s called to compute the correlaton matrx between the logarthm transformed varables. The correlaton matrx s shown on next page. From the correlaton matrx, t can be seen that LnQ s hghly correlated wth LnX wth the correlaton coeffcent as hgh as.94. Besdes ths, LnQ s also strongly correlated wth LnX3 wth the correlaton coeffcent at.643. Except these two, the correlatons between LnQ and other LnX s are not very sgnfcant: The correlaton coeffcent between LnQ and LnX6 s.648. The correlaton coeffcent between LnQ and LnX8 s.94. The correlaton coeffcent between LnQ and LnX9 s.385. Therefore LnX s most lkely to contrbute sgnfcantly to the varaton n LnQ, LnX3 s second only to LnX and t may also contrbute a lot on the varaton of LnQ. Among the ndependent varables themselves, LnX8 and LnX9 are hghly correlated wth the correlaton coeffcent as hgh as.863. LnX and LnX3 are moderately correlated wth the correlaton coeffcent at.574. The correlaton between other pars s not very bg. 6

Correlaton Varable LNX LNX3 LNX6 LNX..574.48 LNX3.574..643 LNX6.48.643. LNX8.94 -.974.35 LNX9.385 -.694 -. LNQ.94.643.54 Correlaton Varable LNX8 LNX9 LNQ LNX.94.385.94 LNX3 -.974 -.694.643 LNX6.35 -..54 LNX8..863.49 LNX9.863..55 LNQ.49.55. (4) Apply OLS to the model [] and Conduct a resdual analyss for ths model. [] LnQ= β + β *LnX + β *LnX 3 + β 3 *LnX 6 + β 4 *LnX 8 + β 5 *LnX 9 + e Model [] dffers from model [] n that t uses all the logarthm transformed varables. The scatter plot of the LnQ v.s LnX s s shown n fgure 3 on next page, the relatonshps between LnQ and LnX s seems to be more lnear than the relatonshps between Q and X s n fgure. Wth the smlar procedure to the resdual analyss as n step () on model, the Proc REG wth opton SPEC generates the followng results: The REG Procedure Model: modelols Dependent Varable: LNQ Analyss of Varance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 7.848 4.5696 34.56 <. Error 4.773.466 Corrected Total 9 7.3955 7

Root MSE.484 R-Square.9845 Dependent Mean 6.36677 Adj R-Sq.983 Coeff Var 3.37437 Parameter Estmates Parameter Standard Varable DF Estmate Error t Value Pr > t Intercept 5.8759. 6.68 <. LNX.688.343 9.7 <. LNX3.3757.9489 3.95.6 LNX6 -.33576.85-4..4 LNX8.7499.6768.9 <. LNX9 -.45653.4975-9.73 <. Test of Frst and Second Moment Specfcaton DF Ch-Square Pr > ChSq.63.898 Scatter Plots of LnQ vs LnX's 8 8 8 LnQ 6 LnQ 6 LnQ 6 4 4 4-4 - 3 - LnX LnX 3 LnX 6 8 8 LnQ 6 LnQ 6 4 4.5.5 LnX 8 LnX 9 Fgure 3. Scatter Plot between LnQ and LnX s. Model [] s now ftted by OLS as followng: LnQ = 5.8759 +.688 * LnX +.3757 * LnX3 + (-.33576) * LnX6 +.7499 * LnX8 + (-.45653) * LnX9. (eq. ) 8

The LnX s seem to be more sgnfcant to LnQ than the X s to Q accordng to the t tests n ANOVA. Meanwhle the constant-varance test seems to support the constant varance hypothess on the resduals of LnQ n model [] more than the constant varance hypothess on the resduals of Q n model []. The scatter plots of the resduals of LnQ vs. LnQ and LnX s are shown n fgure 4. Except one seemly outler at the lower part of each fgure, the scatter plots of the resduals of model [] doesn t assume any promnent nonlnear trend pattern, but rather random, whch ndcates that the logarthm transformaton s effectve to get rd of the nonlnear relaton between Q and X s. However, the varance of the resduals seems to ncrease wth the ncrease of the predcted LnQ, whch calls the usage of WLS n the secton later on. Model [] OLS resdual.5 -.5-4 6 8 Predcted LnQ Scatter Plots of Model [] OLS resduals vs LnQ & LnX's.5.5 Model [] OLS resdual -.5 - -4 - LnX Model [] OLS resdual -.5-3 LnX 3.5.5.5 Model [] OLS resdual -.5 - - Model [] OLS resdual -.5 - -.5.5.5 Model [] OLS resdual -.5 - LnX 6 LnX 8 LnX 9 Fgure 4. Scatter Plot of the resdual of LnQ to LnQ and Ln X s. However, from the normalty check as above by Proc UNIVARIATE as shown as followng, the magntude of both the skewness (now -.876655 from -.664666) and the kurtoss (now 5.6776995 from.473) ncreases a lot. The normalty tests become more sgnfcant to reject the null hypothess of normalty. The normal probablty plot of the logarthm transformed resduals swng more away from the normalty lne too. All the evdence ndcates that the logarthm transformaton worsens the non-normalty of the orgnal data. In another word, the transformaton to cure one certan volaton of the assumpton may not mprove the volaton on the other assumpton. The UNIVARIATE Procedure Varable: resd (Resdual) 9

Moments N 3 Sum Weghts 3 Mean Sum Observatons Std Devaton.9544 Varance.389758 Skewness -.876655 Kurtoss 5.6776995 Uncorrected SS.77975 Corrected SS.77975 Coeff Varaton. Std Error Mean.356867 Tests for Normalty Test --Statstc--- -----p Value------ Shapro-Wlk W.84647 Pr < W.5 Kolmogorov-Smrnov D.635 Pr > D.4 Cramer-von Mses W-Sq.8367 Pr > W-Sq.8 Anderson-Darlng A-Sq.778 Pr > A-Sq <.5 Normal Probablty Plot.5+ +++*+* * ******* * ******++ ****++ * * ****+ -.5+ +++++ ++++* +++++ -.75+ * +----+----+----+----+----+----+----+----+----+----+ - - + + (5) Conduct Weghted Least Squares (WLS) for the model []. Suggestons: (5.) use the levels of X6 (.5,.,.5 and.) to group the data, (5.) compute the varance (VAR) of the resduals from the model [] usng OLS for each group, (5.3) use /VAR as a weght for each group, (5.4) prnt out the varance and weght for each group, and (5.5) conduct WLS for the model. Followng the suggestons (5. - 4), the data are frst grouped by X6 nto 4 groups. And then the varance s calculated wthn each group. Each group of observatons uses the same weght /VAR among ts group members. The weghts are summarzed as followng, where Obs s the Observaton Indces, G s the Group Index, var s the Varance among that group, w s the weght for that group. For nstance, observatons to 6 belong to group wth common varance.4646, and the resduals correspondng to these data ponts are weghted by the common weght.5396. Obs G var w -6.4646.5396

7-5.868.4738 Obs G var w 6-3.68 6.46-3 4.393 7.839 The WLS regresson results by Proc REG wth opton SPEC and WEIGHT w are shown as followng: The REG Procedure Model: modelwls Dependent Varable: LNQ Weght: w Analyss of Varance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 38.363 66.636 56.8 <. Error 4 6.3667.9836 Corrected Total 9 37.67699 Root MSE.483 R-Square.995 Dependent Mean 6.497 Adj R-Sq.9898 Coeff Var 6.36783 Parameter Estmates Parameter Standard Varable DF Estmate Error t Value Pr > t Intercept 5.944.64 36.64 <. LNX.68397.7 4. <. LNX3.3854.7587 5. <. LNX6 -.334.74-4.6. LNX8.678.4665.4 <. LNX9 -.43948.36 -.67 <. Test of Frst and Second Moment Specfcaton DF Ch-Square Pr > ChSq 4.7.863 Model [] s now ftted by WLS as followng: LnQ = 5.944 +.68397 * LnX +.3854 * LnX3 + (-.334) * LnX6 +.678 * LnX8 + (-.43948) * LnX9. (eq. 3) <Dscusson on how the OLS and WLS resduals should be compared> Ths dscusson s based on the page -3 of the book Peter J. Bckel, Kjell A. Doksum.

Mathematcal Statstcs: Basc Ideas and Selected Topcs. Vol, nd Edton. In OLS, the varance of the resduals for dfferent range of data s assumed to be constant σ, therefore the objectve s to mnmze n [ y ( β + β j xj )], where the regresson part or specfcally the regresson = j= k parameters are what should be derved from the observatons. However, f the varance of the resduals are not constant but Var ( e ) = v σ. We may consder a transformed set of observatons, y and v n = y [ v e x v j, j=,,k; =,,n by dvdng the coeffcent of the standard devaton. If we mnmze β + k j= v β x j j ] e as n OLS, the resdual should be correspondng to. Notce that v Var ( ) = v σ = σ satsfes the assumpton of OLS, thus the WLS s essentally an OLS to v v y x j process the transformed observatons { and, j=,,k; =,,n }. In practce, the true v v varance of the resduals may not be avalable, so the varance s substtuted by the emprcal estmaton of the resdual varance VAR, and the weght s w =, where the ndex s VAR nomnal n the sense that ether each observaton has ts own weght or a group of observatons may share the same weght. In our case as wll be llustrated later on, we are usng common weght for observatons n each group. In summary, the objectve of WLS s to mnmze n w [ y ( β + β x )], and the conceptual procedure s to frst weght the observatons and = j= k j j then do the ORL. The resultant WLS regresson model s n the form of w yˆ = w β + w β, whch s equvalent to the OLS formula yˆ = β + β x Therefore the WLS regresson parameters should be close to the OLS regresson parameters. Further n comparng the OLS and WLS resduals, the OLS s treatng { e } as the resdual n the mnmzaton, smlarly OLS treats { y } tself as the dependent varable, and { x j }, j =,... k by themselves as the ndependent varables. But the WLS s treatng { w e} as the resdual n the mnmzaton, smlarly WLS treats w } as the dependent varable and w ( β + β x ) as the { y regresson model. Therefore n comparng the resduals of OLS and WLS, we should wegh all the varables n WLS for a far comparson. k j= j j k j= Please note that n our project the notaton W s used as W = / VAR, so we should premultply both sdes of WLS results by W. Therefore for far comparson to the regular resduals j k j= j j x j

from OLS, both the resduals from WLS and the LnQ or LnX s are frst weghted by ther weghts utlzed n the WLS regresson and then plotted n the resdual plot n fgure 5. It s easly observable that the varaton of the resduals of the WLS model s almost the same for all the range of the predcted LnQ. Therefore the WLS method s very effectve to stable the varance of the resduals. Model [] WLS resdual - -4 4 6 Predcted LnQ Scatter Plots of Model [] WLS resduals vs LnQ & LnX's Model [] WLS resdual - -4-3 - - LnX Model [] WLS resdual - -4 5 5 LnX 3 Model [] WLS resdual - - -3-5 5 Model [] WLS resdual - - -3 5 5 Model [] WLS resdual - - -3 5 5 LnX 6 LnX 8 LnX 9 Fgure 5. Scatter Plot of the WLS resduals of LnQ to LnQ and Ln X s. All varables are already weghted by the square root of the whole weght. The normalty check by Proc UNIVARIATE gves the followng result: The UNIVARIATE Procedure Varable: resd4 Moments N 3 Sum Weghts 3 Mean -.34873 Sum Observatons -.469 Std Devaton.9574899 Varance.977364 Skewness -.6343 Kurtoss.73984 Uncorrected SS 6.36674 Corrected SS 6.34886 Coeff Varaton -73.58 Std Error Mean.7394737 Tests for Normalty Test --Statstc--- -----p Value------ 3

Shapro-Wlk W.9996 Pr < W.85 Kolmogorov-Smrnov D.5377 Pr > D.7 Cramer-von Mses W-Sq.59457 Pr > W-Sq.79 Anderson-Darlng A-Sq.93644 Pr > A-Sq.68 Normal Probablty Plot.75+ +++*+ ++++++ ****** * * *.5+ ******+ ******++ +**+++ -.5+ +++*+* ++++* * +++++ -.75+ * +----+----+----+----+----+----+----+----+----+----+ - - + + By WLS on model [], the magntude of both the skewness (now -.6343 from -.8767) and the kurtoss (now.977364 from 5.6776995) decreases a lot from the OLS model []. The normalty tests become less sgnfcant to reject the null hypothess of normalty. Only the normal probablty plot of the logarthm-transformed-resduals seems to be comparable to the OLS result. Above evdence ndcates that the WLS method mproves the normalty from the OLS method. So WLS can not only stable the varance, but also luckly mprove the normalty of the resduals. (6) Compare the OLS model and WLS model for the model [], ncludng parameter estmates, standard errors of the parameters, and resdual plots. For easer comparson, the OLS result on model [], the equaton, and the WLS result on model [], the equaton 3, are coped here: LnQ = 5.8759 +.688 * LnX +.3757 * LnX3 + (-.33576) * LnX6 +.7499 * LnX8 + (-.45653) * LnX9. (eq. ) LnQ = 5.944 +.68397 * LnX +.3854 * LnX3 + (-.334) * LnX6 +.678 * LnX8 + (-.43948) * LnX9. (eq. 3) The regresson coeffcents from both methods don t change much, whch s preferable as dscussed n the far comparson of resduals of OLS and WLS. Even though the OLS result s unbased, t s subject to greater samplng varaton, whch can be seen from the estmated standard error n Table. Meanwhle, all ndependent varables n WLS become more sgnfcant than they are n the OLS method. For succnct comparson on the scatter plots of the resdual v.s. predcted LnQ, fgure 6 s 4

extracted from fgure 4 and fgure 5. Be advsed that fgure 4, or the left part of fgure 6, s by the OLS method; fgure 5, or the rght part of fgure 6, s by the WLS method, and all the varables have been weghted by the weghts that are used n the WLS. From fgure 6, t s easly dscernable that by OLS method, the varance of the resduals tends to ncrease wth the ncrease of the predcted LnQ. But by WLS, the varance of the resduals seems to be constant wthn the range of the data, whch s exactly why we wanted to mplement the WLS method. Varable Parameter Estmate Standard Error Table. Parameter estmatons by OLS and WLS OLS t Value Pr > t Parameter Estmate Standard Error WLS t Value Pr > t Intercept 5.8759. 6.68 <. 5.944.64 36.64 <. LNX.688.343 9.7 <..68397.7 4. <. LNX3.3757.9489 3.95.6.3854.7587 5. <. LNX6 -.33576.85-4..4 -.334.74-4.6. LNX8.7499.6768.9 <..678.4665.4 <. LNX9 -.45653.4975-9.73 <. -.43948.36 -.67 <. Model [] OLS resdual.6.4. -. -.4 -.6 Resdual Scatter Plots of Model [] by OLS and WLS Model [] WLS resdual - - -.8 3 4 5 6 7 8 9 Predcted LnQ by OLS -3 3 4 5 6 7 Predcted LnQ by WLS Fgure 6. Scatter Plot of the resduals v.s. predcted LnQ by OLS and WLS. (7) Re-express the WLS model [] on the orgnal scale (by takng the antlogarthm of your equaton). Does ths equaton make sense? Would you expect the varables n the model to be mportant? From equaton (3), the nverse logarthm transform can help derve the formula for the orgnal Q: 5

Q = exp [ 5.944 +.68397 * LnX +.3854 * LnX3 + (-.334) * LnX6 +.678 * LnX8 + (-.43948) * LnX9] 5.944.678.678 e X X X 8 37.3357 X X X 8 Q = = (eq. 4) X X X.68397.334 6.3854 3.43948 9.68397.3854 3.334.43948 6 X 9 From ths formula, we can see that Q s monotoncally ncreasng wth the ncrease of the X, X 3 and X 8, but t s monotoncally decreasng wth the ncrease of the X 6 and X 9. As to the varables physcal meanng, Q s the peak rate of flow of the water from the watersheds, we would expect t to be bg f the area of the watersheds (X ) s bg, or the average slope of watershed (X 3 ) s steep, or the ranfall (X 8 ) s strong. On the other hand, f the estmated sol storage capacty (X 6 ) s bg, the peak rate of flow of water (Q) s expected to slow down. Or f the tme perod durng whch ranfall exceeded ¼ nch/hour (X 9 ) s long, whch means that the strong storm doesn t happen very frequently, the Q s also expected to slow down. Based on ths analyss, our model s consstent wth the physcal mechansm thus t s meanngful and the varables used n ths model are mportant to predct the Q. 3. SUMMARY Ths project s to model the peak rate Q of flow of the water from sx watersheds based on the area of the watersheds (X ), the average slope of watershed (X 3 ), the estmated sol storage capacty (X 6 ), the ranfall (X 8 ) and the tme perod durng whch ranfall exceeded ¼ nch/hour (X 9 ). The ordnary least square method s frst used on the raw observatons of above varables, but the resdual plot shows that there s nonlnear relatonshp between the resdual and the predcted Q value. Therefore the logarthm transformaton s used to lnearze the relaton between the dependent varable and the ndependent varables. Then the OLS s used on the logarthm-transformed varables agan, but the resdual analyss shows that the varance of the resduals s not constant wthn the range of observatons. Further the weghted least square (WLS) method s utlzed to stable the varance, and t can also mprove the normalty of the resduals luckly. The OLS and WLS models are compared n detal on the logarthm-transformed varables. Fnally the nverse logarthm transform s mplemented to convert the LnQ back to Q, and the physcal meanng of the constructed model s dscussed. The model shows that the Q wll ncrease wth the ncrease of X, X 3 and X 8, but Q wll decrease wth the ncrease of X 6 and X 9, whch make sense physcally. 6

Appendx: p.sas ************************************************** * THE SAS IS USED FOR MULTIPLE LINEAR REGRESSION * * FLOW RATE DATA * **************************************************; OPTIONS NOCENTER NODATE LS=97 PS=76 PAGENO=; *---INPUT DATA-----------------------------------; DATA ALL; INFILE 'I:\APM63\FLOW.DAT'; INPUT X-X9 Q; LNQ=LOG(Q); LNX=LOG(X); LNX=LOG(X); LNX3=LOG(X3); LNX4=LOG(X4); LNX5=LOG(X5); LNX6=LOG(X6); LNX7=LOG(X7); LNX8=LOG(X8); LNX9=LOG(X9); RUN; *=== = Descrptve parameters ====; proc corr data=all; var Q X X3 X6 X8 X9; proc gplot data=all; plot Q*X; symbol v=star h=; ttle 'Scatter Plot between Q and X'; *=== = OLS on the raw data: model [] ====; proc reg data=all; modelols: model Q = X X3 X6 X8 X9 /spec; output out=out p=pred r=resd; ttle 'OLS for model []'; *=== = Resdual Analyss for OLS on the raw data: model [] ====; proc gplot data=out; plot resd*pred='*' / VREF =; plot resd*x='*' /VREF=; plot resd*x3='*' /VREF= ; plot resd*x6='*' /VREF=; plot resd*x8='*' /VREF=; plo t resd*x9='*' /VREF=; proc unvarate data=out plot normal; var resd; *=== = OLS on the log data: model [] ====; proc reg data=all corr; modelols: model LNQ = LNX LNX3 LNX6 LNX8 LNX9 /spec; output out=out p=pred r=resd; ttle 'OLS for model []'; *=== = Resdual Analyss for OLS on the log data: model [] ====; proc gplot data=out; plot resd*pred='*' / VRE F =; plot resd*lnx='*' /VREF=; plot resd*lnx3='*' /VREF=; plot resd*lnx6='*' /VREF= ; plot resd*lnx8='*' /VREF=; plo t resd*lnx9='*' /VREF=; proc unvarate data=out plot normal; var resd; *==== group the data by X6: model [] ====; data new; set all; 7

f X6=.5 t hen G=; else f X6=. then G=; else f X6=.5 then G= 3; els e f X6=. then G=4; proc sort data=new; by G; proc reg data=new; groupng: model LNQ = LNX LNX3 LNX6 LNX8 LNX9 /spec; ID G; output out=out3 p=pred3 r=resd3; proc sort data=out; by G; *=== = compute the varance of the resduals for each group ====; proc means n mean var data=out3 noprnt; var resd3; by G; output out=varance n=n mean=mean var=var; *==== compute the weght for each group ====; data weght; merge new varance; by G; w=/var; proc prnt data=weght; var G var w; *=== = WLS on the log data: model [] ====; proc reg data=weght corr; modelwls: model LNQ = LNX LNX3 LNX6 LNX8 LNX9 /spec; weght w; output out=out3 p=pred3 r=resd3; ttle 'WLS for model []'; data out4; set out3; pred4=sqrt(w)*pred3; resd4=sqrt(w)*resd3; LNXw=sqrt(w)*LNX; LNX3w=sqrt(w)*LNX3; LNX6w=sqrt(w)*LNX6; LNX8w=sqrt(w)*LNX8; LNX9w=sqrt(w)*LNX9; proc gplot data=out4; plot resd4*pred4='*' /vref= ; plot resd4*lnxw='*' /VREF=; plot resd4*lnx3w='*' /VREF=; plot resd4*lnx6w='*' /VREF= ; plot resd4*lnx8w='*' /VREF= ; plo t resd4*lnx9w='*' /VREF=; proc unvarate data=out4 plot normal; var resd4; 8

Table. Descrpton of varables n the dataset, among whch X, X3, X6, X8 and X9 are used n developng the model. Name Varable Dependent Varable Q Peak rate of flow (cfs) of water Independent Varables X Area of watersheds (m ) X Area mpervous to water (m ) X3 Average slope of watershed (%) X4 Longest stream flow n watershed (n thousands of feet) X5 Surface absorbency ndex: X6 = complete absorbency, = no absorbenc Estmated sol storage capacty (nches of water) y X7 X8 Infltraton rate of water nto sol (nches/hour) Ranfall (nches) X9 Tme perod durng whch ranfall exceeded ¼ nch/hour 9