Regression and Simulation

Similar documents
Multiple regression - a brief introduction

Let us assume that we are measuring the yield of a crop plant on 5 different plots at 4 different observation times.

R is a collaborative project with many contributors. Type contributors() for more information.

Final Exam Suggested Solutions

Non-linearities in Simple Regression

6 Multiple Regression

MODEL SELECTION CRITERIA IN R:

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 18, 2006, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTIONS

Dummy Variables. 1. Example: Factors Affecting Monthly Earnings

Regression Review and Robust Regression. Slides prepared by Elizabeth Newton (MIT)

Homework Assignment Section 3

Generalized Linear Models

The Norwegian State Equity Ownership

Homework Assignment Section 3

Study 2: data analysis. Example analysis using R

State Ownership at the Oslo Stock Exchange. Bernt Arne Ødegaard

Lecture Note: Analysis of Financial Time Series Spring 2017, Ruey S. Tsay

Jaime Frade Dr. Niu Interest rate modeling

Economics 424/Applied Mathematics 540. Final Exam Solutions

STATISTICS 110/201, FALL 2017 Homework #5 Solutions Assigned Mon, November 6, Due Wed, November 15

Review: Population, sample, and sampling distributions

NHY examples. Bernt Arne Ødegaard. 23 November Estimating dividend growth in Norsk Hydro 8

Monetary Economics Risk and Return, Part 2. Gerald P. Dwyer Fall 2015

Subject CS1 Actuarial Statistics 1 Core Principles. Syllabus. for the 2019 exams. 1 June 2018

State Ownership at the Oslo Stock Exchange

Linear regression model

The histogram should resemble the uniform density, the mean should be close to 0.5, and the standard deviation should be close to 1/ 12 =

σ e, which will be large when prediction errors are Linear regression model

MCMC Package Example

MixedModR2 Erika Mudrak Thursday, August 30, 2018

Chapter 8 Statistical Intervals for a Single Sample

GGraph. Males Only. Premium. Experience. GGraph. Gender. 1 0: R 2 Linear = : R 2 Linear = Page 1

Loss Simulation Model Testing and Enhancement

Stat 401XV Exam 3 Spring 2017

STA2601. Tutorial letter 105/2/2018. Applied Statistics II. Semester 2. Department of Statistics STA2601/105/2/2018 TRIAL EXAMINATION PAPER

Multiple linear regression

Determination of the Optimal Stratum Boundaries in the Monthly Retail Trade Survey in the Croatian Bureau of Statistics

> attach(grocery) > boxplot(sales~discount, ylab="sales",xlab="discount")

Econometric Methods for Valuation Analysis

Linear Regression with One Regressor

Chapter 7 Sampling Distributions and Point Estimation of Parameters

Basic Procedure for Histograms

Random Effects ANOVA

Introduction to General and Generalized Linear Models

Business Statistics: A First Course

Chapter 7: SAMPLING DISTRIBUTIONS & POINT ESTIMATION OF PARAMETERS

ORDERED MULTINOMIAL LOGISTIC REGRESSION ANALYSIS. Pooja Shivraj Southern Methodist University

Econometric Methods for Valuation Analysis

Stat 101 Exam 1 - Embers Important Formulas and Concepts 1

boxcox() returns the values of α and their loglikelihoods,

Step 1: Load the appropriate R package. Step 2: Fit a separate mixed model for each independence claim in the basis set.

Key Objectives. Module 2: The Logic of Statistical Inference. Z-scores. SGSB Workshop: Using Statistical Data to Make Decisions

Business Statistics 41000: Probability 4

Tests for the Difference Between Two Linear Regression Intercepts

PASS Sample Size Software

INSTITUTE OF ACTUARIES OF INDIA EXAMINATIONS. 20 th May Subject CT3 Probability & Mathematical Statistics

Copyright 2011 Pearson Education, Inc. Publishing as Addison-Wesley.

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 22 January :00 16:00

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Midterm

Models of Patterns. Lecture 3, SMMD 2005 Bob Stine

Stat3011: Solution of Midterm Exam One

Problem Set 4 Answer Key

Stat 328, Summer 2005

Case Study: Applying Generalized Linear Models

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Introduction to Population Modeling

GARCH Models. Instructor: G. William Schwert

Ordinal Multinomial Logistic Regression. Thom M. Suhy Southern Methodist University May14th, 2013

Chapter 6 Part 3 October 21, Bootstrapping

Window Width Selection for L 2 Adjusted Quantile Regression

MVE051/MSG Lecture 7

Generalized Multilevel Regression Example for a Binary Outcome

Monte Carlo Simulation and Resampling

Time Observations Time Period, t

Lecture 3: Factor models in modern portfolio choice

Statistics 431 Spring 2007 P. Shaman. Preliminaries

Predicting Charitable Contributions

Gov 2001: Section 5. I. A Normal Example II. Uncertainty. Gov Spring 2010

SAS Simple Linear Regression Example

Business Statistics Midterm Exam Fall 2013 Russell

4.1 Introduction Estimating a population mean The problem with estimating a population mean with a sample mean: an example...

Panel Data. November 15, The panel is balanced if all individuals have a complete set of observations, otherwise the panel is unbalanced.

The misleading nature of correlations

Economics 413: Economic Forecast and Analysis Department of Economics, Finance and Legal Studies University of Alabama

Financial Econometrics Jeffrey R. Russell. Midterm 2014 Suggested Solutions. TA: B. B. Deng

Statistics vs. statistics

Statistics for Business and Economics

CHAPTER 2 Describing Data: Numerical

Spring, Beta and Regression

Analysis of Variance in Matrix form

CHAPTER 8. Confidence Interval Estimation Point and Interval Estimates

Handout 4: Gains from Diversification for 2 Risky Assets Corporate Finance, Sections 001 and 002

Introduction to R (2)

CHAPTER 7 INTRODUCTION TO SAMPLING DISTRIBUTIONS

Variance clustering. Two motivations, volatility clustering, and implied volatility

Model 0: We start with a linear regression model: log Y t = β 0 + β 1 (t 1980) + ε, with ε N(0,

ECE 295: Lecture 03 Estimation and Confidence Interval

book 2014/5/6 15:21 page 261 #285

Lecture Note of Bus 41202, Spring 2010: Analysis of Multiple Series with Applications. x 1t x 2t. holdings (OIH) and energy select section SPDR (XLE).

University of Zürich, Switzerland

Transcription:

Regression and Simulation This is an introductory R session, so it may go slowly if you have never used R before. Do not be discouraged. A great way to learn a new language like this is to plunge right in. We will simulate some data suitable for a regression. This will get you started in R and will teach you some useful tricks. I want you to build a script in RStudio (File/New File/R Script) and execute the lines from the script as you proceed. 1. Generate data for simple regression The first task is to generate some data for fitting a simple regression model. By simple, this means a single predictor X 1 and a response Y. We will use a sample size of N=100, and generate N pairs (X 1i, Y i ) from the model Y i = β 0 + X 1i β 1 + e i. In R we will call X 1 simply, since R does not like subscripts in names. Now for some details: The X 1i should come from a Gaussian distribution with mean 20 and standard deviation 3. Take β 0 = 25 and β 1 = 2 to be the true parameters, and assume that e i comes from a normal distribution with mean 0 and standard deviation 15. Generate your data, make a histogram of X 1, and plot the values of Y against those of X 1. Your plot should look something like this: Y 20 40 60 80 100 14 16 18 20 22 24 26 Useful functions are rnorm, plot, and hist. If you type?rnorm, for example, you will see a helpfile. In the plot, include the true regression line; abline is useful here. 1

Answers First we will generate some data from a Gaussian distribution and call it. N=100?rnorm =rnorm(n,20,3) hist() Histogram of Frequency 0 5 10 15 20 25 10 15 20 25 30 Now we will make a response Y that depends on plus some error. We will make the error Gaussian as well. Y= 25 + *2 + rnorm(n,0,15) plot(,y) abline(25,2,col="red") 2

Y 40 60 80 100 15 20 25 Notice how we included the true regression line in the plot. 2. Fit a linear regression Let s now fit a linear regression of Y on X 1 and look at its properties. For this you will use the function lm in the following manner. Use the summary() command on your fitted object fit to see what is inside. Try coef as well. We can include the fitted regression line in the plot. The abline function knows about fitted lm models, so you can give it as a single argument to abline(). 3

Y 40 60 80 100 15 20 25 Once you get this working, see what happens if you execute your entire script (including the data generation) a second time. You should get a somewhat different result, since different data are generated. You can prevent this, if you like, by using the set.seed command. Type?set.seed to learn how. Try this and see that your data generation is now reproducible. Answers Here is our initial regression fit. summary(fit) Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -35.348-9.830 0.121 11.964 36.474 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 16.1050 10.3013 1.563 0.121 2.3589 0.5072 4.650 1.04e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.57 on 98 degrees of freedom Multiple R-squared: 0.1808, Adjusted R-squared: 0.1724 F-statistic: 21.63 on 1 and 98 DF, p-value: 1.036e-05 4

coef(fit) (Intercept) 16.10495 2.35887 plot(,y) abline(25,2,col="red") abline(fit,col="blue",lwd=2) Y 40 60 80 100 15 20 25 If we repeat this process, we get something slightly different =rnorm(n,20,3) Y= 25 + *2 + rnorm(n,0,15) plot(,y) summary(fit) Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -32.217-10.210 0.504 11.952 33.338 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 34.9713 10.0636 3.475 0.000762 *** 1.5293 0.4915 3.112 0.002437 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 5

Residual standard error: 15.05 on 98 degrees of freedom Multiple R-squared: 0.08992, Adjusted R-squared: 0.08063 F-statistic: 9.683 on 1 and 98 DF, p-value: 0.002437 abline(25,2,col="red") abline(fit,col="blue",lwd=2) Y 40 50 60 70 80 90 15 20 25 30 If we want to be able to repeat an analysis, including the random number generation, then we have to set the random number seed. This is easy to do in R. set.seed(101) # any number will do =rnorm(n,20,3) Y= 25 + *2 + rnorm(n,0,15) plot(,y) summary(fit) Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -39.952-9.341-2.383 9.540 33.741 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 12.8444 10.8411 1.185 0.239 2.5795 0.5398 4.778 6.21e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 6

Residual standard error: 15.05 on 98 degrees of freedom Multiple R-squared: 0.189, Adjusted R-squared: 0.1807 F-statistic: 22.83 on 1 and 98 DF, p-value: 6.206e-06 coef(fit) (Intercept) 12.844385 2.579512 abline(25,2,col="red") abline(fit,col="blue",lwd=2) Y 20 40 60 80 100 14 16 18 20 22 24 26 3. Simulation to check aspects of regression We learn from the theory that the variance of the slope estimate in simple regression is given by the expression Var( ˆβ 1 ) = σ 2 N i=1 (X 1i X 1 ) 2, where X 1 is the mean of the X 1i, and σ is the standard deviation of the e i. (Here the hat on top of ˆβ 1 means it is estimated from our data.) What this theory is really telling us is that if we fix our X 1 values and repeatedly generate new e i and hence Y i and refit the regressions each time, the variance of the series of ˆβ 1 s that we get will be given by this expression. Since we typically don t know σ, we need to estimate it as well, using the the expression ˆσ 2 = 1 N 2 N (Y i Ŷi) 2. i=1 7

The linear model summary tells us the estimated standard error of the slope is 0.54 for these data (at least mine did). Let s check it out! Run a simulation where you generate a new response vector each time, fit the regression, extract the slope coefficient, and store it. Do this 1000 times using a for loop (?'for'), keeping the same values for X 1. Summarize the results and in particular compute the standard deviation of the 1000 values you get. How closely does it match the theoretical value? Answers We will generate 1000 different response vectors, compute the slope each time, and then claculate the standard deviation of the estimated slopes. set.seed(101) #any number will do beta=double(1000) for(i in 1:1000) { Y = 25 + *2 + rnorm(n,0,15) beta[i]=coef(fit)[2] } Now let s look at these betas (slope estimates). hist(beta) Histogram of beta Frequency 0 100 200 300 0 2 4 6 beta mean(beta) [1] 1.997322 8

sd(beta) [1] 0.555623 Pretty close! 4. Slightly different simulation Here we held the original X 1 values fixed. What if we generate new versions of them as well when we do our simulations. Repeat your simulations, adding this little twist. What do you learn about the distribution and standard error of ˆβ 1 in this setting? Any surprises? What would you have expected? Answers set.seed(101) #any number will do beta2=double(1000) for(i in 1:1000){ =rnorm(n,20,3) Y= 25 + *2 + rnorm(n,0,15) beta2[i]=coef(fit)[2] } mean(beta2) [1] 2.005391 sd(beta2) [1] 0.5228691 It s smaller! What happened? 5. Multiple Regression We now generate an X 2 and create a multiple regression model Y i = β 0 + X 1i β 1 + X 2i β 2 + e i. Let X 2 have the same distribution as X 1 (but be independent of X 1 ), and let β 2 = 1 in the simulation. Fit the multiple linear regression model the formula is now Y~+X2 and use summary to summarize the results. What do you observe about the standard errors of each coefficient and in particular the coefficient of X 1? For your current version of X 1 and X 2, what is their sample correlation? (cor()). Is your sample correlation zero? Why or why not? Answers We now generate an X 2 and create a multiple regression model. 9

set.seed(109) =rnorm(n,20,3) X2=rnorm(N,20,3) Y= 25 + *2 + X2*1+ rnorm(n,0,15) fit2=lm(y~+x2) summary(fit) Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -37.496-10.097-0.656 10.313 44.703 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 36.1417 8.9195 4.052 0.000102 *** 2.3445 0.4269 5.491 3.14e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.19 on 98 degrees of freedom Multiple R-squared: 0.2353, Adjusted R-squared: 0.2275 F-statistic: 30.16 on 1 and 98 DF, p-value: 3.141e-07 summary(fit2) Call: lm(formula = Y ~ + X2) Residuals: Min 1Q Median 3Q Max -37.970-11.170-1.228 10.061 43.059 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 19.5201 12.7245 1.534 0.1283 2.3316 0.4221 5.524 2.78e-07 *** X2 0.8643 0.4770 1.812 0.0731. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.01 on 97 degrees of freedom Multiple R-squared: 0.2603, Adjusted R-squared: 0.2451 F-statistic: 17.07 on 2 and 97 DF, p-value: 4.447e-07 In this case, X 1 and X 2 are realizations of two independent random variables, each Gaussian, so the presence of one does not really effect the other in the regression. Here their sample correlation is 10

cor(,x2) [1] 0.01693538 Suppose we wanted the distribution of X 2 to be correlated with X 1. How might we do that? Figure out a way to do this, and experiment until your X 2 has sample correlation about 0.6 with X 1. Now generate a Y from your regression model, and use summary to examine the coefficients. What do you observe? Now try to produce X 2 with sample correlation about 0.8 with X 1, and repeat the regression experiment above. What do you observe? Answers How do we generate an X 2 that has correlation with X 1? One simple way is to generate a Z independent of X 1, and then make X 2 = αx 1 + Z for some value of α (below we use α = 0.7). Z=X2 X2=0.7*+Z plot(,x2) X2 25 30 35 40 45 15 20 25 30 cor(,x2) [1] 0.6285433 Now when we fit the linear model, the variance of the coefficients increases in the multiple linear regression versus the simple linear regression, because of the correlation. Y= 25 + *2 + X2*1+ rnorm(n,0,15) fit2=lm(y~+x2) summary(fit) 11

Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -51.343-8.826-0.108 9.669 36.166 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 43.1348 8.8712 4.862 4.42e-06 *** 2.7012 0.4246 6.361 6.40e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.1 on 98 degrees of freedom Multiple R-squared: 0.2923, Adjusted R-squared: 0.285 F-statistic: 40.47 on 1 and 98 DF, p-value: 6.404e-09 summary(fit2) Call: lm(formula = Y ~ + X2) Residuals: Min 1Q Median 3Q Max -49.512-8.147 0.254 8.648 34.816 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 22.1669 12.5245 1.770 0.079888. 1.9217 0.5341 3.598 0.000507 *** X2 1.0903 0.4695 2.322 0.022322 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.78 on 97 degrees of freedom Multiple R-squared: 0.3295, Adjusted R-squared: 0.3157 F-statistic: 23.84 on 2 and 97 DF, p-value: 3.799e-09 12