Regression and Simulation This is an introductory R session, so it may go slowly if you have never used R before. Do not be discouraged. A great way to learn a new language like this is to plunge right in. We will simulate some data suitable for a regression. This will get you started in R and will teach you some useful tricks. I want you to build a script in RStudio (File/New File/R Script) and execute the lines from the script as you proceed. 1. Generate data for simple regression The first task is to generate some data for fitting a simple regression model. By simple, this means a single predictor X 1 and a response Y. We will use a sample size of N=100, and generate N pairs (X 1i, Y i ) from the model Y i = β 0 + X 1i β 1 + e i. In R we will call X 1 simply, since R does not like subscripts in names. Now for some details: The X 1i should come from a Gaussian distribution with mean 20 and standard deviation 3. Take β 0 = 25 and β 1 = 2 to be the true parameters, and assume that e i comes from a normal distribution with mean 0 and standard deviation 15. Generate your data, make a histogram of X 1, and plot the values of Y against those of X 1. Your plot should look something like this: Y 20 40 60 80 100 14 16 18 20 22 24 26 Useful functions are rnorm, plot, and hist. If you type?rnorm, for example, you will see a helpfile. In the plot, include the true regression line; abline is useful here. 1
Answers First we will generate some data from a Gaussian distribution and call it. N=100?rnorm =rnorm(n,20,3) hist() Histogram of Frequency 0 5 10 15 20 25 10 15 20 25 30 Now we will make a response Y that depends on plus some error. We will make the error Gaussian as well. Y= 25 + *2 + rnorm(n,0,15) plot(,y) abline(25,2,col="red") 2
Y 40 60 80 100 15 20 25 Notice how we included the true regression line in the plot. 2. Fit a linear regression Let s now fit a linear regression of Y on X 1 and look at its properties. For this you will use the function lm in the following manner. Use the summary() command on your fitted object fit to see what is inside. Try coef as well. We can include the fitted regression line in the plot. The abline function knows about fitted lm models, so you can give it as a single argument to abline(). 3
Y 40 60 80 100 15 20 25 Once you get this working, see what happens if you execute your entire script (including the data generation) a second time. You should get a somewhat different result, since different data are generated. You can prevent this, if you like, by using the set.seed command. Type?set.seed to learn how. Try this and see that your data generation is now reproducible. Answers Here is our initial regression fit. summary(fit) Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -35.348-9.830 0.121 11.964 36.474 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 16.1050 10.3013 1.563 0.121 2.3589 0.5072 4.650 1.04e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.57 on 98 degrees of freedom Multiple R-squared: 0.1808, Adjusted R-squared: 0.1724 F-statistic: 21.63 on 1 and 98 DF, p-value: 1.036e-05 4
coef(fit) (Intercept) 16.10495 2.35887 plot(,y) abline(25,2,col="red") abline(fit,col="blue",lwd=2) Y 40 60 80 100 15 20 25 If we repeat this process, we get something slightly different =rnorm(n,20,3) Y= 25 + *2 + rnorm(n,0,15) plot(,y) summary(fit) Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -32.217-10.210 0.504 11.952 33.338 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 34.9713 10.0636 3.475 0.000762 *** 1.5293 0.4915 3.112 0.002437 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 5
Residual standard error: 15.05 on 98 degrees of freedom Multiple R-squared: 0.08992, Adjusted R-squared: 0.08063 F-statistic: 9.683 on 1 and 98 DF, p-value: 0.002437 abline(25,2,col="red") abline(fit,col="blue",lwd=2) Y 40 50 60 70 80 90 15 20 25 30 If we want to be able to repeat an analysis, including the random number generation, then we have to set the random number seed. This is easy to do in R. set.seed(101) # any number will do =rnorm(n,20,3) Y= 25 + *2 + rnorm(n,0,15) plot(,y) summary(fit) Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -39.952-9.341-2.383 9.540 33.741 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 12.8444 10.8411 1.185 0.239 2.5795 0.5398 4.778 6.21e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 6
Residual standard error: 15.05 on 98 degrees of freedom Multiple R-squared: 0.189, Adjusted R-squared: 0.1807 F-statistic: 22.83 on 1 and 98 DF, p-value: 6.206e-06 coef(fit) (Intercept) 12.844385 2.579512 abline(25,2,col="red") abline(fit,col="blue",lwd=2) Y 20 40 60 80 100 14 16 18 20 22 24 26 3. Simulation to check aspects of regression We learn from the theory that the variance of the slope estimate in simple regression is given by the expression Var( ˆβ 1 ) = σ 2 N i=1 (X 1i X 1 ) 2, where X 1 is the mean of the X 1i, and σ is the standard deviation of the e i. (Here the hat on top of ˆβ 1 means it is estimated from our data.) What this theory is really telling us is that if we fix our X 1 values and repeatedly generate new e i and hence Y i and refit the regressions each time, the variance of the series of ˆβ 1 s that we get will be given by this expression. Since we typically don t know σ, we need to estimate it as well, using the the expression ˆσ 2 = 1 N 2 N (Y i Ŷi) 2. i=1 7
The linear model summary tells us the estimated standard error of the slope is 0.54 for these data (at least mine did). Let s check it out! Run a simulation where you generate a new response vector each time, fit the regression, extract the slope coefficient, and store it. Do this 1000 times using a for loop (?'for'), keeping the same values for X 1. Summarize the results and in particular compute the standard deviation of the 1000 values you get. How closely does it match the theoretical value? Answers We will generate 1000 different response vectors, compute the slope each time, and then claculate the standard deviation of the estimated slopes. set.seed(101) #any number will do beta=double(1000) for(i in 1:1000) { Y = 25 + *2 + rnorm(n,0,15) beta[i]=coef(fit)[2] } Now let s look at these betas (slope estimates). hist(beta) Histogram of beta Frequency 0 100 200 300 0 2 4 6 beta mean(beta) [1] 1.997322 8
sd(beta) [1] 0.555623 Pretty close! 4. Slightly different simulation Here we held the original X 1 values fixed. What if we generate new versions of them as well when we do our simulations. Repeat your simulations, adding this little twist. What do you learn about the distribution and standard error of ˆβ 1 in this setting? Any surprises? What would you have expected? Answers set.seed(101) #any number will do beta2=double(1000) for(i in 1:1000){ =rnorm(n,20,3) Y= 25 + *2 + rnorm(n,0,15) beta2[i]=coef(fit)[2] } mean(beta2) [1] 2.005391 sd(beta2) [1] 0.5228691 It s smaller! What happened? 5. Multiple Regression We now generate an X 2 and create a multiple regression model Y i = β 0 + X 1i β 1 + X 2i β 2 + e i. Let X 2 have the same distribution as X 1 (but be independent of X 1 ), and let β 2 = 1 in the simulation. Fit the multiple linear regression model the formula is now Y~+X2 and use summary to summarize the results. What do you observe about the standard errors of each coefficient and in particular the coefficient of X 1? For your current version of X 1 and X 2, what is their sample correlation? (cor()). Is your sample correlation zero? Why or why not? Answers We now generate an X 2 and create a multiple regression model. 9
set.seed(109) =rnorm(n,20,3) X2=rnorm(N,20,3) Y= 25 + *2 + X2*1+ rnorm(n,0,15) fit2=lm(y~+x2) summary(fit) Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -37.496-10.097-0.656 10.313 44.703 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 36.1417 8.9195 4.052 0.000102 *** 2.3445 0.4269 5.491 3.14e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.19 on 98 degrees of freedom Multiple R-squared: 0.2353, Adjusted R-squared: 0.2275 F-statistic: 30.16 on 1 and 98 DF, p-value: 3.141e-07 summary(fit2) Call: lm(formula = Y ~ + X2) Residuals: Min 1Q Median 3Q Max -37.970-11.170-1.228 10.061 43.059 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 19.5201 12.7245 1.534 0.1283 2.3316 0.4221 5.524 2.78e-07 *** X2 0.8643 0.4770 1.812 0.0731. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.01 on 97 degrees of freedom Multiple R-squared: 0.2603, Adjusted R-squared: 0.2451 F-statistic: 17.07 on 2 and 97 DF, p-value: 4.447e-07 In this case, X 1 and X 2 are realizations of two independent random variables, each Gaussian, so the presence of one does not really effect the other in the regression. Here their sample correlation is 10
cor(,x2) [1] 0.01693538 Suppose we wanted the distribution of X 2 to be correlated with X 1. How might we do that? Figure out a way to do this, and experiment until your X 2 has sample correlation about 0.6 with X 1. Now generate a Y from your regression model, and use summary to examine the coefficients. What do you observe? Now try to produce X 2 with sample correlation about 0.8 with X 1, and repeat the regression experiment above. What do you observe? Answers How do we generate an X 2 that has correlation with X 1? One simple way is to generate a Z independent of X 1, and then make X 2 = αx 1 + Z for some value of α (below we use α = 0.7). Z=X2 X2=0.7*+Z plot(,x2) X2 25 30 35 40 45 15 20 25 30 cor(,x2) [1] 0.6285433 Now when we fit the linear model, the variance of the coefficients increases in the multiple linear regression versus the simple linear regression, because of the correlation. Y= 25 + *2 + X2*1+ rnorm(n,0,15) fit2=lm(y~+x2) summary(fit) 11
Call: lm(formula = Y ~ ) Residuals: Min 1Q Median 3Q Max -51.343-8.826-0.108 9.669 36.166 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 43.1348 8.8712 4.862 4.42e-06 *** 2.7012 0.4246 6.361 6.40e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 15.1 on 98 degrees of freedom Multiple R-squared: 0.2923, Adjusted R-squared: 0.285 F-statistic: 40.47 on 1 and 98 DF, p-value: 6.404e-09 summary(fit2) Call: lm(formula = Y ~ + X2) Residuals: Min 1Q Median 3Q Max -49.512-8.147 0.254 8.648 34.816 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 22.1669 12.5245 1.770 0.079888. 1.9217 0.5341 3.598 0.000507 *** X2 1.0903 0.4695 2.322 0.022322 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 14.78 on 97 degrees of freedom Multiple R-squared: 0.3295, Adjusted R-squared: 0.3157 F-statistic: 23.84 on 2 and 97 DF, p-value: 3.799e-09 12