Lecture Notes in Financial Econometrics (MBF, MSc course at UNISG)

Size: px

Start display at page:

Download "Lecture Notes in Financial Econometrics (MBF, MSc course at UNISG)"

Dora Lizbeth Lawrence
6 years ago
Views:

1 Contents Lecture Notes in Financial Econometrics (MBF, MSc course at UNISG) Paul Söderlind 1 24 June 25 1 Review of Statistics Random Variables and Distributions Hypothesis Testing Normal Distribution of the Sample Mean as an Approximation Least Squares and Maximum Likelihood Estimation Least Squares Maximum Likelihood A Some Matrix Algebra 28 3 Testing CAPM Market Model Several Factors Fama-MacBeth Event Studies Basic Structure of Event Studies Models of Normal Returns Testing the Abnormal Return Quantitative Events A Derivation of (4.8) 47 1 University of St. Gallen and CEPR. Address: s/bf-hsg, Rosenbergstrasse 52, CH-9 St. Gallen, Switzerland. Paul.Soderlind@unisg.ch. Document name: FinEcmtAll.TeX 5 Time Series Analysis Descriptive Statistics White Noise

2 5.3 Autoregression (AR) Moving Average (MA) ARMA(p,q) VAR(p) Non-stationary Processes Predicting Asset Returns Asset Prices, Random Walks, and the Efficient Market Hypothesis Autocorrelations Other Predictors and Methods Security Analysts Technical Analysis Empirical U.S. Evidence on Stock Return Predictability ARCH and GARCH Heteroskedasticity ARCH Models GARCH Models Non-Linear Extensions (G)ARCH-M Multivariate (G)ARCH Option Pricing and Estimation of Continuous Time Processes The Black-Scholes Model Estimation of the Volatility of a Random Walk Process Kernel Density Estimation and Regression Non-Parametric Regression Review of Statistics Reference: Bodie, Kane, and Marcus (22) (statistical review in appendix) or any textbook in statistics. More advanced material is denoted by a star ( ). It is not required reading. 1.1 Random Variables and Distributions Distributions A univariate distribution of a random variable x describes the probability of different values. If f (x) is the probability density function, then the probability that x is between A and B is calculated as the area under the density function from A to B B Pr (A x < B) = f (x)dx. (1.1) A See Figure 1.1 for an example. The distribution can often be described in terms of the mean and the variance. For instance, a normal (Gaussian) distribution is fully described by these two numbers. See Figure 1.4 for an illustration. Remark 1 If x N(µ, σ 2 ), then the probability density function is f (x) = 1 2πσ 2 e 1 2 ( ) x µ 2 σ. This is a bell-shaped curve centered on µ and where σ determines the width of the curve. A bivariate distribution of the random variables x and y contains the same information as the two respective univariate distributions, but also information on how x and y are related. Let h (x, y) be the joint density function, then the probability that x is between A and B and y is between C and D is calculated as the volume under the surface of the 2 3

3 density function Pdf of N(,1) Pdf of bivariate normal, corr=.1 Pr (A x < B and C x < D) = B D A C h(x, y)dxdy. (1.2).4.2 A joint normal distributions is completely described by the means and the covariance matrix [ ] ([ x N y µ x µ y ], [ σ 2 x σ xy σ xy σ 2 y ]), (1.3) where σx 2 and σ y 2 denote the variances of x and y respectively, and σ xy denotes their covariance. Clearly, if the covariance σ xy is zero, then the variables are unrelated to each other. Otherwise, information about x can help us to make a better guess of y. See Figure 1.1 for an example. The correlation of x and y is defined as x Pdf of bivariate normal, corr= y 2 2 x 2 Corr (x, y) = σ xy σ x σ y. (1.4) If two random variables happen to be independent of each other, then the joint density function is just the product of the two univariate densities (here denoted f (x) and k(y)).2 2 y 2 2 x 2 h(x, y) = f (x) k (y) if x and y are independent. (1.5) This is useful in many cases, for instance, when we construct likelihood functions for maximum likelihood estimation Conditional Distributions If h (x, y) is the joint density function and f (x) the (marginal) density function of x, then the conditional density function is g(y x) = h(x, y)/f (x). (1.6) For the bivariate normal distribution (1.3) we have the distribution of y conditional on a given value of x as [ y x N µ y + σ xy σx 2 (x µ x ), σy 2 σ ] xyσ xy σx 2. (1.7) Figure 1.1: Density functions of univariate and bivariate normal distributions Notice that the conditional mean can be interpreted as the best guess of y given that we know x. Similarly, the conditional variance can be interpreted as the variance of the forecast error (using the conditional mean as the forecast). The conditional and marginal distribution coincide if y is uncorrelated with x. (This follows directly from combining (1.5) and (1.6)). Otherwise, the mean of the conditional distribution depends on x, and the variance is smaller than in the marginal distribution (we have more information). See Figure 1.2 for an illustration Mean and Standard Deviation The mean and variance of a series are estimated as x = T t=1 x t /T and ˆσ 2 = T t=1 (x t x) 2 /T. (1.8) 4 5

4 Pdf of bivariate normal, corr= Conditional pdf of y, corr=.1 x=.8 x= 2.2 y 2 2 x y Pdf of bivariate normal, corr= Conditional pdf of y, corr=.8 x=.8 x= 2.2 y 2 2 x y Figure 1.2: Density functions of normal distributions The standard deviation (here denoted Std(x t )), the square root of the variance, is the most common measure of volatility. (Sometimes we use T 1 in the denominator of the sample variance instead T.) A sample mean is normally distributed if x t is normal distributed, x t N(µ, σ 2 ). The basic reason is that a linear combination of normally distributed variables is also normally distributed. However, a sample average is typically approximately normally distributed even if the variable is not (discussed below). Remark 2 If x N ( µ x, σx 2 ) ( ) and y N µ y, σy 2, then ] ay + by N [aµ x + bµ x, a 2 σx 2 + b2 σy 2 + 2ab Cov(x, y). If x t is iid (independently and identically distributed), then it is straightforward to find the variance of the sample average. Then we have that ( Tt=1 ) Var( x) = Var x t /T = T t=1 Var (x t /T ) = T Var (x t ) /T 2 = σ 2 /T. (1.9) The first equality is just a definition and the second equality follows from the assumption that x t and x s are independently distributed. This means, for instance, that Var(x 2 + x 3 ) = Var(x 2 ) + Var(x 3 ) since the covariance is zero. The third equality follows from the assumption that x t and x s are identically distributed (so their variances are the same). The fourth equality is a trivial simplification. A sample average is (typically) unbiased, that is, the expected value of the sample average equals the population mean. To illustrate that, consider the expected value of the sample average of the iid x t E x = E T t=1 x t /T = T t=1 E x t /T = E x t. (1.1) The first equality is just a definition and the second equality is always true (the expectation of a sum is the sum of expectations), and the third equality follows from the assumption of identical distributions which implies identical expectations Covariance and Correlation The covariance of two variables (here x and y) is typically estimated as Ĉov (x t, y t ) = T t=1 (x t x) (y t ȳ) /T. (1.11) (Sometimes we use T 1 in the denominator of the sample variance instead T.) The correlation of two variables is then estimated as Ĉorr (x t, y t ) = Ĉov (x t, y t ) Ŝtd (x t ) Ŝtd (y t ), (1.12) 6 7

5 y y = x +.2*N(,1) Corr x z z=y 2 Corr x Remark 3 The logic of using the standardized t statistic in (1.13) is easily seen by an example. Suppose a random variable (here denoted x, but think of any test statistic) is distributed as N(.5, 2), then the following probabilities are all 5% ( ) x Pr (x 1.83) = Pr (x ) = Pr. 2 2 Notice that (x.5) / 2 N (, 1). See 1.4 for an illustration Two-Sided Test Figure 1.3: Example of correlations on an artificial sample. Both subfigures use the same sample of y. As an example of a two sided test we could have the null hypothesis and the alternative hypothesis where Ŝtd(x t ) is an estimated standard deviation. A correlation must be between 1 and 1 (try to show it). Note that covariance and correlation measure the degree of linear relation only. This is illustrated in Figure 1.3. To test this, we follow these steps: H : β = 4 H 1 : β = 4. (1.14) 1.2 Hypothesis Testing The basic approach in testing a null hypothesis is to compare the test statistic (the sample average, say) with how the distribution of it would look like if the null hypothesis is true. If the test statistic would be very unusual, then the null hypothesis is rejected we are not willing to believe in a null hypothesis that looks very different from what we see in data. For instance, suppose the null hypothesis (denoted H ) is that the true value of some parameter is β Suppose also that we know that distribution of the parameter estimator, ˆβ, is normal (discussed in some detail later on) with a known variance of s 2. For instance, ˆβ could be a sample mean. Construct the test statistic by standardizing the sample mean t = ˆβ β s N (, 1) if H is true. (1.13) Notice that t has a standard normal distribution if the null hypothesis is true. The test is see if the test statistic would be very unusual but then we need to define unusual (see below) Construct distribution under H : from (1.13) it is such that t = ( ˆβ 4)/s N (, 1). 2. Would test statistic (t) be very unusual under the H distribution (N (, 1))? Since the alternative hypothesis is β = 4, a value of t far from zero ( ˆβ far from 4) must be considered unusual. 3. Put a value on what you mean by unusual. For instance, suppose you regard something that would happen with 1% probability to be unusual. (This is called the size of the test.) 4. In a N (, 1) distribution, t < 1.65 has a 5% probability, and so does t > These are your 1% critical values. 5. Reject H if t > The idea is that, if the hypothesis is true, then this decision rule gives the wrong decision in 1% of the cases. That is, 1% of all possible random samples will make us reject a true hypothesis. If we prefer a 5% significance level (which makes the risk of a false rejection smaller), then we should use the critical values of 1.96 and

6 Example 4 Let s = 1.5, ˆβ = 6 and β = 4 (under H ). Then, t = (6 4)/ so we cannot reject H at the 1% significance level..4 Density function of N(.5,2) Pr(x 1.83) =.5.4 Density function of N(,2) Pr(y 2.33) =.5 Example 5 If instead, ˆβ = 7, then t = 2 so we can reject H at the 1% (and also the 5%) level..2.2 See Figure 1.4 for some examples or normal distributions One-Sided Test x y = x.5 A one-sided test is a bit different since it has a different alternative hypothesis (and therefore a different definition of unusual ). As an example, suppose the alternative hypothesis is that the mean is larger than 4.4 Density function of N(,1) Pr(z 1.65) =.5 H : β 4.2 H 1 : β > 4. (1.15) To test this, we follow these steps: 1. Construct distribution at the boundary of H : set β = 4 in (1.13) to get the same test statistics as in the two-sided test: t = ( ˆβ 4)/s N (, 1). 2. A value of t a lot higher than zero ( ˆβ much higher than 4) must be considered unusual. Notice that t < ( ˆβ < 4) isn t unusual at all under H. 3. In a N (, 1) distribution, t > 1.29 has a 1% probability. This is your 1% critical value. 4. Reject H if t > Example 6 Let s = 1.5, ˆβ = 6 and β 4 (under H ). Then, t = (6 4)/1.5 = 1.33 so we can reject H at the 1% significance level (but not the 5% level). Example 7 Let s = 1.5, ˆβ = 3 and β 4 (under H ). Then, t = (3 4)/1.5 =.67 so we cannot reject H at the 1% significance level z = (x.5)/ 2 Figure 1.4: Density function of normal distribution with shaded 5% tails A Joint Test of Several Parameters Suppose we have estimated both β x and β y (the estimates are denoted ˆβ x and ˆβ y ) and that we know that they have a joint normal distribution with covariance matrix. We now want to test the null hypothesis H : β x = 4 and β y = 2 (1.16) H 1 : β x = 4 and/or β y = 2 (H is not true...) To test two parameters at the same time, we somehow need to combine them into one test statistic. The most straightforward way is too form a chi-square distributed test statistic. Remark 8 If v 1 N(, σ1 2) and v 2 N(, σ2 2) and they are uncorrelated, then (v 1/σ 1 ) 2 + (v 2 /σ 2 ) 2 χ

7 .4 a. Pdf of N(,1) and t(5) N(,1) ( 1.64 ) t(5) ( 1.68 ).4 b. Pdf of Chi square(n) n=2 (4.61) n=5 (9.24) then (1.17) is [ ] [ ] [ ] 1 4/11 3/11 1 c = /1 5/ which is higher than the 1% critical value of the χ2 2 distribution (which is 4.61) x c. Pdf of F(n,5) n=2 (2.41) n=5 (1.97) 2 4 x 5 1 x 1% critical values in parentheses 1.3 Normal Distribution of the Sample Mean as an Approximation In many cases, it is unreasonable to just assume that the variable is normally distributed. The nice thing with a sample mean (or sample average) is that it will still be normally distributed at least approximately (in a reasonably large sample). This section gives a short summary of what happens to sample means as the sample size increases (often called asymptotic theory ) a. Distribution of sample avg. T=5 T=25 T=1 b. Distribution of T sample avg..4.2 Figure 1.5: Probability density functions of a N(,1) and χ 2 n Remark 9 If the J 1 vector v N(, ), then v 1 v χj 2. See Figure 1.5 for the pdf. We calculate the test statistics for (1.16) as [ ] ( ) ˆβ x 4 c = Var( 1 [ ] ˆβ x ) Cov ˆβ x, ˆβ y ( ) ˆβ x 4, (1.17) ˆβ y 2 Cov ˆβ x, ˆβ y Var( ˆβ y ) ˆβ y 2 and then compare with a 1% (say) critical from a χ 2 2 distribution. Example 1 Suppose [ ] [ ] ( ) ˆβ x 3 = and Var( [ ] ˆβ x ) Cov ˆβ x, ˆβ y ( ) 5 3 =, ˆβ y 5 Cov ˆβ x, ˆβ y Var( ˆβ y ) Sample average Sample average of z t 1 where z t has a χ 2 (1) distribution Figure 1.6: Sampling distributions. 5 5 T sample average The law of large numbers (LLN) says that the sample mean converges to the true population mean as the sample size goes to infinity. This holds for a very large class of random variables, but there are exceptions. A sufficient (but not necessary) condition for this convergence is that the sample average is unbiased (as in (1.1)) and that the variance goes to zero as the sample size goes to infinity (as in (1.9)). (This is also called convergence in mean square.) To see the LLN in action, see Figure 1.6. The central limit theorem (CLT) says that T x converges in distribution to a normal distribution as the sample size increases. See Figure 1.6 for an illustration. This also 13

8 holds for a large class of random variables and it is a very useful result since it allows us to test hypothesis. Most estimators (including least squares and other methods) are effectively some kind of sample average, so the CLT can be applied. Bibliography Bodie, Z., A. Kane, and A. J. Marcus, 22, Investments, McGraw-Hill/Irwin, Boston, 5th edn. 2 Least Squares and Maximum Likelihood Estimation More advanced material is denoted by a star ( ). It is not required reading. 2.1 Least Squares Simple Regression: Constant and One Regressor The simplest regression model is y t = β + β 1 x t + u t, where E u t = and Cov(x t, u t ) =, (2.1) where we can observe (have data on) the dependent variable y t and the regressor x t but not the residual u t. In principle, the residual should account for all the movements in y t that we cannot explain (by x t ). Note the two very important assumptions: (i) the mean of the residual is zero; and (ii) the residual is not correlated with the regressor, x t. If the regressor summarizes all the useful information we have in order to describe y t, then the assumptions imply that we have no way of making a more intelligent guess of u t (even after having observed x t ) than that it will be zero. Suppose you do not know β or β 1, and that you have a sample of data: y t and x t for t = 1,..., T. The LS estimator of β and β 1 minimizes the loss function Tt=1 (y t b b 1 x t ) 2 = (y 1 b b 1 x 1 ) 2 + (y 2 b b 1 x 2 ) (2.2) by choosing b and b 1 to make the loss function value as small as possible. The objective is thus to pick values of b and b 1 in order to make the model fit the data as closely as possible where close is taken to be a small variance of the unexplained part (the residual). Remark 11 (First order condition for minimizing a differentiable function). We want to find the value of b in the interval b low b b high, which makes the value of the 14 15

9 2 1 a. 2b b b. 2b 2 +(c 4) b c 4 6 y 2 2 y: x: b: 1.8 (OLS) R 2 :.92 OLS, y = bx + u Data 2*x OLS 1 1 x 1 5 Sum of squared errors 2 4 b Figure 2.1: Quadratic loss function. Subfigure a: 1 coefficient; Subfigure b: 2 coefficients differentiable function f (b) as small as possible. The answer is b low, b high, or the value of b where d f (b)/db =. See Figure 2.1. y 2 y: x: b: 1.8 (OLS) R 2 :.81 OLS 1 5 Sum of squared errors The first order conditions for a minimum are that the derivatives of this loss function with respect to b and b 1 should be zero. Let ( ˆβ, ˆβ 1 ) be the values of (b, b 1 ) where that is true Tt=1 (y t ˆβ ˆβ 1 x t ) 2 = T β t=1 (y t ˆβ ˆβ 1 x t )1 = (2.3) Tt=1 (y t ˆβ ˆβ 1 x t ) 2 = T β t=1 (y t ˆβ ˆβ 1 x t )x t =, 1 (2.4) which are two equations in two unknowns ( ˆβ and ˆβ 1 ), which must be solved simultaneously. These equations show that both the constant and x t should be orthogonal to the fitted residuals, û t = y t ˆβ ˆβ 1 x t. This is indeed a defining feature of LS and can be seen as the sample analogues of the assumptions in (2.1) that E u t = and Cov(x t, u t ) =. To see this, note that (2.3) says that the sample average of û t should be zero. Similarly, (2.4) says that the sample cross moment of û t and x t should also be zero, which implies that the sample covariance is zero as well since û t has a zero sample mean. Remark 12 Note that β i is the true (unobservable) value which we estimate to be ˆβ i. Whereas β i is an unknown (deterministic) number, ˆβ i is a random variable since it is calculated as a function of the random sample of y t and x t x Figure 2.2: Example of OLS estimation 2 4 b Remark 13 Least squares is only one of many possible ways to estimate regression coefficients. We will discuss other methods later on. Remark 14 (Cross moments and covariance). A covariance is defined as Cov(x, y) = E[(x E x)(y E y)] = E(xy x E y y E x + E x E y) = E xy E x E y E y E x + E x E y = E xy E x E y. When x = y, then we get Var(x) = E x 2 (E x) 2. These results hold for sample moments too. When the means of y and x are zero, then we can disregard the constant. In this case, 16 17

10 (2.4) with ˆβ = immediately gives Tt=1 y t x t = ˆβ Tt=1 1 x t x t or Tt=1 y t x t /T ˆβ 1 = Tt=1 x t x t /T. (2.5) In this case, the coefficient estimator is the sample covariance (recall: means are zero) of y t and x t, divided by the sample variance of the regressor x t (this statement is actually true even if the means are not zero and a constant is included on the right hand side just more tedious to show it). See Figure 2.2 for an illustration Least Squares: Goodness of Fit The quality of a regression model is often measured in terms of its ability to explain the movements of the dependent variable. Let ŷ t be the fitted (predicted) value of y t. For instance, with (2.1) it would be ŷ t = ˆβ + ˆβ 1 x t. If a constant is included in the regression (or the means of y and x are zero), then a check of the goodness of fit of the model is given by R 2 = Corr(y t, ŷ t ) 2. (2.6) See Figure 2.3 for an example. Return = c + b*lagged Return, slope.5 Slope with 9% conf band, Newey West std, MA(horizon 1) US stock returns Return horizon (months).4.2 Return = c + b*d/p, slope Slope with 9% conf band Return horizon (months) Return = c + b*lagged Return, R Return horizon (months).2.1 Return = c + b*d/p, R Return horizon (months) This is the squared correlation of the actual and predicted value of y t. To understand this result, suppose that x t has no explanatory power, so R 2 should be zero. How does that happen? Well, if x t is uncorrelated with y t, then the numerator in (2.5) is zero so ˆβ 1 =. As a consequence ŷ t = ˆβ, which is a constant and a constant is always uncorrelated with anything else (as correlations measure comovements around the means). To get a bit more intuition for what R 2 represents, suppose the estimated coefficients equal the true coefficients, so ŷ t = β + β 1 x t. In this case, R 2 = Corr (β + β 1 x t + u t, β + β 1 x t ) 2, that is, the squared correlation of y t with the systematic part of y t. Clearly, if the model is perfect so u t =, then R 2 = 1. In contrast, when there is no movements in the systematic part (β 1 = ), then R 2 =. 18 Figure 2.3: Predicting US stock returns (various investment horizons) with the dividendprice ratio Least Squares: Outliers Since the loss function in (2.2) is quadratic, a few outliers can easily have a very large influence on the estimated coefficients. For instance, suppose the true model is y t =.75x t + u t, and that the residual is very large for some time period s. If the regression coefficient happened to be.75 (the true value, actually), the loss function value would be large due to the u 2 t term. The loss function value will probably be lower if the coefficient is changed to pick up the y s observation even if this means that the errors for the other observations become larger (the sum of the square of many small errors can very well be less than the square of a single large error). 19

11 2 1 OLS vs LAD of y =.75*x + u y: x: to substitute for y t (recall β = ) Tt=1 x t (β 1 x t + u t ) /T ˆβ 1 = Tt=1 x t x t /T Tt=1 x t u t /T = β 1 + Tt=1 x t x t /T, (2.7) y 1 Data OLS (.25.9) LAD (..75) x Figure 2.4: Data and regression line from OLS and LAD There is of course nothing sacred about the quadratic loss function. Instead of (2.2) one could, for instance, use a loss function in terms of the absolute value of the error T t=1 y t β β 1 x t. This would produce the Least Absolute Deviation (LAD) estimator. It is typically less sensitive to outliers. This is illustrated in Figure 2.4. However, LS is by far the most popular choice. There are two main reasons: LS is very easy to compute and it is fairly straightforward to construct standard errors and confidence intervals for the estimator. (From an econometric point of view you may want to add that LS coincides with maximum likelihood when the errors are normally distributed.) The Distribution of ˆβ Note that the estimated coefficients are random variables since they depend on which particular sample that has been drawn. This means that we cannot be sure that the estimated coefficients are equal to the true coefficients (β and β 1 in (2.1)). We can calculate an estimate of this uncertainty in the form of variances and covariances of ˆβ and ˆβ 1. These can be used for testing hypotheses about the coefficients, for instance, that β 1 =. To see where the uncertainty comes from consider the simple case in (2.5). Use (2.1) so the OLS estimate, ˆβ 1, equals the true value, β 1, plus the sample covariance of x t and u t divided by the sample variance of x t. One of the basic assumptions in (2.1) is that the covariance of the regressor and the residual is zero. This should hold in a very large sample (or else OLS cannot be used to estimate β 1 ), but in a small sample it may be different from zero. Since u t is a random variable, ˆβ 1 is too. Only as the sample gets very large can we be (almost) sure that the second term in (2.7) vanishes. Equation (2.7) will give different values of ˆβ when we use different samples, that is different draws of the random variables u t, x t, and y t. Since the true value, β, is a fixed constant, this distribution describes the uncertainty we should have about the true value after having obtained a specific estimated value. The first conclusion from (2.7) is that, with u t = the estimate would always be perfect and with large movements in u t we will see large movements in ˆβ. The second conclusion is that a small sample (small T ) will also lead to large random movements in ˆβ 1 in contrast to a large sample where the randomness in T t=1 x t u t /T is averaged out more effectively (should be zero in a large sample). There are three main routes to learn more about the distribution of ˆβ: (i) set up a small experiment in the computer and simulate the distribution (Monte Carlo or bootstrap simulation); (ii) pretend that the regressors can be treated as fixed numbers and then assume something about the distribution of the residuals; or (iii) use the asymptotic (large sample) distribution as an approximation. The asymptotic distribution can often be derived, in contrast to the exact distribution in a sample of a given size. If the actual sample is large, then the asymptotic distribution may be a good approximation. The simulation approach has the advantage of giving a precise answer but the disadvantage of requiring a very precise question (must write computer code that is tailor made for the particular model we are looking at, including the specific parameter values). See Figure 2.5 for an example. 2 21

12 2.1.5 Fixed Regressors The assumption of fixed regressors makes a lot of sense in controlled experiments, where we actually can generate different samples with the same values of the regressors (the heat or whatever). In makes much less sense in econometrics. However, it is easy to derive results for this case and those results happen to be very similar to what asymptotic theory gives. Remark 15 (Linear combination of normally distributed variables.) If the random variables z t and v t are normally distributed, then a + bz t + cv t is too. a + bz t + cv t N ( a + bµ z + cµ v, b 2 σ 2 z + c2 σ 2 v + 2bcσ zv). To be precise, Suppose u t N (, σ 2), then (2.7) shows that ˆβ 1 is normally distributed. The reason is that ˆβ 1 is just a constant (β 1 ) plus a linear combination of normally distributed residuals (with fixed regressors x t / T t=1 x t x t can be treated as constant). It is straightforward to see that the mean of this normal distribution is β 1 (the true value), since the rest is a linear combination of the residuals and they all have a zero mean. Finding the variance of ˆβ 1 just just slightly more complicated. First, write (2.7) as ˆβ 1 = β Tt=1 x t x t (x 1 u 1 + x 2 u x T u t ). (2.8) Second, remember that we treat x t as fixed numbers ( constants ). Third, assume that the residuals are iid: they are uncorrelated with each other (independently distributed) and have the same variances (identically distributed). The variance of (2.8) is therefore ( ) ( ) 2 1 ] Var ˆβ 1 = Tt=1 [x1 2 x t x Var (u 1) + x2 2 Var (u 2) +... xt 2 Var (u T ) t ( ) 2 1 [ = Tt=1 x1 2 x t x σ 2 + x2 2 σ xt 2 σ 2] t ( ) 2 1 ( Tt=1 ) = Tt=1 x t x t σ 2 x t x t = 1 Tt=1 x t x t σ 2. (2.9) Notice that the denominator increases with the sample size while the numerator stays constant: a larger sample gives a smaller uncertainty about the estimate. Similarly, a lower volatility of the residuals (lower σ 2 ) also gives a lower uncertainty about the estimate. Example 16 When the regressor is just a constant (equal to one) x t = 1, then we have Tt=1 x t x t = T t=1 1 1 = T so Var( ˆβ) = σ 2 /T. (This is the classical expression for the variance of a sample mean.) Example 17 When the regressor is a zero mean variable, then we have Tt=1 x t x t = Var(x t )T so Var( ˆβ) = σ 2 / [ Var(x i )T ]. The variance is increasing in σ 2, but decreasing in both T and Var(x t ). Why? A Bit of Asymptotic Theory A law of large numbers would (in most cases) say that both T t=1 xt 2/T and T t=1 x t u t /T in (2.7) converges to their expected values as T. The reason is that both are sample averages of random variables (clearly, both xt 2 and x t u t are random variables). These expected values are Var (x t ) and Cov (x t, u t ), respectively (recall both x t and u t have zero means). The key to show that ˆβ is consistent is that Cov (x t, u t ) =. This highlights the importance of using good theory to derive not only the systematic part of (2.1), but also in understanding the properties of the errors. For instance, when economic theory tells us that y t and x t affect each other (as prices and quantities typically do), then the errors are likely to be correlated with the regressors and LS is inconsistent. One common way to get around that is to use an instrumental variables technique. Consistency is a feature we want from most estimators, since it says that we would at least get it right if we had enough data. Suppose that ˆβ is consistent. Can we say anything more about the asymptotic distribution. Well, the distribution of ˆβ converges to a spike with all the mass at β, but the distribution of T ( ˆβ β), will typically converge to a non-trivial normal distribution. To see why, note from (2.7) that we can write ( Tt=1 ) 1 T ( ˆβ β) = xt 2 /T T Tt=1 x t u t /T. (2.1) 22 23

13 .4 Distribution of t stat, T=5.4 Distribution of t stat, T=1 autocorrelation (no correlation of u t and u t s ). When ML is used, it is common to investigate if the fitted errors satisfy the basic assumptions, for instance, of normality Multiple Regression All that all the previous results still hold in a multiple regression with suitable reinterpretations of the notation. Model: R t =.9f t +ε t, ε t = v t 2 where v t has a χ 2 (2) distribution Results for T=5 and T=1: Kurtosis of t stat: Frequency of abs(t stat)>1.645: Frequency of abs(t stat)>1.96: Probability density functions N(,1) χ 2 (2) Multiple Regression: Details Consider the linear model y t = x 1t β 1 + x 2t β x kt β k + u t = x t β + u t, (2.11) Figure 2.5: Distribution of LS estimator when residuals have a t 3 distribution where y t and u t are scalars, x t a k 1 vector, and β is a k 1 vector of the true coefficients (see Appendix A for a summary of matrix algebra). Least squares minimizes the sum of the squared fitted residuals Tt=1 û 2 t = T t=1 (y t x t ˆβ) 2, (2.12) The first term on the right hand side will typically converge to the inverse of Var (x t ), as discussed earlier. The second term is T times a sample average (of the random variable x t u t ) with a zero expected value, since we assumed that ˆβ is consistent. Under weak conditions, a central limit theorem applies so T times a sample average converges to a normal distribution. This shows that T ˆβ has an asymptotic normal distribution. It turns out that this is a property of many estimators, basically because most estimators are some kind of sample average. The properties of this distribution are quite similar to those that we derived by assuming that the regressors were fixed numbers Diagnostic Tests Exactly what the variance of ˆβ is, and how it should be estimated, depends mostly on the properties of the errors. This is one of the main reasons for diagnostic tests. The most common tests are for homoskedastic errors (equal variances of u t and u t s ) and no 24 by choosing the vector β. The first order conditions are which can be solved as kx1 = T t=1 x t (y t x t ˆβ) or T t=1 x t y t = T t=1 x t x t ˆβ, (2.13) ( Tt=1 ) 1 ˆβ = x t x t Tt=1 x t y t. (2.14) Example 18 With 2 regressors (k = 2), (2.13) is [ ] = [ ] x 1t (y t x 1t ˆβ 1 x 2t ˆβ 2 ) Tt=1 and (2.14) is [ ] ˆβ 1 = ˆβ 2 ( [ T x 1t x 1t t=1 x 2t x 1t x 2t (y t x 1t ˆβ 1 x 2t ˆβ 2 ) ]) 1 [ ] x 1t x 2t T x 1t y t t=1. x 2t x 2t x 2t y t 25

14 2.2 Maximum Likelihood A different route to arrive at an estimator is to maximize the likelihood function. To understand the principle of maximum likelihood estimation, consider the following example. Suppose we know x t N(µ, 1), but we don t know the value of µ. Since x t is a random variable, there is a probability of every observation and the density function of x t is L = pdf (x t ) = 1 exp [ 12 ] (x t µ) 2, (2.15) 2π where L stands for likelihood. The basic idea of maximum likelihood estimation (MLE) is to pick model parameters to make the observed data have the highest possible probability. Here this gives ˆµ = x t. This is the maximum likelihood estimator in this example. What if there are two observations, x 1 and x 2? In the simplest case where x 1 and x 2 are independent, then pdf(x 1, x 2 ) = pdf(x 1 )pdf(x 2 ) so L = pdf(x 1, x 2 ) = 1 [ 2π exp 12 (x 1 µ) 2 12 ] (x 2 µ) 2. (2.16) Take logs (log likelihood) ln L = ln(2π) 1 2 which is maximized by setting ˆµ = (x 1 + x 2 )/2. To apply this idea to a regression model [ (x 1 µ) 2 + (x 2 µ) 2], (2.17) y t = βx t + u t, (2.18) we could assume that u t is iid N (, σ 2). The probability density function of u t is pdf (u t ) = ( 1 exp 1 ) 2πσ 2 2 u2 t /σ 2. (2.19) Since the errors are independent, we get the joint pdf of the u 1, u 2,..., u T by multiplying 26 the marginal pdfs of each of the errors L = pdf (u 1 ) pdf (u 2 )... pdf (u T ) [ ( )] = (2πσ 2 ) T/2 N/2 exp 1 u σ 2 + u2 2 σ u2 T σ 2. (2.2) Substitute y t x t β for u t and take logs to get the log likelihood function of the sample ln L = T 2 ln (2π) T 2 ln(σ 2 ) 1 2 Tt=1 (y t x t β) 2 /σ 2. (2.21) Suppose (for simplicity) that we happen to know the value of σ 2. It is then clearl that this likelihood function is maximized by minimizing the last term, which is proportional to the sum of squared errors just like in (2.2): LS is ML when the errors are iid normally distributed (but only then). (This holds also when we do not know the value of σ 2 just slightly messier to show it.) See Figure 2.6. Maximum likelihood estimators have very nice properties, provided the basic distributional assumptions are correct, that is, if we maximize the right likelihood function. In that case, MLE is typically the most efficient/precise estimators (at least in very large samples). ML also provides a coherent framework for testing hypotheses (including the Wald, LM, and LR tests). Example 19 Consider the simple regression where we happen to know that the intercept is zero, y t = β 1 x t + u t. Suppose we have the following data ] [ ] [y 1 y 2 y 3 = and [x 1 x 2 x 3 ] = [ ] 1 1. Suppose β 2 = 2, then we get the following values for u t = y t 2x t and its square ( 1) =.6 with the square.36 with sum Now, suppose instead that β 2 = 1.8, then we get ( 1) =.6 with the square.36 with sum

15 y y: x: b: 1.8 (OLS) OLS, y = bx + u Data fitted 1 1 x Log likelihood b 1 5 Sum of squared errors 2 4 b To turn a column into a row vector, use the transpose operator like in x x = [ x 1 x 2 ] = [x 1 x 2 ]. To do matrix multiplication, the two matrices need to be conformable: the first matrix has as many columns as the second matrix has rows. For instance, xc does not work, but x c does [ ] [ ] x c 1 c = x 1 x 2 = x 1 c 1 + x 2 c 2. c 2 Some further examples: [ ] [ ] xx x [ ] 1 x = x 1 x 2 = 1 2 x 1 x 2 x 2 x 2 x 1 x2 2 [ ] [ ] [ ] A 11 A 12 c 1 A 11 c 1 + A 12 c 2 Ac = =. A 21 A 22 c 2 A 21 c 1 + A 22 c 2 A matrix inverse is the closest we get to dividing by a matrix. The inverse of a matrix D, denoted D 1, is such that Figure 2.6: Example of OLS and MLE estimation The latter choice of β 2 will certainly give a larger value of the likelihood function (it is actually the optimum). See Figure 2.6. A Some Matrix Algebra Let x = [ x 1 x 2 ], c = [ c 1 c 2 ] [ ] [ ] A 11 A 12 B 11 B 12, A =, and B =. A 21 A 22 B 21 B 22 Matrix addition (or subtraction) is element by element [ ] [ ] [ ] A 11 A 12 B 11 B 12 A 11 + B 11 A 12 + B 12 A + B = + =. A 21 A 22 B 21 B 22 A 21 + B 21 A 22 + B DD 1 = I and D 1 D = I, where I is the identity matrix (ones along the diagonal, and zeroes elsewhere). For instance, the A 1 matrix has the same dimensions as A and the elements (here denoted Q i j ) are such that the following holds [ ] [ ] [ ] A 1 Q 11 Q 12 A 11 A 12 1 A = = Q 21 Q 22 A 21 A Example 2 We have [ ] [ ] [ ] = , so [ ] 1 [ ] = Bibliography 29

16 tions, the t-statistic has an asymptotically normal distribution, that is 3 Testing CAPM Reference: Elton, Gruber, Brown, and Goetzmann (23) 15 More advanced material is denoted by a star ( ). It is not required reading. 3.1 Market Model The basic implication of CAPM is that the expected excess return of an asset (µ e i ) is linearly related to the expected excess return on the market portfolio (µ e m ) according to µ i e = β i µ e m, where β i = Cov (R i, R m ). (3.1) Var (R m ) Let R e it = R it R f t be the excess return on asset i in excess over the riskfree asset, and let R e mt be the excess return on the market portfolio. CAPM with a riskfree return says that α i = in R e it = α i + b i R e mt + ε it, where E ε it = and Cov(R e mt, ε it) =. (3.2) The two last conditions are automatically imposed by LS. Take expectations to get E ( R e it) = αi + b i E ( R e mt). (3.3) Notice that the LS estimate of b i is the sample analogue to β i in (3.1). It is then clear that CAPM implies that α i =, which is also what empirical tests of CAPM focus on. This test of CAPM can be given two interpretations. If we assume that R mt is the correct benchmark (the tangency portfolio for which (3.1) is true by definition), then it is a test of whether asset R it is correctly priced. This is typically the perspective in performance analysis of mutual funds. Alternatively, if we assume that R it is correctly priced, then it is a test of the mean-variance efficiency of R mt. This is the perspective of CAPM tests. The t-test of the null hypothesis that α = uses the fact that, under fairly mild condi- 3 ˆα d N(, 1) under H : α =. (3.4) Std( ˆα) Note that this is the distribution under the null hypothesis that the true value of the intercept is zero, that is, that CAPM is correct (in this respect, at least). The test assets are typically portfolios of firms with similar characteristics, for instance, small size or having their main operations in the retail industry. There are two main reasons for testing the model on such portfolios: individual stocks are extremely volatile and firms can change substantially over time (so the beta changes). Moreover, it is of interest to see how the deviations from CAPM are related to firm characteristics (size, industry, etc), since that can possibly suggest how the model needs to be changed. The results from such tests vary with the test assets used. For US portfolios, CAPM seems to work reasonably well for some types of portfolios (for instance, portfolios based on firm size or industry), but much worse for other types of portfolios (for instance, portfolios based on firm dividend yield or book value/market value ratio). Figure 3.1 shows some results for US industry portfolios Interpretation of the CAPM Test Instead of a t-test, we can use the equivalent chi-square test ˆα 2 d χ1 2 Var( ˆα) under H : α =. (3.5) It is quite straightforward to use the properties of minimum-variance frontiers (see Gibbons, Ross, and Shanken (1989), and also MacKinlay (1995)) to show that the test statistic in (3.5) can be written ˆα 2 Var( ˆα) = (S R q) 2 (S R m ) 2 [1 + (S R m ) 2 ]/T, (3.6) where S R m is the Sharpe ratio of the market portfolio (as before) and S R q is the Sharpe ratio of the tangency portfolio when investment in both the market return and asset i is possible. (Recall that the tangency portfolio is the portfolio with the highest possible Sharpe ratio.) If the market portfolio has the same (squared) Sharpe ratio as the tangency portfolio of the mean-variance frontier of R it and R mt (so the market portfolio is mean- 31

17 Mean excess return US industry portfolios, I DH B A G C J F β E Mean excess return US industry portfolios, I DH A G B C J F Excess market return: 7.5% Predicted mean excess return (with α=) E µ.1.5 MV frontiers before and after (α=) Solid curves: 2 assets, Dashed curves: 3 assets.5.1 σ µ.1.5 MV frontiers before and after (α=.5).5.1 σ Asset alpha WaldStat pval all NaN A B C D E F G H I J CAPM Factor: US market test statistics use Newey West std test statistics ~ χ 2 (n), n = no. of test assets µ.1.5 MV frontiers before and after (α=.4).5.1 σ The new asset has the abnormal return α compared to the market (of 2 assets) Means.8.5 α Cov matrix Tang portf N = 2 α = α > α < NaN Figure 3.1: CAPM regressions on US industry indices variance efficient also when we take R it into account) then the test statistic, ˆα 2 / Var( ˆα), is zero and CAPM is not rejected. This is illustrated in Figure 3.2 which shows the effect of adding an asset to the investment opportunity set. In this case, the new asset has a zero beta (since it is uncorrelated with all original assets), but the same type of result holds for any new asset. The basic point is that the market model tests if the new assets moves the location of the tangency portfolio. In general, we would expect that adding an asset to the investment opportunity set would expand the mean-variance frontier (and it does) and that the tangency portfolio changes accordingly. However, the tangency portfolio is not changed by adding an asset with a zero intercept. The intuition is that such an asset has neutral performance compared to the market portfolio (obeys the beta representation), so investors should stick to the market portfolio. 32 Figure 3.2: Effect on MV frontier of adding assets Econometric Properties of the CAPM Test A common finding from Monte Carlo simulations is that these tests tend to reject a true null hypothesis too often when the critical values from the asymptotic distribution are used: the actual small sample size of the test is thus larger than the asymptotic (or nominal ) size (see Campbell, Lo, and MacKinlay (1997) Table 5.1). The practical consequence is that we should either used adjusted critical values (from Monte Carlo or bootstrap simulations) or more pragmatically, that we should only believe in strong rejections of the null hypothesis. To study the power of the test (the frequency of rejections of a false null hypothesis) we have to specify an alternative data generating process (for instance, how much extra return in excess of that motivated by CAPM) and the size of the test (the critical value to use). Once that is done, it is typically found that these tests require a substantial deviation 33

18 from CAPM and/or a long sample to get good power Several Assets In most cases there are several (n) test assets, and we actually want to test if all the α i (for i = 1, 2,..., n) are zero. Ideally we then want to take into account the correlation of the different alphas. While it is straightforward to construct such a test, it is also a bit messy. As a quick way out, the following will work fairly well. First, test each asset individually. Second, form a few different portfolios of the test assets (equally weighted, value weighted) and test these portfolios. Although this does not deliver one single test statistic, it provides plenty of information to base a judgement on. For a more formal approach, see Section A quite different approach to study a cross-section of assets is to first perform a CAPM regression (3.2) and then the following cross-sectional regression Rit e = γ + λ ˆβ i + u i, (3.7) where Rit e is the (sample) average excess return on asset i. Notice that the estimated betas are used as regressors and that there are as many data points as there are assets (n). There are severe econometric problems with this regression equation since the regressor contains measurement errors (it is only an uncertain estimate), which typically tend to bias the slope coefficient (b) towards zero. To get the intuition for this bias, consider an extremely noisy measurement of the regressor: it would be virtually uncorrelated with the dependent variable (noise isn t correlated with anything), so the estimated slope coefficient would be close to zero. If we could overcome this bias (and we can by being careful), then the testable implications of CAPM is that γ = and that λ equals the average market excess return. We also want (3.7) to have a high R 2 since it should be unity in a very large sample (if CAPM holds) Several Assets: SURE Approach This section outlines how we can set up a formal test of CAPM when there are several test assets. For simplicity, suppose we have two test assets. Stack (3.2) for the two equations are R e 1t = α 1 + b 1 R e mt + ε 1t, (3.8) R e 2t = α 2 + b 2 R e mt + ε 2t (3.9) where E ε it = and Cov(Rmt e, ε it) =. This is a system of seemingly unrelated regressions (SURE) with the same regressor (see, for instance, Greene (23) 14). In this case, the efficient estimator (GLS) is LS on each equation separately. Moreover, the covariance matrix of the coefficients is particularly simple. To see what the covariances of the coefficients are, write the regression equation for asset 1 (3.8) on a traditional form R e 1t = x t β 1 + ε 1t, where x t = [ 1 R e mt and similarly for the second asset (and any futher assets). Define ˆ xx = T where ˆε it is the fitted residual of asset i. ] t=1 x t x t /T, and ˆσ i j = T, β 1 = [ α 1 b 1 ], (3.1) t=1 ˆε it ˆε jt /T, (3.11) The key result is then that the (estimated) asymptotic covariance matrix of the vectors ˆβ i and ˆβ j (for assets i and j) is (estimated) Asy. Cov( ˆβ i, ˆβ j ) = ˆσ i j ˆ xx 1 /T. (3.12) (In many text books, this is written as ˆσ i j (X X) 1.) The null hypothesis in our two-asset case is H : α 1 = and α 2 =. (3.13) In a large sample, the estimator is normally distributed (this follows from the fact that the LS estimator is a form of sample average, so we can apply a central limit theorem). Therefore, under the null hypothesis we have the following result. Let A be the upper left element of xx 1 /T. Then [ ] ˆα 1 N ˆα 2 ([ ], [ σ 11 A σ 12 A ]) σ 12 A (asymptotically). (3.14) σ 22 A 34 35

19 In practice we use the sample moments for the covariance matrix. Notice that the zero means in (3.14) come from the null hypothesis: the distribution is (as usual) constructed by pretending that the null hypothesis is true. We can now construct a chi-square test by using the following fact. Remark 21 If the n 1 vector y N(, ), then y 1 y χn 2. To apply this, let be the covariance matrix in (3.14) and form the test static [ ] [ ] ˆα 1 1 ˆα 1 χ2 2 ˆα 2 ˆα. (3.15) 2 This can also be transformed into an F test, which might have better small sample properties. Mean excess return US industry portfolios, H E D AG B C J F I Predicted mean excess return Asset alpha WaldStat pval all NaN A B C D E F G H I J Fama French model Factors: US market, SMB (size), and HML (book to market) test statistics use Newey West std test statistics ~ χ 2 (n), n = no. of test assets Representative Results of the CAPM Test One of the more interesting studies is Fama and French (1993) (see also Fama and French (1996)). They construct 25 stock portfolios according to two characteristics of the firm: the size (by market capitalization) and the book-value-to-market-value ratio (BE/ME). In June each year, they sort the stocks according to size and BE/ME. They then form a 5 5 matrix of portfolios, where portfolio i j belongs to the ith size quintile and the jth BE/ME quintile. They run a traditional CAPM regression on each of the 25 portfolios (monthly data ) and then study if the expected excess returns are related to the betas as they should according to CAPM (recall that CAPM implies E Rit e = β i λ where λ is the risk premium (excess return) on the market portfolio). However, it is found that there is almost no relation between E Rit e and β i (there is a cloud in the β i E Rit e space, see Cochrane (21) 2.2, Figure 2.9). This is due to the combination of two features of the data. First, within a BE/ME quintile, there is a positive relation (across size quantiles) between E Rit e and β i as predicted by CAPM (see Cochrane (21) 2.2, Figure 2.1). Second, within a size quintile there is a negative relation (across BE/ME quantiles) between E Rit e and β i in stark contrast to CAPM (see Cochrane (21) 2.2, Figure 2.11). Figure 3.3: Fama-French regressions on US industry indices 3.2 Several Factors In multifactor models, (3.2) is still valid provided we reinterpret b and Rmt e as vectors, so brmt e stands for b o Rot e + b p R e pt +... (3.16) In this case, (3.2) is a multiple regression, but the test (3.4) still has the same form. Figure 3.3 shows some results for the Fama-French model on US industry portfolios. 3.3 Fama-MacBeth Reference: Cochrane (21) 12.3; Campbell, Lo, and MacKinlay (1997) 5.8; Fama and MacBeth (1973) The Fama and MacBeth (1973) approach is a bit different from the regression approaches discussed so far. The method has three steps, described below. First, estimate the betas β i (i = 1,..., n) from (3.2) (this is a time-series regression). This is often done on the whole sample assuming the betas are constant

Lecture Notes in Financial Econometrics (MBF, MSc course at UNISG)

Lecture Notes in Financial Econometrics (MBF, MSc course at UNISG) Paul Söderlind 1 26 May 211 1 University of St. Gallen. Address: s/bf-hsg, Rosenbergstrasse 52, CH-9 St. Gallen, Switzerland. E-mail: