FINANCIAL ECONOMETRICS PROF. MASSIMO GUIDOLIN

Massimo Guidolin Massimo.Guidolin@unibocconi.it Dept. of Finance FINANCIAL ECONOMETRICS PROF. MASSIMO GUIDOLIN SECOND PART, LECTURE 1: VOLATILITY MODELS ARCH AND GARCH

OVERVIEW 1) Stepwise Distribution Modeling Approach 2) Three Key Facts to Remember 3) Volatility Clustering in the Data 4) Naïve Variance Forecast Models: Rolling Window Variance Estimation and the RiskMetrics System 5) ARCH Models 6) GARCH Models; Comparisons with RiskMetrics 7) Leverage Effects and Component GARCH 8) Estimation of GARCH Models 9) Variance Model Evaluation a.a. 12/13 p. 2

OVERVIEW AND GENERAL IDEAS Financial economists are concerned with modeling volatility in (individual) asset and portfolio returns This is important as volatility is considered a measure of risk, and investors want a premium for investing in risky assets As you known, banks and other financial institutions apply so-called value-at-risk (VaR) models to assess their risks Modeling and forecasting volatility or, in other words, the covariance structure of asset returns, is therefore important We will proceed in three steps following a stepwise distribution modeling (SDM) approach: Establish a variance forecasting model for each of the assets individually and introduce methods for evaluating the performance of these forecasts Consider ways to model conditionally non-normal aspects a.a. 12/13 p. 3

STEP-WISE DISTRIBUTION MODELING APPROACH of the assets in our portfolio i.e., aspects that are not captured by conditional means, variances, and covariances Link individual variance forecasts with a correlation model The variance and correlation models together will yield a timevarying covariance model, which can be used to calculate the variance of an aggregate portfolio of assets The idea that second moments vary over time has an even deeper importance While most classical finance is built on the assumption that both asset returns and their underlying fundamentals are IID Normal over time, casual inspection of GDP, financial aggregates, interest rates, exchange rates etc. reveals that these series display time-varying means, variances, and covariances a.a. 12/13 p. 4

THREE KEY RESULTS TO BEAR IN MIND Time-varying means: Carlo Favero s class, i.e., the past 4 weeks; time-varying variance and covariances: NOW, HERE, i.e., the following 8 weeks (of classes) What is IID? Identically and independently distributed Fundamentals = quantities that justify asset prices in a rational framework E.g., dividends for stocks, short-term rates for long-term rates, macroeconomic and fiscal policies for exchange rates, etc. When variances and covariances are time-varying we speak about conditional HETEROSKEDASTICITY 3 simple facts to remember and understand: 1 The fact that the conditional variance may change in heteroskedastic fashion, does not necessarily mean the series is non-stationary a.a. 12/13 p. 5

THREE KEY RESULTS TO BEAR IN MIND Even though the variance may go through high and low periods, the unconditional (long-run, steady-state, average) variance may exist and be actually constant 2 Conditional heteroskedasticity implies that the unconditional, long-run distribution of asset returns will be non-normal 3 Many models of conditional heteroskedasticity, but in the end we care for their forecasting performance For instance, consider the (dividend-corrected) realized returns on a value-weighted index (by CRSP) of NYSE, AMEX, and NASDAQ stocks Not the usual data series you will get used to face in this class, but more similar to those in readings/textbooks a.a. 12/13 p. 6

VOLATILITY CLUSTERING IN THE DATA Sample period is 1972:01-2009:12, monthly data 15 10 5 0-5 -10-15 -20 Quiet period 72 75 78 81 84 87 90 93 96 99 02 05 08 Our objective is to develop models that can fit the sequence of calm and turbulent periods and especially forecast them Notice: value-weighted NYSE/AMEX/NASDAQ are ptf. returns! a.a. 12/13 p. 7 Value-Weighted NYSE/AMEX/NASDAQ Returns Turbulence Turbulence Turbulence Turbulence Quiet period Quiet period Volatility clusters : high (low) volatility tends to be follo-wed by high (low) volatility

VOLATILITY CLUSTERING & SERIAL CORRELATION IN SQUARES As you have seen in the past 5 weeks, there is very weak serial correlation in asset returns This lack of correlation means that, given yesterday s return, today s return is equally likely to be positive or negative The autocorrelation estimates from a standard autocorrelogram can be used to test the hypothesis that the process generating observed returns is a series of independent and identically distributed (IID) variables The asymptotic standard error of an autocorrelation estimate is approximately 1/(T) 1/2, where T is the sample size The IID hypothesis can be tested using the Portmanteau Q- statistic of Box and Pierce (1970), calculated from the first k autocorrelations as: a.a. 12/13 p. 8

VOLATILITY CLUSTERING & SERIAL CORRELATION IN SQUARES where is defined as: The asymptotic distribution of the Q k statistic, under the null of an IID process, is Chi-square, with k degrees of freedom VW CRSP Stock Returns 10 Year U.S. Govt. Bond Returns a.a. 12/13 p. 9

VOLATILITY CLUSTERING & SERIAL CORRELATION IN SQUARES Does this mean that stock and bond returns are (approximately) IID? Unfortunately not: it turns out that the squares and absolute values of stock and bond returns display high and significant autocorrelations Of course, similar evidence applies to REIT and 1M T-bills VW CRSP Stock Returns 10 Year Govt. Bond Returns a.a. 12/13 p. 10

VOLATILITY CLUSTERING & SERIAL CORRELATION IN SQUARES The high dependence in series of absolute returns proves that the returns process is not made up of IID random variables Large squared returns are more likely to be followed by large squared returns than small squared returns are But this result alone cannot be used to predict direction of price changes How to explain this phenomenon? If changes in price volatility create clusters of high and low volatility, this may reflect changes in the flow of relevant information to the market These stylized facts can be explained by assuming that volatility follows a stochastic process where today s volatility is positively correlated with the volatility on any future day This is what ARCH and GARCH models are for a.a. 12/13 p. 11

NAÏVE MODELS OF VARIANCE FORECASTING Consider the simple model for one asset (or ptf.) return: Here R t+1 is the continuously compounded return z t+1 is a pure shock to returns, z t+1 = R t+1 /σ t+1 The model assumes (as in Christoffersen) that the mean µ = 0 This is an acceptable approximation on daily data; absent this assumption, the model is R t+1 = µ + σ t+1 z t+1 The assumption of normality will be discussed/removed in lecture 2, after the break (and the midterm) The easiest way to capture volatility clustering is by letting tomorrow s variance be the simple average of the most recent m observations, as in Constant weighting a.a. 12/13 p. 12

NAÏVE MODELS: ROLLING WINDOW FORECASTS This is often called a rolling window variance forecast model However, the fact that the model puts equal weights (equal to 1/m) on the past m observations yields unwarranted results When plotted over time, variance exhibits box-shaped patterns An extreme return (positive or negative) today will bump up variance by 1/m times the return squared for exactly m periods after which variance immediately will drop back down The autocorrelation plot of squared returns suggests that a more gradual decline is warranted in the effect of past returns on today s variance Also: how shall we pick m? a.a. 12/13 p. 13

NAÏVE MODELS: RISKMETRICS A high m will lead to an excessively smoothly evolving σ t+1, and a low m will lead to an excessively jagged pattern of σ t+1 A more interesting model is JP Morgan s RiskMetrics system: The weights on past squared returns decline exponentially as we move backward in time: 1, λ, λ 2, Also called the exponential variance smoother Because for τ = 1 we have λ 0 = 1, it is possible to re-write it as: which is equivalent to: See lecture notes for why this is the case A weighted avg. of today s variance and today s squared return a.a. 12/13 p. 14

NAÏVE MODELS: RISKMETRICS Key advantages of the RiskMetrics model: Recent returns matter more for tomorrow s variance than distant returns do as λ is less than 1 and therefore gets smaller when the lag, τ, gets bigger It only contains one unknown parameter, λ When estimating λ on a large number of assets, RiskMetrics found that the estimates were quite similar across assets, and therefore simply set λ = 0.94 for every asset for daily data In this case, no estimation is necessary Little data need to be stored in order to calculate tomorrow s variance; in fact, after including 100 lags of squared returns, the cumulated weight is already close to 100% Of course, once σ 2 t, is calculated, past returns are not needed Given all these advantages of the RiskMetrics model, why not simply end the discussion on variance forecasting here? a.a. 12/13 p. 15

NAÏVE MODELS: RISKMETRICS a.a. 12/13 p. 16

NAÏVE MODELS: RISKMETRICS 8 a.a. 12/13 p. 17

ARCH MODELS The RiskMetrics model has a number of shortcomings, but these can be understood only after introducing ARCH models Historically, ARCH models were the first-line alternative developed to compete with exponential smoothers In the zero-mean return case, their structure is very simple: This is a ARCH(1) However, historically it was soon obvious that just using one lag of past squared returns would not be sufficient One needs to use a large number q > 1 of lags on the RHS This means that squared returns are best modeled using an AR(q) instead of a simple AR(1) Yet ARCH(1) already implies one complication: they require nonlinear parameter estimation a.a. 12/13 p. 18

GARCH MODELS If you have paid some attention to what has happened in the last 5 weeks, you know where to look for: ARMA models The simplest generalized autoregressive conditional heteroskedasticity (GARCH(1,1)) model is: α > 0, β > 0 The implied, unconditional, or long-run average, variance, σ 2, is This derives from the fact that Furthermore, if one solves for ω from the long-run variance expression and substitutes it into the GARCH equation: a.a. 12/13 p. 19

GARCH MODELS Therefore a GARCH(1,1) implies that tomorrow s variance is a weighted average of the long-run variance, today s squared return, and today s variance Or, tomorrow s variance is predicted to be the long-run average variance with: something added (subtracted) if today s squared return is above (below) its long-run average, and something added (subtracted) if today s variance is above (below) its long-run average How do you forecast variance in a GARCH model? The one-day forecast of variance, σ 2 t+1 t, is given directly by the model as σ 2 t+1 As for multi-periods forecasts, one can show that: a.a. 12/13 p. 20

GARCH MODELS: FORECASTING This implies that as the forecast horizon H grows, because for (α+β) < 1 implies that (α+β) H-1 0, then E t [σ 2 t+h] σ 2 For shorter horizons instead, E t [σ 2 t+h] > σ 2 when σ 2 t+1 > σ 2 and vice-versa when σ 2 t+1 < σ 2 The conditional expectation, E t [ ], refers to taking the expectation using all the information available at the end of day t, which includes the squared return on day t itself (α+β) plays a crucial role and it is commonly called the persistence level/index of the model A high persistence, (α+β) close to 1, implies that shocks which push variance away from its long-run average will persist for a long time Of course, eventually the long-horizon forecast will be the longrun average variance, σ 2 In asset allocation problems, we sometimes care for the a.a. 12/13 p. 21

GARCH MODELS: FORECASTING variance of long-horizon returns, As we assume that returns have zero autocorrelation (from their sample autocorrelations), the variance of the cumulative H-day returns is: Solving in the GARCH(1,1) case, we have: You will see in a moment why establishing a difference w.r.t. Hσ 2 t+1 is so important Let s now compare GARCH(1,1) and RiskMetrics: are they so different? In a way they are not: comparing with you can see that RiskMetrics is just a a.a. 12/13 p. 22

GARCH MODELS: COMPARISON WITH RISKMETRICS a special case of GARCH(1,1) in which ω = 0 and λ = β = 1 α so that, equivalently, α + β = 1; this has a number of implications: Implication 1: because ω = 0 and α + β = 1, under RiskMetrics the long-run variance does not exist as gives an indeterminate ratio 0/0 a.a. 12/13 p. 23 Therefore while RiskMetrics ignores the fact that the long-run average variance tends to be relatively stable over time, a GARCH model with (α+β) < 1 does not Equivalently, while a GARCH with (α+β) < 1 is a stationary process, a RiskMetrics model is not Implication 2: because, under RiskMetrics α + β = 1, so that H which means that any shock to current variance is destined to

GARCH MODELS: COMPARISON WITH RISKMETRICS persist forever If today is a high-variance day, then the RiskMetrics model predicts that all future days will be high-variance A GARCH more realistically assumes that eventually in the future variance will revert to the average value Implication 3: Under RiskMetrics, the variance of long-horizon returns is: t+h What is the density, the distribution of long-horizon returns implied by these models? Impossible to show in closed form, see the posted notes a.a. 12/13 p. 24 GARCH(1,1) α = 0.05, β = 0.90, σ 2 = 0.00014 RiskMetrics

GARCH MODELS WITH LEVERAGE A number of empirical papers have emphasized that a negative return increases variance by more than a positive return of the same magnitude, the so-called leverage effect This is because, in the case of stocks, as a negative return on a stock implies a drop in the equity value, which implies that the company becomes more highly levered and thus riskier (assuming the level of debt stays constant) We can modify the GARCH models so that the weight given to the return depends on whether the return is positive or negative in many ways This is described by the (sample) news impact curve (NIC) The NIC measures how new information is incorporated into volatility, i.e. it shows the relationship between the current return R t and conditional variance one period ahead σ 2 t+1, holding constant all other past and current information a.a. 12/13 p. 25

GARCH MODELS WITH LEVERAGE In a GARCH(1,1) model we have: NIC(R t σ 2 t+1 = σ 2 ) = ω + R 2 t + σ 2 = A + R 2 t which is a quadratic function of R 2 t and therefore symmetric around 0 (with intercept A ω + 1 σ 2 ) GARCH Problem: for most return series, the empirical NIC fails to be symmetric EGARCH is probably the most prominent asymmetric GARCH As in ARCH models, in GARCH models the negativity of parameters may create difficulties in estimation Nelson (1991) has proposed a new form of GARCH, the Exponential GARCH (EGARCH), in which positivity of the conditional variance is ensured by the fact that ln(σ 2 t+1) is directly modeled a.a. 12/13 p. 26 Asymmetric NIC

GARCH MODELS WITH LEVERAGE Two types of EGARCH(1,1) found in the applied literature; the first type is the one originally proposed by Nelson Letting z t [R t /σ t ], the log-conditional variance is: R The sequence g(z t ) is a zero-mean, i.i.d. random sequence: If z t > 0, g(z t ) is linear in z t with slope + 1 If z t < 0, g(z t ) is linear in z t with slope - 1 Thus, g(z t ) is function of both the magnitude and the sign of z t and it allows the conditional variance process to respond asymmetrically to rises and falls in stock prices Indeed, it can be rewritten as a.a. 12/13 p. 27

GARCH MODELS WITH LEVERAGE The term represents a magnitude effect: If 1 > 0 and = 0, the innovations in the conditional variance are positive (negative) when the magnitude of z t is larger (smaller) than its expected value If 1 = 0 and < 0, the innovation in conditional variance is positive (negative) when returns innovations are negative (positive), in accordance with empirical evidence for stock returns Another way of capturing the leverage effect is to define an indicator variable, I t, to take on the value 1 if day t return is negative and zero otherwise The variance dynamics can now be specified as Equivalent to have σ 2 t+1 = ω + α(1 + θ)r 2 t + βσ 2 t after negative returns and σ 2 t+1 = ω + αr 2 t + βσ 2 t after positive ones a.a. 12/13 p. 28

GARCH MODELS WITH LEVERAGE A θ larger than zero will again capture the leverage effect This model is sometimes referred to as the GJR-GARCH model or threshold GARCH (TARCH) model GJR = Glosten, Jagannathan, and Runkle In this model, because when 50% of the shocks will be negative and the other 50% positive, the long run variance equals [ω/(1 - α(1 + 0.5θ) - β)]; the persistence index is instead [α (1 + 0.5θ) + β] There is also a smaller literature that has connected timevarying volatility not to time series features, but to observable economic phenomena, especially at daily frequencies For instance, days where no trading takes place days that follow a weekend or a holiday have higher variance: where IT is a day that follows a weekend a.a. 12/13 p. 29

PREDETERMINED VARIABLES THAT AFFECT VARIANCE Other predetermined variables could be yesterday s trading volume or prescheduled news announcement dates such as company earnings and FOMC meetings dates Option implied volatilities have quite a high predictive value in forecasting next-day variance, e.g., the CBOE VIX (squared) In general, such models that use explanatory variables to capture time-variation in variance are represented as: where X t are predetermined variables Important to ensure that the GARCH model always generates a positive variance forecast You need to ensure that ω, α, β, and g(x t ) are all positive How do you estimate a GARCH model? This means, how do you estimated the fixed but unknown parameters ω, α, and β? a.a. 12/13 p. 30

MAXIMUM LIKELIHOOD ESTIMATION To perform point estimation, you need to propose one estimator (or method of estimation) with good properties For GARCH, maximum likelihood (MLE) is such method The method is simply based on knowledge of the likelihood function, which is affine to the joint probability density function (PDF) of all of your data Because the assumption of IID normal shocks (z t ) implies that the density of the time t observation is: Because each shock is independent of the others, the total probability (PDF) of the entire sample is then the product of T such densities: a.a. 12/13 p. 31

MAXIMUM LIKELIHOOD ESTIMATION This is also called the likelihood function; however, because it is more convenient to work with sums than with products, we usually consider the log of the likelihood function, also called log-likelihood function The idea is the the log-lik (its nickname) depends on the unknown parameters in a (say) GARCH (1,1), σ 2 t = ω + αr 2 t-1 + βσ 2 t-1 Therefore we shall simply maximize such log-lik to select the unknown parameters: maximize + the log-lik MLE How do you do it, with paper and pencil? In the case of GARCH, not, you need to perform numerical constrained optimization What? That s why we shall need Matlab a.a. 12/13 p. 32

MAXIMUM LIKELIHOOD ESTIMATION Moreover: you need to estimate imposing constraints on the parameters to keep variance positive and the process stationary I.e., you need to impose ω > 0, α 0, β 0, and (α+β) < 1 MLEs have very strong theoretical properties: They are consistent estimators: this means that as the sample size T, the probability that the value of the estimators (in repeated samples) shows a large divergence from the true (unfortunately unknown) parameter values, goes to 0 They are the most efficient estimators (i.e., those that give estimates with the smallest standard errors, in repeated samples) among all the (asymptotically) unbiased estimators Please also see posted class notes for additional details What is asymptotically unbiased? Something related to consistent (not exactly the same, but the same for most cases) Something to notice: MLE requires knowledge of a.a. 12/13 p. 33

QUASI-MAXIMUM LIKELIHOOD ESTIMATION Who told you that this is actually the case? What if, with your data, this is probably NOT the case? Can we still somehow do what we described above and enjoy some of the good properties of MLE? Answer: Yes it is called quasi (or pseudo) maximum likelihood estimation (QMLE) Key result: even if the conditional distribution of the shocks z t is not normal, MLE will yield estimates of the mean and variance parameters, which converge to the true parameters as the sample gets infinitely large. as long as the mean and variance functions are correctly specified Correctly specified = the models for conditional mean and variance functions are right (in a statistical sense) a.a. 12/13 p. 34

QUASI-MAXIMUM LIKELIHOOD ESTIMATION In short, the QMLE result says: you can still use MLE estimation even when the shocks are not normally distributed if your choices of conditional mean and variance function are good z t will have to be anyway IID; you can just do without normality Conditional mean function = how µ depends on past information in more general model is R t+1 = µ + σ t+1 z t+1 Conditional variance function: GARCH(p,q); or TARCH(p,q); or RiskMetric, etc. In practice, QMLE buys us the freedom to worry about the conditional distribution later on, and we will Too good to be true: what is the true cost of QMLE? Simple, QMLEs will in general be less efficient than those from MLE Thus, we trade-off theoretical asymptotic parameter efficiency for practicality QMLE comes in handy also when holds: when a.a. 12/13 p. 35

QUASI-MAXIMUM LIKELIHOOD ESTIMATION we shall need to split-up estimation in different stages Why would you do that? Sometimes practicality again, sometimes to avoid numerical maximization problems (also called laziness ) Example 1 (Variance Targeting): Because you know that the longrun (ergodic) variance from a GARCH(1,1) is, instead of estimating ω, α, and β you simply set ω to equal the long-run, average variance of the series, which is easily estimated beforehand as Two benefits: (i) you impose the long-run variance estimate on the GARCH model directly and avoid that the model yields nonsensical estimates; (ii) you have reduced the number of parameters to be estimated in the model by one Example 2 (TARCH estimation in two steps): Given a GJR model, a.a. 12/13 p. 36

QUASI-MAXIMUM LIKELIHOOD ESTIMATION you perform a 1 st round of GARCH estimation (setting θ = 0) obtaining estimates of ω, α, and β as well as filtered variance levels σ 2 t+1 Call these estimates ω *, α *, and β * Next you regress (σ 2 t+1 - ω * ) on (i) α * R 2 t, (ii) α * I t R 2 t, and (iii) β * σ 2 t to obtain an estimate of θ, call it θ * You keep iterating on this process until convergence Of course nobody would really do that, but the point is: even these estimates, because are not obtained in one single-pass using all the available information, will be QMLE estimates Let s now move where the money is (or not): how can you tell whether a (univariate) volatility model works in practice? A number of techniques called diagnostic checks exist: here we just discuss 3 among the many possible methods a.a. 12/13 p. 37

VARIANCE MODEL EVALUATION 1 (Normality tests) If you have estimated by MLE and exploited the assumption that, then the standardized model residuals defined as z t = R t /σ t should have a normal distribution Use a Jarque and Bera test: JB proposed a test that measures departure from normality in terms of the skewness and kurtosis Under the null hypothesis of normally distributed errors, the JB statistic has a known asymptotic distribution: Zero under N(0,1) Large values of this statistic indicate departures from normality; see also next lecture 2 (Squared Autocorrelation Tests) Even though normality has not been assumed (QMLE), a good model implies that the squared standardized residuals, z 2 t = R 2 t/σ 2 t, should display no systematic autocorrelation patterns a.a. 12/13 p. 38 Zero under N(0,1)

VARIANCE MODEL EVALUATION Whether this has been achieved can be assessed in standard autocorrelation plots that you have seen with Prof. Favero Standard errors are calculated simply as 1/(T 1/2 ), where T is the number of observations in the sample So-called Bartlett standard error bands give the range in which the autocorrelations would fall roughly 95% of the time if the true but unknown autocorrelations were all zero In the example below, there Levels is little or no serial correlation in the levels of zt, but there is some serial correlation left in the squares, at low orders Probably this means that one should build a different/better volatility model a.a. 12/13 p. 39

VARIANCE MODEL EVALUATION 3 (Variance Regressions) The idea is simply to regress squared returns computed over a forecast period on the forecast from the variance model: Squares A good variance forecast model should be unbiased, that is, have an intercept b 0 = 0, and be efficient, that is have a slope, b 1 = 1 Problem: In this regression, the squared returns is used as a proxy for the true but unobserved variance in period t + 1; how good of a proxy is the squared return? On the one hand, in principle we are fine because from our model R t+1 = σ t+1 z t+1 On the other hand, the variance of such a proxy may be poor: a.a. 12/13 p. 40

VARIANCE MODEL EVALUATION Kurtosis coefficient Because κ tends much higher than 3 in reality, the variance of the square proxy for realized variance is often very poor (i.e., imprecisely estimated) Due to the high degree of noise in the squared returns, the fit of the preceding regression as measured by the regression R 2 will be very low, typically around 5 to 10%, even if the variance model used to forecast is indeed the correct one Thus obtaining a low R 2 in such regressions should not lead one to reject the variance model However it remains true that the null hypothesis of b 0 = 0 and b 1 = 1 should not be rejected, if the volatility model is any good Stay tuned, in lecture 3 we will examine alternative and much better measures of realized variance a.a. 12/13 p. 41

READING LIST/HOW TO PREPARE THE EXAM YOU NEED TO GET A FULL GRASP OF THE CHAPTER 4 IN CHRISTOFFERSEN S BOOK Full grasp means every single sentence/equation must (eventually) make sense to you A set of exercises/questions posted on our class web page will help you: WORK ON IT! YOU NEED TO WORK THROUGH THE SAMPLE MATLAB CODE AND PRACTICE PROBLEM POSTED ON THE WEB Two passes: in the first one pay attention to the results; in the second start looking at the Matlab code Some portions of ANDERSEN T., BOLLERSLEV T., CHRISTOFFERSEN P., DIEBOLD, F. (2006) Volatility and Correlation Forecasting may help Engle, R. F. (2001) GARCH 101: The Use of ARCH/GARCH Models in Applied Econometrics, Journal of Economic Perspectives, 15, 157-168 is a very easy and (almost) fun reading a.a. 12/13 p. 42 Lecture 7: Univariate ARCH, Option Pricing Prof. Guidolin

APPENDIX 1: FORECASTING WITH GARCH(1,1) In the lecture, we have stated that This comes from the fact that: and this yields a.a. 12/13 p. 43 Lecture 7: Univariate ARCH, Option Pricing Prof. Guidolin