Forecasting Stock Return Volatility in the Presence of Structural Breaks

Forecasting Stock Return Volatility in the Presence of Structural Breaks David E. Rapach Saint Louis University Jack K. Strauss Saint Louis University Mark E. Wohar University of Nebraska at Omaha September 29, 2007 Abstract We examine the role of structural breaks in forecasting stock return volatility. We begin by testing for structural breaks in the unconditional variance of daily returns for the S&P 500 market index and ten sectoral stock indices for 9/12/1989 1/19/2006 using an iterative cumulative sum of squares procedure. We find evidence of multiple variance breaks in almost all of the return series, indicating that structural breaks are an empirically relevant feature of return volatility. We then undertake an out-of-sample forecasting exercise to analyze how instabilities in unconditional variance affect the forecasting performance of asymmetric volatility models, focusing on procedures that employ a variety of estimation window sizes designed to accommodate potential structural breaks. The exercise demonstrates that structural breaks present important challenges to forecasting stock return volatility. We find that averaging across volatility forecasts generated by individual forecasting models estimated using different window sizes performs well in many cases and appears to offer a useful approach to forecasting stock return volatility in the presence of structural breaks. JEL classifications: C22, C53, G11, G12 Keywords: Volatility; GJR-GARCH model; Out-of-sample forecasts; Structural breaks; Estimation window Rapach (corresponding author): Department of Economics, Saint Louis University, 3674 Lindell Boulevard, St. Louis, MO 63108-3397 (rapachde@slu.edu); Strauss: Department of Economics, Saint Louis University, 3674 Lindell Boulevard, St. Louis, MO 63108-3397 (strausjk@slu.edu); Wohar: Department of Economics, University of Nebraska at Omaha, RH 512K, Omaha, NE 68182 (mwohar@mail.unomaha.edu).

1 Introduction Forecasting the volatility of asset returns has implications for many areas of finance, especially risk management, portfolio management, and the pricing of derivative securities. Given this importance, it is not surprising that a large empirical literature exists that examines volatility forecasting; see the extensive recent survey by Poon and Granger (2003). Due in large part to the prevalence of volatility clustering in asset returns, the canonical generalized autoregressive conditional heteroskedastic (GARCH) model of Engle (1982) and Bollerslev (1986) especially the GARCH(1,1) model is the most popular volatility forecasting model. The GARCH model allows for current shocks to asset returns to affect future volatility through an autoregressive-type of process, thereby introducing persistence into the behavior of volatility. Indeed, fitted GARCH models used for forecasting in the literature are typically highly persistent, so that a return shock today significantly affects the level of volatility of asset returns well into the future. While GARCH models are frequently used to forecast volatility in the extant literature, relatively little attention has been paid to the role of structural breaks. In fact, researchers typically assume (explicitly or implicitly) the existence of a stable GARCH process when forecasting volatility and hence use of a recursive (expanding) or fixed window size when estimating the GARCH model used to generate out-of-sample volatility forecasts. However, financial markets are periodically buffeted by sudden large disturbances, such as the Russian and East Asian financial crises, collapse of Long Term Capital Management, and bursting of the Internet bubble. These disturbances can lead to sharp breaks in the unconditional variance of asset returns and thus breaks in the parameters governing GARCH processes. Periodic breaks in the unconditional variance of asset returns have important implications for volatility forecasting using GARCH models. Earlier research by Diebold (1986), Hendry (1986), and Lamoureux and Lastrapes (1990), as well as more recent research by Mikosch and Stărică (2004) and Hillebrand (2005), shows that failing to account for structural breaks in the unconditional volatility of asset returns can lead to sizable upward biases in the degree of persistence in estimated GARCH models. 1 As indicated above, fitted GARCH processes used for forecasting asset return volatility are typically very persistent and often very close 1 Also see Kim and Kon (1999) and Granger and Hyung (2004). Intuitively, over-estimating the degree of persistence in volatility by failing to account for structural breaks is related to the argument made by Perron (1989) in the context of unit root testing: failure to account for periodic breaks in the mean (or linear trend) of a stationary series can lead one to over-estimate the degree of persistence in the series and fail to reject the null hypothesis of a unit root (Type II error). 1

to the integrated GARCH (IGARCH) model of Engle and Bollerslev (1986). The failure to account for structural breaks in the unconditional variance of asset returns may lead to fitting GARCH models that are too persistent, which can have adverse effects on volatility forecasts. Furthermore, fitted GARCH models that neglect structural breaks can fail to track change in the unconditional variance and thus produce forecasts that systematically under- or overestimate volatility on average for long stretches. In summary, structural breaks have potentially important implications for forecasting the volatility of asset returns. 2 In this chapter, we investigate the role of structural breaks in forecasting U.S. stock return volatility. Our analysis has two components. First, we test for (possibly multiple) structural breaks in the unconditional variance of eleven daily U.S. stock returns series the S&P 500 and ten sectoral indices for 9/12/1989 1/19/2006 using a modified version of the Inclán and Tiao (1994) iterative cumulative sum of squares (ICSS) algorithm. The in-sample component of our analysis investigates the empirical relevance of structural breaks in the unconditional variance of U.S. stock returns. In addition, it provides us with a context for interpreting the second component of our analysis, which involves an out-of-sample volatility forecasting exercise. More specifically, we generate stock return volatility forecasts based on the asymmetric GARCH model of Glosten et al. (1993; GJR-GARCH) for the last 1,000 days of our 9/12/1989 1/19/2006 full-sample period. Our reason for basing our volatility forecasting exercise on an asymmetric GARCH model is the ample empirical evidence supporting the existence of a socalled leverage effect in stock returns, that is, that negative returns shocks are correlated with larger increases in volatility than positive return shocks. 3 Among the studies presenting empirical evidence of a leverage effect are French et al. (1987), Pagan and Schwert (1990), Nelson (1991), Engle and Ng (1993), and Glosten et al. (1993). Furthermore, there are a number of studies showing that asymmetric volatility models, including the GJR-GARCH model, outperform standard (symmetric) GARCH models in out-of-sample of stock return volatility forecasting horse races ; see, for example, Brailsford and Faff (1996), Hansen and Lunde (2005), and Awartani and Corradi (2005). Our out-of-sample forecasting exercise involves estimating GJR-GARCH(1,1) forecasting 2 In the context of exchange rate return volatility forecasting, West and Cho (1995) speculate that the forecasting performance of GARCH models could be improved by allowing for structural breaks in the unconditional variance. 3 The term leverage effect is meant here as a another name for asymmetric volatility. As formulated by Black (1976) and Christie (1982), the leverage effect provides a theoretical explanation for asymmetric volatility: a negative return shock decreases the value of a stock and thus increases financial leverage, making the stock riskier and thus increasing volatility. Of course, there are other (possibly complementary) theoretical explanations for asymmetric volatility; see, for example, Campbell and Hentschel (1992) and Bekaert and Wu (2000). 2

models using a variety of estimation window sizes, including a benchmark model based on an expanding estimation window (which is appropriate if the GJR-GARCH(1,1) process is stable) and models based on long and short rolling estimation windows, as well as an estimation window whose size is determined by applying the modified ICCS algorithm to the data available at the time of forecast formation. These last three windows all attempt to accommodate structural instability in the data-generating process when estimating a volatility forecasting model, and we compare forecasts generated by these methods to the forecasts generated by the GJR-GARCH(1,1) model estimated using an expanding window to see if it pays to accommodate potential structural breaks when selecting the estimation window. 4 Building on insights in Clark and McCracken (2004) and Pesaran and Timmermann (2007), we also consider combination forecasts based on the individual GJR-GARCH(1,1) model forecasts generated using different estimation window sizes. Previewing our results, we first find that structural breaks in unconditional volatility are an empirically relevant feature of U.S. stock returns. Using the modified ICSS algorithm, we find evidence of one or more breaks in unconditional volatility for ten of the eleven daily return series we consider, and the breaks are often associated with sizable changes in unconditional variance. We also report GJR-GARCH(1,1) parameter estimates for the sub-samples defined by the significant structural breaks in unconditional variance detected by the modified ICSS algorithm, and we find that the GJR-GARCH(1,1) parameter estimates can vary markedly across the different regimes. Given that our in-sample results indicate that structural breaks are prevalent in U.S. stock returns, we may suspect that it would be helpful to estimate volatility forecasting models using windows that accommodate structural breaks. However, using an aggregate mean square forecast error (MSFE) metric (Stărică and Granger, 2005; Stăriă et al., 2005) and considering a variety of forecast horizons, we find that volatility forecasts generated by the GJR-GARCH(1,1) models estimated using long and short rolling windows and window selected using the ICSS algorithm are often not more accurate than the volatility forecasts generated by the benchmark GJR-GARCH(1,1) model estimated using an expanding window. The failure of forecasting models estimated with windows designed to accommodate structural breaks to consistently outperform the expanding estimation window can be explained by the bias-efficiency trade-off identified by Clark and McCracken (2004) and Pesaran and Timmermann (2007) in selecting 4 We consider two additional volatility forecasting models that are frequently used in out-of-sample volatility forecasting exercises: the RiskMetrics and simple moving-average models. 3

the optimal estimation window size in the presence of structural breaks. With this biasefficiency trade-off in mind, and in the spirit of Clark and McCracken (2004) and Pesaran and Timmermann (2007), we consider combination forecasts formed by averaging across the forecasts generated by the individual GJR-GARCH(1,1) models estimated with the different window sizes. We find that certain combination forecasts are able to consistently outperform the benchmark GJR-GARCH(1,1) model estimated using an expanding window, sometimes by a sizable margin. Overall, our results suggest that combining forecasts across GJR-GARCH model estimated using different window sizes offers a practical method for improving the accuracy of stock return volatility forecasts in the presence of structural breaks in unconditional volatility. The rest of this chapter is organized as follows: Section 2 describes our econometric methodology, Section 3 describes the results of the in-sample and out-of-sample analyses, and Section 4 concludes. 2 Econometric Methodology 2.1 Modified Iterative Cumulative Sum of Squares Algorithm Let R t = 100 log(p t /P t 1 ) denote the continuous return for a stock index from time t 1 to t, where P t is the value of the index at time t, and let r t = R t µ, where µ is the constant (conditional and unconditional) mean of R t. Suppose that we observe r t for t = 1,..., T. Inclán and Tiao (1994) develop a cumulative sum of squares statistic that can be used to test the null hypothesis that the unconditional variance of r t is constant for k = 1,..., T against that alternative hypothesis of a break in the unconditional variance at some point in the sample. The statistic is given by IT = sup (T/2) 0.5 D k, (1) k where D k = (C k /C T ) (k/t ) and C k = k t=1 r2 t for k = 1,..., T. When r t iid N(0, σr), 2 Inclán and Tiao (1994) show that IT asy sup r W (r) under the null hypothesis, where W (r) = W (r) rw (1) is a Brownian bridge and W (r) is standard Brownian motion. Finitesample critical values for IT can be generated by simulation methods. When the null hypothesis is rejected, the value of k that maximizes (T/2) 0.5 D k serves as the estimate of the break date. A potential drawback to relying on the IT statistic is that it is designed for iid processes, while many return processes appear to be characterized by temporal dependencies, such as the 4

autocorrelation in conditional volatility captured by GARCH models. A number of studies, including Andreou and Ghysels (2002), de Pooter and van Dijk (2004), and Sansó et al. (2004), show that the IT statistic can be substantially oversized for dependent processes, including GARCH processes. 5 In order to allow for r t to follow a variety of dependent process (including GARCH processes) under the null hypothesis, a non-parametric adjustment can be applied to the IT statistic; see, for example, Kokoszka and Leipus (1999), Lee and Park (2001), and Sansó et al. (2004). We employ a non-parametric adjustment based on the Bartlett kernel, which can be expressed as where AIT = sup T 0.5 G k, (2) k G k = ˆλ 0.5 [C k (k/t )C T ], ˆλ = m ˆγ 0 + 2 [1 l(m + 1) 1 ]ˆγ l, l=1 T ˆγ l = T 1 (rt 2 ˆσ 2 )(rt l 2 ˆσ2 ), t=l+1 ˆσ 2 = T 1 C T, and the lag truncation parameter m is selected using the procedure in Newey and West (1994). Under general conditions, AIT asy sup r W (r), and simulation methods can again be used to generate finite-sample critical values. Inclán and Tiao (1994) also develop an iterative cumulative sum of squares (ICSS) algorithm based on the IT statistic that can be used to test for multiple breaks in the unconditional variance, and the procedure is described in Steps 0 3 in Inclán and Tiao (1994, p. 916). In order to allow for dependence processes under the null hypothesis, the ICSS procedure can instead be based on the AIT statistic, and we label this the modified ICSS algorithm. In our applications in Section 3 below, we use the modified ICSS algorithm based on the AIT statistic and a 5% significance level to test for structural breaks in the unconditional volatility of daily stock returns series for the S&P 500 index and ten sectoral indices. 5 Further complicating matters, the size distortions seem to increase with the sample size. 5

2.2 GJR-GARCH Model The canonical GARCH(1,1) model is given by r t = h 0.5 t ɛ t, (3) h t = ω + αr 2 t 1 + βh t 1, (4) where h t represents the conditional volatility of r t and ɛ t is distributed iid with zero mean and unit variance. The condition for the GARCH(1,1) process to be covariance-stationary is α + β < 1, 6 and the persistence measure for the GARCH(1,1) model is given by α + β. It is common to estimate the GARCH(1,1) model using quasi maximum likelihood estimation (QMLE) under the assumption that ɛ t N(0, 1). QMLE parameter estimates can be shown to be consistent and asymptotically normal under certain conditions; see, for example, Ling and McAleer (2003) and Jensen and Rahbek (2004). The GARCH(1,1) model assumes that the conditional variance responds symmetrically to positive and negative shocks, so that only the size and not the sign of the shock matters. However, as noted in the introduction, it is well established that stock returns and volatility are negatively correlated. A number of nonlinear GARCH models have been introduced in the literature in order to allow for asymmetry in the response of conditional volatility to return shocks. We focus on the popular Glosten et al. (1993) GJR-GARCH model. 7 The GJR-GARCH(1,1) model can be expressed as h t = ω + (α + γi t 1 )r 2 t 1 + βh t 1, (5) where I t 1 is a dummy variable that takes a value of unity when ɛ t 1 < 0 and zero otherwise. The specification in (5) allows for positive and negative shocks to have different effects on conditional volatility: a positive unit shock at time t leads to an increase in conditional volatility of α at time t+1, whereas a negative unit shock at time t elicits a larger increase in conditional volatility of α + γ at time t + 1 (assuming γ > 0). When γ = 0, it is clear that positive and negative shocks have symmetric effects on volatility, and in this case, (5) reduces to (4). The persistence measure for the GJR-GARCH(1,1) model is given by α + β + (γ/2). Note 6 When α + β = 1, we have the integrated GARCH(1,1) or IGARCH(1,1) model of Engle and Bollerslev (1986). Note that we require α > 0 for β to be identified for the GARCH(1,1) model; when α = 0, the volatility is conditionally heteroskedastic. 7 Other asymmetric GARCH models proposed in the literature include the exponential GARCH (EGARCH) model of Nelson (1991), threshold GARCH model of Zakoian (1994), and quadratic GARCH (QGARCH) model of Sentana (1995). Franses and van Dijk (2000, Ch. 4) provide a useful survey of different types of asymmetric GARCH models. 6

that we require α + (γ/2) > 0 for β to be identified for the GJR-GARCH(1,1) model; when α+(γ/2) = 0, the volatility process is characterized by conditional homoskedasticity. Assuming α + β + (γ/2) < 1, the unconditonal variance of r t is given by ω/{1 [α + β + (γ/2)]}, and structural breaks in the unconditonal variance of r t imply structural breaks in ω, γ, and/or β. Like the GARCH(1,1) model, the GJR-GARCH(1,1) model can be estimated using QMLE under the assumption that ɛ t N(0, 1), and we follow this practice in Section 3 below. 8 2.3 In-Sample Tests Our in-sample testing strategy involves two steps. First, we use the modified ICSS algorithm to test for structural breaks in the unconditional volatility of daily stock returns for the S&P 500 and ten sectoral indices over the 9/12/1989 1/19/2006 period. Second, we estimate GJR- GARCH(1,1) models for each return series for the full 9/12/1989 1/19/2006 sample, as well as the sub-samples defined by the structural breaks identified by the modified ICSS algorithm. 9 These exercises allow us to analyze the empirical relevance of structural breaks in unconditional volatility for U.S. stock market returns and to examine the effects of structural breaks on fitted GJR-GARCH(1,1) models. Importantly, our in-sample tests also provide a context for analyzing our out-of-sample test results. 2.4 Forecasting Models We divide the full sample into in-sample and out-of-sample components, where the in-sample component spans the first R observations and the out-of-sample component spans the last P observations, so that T = R+P. We simulate the situation of a forecaster in real time and form volatility forecasts over the out-of-sample period for horizons of 1, 20, 60, and 120 days (s = 1, 20, 60, 120). Our aim is to explore the effects of structural breaks on volatility forecasting and analyze the usefulness of various forecasting approaches designed to accommodate potential structural breaks. We consider the following volatility forecasting models: GJR-GARCH(1,1) expanding window. This model forms out-of-sample forecasts using a recursive (expanding) estimation window. It serves as our benchmark model and 8 As pointed out by Ng and McAleer (2004), while the conditions are known for the existence of moments for the GJR-GARCH(1,1) model, theoretical results are not available regarding the asymptotic distribution, and in practice, estimation proceeds under the assumptions of consistency and asymptotic normality. 9 If there are no structural breaks for a return series, we only estimate the GJR-GARCH(1,1) model for the full sample. 7

is the appropriate forecasting model if there are no structural breaks in the unconditional volatility of stock returns (assuming that the GJR-GARCH(1,1) model adequately captures the dynamics of the volatility process). The first out-of-sample forecast at the 1-period horizon (s = 1) is given by ĥr+1 R,EXP = ˆω R,EXP + (ˆα R,EXP + ˆγ R,EXP I R )rr 2 + ˆβ R,EXP ĥ R,EXP, where ˆω R,EXP, ˆα R,EXP, ˆγ R,EXP, and ĥr,exp are the QMLE estimates of ω, α, γ, and h R, respectively, in (5) based on observations 1,..., R. The second out-of-sample forecast is given by ĥr+2 R+1,EXP = ˆω R+1,EXP +(ˆα R+1,EXP + ˆγ R+1,EXP I R+1 rr+1 2 + ˆβ R+1,EXP ĥ R+1,EXP, where ˆω R+1,EXP, ˆα R+1,EXP, ˆγ R+1,EXP, and ĥ R+1,EXP are the QMLE estimates of ω, α, γ, and h R+1, respectively, in (5) based on observations 1,..., R + 1. We continue in this manner through the end of the available out-of-sample period, leaving us with a series of P 1-step-ahead volatility forecasts, which we denote by {ĥt t 1,EXP } T R+1. GJR-GARCH(1,1) 0.50 rolling window. This model generates forecasts using a rolling estimation window equal to one-half of the size of the in-sample period. The forecasts are generated similar to the GJR-GARCH(1,1) expanding window model, with the exception that the parameter estimates for the first out-of-sample forecast are based on observations from 0.50R + 1,..., R, parameter estimates for the second out-of-sample forecast are based on observations 0.50R + 2,..., R + 1, etc. Researchers sometimes use parameter estimates based on a rolling window when forming forecasts in order to allow for the parameters to evolve over time in an attempt to accommodate structural breaks. We denote the 1-step-ahead forecasts corresponding to the GJR-GARCH(1,1) 0.50 rolling window by {ĥt t 1,ROLL(0.50)} T t=r+1. GJR-GARCH(1,1) 0.25 rolling window. The forecasts for this model are generated the same as those for the GJR-GARCH(1,1) 0.50 rolling window model, with the exception that we use a rolling window equal to one-quarter of the size of the in-sample period, so that the parameter estimates for the first out-of-sample forecast are based on observations from 0.75R+1,..., R, parameter estimates for the second out-of-sample forecast are based on observations from 0.75R + 2,..., R + 1, etc. By using a shorter estimation window, it is less likely that the parameters are estimated using data from different regimes, but the forecasting model has fewer observations available for estimating the parameters of the GJR-GARCH(1,1) model. The 1-step-ahead forecasts for GJR-GARCH(1,1) 0.25 rolling window model are denoted by {ĥt t 1,ROLL(0.25)} T t=r+1. 8

GJR-GARCH(1,1) with breaks. For this model, the size of the estimation window is determined by applying the modified ICSS algorithm to the data available at the time the forecast is made. More specifically, to form the first out-of-sample forecast, we apply the modified ICSS algorithm to observations 1,..., R. The estimation window for the parameters of the GJR-GARCH(1,1) model is comprised of observations T B(1,...,R) + 1,..., R, where T B(1,...,R) is the final break detected by the modified ICSS algorithm applied to observations 1,..., R; if no breaks are detected over this period, the parameters of the GJR-GARCH(1,1) model are estimated using observations 1,..., R. To form the second out-of-sample forecast, we apply the modified ICSS algorithm to observations 1,..., R + 1, and the estimation window for the GJR-GARCH(1,1) model parameters is comprised of observations T B(1,...,R+1) + 1,..., R. We proceed in this manner through the end of the available out-of-sample period, and we denote the 1-step-ahead forecasts for the GJR-GARCH(1,1) with breaks model by {ĥt t 1,BREAKS} T t=r+1. Note that the forecasts for the GJR-GARCH(1,1) model do not suffer from look-ahead bias, as only observations available at the time of forecast formation are used to determine the timing of the most recent break. Of course, a potential drawback to this model is that a relatively short sample will be available for estimating the GJR-GARCH(1,1) model parameters if a break is detected near the period of forecast formation. When forming forecasts for horizons beyond one period (s > 1) for the four GJR-GARCH(1,1) forecasting models described above, we use the iterative procedure, ĥt+s t = ˆω + (ˆα + ˆγ/2 + ˆβ)ĥt+s 1 t; see Franses and van Dijk (2000, p. 192). 10 We denote the sequence of P (s 1) s-step-ahead out-of-sample volatility forecasts by {ĥt t s,i} T t=r+s ROLL(0.25), BREAKS. for i = EXP, ROLL(0.50), Our strategy of considering forecasts based on models estimated using an expanding window, a long rolling window, a short rolling window, and a window size determined by the application of a structural break test is similar in spirit to Pesaran and Timmermann (2005). They analyze the performance of different estimation window sizes for autoregressive models designed to forecast the conditional mean of a number of macroeconomic variables. While they focus on forecasting the conditional mean of macroeconomic variables in the presence of structural breaks, we are interested in the performance of different estimation window sizes when forecasting stock return volatility in the presence of structural breaks. 10 The iterative procedure follows from the fact that E[r t+s Ω t] < 0 = 0.5 and E[r 2 t+s Ω t] = h t+s, where Ω t represents the information available at time t. 9

We consider two additional forecasting models: RiskMetrics. This is a popular model that is often included in out-of-sample volatility forecasting exercises. The 1-step-ahead RiskMetrics forecast is given by the exponentially weighted moving average formula, ĥt+1 t = (1 λ) t 1 k=0 λk rt k 2. Following the recommendation of the RiskMetrics Group (1996), we set λ = 0.94 for daily data, and following convention, we set the s-step-ahead forecast for s > 1 equal to the 1-step-ahead forecast for the RiskMetrics model. We denote the sequence of s-step-ahead out-of-sample volatility forecasts for the RiskMetrics model by {ĥt t s,rm} T t=r+s. Moving average. This is a simple moving average model that uses the average of the squared returns over the previous 250 days to form a volatility forecast for day t: ĥ t t 1 = (1/250) 250 k=1 r2 t k. Stărică et al. (2005) find that this model outperforms a GARCH(1,1) model when forecasting daily stock return volatility in a number of industrialized countries and argue that it is a useful model for accommodating structural breaks. As with the RiskMetrics model, we set the s-step-ahead forecast for s > 1 equal to the 1-step-ahead forecast for the moving average model. We denote the sequence of s-step-ahead out-of-sample volatility forecasts for the moving average model by {ĥt t s,ma} T t=r+s. 2.5 Combination Forecasts At first blush, it may seem that we would only want to estimate a forecasting model with data after the most recent break, as this avoids the biases associated with using data from different regimes to estimate the parameters of the forecasting model. However, recent research by Clark and McCracken (2004) and Pesaran and Timmermann (2007) show that it can be optimal to use pre-break data when estimating forecating models for the mean square forecast error (MSFE) loss function. Intuitively, using pre-break data can reduce the variance of parameter estimates, so that there is a bias-efficiency tradeoff involved with the inclusion of pre-break data in the estimation window. Clark and McCracken (2004) and Pesaran and Timmermann (2007) derive the optimal estimation window size for a linear regression forecasting model for the conditional mean that minimizes MSFE. They show that the optimal amount of pre-break data to include in the estimation window depends on a number of factors, including the size, direction, and exact timing of the structural breaks. While there are currently no theoretical results available on the optimal estimation window size for forecasting volatility using GARCH-type models, it 10

is very likely that the types of results derived by Clark and McCracken (2004) and Pesaran and Timmermann (2007) are also relevant for volatility forecasting. From a practical standpoint, determining the optimal window size for a forecasting model is likely to be quite difficult, even when theoretical results are available, as the size, direction, and timing of the breaks all need to be estimated, and there may be substantial uncertainty surrounding the estimates. Given the difficulty in determining the optimal estimation window size in practice, Clark and McCracken (2004) and Pesaran and Timmermann (2007) consider combining forecasts from time-series models estimated using windows of different sizes as a way of dealing with structural breaks. Both of these studies find that combining forecasts from time-series models estimated using different window sizes performs well in forecasting the conditional mean of a number of macroeconomic variables. In the context of forecasting exchange rate volatility, Rapach and Strauss (2007) consider combining forecasts of exchange rate return volatility generated by GARCH(1,1) models estimated using different window sizes. They find that combining forecasts across estimation window sizes typically improves exchange rate return volatility forecasts in the presence of structural breaks in the unconditional volatility of exchange returns. 11 In light of these results, we consider a number of combination forecasts: 1. Mean-all. This is the average of the six individual forecasts listed above (GJR-GARCH(1,1) expanding window, GJR-GARCH(1,1) 0.50 rolling window, GJR-GARCH(1,1) 0.25 rolling window, GJR-GARCH(1,1) with breaks, RiskMetrics, and Moving average). 2. Trimmed mean-all. This is the average of the four individual forecasts that result from excluding the highest and lowest individual forecasts from the six individual forecasts listed above. 3. CM combined. This is the average of the GJR-GARCH(1,1) expanding window and GJR- GARCH(1,1) 0.25 rolling window forecasts and is in the spirit of Clark and McCracken (2004). 4. Mean-windows. This is the average of the four individual GJR-GARCH(1,1) models estimated using the four different window sizes (GJR-GARCH(1,1) expanding window, GJR-GARCH(1,1) 0.50 rolling window, GJR-GARCH(1,1) 0.25 rolling window, and GJR-GARCH(1,1) with breaks). 12 11 The usefulness of combining individual forecasts has been known since the seminal work of Bates and Granger (1969). Timmermann (2006) provides a useful recent survey of forecast combination. 12 This combination forecast is similar in spirit to Pesaran and Timmermann (2007). They actually form 11

5. Trimmed mean-windows. This the average of the two individual forecasts that result from excluding the highest and lowest individual forecasts generated by the four GJR- GARCH(1,1) models estimated using the four different window sizes. 2.6 Forecast Evaluation Following Stărică and Granger (2005) and Stăriă et al. (2005), we compare volatility forecasts across models using an aggregated version of the popular MSFE metric: where r 2 t MSF E = [P (s 1)] 1 T t=r+s ( r 2 t ĥ t t s,i ) 2, (6) = s j=1 r2 t (j 1) and ĥ t t s,i = s j=1 ĥt (j 1) t s,i. It is well known that squared returns tend to be a very noisy proxy for latent volatility (Andersen and Bollerslev, 1998). Aggregating helps to reduce the idiosyncratic noise in squared returns at horizons beyond one period, thereby providing a more informative volatility metric. An alternative approach to measuring latent volatility is the use of realized volatility measured from intra-day data. However, intra-day data are typically not available for a sufficiently long period for the series in our applications, so we cannot use this measure. 13 We analyze volatility forecasts using the MSF E criterion at horizons of 1, 20, 60, and 120 days (s = 1, 20, 60, 120). We use the Clark and West (2007) variant of the popular Diebold and Mariano (1995) statistic to test for significant differences in MSF E between the benchmark GJR-GARCH(1,1) expanding window and an individual competing volatility forecasting model. The Clark and West (2007) statistic is designed to test for differences in forecasting performance for nested models, and the competing volatility forecasting models we consider can be viewed as being nested by the GJR-GARCH(1,1) expanding window model under the null hypothesis that the data are generated by a stable GJR-GARCH(1,1) model. The Clark and West (2007) statistic is designed to be more powerful than the Diebold and Mariano (1995) statistic when comparing nested models. While our framework is not equivalent to that considered by Clark and West (2007), we adopt their statistic in an effort to employ a more powerful test of differences in predictive accuracy. 14 To compute the Clark and West (2007) statistic in our volatility forecasting combination forecasts by averaging across forecasts generated by linear regression models estimated using all possible window sizes (subject to a minimum length requirement). While this is feasible for linear regression models and moderate sample sizes, it is not feasible in our stock return volatility applications, as this would entail the estimation of thousands of GJR-GARCH(1,1) models just to form a single combination forecast. 13 As pointed out in the chapter by Hyung et al. in this volume, the relative rankings of volatility forecasting models are typically the same whether squared returns or realized volatility are used to measure latent volatility. 14 It is important to realize that recent research shows that there are a number of issues that arise when testing 12

framework to test whether forecasting model i has a lower MSF E than the GJR-GARCH(1,1) expanding window benchmark model, first define ˆf t = ( r 2 t ĥ t t s,exp ) 2 [( r 2 t ĥ t t s,i ) 2 ( h t t s,exp ĥ t t s,i ) 2 ] (7) for t = R + s,..., T. We can test for equal MSF E by regressing ˆf t on a constant and testing for a zero constant term using the t-statistic and a one-sided (upper-tail) test based on the standard normal distribution. The conventional OLS standard error can be used to compute the t-statistic when s = 1, while we use the Newey and West (1987) procedure to compute an autocorrelation consistent standard error when s > 1. 3 Empirical Results 3.1 Data We use daily U.S. stock price indices for the S&P 500 and ten sectors: Consumer durables, Construction, Engineering, Finance, Health, Industrials, Information, Materials, Gas utilities, Electric utilities. The daily price indices are from Global Financial Data, and we compute daily returns based on daily closing prices. 15 The sample period is 9/12/1989 1/19/2006 (4,129 observations). This sample covers some interesting periods, including the bull market of the 1990s and the subsequent bear market. 16 Table 1 reports summary statistics for the daily return series. Heteroskedastic and autocorrelation consistent standard errors for the mean, standard deviation, skewness, and excess kurtosis are reported in the table and computed as in West and Cho (1995). Daily stock returns appear quite volatile and exhibit strong evidence of excess kurtosis. There is evidence of serial correlation in the returns for only a few of the indices. In contrast, there is strong evidence for significant differences in forecasting performance, including whether the models are nested or non-nested, the size of the in-sample relative to the out-of-sample period (P/T ), and type of estimation window used; see Corradi and Swanson (2006) for useful survey of some of these issues. In light of this, we view the Clark and West (2007) statistic as providing a rough guide for assessing statistical significance. 15 We analyze returns for ten individual sectoral indices in addition to the broad S&P 500 index for two reasons. First, forecasting volatility for sectoral returns has received less attention than forecasting broad market returns, despite the importance of sectoral returns; for instance, many leading financial institutions market mutual funds that specialize in particular sectors, so there is general interest in sectoral returns. Secondly, we will analyze a variety of forecasting models, and we can get a better sense of the robustness of our results by considering a relatively large number of sectoral returns in addition to the broad market returns. 16 We compute demeaned returns by subtracting the sample mean from daily returns. Our in-sample and out-of-sample volatility forecasting results are not very sensitive to the assumption of a constant conditional mean. Note that Awartani and Corradi (2005) find that out-of-sample volatility forecasting results for U.S. stock returns are not sensitive to the modeling of the conditional mean. The insensitivity of volatility results to the modeling of the conditional mean for stock returns in not surprising, given that the predictable component in stock returns is very small. 13

of serial correlation in the squared returns. In addition, the Engle (1982) Lagrange-multiplier statistic provides significant evidence of ARCH effects. The high volatility, excess kurtosis, and strong serial correlation in squared returns reported in Table 1 are well-known stylized facts that help to motivate the widespread use of GARCH models in stock return volatility modeling. TABLE 1 ABOUT HERE 3.2 In-Sample Results We apply the modified ICSS algorithm to the S&P 500 and ten sector return series, and Figure 1 plots r t for each series, along with ±3-standard-deviation bands for each of the regimes defined by the structural breaks identified by the modified ICSS algorithm. 17 (The break dates are reported in Table 2.) The modified ICSS algorithm detects at least one structural break in the unconditional volatility for ten of the eleven return series. There are five breaks detected for Finance; four breaks for the S&P 500, Industrials, Information, Materials, and Electric utilities; three breaks for Consumer durables, Engineering, and Health; and one break for Gas utilities. Construction is the only return series for which a volatility break is not detected. FIGURE 1 ABOUT HERE Table 2 presents the full-sample GJR-GARCH(1,1) parameter estimates, as well as parameter estimates for the sub-samples defined by the structural breaks identified by the modified ICSS procedure. 18 For the full sample, the estimate of γ is significant for each series, indicating a significant leverage effect whereby negative news shocks have a larger effect on next period s volatility than postive shocks. The fitted full-sample GJR-GARCH(1,1) models are all highly persistent, with estimates of α + β + (γ/2) ranging from 0.986 to 0.997. These fullsample results asymmetric volatility and highly persistent conditional volatility are typical findings for stock return volatility. TABLE 2 ABOUT HERE There are interesting contrasts to the full-sample parameter estimates when we estimate 17 We implement the modified ICSS algorithm using the GAUSS procedures available from Andreu Sansó s web page at http://www.uib.es/depart/deaweb/personal/professores/personalpages/andreusanso/we. The original code computes finite-sample critical values using a response surface that is appropriate for sample sizes up to 1,000 observations. We estimate an additional response surface to generate critical values that are appropriate for sample sizes up to 7,000 observations. 18 We use the GAUSS module Constrained Maximum Likelihood 2.0 to obtain QMLE GJR-GARCH(1,1) parameter estimates. 14

GJR-GARCH(1,1) models for the sub-samples defined by the structural breaks in unconditional volatility. Perhaps most notably, persistence as measured by α + β + (γ/2) falls relative to the full-sample measure of persistence for each sub-sample and for the ten return series that contains at least one significant break. The decreases in persistence are often sizable; in the extreme, there are a some sub-samples (for Engineering, Finance, and Electric utilities) where the estimates of α and γ are both zero, so that α + β + (γ/2) = 0 and the sub-samples are characterized by conditional homoskedasticity. The often sizable decreases in the persistence of the volatility processes relative to the full-sample estimates are a likely manifestation of the upward biases in persistence that results from failing to account for structural breaks, as discussed in Section 1 above. With respect to the leverage effect, the γ estimates for the sub-samples are often larger than those for the full sample, so that the leverage effect often becomes more pronouned for the sub-samples relative to the full sample. The estimates of ω and the unconditional variance (ω{1 [α + β + (γ/2)]}) also tend to vary considerably between sub-samples. For example, for Consumer durables, the unconditional variance is 1.084 for the full sample, while it ranges from 0.672 to 2.249 for the different sub-samples. Overall, the in-sample results show that failure to account for structural breaks in unconditonal variance can mask important differences in GJR-GARCH(1,1) parameters across various periods. 3.3 Out-of-Sample Results We divide the full sample into in-sample and out-of-sample portions, where the out-of-sample portion is comprised of the last 1,000 observations (2/5/2002 1/19/2006). Table 3 reports the out-of-sample volatility forecasting results for each return series at horizons of 1, 20, 60, and 120 days. The table reports the MSF E for each forecasting method, as well as the ratio of the MSF E for an individual forecasting method to the MSF E for the benchmark GJR- GARCH(1,1) expanding window model. A ratio below unity thus indicates that the forecasting method beats the benchmark GJR-GARCH(1,1) expanding window model according to the MSF E metric. The p-values in Table 3 correspond to the Clark and West (2007) test of the null hypothesis of equal expected loss against the alternative hypothesis that the competing model has a lower expected loss than the benchmark GJR-GARCH(1,1) expanding window model. A small p-value indicates that the MSF E for the competing model is significantly less than that for the benchmark model. 19 19 Note that the GJR-GARCH(1,1) expanding window model consistently outperforms a GARCH(1,1) forecasting model estimated using an expanding window for the stock return series. This provides out-of-sample evidence of leverage effects in stock returns and is in line with the out-of-sample volatility forecasting results in 15

TABLE 3 ABOUT HERE Focusing first on the results for S&P 500 returns in Table 3, we see that the two rolling window and GJR-GARCH(1,1) with breaks models offer improvements in forecasting accuracy relative to the GJR-GARCH(1,1) expanding window model only in some instances. While the MSF E ratios for the two rolling and break-adjusted windows are all less than unity at the 1-day horizon, the ratios are all above unity at the 20- and 60-day horizons. At the 120-day horizon, the GJR-GARCH(1,1) 0.50 rolling window model beats the benchmark GJR- GARCH(1,1) expanding window model, but the GJR-GARCH(1,1) 0.25 roling window and GJR-GARCH(1,1) with break models are unable to beat the benchmark. These results suggest that it is difficult to determine the particular estimation window for a GJR-GARCH(1,1) model to consistently produce the most accurate volatility forecasts. The RiskMetrics and moving average models never offer forecasting gains S&P 500 returns relative to the benchmark model, and the MSF E ratios for these methods are typically well above unity. Some of the combination forecasts, especially CM combined, appear to offer more promise for forecasting the volatility of S&P 500 returns. The ratio for the CM combined forecast is below unity at horizons of 1, 20, and 120 days, and it produces the lowest MSF E of all the forecasting methods at horizons of 20 and 120 days. The in-sample results reported in Section 3.2 above and the theoretical results in Clark and McCracken (2004) and Pesaran and Timmermann (2007) help to explain the out-of-sample volatility forecasting results for S&P 500 returns in Table 3. Recall from Table 2 and Figure 1 that the modified ICSS algorithm applied to the full sample detects four structural breaks in unconditional volatility. With this many breaks, it will likely be difficult in a real-time volatility forecasting setting to determine the optimal window size (in terms of the bias-efficiency tradeoff) for estimating the GJR-GARCH(1,1) forecasting model for any given horizon. Averaging across forecasts generated by models estimated using different window sizes recognizes this uncertainty and helps to better approximate a forecast generated using an optimal estimation window. Turning to the results for the individual sectoral returns in Table 3, we see that the GJR-GARCH(1,1) forecasting models estimated using the rolling and break-adjusted windows are not able to consistently outperform the benchmark GJR-GARCH(1,1) expanding window model for any sector at all reported horizons. The main exception is Engineering, where the Awartani and Corradi (2005) and Hansen and Lunde (2005). To conserve space, we do not report the complete GARCH(1,1) model forecasting results; they are available upon request from the authors. 16

ratio for the forecasts generated by the GJR-GARCH(1,1) model based on the rolling and break-adjusted windows are all below unity at all reported horizons. Similar to the results for S&P 500 returns, the RiskMetrics and moving average models are almost always outperformed by the benchmark GJR-GARCH(1,1) expanding window model. Some of the combination forecasts perform reasonably well overall for the sectoral returns, especially for Consumer durables, Construction, Engineering, Finance, Industrials, Gas utilities, and Electric utilities at the 1-day horizon. At longer horizons of 20, 26, and 120 days, one or more of the combination forecasts perform well at each horizon for Construction, Engineering, Health, Information, Materials, and Gas utilities, and Electric utilities. In a number of instances, p-value indicates that the combination forecasts are significantly more accurate than the GJR-GARCH(1,1) expanding window model forecasts at conventional significance levels. There are a number of cases where there is a sizable reduction in MSF E relative to the benchmark model; for example, the combination forecasts offer reductions in MSF E of approximately 20% 30% for Engineering at the 120-day horizon. Among the combination forecasts, the CM combined method appears to do well on a reasonably consistent basis for the sectoral returns. It delivers the lowest MSF E ratio among all of the forecasting methods for at least one of the reported horizons for Consumer durables, Construction, Industrials, Information, Materials, Gas utilities, and Electric utilities. There are also a number of instances where the p-value indicates that the CM combined forecast are significantly more accurate than the benchmark GJR-GARCH(1,1) expanding window model forecasts. Table 4 provides a summary of the out-of-sample volatility forecasting results in Table 3 across all of the return series at each horizon and across all of the return series and all of the horizons. In line with our observations in the previous paragraph, the CM combined forecast, which is formed as the average of the GJR-GARCH(1,1) expanding window and 0.50 rolling window model forecasts, appears to offer the best performance overall. It beats the benchmark GJR-GARCH(1,1) expanding window model well above half of the time at horizons of 1, 20, and 120 days, and across all horizons, it beats the benchmark model 73% of the time. The only other forecasting method that beats the benchmark model more than half of the time across all horizons is Mean-windows (53%). Overall, the in-sample results for the sectoral returns show that breaks in unconditional variance are prevalent for most sectors. As with the out-of-sample results for S&P 500 returns, 17

the prevalence of structural breaks can make it difficult to identify the optimal window size in practice, and combining forecasts across forecasting models estimated with different estimation window sizes seems to offer a useful way of dealing with the uncertainty surrounding the optimal estimation window size. TABLE 4 ABOUT HERE 4 Conclusion In this chapter, we analyze stock return volatility forecasting in the presence of structural breaks. Our in-sample results indicate that structural breaks in unconditional variance are an empirically relevant characteristic of stock return volatility. In our out-of-sample analysis, we focus on the forecasting performance of GJR-GARCH(1,1) models estimated using windows of various sizes designed to accommodate structural breaks, including long and short rolling windows and a window whose size is selected based on a test for variance breaks applied to the data available at the time of forecast formation. We fail to find that any of these particular window sizes are able to consistently outperform forecasts generated by a benchmark GJR-GARCH(1,1) model estimated using an expanding window. We interpret this result in light of recent research by Clark and McCracken (2004) and Pesaran and Timmermann (2007) and consider combining forecasts generated by models estimated using different window sizes. Combination forecasts are often able to deliver more accurate forecasts than the benchmark GJR-GARCH(1,1) expanding window model. In particular, combination forecasts based on averaging over volatility forecasts generated by GJR-GARCH(1,1) models estimated using expanding and short rolling windows appears to be an especially effective method for forecasting stock return volatility. Our findings in this chapter are similar to those in Rapach and Strauss (2007), who analyze forecasts of exchange rate return volatility generated by GARCH(1,1) models estimated using different window sizes. 20 Using the modified ICSS algorithm, they find evidence of one or more structural breaks in unconditional variance for a number of daily U.S. dollar exchange rate return series. They also find that combinination forecasts formed by averaging across forecasts generated by GARCH(1,1) models estimated using different window sizes are often able to outperform forecasts generated by a benchmark GARCH(1,1) expanding window model. 20 Rapach and Strauss (2007) use the GARCH(1,1) model for modeling exchange rate return volatility instead of the GJR-GARCH(1,1) model, as the leverage effect is not as apparent for exchange rate return volatility as stock return volatility; see, for example, Hansen and Lunde (2005). 18