Forecasting Market Returns: Bagging or Combining?

Forecasting Market Returns: Bagging or Combining? Steven J. Jordan Econometric Solutions econometric.solutions@yahoo.com Andrew Vivian Loughborough University a.j.vivian@lboro.ac.uk Mark E. Wohar University of Nebraska-Omaha and Loughborough University mwohar@mail.unomaha.edu March 2015 Keywords: Return forecasting, Fundamentals, Macro variables, Technical indicators, Emerging markets, Asia, G7. 1

Forecasting Market Returns: Bagging or Combining? Abstract We provide evidence on applying the bagging method to forecast stock returns out-of-sample for the G7 and a broad set of Asian countries for which there is little prior evidence. We focus on using the recently developed bagging method that explicitly addresses model uncertainty and parameter uncertainty. We are amongst the first to apply the bagging method to market return predictability and amongst the first to examine if bagging can generate economic gains. We find that, when portfolio weight restrictions are applied, bagging generally improves forecast accuracy and generates economic gains relative to the benchmark; bagging also performs well compared to forecast combinations in this setting. We also provide new evidence that the results for bagging cannot be fully explained by data mining concerns. Finally, we report that bagging generates economic gains in G-7 countries and overall these gains are highest for countries with high trade openness and high FDI. The potentially substantial economic gains could well be operational given the existence of index funds for most of these countries. 2

1. INTRODUCTION A review of the extant stock return forecasting literature indicates that most out-of-sample return forecasting evidence rests on major developed countries and is especially focused on US data. This literature generally suggests that it is difficult to consistently outperform a simple benchmark. For example, Goyal and Welch (2003) state: By assuming that the equity premium was `like it always has been, a trader would have performed at least as well in most of our samples. The majority of international literature considers several macro variables and also considers fundamental variables based on dividends and earnings. 1 The international literature, which focuses on major developed countries, has generally provided mixed evidence on the extent of out-of-sample (OOS) predictability. 2 However, there has been very little prior international evidence on amalgamating information from these different predictor variables (Jordan and Vivian, 2011, is a notable exception that considers one simple forecast combination technique). The main contributions of this paper are threefold: i) to provide a rigorous and detailed analysis of the bagging method (Inoue and Kilian, 2008), ii) to provide evidence on out-of-sample forecasting for a set of Asian stock returns, and iii) to incorporate data mining critical values for appropriate inference on bagging and combination forecast methods, which may be particularly important for the set of G7 countries. According to Inoue and Kilian, 2008 p511: Bagging involves generating a large number of bootstrap resamples of the original forecasting problem, applying a pretest model selection rule to each of the resamples, and averaging the forecasts from the models selected by the pretest on each bootstrap sample. An advantage of this method is that it explicitly accounts for model and parameter instability. 1 Bossaerts and Hillion (1999) use fundamental variables including long-term bond excess returns, T-bill returns, the stock market s price level, the market s dividend yield, and the market s price-earnings ratio. Rapach, Wohar, and Rangvid (2005) examine predictability of macro variables including: money market rate, T-bill rate, bond spread, long-term bond yield, inflation rate, industrial production growth, narrow money growth, broad money growth, and change in the unemployment rate. Rapach and Wohar (2009) use the dividend-price ratio. Giot and Petitjean (2011) use fundamental variables including long-term bond spread, T-bill returns, the market s dividend yield, and the market s price-earnings ratio. Jordan and Vivian (2011) use fundamental ratios (dividend-price, consumption-price, and output-price) and their growth adjusted counterparts. 2 Jordan (2012) examines one particular type of time-variation in aggregate returns, long-term reversals. He finds these reversals can be explained by asset pricing models with time-varying risk factors and time-varying alphas. Jordan s results suggest that macroeconomic factors are important determinants of time variation in equity returns which can help explain aggregate return reversals. 3

Bagging has been recently implemented for forecasting economic variables with applications to US inflation (Inoue and Kilian, 2008) and US employment growth (Rapach and Strauss, 2010) but there is, to our knowledge, no empirical evidence on the effectiveness of bagging for a dataset of international stock market returns. Results from the bagging method are compared to the no-predictability benchmark (a random-walk with drift model) and a discussion is provided on the performance of bagging compared to forecast combination methods. While forecast combination methods have been well established as effective in improving forecast accuracy in many disparate applications (Clemen, 1989), they have received little attention in stock return forecasting applications until recent studies of the US (Rapach, Strauss and Zhou, 2010) and the set of G7 countries (Jordan and Vivian, 2011). Prior international studies of equity market return predictability generally utilize a similar set of large economies consisting of the major developed countries. 3 This raises important issues. Firstly, does out-of-sample stock return predictability exist outside the major developed markets? This question emanates from the fact that there is relatively little evidence on emerging markets compared to the mass of evidence that has been produced for the G7 countries. Secondly, major developed countries stock returns are highly correlated with each other; for example in the G7 monthly correlations are about 0.7 amongst country pairs. This suggests that data from these countries is only partly independent from each other and may reflect one common effect rather than confirming multiple times that a relationship holds. Thus, results that generally hold for G7 markets may not transfer well to other markets. In contrast Asian financial markets have much lower correlations with each other and thus provide more independent evidence than a G7 sample. Moreover, Asian economies are of global interest given that they produce about 30% of global economic output; make up over 40% of the world population, and whose financial 3 Bossaerts and Hillion (1999) use data starting after 1969 to 1995 for Australia, Belgium, Canada, France, Germany, Italy, Japan, Netherlands, Norway, Spain, Sweden, Switzerland, the UK, and the US. Rapach, Wohar, and Rangvid (2005) use data from the mid 1970s to the late 1990s for Belgium, Canada, Denmark, France, Germany, Italy, Japan, Netherlands, Norway, Sweden, the UK, and the US. Rapach and Wohar (2009) study the G7 countries. Giot and Petitjean (2011) use data starting from early 1950s to late 1960s ending in 2005 for Australia, Canada, France, Germany, Japan, Netherlands, South Africa, Sweden, the UK, and the US. Jordan and Vivian (2011) use a long data set from 1927 to 2009 for Australia, Germany, France, Italy, Sweden, the UK, and the US. McMillan and Wohar (2011) use less sophisticated measures of economic value but include the G-7 countries and several Asian countries in their sample: Hong Kong, Korea, Malaysia and Singapore. Guidolin, Hyde, McMillan and Ono (2009) study the G-7. 4

markets are emerging as an important investment class. These countries also differ in terms of economic, institutional, and cultural characteristics in comparison to the US and other major developed countries. 4 We propose that investigating the effectiveness of bagging techniques for Asian market returns is of interest in its own right; nonetheless we also provide corroborating evidence that the main results generally also hold for the G7 countries. Further by providing analysis on both Asian and G7 markets, we can provide broad evidence on the cross-country determinants of aggregate return forecastability across markets with very different characteristics. Our paper provides evidence on the following six questions: (i) Can the recently developed bagging method beat the historical average benchmark for stock return forecasts? 5 Does this hold consistently in Asian markets and for the G7 economies? We test if the performance of bagging is the same as the historical average benchmark and compare results from bagging with forecast combination techniques in terms of both statistical and economic significance. Notably, this paper includes China and India, key global economies, for which there is relatively little prior literature on aggregate stock return forecasting (the few exceptions include Goh et al., 2013, Jordan et al., 2014 and Sousa et al., 2014). (ii) Can data mining account for evidence of predictability? Prior literature focuses on the G7 applying bivariate regressions multiple times. Rapach and Wohar (2006) demonstrate that data mining can partly account for evidence on aggregate return forecastability in the US. This suggests data mining could be a concern. Our inference is conducted using data mining robust critical values adapting the 4 Compared to the G7 countries, our sample of Asian countries consist of firm that have: (1) smaller size, (2) higher B/M, and (3) more negative B/M firms. Culturally, our Asian sample of countries is more accepting of power inequalities and less individualistic than the G7 countries. If culture is measured via the Hofstede Cultural Indices then the cultural differenece is reflected in the fact that the average Asian power distance index is 74, while that of the G7 is only 46 (a difference of 46%). These two groups differ on individualism too with an average individualism index of 26 for Asia and 74 for the G7 (a difference of 65%). The other major cultural difference is that our Asian countries tend to maintain a long-term view, especially compared to the Western countries that comprise the G7. The Hofstede long-term orientation index is 71 for our Asian sample, while it is only 37 for the G7 (a difference of 91%). Legally, our Asian sample has a higher percent of common law countries and several have specific Sharia codes that apply to their Muslim population. No G7 country has such Sharia compliant personal codes. 5 We utilize an historical average benchmark to be consistent with prior literature. However, as a robustness exercise we examine an AR(p) benchmark, which produces similar results to the historical average. Asian markets are less liquid than the G7 economies typically used in past studies. Return autocorrelation can occur in less liquid markets. Robustness results using the AR(p) benchmark demonstrates our predictor variables are not just capturing this autocorrelation effect. 5

approach of Rapach and Wohar (2006), which is built on the seminal work of White (2000) and Inoue and Kilian (2004). These methods that have been developed to adjust for data mining do so amongst the variables considered in a study rather than all variables that have ever been examined for a given country. Hence, while the data mining issue is more acute for the G7 countries, the methods to account for data mining can more fully address this concern in the Asian sample where the variables in this study are closer to the set of all variables which have been examined. 6 Further, we apply these data mining robust critical values to forecast combinations and to bagging, which, to our knowledge, has not been previously implemented. (iii) Does bagging add economic value? Prior empirical investigations of bagging to US macroeconomic series and US bond market do not examine the economic value of bagging forecasts. From a practical perspective, our results suggest implementing equity index trading strategies based upon the new bagging method in medium and smaller markets could help investors to time-vary their portfolio allocations between debt and equity. Hence, we provide new evidence that bagging can generate economic value. (iv) Which measure of forecast accuracy (for bagging and combination methods) is most closely associated with economic value? Recent literature, in a different context, emphasizes the role of sign forecasts, i.e., the correct direction of the forecast (see for example Guidolin et al., 2009, 2014). We investigate five measures of forecast accuracy and examine how closely these measures are related to economic value. In our context of linear models, we examine if it is the direction of the forecast or the magnitude of the forecast that is more important. (v) Does imposing forecast restrictions, which are implemented when calculating economic value yield, improvements in forecast accuracy? Do economic value and forecast accuracy measures yield more 6 The US, in particular, and the G7 more broadly have previously had a huge number of variables used as aggregate return predictors. Strictly speaking all these previously implemented predictors should be included in the data mining procedure not just those considered in an individual study. However, many variables that have been previously studied do not have readily obtainable data. Thus, current data mining method can only partially account for data mining concerns in G7 countries and results should be interpreted cautiously. Another issue is that the data mining procedure focuses on a single (forecast) horizon and does not take account of the different time horizons at which return predictability has been conducted for the G7 countries. 6

consistent results when such restrictions are imposed? Another important contribution is to investigate if the economic value and forecast accuracy results differ for the bagging method. In particular, we examine the role that portfolio weight restrictions play and if applying these restrictions to winsorize (extreme) forecasts improve standard metrics of forecast accuracy. (vi) What can explain the cross-country variation in market return forecasts from the bagging method? We can examine this issue broadly using data from the G7 and 10 additional Asian markets. Here we investigate several types of factors including economic development, economic openness, stock market liquidity, information availability and legal regulations. This enables us to further our understanding beyond that provided in prior analysis that considered alternative settings. 7 2. DATA DESCRIPTION We examine two samples of countries: (1) the G7 countries for comparison to prior work and (2) a subset of Asian countries for out-of-sample testing. Our eleven Asian countries include: China (CH), Hong Kong (HK), India (IN), Indonesia (ID), Japan (JP), Korea (KO), Malaysia (MY), Philippines (PH), Singapore (SG), Thailand (TH) and Taiwan (TW), over a balanced sample period of 1995-2011 employing monthly data. The selection criteria are to include Central, East, South, and Southeast Asian countries for which: (1) there is a Datastream (DS) Total Market Index and (2) there is data available from at least 1995. 8 Notably, there is little prior OOS forecasting evidence for our subset of Asian 7 For example Jordan, Vivian and Wohar (2014) provided some evidence on the cross-country variation in market return forecasts for European countries from bivariate predictive regressions (and a simple average combination). They examined the role of stock market liquidity, size and development. The current paper considers a larger number of countries, a broader set of factors and focuses on bagging and a wider range of combination methods. 8 The following countries did not have a DS Total Market Index: Afghanistan, Bhutan, Brunei, Burma, Cambodia, East Timor, Laos, Macau, Maldives, Mongolia, Nepal, Tajikstan, Turkmenistan, and Uzbekistan. For the following countries Market index data (not DS index) started after 1995: Bangladesh, Kazakhstan, and Vietnam. At the suggestion of a referee we excluded two small illiquid markets (Pakistan and Sri Lanka) and included Taiwan using alternative data from Datastream, such as the policy interest rate as a proxy for the risk-free rate. 7

countries, with the exception of Japan. 9 Further, we choose the 1995 date because it means we have 96 monthly observations before beginning the OOS forecasting period in 2003, which allows for a reasonably sized OOS test period. The data is primarily from Thomson Datastream. We collect monthly data from January 1995 to June 2011; the start date enables inclusion of a wide number of Asian countries for which there is virtually no prior OOS forecasting evidence. We obtain data for 10 variables including 8 of the variables used by Goyal and Welch (2008, hereafter GW). We include the following fundamental variables from GW: Dividend price ratio (log), (DP): Difference between the log of dividends paid on the market index and the log of market index price, where dividends are measured using a one-year moving sum. Dividend yield (log), (DY): Difference between the log of dividends and the log of one month lagged market index price. Earnings price ratio (log), (EP): Difference between the log of earnings on the market index and the log of stock prices, where earnings are measured using a one-year moving sum. Book-to-market ratio, (BM): Ratio of book value to market value for the market index. We include the following two macroeconomic variables from GW: 10 Risk-free rate, (RF): Interest rate on a low risk short-term security. Inflation, (INFL): Calculated from CPI; since inflation rate data are generally released in the following month, we use one month lagged inflation data. We include the following two technical variables from GW: Stock variance, (SVAR): Sum of squared weekly returns on the market index. 9 In particular, there is very little prior OOS forecasting evidence on the economic value of return forecasts (portfolio allocation evidence) in our sample of countries, except for Japan. 10 Jordan (2012) demonstrates that macro variables are able to capture predictability in international market returns. 8

Net equity expansion, (NTIS): Ratio of twelve-month moving sums of net issues by listed stocks to total market capitalization of index. We consider two new variables in this context, which are also of interest to technical traders: Price Pressure, (PRES): Calculated as the ratio of the number of rising stocks in the previous month divided by the number of falling stocks. Change in Volume, (CVm): Calculated as the monthly change in the volume of traded stocks (in the index). Table 1 provides a summary of descriptive statistics for our sample countries. We report the mean and the standard deviation for each independent variable used and for the aggregate market return. There are several interesting comparisons. First, the average nominal returns (RET) vary substantially across countries from -0.0015 (-0.15% per month or -1.8% per year compounded) in Japan, up to 0.0111 (1.11% per month or 14.2% per year compounded) in Indonesia. The standard deviation of returns also varies substantially across countries from 0.0422 for the UK to 0.1048 for China. The risk-free rate also varies substantially across countries with a low of 0.000 for Japan to a high of 0.007 for India. This means the Sharpe ratio (return per unit of risk) varies dramatically across countries from over 0.09 in the US to a negative -0.029 in the Philippines. The wide variation across countries exists for most the variables we study. [INSERT TABLE 1 AROUND HERE] Table 2 contains the correlation matrix for returns across our sample countries. There is substantial difference in correlation between the different country pairs. Interestingly the crosscorrelations in Asian countries returns are modest, on average about 0.40; the lowest correlation is between China and Indonesia at 0.195. This is interesting since the G7 developed markets typically used 9

in prior literature tend to have correlations above 0.70 on average; this is important since existing methods that adjust for data mining do not account for cross-country effects,. [INSERT TABLE 2 AROUND HERE] 3. METHODOLOGY 11 3.A. Assessing the Impact of Individual Variables Individual predictive regression models are used to estimate the linkage between the dependent, lagged dependent, and a potential predictor variable (including its lags). Define ΔRI t = RI t RI t 1, where RI t is the log-level of the total stock return index (stock price index that includes reinvested dividends) at h month t. In addition, define ( 1/ ) h h ytt, + h = h j= 1 RIt+ j so that y tt, + h is the (approximate) monthly growth rate of stock returns from time t to t + h, where h is the forecast horizon. In this section we outline the general models for h step ahead forecasts, however in the empirical analysis we purely focus on 1 step ahead forecasts (h=1). These predictive regression models take the form of (1) below: y = α + λ x + ε (1) h h t, t+ h i, t t+ h This model can be employed to estimate h-step ahead forecasts of stock returns using a recursive expanding window., h y tt + h, is linear in the potential predictor variables (i.e., X t ). The parameter α is a constant. The parameter λ capture the effect of the potential predictor variable. Finally, ε is an error term. For each country s stock return, 10 regression models are estimated one for each of the 10 explanatory variables. The models generated are used to conduct 1 step-ahead out-of-sample forecasts of stock returns. These forecasts are then compared to the respective benchmark model, which takes the form of the h t+h 11 This section draws upon Rapach and Strauss (2010). 10

historical average (a random walk with drift model). The form of the benchmark model is the same as in (1) where all the λ j s = 0. 3.B. Assessing the Impact of Variable Groups 3.B.1. Forecast Combinations The method of forecast combining is considered to be a useful technique for "...sharing strengths of different forecasting procedures..." (Yang, 2004:205) and is an alternative method of imposing "...structure on high-dimensional forecasting models" (Stock and Watson, 2004:1). Empirical as well as theoretical evidence (see Clemen and Winkler, 1986; Rapach and Strauss, 2010; Stock and Watson, 2003; 2004; and Yang, 2004) indicates that forecast combining generally improves the predictive ability of models because it includes more variables or potential predictors, thus increasing the amount of information used in generating forecasts. Recent work by Rapach, Strauss and Zhou (2010) demonstrates combination forecasts are also useful for US stock index returns. The forecast combination methods used in this paper include mean, median, trimmed mean, Discounted Mean Squared Forecast Errors (MSFE), and Cluster (C), and Principal Components (PC) combinations. According to Rapach and Strauss (2008, 2010), these combination methods can be described below in (2): ŷ w h + = n CB,t h t i= 1 i,t ŷ h i,t + h t (2) where h CB,t h t ŷ + is the combined forecast of the variable of interest from individual regression models (1) and w i,t is the weight of the individual regression forecasts, h i,t,h t ŷ + is the combined forecast. The weights sum to unity. 11

Simple Combinations Simple combinations (mean, median, and trimmed mean) methods differ from other combining methods in that they do not take into account the historical performance of the individual regression forecast models (Stock and Watson, 2003; 2004) and as a result they do not require a "hold-out" period for calculating the weights (Rapach and Strauss, 2010). Three simple combination methods are considered in this paper, namely the mean, median and trimmed mean. Mean combinations are the weighted summation of the forecasts whereby all individual forecasts receive an equal weighting, i.e., w i,t = 1/n, (where n is the number of individual regression forecasts). The sample median of the individual regression models is used for the median combination. Trimmed mean combinations exclude the lowest and highest individual forecasts. The remaining forecasts are aggregated with a weight of w i,t = 1/(n-2) assigned to each forecast (Rapach and Strauss, 2005; 2010). Discount MSFE Combining Method The method of discount MSFE combinations allows the modeler to place more value on the recent historical performance of the individual regression forecasts which is measured using the MSFEs (Rapach & Strauss, 2010). The weights used in this combination process, given in (3), are dependent on the discount factor (i.e., δ) which means that the weights are a function of the recent historical forecasting performance of the individual regression models. As δ decreases, the weight attributed to the most recent historical forecasts increases. A discount factor of unity reduces the combination method to the Bates and Granger (1969) optimal combination forecast when the individual forecasts are uncorrelated (Stock and Watson, 2004). Following Stock and Watson (2004) we use δ = 1.0 and δ = 0.9. w i,t = n m j= 1 1 i,t m 1 j,t, (3) t h t h s h h where mi, t= δ ( ys+ h yi, s+ hs ) s= T 0 2 and 0 < δ < 1. 12

Cluster Combining Method The cluster combination method, as developed by Aiolfi and Timmermann (2006), is a conditional combining method that incorporates information about the forecast persistence and historical performance of individual models (Rapach & Strauss, 2010). We employ their Previous Best Conditional Combination algorithm, C(P,PB), where C denotes a cluster combination, P the number of clusters, and PB denotes the Previous Best conditional combination strategy. In particular, the individual cluster combination forecasts are computed by grouping the individual regression model forecasts (from (1)) into K equal-sized clusters based on MSFE over the past 36 months of data. For the initial OOS forecast the holdout out-of-sample period (denoted as P 0 and comprising 36 observations) is used to determine the initial clusters. 12 The first cluster comprises the individual regression models with the lowest MSFE values; the second cluster comprises the regression models with the next lowest MSFE values, and so on. We apply the previous best approach (as in Rapach and Strauss, 2010) which focuses solely on the cluster which includes the models with the lowest MSFE. The cluster combination forecast is then simply an average of the forecasts for the next period (time t+1) from those models in the cluster with the lowest MSFE. That is each model within the lowest MSFE cluster (at time t) is given an equal weighting. Clusters are formed for each estimated period from the initial holdout out-of-sample period to the end of the out-of-sample period. Following Aiolfi and Timmermann, we consider K =2 and K =3. Principal Component Combining Methods The principal component method of forecast combining comprises the extraction of the (r) principal components from the individual regression forecasts. Following Chan et al. (1999) and Stock and Watson (2004), we generate a combination forecast from the first r principal components. These components are then used to form a regression in which the weights are estimated using OLS. This 12 A rolling window of 36 monthly observations is used and applied for each time period during the OOS period to determine the constituents of each cluster constituents. 13

method differs from dynamic factor and factor augmented vector autoregression models in that the factors are estimated from the panel of forecasts and not the actual variables. h h h h y s+ h = φ 1,s+ h s +... + φm m,s+ h s + vs+ h 1 pˆ pˆ (4) The combination forecast is then simply the aggregation of the principal components multiplied by their estimated weights. Principal components are linearly estimated from the individual regression forecasts such that each p captures the variability and information contained in the individual forecasts. The first p explains most of the variability followed by the second, third and so on (Chan et al., 1999; Rapach & Strauss, 2010; Stock & Watson, 2004). The number of components included in the study is based on the ICp3 criterion of Bai and Ng (2002). Following Bai and Ng, we select a maximum value four components in our principal components (PC) combination forecasts and we utilize their alternative criterion (ICp3) that consistently estimates the true number of factors. 3.C. Construction of Bagging Forecasts h Recall that y tt, + h is the monthly stock return from time t to t + h, where h is the forecast horizon. Let x i,t denote one of n potential predictors of stock returns (so that i = 1,..., n). We consider 10 potential predictors of stock returns (n = 10) in our analysis. We compute bagging forecasts of stock returns at horizon h using the bagging-augmented pretesting procedure (BA) of Inoue and Kilian (2008). The procedure begins with the general model: y n h h t, t+ h= µ + δixi, t+ xt+ h i= 1, (5) h where ξ t+h is an error term characterized by autocorrelation of degree h 1. Suppose we are interested in forming a forecast of h t h y + at time t. The pretesting procedure involves estimating (5) via ordinary least squares (OLS) using data from the start of the available sample through time t and computing the t- 14

statistics corresponding to each of the potential predictors. 13 The x i,t variables with t-statistics less than 1.645 in absolute value are dropped from (5), and the model is estimated a second time using only significant predictors. Bagging can be implemented for the pretesting procedure via a moving-block bootstrap. More specifically, a large number (B) of pseudo samples of size t for the left-hand-side and right-hand-side variables in (5) are generated by randomly drawing blocks of size m (with replacement) from the observations of these variables available from the beginning of the sample through time t. For each pseudo-sample, we estimate (5) using the pseudo-data and OLS, the (pretesting) procedure determines the predictors to include in the forecasting model, the model is re-estimated using the pseudo-data, and a forecast of h t h y + is formed by plugging the actual included x i,t values into the re-estimated version of the forecasting model (and again setting the error term equal to its expected value of zero). The bagging model forecast corresponds to the average of the B forecasts for the bootstrapped pseudo samples. 14 Dividing the complete available sample of T observations for Δy t and x i,t (i = 1,..., n) into an insample portion comprised of the first R observations and an out-of-sample period comprised of the last P observations, we can form a series of P (h 1) recursive simulated out-of-sample forecasts using the h BA,t + h t T h bagging procedure. 15 We denote this series by { ŷ }. t= R 3.D. Statistical Tests Tests for encompassing and equal forecast accuracy In the case of nested models, Clark and McCracken (2001) and McCracken (2007) develop a set of asymptotics that allow for an out-of-sample test of equal population-level predictive ability between 13 The t -statistics for the OLS estimates of δ i in (5) are computed using Newey and West (1987) heteroscedasticity and autocorrelation consistent (HAC) standard errors based on a lag truncation of h 1. 14 Following Inoue and Kilian (2008), we use m = h. We use B = 1000. 15 Recursive indicates that the forecasts are generated using an expanding estimation window. 15

two nested models. They show that, in the context of linear, OLS-estimated models, a number of different statistics can be employed to test for equal forecast accuracy and forecast encompassing, despite the fact that the models are nested. Based on Monte Carlo simulations, Clark and McCracken (2001, 2004) indicate that ENC NEW is the most powerful statistic, followed by their ENC T, MSE F and the MSE T statistics. These rankings suggest that the forecast encompassing statistics, especially ENC NEW, can have important power advantages over test statistics based on relative MSFE. We report results for the most powerful statistic ENC-NEW, which is an F type test and is related to the Harvey et al. (1998) statistic designed to test for forecast encompassing. It has been shown through extensive Monte Carlo simulations in Clark and McCracken (2001, 2004) that the ENC-NEW statistic has power advantages over the original Diebold and Mariano (1995) statistic as well as the Harvey et al. (1998) ENC-T statistic. ( T R h + 1) c / MSE1 ENC NEW =. (6) T where, R is the number of observations in the in-sample period ( ) ( uˆ u ) ˆ, u y y, 1, 2 c t+ h = uˆ 1, t+ h 1, t+ h ˆ2, t+ h ˆ i, t + h = t + h i, t + h i = h cˆ t= R t + h c = T R h + 1 1, t = R i, t + h 1 T h 2, MSE = ( T R h + 1) uˆ ( i = 1, 2) y i is the forecast from model i, i = 1 is the benchmark and i = 2 is the predictive model. i. Under the null hypothesis, the restricted model forecasts encompass the unrestricted model forecasts, while under the one-sided (upper-tail) alternative hypothesis the restricted model forecasts do not encompass the unrestricted model forecasts. Clark and McCracken (2001) note that the limiting distribution of the ENC-NEW statistic is non-standard and pivotal for one step ahead forecasts (h = 1) considered in this paper. Clark and McCracken (2004) recommend basing inferences the ENC NEW statistics on a bootstrap procedure, given that the statistics are not in general asymptotically pivotal (when h>1). The 16

bootstrap procedure we employ is similar to the one in Clark and McCracken (2004), which is a version of the Kilian (1999) bootstrap procedure, and is discussed in detail in Rapach and Weber (2004) and Rapach and Wohar (2006). The ENC-t test of Harvey et al. (1998) is defined below: ENC t = ( T R h + 1) Sˆ 1/ 2 cc T h 1/ 2 t= R cˆ t+ h, (7) where Ŝ cc denotes the long-run variance estimates for c ˆ t + h constructed with a HAC estimator such as Newey and West ' s (1987). As described above, the ENC-t test applies to nested forecasting models. Clark and West (2006, 2007) demonstrate that the test can be viewed as an adjusted test for equal MSE. In the Clark and West framework the null hypothesis is a random walk and the alternative hypothesis is of a predictive regression. If the null hypothesis of a random walk is true then it will have a lower mean-squared error relative to the alternative (despite the fact the alternative include an additional variable) due to the fact that there is sampling error associated with estimating the alternative model. Clark and West therefore adjust the forecast error of the alternative model to take account of this sampling error. The adjustment subtracts the square of the difference in forecasts from the competing models; this term captures (under the null hypothesis of equal accuracy in population) the extra sampling error in the larger model. Clark and West (2006, 2007) present the loss differential of the test statistic as: wˆ = uˆ 2 2 ( uˆ ( y y ) ), 2 c t+ h 1, t+ h 2, t+ h 1, t+ h 2, t+ h (8) cw i and then regressing the series { }., t+ h T h t= R on a constant generates the adjusted MSE (CW-T), which is the t-statistic corresponding to a zero constant and is based on a normal distribution. The second term within the brackets of equation (8) adjusts for the upward bias in MSE predicted by estimation of 17

parameters that are zero under the null. This t-test statistic proposed by Clark and West (2006, 2007) is equivalent to the Harvey et al. (1998) ENC-t test for forecast encompassing as considered in such studies as Clark and McCracken (2001, 2005). For tests of equal predictive ability at the population level, Monte Carlo results in Clark and McCracken (2001, 2005), Clark and West (2006, 2007), and McCracken (2007) show that critical values obtained from Monte Carlo simulations of the asymptotic distributions generally yield good size and power properties for 1-step ahead forecasts, but can yield rejection rates greater than nominal size for multi-step forecasts. Similarly, results in Clark and West (2006, 2007) indicate that comparing Clark- West (equivalent to ENC-t) test against standard normal critical values can work reasonably well but exhibit size distortions as the forecast horizon increases. Sign Test Pesaran and Timmermann (1992) investigate the sign of the variable y t, and develop a nonparametric test based on the number of correct predicted signs in the forecast series of size T. The assumptions are that the distributions of y t and the predictor are continuous, independent and invariant over time. Let: π 1 = Pr(y t > 0), π 2 = Pr( ŷ t > 0), p 1 = sample proportion of times that the actual value of y t is positive, and p 2 = sample proportion of times that the forecast of y t is positive. Under the null hypothesis that ŷ t (forecast) and y t are independently distributed of each other (so that the forecast values have no ability to predict the sign of y t ), then the number of correct sign predictions in the sample has a binomial distribution with T trials and success probability that is equal to: 18

* π = π π + 1 2 ( π )( 1 π ). 1 1 2 When π 1 and π 2 are not known, they can be estimated by the sample proportions p 1 and p 2, so that π * can be estimated by p*: p * = p 1 p 2 + (1- p 1 )(1 - p 2 ). where p* is the expected sample proportion of correct predictions under the null hypothesis. Define p as the actual sample proportion of times that the sign of y t is correctly predicted. The test statistic in this case is shown to converge in distribution to N(0,1) under the null hypothesis by Pesaran and Timmermann (1992) and is given as: where and PT2 = (p p * ) [ vâr ( p) vâr ( p )] 1/ 2 ( p ) = p ( 1 p )/ T v âr * * * 2 2 2 ( p ) = ( 2 p 1) p ( 1 p )/ T + ( 2 p 1) p ( 1 p / T ) + 4 p p ( 1 p )( p )/ T. v âr * 1 2 2 2 1 1 1 2 1 1 2 Pesaran and Timmermann (1992) generalize this test to situations where there are two or more meaningful categories for the actual and forecast values of y t. They also note that in the case of two categories, the square of PT2 is asymptotically equal to the chi-squared statistic in the standard goodness-of-fit test using the 2x2 contingency table categorizing actual and forecast values by sign. 3.E. Measuring Economic Value Our final set of empirical tests deal with the economic value of forecasts. We analyze if portfolio allocations could have been improved by following the regression model rather than the historical average benchmark. First, we consider a mean-variance optimizing investor as in Campbell and Viceira (2002) 19

and Campbell and Thompson (2008). We take the return forecast from the historical average benchmark and compare it to an alternative return forecast from i) bagging and ii) combination forecast methods. Recall that y t + 1 is the log stock return. Define Y t+1 as the stock return (Y t+1 = exp(y t+1 )-1. A meanvariance optimizing investor has objective function: γ 2 1 2 γ 2 O= EY ( p) σy E( y ) p p + σy σ p y (11) p 2 2 2 where O is the objective, Y p is the portfolio return, y p is the portfolio log return, and γ is the coefficient of relative risk aversion. Such an investor will choose a portfolio weight, tb, ω ( ωtz, ) of the risky asset under the prediction from the historical average benchmark (the alternative forecast model [bagging or forecast combination]): 16 ω EY ( ) Y = 1 t+ 1, b f tb, γ 2 σt (12) ω EY ( ) Y = 1 t+ 1, z f tz, γ 2 σt Y f is the risk-free rate. We use 5-year rolling monthly data to estimate volatility 2 ; however, ( σ t ) estimating volatility using alternative horizons has very little impact on the utility gain since (13) 2 σ t is the same in the benchmark weight, ω tb,, and the alternative model weight placed, ω tz, (see the denominator in equations 12 and 13 above). 16 Weights are recalculated in every time period and portfolio allocations adjusted accordingly. 20

benchmark is: The utility gain ( O ) from using the regression model rather than the historical average 2 ( σ 2 ) γ O = Y Y σ (14) z b Y Y 2 z b Second, we implement Goetzmann et al. s (2007) performance measure (referred to as GISW): GISW 1 G 1 G T 1 1 1 1 1 Y T + t+ 1, z 1 1+ Yt+ 1, b = ln ln 1 T t= 0 1 Y G + t+ 1, f T t= 0 1+ Y t+ 1, f ln[ E(1 + Ym)] ln(1 + Yf ) where: G = Var[ln(1 + Y )] m (15) GISW measures the average performance of a portfolio relative to the risk-free rate; it is a certainty equivalent measure of abnormal performance. An advantage of GISW is that it is difficult to manipulate. The parameter Γ is set to reflect the overall reward (return) to risk (variance) ratio for each country based upon our sample data. This reduces the possibility of manipulation and incorrect inference. 3.F. Data Mining Robust Critical Values When testing the predictive ability of a large number of financial variables, Lo and MacKinlay (1990) and Inoue and Kilian (2004) note that data mining is a serious concern when one is dealing with stock return predictability regardless of whether the tests are in-sample or out-out-sample. Inoue and Kilian (2004) note that an important way in which data mining can be controlled for is to use appropriate critical values, which explicitly account for the possibility of data mining. 21

Here we employ the data mining bootstrap used in Rapach and Wohar (2006). 17 We consider J different variables serving as candidate predictors in predictive regression models. These J variables are then used to form forecasts using combining methods and bagging. We assume that the data are generated by the following system under the null hypothesis of no predictability: r t =α 0 + ε 1,t, (16) x x J, t = b j,0 j,1 j, t 1 +..._ b i, t 1,0 1,1 1, t 1 1, pi 1, t p1 = b + b + b x x +... + b x J, p J x + ε J, t p J 1,2, t (17) where the disturbance vector ε t = ( ε 1, t,..., ε J, 2, t )' is independently and identically distributed with covariance matrix Σ. We first estimate (16) and each of the processes in (17) via OLS. The lag order for each of the AR processes in (17) can differ, and we select each lag order using the AIC (considering a maximum lag order of four is used). 18, + ε J,2, t We then compute the OLS residuals, { = ( ˆ ε,ˆ ε...,ˆ )} p ˆ T t 1, t 1, 2, t, ε J, 2, t ' t = 1, ε where p = max j { 1,..., J } p j. In order to generate a series of disturbances for our pseudo-sample, we randomly draw (with replacement) T + 100 times from the OLS residuals, { } p * T + { } 100 ˆε. t t = 1 εˆ T t t =1, giving us a pseudo-series of disturbance terms, Drawing the OLS residuals in tandem preserves the contemporaneous correlation between all of the disturbances in the original sample. Using the pseudo-series of disturbance terms, the OLS estimates 17 This section draws on Rapach and Wohar (2006). However, please note that Rapach and Wohar (2006) consider individual predictive regressions, while we focus on combinations and bagging in this paper. 18 We do not estimate a VAR process for (x 1,t,, x J,t ) in order to conserve degrees of freedom. 22

of the coefficients in (16) and (17), 19 and setting the initial observations for each of the x j,t variables equal to zero in (17), we can build up a pseudo-sample of T + 100 observations for r t and x 1, t,..., x J, t, * * * T + 100 { r, x,..., x }. t 1, t J, t t = 1 We drop the first 100 transient start-up observations in order to randomize the initial observations, leaving us with a pseudo-sample of T observations, matching the original sample length. For the pseudo-sample, we calculate the two out-of-sample statistics (MSE-F and CW-T) and the economic value measures for all models (each of the forecast combination techniques and bagging) in turn. For each pseudo sample we store the maximal values for each metric. We repeat this (whole) process 5000 times, to generate an empirical distribution for each of the maximal out-of-sample statistics and for each of the maximal economic value measures. After ordering the empirical distribution for each maximal statistic, the 4,500th, 4,750th, and 4,950th values serve as the 10%, 5%, and 1% critical values for each maximal statistic. 4. OUT-OF-SAMPLE STOCK RETURN FORECASTS Could investors actually utilize fundamental-price based models in order to benefit from more accurate predictions of future stock returns? This issue is of importance to both practitioners and academics alike. Asset managers, economic policymakers, as well as pension providers and contributors all need accurate estimates of future market returns. In this section, we examine a range of fundamental-price ratios as well as macro and technical variables for a sample of Asian countries. Following Rapach, Strauss, and Zhou (2010) and Stock and Watson (2004) we consider if various combining forecasts or bagging methods can improve forecast accuracy over individual models. The historical average model is used as the benchmark to control for possible low liquidity effects in Asian emerging markets. 19 We employ bias-adjusted slope coefficients for the AR processes in (2). 23

4.A OOS Forecast Accuracy (Individual Regression Forecasts) Table 3 shows 1-month forecast results for individual predictive regression models. 20 Overall, the individual predictive models have mixed results. In the Asian markets, some predictors provide dismal forecasts, such as stock variance (SVAR), net equity issuance (NTIS), and change in volume at the monthly frequency (CVM). However, the performance of individual fundamental-price ratios, e.g., dividend-price ratio (DP), dividend yield (DY), earnings price (EP), and book-to-market (BM), with the exception of EP, show evidence of predictability. DP, DY, and BM show predictability above that of the benchmark model in 8 of 11 countries. We implement the Clark and McCracken (2001) ENC-NEW encompassing test and the Clark and West (2007) CW-T test of equal forecast accuracy. Predictability in our Asian sample is robust even if very high hurdles associated with data snooping adjustments are made, however data-snooping significance in fundamentals is found only under the ENC-NEW. Technical indicators, e.g., price pressure as measured by rising stocks against falling stocks (PRES), also demonstrate some predictability. PRES demonstrates predictability in 6 of 11 countries using standard statistical significant and in 2 of 11 countries after adjusting for data-snooping bias. The risk-free rate (RF) and inflation (INFL) are the only two predictors that demonstrate predictability under both the CW- T test and data-snooping bias adjustments. Thus, it appears that investors interested in Asian markets should specifically consider these variables in forecasting models. [INSERT TABLE 3 AROUND HERE] Predictability in the G7 countries is not found to be robust in our tests. Fundamental ratios do not perform well in the G7 sample. The lack of robust statistical evidence of predictability in the G7 countries contrasts sharply with the robust predictability found in our sample of Asian countries. The evidence for 20 We focus on one-step-ahead forecasts due to space considerations. 24

predictability in G7 countries is even more dismal once data mining adjustments are made. When data mining adjustments are used, there is virtually no evidence of predictability in the G7 countries. There is also considerable variation in predictability across countries. Although there is virtually no predictability in the G7, there is evidence at the 5% significance level for DE, JP, and the US. In the Asian sample, we find there is no evidence of OOS predictability in China as no variable demonstrates predictability when data-snooping bias is controlled. Predictability is also weak for the other large Asian markets (HK, India, Japan, Korea, and Taiwan) where out of 10 predictor variables only one exhibits some predictability after data-snooping bias adjustments. However, four countries (Indonesia, Malaysia, Singapore, and Thailand) have robust predictability, even after data-snooping bias. 4.B OOS Forecast Accuracy (Bagging and forecast combinations) This section contains the forecast results from bagging and forecast combinations. Table 4 reports results for the 1-month forecast horizon. We find consistent evidence in our Asian sample that combining forecasts improves forecast accuracy in our sample of Asian countries even after adjustments are made for data-snooping bias. This is true for seven of eight methods examined, the lone exception is the principal component method (PC(C,3B)), and in 10 of 11 countries there are consistent improvements in forecast accuracy, the sole exception is China. Combination forecasts provide consistently large gains in HK, India, Indonesia, Malaysia, Singapore, and Thailand. The finding that combination forecasts consistently perform well is strengthened by the fact they are subject to greater parameter estimation error than the benchmark. Clark-West (2007) emphasize that parameter estimation error leads to an expectation that the mean-squared prediction error from an alternative model (here the forecast combination) is larger than that from a parsimonious model (here the benchmark). [INSERT TABLE 4 AROUND HERE] 25

Once more, the results for predictability of the G7 set of countries is not promising. There is little to no evidence with predictability only documented for JP and the US for combination methods and only for JP for bagging methods. Again, this is in complete contrast to the results found for Asia. We note that the various classes of forecast combination methods yield broadly similar results; hence in the subsequent analysis we only report results from one method in each class: Mean, DMSFE(1), C(2,PB) and PC(C,3B); we can confirm results from the other combination techniques in the subsequent tests are qualitatively similar to those of the same combination class that is reported. In terms of forecast errors, the bagging method provides mixed results for the majority of Asian countries; in general it underperforms the benchmark. For several countries (China, HK, Korea, Singapore and Taiwan) bagging forecast errors are 5% larger than those from the benchmark, a substantial magnitude. However, bagging does provide substantial improvement for four countries: Indonesia, Japan, Malaysia, and Thailand. These results for bagging are in contrast to prior literature that provides favorable results for the bagging method when applied to macroeconomic forecasts (Inoue and Kilian, 2008; Rapach and Strauss, 2010). 5. ALTERNATIVE MEASURES OF FORECAST ACCURACY 5.A Sign Forecast Accuracy Tests Table 5 provides evidence on whether the return direction can be predicted. This may be of particular interest given some prior work suggests a link between directional performance and economic value (Pesaran and Timmermann, 1992; Leitch and Tanner, 1991). Our analysis compares bagging and forecast combination models to benchmark of i) only predicting positive returns and ii) the sign of the average of the prior 36 months returns. 21 If the models can beat the benchmark, then this indicates that there is evidence that the models can be used to time the market and shift to the risk-free asset for 21 The size of the window is set at 36 months to be consistent with the optimisation period of the forecast combination models which draw on past forecast performance (DMSFE, Cluster and Principal Component methods). We thank an anonymous referee for pointing us towards using a benchmark based on rolling return performance. 26