Index-Tracking Portfolios and Long-Short Statistical Arbitrage Strategies: A Lasso Based Approach

Size: px

Start display at page:

Download "Index-Tracking Portfolios and Long-Short Statistical Arbitrage Strategies: A Lasso Based Approach"

Gordon Austin
5 years ago
Views:

1 Index-Tracking Portfolios and Long-Short Statistical Arbitrage Strategies: A Lasso Based Approach Author One a,b,1,, Author Two c, Author Three a,c, Author Four a,c a Address One b Address Two c Some University Abstract In this paper, we aim at performing an extensive application of the lasso-type regression to solve the index tracking (IT) and the long-short investing strategies. In both cases, our objective is to exploit the meanreverting properties of prices reported in the literature. Due to its capacity to perform variable selection in linear regressions, as well as its capacity to solve high-dimensional problems, lasso becomes an interesting technique for portfolio selection. In the empirical analysis, we considered three market benchmarks (S&P 100 and Russell US stock market; and Ibovespa index - Brazilian market). In the tests, we also formed IT portfolios using cointegration to compare with the results using lasso. The findings showed an overall performance similar between portfolios using lasso and cointegration. Nevertheless, portfolios using lasso presented average monthly turnover at least 40% smaller, indicating that lasso based portfolios not only had a consistent performance but also had a considerable advantage in terms of transaction costs. Keywords: Lasso, Index tracking, Long-short, Portfolio selection, Statistical arbitrage JEL Codes: C520, C550, C580, G Introduction The index tracking (IT) is a passive investment strategy, as opposed to a buy-and-hold strategy, that involves constructing a portfolio of securities that replicates, or tracks, the total return of a market index over time. For instance, a portfolio may be set up to mimic the returns of the Index S&P 100 or the Dow Jones Industrial Average (DJIA). The choice for passive investment relies on the premise that securities are efficiently priced. This notion of price efficiency is encapsulated in the efficient markets hypothesis (EMH) (EMH; Fama, 1970). According to the EMH and studies such as Fama and French (2010), active investment funds, in general, do not beat the overall market performance consistently in the long-run; for this reason, investors should be better off choosing a passive strategy, as this type of investment seeks to minimize costs while expecting that the market exhibits the mean-reverting properties and its constituents will present positive returns over time. Moreover, an extension of IT is the long-short investment, which is a self-financing strategy that aims at exploring temporary market inefficiencies to generate alpha (excess return relative to the market performance) through buying undervalued stocks and short selling overvalued stocks (Alexander and Dimitriu, 2002). As we compare the similarities between index tracking and long-short regarding implementation methodology (in spite of their conceptual differences), we notice a standard feature in past studies, which is the need to impose I am corresponding author addresses: author.one@mail.com (Author One), author.two@mail.com (Author Two), author.three@mail.com (Author Three), author.four@mail.com (Author Four) URL: author-one-homepage.com (Author One) 1 I also want to inform about... 2 Small city Preprint submitted to XVIII Brazilian Finance Meeting April 9, 2018

2 a limit in the number of stocks used to compose their portfolios (Beasley et al., 2003; Alexander and Dimitriu, 2005; Dunis and Ho, 2005; Focardi et al., 2016). Thereby, the manager s choice to define each portfolio becomes an optimization problem that requires the use of a numerical approach to find a solution. In this paper, we present an implementation of the so-called lasso-type regression to solve both the index tracking and long-short problems. Furthermore, because the cointegration approach has already been employed extensively in previous studies for both the IT and long-short investing strategies (for instance, Alexander and Dimitriu, 2002, 2005; Dunis and Ho, 2005; Acosta-González et al., 2015), we consider this approach a relevant benchmark that can be used to validate the solutions estimated using lasso. For this reason, we also solve the IT problem with cointegration so that we have a basis for comparison of the results calculated using lasso relative to another methodology widely explored in the literature. According to the literature on index tracking (for example, Beasley et al., 2003; Guastaroba and Speranza, 2012; Scozzari et al., 2013), due to cost control issues, the IT optimization problem commonly has a constraint to limit the number of stocks used to compose the tracking portfolios, as managers attempt to reduce transaction and management costs. Besides, it is a standard practice in this strategy to form portfolios with the expectation to hold them for at least one or two months. The reason for this decision comes from the assumption that index funds are a passive investment vehicle that aims at diminishing transactions and preventing frequent portfolio updates to avoid extra costs (in contrast with active investment, which is based on the practice of trading activities more frequently as traders seek to explore temporary market inefficiencies over time). Concerning past studies about index tracking, some different approaches have already been presented and discussed to solve this optimization problem, such as optimization (Konno and Wijayanayake, 2001; Mezali and Beasley, 2013), optimization combined with simulation (Consiglio and Zenios, 2001), heuristic methods (Beasley et al., 2003; Guastaroba and Speranza, 2012; Scozzari et al., 2013), cointegration (Alexander, 1999; Alexander and Dimitriu, 2005), and lasso-type regression (Wu et al., 2014; Yang and Wu, 2016). The lasso approach (least absolute shrinkage and selection operator) has been introduced by Tibshirani (1996) and is a method that makes variable selection automatically in linear regression modeling through the generation of sparse estimates of the coefficients (Zeng et al., 2012). Additionally, it is a method successfully used in statistical modeling especially with high-dimensional datasets (Tibshirani, 1996; Zeng et al., 2012). Such features make this technique interesting for the index tracking problem, mainly as a result of the need to impose a cardinality constraint on the size of the tracking portfolios. Due to the need to set a cardinality constraint to solve the IT problem with a limited number of stocks in the portfolio, some studies have a similar feature, which is the authors choice to form tracking portfolios using the stocks with the most significant weights in the index (for instance, Alexander and Dimitriu, 2005; Dunis and Ho, 2005). In this sense, picking the stocks to compose each tracking portfolio is a decision exogenous to the solving process. Besides, the option for the use of the stocks with the largest weights tends to be efficient if we consider market benchmarks composed by fewer stocks. As examples, some smaller indexes are the Dow Jones Industrial Average (DJIA), composed by the 30 largest firms listed either in the New York Stock Exchange (NYSE) or in the Nasdaq Stock Market; the DAX, composed by the 30 largest German firms listed in the Frankfurt Stock Exchange; and the Ibovespa, formed by the most liquid stocks in the Brazilian stock market. For instance, regarding the Ibovespa, in its portfolio composition for the period of September-December 2017, 2

3 the index contained 59 stocks, and the top 10 stocks with largest weights were enough to account for about 57.7% of the index portfolio 3. However, as we move to larger indexes like the S&P 100, such choice for the most relevant stocks tends to become less effective. In the case of the S&P 100, its composition had a total of 102 stocks in December 2017, and the top 10 major stocks accounted for about 31% of the index 4. In the case of a bigger index such as the Russell 1000, this contrast becomes even more evident, as the largest stock weight in this index in December 2017 was Microsoft Corp., corresponding to only 2.23% of the index, and the top 10 stocks accounted for only 14.7% of the Russell 1000 s portfolio 5. As a result, we can understand the usefulness of a method that has the capacity of performing variable selection, such as the lasso approach. In the context of index tracking and portfolio optimization, the use of a statistical model that selects the most relevant coefficients is appropriate since it makes the portfolio selection process endogenous to the optimization problem. Moreover, the lasso regression has the potential to solve the index tracking problem effectively for small market indexes as well as for more extensive benchmarks such as the Russell 1000, thus being apt for application with high-dimensional datasets. Regarding the studies that combined lasso and index tracking, Wu et al. (2014) proposed the so-called Nonnegative Lasso, which consists of computing the lasso regression constrained by having all coefficients equal or larger than zero, thereby avoiding short positions in the portfolios. The authors use the Chinese index CSI 300 and its stock constituents (with data from August 2008 to November 2011), and their empirical analysis shows the capacity of the lasso-type regression to generate good quality portfolios concerning annual tracking error in-sample (fitted error) as well as out-of-sample (predicted error). Later, Yang and Wu (2016) extended the nonnegative lasso to propose a method called Nonnegative Adaptive Lasso. Moreover, they also applied a two-stage approach combining nonnegative adaptive lasso and nonnegative least squares. Their empirical tests also used the CSI 300, and the results concerning tracking error also presented overall consistent tracking performance regarding fitted and predicted error. Nevertheless, the studies mentioned above focus on the introduction of two statistical approaches, while their empirical analysis to the index tracking problem is quite limited. Thus, our study differs from the previous literature as we focus on the financial environment and apply the lasso-type regression to different markets using diversified sample sizes. More specifically, we use three datasets in our empirical tests: S&P 100 and Russell 1000 (American stock market respectively, databases with 102 and 907 stocks), and the Ibovespa Index (Brazilian stock market dataset with 55 stocks). As a result, we explore the lasso regression in different market environments (a financial market with robust stability USA as well as a more volatile emerging market Brazil) and with varying sample sizes (datasets ranging from 55 to 907 stocks). Therefore, not only we seek to explore distinct market conditions, but also we aim at analyzing the capacity of the lasso regression to solve a high-dimensional problem (which is the case of the Russell 1000). As pointed out in the literature (Tibshirani, 1996; Konzen and Ziegelmann, 2016), the lasso regression is a statistical approach 3 Source, access in Dec 26th, 2017: 4 Source, access in Dec 26th, 2017: 5 Source, access in Dec 26th, 2017: 3

4 that fits especially problems with high-dimensional data. Furthermore, the documentation on index tracking dealing with more substantial datasets usually presents optimization models that require some heuristic design to solve the problem in a fashion time. Thereby, we perform tests with the Russell 1000 as a tentative to assess the potential of lasso to address the index tracking problem with data of larger magnitude, providing solutions within a short computing time. Finally, we extrapolate index tracking and also use the lasso regression to handle the long-short investing problem. As Alexander and Dimitriu (2002) argued, such strategy has as a characteristic to present regular patterns in returns and market neutrality, even though its goal is to generate excess returns relative to the market benchmark. To achieve such goal, the primary objective of long-short consists in short selling overvalued stocks and assuming long positions in undervalued stocks 6. Thus, the strategy builds self-financed portfolios that attempt to explore temporary market inefficiencies to maximize returns in relation to the market performance, and the solution may be obtained using OLS (ordinary least squares) (Alexander, 1999; Alexander and Dimitriu, 2002, 2005; Li and Bao, 2014). However, OLS is a method that does not have the capacity of performing the variable selection, and the cointegration is a two-step methodology. So, the use of lasso may be useful due to both its more straightforward application (relative to cointegration) and its capacity to select the portfolio constituents by choosing the most fitted coefficients in the linear regression. To develop the empirical analysis, we use three indexes from two stock markets: S&P 100 and Russell 1000 (US stock market), and the Ibovespa (main benchmark in the Brazilian stock market). Moreover, for each index, we select two different sizes for the portfolios, and three distinct updating frequencies. Overall, the results for index tracking using lasso presented good quality performance (in terms of returns and tracking error) in all analyzed cases. Regarding the S&P 100 and Ibovespa indexes, the performance of all lasso portfolios is satisfying concerning annual average return and cumulative return, especially in the case of the US market. Furthermore, as we compare the results for index tracking obtained using lasso with the results obtained using cointegration, we notice very similar performance in general for all portfolios. However, despite having comparable performance in terms of return and volatility, portfolios built using lasso have monthly average turnover at least 40% smaller than the average monthly turnovers of portfolios using cointegration, which implies transaction costs at least about 40% lower for portfolios using lasso. Such outcome is interesting since the reduced turnover implies a substantial difference in transaction costs, thereby fulfilling the expectations regarding a passive investment: to diminish costs while keeping a satisfactory performance. Finally, the results for the long-short strategy also confirmed the quality of lasso, as we were able to obtain consistent cumulative returns especially in the case of the indexes in the US market. As a result, the contribution of this paper is twofold. First, we add to the index tracking literature by widely testing a statistical model (lasso) that has only been used a few times in past research (and with limited empirical analysis). To expand previous studies, we adopt market benchmarks with different sizes (from 55 to 907 stocks) as well as from distinct financial environments (US and Brazil). Also, we compute index tracking portfolios using an alternative approach (cointegration), so that we have a basis for the analysis and validation of the results obtained using lasso. Second, the empirical testing also presents innovations as we employ the lasso 6 In addition to this approach, long-short also may be developed by pairs trading or trading strategies that involve a stock and an ETF (Exchange-Traded Funds) (Avellaneda and Lee, 2010). 4

5 regression to solve an alternative investing strategy: the long-short approach. Consequently, we also contribute to the finance studies by showing how a different statistical approach can be consistently used for long-short, considering the more substantial simplicity in the use of lasso relative to cointegration (which is a two-step method that requires a more extended analysis, as referred in Section 3.2). This study is organized as follows. Initially, Section 2 describes the method associated with the lasso-type regression. Then, Section 3 presents the methodology of the study, including the guidelines for the index tracking and long-short investing strategies, as well as the description of the cointegration approach based on simulations. Finally, Section 4 describes the empirical tests and our results, and Section 5 concludes the study Lasso Least Absolute Shrinkage and Selection Operator This Section is dedicated to the discussion of the lasso-type regression methodology. First, Section 2.1 discusses the general concepts regarding lasso. Second, Section 2.2 presents the guidelines concerning the k-fold crossvalidation algorithm as a viable method to solve the lasso regression Lasso: General Concepts As Konzen and Ziegelmann (2016) point out, the central goal of a linear regression analysis consists of estimating the coefficients for the model y i = β 0 + X i β + ε i, where y i R is the dependent variable to be predicted, X i = (x 1i,..., x ki ) R k is the vector of independent variables, the union of β 0 and β is the set of predictors (β 0, β 1,..., β k ), and ε i is the error term considering a model with variables j = 1, 2,..., k, and time frame i = 1, 2,..., N. To compute such model, some approaches are available; among them, one of the most popular is OLS (Ordinary Least Squares), which is based on the minimization of the sum of the squared residuals (SSR) as follows: ˆβ OLS = ( argmin y i β 0 β 0,β 1,...,β k i N k ) 2 β j x ji (1) j= However, as pointed out by Tibshirani (1996), the OLS approach presents some inconsistencies, specially as we increase the number of independent variables and move to high-dimensional models 7. For this reason, Tibshirani (1996) cites two specific techniques that attempt to overcome the OLS inconsistencies: subset selection and ridge regression. Nonetheless, both techniques have downsides as well. In the case of subset selection, the procedure consists basically in the use of discrete choice to drop or add variables to the model as one aims to locate the best combination of input information for the model. Thus, the ideal situation in this case would be to test all 2 k possible combinations of the variables (Konzen and Ziegelmann, 2016). Yet, such analysis has a strong drawback in terms of computing time necessary to test all combinations 8. 7 According to Tibshirani (1996), the OLS estimates has basically two issues: (1) prediction accuracy, which results in parameters with large variance, and (2) interpretation, which is the case especially in large models since the method does not perform variable selection and thus make the interpretation of the results more difficult and inaccurate. 8 It is possible to find some algorithms in the literature to solve the subset selection problem, such as forward and backward elimination (Hastie et al., 2009), and the Dantzig Selector (Candes and Tao, 2007). 5

6 In relation to the ridge regression, Tibshirani (1996) points out its stability in terms of coefficients, in comparison to subset selection, as ridge regression consists of a continuous process that shrinks the regression coefficients. To carry out such process, the model receives a penalty on the sum of the squared residuals: ˆβ Ridge = ( argmin y i β 0 β 0,β 1,...,β k i N k ) 2 β j x ji (2) j=1 Subject to: k βj 2 t j=1 (3) t 0 (4) which is equivalent to: ˆβ Ridge = argmin β 0,β 1,...,β k [ i N ( y i β 0 k ) 2 k β j x ji + λ j=1 j=1 β 2 j ] (5) In Equations (2)-(4), the parameter t 0 works as a control for the penalty, which is the same role of λ in Equation (5). In the case of Equation (5), increasing λ strengthens the shrinkage process, while setting λ = 0 equalizes ˆβ Ridge and ˆβ OLS. Different from subset selection, however, the ridge regression approach does not involve variable selection. As Nasekin (2013) highlights, regression analyses usually face a situation where many independent variables are irrelevant for the model and may actually decrease its prediction power. As a result, Tibshirani (1996) proposes the so-called lasso approach, which consists of a shrinkage method that aims at combining features from both the subset selection and the ridge regression. In this sense, the lasso-type regression imposes a penalty on the coefficients (similar to the ridge regression); meanwhile, its estimating procedure works similarly to calculating the subset selection process continuously. Thus, the method results in the shrinkage of some of the coefficients while setting others to zero, achieving the basic goal of performing variable selection in the regression model. Tibshirani (1996) defines the lasso estimates in the form of the following optimization problem 9 : ˆβ lasso = ( argmin y i β 0 β 0,β 1,...,β k i N k ) 2 β j x ji (6) j=1 Subject to: k β j t j=1 (7) t 0 (8) where the variables and parameters have the same definitions from the models for ˆβ OLS and ˆβ Ridge. Additionally, we have the assumption that x ki are standardized, thus resulting in i N x i N ki = 0 and x2 ki N = 1. However, 9 To keep the description of the lasso-type regression short, we omit the explanation regarding the properties of ˆβ lasso. For instance, we refer the reader to Zhao and Yu (2006) and Konzen and Ziegelmann (2016) for a complete description of the lasso s consistency. 6

7 even though Equations (2) and (6) are similar, their Constraints (3) and (7) applied on the penalty parameter t are slightly different. As a consequence of Constraint (7), the optimization in Equations (6)-(8) takes the following form using the Lagrangian: ˆβ lasso = argmin β 0,β 1,...,β k [ i N ( y i β 0 ] k ) 2 k β j x ji + λ β j j=1 j=1 As Tibshirani (1996) and Hastie et al. (2009) point out, the model in Equation (6) might be re-parametrized by standardizing the predictors, so that the solution for β 0 is β 0 = ȳ; thereby, it is possible to suppose ȳ = 0, thus omitting β 0. Furthermore, in a similar way to the ridge regression, parameter t in Constraint (7) works as the penalty imposed on the coefficients. Nevertheless, while the ridge regression imposes a penalty of L 2 norm with k j=1 β2 j, the lasso regression is characterized by a penalty of L 1 norm with k j=1 β j (Hastie et al., 2009). In Equations (6)-(8), as t 0 represents the penalty on the coefficients and works as a control of the amount of shrinkage applied on the estimates, Tibshirani (1996) defines ˆβ j as the full least square estimates (OLS coefficients) and t 0 = k j=1 ˆβ j Therefore, setting t t 0 leads to a shrinkage of the solutions in convergence to zero, with some coefficients equal to zero. On the other hand, for t t 0, the lasso regression estimates will be equal to the OLS estimates. For instance, letting t = t 0 /2 has the effect of (roughly) shrinking the OLS coefficients by 50% on average (Hastie et al., 2009). For this reason, the parameter t should be selected in a dynamic process to minimize an estimate of the expected prediction error. Finally, concerning Equation (9), it is worth to note that λ = 0 (in the same way as t t 0 ) results in lasso coefficients equal to the OLS ones. Moreover, increasing λ implies a larger penalty that forces the coefficients to converge towards zero. Hence, the choice for λ (or, equivalently, the choice for t) becomes an important step for the lasso-type regression to achieve good quality results (Nasekin, 2013), and is related to the calculation of the prediction error. As Tibshirani (1996) emphasizes, one option is to choose the value of the penalty parameter to minimize the prediction error, which is based on the construction of a cross-validation style statistic. In this study, we opted to employ the cross-validation method, since it is traditionally used in the literature (Hastie et al., 2009). The following Section 2.2 presents this method in more detail. (9) K-fold Cross-validation Hastie et al. (2009) describe the K-fold cross-validation as the simplest and most used method to estimate the prediction error. As Efron and Tibshirani (1993) emphasize, starting from a simple regression model, the prediction error consists of the expected squared difference between a future response and its prediction from the model: PE = E(y i ŷ i ) 2. Then, the in-sample mean squared error is MSE = (1/n) i N (y i ŷ i ) 2. However, a more realistic application would be to split the data into training and testing samples, thus using the fitted model from the training sample to estimate the MSE of the testing sample (Efron and Tibshirani, 1993; Tibshirani, 1996). Based on this idea, Efron and Tibshirani (1993) presented the following Algorithm 1 for cross-validation: 7

8 Algorithm 1 K-fold Cross-validation (Efron and Tibshirani, 1993) Step 1: Split the data into K roughly equal-sized parts Step 2: For the k-th part, fit the model to the other K 1 parts of the data, and calculate the prediction error of the fitted model when predicting the k-th part of the data Step 3: Do the above for k = 1, 2,..., K parts, and combine the K estimates of prediction error For instance, if we set K = 5, then for each k = 1, 2,..., K, the model will be fitted for the data of all K 1 parts, and the fitted model will be used to verify the MSE of the k-th part of the sample. As described by Efron and Tibshirani (1993), if we let k(i) be the part containing the i-th observation of the data, and define ŷ k(i) i as the fitted value for the i-th observation (estimated with the fitted model with the k(i)-th part of the data removed), then the cross-validation estimate for the prediction error (or cross-validated MSE) will be as follows: CV MSE = 1 ( ) y i ŷ k(i) i n i N (10) In the lasso-type estimation, the K-fold cross-validation is used to compute the CV MSE statistic in Equation (10) employing different values for λ. Hence, the chosen value for λ will be the one that results in the least value for the cross-validation error. Figure 1 illustrates the process, where the y-axis represents the cross-validated MSE. As λ increases (x-axis), the results present an increasing number of coefficients equal to zero, which tends to lead to larger error, and the best value for λ, as already mentioned, is the one that minimizes the crossvalidated error identified by the vertical dotted line in the figure. In our basic simulation to generate this example, as well as in our empirical tests described in Section 4, we use K = 10, i.e. 10-fold cross-validation, based on Breiman and Spector (1992) and Kohavi (1995), who claim that K = 5 or K = 10 are satisfactory choices to solve the lasso-type regression in general cases Methodology of the Study In this Section, first we present the basic methodology for the portfolio selection using both index tracking and long-short investing strategies (Sections and 3.1.2). Later, we describe the essential guidelines to solve the index tracking portfolio selection using cointegration (Section 3.2) Index Tracking and Long-Short Investing Strategies Index Tracking The index tracking (IT) strategy consists in defining a portfolio of stocks that aim at replicating the return of a market benchmark. Naturally, as we desire to form a portfolio to track the S&P 100, for instance, the first choice would be to make a full replication of the index by holding all stock constituents in accordance with its index weighting. However, due to cost control, IT portfolios usually have a cardinality constraint that limits 8

9 Figure 1: Cross-validated MSE of lasso fit their size to a specific number of stocks; additionally, such portfolios are not updated very frequently, and are held during at least one month, thus diminishing transaction and management costs. As a result of such constraints, IT portfolios are commonly evaluated by their tracking error (TE) measure, which is defined as the standard deviation of the difference between portfolio and index returns in a specific time interval (Beasley et al., 2003; Guastaroba and Speranza, 2012): [ T E = 1 T ] 1/2 (r p t R t ) 2 T t=1 (11) where T is the time frame (for instance, one month), t = 1, 2,..., T corresponds to each business day in our dataset, r p t is the portfolio daily return, and R t is the index daily return. Regarding the lasso regression, the index tracking problem is implemented as follows. The dataset contains a time series of daily log returns for the market index and N stocks, where rjt l represents the daily log return of the j-th stock on the t-th day, and Rt l represents the index daily log return. Then, we implement Equation (9) in the following equivalent form: ˆβ lasso = argmin β 0,β 1,...,β N [ t T ( R l t β 0 N ) 2 N ] β j rjt l + λ β j j=1 j=1 (12) where Rt l = log(p t /P t 1 ) equals the log return of the index on the t-th business day (where P t is the index price on the t-th day), and rjt l = log(p j,t/p j,t 1 ) is the log return of the j-th stock on t (p jt is the stock price of the j-th stock, j = 1, 2,..., N). The value for λ is computed using K-fold cross-validation in line with Algorithm 1; as already mentioned, we choose K = 10, i.e. 10-fold cross-validation based on previous literature (Breiman and Spector, 1992; Kohavi, 1995). After computing Equation (12), the IT portfolio is defined by normalizing the coefficients β j, j = 1, 2,..., N, to sum up to one; as a result, the stock weight of the j-th asset in the portfolio equals the 9

10 normalized coefficient of the j-th coefficient. Finally, concerning the lasso predictors, we set up two definitions. First, we impose a constraint on the number of lasso coefficients that may take value different from zero, which means to restrict the size of each portfolio. As mentioned before, IT portfolios normally have a limited number of stocks as managers attempt to minimize transaction costs, and for this reason we impose an upper bound on the size of each portfolio. Second, in contrast with prior literature (Wu et al., 2014; Yang and Wu, 2016), we do not impose a nonnegative constraint on the parameters. Hence, we allow the IT portfolios to have short positions. Usually, IT models avoid short positions due to liquidity and cost issues because shorting stocks might be difficult as a result of the potential lack of stocks available for rent, thereby leading to larger costs associated with short selling. However, because the indexes selected for the empirical tests in our study are composed by the most liquid stocks in the markets, we opt to allow portfolios to have short positions. Furthermore, our results (as explained in Section 4) already account for the larger costs associated with short selling Long-Short Alexander and Dimitriu (2005) describe the long-short strategy as a natural extension of the IT optimization using cointegration. As detailed in the next Section 3.2, cointegration is a two-step method based on ordinary least squares (OLS), following a statistical model where the first step consists in regressing the index daily price time series on the time series of daily stock prices (similar to the lasso method). However, in the case of long-short, we take the original index returns and use it to build enhanced indexes by adding (index plus) and subtracting (index minus) an annual excess return equal to α%. Then, the model will be estimated twice (substituting the original index by the benchmark plus and minus) and the long-short portfolio will be set up as the difference between the portfolio plus (obtained with the regression using the index plus) and the portfolio minus (obtained using the index minus). Thus, we will obtain a self-financing portfolio that have as its objective to produce positive, low-volatility returns that are uncorrelated with market returns. For instance, if we set α = 5%, then the construction of the index plus consists in adding an annual excess return of 5% (uniformly distributed over daily returns) to the original index daily returns. Likewise, the index minus is constructed by subtracting 5% from the original index returns. Once the indexes plus/minus are built, we estimate the long-short portfolio with lasso by using Equation (12) to calculate two models, the first of them using the index plus instead of the original index time series, and the second one using the index minus. For each regression, the coefficients should be used to form a portfolio normalized to sum up to one (just as in the index tracking methodology). As a result, the outcomes will be two portfolios (plus and minus), and the final wight of the i-th stock in the long-short portfolio will be the difference between the weights of the i-th stock in the portfolios plus and minus. According to Alexander and Dimitriu (2005), the conceptual background that supports the choice for longshort strategy is its self-financing characteristic, since investing in the long-short portfolio is the equivalent to selling the short portfolio (constructed using the index plus) to obtain the resources necessary to buy the long portfolio (constructed using the index minus). Such strategy, then, follows Roll (1992) and Stucchi (2015), who argue that indexes may be inefficient, thus giving the investor the possibility of forming portfolios to outperform the market. 10

11 Cointegration Approach based on Simulations for Index Tracking The concept of cointegration was introduced by Granger (1981) in time-series analysis and formalized some years later by Engle and Granger (1987). Since then, empirical studies (Alexander et al., 2002; Alexander and Dimitriu, 2005) have shown that financial assets can be found to be cointegrated quite often, and this has actually motivated an alternative approach to equity trading and portfolio construction. By using all information embedded in prices, it may be possible to detect a long-run equilibrium between a portfolio and a benchmark which can be used to indicate the optimal strategic asset allocation. Cointegration is a statistical feature which defines that a set of time series that are integrated of order 1, i.e. I(1), can be linearly combined to produce one time series which is stationary, I(0). Formally, if we set S 1,t, S 2,t,..., S n,t to be a sequence of I(1) time series, and if there are nonzero real numbers β 1, β 2,..., β n such that β 1 S 1,t, β 2 S 2,t,..., β n S n,t becomes an I(0) series, then S 1,t, S 2,t,..., S n,t are said to be cointegrated (Hamilton, 1994). When applied to prices in a stock market index, cointegration occurs when there is at least one portfolio of stocks that has a stationary tracking error, e.g., when there is mean reversion in the price spread between the portfolio and the index. This property does not provide any information for forecasting the individual prices in the system, or the position of the system at some point in the future, but it provides the valuable information that, irrespectively to its position, the prices of the portfolio and the index will stay together on a long-run basis. The design for the use of cointegration in asset allocation is based on a two-step approach as follows. The first step for the selection of a tracking portfolio requires the analysis to confirm that each price series is I(1) in a predefined time frame of in-sample data. To infer the portfolio weights, we estimate the cointegration Equation (13) given a prespecified in-sample calibration period. As we assume no short sales, all stock weights must be positive, which is achieved by applying a non-negative least squares (NNLS) estimation that ensures non-negativity on the regression coefficients. n log(p t ) = β 0,t + β i,t log(p i,t ) + ε t (13) i= where P t denotes the index price on the t-th day, p i,t denotes the stock price of the i-th stock, i = 1, 2,..., N, and ε t is a zero-mean tracking error. By normalizing the cointegration coefficients β i (for i = 1, 2,..., N) to sum up to one, we determine the proportional weights of each stock i in the portfolio. The second step is to apply the unit root test on the series of residuals ε t resulting from Equation (13) to confirm that the linear combination of the price series of N stocks I(1) is a stationary combination with order I(0). To confirm if such stationary combination occurs, we apply the Augmented Dickey-Fuller (ADF) test on ε t to test the null hypothesis of no cointegration, where γ is the coefficient of the lagged fitted error term ˆε t 1 in Equation (14). If we let q be the order of the autoregressive (AR) process, ε t be the estimated error term from Equation (13), and ε t be the change between two error terms, then the ADF regression takes the following form: q ˆε t = γˆε t 1 + φ i ˆε t i + u t. i=1 11 (14)

12 By rejecting the null hypothesis, we confirm the time series of estimated residuals is stationary, thereby attesting that the variables used on the regression are cointegrated. We consider the critical values suggested by MacKinnon (1992, 2010) at 1% level of significance for the ADF test. Then, as the null hypothesis is rejected, the portfolio obtained from Equation (13) consists in a valid portfolio to track the market benchmark. Finally, as described by Alexander and Dimitriu (2005), cointegration fits in the context of portfolio selection and IT strategy due to its features as an appropriate method for long-run asset price dynamics. However, a drawback of past studies lies in the issues relative to asset selection to compose each portfolio, which is usually exogenous to the portfolio optimization process, since the OLS method does not make variable selection. As a result, for instance, some studies make the selection of the stocks to compose the tracking portfolio based on the weights of the stocks in the index composition, in which case the portfolios have the stocks with the largest weights in an index (for instance, Alexander and Dimitriu, 2005; Dunis and Ho, 2005). Nonetheless, selecting the top 10 stocks with the largest weights in the index might become a tricky choice in the case of broader indexes that have lower concentration, such as the S&P 100 or the Russell 1000, as already mentioned in the Introduction. Under these circumstances, we can consider the best option would be to test all possible combinations to select the best portfolio according to some criteria. For example, to form a 10-stock portfolio using a sample with 100 stocks, it is necessary first to select the 10 stocks to estimate the cointegration analysis. Hence, the best-case scenario would be to test each possible combination of 10 stocks and choose the best one according to some criteria. Nevertheless, such analysis would face a challenge specially in terms of computing time, since the number of all possible combinations of 10 stocks in this case would be 1.73E 13. Thus, we seek to mitigate this issue and overcome the difficult concerning portfolio selection through the use of a series of simulations to form each cointegrated portfolio. In this process, to obtain the portfolio for each in-sample subset, first we form a sequence of M different portfolio candidates, where each portfolio is composed by s stocks randomly selected, i.e. s corresponds to the limit size of each portfolio (for instance, we chose s = 15 and s = 25 regarding the S&P 100, which means to limit the tracking portfolios to 15 and 25 stocks; see Section 4.1 for more detail about the testing setup). Second, after constructing M different portfolio candidates and discarding the ones that do not meet the cointegration requirements previously described, we select the portfolio whose estimation of Equation (13) resulted in the smallest fitted sum of the squared residuals 10. By using this process, we aim at overcoming the difficult in the process of selecting the stocks to form each portfolio using cointegration, as now the stock selection is endogenous to the solving process (not an ex-ante choice) Empirical Tests First, Section 4.1 presents the details regarding the databases and the background definitions to compute the tests. Then, Section 4.2 discuss the results for index tracking with S&P 100 and Ibovespa, and Section 4.3 discusses the results for a high-dimensional dataset: index tracking with the Russell Later, Section In this study, we select M=50,000, so that we form 50 thousand distinct portfolios to select the best one based on the sum of the squared residuals. The choice for 50,000 is due to the fact that this was the maximum number of different combinations that we were able to form. As M increases, there is a larger use of physical memory (RAM) by the CPU, thereby imposing a limit on the number of M. 12

13 describes the comparison between the results for index tracking using lasso and cointegration. Finally, Section 4.5 shows the results for long-short investing strategy using lasso Database and Testing Setup We select three databases to perform this study. The first one is composed by the S&P 100 (one of the main benchmarks in the US market) and 101 stocks; the second has the Ibovespa index (reference benchmark in the Brazilian market) and 55 stocks; finally, the third has the Russell 1000 index (composed approximately by the 1,000 largest firms in the US equity market) and 907 stocks. All three datasets were extracted from software Economatica, a financial database widely used in Brazil by both market participants and academicians. Our database includes daily stock prices from January 2010 to September 2017, a period which includes 1921 trading days. Prices are adjusted for (1) splits, mergers, and other corporate actions and (2) the payment of dividends. For each dataset, we select two sizes for the tracking portfolios. To track the S&P 100, we form portfolios limited to 15 and 25 stocks; regarding the Ibovespa index, we estimate portfolios up to 8 and 12 stocks; finally, regarding the Russell 1000, we form portfolios limited to 30 and 40 stocks. Additionally, to compute the tests, we choose in-sample intervals equal to 480 data points (similar to Alexander and Dimitriu, 2002), each data point being one business day, whereas out-of-sample intervals equal 60, 120, and 240 data points (which means to perform portfolio updates roughly every three months, six months, and one year i.e. quarterly, semiannual, and annual updates). As a result, the first portfolio will be obtained by estimating a regression with data from t = 1 to t = 480, and its results will be observed over the data for the next 60, 120, or 240 data points in a rolling horizon framework. For instance, in the case of 60 days (quarterly updates), the second portfolio will be formed with data form t = 61 to t = 540, and so on. Consequently, we obtain a total number of 24 portfolios to cover all the dataset interval in the case of quarterly updates, 12 portfolios in the case of semiannual updates, and 6 portfolios for annual updates. Moreover, we also consider a buy-and-hold case in which we do not update the portfolios over time, i.e. the first portfolio optimized with data from t = 1 to t = 480 is held until the last day in the samples. Concerning the lasso-type regression, the empirical analysis consists in evaluating Equation (12) with index and stocks daily returns (in natural logarithm). In contrast, the tests based on cointegration are estimated with index and stocks daily prices (also in natural logarithm), in line with the methodology described in Section 3.2. Furthermore, to select each cointegrated portfolio, first we compute 50,000 regressions (each one with a different combination of s stocks, where s is the maximum size of the portfolio) according to the methodology in Section 3.2, then we select the most fitted portfolio with the smallest sum of the squared residuals. Because increasing the number of different combinations to form candidate portfolios results in the use of more physical memory (RAM) by the CPU, we could not compute more than 50,000 different combinations. Finally, we highlight that the results presented in the next Sections already account for transaction costs as we use Equation (15) to compute the daily returns in the rolling window projections (Han, 2005; Do and Faff, 2012): ( ) ( ) pi,t 1 C r i,t = log + log d (15) p i,t C 13

14 where p i,t is the price of stock i in day t, C represents the transaction costs, and d refers to the costs related to short positions. In our empirical tests, we set c = 0.5% (which refers mainly to brokerage fees), and d = 2% per year (which refers basically to rental costs). Both costs are discounted from the return of stock i every day the portfolio is updated Index Tracking Using Lasso Indices S&P 100 and Ibovespa We start the empirical analysis using lasso regression to solve the index tracking problem for S&P 100 and Ibovespa. The portfolios were compared using the following performance measures: (i) Annual average returns; (ii) Cumulative returns; (iii) Annual volatility; (iv) Daily TE average; (v) Daily TE volatility; and (vi) Monthly average turnover, which defined as follows: ( N )] i=1 xp i xp 1 i 1 2 f [ np p=2 (16) where np is the number of portfolios estimated per portfolio size and updating frequency (for instance, considering quarterly updates, we form a total of 24 portfolios), p and p 1 are time instants where sequential rebalancing were carried out, and f equals 3 for quarterly rebalancing, 6 for semiannual rebalancing, and 12 for annual rebalancing. The results are in Table 1 and Figures 2 and 3. Concerning the S&P 100, we can initially notice in Table 1 the good quality of the results in terms of tracking performance specially in the case of portfolios up to 15 stocks and quarterly update, and up to 25 stocks and semiannual update, as they present cumulative returns very close to the index. Also, we can observe the outstanding results of portfolios buy-and-hold, considering that these portfolios are held constant throughout the entire out-of-sample interval (roughly 5.5 years); in both cases (portfolios up to 15 and 25 stocks), the choice for buy-and-hold results in annual average returns (respectively 12.41% and 12.48%) very close to the index average annual return (11.43%). TABLE 1 HERE Additionally, increasing the size of the portfolios from 15 to 25 stocks results in smaller portfolios average tracking error for all updating frequencies (comparing portfolios with the same updating frequency), as it would be naturally expected (intuitively, larger portfolios should track the index more accurately). Moreover, increasing the size of the portfolios also results in larger correlation with the benchmark index and smaller average monthly turnover. Lastly, the quality of the tracking portfolios for the S&P 100 can also be observed in Figure 2a, which shows the cumulative returns of portfolios up to 15 and 25 stocks using semiannual and annual updating frequencies. In this case, the figure exhibits the good quality in terms of tracking performance in all four cases, as they remain very close to the index over time. Moreover, we can also reach out this conclusion by observing the monthly return of those portfolios in Figure 3a; in this case, it is possible to notice small detachments from both portfolios (15 and 25 stocks) to the index in a few months, such as Dec 2012, Jul 2016, Aug 2016 and Oct

15 As we turn our attention to the results for the Ibovespa index, first we highlight the considerably larger volatility of the Brazilian index in comparison with the S&P 100. In fact, Table 1 shows that the Ibovespa has annual volatility equal to 23.05%, almost twice as large as the annual volatility of the S&P 100 (12.47%). The consequence of such volatility is noticed in the portfolios average tracking error, where the values for the Ibovespa tracking portfolios are in general twice as large as the values of the portfolios tracking the S&P 100. Nonetheless, we may also see the good quality results for Ibovespa tracking portfolios in terms of cumulative returns, specially in the case of portfolios up to 8 and 12 stocks with semiannual and annual updating frequency. In those cases, the difference between the portfolio s cumulative return and the index s cumulative return remains below 10 percentage points. Furthermore, we point out to the fact that increasing the number of stocks in the portfolio results in smaller values for portfolios average tracking error, annual volatility and average monthly turnover, as well as larger correlation with the index. Such results are in line with the conclusions drawn from the S&P 100 tracking portfolios. (a) S&P 100 (b) Ibovespa Figure 2: Out-of-sample forecast per index and portfolio updating frequency Finally, Figures 2b and 3b allow one to observe the quality of the results for the Ibovespa in the case of cumulative returns as well as monthly returns, considering semiannual and annual updating frequencies. Concerning cumulative return, it is noticeable the increasing detachment of both portfolios from the index 15

16 specially between June 2015 and July 2016, with a peak in the difference between monthly returns in February 2016, as shown in Figure 3b. However, these findings are mitigated by the fact that, from June 1st, 2015 to Jan 26th, 2016, the index has a cumulative return of %, then moving up by 52.83% from Jan 26th 2016 to July 29th Such strong volatility shows how difficult it was for tracking portfolios to follow the index during this period and justifies the separation between monthly returns of the index and portfolios especially between June 2015 and August (a) S&P 100 (b) Ibovespa Figure 3: Monthly return per index and portfolio updating frequency Index Tracking in a High-dimensional Dataset Index Russell 1000 In the previous Section 4.2, we described the results for the tracking portfolios using lasso regression and two market benchmarks: S&P 100 and Ibovespa. However, neither of those indexes is composed by a very large number of stocks (101 stocks concerning the dataset for the S&P 100, and 55 stocks for the Ibovespa). In contrast, according to the literature on lasso regression (for example, Tibshirani, 1996; Nasekin, 2013; Konzen and Ziegelmann, 2016), a common characteristic of this statistical approach is its capability to solve especially high-dimensional problems. Such feature is a result of the capacity of the lasso regression to perform variable selection through its penalty function imposed on the coefficients, which leads the model towards a shrinkage process that selects only the most relevant coefficients in the regression. 16

17 For this reason, we also opted to carry out an empirical analysis of index tracking using a larger market benchmark: the Russell 1000, which is theoretically composed approximately by the 1,000 largest firms listed in the US equity market. In our specific analysis, the dataset for the Russell 1000 has a total of 907 stocks, thereby imposing a challenge for the index tracking problem since the Russell 1000 constituents have minimal concentration in the index portfolio, as mentioned in the Introduction. We describe the results for the tracking portfolios in Table 2 and Figure 4. To track the Russell 1000, we form portfolios limited to 30 and 40 stocks, with quarterly, semiannual and annual updating frequencies (similar to the tracking analysis for the S&P 100 and the Ibovespa). Initially, we can infer from Table 2 once again the good quality of the tracking solutions in terms of both the average annual returns and the cumulative returns. In the case of portfolios using quarterly updates, the cumulative returns are very low and the tracking performance is poorer relatively to the other updating frequencies, since the more frequent portfolio updates resulted in larger transaction costs that penalized the portfolio s cumulative performance. However, as the update interval increases, the results become consistent for all remaining portfolios (semiannual and annual updates, as well as the buy-and-hold strategy). Regarding semiannual and annual updating portfolios up to 30 stocks, lasso portfolios present average annual returns of 14.04% and 14.39%, which is a difference close to 2 percentage points from the index annual average return. Moreover, we notice better performance as we increase the limit size of the tracking portfolios, as it would be expected; in the case of semiannual and annual updating intervals, we see a significant increase in performance in the cumulative returns, especially for semiannual updates (cumulative return equal to %, almost identical to the index cumulative return: %). Naturally, increasing the size of each portfolio (per updating frequency) resulted in lower portfolios average tracking error and annual volatility, as well as larger correlation with the index. Such findings are in accordance with the results for the S&P 100 and the Ibovespa, where we also obtained slightly better performance with larger tracking portfolios. TABLE 2 HERE Finally, the results relative to the Russell 1000 are also introduced in Figures 4a and 4b, that show respectively the cumulative performance and the monthly returns of tracking portfolios using semiannual and annual updating frequencies. Regarding the cumulative performance, we notice good tracking results for the portfolios limited by either 30 or 40 stocks, with small detachments from the index around August 2016 (Figure 4a). Such conclusion can also be drawn from Figure 4b, in which we observe larger differences between monthly returns of the portfolios and index especially in August 2016 and October Nonetheless, the performance remains adequate during the remaining sample interval, which highlights the good quality of the tracking results obtained using lasso regression for a high-dimensional dataset Validation of the Lasso-type Regression: Comparison with Cointegration Based on Simulations As discussed in the previous subsections, the application of lasso regression to solve the index tracking problem resulted in promising conclusions regarding the capacity of this method to perform portfolio selection. Still, a comparison with another statistical method might be useful as an attempt to shed some new light in the discussion related to the previous findings. So, due to the extensive use of cointegration in the previous literature 17

18 (a) Out-of-sample forecast per index and portfolio updating frequency (b) Monthly return per index and portfolio updating frequency Figure 4: Russell on index tracking, we also opted to estimate the tracking portfolios using this method, as we sought to have a basis for comparison and validation of the results obtained using lasso. To carry out the cointegration tests, we followed the methodology described in Section 3.2. Therefore, as mentioned in Section 4.1, to select each tracking portfolio (for each in-sample interval), first we computed 50,000 distinct candidate portfolios, each one with s stocks selected randomly (where s is the limit size of the portfolio); then, after computing the ADF test to verify the cointegration properties and discard invalid portfolios, we chose the one with the least sum of squared residuals resulting from Equation (13). Finally, we highlight that the use of the OLS regression would most likely result in negative and positive OLS estimates, i.e. long and short positions in each portfolio. Nevertheless, none of the portfolios obtained using lasso presented short positions. For this reason, we chose to estimate cointegration using non-negative least squares instead of OLS, thereby avoiding short positions (negative regression coefficients) in the cointegrated portfolios. The results for cointegration (hereafter, referred to as OLS-NN) and lasso are described in Table 3 and Figures 5 and 6. Initially, Table 3 has a summary of the results using lasso and OLS-NN for each of the three indexes. In the case of the S&P 100, we observe similar results for both methods in terms of annual average returns as well as cumulative returns. Specifically, portfolios lasso limited to 25 stocks present cumulative returns closer to the index cumulative return; in contrast, portfolios OLS-NN have smaller portfolios average 18

19 tracking error in all cases, as well as somewhat lower volatility, which implies OLS-NN generates portfolios less risky. TABLE 3 HERE In the case of tracking portfolios for the Ibovespa index, we see an overall pattern similar to the results for the S&P 100. In terms of average annual return and cumulative return, the lasso approach resulted in superior performance for portfolios up to 12 stocks; in the meantime, portfolios OLS-NN up to 8 stocks have cumulative returns closer to the index performance (except for the buy-and-hold strategy). In contrast, as we notice the values for portfolios average tracking error and annual volatility, all portfolios using OLS-NN have slightly superior performance than lasso. Finally, we can achieve similar conclusions for the results related to the Russell Table 3 describes again moderately better performance for portfolios OLS-NN, with cumulative returns closer to the index and lower portfolios average tracking error. However, we can also notice that such differences in performance are very small, especially considering portfolios up to 40 stocks, in which case the difference between portfolios average tracking error for portfolios OLS-NN and lasso with the same updating intervals remain below percentage points in all four updating frequencies. As the findings for portfolios OLS-NN and lasso are hardly distinguishable in terms of overall performance, we turn our attention to the portfolio concentration and average monthly turnover, because both measures might be translated into portfolio risk and costs. Figure 5 compares the concentration of the stock weights in the portfolios for each index. In this analysis, we consider all 42 portfolios obtained per index and size of portfolio (as already mentioned, in the case of quarterly updates, we formed a total number of 24 portfolios; for semiannual updates, we have 12 portfolios; finally, for annual updates, we have 6 portfolios), so that we are able to verify the concentration of the stock weights. In Figure 5a, we can see that the tracking portfolios for the S&P 100 have slightly lower average weights using lasso, if we compare portfolios with the same size. Nonetheless, portfolios lasso also present more extreme (outliers) weights, which justifies the larger annual volatility values for lasso portfolios in Table 3. Moreover, similar conclusions can be drawn from the results for the Ibovespa (Figure 5b) and the Russell 1000 (Figure 5c). Overall, portfolios lasso have a larger number of stocks with weights recognized as outliers, supporting the fact that those portfolios resulted in larger volatility for all three indexes. Nonetheless, despite the slightly better results of OLS-NN portfolios regarding the concentration of stock weights in the portfolios, we see a remarkable advantage of portfolios using lasso by observing Figure 6. Here, we compare the average monthly turnover and the portfolios average tracking error (organized by index and size of portfolios). Thus, the figure shows that the average tracking error per portfolio is slightly smaller for portfolios using OLS-NN. For instance, portfolios OLS-NN using the S&P 100 and limited to 15 stocks have average tracking error equal to 0.032%, 0.023%, and 0.016% respectively in the cases of quarterly, semiannual, and annual updating frequencies; in the meantime, portfolios lasso have average tracking error equal to 0.040%, 0.029%, and 0.020%. However, as we observe the average monthly turnover, the values for portfolios lasso are at least 50% inferior: 6.0%, 4.3%, and 3.3%, against 25.7%, 12.4%, and 6.6% for portfolios OLS-NN. The complete list of results for average monthly turnover is presented in Table 3, and the same pattern 19

20 (a) S&P Portfolios up to 15 and 25 stocks (b) Ibovespa - Portfolios up to 8 and 12 stocks (c) Russell Portfolios up to 30 and 40 stocks Figure 5: Distribution of the stock weights in the portfolios per index, size of portfolio and statistical model mentioned above for the S&P 100 can be noticed in the results relative to the Ibovespa and the Russell For instance, lasso portfolios tracking the Ibovespa and limited to 8 stocks resulted in average monthly turnovers equal to 4.9%, 3.6%, and 2.9% (respectively, quarterly, semiannual and annual updates); in contrast, portfolios OLS-NN have turnover equal to 19.4%, 9.5%, and 5.2%. In this case, portfolios lasso have turnover at least 40% smaller (in the case of annual update). Concerning the Russell 1000, the least average monthly turnover for portfolios OLS-NN is 7.6% in the case of portfolios up to 30 stocks with annual update; however, the turnover for portfolios lasso with 30 stocks and quarterly updates equals 4.6%, hence signaling transaction costs about 40% smaller. As a result, Figure 6 shows that, on the one hand, portfolios formed using lasso and OLS-NN are very similar concerning overall performance (represented by average tracking errors). On the other hand, the substantial difference regarding average monthly turnovers implies that portfolios using lasso have considerably lower costs. Thus, we can infer from these results the good quality of the lasso regression solutions for index tracking; although lasso portfolios have slightly inferior performance in some cases, this approach resulted in portfolios with overall costs at least 40% lower than portfolios OLS-NN Results for Long-Short Using Lasso As described in Section 3.1.2, the goal of long-short strategy is to explore temporary market failures by assuming long positions in undervalued stocks and short positions in overvalued stocks. The selection of those stocks is made through the use of benchmarks plus and minus obtained by adding/subtracting an annual percentage α% to the index (uniformly distributed over daily returns). After creating those synthetic benchmarks, the next step consists in forming portfolios long and short separately (tracking the indexes plus/minus); finally, it is necessary to subtract the long portfolio from the short portfolio to obtain the net positions for each stock. To 20

(a) S&P 100 (b) Ibovespa (c) Russell 1000 Figure 6: Comparison between Average Monthly Turnover and Portfolios Average Predicted Tracking Error 514 515 516 517 518 519 520 estimate the long-short

21 (a) S&P 100 (b) Ibovespa (c) Russell 1000 Figure 6: Comparison between Average Monthly Turnover and Portfolios Average Predicted Tracking Error estimate the long-short portfolios, we selected α = 2% concerning the datasets for the S&P 100 and the Russell 1000, and α = 2.5% for the dataset related to the Ibovespa. Moreover, we also calculated long-short portfolios limited to 40 stocks based on the S&P 100 (maximum of 20 stocks for each of the portfolios long and short separately), portfolios limited to 20 stocks based on the Ibovespa, and limited to 50 stocks based on the Russell The results are presented in Tables 4 and 5, and Figure 7. Since all portfolios naturally have stocks with short positions and increasing volatility (in comparison with tracking portfolios), we diminished the updating 21

Lasso-based index tracking and statistical arbitrage long-short strategies

Lasso-based index tracking and statistical arbitrage long-short strategies Leonardo Riegel Sant Anna a,, João Frois Caldeira a, Tiago Pascoal Filomena b a Department of Economics, Federal University of