STOCK market price behavior has been studied extensively.

Size: px

Start display at page:

Download "STOCK market price behavior has been studied extensively."

Jemimah Parker
6 years ago
Views:

1 1 Stock Market Prediction through Technical and Public Sentiment Analysis Kien Wei Siah, Paul Myers I. INTRODUCTION STOCK market price behavior has been studied extensively. It is influenced by a myriad of factors, including political and economic events, among others, and is a complex nonlinear time-series problem. Traditionally, stock price forecasting is performed based on technical analysis, which focuses on price action, which is the process of finding patterns in price history. More recently, research has shown that public sentiment is correlated with stock market events [1], [2], [3]. This project proposes to study the potential of using both behavioral and technical features in stock price prediction models based on traditional classifiers and popular neural networks. We believe that behavioral data may offer insights into financial market dynamics in addition to that captured by technical analysis. An improved price forecasting model can yield enormous rewards in stock market trading. A. Problem Statement For this project, we focus on the Nikkei 225 (N225) stock index. N225 is the stock market index for the Tokyo Stock Exchange. It constitutes a price-weighted index average of 225 top rated Japanese companies in the Tokyo Stock Exchange. With Japan being the third largest economy in the world currently, and Tokyo being one of the largest global financial centers, the N225 price index is certainly a critical financial indicator that is closely watched by traders and banks around the world. We formulate the stock price prediction problem as a binary classification problem: whether the future daily returns of N225 will be positive (1) or negative (0), i.e. whether N225 s closing price tomorrow will be higher (1) or lower (0) than today s closing price. Daily return is defined in Equation 1. R i = C i C i 1 C i 1 (1) where R i is the daily return for the i-th day and C i is the N225 closing price for the i-th day. Daily return for day i is essentially the percent change in closing price from day (i 1) to day i. Future daily return for day i is just R (i+1). Take note that to get the classification target, we must take the sign of the future daily return R (i+1) rather than its numerical value. As described in the introduction, we will investigate the use price histories and public sentiment indicators available up to day i to predict sign(r (i+1) ). Subsequent sections cover the data collection process. Since this is framed as a classification task, we may use classification accuracy as a metric for evaluating the performances of various models. II. DATA COLLECTION AND FEATURE GENERATION A. Price History We queried daily historical prices of N225, for all trading days spanning January 1, 2004 to December 31, 2014, from Yahoo! Finance. However financial time series are well-known to be non-stationary, with means, variances and covariances that change over time. Such non-stationary data are difficult to model and will likely give poor classification accuracy when directly used as features. By viewing the daily prices as random walks, we attempted to stationarize the price history (through differencing and lagging) before using them as predictors. To this end, we used three main types of conventional price technical indicators as features [13]: 1) n-day Returns R i,n = C i C i n C i n (2) where R i,n is the i-th day return with respect to the (i n)-th day, or the percentage difference between the i-th day closing price C i and the (i n)-th day closing price C i n. Positive values imply that the N225 index has risen over the n days. For n = 1, we get the simple daily returns equation (Equation 1). 2) n-day Returns Moving Average MA i,n = R i,1 + R (i 1),1 + + R (i n),1 (3) n where MA i,n is the average returns over the previous n days, and n > 1 because a one day average is the day s return itself. 3) n-time Lagged 1-Day Returns R i,1, R (i 1),1,..., R (i n),1 (4) where R (i n),1 is (i n)-th day s 1-Day returns. By varying n, we have different numbers of features which contains varying degrees of information about price trends and past prices. This is one of the multiple parameters we will vary and decide upon using cross validation. B. Public Sentiment Indicators In addition to conventional technical indicators, we also looked at public sentiment indicators. The theory of behavioral economics postulates that emotions plays a significant role influencing economic decisions of individuals. Research has shown that this applies to societies as large as well. In fact, Bollen et al used Twitter messages as indicators of public mood states and demonstrated that they were correlated to,

2 2 and predictive of the Dow Jones Industrial Average over time [2]. In another study, Preis et al found patterns in Google query volumes, for search terms related to finance, that constitutes early warning signs of stock market movements. They hypothesize that investors search for information online about the markets before eventually deciding whether to buy or sell stocks. This indicates that search query data from Google Trends may contain valuable predictive information about the information gathering process that precedes trading decisions in the stock market [3]. This project takes inspiration from these two widely cited studies and attempts to integrate some aspects of public sentiment analysis as part of our features, in hope that combining behavioral data with technical price indicators will lead to improved performance. To this end, we used behavioral data from two sources: Bloomberg Businessweek and Google Trends. We were unable to replicate Bollen et al s study using Twitter messages as Twitter has restricted public access to very limited amounts of data. Other Twitter data sources required paid subscriptions. Therefore, similar to [3], we used trends in Google query volume for finance-related search terms as a proxy for public sentiment. Further, we wrote a script to crawl a free online news archive Bloomberg Businessweek for articles published from 2004 to 2014: approximately 210,000 articles were gathered. It is hoped that the state of the economy and prevalent stock market conditions can be extracted through sentiment analysis from these articles. For Google Trends, we focused on the daily search volumes of five finance-related search terms that showed the greatest predictive potential for stock market forecasting in [3], namely economics, debt, inflation, risk and stocks. Google Trends scores the daily query volumes on a scale of 0-100, normalized with respect to the peak within the date range (2004 to 2014 in our case). Subsequently we performed a relatively simple sentiment analysis on the news articles crawled from Bloomberg Businessweek to obtain daily sentiment scores. First, we obtained lists of positive and negative words that are both financialspecific and general. For financial-specific words, we used the lists published by McDonald, originating from his research on sentiment analysis on financial texts [4]. This is particularly relevant in our case as words with positive meanings in the general context may actually be negative in the financial context. For the general case, we used the lists of positive and negative opinion words or sentiment words by Hu and Liu [5]. To compute the sentiment score for each article, we used the following equation: Score = POS NEG POS + NEG where POS refers to the number of positive words (from the lists obtained earlier) counted in the article, NEG refers to the number of negative words (from the lists obtained earlier) counted in the article. The positive and negative words were counted as many times as they appear. A score of +1 implies an entirely positive article, 0 (when no words are counted) implies neutral, and -1 implies an entirely negative article. Daily scores were obtained by averaging over all the articles (5) in that day. Just computing this score for the 210,000 articles crawled took up a few days and had to be done in batches. It is likely that a more sophisticated sentiment analysis would have required longer time and unfeasible within the time framework of this project. C. Missing Data and Look-Ahead Bias By crawling for our own data, we inevitably face the problem of missing data e.g. price histories for some days are missing, the Bloomberg Businessweek archive does not have articles for every trading day. In dealing with this issue, we have three options: mean imputation, interpolation based on previous and next data point, or sample and hold. We opted to go with the last option (using the last observed valid data point) as we felt that mean imputation and interpolation will introduce some extent of look-ahead bias (using information that would not have been available during that time). For instance, the interpolation of prices or returns implicitly uses the future price, i.e. the interpolated point will be higher if the next price is high. This will lead to inaccurate results. While there are certainly more sophisticated and effective techniques of dealing with missing data, we considered only the simpler methods in view of time constraints. III. RECURRENT NEURAL NETWORK A. Vanilla Recurrent Neural Network Recurrent Neural Networks (RNNs) have shown great potential in many Natural Language Processing tasks (e.g. machine translation, language models...etc.) and are becoming increasingly popular. Unlike vanilla Neural Networks (NNs), RNN s network topology allows it make use of sequential information. This is a natural fit for stock market prediction, a time series problem - knowing previous days prices may help us predict tomorrow s price. Fig. 1. Recurrent Neural Network topology [6]. As illustrated in Figure 1, RNN performs the same operations, with the same weights, for each element of the sequence. It takes into account the previous step s state (s t 1 ) while computing the output for the current step. This recurrent property allows it to have a memory as mentioned earlier. The relevant equations are as follows: s t = tanh(ux t + W s t 1 ) (6) o t = sigmoid(v s t ) (7) where U, W and V are the weight matrices used across all time steps, x t is the input at time step t, s t is the hidden state

3 3 at time step t and o t is the output at time step t. We may think of s t as the memory of the RNN which contains information about inputs and computations of all the previous time steps (subject to the vanishing gradient problem elaborated below)! As described earlier, the output is computed based on the previous hidden state s t 1 and current input x t (Equation 6). The first hidden state s 0 is typically initialized with zeros. In our stock market prediction problem, we can think of x t as the feature vector of each day (composing of features from Section II). Figure 1 has outputs at all time steps, but in our case, we are really only concerned with the output at the final step, which is the prediction whether price will rise or fall. In other words, we input feature vectors from previous t days into the RNN sequentially, and o t (a sigmoid output (Equation 7)) represents the probability of price rising or falling for the (t+1)-th day. This allows it capture more temporal information than classifiers (e.g. Support Vector Machines, NNs, Logistic Regression) that only take input of one time step. Training for RNNs is similar to that for vanilla NNs: backpropagation. However for RNNs, we backpropagate through time to obtain dl. The idea is to unfold the RNN across time (similar to that in Figure 1) and do backpropagation as if it were a normal NN. Since this is a classification problem, we can use the binary cross entropy loss as the error function L. Because we are only looking at the final output, we can mask all other outputs and only consider loss from the final output. From here, we may use stochastic gradient descent to minimize the error. There is one caveat: the vanishing gradient problem. As we know from NN backpropagation in class, the gradients dl dl dw, dl dv du, dl dw, dl dv du, are derived from the chain rule, meaning they are products of multiple derivatives. These chain rule derivatives have upper bounds of 1 (apparent from the tanh and sigmoid activation functions used). And this means that gradient values can shrink exponentially fast and vanish after a few time steps, particularly when the neurons are saturated. Because gradients vanish within a limited number of time steps, the vanilla RNN model typically has issues learning long range dependencies, i.e. the RNN will not learn much from inputs more than a certain number of time steps before the final output. From this, we know that the number of time steps in the input sequence for this RNN model cannot be too large. We may determine this hyper-parameter from cross validation. Note that this is a problem in deep NNs as well. Also, exploding gradient may be a problem, but this can be circumvented effectively by clipping the gradients. For this project, we implemented the above described RNN model from scratch in Python and tested its performance on the stock market prediction problem. B. Gated Recurrent Unit We also implemented from scratch in Python a more sophisticated RNN variant - the Gated Recurrent Unit (GRU). GRUs are identical to the vanilla RNN described above (takes sequential inputs) except in the way the hidden states s t are calculated. They were designed to alleviate the vanishing gradient problem through the use of gates (Figure 2). These are illustrated through the GRU equations 8, 9, 10, 11 and 12. Fig. 2. Gated Recurrent Unit topology [8], [9]. z = sigmoid(u z x t + W z s t 1 ) (8) r = sigmoid(u r x t + W r s t 1 ) (9) h = tanh(u h x t + W h (s t 1 r)) (10) s t = (1 z) h + z s t 1 (11) o t = sigmoid(v s t ) (12) where denotes element-wise multiplication. GRU has two gates, specifically a reset gate r and an update gate z. The reset gate r determines how to combine the new input x t with the previous hidden state s t 1, while the update gate z determines how much of the previous hidden state s t 1 to retain in the current hidden state s t. We obtain the vanilla RNN by setting r to all 1 s and z to all 0 s [8]. The GRU is a relatively new model published in recent years. They have fewer parameters than Long Short Term Memory (another RNN variant), rendering them faster to train and requiring less data to generalize. We tested our implementation of GRU on the stock market prediction problem as well. IV. METHODOLOGY A. Baseline and Other Models Since we have framed stock market prediction as a binary classification problem, Logistic Regression (LR) is a natural choice as a baseline model. Beyond LR, we also tested several other more sophisticated models (some of which were not covered in lectures) to gain exposure to common machine learning algorithms. They are Support Vector Machines RBF (SVM RBF), K-Nearest Neighbors (KNN) and AdaBoost (implemented in Scikit-Learn). B. Experiment Design The range of data (price history and sentiment scores) collected span 11 years from January 1, 2004 to December 31, In this project, we would like to predict whether tomorrow s price will be higher (1) or lower (0) than today s price. Thus, each day may be viewed as an observation from which a training example or testing example may be constructed. We created feature vectors based on the features described in Section II: each vector is essentially a concatenation of price technical indicators and public sentiment scores. The target variable is binary and is simply the sign of tomorrow s 1-day returns. We show an example feature vector x (i) and target variable y (i) pair for some arbitrary i-th day below:

4 4 x (i) = y (i) = [ R i,1, R i,2, , R i,n, MA i,2, , MA i,n, R (i 1),1, R (i 2),1,..., R (i n),1 GT i,econ, GT i,debt, GT i,inflat, GT i,risk, GT i,stocks, ] Score i [ ] Sign(R (i+1),1 ) where notation remains the same as introduced in Section II, GT i,yyy refers to the Google Trends query volumes for the word YYY. It is important that the feature vector x (i) does not contain any future information and only uses information available up to that point. n determines that amount of information about past prices and price trends incorporated into the feature vector; the dimensions of the feature vector changes with n. Note that because we are predicting tomorrow s price change, we lose one day: no prediction can be made for the last day in the data set, December 31, 2014, because we do not know the true price on January 1, Also, depending on the n chosen, we have to drop the first n days observations: to calculate the n-days returns, n-day returns moving average and n-time lagged 1-day returns, we need the previous n days prices. So these features cannot be calculated for the first n days in the data set because we do not know prices prior the first day, January 1, We select n from cross validation. f(θ) = N i=1 [ ] y (i) log(q(x (i) ) + (1 y (i) )log(1 q(x (i) ) (13) where N is the number of training examples. This is a binary classification task so we may use the binary cross entropy error function as objective to minimize for LR and the RNN (Equation 13). TABLE I TRAIN AND TEST SET SPLIT Data Set January 2004 to December 2014 Train Set January 2004 to December 2012 Test Set January 2013 to December 2014 Before we began training, we split the data set of observations into train and test sets, roughly 80% and 20% respectively each (Table I). We will train our models (RNN, GRU, LR, SVM, KNN and AdaBoost) based on the train set, and subsequently evaluate their performance on the untouched test set. C. RNN Training For conventional classifiers like LR, the training method is straightforward: for each prediction, we use x (i) as input, y (i) as target and minimize the error function either stochastically (stochastic gradient descent) or collectively (batch gradient descent). This is not the case for RNNs. Recall that one of the properties of RNNs is that they can process sequential data. This means that we are not restricted to using one feature vector for each prediction; we may input feature vectors from some previous t days into the RNN sequentially and take the final output prediction (minimize cross entropy error of final step prediction). Using t = 3 as a concrete example: y (i) s (0) RNN s (1) RNN s (2) RNN x (i 2) x (i 1) x (i) where s (t) are the hidden state vectors at time step t from the RNN, and s (0) is initialized with all zeros. For training RNN, we used inputs that are sequences of feature vectors [x (i t+1),...,, x (i 1), x (i) ]. We feed them into the RNN sequentially beginning from x (i t+1) to x (i). And the final output gives us a probability for the target variable y (i) since we use the sigmoid function (Equation 7). Again, similar to that described in the previous section, depending on t we have to drop the first few days of training examples. This allows the RNN to capture some extent of temporal information that LR does not (e.g. finer grain resolution of how returns are changing day to day). The larger t is, the more temporal information we are feeding into the RNN. However, as mentioned in Section III, t is intrinsically limited by the vanishing gradient problem. t, together the dimensions of hidden state vectors s (t) are the hyper parameters we can tune using cross validation. The above training method also applies for GRUs (a variant of RNN). However we may expect better results for GRUs as they should theoretically face a less extent of the vanishing gradient problem. D. Cross Validation for Time Series Cross validation is an important step in model selection and parameters tuning. It provides a measure of the generalization error of the trained classifier. To a certain extent, this technique allows us to avoid over-fitting on the training data (and perhaps under-fitting), and consequently do better on the test data. For independent data, we can typically use K-Folds cross validation, where the training data is randomly split in K ideally equally sized folds. Each fold may then be used as a validation set while the remaining (K-1) folds become the new training set. We cycle through the K folds so that each fold is left out of training and used for validation once. By taking the average error over these K folds validation, we get an estimate of the generalization error (i.e. how well the classifier will likely perform on unseen test sets). However, for this project, the data involved is financial time series and they are not independent! Correlation between adjacent observations is often prevalent in time series data; the data has some intrinsic order. The K-Folds cross validation method described earlier breaks down because (assuming we randomly split the training data into K Folds) the validation

5 and training samples are no longer independent. Furthermore, the train set should not contain any information that occurs after the validation set.

TABLE II CROSS VALIDATION FOR TIME SERIES Fold Train Set Validation Set 1 2004 2005 2 2004, 2005 2006 3 2004, 2005, 2006 2007 4 2004, 2005, 2006, 2007 2008 A more principled approach for time series

This is a more accurate reflection of the situation during testing where we train on past data and predict future price changes. We adopted this approach for cross validation in this project.

5 5 and training samples are no longer independent. Furthermore, the train set should not contain any information that occurs after the validation set. But splitting the data randomly, we cannot be sure of that. TABLE II CROSS VALIDATION FOR TIME SERIES Fold Train Set Validation Set , , 2005, , 2005, 2006, A more principled approach for time series cross validation is forward chaining [7]. Using 5 years of training time series data from 2004 to 2008 as example, we may split it into 4 folds and perform cross validation as in Table II. This is a more accurate reflection of the situation during testing where we train on past data and predict future price changes. We adopted this approach for cross validation in this project. In Table III, we summarize the hyper-parameters for each model we tested, and the respective ranges over which we did a grid search for. TABLE III GRID SEARCH HYPER-PARAMETERS Hyper-Parameters Sweep Range n (refer to section II) 3, 4, 5, 6, 7, 8, 9 GT, Score with and without LR Regularization C 10e-2, 10e-1, 10e-0, 10e1, 10e2 SVM RBF Bandwidth γ 10e-2, 10e-1, 10e-0, 10e1, 10e2 C 10e-2, 10e-1, 10e-0, 10e1, 10e2 KNN No. of neighbors 5, 10, 25, 50, 75, 100 AdaBoost No. of estimators 5, 10, 25, 50, 75, 100 Learning rate 0.01, 0.05, 0.1, 0.5, 1 RNN Time steps t 2, 4, 6 Hidden state s (t) dimensions 10, 30, 50 GRU Time steps t 2, 4, 6 Hidden state s (t) dimensions 10, 30, 50 Fig. 3. Grid search heat map for Logistic Regression. The optimal parameters from cross validation are n = 8 and regularization C = 0.1, without sentiment scores. Fig. 4. Grid search heat map for K-Nearest Neighbor. The optimal parameters from cross validation are n = 8 and no. of neighbors= 5, without sentiment scores. V. RESULTS AND DISCUSSION A. Grid Search Cross Validation Results We performed extensive grid searches for each model to choose the best hyper-parameters based on the resulting cross validation accuracy. Selected results are presented as heat maps in Figures 3, 4, 5, 6, 7 and 8. Using the best hyper-parameter combination, we trained fresh models (LR, KNN, AdaBoost, SVM RBF, RNN and GRU) based on the entire train set (from January 2004 to December 2012) and tested them on the unseen test set (from January 2013 to December 2014). The results are summarized in Table IV. B. Discussion From our grid search experiments, we realized that including Google query volumes and sentiment scores did not necessarily lead to improved performance. In fact for some models Fig. 5. Grid search heat map for AdaBoost. We swept n as mentioned in Table III. For easy visualization we only present heat map of the best n here. The optimal parameters from cross validation are n = 3, no. of estimators= 5 and learning rate= 1, without sentiment scores. (like KNN and LR), including these sentiment scores caused a significant drop in test accuracy. The reason becomes apparent when we overlay Google query volumes and sentiments scores with the N225 price index. From Figures 9 and 10, we can see that both scores do

6 TABLE IV BEST CROSS VALIDATION ACCURACY AND TEST ACCURACY Model Best Cross Validation Accuracy Test Accuracy LR (baseline) 0.509 0.510 KNN 0.511 0.495 AdaBoost 0.520 0.523 SVM RBF 0.568 0.565 RNN 0.

The optimal parameters from cross validation are n = 8, bandwidth γ = 0.1 and C = 1000, without sentiment scores. Fig. 9.

For easy visualization we only present heat map of the best n here.

6 6 TABLE IV BEST CROSS VALIDATION ACCURACY AND TEST ACCURACY Model Best Cross Validation Accuracy Test Accuracy LR (baseline) KNN AdaBoost SVM RBF RNN GRU Fig. 6. Grid search heat map for Support Vector Machine RBF. We swept n as mentioned in Table III. For easy visualization we only present heat map of the best n here. The optimal parameters from cross validation are n = 8, bandwidth γ = 0.1 and C = 1000, without sentiment scores. Fig. 9. Plot of Bloomberg Businessweek sentiment scores and the N225 price index over time from 2007 to Fig. 7. Grid search heat map for Recurrent Neural Network. We swept n as mentioned in Table III. For easy visualization we only present heat map of the best n here. The optimal parameters from cross validation are n = 5, hidden state s (t) dimensions= 30 and time steps t = 4, without sentiment scores. Fig. 10. Plot of Google Trends query volume for the word debt and the N225 price index over time from 2010 to Fig. 8. Grid search heat map for Gated Recurrent Unit. We swept n as mentioned in Table III. For easy visualization we only present heat map of the best n here. The optimal parameters from cross validation are n = 5, hidden state s (t) dimensions= 50 and time steps t = 4, without sentiment scores. not seem to be consistently correlated with the N225 price. They do not seem to be predictive of N225 price changes (while the figures are plotted at the monthly-level, the same holds true when we zoom in to the daily-level). This likely explains why the sentiment score features do not improve the classifiers performance; they do not provide useful additional information. It seems that our simple sentiment analysis (scoring by counting positive and negative words from pre-specified lists) is too coarse to extract useful information. Perhaps using more sophisticated sentiment analysis methods that goes beyond the word-level (such as OpinionFinder in [2], that looks at sentence-level subjectivity) will yield more informative scores. In addition, it may be useful to crawl articles from multiple news archives, rather than just the Bloomberg Businessweek, to gain a more diverse set of corpus that may be more representative of the state of world affairs. Unlike that reported in [3], Google search volume trends did not improve our results.

7 7 This could be simply due to the fact we are analyzing N225 in this project, and not the Dow Jones Industrial Index as in the original paper. On hindsight, perhaps using volume trends for search terms in the Japanese language would have been more appropriate since English is not Japan s first language (but then again, with globalization, N225 is tradable from almost anywhere in the world). Further, [3] could have used a greater set of search terms; we restricted ourselves to 5 finance related terms to keep data collection and computation time reasonable. Out of all the models tested, LR gave one of the poorest accuracy at This is only slightly better than randomly guessing (0.5). However such result is consistent with our understanding that LR is ultimately a linear classification model (we did not kernelize LR for this project). It is natural that stock market prediction, a non-linear problem, cannot be well-modeled by a linear model. Nevertheless this serves as a baseline benchmark to evaluate other more sophisticated algorithms. Both the RNN and GRU performed better than LR. Because these are non-linear models, it is natural that they can give better accuracy than LR. One observation is that the GRU (0.558) performs slightly better then the vanilla RNN (0.531), suggesting that the GRU gating architecture may have indeed helped to alleviate the vanishing gradient problem, allowing it to learn better. We also note that both the RNN and GRU required significant longer times to train as compared to the other models. This posed as an issue particularly for time series cross validation. As a result, we only managed to sweep 3 values for both the time steps t and the hidden state s (t) dimensions (on top of the n) - sweeping these parameters took over a day for each of the two models. Finally, we see that GRU has comparable performance/slightly lower with the SVM RBF (0.565). In general, our SVM RBF accuracy is consistent with that reported in literature and other implementations online ([10], [11], [12] and [13]). However we feel that the GRU has potential to outperform the SVM RBF classifier: Firstly, as mentioned earlier, we only swept 3 values for both GRU parameters. Given more time and resources, we could sweep the parameters at finer resolutions and for a larger range. This will likely give better performance. In addition, we used simple stochastic gradient descent in the GRU implementation. There are more sophisticated optimization methods available (such as RMSprop) that could potentially lead to improved accuracy. Lastly, we are currently looking at daily data, which gives us around 2000 training examples. This data set size may be insufficient to learn the reset and update gates weights effectively. Perhaps if we looked at minute scale data (which would vastly increase the number of training examples), the GRU will perform much better than the SVM RBF. Lastly, we did not have sufficient time to thoroughly analyze the results for KNN and AdaBoost. As mentioned in Section IV, we tested these models mostly to gain exposure to a wider range of common machine learning algorithms. VI. CONCLUSION In this project we collected price history from Yahoo! Finance, crawled articles from Bloomberg Businessweek and obtained Google query volumes from Google Trends for the period 2004 to Using the data, we generated price technical indicators and sentiment scores to be used as features for predicting future (tomorrow s) price change direction. We implemented a vanilla RNN and GRU from scratch in Python and tested them against LR as a baseline. Through grid searches and cross validation for time series, we chose the optimal (according to cross validation error) hyper-parameters for each model. From our experiments, sentiment scores and Google query volumes did not improve classifiers performance. This is likely because our simple sentiment analysis does not extract useful information from the news articles. Consistent with our expectations, LR performed the poorest among SVM RBF, RNN and GRU. It is logical than a linear model cannot adequately describe a complex non-linear problem such as stock prices. The GRU performed slightly better than the vanilla RNN, indicating that the gating mechanism was effective to some extent in relieving the vanishing gradient issue. Finally, we observed that the GRU has comparable performance with the SVM RBF. However, we feel that the GRU has potential to outperform the SVM RBF given more time and resources. Moving forward, we may perform more advanced sentiment analysis, in terms of using more sophisticated sentence-level methods (such as the OpinionFinder) and also crawling for news articles from a wider range of websites (such as the Wall Street Journal) for a more diverse corpus. This should serve as a better proxy for public sentiment. We could also explore more specialized Google search terms that are predictive of the N225, perhaps in the Japanese language. For the RNN and GRU, we can certainly improve their performances by sweeping a wider range of parameters at finer resolutions, and using more advanced optimization methods like the RMSprop. Also, we feel that their ac curacies should improve given more data (working at the hourly/minute scale instead of the daily scale). Currently, we train the RNN and GRU using a fixed train set and test them on the test set. An alternative way is to have a moving train set where we retrain the model every year based on the latest D years prices, i.e. firstly train on 2004 and 2005 and test on 2006; train a fresh model on 2005 and 2006 and test on etc. This will allow us to capture short term trends more effectively. Finally, we used simple sample and hold to deal with missing data in this project. There are definitely more robust methods on dealing with such cases that we did not have the time to explore here. REFERENCES [1] Ruiz, Eduardo J. et al. Correlating Financial Time Series With Micro- Blogging Activity. Proceedings of the fifth ACM international conference on Web search and data mining - WSDM 12 (2012): n. pag. Web. 11 Nov [2] Bollen, Johan, Huina Mao, and Xiaojun Zeng. Twitter Mood Predicts The Stock Market. Journal of Computational Science 2.1 (2011): 1-8. Web. [3] Preis, Tobias, Helen Susannah Moat, and H. Eugene Stanley. Quantifying Trading Behavior In Financial Markets Using Google Trends. Sci. Rep. 3 (2013): n. pag. Web. [4] McDonald, Bill. Bill Mcdonald s Word Lists Page. Nd.edu. N.p., Web. 7 Dec

8 [5] Hu, Minqing, and Bing Liu. Mining And Summarizing Customer Reviews. Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 04 (2004): n. pag. Web. 7 Dec [6] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. Deep Learning. Nature (2015): Web. 7 Dec [7] Arlot, Sylvain, and Alain Celisse. A Survey Of Cross-Validation Procedures For Model Selection. Statistics Surveys 4.0 (2010): Web. 8 Dec [8] Britz, Denny. Recurrent Neural Network. WildML. N.p., Web. 8 Dec [9] Chung, Junyoung, Gulcehre, Caglar, Cho, KyungHyun, Bengio, Yoshua/ Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. NIPS Deep Learning Workshop, 2014 [10] Fu, Tong, Shou Chen, and Chuanqi Wei. Hong Kong Stock Index Forecasting Web. 9 Dec [11] Dai, Yuqing, and Yuning Zhang. Machine Learning In Stock Price Trend Forecasting Web. 9 Dec [12] Halls-Moore, Michael. Forecasting Financial Time Series. Quantstart. N.p., Web. 9 Dec [13] Pochetti, Francesco. Stock Market Prediction Scikit Classification Algorithms. N.p., Web. 9 Dec

STOCK MARKET PREDICTION AND ANALYSIS USING MACHINE LEARNING

STOCK MARKET PREDICTION AND ANALYSIS USING MACHINE LEARNING Sumedh Kapse 1, Rajan Kelaskar 2, Manojkumar Sahu 3, Rahul Kamble 4 1 Student, PVPPCOE, Computer engineering, PVPPCOE, Maharashtra, India 2 Student,