Abstract Making good predictions for stock prices is an important task for the financial industry. The way these predictions are carried out is often

Abstract Making good predictions for stock prices is an important task for the financial industry. The way these predictions are carried out is often by using artificial intelligence that can learn from the data using machine learning algorithms. This study compares three different approaches in this area, using a regression tree and two artificial neural networks with two different learning algorithms. The learning algorithms used was Levenberg-Marquardt and Bayesian regularization. These three approaches was evaluated using the average misprediction and worst misprediction they made in the selected interval from two different indexes, OMXS30 and S&P-500. Out of these three approaches the artificial neural networks outperformed the regression tree and the Bayesian regularization algorithms performed the best out of the two learning algorithms. The conclusions did support the usage of artificial neural networks but was not able to fully establish that the Bayesian regularization algorithm would be the best performing in the general case.

1 Introduction Predicting stock prices is something that has been popular among investors for a long time. As the short term price of a stock is getting more and more volatile over the years, predicting its price has become increasingly more complex (Schwert, 1989). Even though the investors has had much improvements in the analytical tools, the artificial intelligence is still getting the upper hand and will most likely continue to do so in the future. This leads to the stock market in a near future, with high probability, will be taken over by the artificial intelligence completely and the human side of it will be replaced by robots (Turner, 2015). One way to predict stock market prices is to use a computer program that implements different types of machine learning algorithms in order to learn how to predict future stock prices depending on past prices. A common implementation of machine learning that the financial industry has invested heavily in is the use of artificial neural networks (Trippi & Turban, 1992). There are however other approaches to make these predictions with and therefore further research is needed. The most important aspect of the stock predicting has always been which algorithms perform the best, based on how accurate predictions they are able to make. Different measurements can be used when comparing the predictions and decisions have to be made on what conclusions to draw from the data. There are several areas of machine learning that applies to making predictions on time series values. Comparing which area of machine learning applies the best is therefore an important aspect of this research area in order to conclude which area the future research should work on. However, because of the high value of these types of research, a lot of the conclusions made in this area may not be published in order for companies to maintain an advantage against its competitors. 1.1 Problem statement The current methods that are being used in stock forecasting have shown promising results. Earlier reports have been able to predict stock prices when the data is limited to few stocks or a short time period. In order to determine which area of machine learning the future research should focus on this study will compare three different approaches for predicting time series values. Two of the approaches are different learning algorithms for an artificial neural network which has been the most state of the art when it 5

comes to stock prediction. The last one is a regression tree, which is a times series prediction method derived from the machine learning area of decision trees. This study will compare these three methods on how they are able to predict on a one-day-ahead prediction. The performance will be measured as their average daily misprediction together with their worst day prediction. 1.1.1 Research questions The aim for this study will be to answer and draw conclusions from from the following research questions: Does a artificial neural network achieve a better result than a regression tree when predicting stock prices? How does a Levenberg-Marquardt learning algorithm compare to a Bayesian regularization learning algorithm when predicting stocks with an artificial neural network? 1.1.2 Scope of study In order to keep the focus on this report s main purpose, limitations had to be made to keep the study within a targeted timeframe. These limitations includes the data that was used and also the number of algorithms included, for which predictions was compared. The first limitation made for the data used for the prediction was to only go back until year 2000. This was due to the fact that the predictions was suppose to work under a reasonably close time frame and going further back could alter the practical usage of the conclusions. The report was also limited to two stock market indexes and kept the predictions strictly for one day ahead, limiting the long term possibilities. Due to the fact that there are a vast amount of machine learning algorithms the report focused on exactly three different algorithms. These three algorithms was chosen to represent a good selection of the currently applicable algorithms for stock market predictions. Another factor for limiting the number of algorithms used was that having more than three algorithms would make it hard to have a valuable comparison and time frame relative to its scientific outcome. 6

2 Background This section of the report has the purpose to introduce the reader on the background knowledge that is needed in order to understand what the research comprises. It starts by giving the basics behind how the stocks and stock markets work. It then moves on into more in depth knowledge regarding the machine learning algorithms that is being used to get the results and to draw a conclusion from those results. 2.1 What is a stock? The most fundamental idea behind a stock is that it represents a share of ownership in a company. Stocks are usually issued by a company when it needs to raise capital for investments as an alternative to borrowing the money (Teweles, 1998). This makes it possible for investors to trade capital for a share of a company in which they hope to make a profit in the future. The price of a stock is what the investor is willing to risk in order to gain an expected profit and varies between investors depending if they have short or long term aspects of the investment (Becket, 2004). 2.2 The stock market A stock market is an organized market for purchasing and selling different types of securities such as stocks and bonds. There are several different stock markets in the world and their main purposes are to help companies get easy access to capital and also as a pricing mechanism to determine the value of a stock or whole company. A shareowner can use a stock market to put their shares up for sale and reach a large group of investors willing to buy the shares to the lowest price. This leads to a price where a balance between supply and demand is reached, which is the price commonly referred to as the price of the stock (Encyclopedia Britannica, 2014). Because the demand and supply constantly change the price of a single stock usually changes many times a day, often many times per minute. The change in supply and demand for the stocks on the stock market is driven by several factors. The short term price of a stock changes rapidly while the long term price of a stock is more determined by analytical approaches of what the value of the stock really should be (Ro, 2015). Analytical approaches means expectations of future earnings of the company and its financial health when considering the stock price. Other important factors affecting the stock price is factors such as the general economic trend, the 7

share s industry performance and also different world events and their effect on the economies (A guide to NYSE marketplace, 2006). 2.3 Artificial intelligence & Machine learning Artificial intelligence (AI) is the science of constructing intelligent machines, particularly intelligent computer programs. Intelligent machines have the ability to perform tasks that are otherwise associated with intelligent beings (McCarthy, 2007). Examples of such abilities can be that the program can learn from past experience or reason and discover meaning to achieve its goal (Encyclopedia Britannica, 2014). It is not possible to answer yes or no to the question Is this machine intelligent or not?. This is because intelligence involves many different mechanisms and if a task only requires mechanisms that are well understood today a computer program could easily perform the task. Those programs are called somewhat intelligent (McCarthy, 2007). To create intelligent machines machine learning (ML) is used. That is the study of making a machine, especially a computer program, to have the ability learn from past experience. Which is a necessary ability to be able to call a program intelligent. Tom M. Mitchell (1997) provides a more formal definition in his book Machine Learning: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." An example of this is a computer program that learns the task(t) to play checkers. Where the performance(p) is measured by percentage of the won games and it is gaining experience(e) by playing against itself. Each program can be classified into one of three main categories, depending on how the program learns a task. The three categories are supervised learning, unsupervised learning and reinforcement learning. In supervised learning the program is trained with known input and the wanted output. The goal for the program is to learn a more general rule on how to map inputs to outputs. Unsupervised learning does not have known input and output, instead the program should find patterns or similar structure in the given dataset by itself. Reinforcement learning lets the program interact with a dynamic environment where it must reach a goal. Without someone telling the program if it is close to the goal or not (Russell & Norvig, 1995). This thesis will focus on supervised learning because the expected output is known. 8

2.3.1 Artificial neural network One way to make a computer program learn from past experience is to use an artificial neural network (ANN). ANNs tries to borrow the structure of the biological nervous system based on our current understanding of it. That means instead of a program having all instructions sequentially, ANNs have many simple computational elements that are connected. Those are the artificial neurons or simply neurons, they operate in parallel to achieve high performance speed (Zhang, 2000). There are different types of ANNs but the general idea of how they work is the same. All the neurons in the network are connected in layers. They have at least one input and one output layer of neurons, usually with one or more hidden layers between them. Each neuron can either be activated or not, depending on an activation function and a bias value. All the connections between the neurons have a weight and the total weighted input can be calculated for a neuron by adding up the weight of all connections going into the neuron. This is used in the activation function to determine if the neuron is activated or not. The network then learns by changing the weights of the connections and the bias value of the neuron (Encyclopedia Britannica, 2014). When an ANN learns from training data it is expected that the more neurons the network has the better it can learn, this is however not always the case. Overfitting sometimes called overtraining is when the network tries to adapt to the training data. This causes the network to give great results if tested with exactly the same data as trained with, but if given other data the network is not able to predict the correct result. Possible reasons for this could be that either the network has too many neurons or the training data is insufficient. Therefore it can be hard to train an ANN to get the best predictions (Nielsen, 2015). An example of this can be seen in figure 1, where the ANN is overfitted which causes it to make bad predictions on new data. 9

Figure 1: Example of overfitting 2.3.1.1 Backpropagation learning Backpropagation is one method used to teach a multi-layered ANN. In a backpropagation network, each neuron has multiple inputs and calculates an output. The neurons contains a value called a bias value, θ. This value is then used in the activation function of each neuron (Nielsen, 2015). The activation function takes the sum of all the weighted input values and the bias value. The current neuron is represented by j and neurons in the previous layer represented by i, the activation function for the neuron j is: net j = i (o i w i j) + θ j (1) Or in other words, the activation function for neuron j is the sum of all products of the output from neuron i and the weight of the link between i and j, plus the bias value of neuron j. The output of the current neuron is then based on a transfer function for the activation function. o j = f(net j ) (2) The important property the transfer function needs to have is that it must be possible to compute its first derivative. The sigmoid function is therefore commonly used as the transfer function. 10

1 f(net j ) = 1 + e net j (3) f (net j ) = f(net j )(1 f(net j )) (4) Each neuron in a backpropagation network has a value called δ used for error estimate, this is so the network can adjust the weights according to the expected result. The value δ is calculated differently depending on if the neuron is in the output layer or hidden layer. For an output neuron δ is calculated by the following formula. δ output = f (net)(t o) (5) Where f (net) is the first derivative of the transfer function, t is the expected value and o is the actual output for that neuron. For the hidden neuron j, δ is calculated with the deltas of the succeeding layer represented by k. δ j = f (net j ) k (δ k w jk ) (6) This value is then used to adjust the weights going backwards, towards the input layer. The weights are changed according to the following formula. dw ij = δ j o i L (7) Where dw represents change in weight and L is a learning parameter typically the same value for L is used for all neurons in the network. The bias value is also adjusted in the same way. dθ j = δ j f(θ j )L (8) These are all the basic formulas required for a backpropagation network. To summarise, the steps the network takes can be divided in three sections. First it calculates an output from the given training data. Then it compares the calculated output with the expected result and calculates the error for all neurons in the network backwards. Finally the weights and bias values are updated depending on the error (Nielsen, 2015). 2.3.1.2 Levenberg-Marquardt learning algorithm The Levenberg-Marquardt algorithm is an optimization to the standard backpropagation algorithm. It does this by using an approximation of Newton s method. It does this by first computing the output from the given input in 11

the same way as the standard backpropagation algorithm. Then it calculates the error for each neuron and uses this to compute the Jacobian matrix. The Jacobian matrix is then used in the modified Newton method to get a new error value. This value is then used to compute the sum of squares of errors. Depending on if this value is larger or smaller than the previous calculated error, then the algorithm updates the modified variable in the newton method and goes back to either the first step or the third. The algorithm is complete when it reaches some pre-defined threshold value for the error (Hagan & Menhaj, 1994). 2.3.1.3 Bayesian regularization learning algorithm Bayesian regularization is another modification that can be used in the Levenberg-Marquardt algorithm. It minimizes a linear combination of weights and squared errors. The regularization adds another term in the linear combination to make the network produce a smoother response. This can also be used to prevent overfitting but will require more time than the Levenberg- Marquardt algorithm (Foresee & Hagan, 1997). 2.3.2 Decision tree Another machine learning technique that can be used to predict the future is a decision tree. Decision trees builds on the basic ideas behind the hierarchical tree data structure, where it uses divide and conquer to split the data into smaller subsets (Criminisi, Shotton & Konukoglu, 2012). The way to train a simple decision tree (classification decision tree) is to ask questions regarding the data with the known outcome various questions in a way it will get split to different categories. The goal is to split it in such a way that the outcome of the variable the user is trying to predict will be as homogeneous as possible in each category, which leads to a higher confidence on the outcome (Criminisi, Shotton & Konukoglu, 2012). When the data has been split enough times that the categories has a convincing certainty what the outcome of that precise category will be, the last categories we split into will be called the leafs. These will then be the final outcome of the prediction (Magerman, 1995). After the decision tree has been trained with enough data it is ready to use the data with unknown outcome and see which leaf it ends up after answering all the questions. That leaf will then give a predicted outcome and, depending on which learning algorithm has been used, how certain it is of that outcome (Magerman, 1995). 12

Figure 2: Decision tree 2.3.2.1 Regression tree When dealing with real value prediction, it is often better to use a regression tree mode as it has better ways of splitting the data in a non-classification manner. In a regression tree model, instead of just splitting the data in an categorial matter, the data will be split on several points for each of the independent variables. This means that a regression tree is needed when the the target variable is numerical or continuous (Criminisi, Shotton & Konukoglu, 2010). After each independent variable has been split on several points it is possible to calculate the sum of squared errors for each group of nodes: SSE = i (P a i ) 2, a = actual, P = prediction (9) By trying to minimize the SSE, each group of data will obtain a value which is later used for making the predictions. This process is then repeated until all group of nodes have received a value. By splitting the data in a larger amount of datasets the predictions can get more precise. This is due to the fact that each group can be predicted with a simpler linear or constant prediction model instead of more advanced polynomial models (Criminisi, Shotton & Konukoglu, 2012). 13

2.4 Related work Throughout the years there have been numerous studies made concerning stock market predictions. From these published reports, the focus will be on three studies that has a similar research topic but with different approaches. Their prediction methods are related to the models which are being used for this thesis and therefore the conclusions made are highly relevant and should be considered when discussion the result that will be acquired in this report. The first of these reports is written by Hill, Marquez, O Connor & Remus (1994) and relates to different types of regression models that can be used for all types of forecasting and decision making. Their main focus was to compare and draw conclusions from the results other reports had acquired when making predictions on different kinds of regression and times series problems. The goal with this was to find out whether or not artificial neural networks could perform well enough on time series prediction compared to the more classical statistical models. Their conclusions was suppose to provide input on the ongoing discussion if ANNs perform well enough to be a basis for decision making on time series forecasting. For these reasons, the models that they compared with was almost exclusively different versions of ANNs and various statistical models. Their study found that ANNs are able to forecast fully comparable and often better results than any of the classical statistical models that they had compared it too. They did however discuss that these results may have been altered by the conditions and for the comparison that had been made, that further research was needed in order to be decisive about their findings. For the purpose of this thesis, their report do give support for the claim that ANNs should perform well on stock predictions, though they did not have any regression trees in their comparative study. Another report related to time series prediction was written by Kohzadi, Body, Kermanshahi & Kaastra (1996). They have made a comparative study on how ANNs can perform on time series predictions oppose to the autoregressive integrated moving average (ARIMA) models. They made the comparison as an empirical study where they implemented both an ARIMA and an ANN using a feedforward learning algorithm. These models was ran on the wheat market, seven different three year time periods in order to make the conditions fair. For their comparison method they used mean square error, absolute error and also a mean absolute error. What they found in their research was similar to that of the previous report. Their implementations of an ANN outperformed the ARIMA model by a clear margin. Similar to previous report, these findings gives support for the ANNs good performance on time series predictions, though the high frequency trade for the wheat market may not be the same as for the stock market. 14

Lastly, an empirical study related to the stock prediction was written by Tsung-Sheng Chang (2011) who did a comparative study on how ANNs perform on stock prediction compared to a decision tree and a hybrid model. Both the ANN and the decision tree implementations as this thesis. The ANN had a standard backpropagation learning algorithm and the decision tree used a regression model for binary outcome. Data used for these predictions was from a one year period of time with 10 different digital game content stocks closing prices. What the report found was that ANNs outperformed the two other methods on the average accuracy which was mainly a result of the less volatility it achieved. The report concludes that these results has great limitations though due to the small amount of stocks predicted and that the 32 days verification period was most likely too short. 15

3 Method In this chapter the reader will be introduced into the methods used in this report and how the results were obtained. Explanations will be provided for what data was used and how the values were processed using each different prediction approach. It will also provide how the approaches were compared and what variables were used for the comparison. 3.1 Data used The data used for the report was daily stock prices gathered from the OMX Stockholm stock market and the New York Stock Exchange. The indexes used for the prediction was OMXS30 which consists of the 30 highest valued companies on the Stockholm stock exchange and S&P-500 that includes the 500 largest companies on the NYSE. The reason for choosing these graphs was to have one prediction for a large market as of NYSE and one prediction for a smaller market, OMXS30. This would give two different views on how the algorithms performed under different conditions. Figure 3: OMXS30 graph 16

Figure 4: S&P-500 graph The prices that were used were dating from the start of 2016 back to the year-to-year end between 1999 and 2000. The reason for not going further back in the stock history was to make the stock s price volatility more accurate with today s conditions. In order to make the data more manageable for prediction, the index price used for each day was the stock s opening price on that day. This way the risk of having extreme values by taking the day s maximum or minimum price were reduced. The arbitration limit chosen between what data was used as training sets and what data was used under the testing phase were set as the year-end between 2015 and 2016. All data from year 2016 was used as test data. For each selection of days, the day after those days was chosen as the outcome result. Then the outcome value was switched to be included among the input values and the first day of the earlier selection of input values was removed. 3.2 ANN prediction The prediction using backpropagation ANN was implemented using the standard MATLAB library. Both of the two ANNs work in the same way with the exception of how they are both trained. This led to their main difference in implementing their prediction methods was only which builtin MATLAB function was used for the training phase. The splitting of the data used for training the network was divided by the following percentages; 70% was used as training data, 15% was used as validation data and lastly 15% was used as testing data. The training data was used by the network to adjust the node weights and to avoid overfitting it used the validation set. Lastly to confirm that the final adjustments were good, the testing set were used. 17

Both ANNs were implemented as nonlinear autoregressive neural network (narnet) using 4 input nodes, 10 nodes in a single hidden layer and 1 output node. The number of hidden nodes was chosen as it was what gave the best performance during the validation phase. After the network had been trained using the input data, it was ran using the target data to retrieve the predictions. As for this short-term prediction, the one-day-ahead prediction results were only used for a single day comparison and therefore one day s prediction did not affect the next days. To achieve this effect, the narnet was predicting using an open loop response as opposed to a closed loop response. This gave the outcome that the predicting from on one input did not affect the trained networks node weights. Finally, the predicted values were compared to the real values in order to evaluate the its performance. 3.3 Tree prediction For the tree prediction, a binary regression decision tree was used from the MATLAB library Statistics and Machine Learning Toolbox. The input data was then split up from being a list with all opening prices into a matrix with 4 columns, one for each following day. The expected output for each row was the fifth day. This was then used as the input to generate the tree. To get the predictions from the tree it needs to be given the opening price of 4 following days from the target data to predict the opening price of the next day. Therefore the target data was split up in the same way as the input data. Then for each row in the matrix the tree gave a prediction which was saved for later comparison. Because the result of the prediction is not used in the next prediction the tree will not accumulate the errors from day to day. But this also restricts the tree to only be able to predict one day ahead. 3.4 Methods for comparison To make a comparison between the different approaches a misprediction table of each day was calculated, where the predicted values were represented as percentage from the read value. The misprediction made for a single day i was calculated as: M = abs(1 p(i) ), p = predicted, r = realvalue (10) r(i) Miss(%) = M 100 (11) 18