FINANCIAL DATA SIMULATOR COMP4801

Size: px

Start display at page:

Download "FINANCIAL DATA SIMULATOR COMP4801"

Alvin McGee
5 years ago
Views:

1 FINANCIAL DATA SIMULATOR COMP4801 An interim report submitted in part fulfilment of the degree (BengSci) Computing and Data Analytics under the supervision of Dr. Yip Chi Lap Beta. Shadman Mahmood April 17, 2016

2 ACKNOWLEDGMENTS I would like to sincerely thank Dr. Yip Chi Lap Beta for selecting me for this project, and answering all my queries, whenever needed. I would like also like to thank my parents for their always being there for me, and to my brother whose has always shown unconditional support to me. My fiancé who always supported me, and believed in me, even when I did not in myself. Finally I would like to thank my friends who were always there for me, during times of stress. Without them the whole journey wouldn t be fruitful. PAGE 1

3 Abbreviations:- SMA:- Simple Moving average RSI:-Relative Strength Indicator EMA:-Exponential Moving Average PAGE 2

4 Table of Contents Introduction Genetic Algorithms Genetic Algorithms in the Stock Market The Genetic Algorithm Developed The Results achieved using Genetic Algorithms The Random Forest Algorithm Random Forest in relation to Stock Market The RandomForestLearner Algorithm Results achieved using Random Forest Learner KNN LEARNER ALGORITHM The KNNLearning algorithm developed An analysis on how stocks can be chosen Backtesting the Genetic Algorithm the updated version using a market simulator Conclusion PAGE 3

5 Abstract The stock market forecasting is an important research in financial analysis. The high level of uncertainty makes the market an uncertain one. Many machine learning algorithms have been developed extensively for this purpose. However a lot of these literature is sometimes too complicated to understand and contains too much technical terms that are too difficult to comprehend, and long datasets, with low accurate results. This paper presents extensive process of building a stock price predictive model using three machine learning algorithms, namely the Genetic Algorithm, The Random Forest Algorithm, as well as the KNNLearning Algorithm, with data obtained from the NYSE. Moreover a market simulator is also developed, so to backtest a strategy, and to see how the predicted results of the algorithm would perform in the market. The results obtained reveals that all three algorithms especially the Genetic Algorithm is almost accurate in predicting stock market prices in the short term. PAGE 4

6 INTRODUCTION Vast amount of capital is traded through Stock Markets all around the world. A country s national economy is strongly reliant on the performance of their respective Stock Markets. These markets serves as an investment tool not only for strategic investors but for the common people as well. As a result they influence everyday life in a more direct way. All Stock markets have the same characteristics: uncertainty. This affects the short and long term investments, as investors has to choose between higher profits with which comes higher risks or smaller gains and lower risks. As undesirable this characteristic may seem to be, it is unavoidable when stock market is used as an investment tool. Hence the most desirable thing would be to reduce this uncertainty. The financial data simulator is one of the instruments in this process. The metrics of a company s assets, their balance sheet and liabilities are connected with fundamental analysis. It is compared with the benchmark, the market average; for example s&p 500 and can signal the investor whether to buy or sell the share depending on whether it is performing above or below the market. This approach can be taken for long term investments. However this form of prediction is not an appropriate choice since it is difficult to build a model based on a company s current situation and difficult to eliminate irrelevant data. Technical analysis in practice relies on analyzing share price charts and searching for a given chart pattern of price trends. This kind of search is called formation [16] This analysis is supported by technical analysis coefficients [17]. Many decision support systems exist today to analyze these technical features. Various methods to analyze time series data on historic stock market returns as well as multiple linear regression models have been employed to address this problem. However these methods focus more on the quantitative factors like technical indexes. More recently artificial intelligence techniques such as machine learning, artificial neural networks (ANN), genetic algorithm, KNN learning algorithm, decision trees, random forest and support vector machines were applied in this area. [15,16]. Again in many of these literatures, finding a common trend or pattern (for example neural networks to predict a forecast), meant that the analysis carried would only be appropriate for a small population of people to comprehend. However modern programming languages such as Python and new modules in R enabled machine learning techniques to be implemented in a simple way and allowed historical data and their corresponding features to be visualized and the interactions to be observed easily and hence this field of forecasting expanded to PAGE 5

7 people with a modern laptop. The prediction tools for classification approaches involve the neural networks, regression, genetic algorithms, decision trees and k- Nearest Neighbors(kNN). The general trend of these classification techniques involves separating the training and testing data sets. The knn selects k nearest records of a data set that are closest to unknown records when making prediction. Decision trees adopt a different approach by identifying if a certain criteria are met before splitting a data into branch-like segments. A single decision tree is more subject to variability and is further improved by Random Forest algorithm [10], which is made of several decision trees. Random forest is proven to be efficient in preventing overfitting. More recently Genetic algorithms have been applied in the field of finance and stock markets, to find trading rules [2]. They can find optimal solutions to search problems and neural networks can infer from patterns in data. Although the desire to obtain higher profits and minimize investment risks serves as a purpose for developing better predictive models, sometimes this knowledge still remains disclosed, because of the profit motive. As a consequence it again becomes resorts to being technical expert s field, since many forecasting techniques proposed are difficult to comprehend and does not satisfy a high level of accuracy to attract effective interest. First to motivate about how prediction can be achieved, a brief background in simple terms of the proposed model will be introduced. Then extensive process of developing three machine learning models will be described, namely the Genetic Algorithm, The random Forest Learner, and the K-nearest neighbor learner. The results obtained from real life data will demonstrate the potential strength of these models for when short-term stock price prediction is required, the power of modern programming language like python in forecasting with a home computer as well as a situation of real live back testing with paper money. The rest of the paper will proceed as follows. Each section will provide a machine learning algorithm by first briefly describing the purpose of it, then how it is implemented in the stock market and how a forecasting algorithm is developed from it. Then after the three machine learning algorithms are discussed, a backtesting strategy will be introduced with paper money. Since the main idea is to forecast with the highest return and at the lowest risk, also a sample strategy to select stocks would be provided, so the investor can set out to earn the highest, at the same time reduce his losses to a minimum. The programming language of choice is python. There was a trade-off initially between R-programming or python, however, with due time and effective research the rich libraries python had within it meant, it was a clear winner. Although the python libraries provided much convenience, some of the documentation is quite difficult to comprehend since they are relatively new and improvements are taking place. PAGE 6

8 2.1 GENETIC ALGORITHMS The foundation of Genetic algorithms is based on the genetic processes of biological organisms. This algorithm was developed by Holland[1] and its principles were also developed in many texts [2][3][4]. The algorithm mimics the natural population which is essential for evolution and also realizes the members of the population which have very little or no role to play anymore. As in nature, the individuals of the same species compete with each other for food and for mates; and the ones that can survive till a certain point in time can produce the larger number of off-springs and continue to the next generation. This process keeps on repeating with every generation the fittest and the strongest surviving. Generally genetic algorithms are classified as adaptive methods and are used to efficiently solve search and optimization problems. This can be implied; as such the genes from the highly adapted individuals will spread to the number of individuals in each successive generation. The genes that are passed from the ancestors over to the descendants over time results in the best combination of characteristics that can provide the super fit offspring much greater than the parents, and over time they evolve to become more adapted to the environment. Each individual can be thought of as a possible solution to a given problem. A fitness score is attached to it depending on how good a solution to a problem it is. For example in the situation of a stock market the individuals can be the coefficients of technical indicators and a fitness score can be one that minimizes these coefficients i.e., that will be the best individual to predict the stock price for the next day. This fitness score can be thought of as an assessment of how effective an organism is in competing resources in real life scenario. So again these fitter individuals are given the opportunities to reproduce, cross breed and sometimes even mutate over generations of a population. The subsequent generations that are produced this way contains the characteristics of the fitter and the desired members of the previous generations. So the good characteristics are spread across many generations in the population and sometimes new ones are added as mutation takes place. Successive generations are created until some stop criterion is established. The final population can be thought of as a desired selection of solution candidates and hence can be applied to the original problem. Provided The GA has been developed well, the population will converge to an optimal solution to the given problem set. The main advantage of GA is that it is a robust technique and can be successfully used in a wide range of problem areas including those that are difficult for other methods to solve. Although not guaranteed to find the best solution they can find the desired or an acceptable solution rather quickly. Genetic Algorithms are more advantageous than other algorithms that have problems with non-differentiable or discontinuous objective function, to which PAGE 7

9 gradient-based methods such as Gauss Newton would not be applicable [5]. They can also be used to optimize a function with several local optima. Because of the stochastic nature of the selection and recombination operators, evolutionary algorithms are less likely to converge to local maxima than hill-climbing or other optimization methods. However the best way to find a solution set, can be thought of as hybridizing other optimizing algorithms with the Genetic Algorithms. They can also be used to optimize a function with several local optima. The field of evolutionary algorithm has slowly been gaining wide attention since the beginning of the 1990 s. They are being applied in a number of engineering fields such as computer science, economics and for a list of other fields see [6]. Recently Genetic Algorithms have been practically used in many fields as modern computers allow quicker and faster ways to optimize solutions. Some notable fields include Finance and Investment Strategies, Gene Expression Profiling, Encryption and Code Breaking, Optimized Telecommunications Routing etc. Before proceeding to section of the algorithms developed, the next section will discuss the framework of genetic programming in relation to the stock market. PAGE 8

10 2.2 Genetic Algorithms in the Stock Market. The parameters are represented as genes. They are combined to form a string of values that in the case of GA is referred to as chromosomes. The parameters set of genes in a chromosome in biological terms can be thought of as genotypes, and hence these genotypes of a chromosome contain the building blocks of the organism, which is termed as the phenotype. Since we tend to mimic the natural evolutionary pattern, the same terms will be used in the GA. For example if we are trying to minimize a function of four technical indicators RSI, Moving Average, Momentum and Stochastic Oscillator and we represent each by 4-bit binary digits, our chromosome would be about 16 binary digits. So the set of these technical trading features can be thought of as a genotype, while the finished product is the phenotype. So the individual's fitness can be measured in terms of their phenotype performance since, as described above, this is the building block of the organism. Hence from the chromosome it is possible to compute this and is termed as the Fitness function. Fitness function A fitness function in general terms is unique since it solves a particular solution. Hence given a particular chromosome the fitness function will be expected to compute a single numerical fitness. In the case of prediction it would be best to minimize the fitness function since the minimal difference between actual and predicted price is to be achieved. Reproduction The GA needs to create a population which of course will be made of individuals. So these individuals would be randomly selected from the population and produce offsprings that will be a part of the next generation. The parents are selected according to two criteria, the first randomly so there is an equal chance of selection, and secondly from the fitter individuals. Hence good individuals can be selected across many generations but poor ones may fail to be selected. So having selected the parents their chromosomes are re-combined using either mutation or crossover mechanisms. Crossover is when two individuals are taken and their chromosome strings are randomly cut at chosen positions, then their segments are swapped with each other to again produce two different chromosomes but of the same length or varying length depending on the type of cross over performed. For example single point crossover modifies two individuals in place and the two chromosomes have the same length as each other. It should be emphasized PAGE 9

11 crossovers are not applied to all pairs of individuals who are selected in the mating process. Instead it is a random choice with a small probability of individuals being selected for a crossover. The idea behind this is, without performing crossovers we would just be duplicating the parents. Hence this random choice ensures that individuals still have the chance of passing its genes across generations without crossovers altering it. After a crossover is performed we apply mutation to each child individually, with an extremely small probability. This has two main advantages, first is that it brings new changes to the genes, i.e parameters and secondly it makes sure the search space is rapidly explored. Genetic Algorithms in my opinion should be used when it is required to find exact solutions to certain optimization problems. In relation of stock market prediction, genetic algorithm does a very good process of predicting the future price as will be shown in the next section below THE GENETIC ALGORITHM DEVELOPED. This section describes how the stock market price was forecasted using Genetic Algorithms as a framework. A couple of things to be noted before describing the algorithm are as follows: 1. The algorithm choices and the framework for justifying the criteria and the modules used will be described after the algorithm is presented with relation to the steps in the algorithm. 2. Many research papers that have used genetic algorithms to forecast stock prices can be found in many journals and articles. Some can be referenced to the following papers[2][7][8]. 3. Many applications and websites also use genetic algorithms to forecast, but PAGE 10

12 almost all of the papers and websites do not produce any source code, and with this in mind no source code of the genetic algorithm will also be provided at the appendix. 4. However the code is available for the instructors and can be found at hku portal of my account, and the results of all the experimentation can be found in the following site 5. The following code that will be described below, is on a complete new module in python using the DEAP [9] a9nd the QSTK module. The algorithm calculates the price of next days as follows. [Step 1] The name of the stocks that are needed to be predicted are taken in as inputs. [Step2] The time period of the training set is specified. I have taken the training data set to be a period from to The testing data set that will be used is from to [Step3] The stock information is pulled from the NYSE stock exchange. It contains the following: the stock code, the timestamp, day s high, low, closing, and adjusted closing prices and the volume of the stock [Step4] The features set consist of Relative Strength Index (RSI), Simple Moving Averages (MA), Momentum (MOM)and Stochastic Oscillators (SO) and the average of (day s high+ day s low + closing prices/3)=>[pivot]. They are calculated for both the training and testing sets. [Step5] The coefficients of above features with the average of day s high, low and closing prices(pivot) becomes the individuals whose fitness is to be determined. [Step6] Step 5 is initialized using various DEAP functions for initialization. The stopping criteria used is the number of generations which is 5. The fitness is calculated using the training set as follows. (a) The difference between the RSI, MA, MOM in the training data set are calculated between consecutive days. The difference in pivot and closing prices are calculated for consecutive as well. (b) market close (day2) - [ rsi difference between consecutive day * (coeff 1) + moving average difference between consecutive day * (coeff 2) + momentum difference between consecutive day * (coeff 3) PAGE 11

13 + stochastic oscillator (itself a kind of predictor) * (coeff 4) + (close -piv) difference * (coeff 5) + market close (day1) ]. The coefficients are the one we are trying to find (c) After the fitness function is calculated it is added to the closing prices and correspondingly the error is the difference from the next day. [Step7] The fittest individual is calculated using another DEAP s feature. This individual is then used to calculate the predicted prices. [Step8] The fittest individual is passed on an as a parameter to the prediction function. The prices next day is predicted as follows: (a) The difference between the RSI, MA, MOM in the testing data set are calculated between consecutive days. The difference in pivot and closing prices are calculated for consecutive days as well. (b) Feature=[ rsi difference between consecutive day * (individual[0]) + moving average difference between consecutive day * (individual[1]) + momentum difference between consecutive day * (individual[2]) + stochastic oscillator (itself a kind of predictor) * (individual[3]) +(close -pivot) difference * ( individual[4])] (c) PredictedPrice=Feature+Closing Price on that day. (d) Calculate the error=pricenextday-predictedprice and return [Step9] Print various statistics such as correlation coefficient, the root mean square error to separate files for each stock and various statistics of the individual to the console. The DEAP library is used mainly since it is designed with evolutionary algorithm framework in mind. In my personal experience I actually found it quite difficult to implement the stock prediction price using genetic algorithm framework. However with DEAP there were more options. To start with there were no predefined types that was limiting before, and DEAP allows creating customized types. Secondly DEAP initializes everything from the individuals and specific functions do not have to be created; for example selecting the best individual from a population set or mutation and crossovers, all that is need to be done is fine tune in the parameters. It also specifies which function we need to use. For a list of all the functions check DEAP documentation. The only function to be computed is the evaluate function, and this makes the whole process much easier to compute and visualize. The QSTK(quant software toolkit) module is also a relatively new module which as its name suggests helps to construct portfolio and management. It has relatively PAGE 12

14 many features that helps in pulling data right from the Yahoo index to a dataframe. It has many other amazing features that make it very convenient to use and more information can be found out from the web. Steps 1 to 3 will be used throughout the paper, and hence this part will be in a complete separate page for easier reference. PAGE 13

15 [Step1] The user does the choice of stocks, hence this is not fixed and any number of stocks can be put here to see their predicted outputs. However the computation time can be varied from computers to computers. The stocks taken are chosen randomly, however in Section 4, the stocks analyzed will be taken according to a trading strategy. Most importantly these stocks that will be specified in the algorithm has to be listed in the New York Stock Exchange. [Step2] We train the stock prices from to for an important reason. Although a larger training period sounds attractive and is described in [2], there is huge difference in how things changed after the financial crisis of Many major companies became non-existent, and since many are reliant on each other, there was a domino effect. However the companies that survived, slowly recovered after 2008 and many changes were brought in. Hence 2010 seems to be a good time to start training since it is not a reflection of abnormal fluctuations of prices and the market slowly began being more stable. It is very trivial why the testing period is after the training period, since we want to predict the future price. [Step3]. The NYSE is a good and safe place to invest as it is highly regulated. But more importantly since the QSTK is directly related, it is easy to manipulate data, in terms of retrieving it. In my personal opinion, it made the process of getting data that much quicker instead of downloading and then reading the data from csv files. PAGE 14

16 [Step4] The QSTK provides a very convenient feature ftu.feat[feature] here for example the relative strength index can be calculated for the training and testing set with just one line command ftu.featrsi. This ft.featfeatures function has many options to choose from RSI, DrawDown, aaron, Stochastoic, RunUp, RunDown, SMA etc. Initial analysis was carried out using a combination of many features. Features like Bollinger Bands, Correlation Beta were experimented as well, but were crossed out from the list since they require a market benchmark S&P 500. Since the other machine learning algorithms described in this paper would not be using them to keep in line with the trend, that buy and sell decision would not be based on whether the stock would be performing above or below the market. Those technical features that had this feature were dropped. However a combination of these features mentioned above was tested. The criterion for testing was the minimum fitness value since the aim is to minimize the difference between predicted and actual prices. So the combination that would give the lowest value of fitness would be selected for further analysis. Some of these features can also be used in other parts of the paper. The following Descriptions are taken from Investopedia.com. The relative strength index (RSI) is a technical momentum indicator that compares the magnitude of recent gains to recent losses in an attempt to determine overbought and oversold conditions of an asset. It is calculated using the following formula: RSI = /(1 + RS*) Where RS = Average of x days' up closes / Average of x days' down closes. A high value over 70 would mean that the stock is overbought, and a low value of RSI would mean the stock is oversold. A simple moving average (SMA) is a simple or arithmetic moving average that is calculated by adding the closing price of the security for a number of time periods and then dividing this total by the number of time periods. Short-term averages respond quickly to changes in the price of the underlying, while long-term averages are slow to react. Simple moving average gives the average price of a PAGE 15

17 stock over a length of time. Short-term averages can be observed to signal the beginning of the upward trend. Momentum is the rate of acceleration of a security's price or volume. The idea of momentum in securities is that their price is more likely to keep moving in the same direction than to change directions. In technical analysis momentum is considered an oscillator and is used to help identify trends. The Stochastic oscillator is a technical momentum indicator that compares a security's closing price to its price range over a given time period. The oscillator's sensitivity to market movements can be reduced by adjusting the time period or by taking a moving average of the result. This indicator is calculated with the following formula: %K = 100[(C - L14)/(H14 - L14)] C = the most recent closing price L14 = the low of the 14 previous trading sessions H14 = the highest price traded during the same 14-day period. %D = 3-period moving average of %K The combination of the following features in opinion gives a rough estimate of the price is headed in which direction, and hence gives a good and accurate result. The average of the three prices is taken, so rough estimate of how the price varied throughout the day is taken. It should be noticed that closing prices instead of adjusted closing prices is taken. Although the two values are similar and adjusted, closing prices is generally a better choice, however, the next two algorithms discussed in this paper will use the adjusted closing prices and hence it brings a bit more variation. The Aroon indicator is a technical indicator used for identifying trends in an underlying security and the likelihood that the trends will reverse. It is made up of two lines: one line is called "Aroon up", which measures the strength of the PAGE 16

18 uptrend, and the other line is called "Aroon down", which measures the downtrend. The indicator reports the time it is taking for the price to reach, from a starting point, the highest and lowest points over a given time period, each reported as a percentage of total time. Run Up A series of price movements that occur in the same direction for a particular security, sector or index. A run is a prolonged uptrend or downtrend characterized by daily gains (uptrend) or daily losses (downtrend). For example, if a particular stock's price increased each day for five trading sessions, the stock would be said to be in a run. The goal was to choose these combinations from the list of available ones such that the price next day could be calculated. Hence the pivot price which is the average price of the (high+low+close/3) was also calculated. Figure1 below shows the combination of technical features that gives the fitness score. Figure 1 Various technical features and their corresponding fitness scores for cxsimulated Binary and mutgaussian as crossover and mutation techniques respectively A couple of things to note would be for crossovers cxsimulatedbinary, eta = 0.3 And for mutation tools.mutgaussian, mu = 0.0, sigma = 1.0, indpb = 0.05 is being used to compute the fitness function. The same set of combination features were also used but with different crossover functions, cxonepoint, and for mutation Flip Bit, with an independent probability of attributes changing with a 0.05 rate. To keep the report concise the result of this experimentation would be shown here, for the rest, documentation can be found online. PAGE 17

19 Figure 2 Various technical features and their corresponding fitness scores using single point cross over and mutation flip bit Figure 3 Mutation probability(indpb) is fixed and ETA is varied for different fitness scores By comparing figures 1 and 2, it can be observed that the technical indicators rs1,ma,momentum, stochastic oscillation and pivot prices give the lowest fitness value, however the corresponding crossover and mutation technique csxsimulated Binary and mutation Gaussian give the lowest fitness score, and hence this set of features are taken, and their corresponding probabilities for crossovers and mutation are now manipulated to see if a better fitness score can be manipulated. From Figure 3, it can be seen a crossover ETA (description below) will result in almost the same fitness like a crossover with ETA 0.30 keeping the mutation probability fixed. However it is just a bit higher, so the ETA is again retested with an higher value, but from figure 3, it can be seen, it results in a poorer fitness. A lower eta than 0.3 also results in poor fitness, and this is taken as the optimal one. Next the mutation rate is varied with the optimal ETA found from above and is 0.3. PAGE 18

20 Figure 4. Mutation Probability varied, with ETA 0.3 Mutation probability i.e the probability with which attributes were exchanged, was varied and from the above analysis it can be seen that slight mutations can affect the fitness score greatly. So in doing step 4 of the algorithm previous testing was done to find the lowest fitness score by varying the choice of technical indicators, and then the crossover rate and finally the mutation rate to obtain a fixed set of indicators for a given stock, its mutation and crossover rate. [Step5] When weightage = -1, in DEAP, the goal is to minimize the fitness. Now why are we minimizing? It is being assumed that this is the correlation.market close +[x*(rsi diff on consecutive days) + y*(moving avg diff on consecutive days) + z* (momentum diff) + a* (stochastic oscillator) + b*(pivot point - market close) ] = market close for next day. If the exact values of the x,y,z,a,b which satisfies this are found we are good to go since the goal is minimizing the difference between the predicted and the actual. So we are using minimizing fitness. DEAP. The Toolbox is a container for tools of all sorts including initializers that can do what is needed of them. As previously it was discussed that not all individuals are selected for mating, since we also want to maintain the case where some genes are not changed across generation, another operator is used, where it specifies, the number of individuals to be tested; seltournament with a size of 5.. DEAP offers many types of selection and mutation functions, for example the cxonepoint executes a one point crossover on the input sequence individuals. The two individuals are modified in place. The resulting individuals will respectively have the length of the other etc. However the selection function chosen was cxsimulatedbinary; that executes a simulated binary crossover that modify inplace the input individuals. The simulated binary crossover expects sequence individuals of floating point numbers. A high eta will produce children resembling to their parents, while a small eta will produce solutions much more different. A relatively small eta is chosen, for more variation. For mutation mutgaussian was used. This function applies a Gaussian mutation of mean mu(0) and standard deviation sigma(1) on the input individual. This mutation expects a sequence individual composed of real valued attributes. The 0.05 argument is the probability of each individual to be mutated. [Step7] The hall of fame contains the best individual that ever lived in the population during the evolution. It is lexicographically sorted at all time so that the first element of the hall of fame is the individual that has the best first fitness value ever seen, according to the weights provided to the fitness at creation time. The PAGE 19

21 insertion is made so that old individuals have priority on new individuals. A single copy of each individual is kept at all time; the operator passed to the similar argument makes the equivalence between two individuals. This helps us to get the fittest individual across the generation. Please note, [steps 6, 8 and 9] are not included, in justification of choice, since they are performing all the computation. To understand, it is best to go over the comments in the code. So before proceeding to the results section, a possible question asked could be if the date today is 13/4/2016 and I want to predict tomorrow s price14/4/2016, how would my algorithm calculate the next day s price? The answer is simple, it uses the change in RSI,MA,MOM, Pivot point difference between two previous consecutive days to calculate the prediction close. SO is needed just for 13/4/ The Results achieved using Genetic Algorithms The results achieved using the above algorithm has been very accurate. The following table is a brief snapshot of how closely the algorithm calculated the price the next day. It should be noted, two separate python files have been included for genetic algorithms. One that is used in this section, the other gene.py, will be referenced later. The difference between the two files is, in the former one the stocks are chosen randomly, while in the latter, is chosen strategically to find relations and optimize a portfolio. PAGE 20

22 Figure 5 PREDICTED PRICES OF AAPL,ADSK,AMZN,INTU PAGE 21

Figure 6 Predicted Prices of IBM,NVDA,EA,QCOM The above Figure 5 and Figure 6 are snapshots taken from the results.csv file (for a list of all the results please refer to my fyp webpage; http://i.cs.hku.

23 Figure 6 Predicted Prices of IBM,NVDA,EA,QCOM The above Figure 5 and Figure 6 are snapshots taken from the results.csv file (for a list of all the results please refer to my fyp webpage; From the figures, especially if we look at the last three columns, we can see how well the algorithm has predicted. Most of the values are very close. There is little difference between the predicted value and the actual value. The fitness function indeed minimizes the difference between the predicted and actual values. For a bigger picture of how the algorithm performed for each refer to figures 7, figures8, and figures 9. For more plots, please refer to my fyp webpage. PAGE 22

24 Figure 7 Predicted vs actual price AAPL PAGE 23

25 Figure 8 Predicted vs actual price IBM PAGE 24

26 Figure 9 Predicted vs Actual Prices MFST A small note on the genetic algorithm described above. If the aim of the investor is to look at prices beyond the next day, for example a week, the above GA won t work since pivot is based upon the difference between two consecutive days. This is pivotal in analysis in back testing if more days are looked forward into the future. More importantly with the same crossover and mutation functions with eta of 0.3 and indpb 0.05 for mutation, it will be now checked if we can work with the four coefficient of the technical feature i.e drop the pivot price. After a quick changing of the script, the following figure is obtained PAGE 25

Figure 10: Table showing 4 features From the above figure it can be seen that dropping pivot point actually even decreased the fitness i.e a better fitness was obtained.

27 Figure 10: Table showing 4 features From the above figure it can be seen that dropping pivot point actually even decreased the fitness i.e a better fitness was obtained. Hence these 4 features would be used when a back-testing strategy is developed in section 6. Limitations and Analysis 1. If the computation time of the algorithm is taken into consideration, it is extremely quick in finding the fitness of the function. 2. Although it may be argued that the choice of technical features is less, however it should be noted various combinations were tested and analyzed and, this combination actually gives the best and most accurate results. However users running the algorithm may not get the same results every time, but are expected to find the values more or less same to the results achieved 3. The problem of how to write the fitness function must be carefully considered so that higher fitness is attainable and actually does equate to a better solution for the given problem. If the fitness function is chosen poorly or defined imprecisely, the genetic algorithm may be unable to find a solution to the problem, or may end up solving the wrong problem. This was initially difficult as choices had to be made, however with repeated testing and experimentation, the change of feature s coefficient selected in the algorithm seemed to give the best results for a wide range of stocks. PAGE 26

28 4. The fitness function is dependent on a variety of factors, the parameters, the size, the mutation and crossover selection. If the population size is small it may not explore all solution sets, or if the rate of how much genetic change is brought across generations is too quick it may bring poor convergence. This too was overcome with repeated experimentation of the mutation, selection and crossover techniques. The DEAP library made it much convenient to experiment with such changes. 5. GA can be succumb to a problem, which is premature convergence. This can happen if fitter individuals appear more rapidly then the population diversity is reduced and a local optimum is achieved instead of global one. However this was not a problem encountered in the algorithm above, but caution must be taken to prevent it by experimenting with different parameters more. 6. The sudden changes a stock may suffer can be due to a company CEO facing legal actions or the company strangled in some legal problems, and if this condition persisted over prolonged periods GA would still provide accurate results, but not if the problem is fixed quickly. For these quick decisions the market must be closely monitored and web-scrapping tools in relation with data-mining techniques can be used for a better result. The next section will discuss two machine learning algorithms, the random forest classifier and the K-Nearest Neighbor Algorithm. Since the flow of programs for both is similar, except one will call the K-Nearest Neighbor class while the other will call the Random Forest Class to answer to predict, the algorithms will differ on which class module is called. 3.1 THE RANDOM FOREST ALGORITHM Random Forest [10] is an ensemble technique and is quite similar to the nearest neighbor predictor, which will be presented in the next section. The idea behind PAGE 27

ensemble methods is that there are weak learners but if combined together, these weak learners may just become a strong learner. It is a divide-and-conquer approach.

29 ensemble methods is that there are weak learners but if combined together, these weak learners may just become a strong learner. It is a divide-and-conquer approach. The algorithm can be described as a collection of another machine learning algorithm, the decision tree. In computer science analogy, a root is the top of the tree and similarly the input is entered at the root and slowly the data traverses down the tree and is branched into smaller sets. The leaf is the located in the tree where the tree won t branch any more. What the random forest learner does is combine the trees in an ensemble method; with the notion it will be a better learner of the combined weak learners. The following Figure describes a brief overview of random Forest. Figure 11 Random Forest Algorithm. Source:- Generally the algorithm will work in the following manner [10] 1. Take N samples randomly to create subsets of data. 2. At each node select some predictor variables depending on the criteria, randomly. The predictor variable that gives the best split according to some objective function is used in binary splitting of that node. 3. Continue choosing a number of predictor variables from the given set randomly and repeat step 2. PAGE 28

30 So when an input is given to random forest, it is run down to all of the trees. Hence the result will be given according to criteria of average or weighted average on the terminal nodes reached. The next section will talk about random Forest algorithm used in stock prediction. 3.2 Random Forest in relation to Stock Market Few algorithms have been published in literature for stock market prediction using random forests. However, it does not mean it is not good at predicting future prices (more about this is in the results section of Random Forest). Many decision trees are built from the training set and each node will contain the subset of the training features. So the idea can be thought of as such the input variables, i.e the technical features are randomly chosen and the best split is randomly chosen within the subset. Pruning is not performed so that end results of the forest are maximum trees. If the values of the features at a node is the same, the mean weight of the features is taken. The following section will describe the RandomForestLearner Algorithm that was developed to obtain the prediction. Although Python offers scikit class for Random Forest Learner, this was not used as it is more flexible to work if functions are designed according to preference instead of calling some class predefined functions with parameters and secondly, I personally think it beats the purpose of understanding what the algorithm is trying to achieve. The algorithm will be presented first before justifying and describing some of the choices made for them. PAGE 29

31 3.3 The RandomForestLearner Algorithm [Step 1] The names of the stocks that are needed to be predicted are taken in as inputs. [Step2] The time period of the training set is specified. I have taken the training data set to be a period from to The testing data set that will be used is from to [Step3] The stock information is pulled from the NYSE stock exchange. It contains the following, the stock code, the timestamp, day s high, low, closing, and adjusted closing prices and the volume of the stock. [Step4] The training set is split into two parts, the features set, and the Y that is to be predicted. The training features selected are, RSI, STD, Day to Day Difference, frequency, slope and mean. It is selected from a particular date and before, for a window of 20 days. For exam if the date, is 20/2/2012. The first training feature will be from [1/2/ /2/2012] the second [2/2/ /2/2012] and so on. The actual price of the stock is taken on 20/2/2012, and the training price of stock a taken on 25/2/2012. [Step5] Similarly to Step4, the testing data is split into two parts, the features of X to be tested, and the Y that is to be compared against the predicted values. The logic is similar to Step4. [Step6] The RandomForestLearner class is called upon with a range of k in between from 2 to 15 to find the best k that gives the lowest root mean square error between the predicted and actual values [Step7] Call the RandomForestLearner Class with this maximum value of k to PAGE 30

32 train and test the data. The training sets are the training sets created in Step4, i.e both the features and their corresponding Y values will be trained. However for predict only the features testing set is sent to the learner, for getting the Predicted Price of Stocks. [Step8] Statistics of predicted values of stock and the actual prices of the stock are obtained. Correspondingly their results are sent to csv files and pdf for referencing and viewing later. A very detailed commented version of this algorithm s class is uploaded for reference in the RandomForestLearner.py file. For justification of choices in Step1 to Step3, Please refer to page number. More over step4 part A will also be the same for the KNN Learner, since the model used is the same. [Step4 part A] Step4 involves the choosing of technical indicators. To do this around four technical indicators were chosen such that they would become the basic indicators according to the designed model. These four indicators were the day to day difference, standard deviation, and Frequency(how many times the stock price was over the mean on the current day and/or the previous day). This concept is similar to run down in stock technical term. The justification of these four indicators were that the stock price of today and the stock price of tomorrow or within a short time frame would not fluctuate too much or to a great extent. Then another additional indicator was selected with these four to see how well they would fit to the model. A brief introduction to these indicators are given below: Simple Moving Average (SMA):- SMA is the average price of a stock over a specific period. Its main job is smoothing out the random fluctuations of the historical prices to provide a clearer view of the trend. In this paper a 3 day weight over a 20 day average of SMA was taken. Exponential Moving Average:- Is the same as SMA but more weights is attached to recent prices. Exponential Moving Average provides a better picture over the trend since it takes into account the more recent prices. Relative Strength Index (R.S.I) :- Is a good indicator whether the stock is overbought or oversold. A price above 70 is usually considered overbought and prices below 30 is considered oversold. This would be a good example of price fluctuation. PAGE 31

33 Rate of Change(R.O.C): -This is the gradient of the stock over the last 20 days. A y=mx+b line is fit. The m value is taken, depicting a general linear trend of stock prices. Slope:- A linear regression line was fit through the features, set, and the corresponding gradient was taken. This is different from the Rate of Change in the sense that rate of change is the straight line from the newest value-oldest/oldest within a time frame. However slope is taken into consideration for all the data points with in the time frame and finding the correlation corresponding to a certain value. Frequency:- If the stock has been doing better than the mean on the current day and the previous day a value of +1 is given. Standard Deviation:- Standard deviation for the stocks over a period of 20 days. Mean:- This is the mean of the stock adjusted closing prices over 20 days. Day to Day difference: - This is the price today minus the price from yesterday. Maximum Draw down difference(mdd):- A maximum drawdown (MDD) is the maximum loss from a peak to a trough, before a new peak is attained. The experimentation done to see the best indicator will be uploaded, and for the instructors to try them individually or in a combination they would need to just comment out the corresponding indicator/s. The file name is forestexp.py. [Step4 part B] To justify the final selection of indicators various combinations of the basic four namely (day to day difference, standard deviation, and Frequency) +combination of SMA,EMA,MADD were taken to see which gave the lowest RMSE when taken with the initial four indicators. Let s refer to the basic four as base. Indicators RMSE Base Base+slope 2.60 Base+roc+slope 3.07 Base+slope+RSI 2.53 Base+slope+RSI+MDD 2.65 PAGE 32

34 Base+slope+RSI+EMA 2.95 Base+slope+RSI+SMA 2.84 Base+RSI Base+RSI+MDD Base+RSI+MDD+EMA Base+ RSI+MDD+EMA+SMA Base+RSI+SMA+ROC+EMA+MDD Base +MDD Base +MDD+EMA Base +MDD+SMA+EMA Base +MDD+SMA+EMA+ROC Base+SMA Base+SMA+EMA Base+SMA+EMA+ROC BASE+EMA BASE+EMA+ROC Base+SMA+EMA+MDD+slope+RSI+roc Table 1 showing various RMSE for Predicted and Actual Prices for different technical Indicators From the above table it can be observed that the technical indicators, the base +RSI and slope gives the lowest Root mean square error between actual versus predicted prices. Then the set of chosen technical indicators become the Xtraining features of our model, which we will send to the RandomForestLearner along with their corresponding Actual Values. The experimentation done for these will be uploaded, and for the instructors to try them individually or in a combination they would need to just comment out the corresponding indicator. The file name is forestexp.py. The last part of Step 4 and 5 can be better understood by referring to the figure 12 below. PAGE 33

35 Features Training Set Actual Stock Value Actual Value +5days Random Forest Learner/KNN learner Predicted Value Features Testing Set Figure 12. A picture representation of how prediction is achieved. The figure 12 shows how the algorithm achieves prediction. We have a list of training features (X1-X10), and we have actual Y value and Actual Y value+5days. So the model is trained by this method [Y+5]-y/5. The features set and their corresponding Y values are sent to the model for training. Then we query the model with the testing features and the output value is the predicted value. However this is not the value, for example PredictionValue = (PredictionValue+1)*(Corresponding Y value) For Step 6, the best k is found so that the RMSE for each individual stock is minimized and for Step 7, please refer to the comments. For the RandomForest Learner class, it is commented in great detail since the code can get confusing at times. The next section will briefly describe some of the results of the Random Forest Algorithm. PAGE 34

36 3.4 Results achieved using Random Forest Learner The results achieved using the Random Forest algorithm was also very accurate since over 90 percent accuracy was achieved after repeated experimentation on many stocks. For more results and charts please refer to my fyp webpage, given in the hyperlinks above. Figure 13. Predicted vs actual prices for AAPL,INTU,NFLX From the above three stocks it can been seen the model predicts the stocks really well, given the condition we are looking five days into the future. The following figures will show how the algorithm predicted performed for each stock. Moreover not to make the report too long, various statistics, such as the features list, the scatter plots, have also been calculated, but will not be included in this report. All of them can be found in the website. PAGE 35

37 Figure 14 Predicted vs ActualValue EA PAGE 36

38 Figure 15 Predicted vs Actual AKAM Limitations and Analysis 1. The results of learning are quite difficult to understand. A single decision tree can help, can give more insights, but however across trees it becomes quite difficult to visualize what really is going on. This was overcome in the algorithm by making it as simple as possible to simply visualize what was going on. 2. If the number of trees is long and also the data set random forest can be slow to query. For example if we let k vary, where k is the number of trees in the for loop, to compute for a small set of stocks takes a very long time; however this can be minimized with smaller trees. 3. Generally random forest suffers from high variance; this was reduced by bagging. PAGE 37

39 4. The choice of the model makes the technical indicators limited. Since the approach to this model was made before learning of the QSTK features module, only a certain number of features could be calculated. To account for the highest amount of logical features seemed possible to me according to the model taken, and tested along with each other to give the best combination. Overall the random forest algorithm achieves a good accuracy in predicting stock prices. However caution must be taken so that overfitting of the training data does not happen; then poor prediction results might happen. The next section will discuss about the KNN learning algorithm. KNN LEARNER ALGORITHM. KNN can be described as non-parametric and lazy algorithm; hence this algorithm can be described in two parts. First the non-parametric nature means that no assumption is to be made for the data-set distribution in consideration [11]. Hence KNN becomes a very popular choice when it comes to modeling or predicting situations, where there are no theoretical assumptions to be made. Secondly the lazy algorithm nature of KNN not all training points are used for generalization, however, all of them are retained. In simple words, the training phase is very quick since no underlying assumption is needed for the data and secondly the testing phase would require all the training data set for prediction. Hence from this above idea it can be deduced that it is computationally more expensive in terms of memory since, in the testing phase all memory of the training phase has to be retained. In KNN learning the variables are assumed to be in an N dimensional space [11], often called the features space. The notion of features space means that they have metric, their distance from each other, and the most common measure is the Euclidean Distance. Hence each of the data in the training set consists of vectors and usually in KNN the k is how many nearest data points are within the distance metric, for example the Euclidean distance. So given a training set, the KNN will retain that set. However when the testing set is applied on the training set, the distances between them is calculated and the smallest value will correspond to the sample in the training set, which is the closest to the testing set. PAGE 38

40 The performance of a KNN classifier is primarily determined by the choice of K as well as the distance metric applied [12]. The estimate is affected by the sensitivity of the selection of the neighborhood size K because the distance of the Kth nearest neighbor to the query determines the radius of the local region and different K yields different conditional class probabilities. If K is very small, the local estimate tends to be very poor owing to the data sparseness and the noisy, ambiguous or mislabeled points. In order to further smooth the estimate, increase K and take into account a large region around the query. Unfortunately, a large value of K easily makes the estimate over smoothing and the classification performance degrades with the introduction of the outliers from other classes. To deal with the problem, the related research works have been done to improve the classification performance of KNN[13]. Generally speaking, the classification results are very sensitive to two aspects: the data sparseness and the noisy, ambiguous or mislabeled points if K is too small, and many outliers within the neighborhood from other classes if K is too large. [13] From a theoretical point of view, the classification performance of KNN is determined by the estimate of the conditional class probabilities of the query in a local region of the data space, which is determined by the distance of the Kth nearest neighbor to the query. [13] So the classification performance is very sensitive to the selected value of K. Furthermore, the simplest majority voting of combining the class labels for KNN can be a problem if the nearest neighbors vary widely over their distances and the closer ones more reliably indicate the class of the query object. With the goal of addressing the sensitivity issue of different choices of the neighborhood size K [13]. Hence the optimal value of K is very hard to determine, since the data assumed is not in an uniform manner and hence a k to be determined would depend on the type or nature of the training data. The general convention states that larger K is more tolerant to noises, and hence can smooth out noises better than smaller K, and very small K would result in overfitting. However for the best k, it is usually problem specific, and can be found by repeated experimentation. PAGE 39

41 4.2 KNN learners in the Stock Market The Knn Learner has been used to predict stock prices and some can be referenced to these papers [13][14]. The general framework of the algorithm is that historical data or some features of stock or any company fundamentals can be a part of the training data. And each of them will be mapped to a set of vectors. This set of vectors say will be the stock features. A common metric, for example the Euclidean distance is taken to come to a decision. As mentioned above KNN will not provide parameters or any functions from the training to the test data, instead it will find the closest k records of the training data and find the one that has the highest number of similarity with test features to predict the value of the stock. A voting is done to select which value the query should take on for prediction. The following KNN algorithm was used for this project. Again, this algorithm does not contain the python KNN learner modules, and instead a new one is coded, with the principles described above. It is quite similar to the Random Forest Algorithm as discussed previously. Most of the justifications are also the same. PAGE 40

42 4.3 The KNNLearning algorithm developed [Step 1] The names of the stocks that are needed to be predicted are taken in as inputs. [Step2] The time period of the training set is specified. I have taken the training data set to be a period from to The testing data set that will be used is from to [Step3] The stock information is pulled from the NYSE stock exchange. It contains the following: the stock code, the timestamp, day s high, low, closing, and adjusted closing prices and the volume of the stock. [Step4] The training set is split into two parts, the features set, and the Y that is to be predicted. The training features selected are, RSI, MA, ROC, EMA, STD, Day to Day Difference, frequency, slope and mean. It is selected from a particular date and before, for a window of 20 days. For exam if the date, is 20/2/2012. The first training feature will be from [1/2/ /2/2012] the second [2/2/ /2/2012] and so on. The actual price of the stock is taken on 20/2/2012, and the training price of stock a taken on 25/2/2012. [Step5] Similarly to Step4, the testing data is split into two parts, the features of X to be tested, and the Y that is to be compared against the predicted values. The logic is similar to as in Step4. [Step6] The KNNLearner class is called upon with a range of k in between from 2 to 30 to find the k nearest neighbor that gives the lowest root mean square error between the predicted and actual values [Step7] Call the KNNLearner class again with the k found in Step 6 to train and test the data. The training sets are the training sets created in Step4, i.e both the features and their corresponding Y values will be trained. However for prediction, only the features testing set is sent to the learner, for getting the Predicted Price of Stocks. [Step8] Statistics of predicted values of stock and the actual prices of the stock are obtained. Correspondingly their results are sent to csv files and pdf for referencing and viewing later. Before presenting the results in the next section, a couple of things are worth mentioning:- PAGE 41

a) The algorithm finds the best nearest neighbors i.e. a value of k, for which the root mean square is the lowest between the predicted and actual stocks.

43 a) The algorithm finds the best nearest neighbors i.e. a value of k, for which the root mean square is the lowest between the predicted and actual stocks. This is because the idea is to reduce the difference between the predicted and actual values. The best k found is plugged again to obtain a prediction. b) A consequence of the above is a much slower computation time than the above, however the prediction accuracy is much improved with this method. This, I think should not be an issue, as processing time can be improved with hardware, but should not be compensated for higher accuracy which is pivotal. c) To have an overview of the model used, please refer to Figure 7, and for choices please refer to page 14 The next section will discuss the results found using the KNN to predict future prices. 4.4 Results found using the KNN to predict future stock prices. Since this is the last machine learning algorithm that will be presented, its prediction accuracy can be labeled as second, with the Genetic Algorithm coming first, and consequently the Random Forest coming at last. The results using the KNN learning is highly accurate as well. Figure 16: Actual vs Predicted Prices for AAPL,ADSK,AKAM,CSCO PAGE 42

44 The above figure 16 shows the results of the algorithm in forecasting stock prices. From the closeness of the values, it can be seen it does a fairly good job of predicting. The root mean square for the above stocks is in the following table- STOCK CODE RMS AAPL ADSK AKAM CSCO Table 2.Stock code and their corresponding root mean squares The following figures will show how the algorithm predicted performed for each stock. For a full list of all the result, their corresponding features and stock prices can be found in my FYP webpage. PAGE 43

45 Figure 17 Comparison plot between Predicted and Actual Prices ADSK PAGE 44

46 Figure 18 Comparison plot between Predicted and Actual Prices ADSK 4.5 Limitations and Analysis. 1. The algorithm does not learn anything from the training data, which can result in the algorithm not generalizing well and also not being robust to noisy data. However in the case of stock markets, until prices do not fluctuate greatly, KNN would be able to minimize this noise. Furthermore the k is chosen in a way such that it gives the least difference between predicted and actual values. In General Stock market, prices do not fluctuate greatly day to day, and hence KNN can predict fairly accurately 2. To predict the label of a new instance the KNN algorithm will find the K closest neighbors to the new instance from the training data and the predicted class label will then be set as the most common label among the K closest neighboring points. The main disadvantage of this approach is that the algorithm must compute the distance and sort all the training data at each PAGE 45

47 prediction, which can be slow if there are a large number of training examples. Again as mentioned above, if this is the case, then there is always the option of increasing a computer s computational processing. 3. It is very sensitive to irrelevant or redundant features because all features contribute to the similarity and thus to the classification. By careful feature selection or feature weighting, this was again avoided by repeated introduction and trying of a number of features, and seeing how well the algorithm predicted results. Sections 2 to Sections 4 describes the Machine Learning algorithms that have been used to predict stock market data for different testing periods. Although the algorithms came very close in predicting, it is pivotal that some sort of back testing strategy is performed to see how they would have performed. So before making a simulated market scenario, I would like to present my own idea of how to select stocks so that portfolio is maximized. Just one thing would be assumed; that is the investor has enough money to buy any stocks, not just within a certain threshold. This seems a fairly reasonable assumption since, money can always be borrowed, if the investor thinks the money invested would grant him high returns. Hence after providing some detailed analysis on how to choose stocks, a back testing strategy with prediction results from the algorithm will be provided for the machine learning algorithm. If the requirement is just to buy any stocks according to what the investor wishes, then this Section no 5 can be ignored. Section 5 describes how to choose stocks, so a portfolio can be constructed in a way, that guarantees the highest risk free return, and a way that stocks can be purchased, given some initial stocks to start with. The following analysis is something I thought of by myself, but would not claim it as my own, since this idea probably has been around. PAGE 46

48 5. AN ANALYSIS ON HOW STOCKS CAN BE CHOSEN Before presenting the idea my idea of choosing stocks, a couple of things are worth mentioning a) This initial analysis does not use any paper money, or back testing strategies, however a back test is performed to verify the results of the idea presented. This will become clearer a bit later. b) A couple of technical terms will be used;- - The Sharpe Ratio is a measure for calculating risk-adjusted return, and this ratio has become the industry standard for such calculations. The Sharpe ratio is the average return earned in excess of the risk-free rate per unit of volatility or total risk. Subtracting the risk-free rate from the mean return, the performance associated with risk-taking activities can be isolated. One intuition of this calculation is that a portfolio engaging in zero risk investment, such as the purchase of U.S. Treasury bills (for which the expected return is the risk-free rate), has a Sharpe ratio of exactly zero. Generally, the greater the value of the Sharpe ratio, the more attractive the risk-adjusted return. (source:investeopedia.com) - Daily Return can be thought of as the return on an investment daily. It is a kind of performance metric. For example if the price of a stock is $10 on Day1, and the next day the price is 15 dollars on Day2, the daily return can be calculated by taking the difference (15-10)/10=0.5 and multiplying by 100 gives the percentage(50 percent) - A portfolio can be described as a grouping of financial assets by individuals, and managed by professionals. In this paper only stocks would be considered as financial assets c) The procedure to find the good combination of stocks, is a very tedious process, since it is not a very automated process, and involves lots of files moving here and there and renaming them becomes tedious. As always this problem can be approached in a different manner, which is not possible to do so, with the relation to my computer s processing ability. d) Step by step procedure how on this was achieved can be found in the moodle site. PAGE 47

49 e) The results of the genetic algorithm will be used for choosing the stocks, however any of the algorithms presented can also be used. It follows the same steps. Choosing the best stocks Each public company is a specialized company, and by the notion of public, every investor can have the same knowledge of the company. This specialization of a company can be divided into sectors, for example Apple, Google, would fall in the technology sector, or say Exxon Mobiles, Valero Energy Corporation, Chevron can be classified as companies in the Energy sector. For a list of sectors and under what categories they fall for this site can be referenced, Moreover there is another thing to consider as well, for all these companies in the same sector, they compete with each other, since they are in the same field, and each sector also relies on companies within their sector or others sectors. For example the energy sector is closely linked with the automobile sector or within the healthcare sector pharmaceutical companies are reliant on the hospitals, or healthcare providing sectors. This interdependence is exploited when choosing stocks. To explain the idea further the following example is illustrated. If I have decided to purchase AAPL shares, I would randomly choose any technology companies in this sector, for example MSFT, IBM, GOOG, and depending on what combination of these stocks give the highest sharp ratio, this stock will be included in the next step. For example, if AAPL gives the highest sharpe ratio, in the portfolio of the stocks [ AAPL, GOOG, IBM, MSFT ]. The apple stock is selected for a further round two where it will be compared with the stocks in the similar sector. For example a quick google search for which company Apple is reliant on will give the name of the companies. For the case of Apple stocks, it was JBI(Jabil Circuit,),MU(Micron Technology),and ADI(Analog Devices). So for each of these stocks, their prices where correlated with apple stocks. The idea behind this is to observe, whether the stock price of any these companies, increased with the price of Apple stocks, within the same time period, with a correlation coefficient, greater than 0.7, if yes then add that stock to the portfolio. So with this idea, two things are trying to be obtained. The first would be, before buying a stock, it would be better to know, how the other stocks are performing within the same sector. Since the performance metric here is the sharpe ratio, that means in this case Apple is the best stock to purchase given this risk free return rate. This choice can sometimes eliminate the confusion, if there is sufficient money to invest only in once stock and obtain, the highest risk free return from it. The other exploits the idea of complementary goods in economics while the former tries to do so for substitute goods in economics. For example the demand for cars and petrol, PAGE 48

50 interest rates and investment or closely tied with each other. Hence by correlating Apple Stock prices with the very companies it is reliant on, if the price of Apple rises, it is to observed, which stocks were most correlated with it. Hence starting from one stock, the final result is two stocks. It should be noted if the correlation coefficient is less than 0.7, then that stock won t be considered, in constructing a portfolio. The programming logic, of how this was achieved will be in the online document, however a following step by step guide will be discussed here in relation to how my version of optimal portfolio is constructed. Step[1] Identify industry sectors. Technology, Energy, HealthCare and Finance was used. However in finance no corresponding similar company was used, since all companies are closely tied with the finance sector, a more thorough analysis would need to have been performed. [Step2] Pick a stock from each industry, AAPL, XOM,BSX(Boston Scientific Corporation)and JPM( Jp Morgran) was selected. [Step3] For each of these stocks, correspondingly select three other stocks in the same sector, it can be done randomly or according to investors preference. For example AAPL GOOG, MSFT, IBM can be tested to which gives the highest sharpe ratio. Then run the template file(fixedtemplate.py), to find which stocks have the highest sharpe ratio, and the allocation. These will be printed to the console. Then take a note of that stock code, and put it in the portfolio. [Step4] With the stock code that gave the best sharpe ratio, or the max Allocation, Find the companies that are related to this sector, for the case above it could be JBI(Jabil Circuit,),MU(Micron Technology),and ADI(Analog Devices). With the selected stocks run the template file(similar.py) to obtain the correlation coefficient. If the correlation coefficient is greater than 0.7, put that stock in your portfolio. [Step5] Repeat Step3 then Step4 for the remaining sector. Note Step3 is always guaranteed to find an optimal allocation, with the highest sharpe ratio, but not step 4. If in Step 4 no case(correlation > 0.7 is found) it is up to the user to determine, whether to keep on trying more combinations. [Step6] Backtest this strategy, by passing the stock list in your portfolio created to see the results. PAGE 49

51 For results of this strategy please refer to section 6, page 59. Limitations 1. A quick glance of the two template files will make the user realize that it is not a user friendly code. However the purpose of this is to try something different and find whether a pattern of complementary or substitute trend can be observed. 2. The template file only calculates for four stocks at a time, this is to simplify the procedure since, it requires the user to manipulate many things, like keep a copy, renaming them etc. 3. It is possible to find out various stocks and find their sharpe ratio, by using data from the internet, or downloading various csv files, and then manipulating. However this was not done because, to keep things simple. 4. Although it can be argued that the process is too tedious, and that there is indeed a room for improved. However since time was constrained, more priority was given into developing better forecasting algorithms than strategy that gain a more stable and maybe higher return back. The next section will describe a back testing strategy, based on another relative new python module pyalgotrade. It is an event driven algorithmic trading library, with the focus on back testing strategy. Again only the genetic algorithm s predicted results were used. This is again because of two reasons one being, more priority was given to developing forecasting algorithms. Although back-testing is a pivotal part, in this process, and sufficient time was allocated however the documentation of pyalgotrade is quite difficult to comprehend in my opinion and hence, it took a great time to understand this concept. Secondly there was a choice either for genetic algorithms, or the KNN and Random Forest, genetic algorithm was chosen for its higher prediction accuracy. Since this could be implemented I personally believe the other one can be done so accordingly, and this will be a part of my future work outside this research project. The backtesting strategy is build on a market simulator, that signals when to buy or sell depending on the trading strategy being used. PAGE 50

52 6.BACKTESTING THE GENETIC ALGORITHM THE UPDATED VERSION USING A MARKET SIMULATOR Backtesting can be described of taking a strategy, then looking back in time to see how the strategy would have performed, if it had been performed exactly. The underlying assumption; if the strategy had performed well in the past, it has a higher chance of working well in the future and similarly if the strategy did not work well in the past, then, it won t probably work well in the future. Backtesting a strategy can help the investors to assess how a trading strategy would perform likely in the market. By learning, an investor can know more about trading strategies and methods of improving their strategies. Before the results of the strategies are discussed a couple of things should be noted 1) The backtesting strategy for the updated genetic algorithm is performed since this provides more flexibility since the coefficients of the removed pivot point does not help us, if the aim is to look far more into the future, for example a month, this can effect trading strategies. 2. A simple moving average crossover is used. A SMA crossover can be described as such. When the price of security is above or below the moving average, send a signal to notify a possible change in trend. 3. Essentially, when speaking of stocks, long positions are those that are owned and short positions are those that are owed. An investor who owns 100 shares of XYZ stock is said to be long 100 shares. This investor has paid in full the cost of owning the shares. An investor who has sold 100 shares of XYZ stock without currently owning those shares is said to be short 100 shares. The short investor owes 100 shares at settlement and must fulfill the obligation by purchasing the shares in the market to deliver. Oftentimes, the short investor borrows the shares from a brokerage firm in a margin account to make the delivery. Then, with hopes the stock price will fall, the investor buys the shares at a lower price to pay back the dealer who loaned them. When an investor uses option contracts in an account, long and short positions have slightly different meanings. Buying or holding a call or put option is a long position because the investor owns the right to buy or sell the security to the PAGE 51

53 writing investor at a specified price. Selling or writing a call or put option is just the opposite and is a short position because the investor owes the holder the right to buy the shares from or sell the shares to him at the holder's discretion.(source: Investopedia.com). 4. Back testing was done using the predicted prices, also code for the actual prices will also be given. Predicted Prices was chosen, since it would be nice to see how t results of the algorithm performed in the market. The code for the actual prices, also will serve as comparison on how they would have performed on the given set of stocks A brief view of the backtesting code is described below. - Class TradingStrategy -> This class is the one that contains our strategy whether to buy or sell. An intial starting with cash_or_brk = and a moving average with period = 14 days(this is variable and results and it can be adjusted accordingly) I have done for increments of 6 days, and find how portfolio final result was affected. - function onbars is invoked for each day. I decide whether to buy/sell based on the condition (close_price > simple moving average) of previous day. Any other logic can be inserted into this function to get appropriate results. - function onenterok is invoked when we enter a position - function onexitok is invoked when we exit a position - class DataframeBar -> represents a bar -> or we can say an encapsulated data structure for the stock (e.g AAPL) for a particular day (represents open,close,high,low etc) - class DataFrameFeed -> represents a feed of bars (described above) -> creates a feed of bar based on the dataframe we already have. - function getnextbars is invoked to iterate over this feed. - finally, for each stock, we create a dataframefeed object using the dataframe we have already. This is passed to the Trading strategy object. On execution, the strategy is run till the feed ends. PAGE 52

54 - Various statistics are included in the form of graphs so the investor, can see how their portfolio performed, and also sharpe ratio is calculated as metric for portfolio. Moreover when to buy and sell for each stock, and at which date is printed to the console. Only the SMA period was varied, in increments of 6 to see how the stocks performed. The total value in the portfolio was observed as well as sharpe ratio.although all of the pictures side will be uploaded, snapshots of how the trading strategy performed will be uploaded for one period-sma 20 is given here.(figure 19). From Figure 20 it can be observed that varying sma period from resulted in decreasing overall value in portfolio, hence using smaller SMA resulted in a better strategy. The following table 2 will describe the total portfolio value between actual and predicted stocks, across different periods. From table it too can be observed the decreasing trend, however the results should not be surprising, since the prediction accuracy was around 99percent. SMA PERIOD TOTAL PORTFOLIO VALUE PREDICTED TOTAL PORTFOLIO VALUE ACTUAL PAGE 53

55 PAGE 54

56 Figure 19: Performance of each stock in the portfolio,periods when buy and sell was generated,net return, cum return, Portfolio across time is also shown PAGE 55

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques 6.1 Introduction Trading in stock market is one of the most popular channels of financial investments.