UNIVERSITY OF CALGARY. Analyzing Causality between Actual Stock Prices and User-weighted Sentiment in Social Media. for Stock Market Prediction

Size: px

Start display at page:

Download "UNIVERSITY OF CALGARY. Analyzing Causality between Actual Stock Prices and User-weighted Sentiment in Social Media. for Stock Market Prediction"

Marshall Barton
6 years ago
Views:

1 UNIVERSITY OF CALGARY Analyzing Causality between Actual Stock Prices and User-weighted Sentiment in Social Media for Stock Market Prediction by Jin-Tak Park A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE GRADUATE PROGRAM IN ELECTRICAL ENGINEERING CALGARY, ALBERTA SEPTEMBER, 2016 Jin-Tak Park 2016

2 Abstract In this thesis, an improved sentiment analysis algorithm is proposed which reflects the impact of user, and to analyze whether public sentiment calculated by the proposed algorithm can contribute to stock prediction. The proposed sentiment analysis algorithm reflects the factors of Twitter which are relevant to users authority to calculate sentiment weight of each message that is different from existing sentiment analysis algorithms. Linear and nonlinear prediction models are constructed to forecast future stock prices of selected companies. The proposed algorithm is applied to both linear and nonlinear prediction models and comparisons of prediction accuracy with the existing sentiment analysis algorithm are performed. To support the approach of the proposed algorithm that the authoritative users affect the other users, causal relationship between them is figured out through Granger Causality analysis. Further analysis is also provided to find causal relationship between public sentiment and the actual changes of the stock prices. ii

3 Acknowledgements There are so many people in my journey to a master s degree that, without whom, I would not have made it to this point. First of all, I would like to thank my supervisor Dr. Henry Leung for all of his great advice and guidance over the last two years. Without his extensive knowledge, experience, and above all patience, I would not have had the opportunity to complete this meaningful journey to broaden my knowledge and deepen my understandings. I would also like to thank him for unstinting financial support that helped me to focus on my research without anxiety. I would like to thank my committee members, Dr. Behrouz Far, Dr. Guenther Ruhe, and Dr. Diwakar Krishnamurthy who showed curious interest into my research and provided valuable discussions and comments on it. Thanks to all the faculty members and staffs in Department of Electrical and Computer Engineering for their great support. I have been very fortunate to meet some great people while studying in a beautiful city, Calgary. I wish to thank all my colleagues: Miles, Si, Gayan, Chatura, Summona, King and Edwin for their encouragement and comfort. I would also like to thank all my friends in Calgary, U. S. and South Korea for their help and lifelong friendships. Special thanks to my girlfriend who always be a great comfort to me. My life would not be as beautiful without her. Last but most important, I especially appreciate my family for all of their sacrifices made on my behalf. I cannot find the words to thank my parents enough for their constant support, encouragement, love and patience. Huge thanks go to my sisters and their husbands for their support and for taking good care of our parents. Thank you all for always being there. iii

4 Dedication To my loving parents who have always made sacrifices for me. iv

5 Table of Contents Abstract... ii Acknowledgements... iii Dedication... iv Table of Contents...v List of Tables... vii List of Figures and Illustrations...x List of Symbols and Abbreviations... xiii Chapter 1: Introduction Objective Outline...6 Chapter 2: Background and literature review Financial Background Efficient Market Hypothesis (EMH) Terms of Stock Trading Types of Stock Trading and Analysis Sentiment Analysis Sentiment Analysis for Financial Problems Sentiment Analysis for Political Problems Impact of Users in Social Networks Machine Learning Algorithms Naïve Bayes Decision Tree Multilayer Perceptron (MLP)...24 Chapter 3: Sentiment analysis for Social Media Data Acquisition Data Preprocess Emoticon Translation Data Cleansing Construction of N-grams Sentiment Labeling User-weighted Sentiment Analysis Sentiment Classification User-Weighted Sentiment Analysis Algorithm Discussion...48 v

6 Chapter 4: Time series Stock Prediction with public sentiment Vector Auto-regression Models Principle VAR modelling Bivariate Linear Granger Causality Analysis Principle and Linear Causality Analysis for Stock Market Prediction Models Z-score normalization Prediction Power Comparison for UWS and Existing Algorithm Granger Causality Analysis between Authoritative Users and Others Experiments on Linear Stock Market Prediction Visualization Prediction Accuracy Measurement Prediction Accuracy Comparison for UWS and the Existing Algorithm Number of Tweets per Day for Improving Linear Prediction Discussion...79 Chapter 5: Time series Non-Linear Stock Prediction with public sentiment Support Vector Regression (SVR) Models Non-Linear Granger Causality Analysis Principle Nonlinear Causality Analysis for Stock Market Prediction Models Nonlinear Prediction Power Comparison for UWS and Existing algorithm Granger Causality Analysis in Different Conditions Experiments on Nonlinear Stock Market Prediction Nonlinear Prediction Accuracy Measurement Prediction Accuracy Comparison for the Linear and Nonlinear Models Number of Daily Tweets for Improving Nonlinear Prediction Stability of the Nonlinear Model Discussion Chapter 6: Conclusion References Appendix A Appendix A.1: Linear Regression Appendix A.2: Support Vector Machine vi

7 List of Tables 3.1. Each tweet in the collected dataset is composed of five attributes: tweet ID, tweet message, retweet count, submission time and user information. The user information attributes has 4 sub-attributes: user ID, followers count, location and language Examples of simplifying submission time data. Useless information such as minutes and seconds are removed and simplified to save space. Three new attributes are generated and substituted with an old attribute, created_at through this process Sample translations of emoticons. Emoticons are divided into two sentiment categories in this thesis, Positive and Negative. 10 most frequently used emoticons for each category are respectively represented in this table The full list of stop words used in this thesis for data cleansing process Synsets in SentiWordNet. Each synset is associated to three types of sentiment scores: Positivity, Objectivity and Negativity The number of hidden layers and the number of hidden units used in this test for MLP and DNN algorithms. The number of hidden layers and the number of total hidden units of DNN are set to 3 and 300 while the traditional MLP has one hidden layer and 4 total hidden units Detailed classification accuracies of four classifiers. Average, DNN shows the best performance while Naïve Bayes has the lowest classification accuracy of Confusion Matrices after training models: Naïve Bayes, SVM, MLP and DNN Tweets classification result of seven companies in U.S. stock market Tweets classification result of seven companies in U.S. stock market Granger causality analysis between Twitter sentiment and changes of actual stock prices of four companies: Apple (AAPL), Google (GOOGL), Amazon (AMZN), Microsoft (MSFT) and Yahoo (YHOO) in period October 1, 2012 to February 28, Granger causality analysis between Twitter sentiment calculated by two sentiment analysis algorithms and changes of actual stock prices of four companies Granger causality analysis between the authoritative users and the other users for two companies: AAPL and GOOGL. The authoritative users are filtered in three different ranges as top 1%, top 5%, and top 10% of users ranked by user weight vii

8 4.4. Two types of prediction accuracy for models: M1 and M2 are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). Prediction of AAPL using M2 shows best performance in MAPE and GOOGL using the same method performs best in the direction accuracy. M2 performs better than M1 for predictions of all four companies Two types of prediction accuracy for models: M2 and M3 are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). Prediction of AMZN using M2 shows best performance in both the direction accuracy and MAPE. However, the performance of M3 is more accurate compared to M Nonlinear Granger causality analysis between Twitter sentiment and changes of actual stock prices of four companies: AAPL, GOOGL, AMZN and MSFT in period October 1, 2012 to February 28, 2013 which is the same condition with the linear Granger causality analysis in the previous chapter Nonlinear Granger causality analysis between Twitter sentiment calculated by two sentiment analysis algorithms and changes of actual stock prices of four companies Pearson correlation test between direction accuracy and five Twitter factors: number of posting users, number of posted tweets, number of stock relevant tweets, number of authoritative users, and number of stock relevant tweets posted by authoritative users. All of five factors are calculated by the average of past three days of prediction date Result of Granger Causality test in the four different conditions. As companies do not have enough stock relevant tweets, AAPL and GOOGL are only tested in the case of using stock relevant tweets. Grey colored cells are the results having higher causation than the case of full day + all of tweets Two types of prediction accuracy for models: M4 and M5 are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). Prediction of AAPL using M5 shows best performance in both the direction accuracy and MAPE. Using Twitter sentiment to prediction performs better than the prediction without Twitter sentiment for all four companies as it was same in the linear prediction Two types of prediction accuracy for models: M2 and M5 are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). Overall, SVR prediction model performs better than VAR prediction model Two types of prediction accuracy for two nonlinear prediction models, SVR and MLP are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). The prediction accuracies of two nonlinear models are not significantly different in both tests viii

9 5.8. Convergent Cross-Mapping (CCM) analysis for testing nonlinear causality. CCM here is tested in one direction to predict stock prices with lagged public sentiment. The result shows that CCM analysis has similar result with nonlinear Granger causality analysis. Time step τ = 1, and the embedding dimension is varied from one to nine A case that the absolute difference of a day t by M5 is greater than that by M2 while the direction accuracy of M5 is only equal to that of actual stock market ix

10 List of Figures and Illustrations 1.1. A simple structure of the comparison of four models: linear stock prediction models trained by both historical stock prices and public sentiment on social networks calculated by both existing sentiment analysis algorithm and proposed UWS, nonlinear stock prediction models trained by both historical stock prices and public sentiment on social networks calculated by both existing sentiment analysis algorithm proposed UWS A simple structure for two models: a linear stock prediction model and a nonlinear prediction model both trained by historical stock prices only The diagram of three forms of EMH. The weak form includes historical prices only and semi strong form includes all publicly available information. The strong form includes private information as well as all information in both semi strong form and weak form A Simple classification of two types of machine learning algorithms: Supervised and unsupervised learning algorithms. Classifiers and regression algorithms are commonly used for sentiment analysis and time series prediction ( A simple structure of a decision tree. A color node on the top of the tree is the root node, and other blue colored nodes are non-leaf nodes. Red colored nodes are leaf nodes which produce the classification A simple structure of a MLP. An MLP model is composed of three or more layers: Input layer, Hidden layer and Output layer. Except for input nodes, each neuron has a linear or nonlinear activation function An example of emoticon translation and entire data cleansing process. Emoticons are translated to sentiment words as the first step, and five steps of data cleansing are processed. After tokenization step, a bag of words has 6 words including the name of product Correlation between the number of followers and the number of retweet. Although the number of retweet does not increase linearly, the graph shows that it tends to increase as the number of followers increases by x

11 4.1. The movements in time series X are represented in time series Y with some time lag if X Granger causes Y. X in this figure has time lag of 5 days to use it for Y prediction Panels of two graphs for four companies: (a) AAPL, (b) GOOGL, (c) AMZN and (d) MSFT. The top graph for each companies shows changes of actual stock prices with Z- score normalization, and the bottom graph for each companies represents Twitter sentiment with Z-score normalization that has been lagged by 3 days Graphs for two companies: AAPL and GOOGL, which show the number of top 10% authoritative users. Both the number of daily tweets and the number of authoritative users increase when the companies announce quarterly earnings report Graphs of Twitter users. The red points are the authoritative users who affect the other users, and the yellow points are general users. Information in Twitter environment flows from the authoritative users to general users Fuzzy surface graphs with the result of Granger Causality test for the two companies: AAPL and GOOGL. Z-axis is converted to 1 (p-value) as lower p-value has stronger correlation A graph of AAPL price movement from October 3, 2012 to March 28, We do learning models with data ranging from October 3, 2012 to February 28, 2012 and test models for the next one month which is represented in dashed-red line on the figure Panels of three graphs for four companies: (a) AAPL, (b) GOOGL, (c) AMZN and (d) MSFT. The top graph shows the overlap of actual stock prices of March 2013 (black), estimated stock prices with M1 (blue) and estimated stock prices with M2 (red). All four results indicates that the proposed M2 predicts more close to the actual stock prices The graph shows the direction accuracy of AAPL stock prediction with varying the number of tweets per day from 10 to The direction accuracy of AAPL when using entire tweets was 79.17% in Table 4.4. The graph indicates that the direction accuracy approaches its maximum accuracy when having at least 2500 tweets per day The graph shows MAPE of AAPL stock prediction with varying the number of tweets per day from 10 to MAPE of AAPL when using entire tweets was 1.32% in Table 4.4. The graph indicates that MAPE approaches its minimum error when having at least 2000 tweets per day xi

12 5.1. The changes of p-value when using stock relevant tweets only for calculation Twitter sentiment. Twitter sentiment has stronger causation when the graph is negative. Overall, the p-values when using stock relevant tweets only are smaller than using all tweets for sentiment calculation Improvement of the direction accuracy for both AAPL and GOOGL, when using stock relevant tweets only to calculate Twitter sentiment Graphs for the comparison of the absolute difference between predicted prices and the actual prices for four companies: (a) AAPL, (b) GOOGL, (c) AMZN and (d) MSFT. For each graph, the result is represented as a red bar for a day t if the predicted price with M5 is closer to the actual price than that with M Graphs for the comparison of the absolute difference between predicted prices and the actual prices for four companies: (a) AAPL, (b) GOOGL, (c) AMZN and (d) MSFT. For each graph, the result is represented as a red bar for a day t if the predicted price with M5 is closer to the actual price than that with M The graph shows the direction accuracy of AAPL stock prediction with different number of daily from 10 to The direction accuracy of AAPL when using entire tweets was 87.50% in Table 5.2. The graph indicates that the direction accuracy approaches its maximum accuracy when having at least 2500 daily tweets The graph shows MAPE of AAPL stock prediction with different number of daily tweets from 10 to MAPE of AAPL when using entire tweets was 0.93% in Table 5.2. The graph represents that MAPE approaches its minimum error when having at least 2500 daily tweets A.2.1. An example of hyperplanes which separate data points into two subsets. The hyperplane h1 separates data points with the maximum margin while other two hyperplanes do not separate the classes A.2.2. The margin between two hyperplanes, 1) wwwwwwww + ββ 1 and 2) wwwwwwww + ββ 1. It makes the distance between two bounded hyperplanes as large as possible xii

13 List of Symbols and Abbreviations Symbol SNS UWS GPOMS LR ANN SVM SOFNN NLP RWH EMH AAPL GOOGL AMZN MSFT YHOO BAC C DJIA PLSA Definition Social Network Service User-Weighted Sentiment Google-Profile of Mood States Linear Regression Artificial Neural Networks Support Vector Machine Self-organizing Fuzzy Neural Network Natural Language Processing Random Walk Hypothesis Efficient Market Hypothesis NASDAQ symbol of Apple Inc. NASDAQ symbol of Google Inc. NASDAQ symbol of Amazon.com NASDAQ symbol of Microsoft Corporation NASDAQ symbol of Yahoo! Inc. NASDAQ symbol of Bank of America NASDAQ symbol of Citi Bank Dow Jones Industrial Average Probabilistic Latent Semantic Analysis xiii

14 ARSA LIWC URL TURank MLP POS CAN API TF-IDF DNN LSE MLE VAR MAPE KKT SVR RBF Autoregressive Sentiment Aware Linguistic Inquiry and Word Count Uniform Resource Locator Twitter User Rank Multilayer Perceptron Polarity of Sentiment Chain Augmented Naïve Bayes Application Programming Interface Term Frequency-Inverse Document Frequency Deep Neural Network Least Square Estimation Maximum Likelihood Estimation Vector Auto-Regression Mean Absolute Percentage Error Karush-Kuhn-Tucker Support Vector Regression Radial Basis Function xiv

15 CHAPTER 1: INTRODUCTION Over the past few years, there has been an exponential growth in the use of social network platforms such as Twitter, Facebook, and Instagram. Through these platforms, people share their ideas and information: product reviews, public opinions about news, and travelling experiences with others in real time. Thereby, such information is spread out quickly and affects numerous people efficiently. As there is enormous amounts of information that indicates public opinion and insights, various ideas of utilizing this information have been actively discussed and studied [1]. Sentiment Analysis, also known as opinion mining is a research area to extract subjective information from the raw data through natural language processing. As public opinions and insights are reflected in comments, number of views and ratings on social media, many researchers in this area have attempted to find meaningful information from online sources in various ways. Nevertheless, sentiment analysis in social media still remains as a difficult problem due to its limitation. For instance, Twitter limits the length of a tweet (a message posted on Twitter) to 140 characters so it is full of implications. Presence of slang words, misspellings and emoticons also forces to have a complex preprocessing step before sentiment analysis. Besides, previous researchers have only focused on public sentiment in messages, but they have not considered the impact of users. However, it is well known that the impact of a user plays a role of fundamental blocks in building up social networks [2], [3], [4]. Some researchers have reported that the factors of social media such as view counts and ratings are strongly related to the authority of the user in social networks [5], [6]. In other words, social media posted by authoritative users has more possibility of wide-spreading. 1

16 In social media, a user can be a friend of other users if they have interesting and useful information. Conversely, number of friends can be a barometer of usefulness of their information. For instance, a stock market expert who often posts trustable forecasts of stock market would have many friends of investors on social media. Thus, an expert s opinion would be widely spread out since the expert has many potential audiences who believe that the opinion is a useful information. Also, a user can forward other s messages to the user s friends if the messages are useful or/and interesting. In other words, a user whose messages have a large number of forward counts has a strong possibility of having useful information. In the example of a stock market expert, although the stock market expert has posted many useful messages, not all of them are meaningful to others. Most of the messages would be personal such as saying hello to the poster s family or friends. Thus, we can suppose that only the number of forwards are directly related to the flow of information on social media. Therefore in this thesis, an improved sentiment analysis algorithm named User-Weighted Sentiment (UWS) is proposed which considers the impact of users to reflect the usefulness of the message into sentiment weight. The financial sector has particularly paid attention to social mining researches. Many companies have detected that this gold mine of information can be a good replacement of their traditional marketing methods. Previously, companies obtained feedback from users through surveys, questionnaires and interviews. These methods were expensive, and often extremely time consuming while the results were not what they really expected [7]. Recently, however, more often they communicate with their current or potential customers in real time through social network platforms because it significantly reduces the cost of feedbacks. Not only for the purpose of cost efficiency, but also social mining has been utilized to predict the future of the 2

17 economy. The Google-Profile of Mood States (GPOMS) and its variations have been proposed to predict stock prices through measuring public mood in social networks [8], [9], [10]. Some researchers have predicted the result of upcoming elections since economics are hard to be separated from politics [11], [12]. Mainly in this thesis, an attempt is made to study the prediction power of public sentiment of social networks on the stock market. Stock market prediction has been one of the most popular targets for various machine learning methods such as Linear Regressions (LRs), Artificial Neural Networks (ANNs) and Support Vector Machines (SVMs). Bollen et al. [8] proposed two stock prediction models using multiple regression model and Self-organizing Fuzzy Neural Network (SOFNN) respectively and V. Turchenko et al. utilized MLP to predict short-term stock prices in his research [13]. SVMs have also been widely used for the prediction of financial time series since its risk functions consisting of the empirical error and a regularized term minimize the expected risks [14], [15], [16]. In the field of machine learning, SVMs are very specific learning algorithms having the capacity control of the decision function, the existence of the kernel functions, and the sparsity of the solution [17], [18], [19]. SVMs have been proposed based on the unique theory of the structural risk minimization to estimate a function through minimizing an upper bound of the generalization error [20]. Therefore, SVMs are very robust for over fitting problems which achieve good performance in solving various problems of time series prediction. In this thesis, a SVM model is used for nonlinear prediction to compare its performance with the linear model Objective The ultimate goals of this thesis are i) to propose an improved sentiment analysis algorithm that reflects the impact of user, and ii) to study whether sentiment of social media can 3

18 contribute to stock market prediction. The hypothesis of the goal i) is that our proposed sentiment analysis algorithm, named UWS performs better than existing sentiment algorithm when calculating sentiment weight of social media. Figure 1.1 shows a simple structure of goal i). Figure 1.1. A simple structure of the comparison of four models: linear stock prediction models trained by both historical stock prices and public sentiment on social networks calculated by both existing sentiment analysis algorithm and proposed UWS, nonlinear stock prediction models trained by both historical stock prices and public sentiment on social networks calculated by both existing sentiment analysis algorithm proposed UWS. Thus the performance comparison of following models are given in this thesis: a) A linear stock prediction model trained by both historical stock prices and public sentiment on social networks calculated by existing sentiment analysis algorithm. b) A linear stock prediction model trained by both historical stock prices and public sentiment on social networks calculated by proposed UWS. c) A nonlinear stock prediction model trained by both historical stock prices and public sentiment on social networks calculated by existing sentiment analysis algorithm. 4

19 d) A nonlinear stock prediction model trained by both historical stock prices and public sentiment on social networks calculated by proposed UWS. Comparisons between the models a) and b), and between the models c) and d) show if public sentiment calculated by UWS contributes the stock prediction better than that calculated by the existing algorithm in both linear and nonlinear predictions. The hypothesis of the goal ii) is that public opinions about companies or their products on social networks are related to the actual changes of the stock prices. In addition, it is also supposed that the impact of each social media varies according to the position of the user who posted it. Therefore, the performance comparisons of following models are provided in this thesis: e) A linear stock prediction model trained by historical stock prices only. f) A nonlinear stock prediction model trained by historical stock prices only. A comparison between the models a), b) and e) shows whether public sentiment calculated by two algorithms contributes the linear stock prediction, and a comparison between the models c), d), e) represents if public sentiment calculated by two algorithms contributes the nonlinear stock prediction. Finally, a comparison between models b) and d) gives us whether the nonlinear stock prediction model performs better than the linear model. Figure 1.2 shows a simple structure for models e) and f). 5

Figure 1.2. A simple structure for two models: a linear stock prediction model and a nonlinear prediction model both trained by historical stock prices only. 1.2. Outline Chapter 2 starts by providing a literature review about the backgrounds and topics studied in this thesis.

1 provides an explanation of financial backgrounds such as the Efficient Market Hypothesis, terms and types of stock trading. Section 2.

20 Figure 1.2. A simple structure for two models: a linear stock prediction model and a nonlinear prediction model both trained by historical stock prices only Outline Chapter 2 starts by providing a literature review about the backgrounds and topics studied in this thesis. These topics include Opinion Mining, Financial backgrounds and Machine Learning algorithms. Section 2.1 provides an explanation of financial backgrounds such as the Efficient Market Hypothesis, terms and types of stock trading. Section 2.2 introduces opinion mining and review existing researches how a user s position affects the flow of information on social networks and section 2.3 gives an introduction of various machine learning models for classification and their relevant works. Data collection and data pre-processing will be explained in Chapter 3. Data pre-process is one of the data mining techniques to extract information from raw data. As unwanted information, such as misspellings and slang words, can cause wrong sentiment detection, data pre-process has to be done prior to detect sentiment from acquired data. With an introduction of data pre-process and data classification, UWS algorithm is also proposed in this chapter. 6

21 Following tests find the proper parameters to calculate sentiment weight to support our proposed algorithm. In Chapter 4, the main idea of this thesis; to predict future stock prices by combining public sentiment on social networks and historical stock prices, is presented. Granger causality analysis is used in Section 4.1 to explain that the changes of public sentiment that occur systematically before the changes in the actual stock prices. Further analysis is provided in Section 4.2 to find proper amount of information for sentiment detection. Chapter 5 gives the result of nonlinear Granger causality analysis to investigate if the changes of public sentiment can be used to predict changes in the actual stock prices in nonlinear models. Similar to the previous chapter, further analysis is given in Section 5.2 to find proper amount of information for sentiment detection in a nonlinear model. A discussion of the findings and concluding remarks of this thesis are described in Chapter 6. 7

22 CHAPTER 2: BACKGROUND AND LITERATURE REVIEW This chapter aims to give an overview of the theoretical backgrounds for the thesis. Mainly it covers Sentiment Analysis, Natural Language Processing (NLP), financial backgrounds, and machine learning algorithms used in this thesis. First, we introduce financial backgrounds to help understanding of stock market. An explanation of sentiment analysis and review of existing researches about how a user s position affects the flow of information on social networks are provided next. Lastly, we quickly review various machine learning models and their uses in the stock prediction field Financial Background Stock prediction has been received wide attention in the field of finance. Billion dollars are traded on the stock market every day, and investors hope to get profits from the stock investment. Not only investors, but also many companies are strongly affected by the movements of the stock market. The most important thing to consider in the stock investment is to decide an appropriate time to buy, hold or sell stocks. However, prediction of stock market is not an easy task, because the stock market indices are essentially dynamic, complicated, nonlinear, and chaotic in nature [21]. Besides, the movements of the stock markets are influenced by many economic factors, such as general economic conditions, political events, oil prices, policies, expectations of investors, and etc [22]. There have been many researches in the stock prediction field on numerous stock exchange markets. Generally, these researches have two major groups those insist two conflicting opinions. The first group argues theories of the Random Walk Hypothesis (RWH) 8

23 and the Efficient Market Hypothesis (EMH) that people cannot predict stock prices from historical and present information. However, the second group claims that stock prices are predictable. They believe that there are recurring patterns in the stock market, which can be predicted. Thus analysts have undertaken in depth studies into various economic factors such as financial conditions, industries, and political events to discover the extent of correlation between the actual changes in stock prices and the economic factors [23]. Our approach is close to the second group which believes the stock market is predictable, so the later chapters will cover on the researches of the second group. In this section, we will first look at the researches of the first group and basic knowledge of finance Efficient Market Hypothesis (EMH) EMH was proposed by E. Fama which is a financial investment theory that states beat the market is impossible since stock market efficiency causes the current stock prices to always incorporate and reflect all relevant information available in the market [24]. According to this theory, stocks are always traded at the fair value, so it is impossible for investors to either purchase or sell stocks with profit. EMH includes three forms: weak, semi strong and strong efficiency as shown in Figure 2.1. If the form of EMH is weak, stock prices cannot be predicted by analyzing historical stock prices. Thus, investors cannot earn returns in the long run by investment strategies based on historical stock prices. Technical analysis will not be able to consistently produce excess returns if the form is weak although some forms of analysis still provide excess returns. If the form of EMH is semi strong, then it goes one step more by combining historical and current public information into the historical and current stock prices. Lastly, the strong form of EMH 9

However, the up/down movement of stocks prices cannot be predicted with more than 50% accuracy with news because of the randomness. Figure 2.1. The diagram of three forms of EMH.

24 includes both public and private information such as internal information of financial trades in the stock prices. In EMH, stock prices are heavily dependent on new information, such as news or real time public opinions, but not on present and past stock prices [8]. However, the up/down movement of stocks prices cannot be predicted with more than 50% accuracy with news because of the randomness. Figure 2.1. The diagram of three forms of EMH. The weak form includes historical prices only and semi strong form includes all publicly available information. The strong form includes private information as well as all information in both semi strong form and weak form Terms of Stock Trading There are some terms of stock trading that shows sentiment of traders toward the stock movement. In this section, commonly used terms are introduced to help the readers to understand better. If these terms are appeared in messages, it will be very effective in determining the sentiment of a message about the stock movement when training the machine learning algorithms in Chapter 3. 10

25 The term long is the buying action of stock with expectation of rising in value of the asset. For example, an investor who has stock of Apple Inc. (AAPL) can be said to be An investor has a long position in AAPL. On the contrary to this, the term short is the selling action of stock as the investor believes that the price of stock will be decreased. From the example, the investor who sells stock of AAPL on the market can be said to be An investor has a short position in AAPL. The term option is a contract that gives the option holder the right, but not the obligation to perform a specified transaction with the option issuer according to specified terms. Call option provides the right to the option holder to purchase an underlying asset for a specified amount at a certain date in the future. If the stock fails to reach the specified amount (or the strike price) before the certain date, the option will be expired. Thus, the investors will buy calls only when they believe the price of the underlying asset will increase. Contrarily, put option provides the right to the option holder to sell an underlying asset for the strike price. The investors who have put option can exert authority to the option at any time before it expires. They buy the option if they believe the price of the underlying asset will decrease. They have the right, but do not have the obligation to sell the underlying asset for the strike price until it expires. Also, if the investors buy a put option, the risk is equal to the money paid for the option while the profit is equal to the decrease of the price of the underlying asset. However, the profit has limitation due to the underlying asset does not decrease below zero. The investors sell put options with the same terms to close out the position to offset long put [25]. On the other hand, if the investors exert authority to long put, then they sell the underlying asset for the strike price. Investors who sell put options hold short positions as they expect that the stock market will move upward. They have the obligation to buy at least

26 shares of the underlying stock for the put strike price. Contrary to put buyers, the risk of investors who sell put options is decrease of the stock price while the profit is equal to the credit from selling of the put. The investors who sell put options prefer options that are close to the expiration date because they want the put to expire worthless in order to keep the entire premium. Unlike long put, short put is offset by purchasing a put with the same strike price and expiration to close out the position [26] Types of Stock Trading and Analysis Stock trading can be classified as four types: day trading, short term trading, medium term trading, and long term trading. First, for day trading, both buying and selling actions are done on the same day and all of the trading are completed before the stock market closes. Investors who prefer day trading are called day traders. Second, investors who prefer trade periods between one day to a few weeks are called short term traders, and third, medium term trading has trade periods between a few weeks to a few months which are longer than the short term trading periods. Lastly, investors who prefer holding stocks between many months to years are called long term traders. Fundamental business analysis involves analysis of various financial factors such as financial statements and health, management, competitive advantages, competitors and markets. The analysis is based on the assumption that markets may set lower price to a stock, but it will eventually be reached to the fair price. Thus, the investors make profits by purchasing the lower priced stocks and waiting for the market to set the price of stocks to their fair prices. Unlike fundamental business analysis, technical analysis is to forecast the direction or the prices of stock by analyzing past market data. In technical analysis, it is assumed that all 12

27 information is reflected in the historical stock prices. Thus, prediction of stock prices in technical analysis is extrapolations from historical prices. Technical analysts believe that market timing is critical, and profit can be obtained through analyzing historical price and volume movements and comparing them to current prices [26] Sentiment Analysis In our decision making process, figuring out what other people think has always been an important part. With an exponential growth of social media such as personal blogs and social network platforms, new opportunities arise as people can actively use information technologies to find out others opinion [1]. The area of opinion mining, or sentiment analysis is to figure out public opinions and sentiment from social media Sentiment Analysis for Financial Problems J. Bollen et al. [8] investigated whether the measurement of public mood states derived from Twitter is correlated to the value of the Dow Jones Industrial Average over time (DJIA). They analyzed messages in Twitter by two methods, OpinionFinder that measures positivity and negativity and Google-Profile of Mood States (GPOMS) that measures sentiment in terms of 6 dimensions: Calm, Alert, Sure, Vital, Kind, and Happy. Granger causality analysis and Self- Organizing Fuzzy Neural Network (SOFNN) were used to test whether the public mood states measured by the two methods are predictive of actual changes in DJIA. In the research, four parameters, Alert, Sure, Vital, and Kind had no causal relationships with the actual stock prices from the Granger causality analysis. The parameter Happy caused the stock prices with the lag of 6 days, but not significant. Therefore, only one parameter (Calm) was concluded that it has 13

28 causal relationship with the stock prices. A. Mittal et al. [9] also used the same method with [8] to analyze public sentiment in Twitter, but they used two more algorithms: SVM and Logistic Regression to predict stock market. S. Asur et al. [10] demonstrated how social media can be used to predict real-world outcomes. They used messages of Twitter to forecast box-office revenues for movies. They claimed that their simple model built from the rate at which messages of Twitter were created about specific topics can outperform actual market based predictors. P. C. Tetlock [27] used daily contents from Wall Street Journal columns to measure the interactions between the media and the actual stock market. He constructed a straightforward measurement of media that appeared to correspond to either negative investor sentiment or risk aversion. The result claimed that the negative media contents forecasted a decrease of market prices and unusually high or low values of pessimism lead to temporarily high market trading volume. He also insisted that the price impact of pessimism appeared especially large and slow to reverse itself in small stocks. The result supported his assumption that media content is correlated to the opinion of investors, who own a disproportionate fraction of small stocks. Y. Liu et al [28] studied the mining of sentiment information from blogs and investigated methods to use such sentiment information to predict product sales performance. They proposed Sentiment PLSA (S-PLSA) in which a blog entry was viewed as a document generated by a number of hidden sentiment factors. ARSA was also presented by them which is an autoregressive sentiment-aware model to use the sentiment information collected by S-PLSA to forecast product sales performance. They insisted that their proposed approaches were more effective and superior than pre-existing methods. 14

29 Sentiment Analysis for Political Problems Sentiment analysis is not only for predicting financial market, but also for analyzing various problems such as prediction of election. A. Tumasjan et al [11] investigated whether the Twitter was used as a forum for political deliberation and whether online messages on Twitter reflected offline political sentiment using the context of the German federal election. LIWC was used for text analysis, and the result indicated that Twitter was widely used for political deliberation. The study found that the mere number of messages mentioning a party reflected the election result. Joint mentions of two parties were in line with offline political ties and coalitions. The research argued that the analysis of political sentiment about the political positions of parties and politicians in Twitter indicated that the content of Twitter messages reflected the offline political landscape. H. Wang et al. [12] proposed a system for real-time Twitter sentiment analysis of the ongoing 2012 U. S. presidential election. The real-time data processing infrastructure and statistical sentiment model evaluated public sentiment changes in response to emerging political events and news as they unfolded. The study argued that the architecture and the method used in their research were generic thus can be easily adopted and extended for other problems. A. Birmingham et al. [29] used the previous Irish General Election as a case study for investigating the potential to model political sentiment through sentiment analysis. They combined sentiment analysis using supervised learning and volume-based measures. The study figured out that the political problems can be predictable by monitoring online public sentiment. However, their approach demonstrated an error which was not competitive with the traditional polling methods. The study also observed a dramatic sentiment shift in the two days before polling day which hinted at the election outcome. They assumed that a deeper sentiment analysis 15

30 during this period would produce the most beneficial application of sentiment analysis in the context of an election campaign Impact of Users in Social Networks However, those researches focused only on public sentiment in messages, and did not consider the impact of users. Actually, influence (or impact) has been studied for long time in the fields of business, marketing, sociology, and political science [30], [31]. For instance, M. Gladwell [32] studied how a fashion style spreads widely, and J. Berry and E. Keller figured out how American people vote in their research [33]. Thus studying about the impact of users leads us to understand better on why certain information flows faster and more widespread than others. Many researchers have found the importance of the impact of users in an online platform, specifically Twitter. M. Cha et al. [6] used a large amount of data collected from Twitter to present an indepth comparison of three measures of influence: followers, retweets, and replies. They investigated the dynamics of user influence across topics and time. From their observations, popular users who have high followers were not necessarily influential in terms of retweets or replies. Also, most influential users could hold significant influence over a variety of topics. They argued that topological measures such as followers alone reveals very little about the influence of a user. E. Bakshy et al. [34] investigated the attributes and relative impact of numerous users by tracking a massive amount of diffusion events that took place on the Twitter follower graph. From their observations, users who have been influential in the past and have many followers generated the largest cascades. Also, they figured out that the messages having URLs were more 16

31 likely to spread in a specific group. They aimed that word-of-mouth diffusion can only be harnessed reliably by targeting large numbers of potential influencing users, and capture average effects. Y. Yamaguchi et al [46] addressed the problem of finding authoritative users in Twitter. They assumed that authoritative users who often submit useful information were considered to play an important role because useful information spreads quickly and widely. To identify authoritative users, they proposed Twitter User Rank (TURank) which is an algorithm to calculate authority scores of users in Twitter. They insisted that TURank can extract users, who are not followed by many users but having large numbers of retweet counts, with higher position in their ranking. The study was concluded that the number of retweet is more important than the number of followers to rank users correctly Machine Learning Algorithms Machine learning is one of the artificial intelligence fields which focuses on making models having the ability to learn from input dataset. According to A. L. Samuel [35], machine learning is defined as: A field of study that gives computers the ability to learn without being explicitly programmed. Generally, machine learning has two types of algorithms: supervised and unsupervised learning. Supervised learning algorithms consist of presenting an algorithm with a training dataset which is composed of training data and the expected output for it. If the training is 17

correctly processed, supervised learning algorithms figure out patterns from training data so they can correctly map new data although it has never seen before.

32 correctly processed, supervised learning algorithms figure out patterns from training data so they can correctly map new data although it has never seen before. On the other hand, unsupervised learning algorithms do not need training data. The main purpose of unsupervised learning is to figure out the patterns from the data that the researchers may not know. For example, a clustering process which uses a distance function to group similar data points together is the unsupervised learning. Figure 2.2 shows a simple classification of machine learning algorithms. Figure 2.2. A Simple classification of two types of machine learning algorithms: Supervised and unsupervised learning algorithms. Classifiers and regression algorithms are commonly used for sentiment analysis and time series prediction ( In this thesis, we mainly treat supervised learning algorithms because we want to figure out patterns of stock movement from historical information. After reviewing literatures that are relevant to machine learning algorithms used for sentiment analysis and stock prediction, we found that the most commonly used algorithms with higher performance are Naïve Bayes, Decision Tree, Multilayer Perceptron (MLP) and the Support Vector Machine (SVM). Thus in 18

33 this section, we quickly review basic concepts of four types of supervised learning algorithms and their relevant literatures Naïve Bayes Naïve Bayes is a simple probabilistic classifier based on applying Bayes theorem with the assumption of feature independent for classifying input data. Naïve Bayes is widely used for text classification in the field of opinion mining [36], [37], [38]. It is popular because of its low time cost and relatively high accuracy. The algorithm is named Naïve because of the assumption that the value of each feature is independent of the value of any other feature in the input data. On the other hand, words in a sentence are strongly related, thus positions and presence of the words in the sentence is very important to make the overall meaning and sentiment. Although it is conflicted to the assumption of Naïve Bayes algorithm, the algorithm may have high accuracy when used with correct training in specific domains. Let D is a training set of documents and the i-th document DD ii is represented by an m- dimensional word vectors as DD ii = (ww ii1, ww ii2,, ww iiii ). Assume that there are n classes, C = (cc 1, cc 2,, cc nn ) for D. The Naïve Bayes algorithm will classify DD ii into the class which has the highest posterior probability. The algorithm classifies DD ii into the class, cc jj when equation (2.1) is satisfied PP cc jj DD ii > PP(cc kk DD ii ) ffffff 1 kk nn, kk jj (2.1) The document DD ii is classified as cc jj is the posterior probability of cc jj is the highest among all of the n classes. PP(DD ii ) is constant, thus we need to maximize only PP DD ii cc jj PP cc jj to classify 19

34 the document. For Naïve Bayes, balancing the distribution of classes is important to get better classification result and PP cc jj is identical for all of classes. Equation (2.2) needs to be tested to classify the document DD ii. PP DD ii cc jj = PP ww ii1, ww ii2,, ww iiii cc jj (2.2) As the value of each feature is independent of the value of any other feature in the input data, the equation (2.2) can be calculated as the equation (2.3). PP ww ii1, ww ii2,, ww iiii cc jj = PP ww iiii cc jj mm kk=1 (2.3) = PP ww ii1 cc jj PP ww ii2 cc jj PP ww iiii cc jj where the probabilities PP ww ii1 cc jj, PP ww ii2 cc jj,, PP ww iiii cc jj are calculated from the training dataset. Use of Naïve Bayes in Sentiment Analysis Naïve Bayes has been widely used by many researchers in the field of text classification. For instance, A. Pak et al. [36] proposed a method for an automatic collection of a corpus that can be used for training a sentiment classifier. In their research, TreeTagger was used for tagging sentiment and observed the difference in distributions among three sentiment sets: positive, negative and neutral. They used the collected corpus for training sentiment classifier which can figure out positivity, negativity and objectivity of documents. For this sentiment classifier, they used multinomial Naïve Bayes algorithm with N-grams and POS-tags as features. 20

35 A. McCallum et al. [37] compared two different classifiers: the multinomial Naïve Bayes and the multivariate Bernoulli model, which both of them make the Naïve Bayes assumption. The multivariate Bernoulli model is a Bayesian network with no dependencies between words and binary word features, and the multinomial model is a unigram language model with integer word counts. The multinomial Naïve Bayes model performed better than the multivariate Bernoulli model. Their empirical results claimed that the multinomial Naïve Bayes reduced error by an average of 27%, and the maximum by more than 50%. Also, F. Peng et al. [38] proposed a chain augmented Naïve Bayes classifier, named CAN (Chain Augmented Naïve Bayes). The proposed classifier was based on statistical n-gram language modeling. In their study, CAN was able to capture dependence between adjacent attributes as a Markov chain. They claimed that their proposed CAN modeling approach was able to work at either the character level or the word level which provides language independent abilities to handle various languages Decision Tree Decision tree is another widely used machine learning algorithm like Naïve Bayes because it can be applied to almost any types of data [39]. It is also a supervised machine learning algorithm that divides its training data into smaller parts in order to identify patterns for classification. The knowledge in Decision tree algorithm is represented in the form of logical structure like a flow chart for easy understanding without statistical background knowledge. Compare to other machine learning algorithms, it is particularly well suited to the problems which may have many hierarchical categories. 21

Generally, a decision tree can be learned by divide and conquer approach since it uses feature values to split the data into smaller subsets of similar classes.

The algorithm learns what decisions will be made in order to split the labelled training data into the classes. Figure 2.3 shows a simple structure of a decision tree. Figure 2.3. A simple structure of a decision tree.

36 Generally, a decision tree can be learned by divide and conquer approach since it uses feature values to split the data into smaller subsets of similar classes. A decision tree consists of three types of nodes: a root node which represents the entire dataset, non-leaf nodes which perform the computation, and leaf nodes which produce the classification. The algorithm learns what decisions will be made in order to split the labelled training data into the classes. Figure 2.3 shows a simple structure of a decision tree. Figure 2.3. A simple structure of a decision tree. A color node on the top of the tree is the root node, and other blue colored nodes are non-leaf nodes. Red colored nodes are leaf nodes which produce the classification. After training, we can classify an unknown instance with the decision tree by passing through data to the tree. At each non-leaf node, a specific feature from the input data is compared with a constant that was identified in the training process. For instance, in Figure 2.3, a fruit colored yellow will be move to the second level, and the shape feature of the fruit will be tested if 22

37 it is round or thin shape. The fruit will eventually pass through these non-leaf nodes until it reaches a leaf node which represents its assigned class: mango, lemon or banana. Use of Decision Tree in Sentiment Analysis There have been many researchers in the field of sentiment analysis who used the decision tree algorithm for their studies. C. Castillo et al. [40] analyzed the information trustability of news in Twitter. They focused on automatic methods for assessing the trustability of tweets. In the study, trend relevant tweets were mainly analyzed and classified those tweets as trustable or non-trustable, based on features extracted from them. To do so, they used various types of machine learning algorithms such as Decision tree, SVM, and Bayesian network. According to their empirical result, J48, one of the decision tree algorithms performed the best among all of test algorithms. However, A. Bifet et al. [41] failed to improve the accuracy of classification using the decision tree algorithm. In their research, Hoeffding tree algorithm was implemented for sentiment classification of tweets. They trained the algorithm with a massive amount of dataset and split them into approximately equal representations of each class. The accuracy of their classification was lower than other existing researches, having an average accuracy of more than 70%. Another study given by C. Zhang et al. [42] also showed that Decision tree performed worse than other machine learning algorithms. They focused on predicting sentiment polarity of Chines articles in different topic domains. A rule-based semantic analysis approach was proposed which considers the word dependency structures in sentences and the importance of sentences to predict the sentiment polarity of each article. The empirical result of their research argued that the accuracy of sentiment analysis by decision tree was less than using other machine 23

3. Multilayer Perceptron (MLP) MLP is a feedforward neural network model that maps input data onto appropriate outputs [43].

38 learning algorithms, such as SVM and their proposed rule-based semantic analysis. Thus in this thesis, we will not include the decision tree algorithm as a testing model for sentiment analysis in later chapters Multilayer Perceptron (MLP) MLP is a feedforward neural network model that maps input data onto appropriate outputs [43]. It consists of multiple layers in a directed graph, and each layer is fully connected to the next one. In the structure of MLP, each node in layers is called a neuron which has a linear or nonlinear activation function except for the input neurons. Like Naïve Bayes and Decision tree algorithms, MLP also uses a supervised learning algorithm for training data called backpropagation. Figure 2.4 shows a simple structure of MLP. Figure 2.4. A simple structure of a MLP. An MLP model is composed of three or more layers: Input layer, Hidden layer and Output layer. Except for input nodes, each neuron has a linear or nonlinear activation function. 24

39 In Figure 2.4, the input layer of MLP has as many nodes as there are features. However, the output layer normally has one node for each class of outputs. Therefore, the structure of MLP varies only in the number of hidden layers and the number of nodes in each of hidden layers. Our example in the figure above has one hidden layer with five nodes in it. Backpropagation The backpropagation algorithm is the most widely used method for updating the weights in an MLP model. It is a gradient descent method for training the weights in the MLP with minimizing the mean squared error between the calculated outputs and the expected outputs. Generally, the algorithm is composed of the following three steps: a) Feedforward computation: Feedforward computation can be decomposed in two steps. First, the values of the hidden layer nodes are collected and second, the value of the output layer is calculated using collected values at the first step. b) Backpropagation to the output layer: If the error of output node is known, it will be used for backward propagation and weights adjustment. The error is propagated from the output layer to the hidden layer and the weights between two layers are updated. c) Backpropagation to the hidden layer: The errors of the hidden layer are propagated from the hidden layer to the input layer. After calculating errors of the hidden layer, weights between the hidden layer and the input layer are updated. and the entire processes are iteratively done until the value of error is sufficiently small [44]. 25

40 Use of MLP in Sentiment Analysis Like two machine learning algorithms introduced in past sections, MLP also has been widely used in the field of sentiment analysis. For instance, D. Bespalov et al. [45] proposed an efficient embedding for modeling n-gram phrases that projects n-grams to low-dimensional latent semantic space. They used MLP network to build a unified framework that allows for estimating the parameters of the latent space and the classification function with a bias for the target classification task at hand. H. M. Nassif [46] presented a neural model based on Convolutional Neural Networks and MLP for tasks of Aspect Category Detection and Aspect Sentiment Detection in his thesis. The vector representation of words in generated reviews are initialized using word2vec. They explored the one-vs-all and multiclass classification schemes, and the static and non-static training methods for word representations. Their empirical results showed that the proposed model performed better than the baselines. 26

41 CHAPTER 3: SENTIMENT ANALYSIS FOR SOCIAL MEDIA Twitter ( is a social network platform that enables users to send and share 140-character limited messages called tweets [47]. Through Twitter, many people share their personal opinions and information about products and companies in real time. Thus in this chapter, we use tweets as the data for analyzing public sentiment that will be applied to stock prediction models in later chapters. Data collection and data preprocess are explained first and then we propose an improved sentiment analysis algorithm. Data preprocess is one of the data mining techniques to extract meaningful information from raw data. Since unwanted information in collected tweets such as misspellings, punctuations, and spam messages can cause wrong sentiment detection, data preprocess has to be done prior to detect sentiment from acquired tweets. An improved sentiment analysis algorithm that will be proposed in this chapter reflects the user impact which is mainly different from existing sentiment analysis algorithm Data Acquisition Twitter allows researchers access to their streaming application programming interfaces (APIs) that offer low latency access to flows of tweets. It has numerous regulations and limits on its APIs for the reason of registration and authentication to users when they send queries. This registration needs phone number and an address to be verified, thus Twitter can easily limit users access. They do not allow users to collect historical flows of tweets or too many tweets at the same time, hence we have some blank periods in the collected dataset. For this thesis, we obtained a dataset of 360,099,763 tweets that was recorded for about five months from September 29, 2012 to March 31, Each tweet in the collected dataset is 27

42 composed of five attributes as Table 3.1. The attribute _id is the type of information of tweet ID and text is for the tweet message. The attribute retweet_count indicates the number of retweet for this message, and the submission time of the message is classified as an attribute created_at. The user information is composed of four sub-attributes: id_str, followers_count, location and lang where the attributes represent the user ID, the number of followers of the user, the current location of the user, and the main language of the user, respectively. Table 3.1. Each tweet in the collected dataset is composed of five attributes: tweet ID, tweet message, retweet count, submission time and user information. The user information attributes has 4 sub-attributes: user ID, followers count, location and language. Attribute Type Example _id Tweet ID ObejctId( c1a7a ) text Tweet message why is my iphone being so slow today?! :-( retweet_count Retweet count 5 created_at Submission time Tue Oct 09 01:10: user User information { } id_str : followers_count : 491 location : Calgary lang : en To store the collected dataset, we have used a MongoDB database since it allows faster reading and writing performance than traditional table-based relational database system [48]. MongoDB platform is one of NoSQL database platforms that provides database solutions for large volumes of unstructured data. Thus clients can approach big data faster and easier through this platform. A database was created with a simple structure which had the attributes _id, 28

43 text, retweet_count, created_at, user_id, followers_count, location, and language. Submission time data is simplified as Table 3.2 to remove useless information and save space. Table 3.2. Examples of simplifying submission time data. Useless information such as minutes and seconds are removed and simplified to save space. Three new attributes are generated and substituted with an old attribute, created_at through this process. Old Attribute New Attributes created_at date time day_of_week Mon Oct 08 02:10: /10/ Wed Nov 20 12:03: /11/ Fri Dec 21 16:15: /12/ Sun Jan 26 20:25: /01/ In Table 3.2, created_at attribute is substituted with three attributes, date, time, and day_of_week, where submission date is simplified with the form of dd/mm/yyyy, submission time has only hour information in time attribute from 1 to 24, and days of week: Sunday to Saturday are simplified to integer numbers from 1 to Data Preprocess The raw dataset collected by Twitter API is not ready to be used in sentiment analysis since it contains some unwanted information such as punctuations, emoticons, and misspellings. Thus data cleansing and transforming processes are required prior to utilize collected dataset in sentiment analysis. In this section, several steps of data preprocess are described. 29

44 Emoticon Translation Emoticons have been used widely to express personal sentiment on Twitter. They are playing an important role in sentiment analysis as the sentiment of emoticons are in substantial agreement with the sentiment of the entire tweet [49]. Therefore, the emoticon dictionary was prepared by hand-labeling 251 selected emoticons listed on the NetLingo Website [50]. Table 3.3 shows a sample translation of emoticons by order of frequency. Table 3.3. Sample translations of emoticons. Emoticons are divided into two sentiment categories in this thesis, Positive and Negative. 10 most frequently used emoticons for each category are respectively represented in this table. Sentiment Category Emoticons Sentiment Words Positive Negative :) smile :-) smile =) happy face :D big smiley =D big smiley with happy face :-D big smiley ^_^ smile (used in Far East) :P smiley with tongue hanging out ;) left winking male smiley lol laugh out loud :( sad :-( sad >:( angry :-< very sad :-Z angry face :-{{ very angry (:-( very unhappy :-C annoying :-@ screaming (negative) :`-( crying 30

45 Data Cleansing The purpose of the data cleansing process is to remove unwanted information from collected tweets. The term unwanted information is used to describe any types of information within the tweet message that will not be used for the machine learning methods to classify tweets. Through this process, we are not only able to simplify the classification task, but also decrease training cost. Five steps of data cleansing process are explained as below: a) Lowercase transformation: Uppercase characters in tweets are replaced with lower case characters as the lexicon used in this thesis is described in lower characters. b) Filtering: Unwanted contents are removed from tweets such as URL links (e.g. user names and special words (i.e. RT an abbreviation for retweet in tweets). If a tweet message contains URL links, then we regard it as a spam message thus the message is removed from our database. c) Punctuation cleaning: Punctuations in tweets (e.g..,!,?,, ) are replaced with blank spaces to make the tokenization step simple. However, contractions such as I m and don t are converted to their original forms: I am and do not. d) Removing stop words: Stop words are words in a sentence which contain unnecessary information to understand the full sentence (e.g. the, am, and ). To simplify our dataset, stop words are filtered out from tweets. The full list of stop words used in this thesis is shown in Table 3.4. e) Tokenization: Each tweet message is split by blank spaces, and form a bag of words. An example of emoticon translation and entire data cleansing process is represented in Figure

46 32 Table 3.4. The full list of stop words used in this thesis for data cleansing process. Stop words a about above across after afterwards again against all almost alone along already also although always am among amount an and another any anyhow anyone anything anyway anywhere are around as at back be became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can cannot computer could cry describe detail do done down due during each either else elsewhere empty enough etc even ever every everyone everything everywhere except few fill find fire first for former formerly found from front full further get give go had has have he hence her here hereafter hereby herein hereupon hers him his how however i if in indeed interest into is it its keep last latter latterly least less ltd made many may me meanwhile might mill mine more moreover most mostly move much must my name namely neither never nevertheless next no nobody none nor nothing now nowhere of off often on once one only onto or other others otherwise our ours ourselves out over own part per perhaps please put rather same see seem seemed seeming seems serious several she should show side since sincere so some somehow someone something sometime sometimes somewher e still such system take than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin this those though through throughout thru together top toward towards under until up upon us very via was we well were what whatever when whenever where whereas whereby wherein wherever whether which while whither who whoever whole whom whose why will with within without would yet you your yours yourself yourselves

Figure 3.1. An example of emoticon translation and entire data cleansing process. Emoticons are translated to sentiment words as the first step, and five steps of data cleansing are processed.

47 Figure 3.1. An example of emoticon translation and entire data cleansing process. Emoticons are translated to sentiment words as the first step, and five steps of data cleansing are processed. After tokenization step, a bag of words has 6 words including the name of product. As the first step in the given example above, an emoticon :D is translated to a sentiment word, Big smile. Next, all uppercase characters are converted to lowercase characters prior to remove unwanted contents. The filtering step removes and hash tags, #excited, #iphone. After filtering, punctuations are cleared and then removes stop words: I, am, so, to, my. The last step is done by splitting the tweet by blank spaces and making a bag of words. The bag of words finally contains meaningful information only such as sentiment words and the name of product (or the company) which will be used in future sentiment analysis. 33

48 Construction of N-grams In the computational linguistics area, an N-gram word is a contiguous sequence of n subwords from a given sequence of sentence. Originally, N-gram is a type of probabilistic language model to predict the next word in a sequence in the form of (n-1)-order Markov model [51]. It is widely used in opinion mining, computational linguistics and communication theory. From data cleansing process, we created the bag of words which contains single words (or unigrams). However, unigrams do not carry much information as compared to higher order N-grams [52]. For instance, a phrase do not want to buy X contains negative opinion about an item X while each unigram in a bag of words, do, not, want, to, buy and X cannot make specific sentiment by itself. Thus different variations of N-grams (e.g. unigram, bigram, tri-gram) are reflected in our sentiment analysis for higher classification accuracy. Algorithm 3.1. Construction of N-grams Input: a bag of n words Output: a set of N-grams S a vector of n words k n while k > 0 do i 0 while i n k do W Si + Si Sk if W exists in the sentiment lexicon then insert W into the individual dictionary S S W end if i i + 1 end while k k + 1 end while 34

49 Suppose there are m words in a bag of words, then an m-gram word is generated as the first step. It would be stored in the individual dictionary if the sentiment lexicon includes the word. Otherwise, we move to the next step checking for (m-1)-grams whether they are in the sentiment lexicon or not. Repeat this process with an increment of k, and stored (m-k)-grams are removed from the tweet before moving to the next step. Algorithm 3.1 represents the entire process of construction of N-grams Sentiment Labeling After storing constructed N-grams from subsection 3.2.3, we need to label sentiment for each N-gram words. For instance, a word happy would be labeled to positive and a word sad would be labeled to negative. However, labelling a massive amount of words in manual is not easy and takes too much time. Therefore, a sentiment lexicon is considered to overcome these limitations. SentiWordNet The sentiment lexicon that is used for sentiment labeling of N-grams is called SentiWordNet [53], [54]. It is a lexical resource based on WordNet that was designed for the purpose of opinion mining and sentiment analysis [55]. SentiWordNet is composed of numerous synsets which are the basic items of information in WordNet. Each synset s is associated to three types of sentiment scores: Pos(s), Neg(s) and Obj(s) which represent how positive, negative and neutral the terms are in the synset. Each of the scores range from 0.0 to 1.0 and total sum of three scores is 1.0. Table 3.5 represents a sample of synsets in SentiWordNet. 35

50 Table 3.5. Synsets in SentiWordNet. Each synset is associated to three types of sentiment scores: Positivity, Objectivity and Negativity. Positivity Objectivity Negativity Words overgreedy# acceptable# positive#3 plus# negative#9 minus# referable#1 imputable#1 due#4 ascribable# attractive# unaffixed#1 loose# procurable#1 obtainable#1 gettable# long-haired# pretty#1 However, SentiWordNet contains numerous words which are not commonly used which is inefficient to consider all 117,686 words in SentiWordNet. To select the most frequently used 10,000 words from entire words, we use the Term Frequency-Inverse Document Frequency (TF- IDF) model. Term Frequency-Inverse Document Frequency (TF-IDF) Model TF-IDF is a numerical statistic to figure out how important a word is to a document in a dataset or a collection of documents [56]. It is frequently used for information retrieval and opinion mining area. The value of TF-IDF increases as much as the word in the document increases, but it is offset by the frequency of the word in a dataset. It generally helps to adjust some words to appear more frequently. 36

51 Term frequency (TF) weight is to measure importance of a word in a document. Suppose we want to find the most relevant document to the term the angry bird, if we filter out all documents that do not have the, angry and bird, still many documents are remained. Thus we need to count the number each time when the term appears in the document, which is called TF. TF of the word w in document d can be calculated as the equation (3.1). TTTT(ww, dd) = ff ww,dd (3.1) where ff ii,dd is a raw frequency of the word w appears in the document d. Inverse document frequency (IDF) weight is to measure the importance of a word in a dataset. Again from an example, the term the is too commonly used in any documents, so TF might incorrectly emphasize the documents that have the term the more frequently, without giving weight to more meaningful terms angry and bird. Thus an IDF is considered which diminishes the weight of terms that appears too much in a dataset and gives more weight to rarely used terms. IDF for the word w in a document set D can be calculated as the equation (3.2). IIIIII(ww, DD) = llllgg NN nn ww (3.2) where N is the total number of documents in the dataset, NN = DD and nn ww is the number of documents containing the word w. Finally, we firmly select 10,000 words by the normalized TF-IDF which is calculated as the equation shown in (3.3). 37

52 TTFF_IIIIII(ww, dd, DD) = TTTT(ww, dd) log ( NN nn ww ) TTTT(ww, dd) 2 ww dd log ( NN nn ww ) 2 (3.3) where TTTT_IIIIII(ww, dd, DD) is the weight of the word w in the tweet message d, TTTT(ww, dd) is the term frequency of the word w in the tweet d, and N is total number of training samples to select words. Each selected word w is labeled with one of the five sentiments: very positive, positive, neutral, negative, and very negative using the equations (3.4) and (3.5) represented as below. WW ssssssssss = nn ii=1 PPPPPP(ii) NNNNNN(ii) ii nn ii=1 1 ii (3.4) VVVVVVVV PPPPPPPPPPPPPPPP PPPPPPPPPPPPPPPP WW cccccccccccccccc = NNNNNNNNNNNNNN NNNNNNNNNNNNNNNN VVVVVVVV NNNNNNNNNNNNNNNN 0.7 WW ssssssssss 0.1 WW ssssssssss < WW ssssssssss < WW ssssssssss < 0.1 ooooheeeeeeeeeeee (3.5) where i is the rank of the word in the set of its synonyms, and Pos(i) and Neg(i) are positive score and negative score of the word respectively. The score ranges for each category in (3.2) were experimentally revised from the previous study [57]. Through this process, the N- grams stored in our database in subsection have sentiment labels respectively if they are frequently used words. 38

53 3.3. User-weighted Sentiment Analysis Sentiment Classification Now, we need an appropriate classification method to categorize preprocessed data into three groups: Positive, Neutral and Negative. To find the proper classifier, we discussed machine learning algorithms in Chapter 2. As we mentioned, traditional machine learning algorithms such as Naïve Bayes, SVM and MLP have been widely used as the classifier in opinion mining area. However, recently deep learning models have received widespread attention for its classification ability with higher accuracy rate. Many researches have presented that the DNN model gives better result on not only for classification, but also for various problems [58], [59], [60], [61]. Our DNN used in this thesis is an MLP with more hidden layers than traditional MLP networks. Thus in this section, DNN will be tested with three machine learning algorithms introduced in the previous chapter, and see how accurately DNN classifies tweets into predefined categories compared to three traditional machine learning algorithms: Naïve Bayes, SVM, and MLP. A training set we used for four algorithms has 9,000 tweets that were manually classified into three categories (i.e. Positive, Neutral and Negative ) with 3,000 of the tweets for each categories. To differentiate DNN with the traditional MLP networks, we set the number of hidden layers of DNN to 3 and the number of total hidden units to 300, while the traditional MLP has one hidden layer and 4 total hidden units. Details about the training result are given in Table 3.6 and Table

54 Table 3.6. The number of hidden layers and the number of hidden units used in this test for MLP and DNN algorithms. The number of hidden layers and the number of total hidden units of DNN are set to 3 and 300 while the traditional MLP has one hidden layer and 4 total hidden units. Algorithm Hidden Layers Hidden Units MLP 1 4 DNN Table 3.7. Detailed classification accuracies of four classifiers. Average, DNN shows the best performance while Naïve Bayes has the lowest classification accuracy of Naïve Bayes Precision Recall F-Measure ROC Area Class Positive Neutral Negative Avg SVM Precision Recall F-Measure ROC Area Class Positive Neutral Negative Avg MLP Precision Recall F-Measure ROC Area Class Positive Neutral Negative Avg DNN Precision Recall F-Measure ROC Area Class Positive Neutral Negative Avg

55 The training result given in Table 3.7 indicates that the Naïve Bayes algorithm has the lowest classification accuracy of among the four algorithms while DNN has the highest accuracy of The SVM algorithm is the second most accurate classifier in this test, and the average accuracy rate increases from when using MLP to when using DNN as a classifier. The biggest improvement is in the classification of Neutral tweets from Naïve Bayes to SVM (0.321) and the least improvement is in the classification of Negative tweets from Naïve Bayes to DNN (0.050). Table 3.8. Confusion Matrices after training models: Naïve Bayes, SVM, MLP and DNN. Actual Actual Actual Actual Naïve Bayes Classified as Positive Neutral Negative Positive Neutral Negative SVM Classified as Positive Neutral Negative Positive Neutral Negative MLP Classified as Positive Neutral Negative Positive Neutral Negative DNN Classified as Positive Neutral Negative Positive Neutral Negative

56 Although SVM shows good accuracy in the classification of Neutral, its accuracy of Positive is the second lowest. Thus overall, DNN shows the best performance. The result also shows that the Naïve Bayes classifier has the most confused tweets; 2,126 out of 9,000 tweets during the training, and the DNN classifier has only 979 confused tweets in the same test. Detailed confusion matrices of four algorithms are represented in Table 3.8. Since the DNN classifier performs the best among the four tested algorithms, it is used as the sentiment classifier in this thesis. We firmly selected 939,518 Apple Inc. (NASDAQ:AAPL) related tweets from the entire dataset that were written in English. With the trained DNN classifier, 370,918 tweets were classified as positive and 160,126 tweets, 408,474 tweets were classified as negative and neutral respectively. Likewise, we categorized each tweet data according to its relevant companies like Amazon.com (NASDAQ:AMZN), Google Inc. (NASDAQ:GOOGL), Microsoft (NASDAQ:MSFT), and Yahoo! (NASDAQ:YHOO). Table 3.9 shows classification result of categorized companies. As the market we object to forecast in the later chapters is limited to U.S. stock market, only American companies are considered here. Table 3.9. Tweets classification result of seven companies in U.S. stock market. Company Total Tweets Positive Neutral Negative Apple Inc. 939, , , ,126 Google Inc. 707, , , ,035 Amazon.com 81,369 30,547 32,106 18,176 Microsoft 107,009 25,930 46,125 34,954 Yahoo! 38,642 9,774 23,569 5,299 Bank of America 3, , Citi Bank 1, ,

57 User-Weighted Sentiment Analysis Algorithm Researchers have focused on public sentiment in messages only and they do not consider the impact of users. However, the user impact plays an important role when information is spread in social networks. Specifically in Twitter, a user can follow other users if he or she finds an interesting and useful information. Conversely the number of followers of a user indirectly indicates authority of the user. For instance, a stock market expert who often post trustable forecast of stock market would have many followers who are waiting for useful information to invest. Thus an expert s opinion would be widely spread since the expert has many potential audiences who believe the opinion is useful and trustable. In other words, tweets posted by authoritative users have more possibility of wide-spreading. In contrast, information that was posted by a user who does not have any followers would rarely spread in Twitter environment. In Twitter, a user can retweet others tweets if they believe those tweets are useful or/and interesting. In other words, a user is considered to be authoritative if tweets posted by the user contain useful and trustable information, and if so, the tweets would be retweeted by others. From the example, the number of followers does not always guarantee authority of the user. Although the stock market expert has posted many useful tweets, not all of his tweets are meaningful to others. The expert would say hello to friends or family, and people have no interest to such personal messages. However, followers do the action of retweet if they believe messages are useful. Therefore, we can suppose that retweet counts are directly related to the flow of information in Twitter. 43

58 Actually, many researchers have found that the number of retweet is strongly correlated to the authority of the user while the number of followers has weak correlation [62], [63], [64], [65]. The best way of proving such previous researches is to test in the similar way by finding correlation between these parameters. To do so, we need the full lists of followers and connections of all users to construct user graph in Twitter environment. However, our collected dataset does not contain such information because Twitter does not offer it to researchers to protect users privacy. Thus in this section, we consider the number of retweet as a parameter to calculate the impact of a user by the results of existing researches. Although previous researches have shown that the number of followers is not directly related to the authority of the user, it can still be criterion of the potential number of retweet. Thus unlike existing researches, we first analyze how the number of followers affects the number of retweet. To figure out the correlation between the number of followers and the number of retweet, we randomly choose 30,000 tweets for each of 14 cases represented in Table Table Tweets classification result of seven companies in U.S. stock market. Condition Avg. # of retweet Condition Avg. # of retweet f < f < f < f < f < f < f < f < f < f < f < f < f < f

59 In each cases, tweets having less than 5 retweets count are filtered. Also, we treat top 10% and bottom 10% of tweets ordering by the retweet count as noises of the dataset, so only calculated the average of last 80% of tweets per each case. Figure 3.2 represents correlation between the number of followers and the number of retweet. Figure 3.2. Correlation between the number of followers and the number of retweet. Although the number of retweet does not increase linearly, the graph shows that it tends to increase as the number of followers increases by. In Figure 3.2, the number of retweet does not increase straightforward linearly when the number of followers increases. Retweet count repeatedly increases and decreases when the number of followers is greater than 5,000. However, the graph shows that the movement of the number of retweet tends to increase as the number of followers increases by. Combining the result with previous studies, we conclude that the number of followers of a user can be a weighting factor since it shows the potential number of retweet. 45

60 Therefore, the novel method proposed in this subsection takes into consideration the retweet count and the number of followers as sentiment weighting factors. Our improved sentiment analysis algorithm, named User-Weighted Sentiment (UWS) analysis, is based on an existing sentiment analysis algorithm that was proposed by Skuza et al. [66]. The existing sentiment algorithm has not reflected the user s impact to each tweets and has given same weight to all tweets. Thus, a sentiment weight ω of the day d of the existing algorithm has been suggested to be calculated based only on the total number of positive tweets and negative tweets posted on the day as shown as the equation (3.6). ωω dd = log nnnnnnnnnnnn_oooo_pppppppppppppppp_tttttttttttt nnnnnnnnnnnn_oooo_nnnnnnnnnnnnnnnn_tttttttttttt (3.6) It is expected that a stock price will be increased if ωω dd is positive. Else if ωω dd is negative, it indicates that a stock price will be decreased. By improving the equation (3.6), the improved sentiment weight ω of the day d reflecting the user impact φφ of the tweet t is calculated through the proposed equations (3.7), (3.8) and (3.9): ωω rrrrrrrrrrrrrr = rrrrrrrrrrrrrr tt log(1 + ffffffffffffffffff uu ) (3.7) φφ tt = log(1 + ωω rrrrrrrrrrrrrr ) (3.8) ωω dd = log 1 + φφ pppppppppppppppp 1 + φφ nnnnnnnnnnnnnnnn (3.9) 46

61 where rrrrrrrrrrrrrr tt is the retweet count of the tweet t, ffffffffffffffffff uu is the number of followers of the user u, ωω rrrrrrrrrrrrrr is the retweet count weighted by ffffffffffffffffff uu, and φφ pppppppppppppppp and φφ nnnnnnnnnnnnnnnn are the user impacts of positive tweets and negative tweets respectively that were posted on the day t. In the previous studies, the retweet count was driven by the value of the tweet message and it was directly related to the user s impact while the number of followers represented user s popularity. That is to say, the number of followers indicates the potential retweet count only, and it is not directly related to the user s impact. Therefore, the number of followers is logarized in (3.7) to be used as the parameter of giving weight to the retweet count. We add 1 to every logarithm to avoid an undefined error. If ωω dd is positive, the stock price of the day d is expected to rise and it is probable that the stock price is going to drop if ωω dd is negative. 47

62 3.4. Discussion Twitter is one of the social network platforms that enables users to send and share 140- character limited messages called tweets. In this chapter, we used tweets as the data for analyzing public sentiment that will be applied to stock prediction models in later chapters. Data collection and data preprocess were explained and processed first, then we proposed an improved sentiment analysis algorithm, named UWS. For sentiment classification, DNN was tested with three traditional machine learning algorithms that introduced in the previous chapter: Naïve Bayes, SVM and MLP. The training result given in this chapter indicated that the Naïve Bayes algorithm has the lowest classification accuracy among the four algorithms while DNN has the highest accuracy. The SVM algorithm was the second most accurate classifier in this test. The biggest improvement was in the classification of Neutral tweets from Naïve Bayes to SVM, and the least improvement was in the classification of Negative tweets from Naïve Bayes to DNN. Our improved sentiment analysis algorithm, UWS, has considered the number of retweet as a parameter to calculate the impact of the user by the results of existing researches. Unlike existing researches, we did analysis to find how the number of followers affects the number of retweet. The result represented that the movement of the number of retweet tends to increase as the number of followers increases. Thus we concluded that the number of followers of a user can be a weighting factor since it shows the potential number of retweet. In Chapter 4, the proposed sentiment analysis algorithm will be applied to the linear prediction model to compare with the existing algorithm which does not consider the user s impact as a weighting factor. Comparison between linear and nonlinear predictions will be given in Chapter 5 with a supervised learning algorithm. 48

63 CHAPTER 4: TIME SERIES STOCK PREDICTION WITH PUBLIC SENTIMENT In this chapter, a stock prediction model with public sentiment is presented. As it is concerned that whether the Twitter sentiment state correlates with changes in the stock market, Granger causality analysis is applied to the daily time series of the Twitter sentiment vs. the changes of actual stock prices. The proper lagged days of sentiment that indicate a significant correlation with actual changes of stock prices will be found through Granger causality analysis. Autoregressive models are used in this chapter as prediction models since past sentiment and historical prices affect the current stock price. The proposed autoregressive prediction model considering Twitter sentiment will be compared with a prediction model which reflects historical changes of stock prices only. Further analysis is given in this chapter to find the appropriate amount of sentiment to improve performance of stock prediction Vector Auto-regression Models To conduct autoregressive models for daily closing stock prices obtained from Yahoo Finance [67], we first need to understand the basic idea of the linear regression. However, as this thesis is not focused on the basic machine learning models, such as linear regression, we do not treat it in this chapter. Instead, Vector Auto-regression (VAR) is explained briefly in this section, and detailed explanation of the basic concept will be given in Appendix A Principle VAR is proposed by C. A. Sims in 1980 [68]. It is one of the most successful and simplest models for the analysis of multivariate time series. VAR models extend the univariate 49

64 Auto-regression (AR) model to dynamic multivariate time series that allows more than one variable. Unlike structural models with simultaneous equations, VAR models require only a list of variables which can be hypothesized to affect each other with the time lags. VAR models are similar to linear regression, but based on the idea that the output value at the time t of series, xt depends linearly on its p values, xx tt 1, xx tt 2,, xx tt pp where p is the number of past values (lag) to predict the current value. Then a VAR model of order p, abbreviated VAR(p) can be represented as Equation (4.1). xx tt = ββ 1 xx tt 1 + ββ 2 xx tt ββ pp xx tt pp + εε tt pp = ββ ii xx tt ii + εε tt (4.1) ii=1 where xx tt is stationary time series variable, ββ 1, ββ 2,, ββ pp are time-invariant constants, and εε tt is white noise. The process of choosing the lag p in the VAR model is very important since the prediction result is dependent on the correctness of the selected p [69], [70] VAR modelling As one of the ultimate goals of this thesis is to claim Twitter sentiment contributes to improve the accuracy of stock prediction, two different VAR models are prepared for tests in this subsection. First, we only take historical changes in stock prices to set up a structural model. The second VAR model reflects both historical changes in stock prices and Twitter sentiment of past days analyzed by the UWS algorithm that was proposed in the previous chapter. A test will be 50

65 done through the later steps of this chapter to prove if considering Twitter sentiment improves the accuracy of the prediction model. The time series of actual stock prices, denoted DD tt is defined to the daily changes in stock price between day t and day (t 1). To test whether our Twitter sentiment predicts changes in the stock prices, we compare the variance explained by two VAR models as shown in equations (4.2) and (4.3). Equation (4.2) defined as M1 only reflects historical changes of stock prices (p lagged values of DD) for prediction, while the equation (4.3) defined as M2 considers both historical changes of stock prices and the series of Twitter sentiment (by UWS) denoted SS tt. M1: DD tt = αα + ββ 1 DD tt 1 + ββ 2 DD tt ββ pp DD tt pp + εε tt pp = αα + ββ ii DD tt ii + εε tt (4.2) ii=1 M2: DD tt = αα + ββ 1 DD tt ββ pp DD tt pp + γγss tt γγ pp SS tt pp + εε tt pp pp = αα + ββ ii DD tt ii + γγ ii SS tt ii + εε tt (4.3) ii=1 ii=1 where αα is a constant, ββ = {ββ 1, ββ 2,, ββ pp } and γγ = {γγ 1, γγ 2,, γγ pp } are parameters of the models (ββ pp 0, γγ pp 0) and εε tt is white noise. 51

66 4.2. Bivariate Linear Granger Causality Analysis Principle and Linear Causality Analysis for Stock Market Prediction Models Granger causality analysis is a statistical hypothesis proposed by C. W. J. Granger in 1969 [71]. It is based on the assumption that if a time series X is correlated to another time series Y, then changes in X will be useful to predict future changes in Y. Suppose we have two time series, X and Y as Figure 4.1. A movement in X shown around on 25 th day is repeated in Y after 5 days. Another movement in X shown around on 77 th day is also repeated in Y with time lag of 5 days. Thus future value of Y can be estimated by past values of X. Figure 4.1. The movements in time series X are represented in time series Y with some time lag if X Granger causes Y. X in this figure has time lag of 5 days to use it for Y prediction. Therefore, in this chapter, daily changes of Twitter sentiment are tested whether they are useful to predict the changes of actual stock prices through Granger causality. The test also gives the proper number of lagged days of Twitter sentiment that shows a statistical correlation with changes of actual stock prices as well. More formally, we compare the variance explained by two 52

67 VAR models: a VAR model reflecting only historical changes of actual stock prices as M1, and a VAR model reflecting both historical changes of actual stock prices and changes of Twitter sentiment as M2. However, we only test whether one time series has predictive information about the other time series or not, and actual causation is not tested here. This is motivated by E. Gilbert et al [72]. The hypothesis of the test is that parameters γγ of the model M2 would be equal to zero, so that Twitter sentiment do not predict the actual changes of stock prices. Thus the hypothesis can be represented as below. HH 00 : γγ ii = 0 ii = 1, 2,, pp To test Granger causality, the parameters: ββ and γγ of models: M1 and M2 are calculated first. As LSE is determined to be used for estimation, SSR is calculated as equations (4.4) and (4.5). pp 2 mm SSSSSS MM1 = DD ii ββ jj DD ii jj ii=1 jj=1 (4.4) pp pp 2 mm SSSSSS MM2 = DD ii ββ jj DD ii jj γγ jj SS ii jj ii=1 jj=1 jj=1 (4.5) The test statistic is given by equation (4.6), where the calculation result would be compared with a critical value of the F distribution with (p, m-2p) dimensions to deny the 53

68 hypothesis H0 [73]. As the critical value of F distribution can be calculated by p-value v, we can also reversely get p-value from a known critical value of F distribution by equation (4.7) [74]. (SSSSSS MM1 SSSSSS MM2 ) pp TT = ~FF SSSSSS pp,mm 2pp MM2 (4.6) (mm 2pp) (SSSSSS MM1 SSSSSS MM2 ) pp vv = 1 FF SSSSSS MM2 (mm 2pp) pp,mm 2pp (4.7) The choice of number of lags, p, affects the test result seriously. Thus we test with varying number p from 1 to 10. The test is also repeated for the stock prices of five companies: Apple (AAPL), Google (GOOGL), Amazon (AMZN), Microsoft (MSFT), and Yahoo (YHOO). Based on the results of Granger causality shown in Table 4.1, we can reject the hypothesis that the Twitter sentiment do not predict the actual stock prices. In Table 4.1, the result for AAPL indicates that our sentiment has Granger causality relation with the actual stock prices for lags ranging from 1 to 7 days. GOOGL shows the highest Granger causality relations with actual changes of stock prices for lags 3 and 4, and Twitter sentiment of AMZN is related to changes of actual stock prices for lags 3 and 5. MSFT has meaningful p-value for lags 1 to 3 and YHOO also has considerable p-value for lags 3 although it is only related to the actual stock prices when the lag is 4. 54

69 Table 4.1. Granger causality analysis between Twitter sentiment and changes of actual stock prices of four companies: Apple (AAPL), Google (GOOGL), Amazon (AMZN), Microsoft (MSFT) and Yahoo (YHOO) in period October 1, 2012 to February 28, Lag (p) p-value AAPL GOOGL AMZN MSFT YHOO ** * * * * ** ** ** * ** ** * ** * ** * p-value < 0.05 ** p-value < 0.10 Therefore, our linear prediction model is shown as equation (4.8) with considering lags, p = DD tt = αα + ββ ii DD tt ii + γγ ii SS tt ii + εε tt (4.8) ii=1 ii=1 Since p-value for lag 3 in YHOO time series is greater than 0.1, we will consider four companies: AAPL, GOOGL, AMZN, and MSFT in the analysis later. To visualize the correlation between Twitter sentiment and the changes of the actual stock prices, we plot both time series of four companies except YHOO in Figure 4.2. The value of both time series plotted 55

70 in Figure 4.2 is normalized to z-scores to maintain the same scale. Z-score normalization is written as the following steps. As stock market is closed on weekends and holidays, we do not have actual values for weekends and holidays. Instead, we linearly extrapolate time series of actual stock prices to fill blanks Z-score normalization The time series for Granger causality analysis should be stationary. However, the obtained data from Twitter and the changes of the actual stock prices are not stationary thus we need to normalize the data prior to Granger causality analysis. Inspired by Bollen et al. [8], we used Z-score normalization on basis of local mean and standard deviation within a window slide for ±δδ days. The equations for Z-score normalization are given as below. Z-score ZZ DDtt = DD tt DD tt±δδ σσ DDtt±δδ (4.9) tt+δδ 1 Local mean DD tt±δδ = (2δδ + 1) DD ii ii=tt δδ (4.10) Standard deviation σσ DDtt±δδ = 1 δδ 2δδ (DD tt+ii DD tt±δδ ) 2 ii= δδ (4.11) 56

71 (a) Apple Inc. (NASDAQ:AAPL) (b) Google Inc. (NASDAQ:GOOGL) 57

72 (c) Amazon.com, Inc. (NASDAQ:AMZN) (d) Microsft Corporation (MSFT) Figure 4.2. Panels of two graphs for four companies: (a) AAPL, (b) GOOGL, (c) AMZN and (d) MSFT. The top graph for each companies shows changes of actual stock prices with Z-score normalization, and the bottom graph for each companies represents Twitter sentiment with Z- score normalization that has been lagged by 3 days. 58

73 Prediction Power Comparison for UWS and Existing Algorithm In the previous chapter, we proposed an improved Twitter sentiment analysis algorithm, named UWS. The major difference of the proposed algorithm with existing sentiment algorithm is that the user impact is reflected when calculating sentiment weight of each tweet. To test how the proposed algorithm performs better, we compare the performance of UWS algorithm with existing algorithm introduced in the previous chapter. To do so, we prepare two VAR models here; first, M2 model that reflects Twitter sentiment of past days through UWS algorithm and the second model that reflects both historical changes in the stock prices and Twitter sentiment of past days. However, the second model calculates Twitter sentiment of past days through the existing sentiment analysis algorithm. Test will be done through this subsection to show how the proposed algorithm improves the prediction accuracy of the stock market compared to the existing algorithm. Derived from (4.2) and (4.3), the time series of actual stock prices, denoted DD tt is defined to the daily changes in stock price between day t and day (t 1). Equation (4.13), defined as M2, reflects both historical changes of the stock prices and the series of Twitter sentiment calculated by the proposed UWS algorithm, denoted SS tt. The equation (4.14), defined as M3, considers both historical changes of the stock prices and the series of Twitter sentiment denoted SS tt. In M3, the existing sentiment analysis algorithm is used to calculate the Twitter sentiment. 59

74 M2: DD tt = αα + ββ 1 DD tt ββ pp DD tt pp + γγ 1 SS tt γγ pp SS tt pp + εε tt pp pp = αα + ββ ii DD tt ii + γγ ii SS tt ii + εε tt (4.13) ii=1 ii=1 M3: DD tt = ττ + φφ 1 DD tt φφ pp DD tt pp + ωω 1 EEEE tt ωω pp EEEE tt pp + εε tt pp pp = ττ + φφ ii DD tt ii + ωω ii EEEE tt ii + εε tt (4.14) ii=1 ii=1 Like in subsection 4.2.1, Granger causality between Twitter sentiment of the existing sentiment analysis algorithm and the actual stock prices is tested. In Table 4.2, the result for AAPL indicates that UWS sentiment has Granger causality relation with the actual stock prices for lags ranging from 1 to 7 days while sentiment calculated by the existing sentiment analysis algorithm has the relation with the actual stock prices for lags ranging from 2 to 5 days. GOOGL shows the highest Granger causality relations with actual changes of the stock prices for lags 3 and 4 when using UWS but the existing algorithm shows the relation with the actual stock prices only for lags 3. For the existing sentiment analysis algorithm, both AMZN and YHOO do not show any Granger causality relations with the actual changes of the stock prices, and MSFT with UWS has a meaningful p-value for lags 1 to 3 while that of the existing sentiment analysis algorithm has slightly higher meaningful p-value for lags 1 and 3. Although Twitter sentiment calculated by the existing algorithm has less Granger causation than UWS, the result still indicates that the lag of 3 days has the highest p-value. 60

75 Table 4.2. Granger causality analysis between Twitter sentiment calculated by two sentiment analysis algorithms and changes of actual stock prices of four companies Lag (p) p-value AAPL GOOGL AMZN MSFT YHOO UWS ** * * * * ** ** ** * ** ** * ** * ** Existing Sentiment Analysis * ** * ** ** ** ** * p-value < 0.05 ** p-value <

76 Granger Causality Analysis between Authoritative Users and Others From the result in the previous subsection, we concluded that using the proposed UWS algorithm for sentiment analysis improves the linear prediction accuracy of stock market. In other words, UWS has more prediction power than the existing algorithm. Then, how reflecting the user s impact makes more prediction power? Is there any causal relationships between authoritative users and the others? To find the reason for the improvement of the prediction accuracy with UWS, we test Granger Causality between authoritative users and the other users with assuming that the users having higher authority affect the other users opinions and transfer activity. Granger Causality analysis between two groups of users is tested with the sentiment weights calculated by the equations (3.7) - (3.9). We rank all the users by sentiment weights, and treat Top 1%, Top 5%, and Top 10% of the users as authoritative users. From our experience by referencing [6], an authoritative user should have at least 100 of retweets and 1000 followers. Granger causality is tested only on two companies, AAPL and GOOGL, for the lag of three days after they announce quarterly earnings report. The reason for selecting this period is because the number of tweets and the number of authoritative users who post tweets for these companies increase dramatically, so causal relationship test between two groups of users can support our hypothesis that the authoritative users affect the other users on Twitter when some events are occurred. Figure 4.3 shows the number of top 10% authoritative users during our training period. In the figure below, the number of authoritative users dramatically increases when the companies announce their quarterly earnings report. 62

77 (a) AAPL (b) GOOGL Figure 4.3. Graphs for two companies: AAPL and GOOGL, which show the number of top 10% authoritative users. Both the number of daily tweets and the number of authoritative users increase when the companies announce quarterly earnings report. 63

78 As we have significantly more authoritative users who mentioned about the companies right after the companies announce their quarterly report, we choose the days of announcing report to analyze Granger Causality. In this test, we set the unit lag length as 0.5 day, so each lag has 12 hours term with the next step of the lag. Table 4.3 shows the result of Granger Causality analysis between the authoritative users and the other users for the two companies, AAPL and GOOGL. As people share too many topics about the companies on Twitter, we limit the topic of tweets to stock market only. Table 4.3. Granger causality analysis between the authoritative users and the other users for two companies: AAPL and GOOGL. The authoritative users are filtered in three different ranges as top 1%, top 5%, and top 10% of users ranked by user weight. Lag p-value AAPL GOOGL Top 1% Top 5% Top 10% Top 1% Top 5% Top 10% 12 hours ** * * ** * * ** * * ** * * ** * ** ** * ** ** ** ** ** * ** * p-value < 0.05 ** p-value <

79 In Table 4.3, top 10% of authoritative users sentiment on the both companies affects the other users sentiment with all lengths of the lags. Specifically with the lags of 12 hours and 24 hours, authoritative users have strong causal relationship with the other users. As we limit the number of authoritative users to top 5% of all of the daily users, sentiment of authoritative users has weaker prediction power than top 10%. However, the result shows that authoritative users still affect the other users with the lag of 36 hours. Lastly, for top 1% of all of the daily users, authoritative users temporally affect the other users with the lag of a day. The result indicates that authoritative users sentiment affects the other users after a big event happened with the lag of 3 days. Then, why the causal relationship between two groups of users is getting stronger as the number of authoritative users increases by? This phenomenon can be explained by sample size determination. In statistics, larger size of samples leads to increase precision when estimating unknown parameters. Suppose we have a graph of Twitter users as shown in Figure 4.4. In (a), the red points indicate the authoritative users and the yellow points represent general users. In this thesis, we assume that the information in Twitter environment flows from the authoritative users to the other users. Then, can the two authoritative users in the dash lined red box represent all the other authoritative users? Of course, if all of the authoritative users have the same opinion, it is possible. However, they do not always share the same opinion, thus the opinion of the two users may different from the other six authoritative users. Therefore, we need more authoritative users to calculate overall sentiment to make it closer to sentiment of all of the authoritative users. 65

80 (a) (b) Figure 4.4. Graphs of Twitter users. The red points are the authoritative users who affect the other users, and the yellow points are general users. Information in Twitter environment flows from the authoritative users to general users. Specifically, if the graph is not fully connected, and has isolated groups of nodes such as shown in (b). Unlike the two selected authoritative users in the left figure, the selected users in the right figure are totally isolated from other nodes. The information in Twitter environment has less chances of flowing to the isolated groups because of the structure of Twitter. Therefore, these isolated groups have higher probability of having their own opinion which may different from overall Twitter sentiment. For this reason, we conclude that top 10% of users opinion causes the other users with stronger prediction power. Figure 4.5 shows fuzzy surface graphs with the result of Granger Causality test in Table 4.3 for the two companies. The value of z-axis is converted to 1 (p-value) as lower p-value has stronger correlation. 66

81 (a) AAPL (b) GOOGL Figure 4.5. Fuzzy surface graphs with the result of Granger Causality test for the two companies: AAPL and GOOGL. Z-axis is converted to 1 (p-value) as lower p-value has stronger correlation. 67

82 4.3. Experiments on Linear Stock Market Prediction Visualization In this section, we first simulate the prediction of the stock prices of selected four companies: AAPL, GOOGL, AMZN, and MSFT with two models: M1 and M2. NASDAQ closed value of four companies between October 3, 2012 and March 28, 2013 is used for learning. The trained models predict stock prices for March 2013, and obtained results will be compared with the actual stock prices. Thus we have a dataset of 143 days for training, and 24 days for the prediction as represented in Figure 4.6. Figure 4.6. A graph of AAPL price movement from October 3, 2012 to March 28, We do learning models with data ranging from October 3, 2012 to February 28, 2012 and test models for the next one month which is represented in dashed-red line on the figure. The stock market is closed on weekends and holidays thus there are no data on these days. However, we need continuous time series since we will train our models with data of past 68

83 three days as we have found an optimal lagged days through Granger causality. Therefore, we linearly extrapolate time series of actual stock prices to fill blanks. For n continuous lack of data after the point t, linear extrapolation is calculated as equation (4.15). SSSSSSSSSS tt+mm = SSSSSSSSSS tt + mm (SSSSSSSSSS tt+nn+1 SSSSSSSSSS tt ) nn + 1 (4.15) where m indicates m-th point after the point t. Simulated stock prediction is visualized in Figure 4.7. In Figure 4.7, the top graph for each company shows an overlap of the actual stock prices of March 2013 (black), estimated stock prices with M1 (blue) and estimated stock prices with M2 (red). The middle graph and the bottom graph represent which model predicts more close to the actual price movements. All four results indicate that proposed M2 predicts more close to the actual stock prices Prediction Accuracy Measurement To compare performance of two models, M1 and M2, two types of prediction accuracy are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down) during the prediction period (March 1, 2013 March 28, 2013). In Table 4.4, adding Twitter sentiment to AAPL leads to the best improvements in MAPE values (2.13% to 1.32%) and GOOGLE improves direction the most (45.83% to 79.17%). For other three companies, adding Twitter sentiment improves performance of prediction in both MAPE and the direction accuracy. Average prediction accuracies of M1 are 53.13% in direction accuracy and 1.87% in MAPE while those of M2 are 73.96% in direction accuracy and 1.37% in MAPE. 69

84 Table 4.4. Two types of prediction accuracy for models: M1 and M2 are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). Prediction of AAPL using M2 shows best performance in MAPE and GOOGL using the same method performs best in the direction accuracy. M2 performs better than M1 for predictions of all four companies. Stock Measurement Model Error NASDAQ:AAPL Direction (%) M % M % MAPE (%) M1 2.13% M2 1.32% NASDAQ:GOOGL Direction (%) M % M % MAPE (%) M1 1.77% M2 1.49% NASDAQ:AMZN Direction (%) M % M % MAPE (%) M1 2.15% M2 1.42% NASDAQ:MSFT Direction (%) M % M % MAPE (%) M1 1.42% M2 1.25% 70

85 (a) AAPL 71

86 (b) GOOGL 72

87 (c) AMZN 73

88 (d) MSFT Figure 4.7. Panels of three graphs for four companies: (a) AAPL, (b) GOOGL, (c) AMZN and (d) MSFT. The top graph shows the overlap of actual stock prices of March 2013 (black), estimated stock prices with M1 (blue) and estimated stock prices with M2 (red). All four results indicates that the proposed M2 predicts more close to the actual stock prices. 74

89 Prediction Accuracy Comparison for UWS and the Existing Algorithm To compare the performance of two models M2 and M3 like the comparison of M1 and M2 in the subsection 4.4.2, two types of prediction accuracy are measured at the same condition in terms of MAPE and the direction accuracy (up/down) during the same prediction period with the comparison of M1 and M2. In Table 4.5, adding UWS Twitter sentiment (M2) to AMZN leads to the best improvement in both MAPE values (1.88% to 1.42%) and direction accuracy (45.83% to 70.83%). Also, for the other three companies, adding UWS Twitter sentiment shows better performance of prediction in both measurements than the existing algorithm (M3). The Average prediction accuracy of M2 are 73.96% in direction accuracy and 1.37% in MAPE while the prediction accuracy of M3 are 60.42% in direction accuracy and 1.73% in MAPE. Table 4.5. Two types of prediction accuracy for models: M2 and M3 are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). Prediction of AMZN using M2 shows best performance in both the direction accuracy and MAPE. However, the performance of M3 is more accurate compared to M1. Stock Measurement Model Error NASDAQ:AAPL Direction (%) M % M % MAPE (%) M2 1.32% M3 1.64% NASDAQ:GOOGL Direction (%) M % M % MAPE (%) M2 1.49% M3 1.73% NASDAQ:AMZN Direction (%) M % M % MAPE (%) M2 1.42% M % NASDAQ:MSFT Direction (%) M % M % MAPE (%) M2 1.25% M3 1.68% 75

90 Number of Tweets per Day for Improving Linear Prediction In the previous chapter, we classify tweets into seven companies in the U.S. market. Review from Table 3.9, 939,518 of tweets are classified as AAPL relevant and 707,874 of tweets are categorized into GOOGL. MSFT has third most relevant tweets and both AMZN (81,369) and YHOO (38,642) have at least 30,000 relevant tweets. However, Bank of America (NYSE:BAC) and Citi Bank (NYSE:C) have less than 10,000 tweets during the entire period (October 1, 2012 March 28, 2013). The problems we consider are when predict stock prices of companies having small number of tweets are mainly two parts: a) Average number of tweets per day is too small, thus Twitter sentiment cannot represent public sentiment, b) time series of tweet sentiment can be discrete. Thus in this subsection, a test will be done to figure out how many tweets do we need per day for improving the prediction. To figure out the minimum required number of tweets to improve the stock prediction, we use AAPL categorized 939,518 tweets. We differentiate the number of tweets per day ranging from 10 to Our proposed UWS algorithm is used for Twitter sentiment analysis, and VAR model M2 is used for training in this test. For the test, two types of prediction accuracy are measured again in terms of MAPE and the direction accuracy (up/down) during the same prediction period. The result given in Figure 4.8 shows the direction accuracy of AAPL stock prediction with varying number of tweets per day from 10 to The direction accuracy of stock prediction is 79.17% when using entire 939,518 tweets in Table 4.4. The graph indicates that the direction accuracy is lower than 70% while the number of tweets per day is less than

Figure 4.8. The graph shows the direction accuracy of AAPL stock prediction with varying the number of tweets per day from 10 to 3000. The direction accuracy of AAPL when using entire tweets was 79.

91 Figure 4.8. The graph shows the direction accuracy of AAPL stock prediction with varying the number of tweets per day from 10 to The direction accuracy of AAPL when using entire tweets was 79.17% in Table 4.4. The graph indicates that the direction accuracy approaches its maximum accuracy when having at least 2500 tweets per day. However, it approaches its maximum accuracy 79.17% after having 2000 tweets per day. There is a small decrease of direction accuracy when the number of tweets per day increases from 2500 to 3000, but still the direction accuracy is stable. Although the accuracy of M2 is higher than M1 when the number of tweets per day is greater than 1000, it is not stable and keep increases when the number of tweets per day increases. Thus we can conclude at least 2500 tweets per day is required to improve the direction accuracy. The result represented in Figure 4.9 shows MAPE of AAPL stock prediction with varying number of tweets per day from 10 to MAPE of stock prediction is 1.32% when using entire 939,518 tweets in Table 4.4. The graph indicates that MAPE is greater than 1.90% while the number of tweets per day is less than MAPE of M2 is lower than M1 when the number of tweets per day is greater than 500, but it is not stable and the error decreases when the 77

Figure 4.9. The graph shows MAPE of AAPL stock prediction with varying the number of tweets per day from 10 to 3000. MAPE of AAPL when using entire tweets was 1.32% in Table 4.4. The graph indicates that MAPE approaches its minimum error when having at least 2000 tweets per day.

92 Figure 4.9. The graph shows MAPE of AAPL stock prediction with varying the number of tweets per day from 10 to MAPE of AAPL when using entire tweets was 1.32% in Table 4.4. The graph indicates that MAPE approaches its minimum error when having at least 2000 tweets per day. number of tweets per day increases. Thus we can conclude at least 2000 tweets per day is required to improve MAPE. 78

93 4.4. Discussion In this chapter, we conducted VAR models for daily closing stock prices obtained from Yahoo Finance. Two autoregressive models were proposed and tested in this chapter to prove whether Twitter sentiment improves the prediction accuracy of stock market. Granger causality is performed from four companies: AAPL, GOOGL, AMZN, and MSFT with varying the size of lags, p from 1 to 10. From the Granger causality analysis, we obtained a conclusion that Twitter sentiment of past 3 days are highly correlated to the actual changes of the stock prices. To support the proposed algorithm, Granger Causality between authoritative users and the other users was tested with assuming that the users having higher authority affect the other users and transfer activity. From our result, for top 1% of all of the daily users, authoritative users temporally affected the other users with the lag of a day. For top 10% of authoritative users, their sentiment on the both companies affected the other users with all lengths of the lags. Specifically with the lags of 12 hours and 24 hours, authoritative users had strong causal relationship with the other users. Thus we concluded that authoritative users sentiment affects the other users after a big event happened with the lag of 3 days. The prediction was simulated for the four companies using proposed model M1 and M2. Two types of prediction accuracy were measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down) during the prediction period (March 1, 2013 March 28, 2013). Adding Twitter sentiment to AAPL led to the best improvements in both MAPE values and direction accuracy. Moreover, Twitter sentiment improved performance of prediction in both MAPE and the direction accuracy for all other three companies. Average prediction accuracies of M1 were 53.13% in direction accuracy and 1.87% in MAPE while those of M2 were 73.96% in direction accuracy and 1.37% in MAPE. Thus we 79

94 can conclude that time series of Twitter sentiment improves the prediction of future stock market with the lags of 3 when using VAR models. The prediction was simulated again for the four companies using proposed models M2 and M3. Like the comparison of M1 and M2, two types of prediction accuracy were measured in terms of MAPE and the direction accuracy (up/down). Adding UWS Twitter sentiment (M2) to AMZN led to the best improvements in both MAPE values (1.88% to 1.42%) and in direction accuracy (45.83% to 70.83%). For the other three companies, adding UWS Twitter sentiment performed better in both MAPE and the direction accuracy than the existing sentiment analysis algorithm (M3). Average prediction accuracies of M2 were 73.96% in direction accuracy and 1.37% in MAPE while M3 were 60.42% in direction accuracy and 1.73% in MAPE. The number of tweets per day was also tested for improving prediction accuracy in this chapter. From the direction accuracy test, the accuracy of M2 was higher than that of M1 when the number of tweets per day was greater than However, it was not stable and keep increased when the number of tweets per day increased. The result indicated at least 2,500 daily tweets were required to improve the direction accuracy. In the second test, MAPE of M2 was lower than M1 when the number of tweets per day was greater than 500. However, it was not stable and the error was getting decreased when the number of tweets per day increased. Thus we concluded at least 2,000 tweets per day is required to improve MAPE. 80

95 CHAPTER 5: TIME SERIES NON-LINEAR STOCK PREDICTION WITH PUBLIC SENTIMENT In the previous chapter, a linear stock prediction model with public sentiment was presented. Granger causality analysis was applied to the daily time series of Twitter sentiment and the changes of the actual stock prices to figure out whether the sentiment state linearly correlates with the actual stock market. VAR models were tested for linear stock prediction with/without considering Twitter sentiment as an input. Similar to the previous chapter, non-linear Granger causality analysis is applied to the daily time series of the Twitter sentiment and the changes of the actual stock prices to figure out whether the sentiment state non-linearly correlates with the actual stock market. For the nonlinear prediction, SVM models are used as prediction models. Same procedures as the previous chapter, a proposed SVM prediction model considering Twitter sentiment will be compared with a prediction model which only reflects historical changes of the stock prices. The appropriate amount of sentiment for the nonlinear prediction model is also found to improve its performance Support Vector Regression (SVR) Models As this thesis is not focused on the basic machine learning models, the basic idea of SVM is not discussed in this chapter. Instead, in this section, we explain Support Vector Regression (SVR), and detailed explanation of the basic concept will be given in Appendix A.2. The Support Vector Regression (SVR) models are SVM models specifically designed to solve regression problems. Thus the main principle of SVR is same as SVM classification 81

96 models, but different optimization function for SVR to minimize. Like subsection 5.1.1, suppose that we have a set of nn data points as below: PP ii = (XX ii, yy ii ) ii = 1, 2,, nn, yy ii RR, XX ii RR dd where XX ii is the input vector, yy ii is the expected output. Then, SVM models estimate the function using (5.1). ff(xx) = ww TT φφ(xx) + ββ (5.1) where φφ(xx) represents the high dimensional feature spaces which is nonlinearly mapped from the input space XX. To estimate proper ww and ββ, we use ε-insensitive Loss Function (5.3) and formulate the equation (5.2). 1 min ww,ββ,ξξ,ξξ 2 wwtt ww + CC (ξξ ii + ξξ ii ) nn ii=1 s. t. yy ii ww TT φφ(xx ii ) ββ εε + ξξ ii (5.2) ww TT φφ(xx ii ) + ββ yy ii εε + ξξ ii ffffff 1 ii nn ξξ ii 0, ξξ ii 0 YY ff(xx) εε iiii YY ff(xx) εε 0 LL εε YY, ff(xx) = 0 ooooheeeeeeeeeeee (5.3) Finally, the equation (5.1) has the explicit form as (5.4) as below. 82

97 nn ff(xx) = ii=1 (aa ii aa ii )KK XX ii, XX jj + ββ (5.4) where aa ii and aa ii are non-negative Lagrange multipliers. The Lagrange multipliers in (5.1) satisfy the equality of aa ii aa ii = 0 and the values of the multipliers are obtained by maximizing the dual functions as (5.5) [75]. max aa,aa nn yy ii (aa ii aa ii ) εε (aa ii + aa ii ) ii=1 nn nn nn ii=1 1 2 (aa ii aa ii ) aa jj aa jj KK XX ii, XX jj ii=1 jj=1 nn s. t. (aa ii aa ii ) = 0 ii=1 (5.5) 0 aa ii CC 0 aa ii CC ffffff 1 ii nn where KK XX ii, XX jj is the kernel function of the model which satisfies KK XX ii, XX jj = φφ(xx ii )φφ XX jj. Any function can be used as the kernel function if it satisfies Mercer s condition [76], and in this thesis, (Gaussian) Radial Basis Function (RBF) [77], [78] is used as the kernel function for the SVR prediction model since it is widely used in financial time-series analysis due to its good performance in limited conditions [79], [80]. RBF as the kernel function is simply expressed as (5.6). KK XX ii, XX jj = exp XX ii XX jj 2 2σσ 2 (5.6) where σσ is a free parameter, and XX ii XX jj 2 is recognized as the square Euclidean distance. 83

98 5.2. Non-Linear Granger Causality Analysis Principle In the previous chapter, traditional Granger causality analysis was applied to the daily time series of the Twitter sentiment vs. the changes of the actual stock prices to find the proper lagged days of sentiment that indicates a significant correlation with the actual changes of the stock prices. However, traditional Granger causality analysis uses the linear approach for causality testing so that the analysis can have low power to detect nonlinear causality between two sequences. To solve this problem, E. G. Baek et al. [81] proposed a nonparametric statistical method to find nonlinear causality of two sequences. In the research, the correlation integral approach was used as an estimator of spatial dependence across time. Later, based on Baek and Brock s method, C. Hiemstra et al [82] proposed an improved method to allow each series to display weak temporal dependence. In this research, we use Hiemstra s method to test nonlinear Granger causality to find the proper lagged days of sentiment that indicates a significant correlation with the actual changes of the stock prices. Suppose we have two strictly stationary and weakly dependent time series datasets: XX nn and YY. Then the data points of each dataset at time t can be represented as xx tt and yy tt. Let XX tt is the n-length lead vector of XX tt as the equation (5.7), XX tt nn = { xx tt, xx tt+1,, xx tt+nn 1 } (5.7) Then for the length n, the time series YY does not Granger cause to the time series XX if the equation (5.8) is satisfied. 84

99 PP XX tt nn XX ss nn < ee XX tt LL xx XX ss LL xx < ee, YY tt LL yy YY ss LL yy < ee = PP XX nn tt XX nn LL ss < ee XX xx tt LLxx LL xx XXss LLxx < ee (5.8) Where ee > 0 is maximum-norm distance, LL xx and LL yy are the lag lengths of XX and YY, respectively which satisfy the condition LL xx 1 and LL yy 1. Note that PP( ) denotes probability and denotes the sup norm in the equation (5.8). Let II(XX, YY, ee) as a kernel that satisfies following condition. II(XX, YY, ee) = 1 dddddddddddddddd(xx, YY) 0 0 eeeeeeee Then in order to transform (5.8) into a testable form, we define the correlation-integral estimators of the joint probabilities as the equations (5.9) - (5.12). CC 1 nn + LL xx, LL yy, ee 2 II(XX nn+ll xx nn+ll mm(mm 1) tt LL xx, XX xx tt<ss ss LLxx, ee) LL II(YY yy LL tt LLyy, yy YYss LLyy, ee) (5.9) CC 2 LL xx, LL yy, ee = 2 II(XX LL xx mm(mm 1) tt<ss tt LL xx LL, xx XXss LLxx, ee) LL II(YY yy LL tt LLyy, yy YYss LLyy, ee) (5.10) 2 CC 3 (nn + LL xx, ee) = II(XX nn+ll xx nn+ll mm(mm 1) tt LL xx, XX xx tt<ss ss LLxx, ee) (5.11) CC 4 (LL xx, ee) = 2 II(XX LL xx LL mm(mm 1) tt LL xx, xx tt<ss XXss LLxx, ee) (5.12) or simply, 85

100 nn+ll CC 1 nn + LL xx, LL yy, ee PP( XX xx nn+ll tt LLxx XX xx LL ss LLxx < ee, YY yy LL tt LLyy xx YYss LLyy < ee) (5.13) tt LL CC 2 LL xx, LL yy, ee = PP( XX xx ss LL LLxx XX xx LL LLxx < ee, YY yy LL tt LLyy xx YYss LLyy < ee) (5.14) nn+ll CC 3 (nn + LL xx, ee) = PP( XX xx nn+ll tt LLxx XX xx ss LLxx < ee) (5.15) tt LL CC 4 (LL xx, ee) = PP( XX xx ss LL LLxx XX xx LLxx < ee) (5.16) where t and s are max (LL xx, LL yy ) + ii for ii = 1,, TT mm + 1 and mm = TT + 1 mm max (LL xx, LL yy ). By the definition of conditional probability, Hiemstra and Jones [87] modified (5.8) into the equation (5.17). CC 1 nn+ll xx,ll yy,ee CC 2 LL xx,ll yy,ee = CC 3 (nn+ll xx,ee) CC 4 (LL xx,ee) (5.17) where the nonparametric test of Baek and Brock [85] is given by mm CC 1 nn+ll xx,ll yy,ee CC 2 LL xx,ll yy,ee CC 3(nn+LL xx,ee) CC 4(LL xx,ee) ~NN(0, σσ2 (nn, LL xx, LL yy, ee) (5.18) YY Granger causes XX with the lags of LL xx, LL yy, if the equation (5.18) returns a significant positive value while there is no Granger causality between two sequences if (5.18) returns a significant negative value. 86

101 Nonlinear Causality Analysis for Stock Market Prediction Models Same as the previous chapter, we do not test actual causation, but only test whether one time series has predictive information about the other time series or not. We test various lag length of Twitter sentiment from 1 to 10. Granger causality analysis for the nonlinear model is also repeated for the stock prices of the four companies: AAPL, GOOGL, AMZN, and MSFT to set the same conditions with linear Granger causality analysis in the previous chapter. Based on the results of nonlinear Granger causality shown in Table 5.1, we can reject the hypothesis that the Twitter sentiment is not nonlinearly correlated to the actual changes of stock prices. In Table 5.1, the result for AAPL indicates that Twitter sentiment does Granger cause to the actual stock prices of lagging ranges from 3-6 and 8-9 days. The p-values with the lags of 3-4 days are significantly lower than linear Granger causality for AAPL in the previous chapter. GOOGL also shows that the lags of Twitter sentiment having meaningful p-values for nonlinear Granger causality analysis is longer than the linear test. AMZN is related to changes of the actual stock prices for lags 1 and 5 this time. Unlike the other three companies, AMZN does not have a meaningful p-value for the lag of 4 while MSFT has nonlinear Granger causation between Twitter sentiment and actual changes of the stock prices with the lags of 4 and 6. Therefore, unlike the linear prediction models from the previous chapter, our SVR prediction models will set the length of lags to 4. Although the p-value for lags 4 in AMZN is greater than 0.1, we give the length of lags of 4 to AMZN to predict the stock prices since it is very close to 0.1. Thus, we can still consider it as a meaningful p-value. 87

102 Table 5.1. Nonlinear Granger causality analysis between Twitter sentiment and changes of actual stock prices of four companies: AAPL, GOOGL, AMZN and MSFT in period October 1, 2012 to February 28, 2013 which is the same condition with the linear Granger causality analysis in the previous chapter. Lag (p) p-value AAPL GOOGL AMZN MSFT ** * * * * ** * ** ** * ** ** ** ** * p-value < 0.05 ** p-value <

103 Nonlinear Prediction Power Comparison for UWS and Existing algorithm In the previous chapter, we compared the performance of our proposed UWS algorithm and existing algorithm that are introduced in Chapter 3. We prepared two VAR models: M2 model that reflects Twitter sentiment of past days through UWS algorithm, and M3 model that calculates Twitter sentiment of past days through the existing sentiment analysis algorithm. For this section, in order to show the improvement of our proposed algorithm in nonlinear prediction, we will compare the performance of two SVR models: M5 model that reflects Twitter sentiment of past days through UWS algorithm and M6 model that calculates Twitter sentiment of past days through the existing algorithm. Table 5.2 shows the result of nonlinear Granger causality test between Twitter sentiment calculated by the two algorithms and the changes of the actual stock prices. In Table 5.2, the p- values for AAPL shows that UWS sentiment nonlinearly Granger causes to the actual stock prices for the lags ranging from 3 to 6, and 8-9 days while the existing sentiment analysis algorithm has a relation with the actual stock prices for lags ranging from 4 to 6 days only. However, for GOOGL, the existing algorithm nonlinearly Granger causes to the actual changes of stock prices for the lags ranging from 2 to 4 and 6 to 7 days which is one day longer than using UWS. For the existing sentiment analysis algorithm, both AMZN and MSFT do Granger cause to the actual changes of the stock prices only for the lag of 4 and 5 respectively while UWS does Granger cause for the lags 1 and 5 for AMZN, and 4 and 6 for MSFT. Overall, Twitter sentiment calculated by the proposed algorithm is more correlated to the changes of the actual stock prices compared to the existing algorithm. 89

104 Table 5.2. Nonlinear Granger causality analysis between Twitter sentiment calculated by two sentiment analysis algorithms and changes of actual stock prices of four companies Lag (p) p-value AAPL GOOGL AMZN MSFT UWS ** * * * * ** * ** ** * ** ** ** ** Existing Sentiment Analysis * * * * ** * ** ** ** ** * p-value < 0.05 ** p-value < 0.10

105 Granger Causality Analysis in Different Conditions Condition 1: Twitter Sentiment using Stock Relevant Tweets Only Twitter sentiment tested in thesis so far has been considered all of keywords relevant tweets as shown in Table 3.9. Then, can all these tweets contribute to improve the prediction accuracy? Can the prediction accuracy be improved if we use stock relevant tweets only? To find answers for these questions, we analyze Granger Causality between Twitter sentiment and the actual changes of the stock prices again in different condition. In Granger Causality analysis in this subsection, we test only on two companies: AAPL and GOOGL as these companies have significantly more stock relevant tweets than other two companies as shown in Table 5.3. Table 5.3. Pearson correlation test between direction accuracy and five Twitter factors: number of posting users, number of posted tweets, number of stock relevant tweets, number of authoritative users, and number of stock relevant tweets posted by authoritative users. All of five factors are calculated by the average of past three days of prediction date. Company Total Tweets Stock Relevant Tweets AAPL 939,518 68,359 GOOGL 707,874 50,023 AMZN 81,369 2,337 MSFT 107, In Table 5.3, we have average 409 and 300 daily tweets which are relevant to stock information of AAPL and GOOGL respectively. For AMZN and MSFT, Granger Causality cannot be tested as only few days have stock relevant tweets for these companies. Figure 5.1 shows the changes of p-value when using stock relevant tweets only for calculating Twitter Sentiment. In the figure, Twitter sentiment has stronger nonlinear causation when the change of p-value is negative. 91

106 Figure 5.1. The changes of p-value when using stock relevant tweets only for calculation Twitter sentiment. Twitter sentiment has stronger causation when the graph is negative. Overall, the p- values when using stock relevant tweets only are smaller than using all tweets for sentiment calculation. The test shows that the p-values when using stock relevant tweets only are smaller than considering all tweets for calculating Twitter sentiment. Specifically, p-values with the 4 days of lag for both companies approach zero, which means sentiment of stock relevant tweets for past 4 days has stronger nonlinear causation than Twitter sentiment of all tweets for the same period. The direction accuracy is also tested to investigate whether sentiment of stock relevant tweets can improve the accuracy more or not. In Figure 5.2, left side bars of each company show the prediction accuracy when using all tweets, and right side bars represent the accuracy when using stock relevant tweets only. 92

Figure 5.2. Improvement of the direction accuracy for both AAPL and GOOGL, when using stock relevant tweets only to calculate Twitter sentiment. From the result in Figure 5.

107 Figure 5.2. Improvement of the direction accuracy for both AAPL and GOOGL, when using stock relevant tweets only to calculate Twitter sentiment. From the result in Figure 5.2, stock relevant tweets improve the direction accuracy more on both companies. Test for the case of using stock relevant tweets is available to predict only 21 and 17 days for each company as we do not have enough stock relevant tweets for rest 3 and 7 days. However, the accuracy increases from 87.5% (21/24 days) to 90.4% (19/21 days) for AAPL and increases from 83.3% (20/24 days) to 88.2% (15/17 days) when we have more than 100 daily stock relevant tweets. The result indicates that the case of using stock relevant tweets only for analyzing Twitter sentiment improves the directional prediction, although it is not helpful to improve the prediction of actual stock prices. Case 2: Considering Stock Market Hours Another condition in which we can test Granger Causality is varying the time range of a day. We have applied the daily time range from 00:00 to 23:59 for calculating Twitter sentiment so far. However, U.S. NASDAQ stock market opens only from 09:30 to 16:00. For this reason, 93

108 Twitter sentiment of this range of a day would have higher probability to cause the next day s stock market. Then, should tweets posted in this time range belong to yesterday? Or today? To figure out the optimal time range for stock prediction, we test Granger Causality by varying the time range of a day. Table 5.4 presents the result of Granger Causality test in the four different conditions by varying the time range of a day as well as types of tweets used for Twitter sentiment analysis. As we do not have enough stock relevant tweets for the companies: AMZN, MSFT, YHOO, BAC, and C, Granger causal relationships in the conditions of Full Day + Stock Relevant Tweets and Market Hours + Stock Relevant Tweets are tested on AAPL and GOOGL only. The term of Full Day in this test is defined as a time range from 00:00 to 23:59, and the term of Market Hours is defined as a time range from 16:00 to 16:00 of the next day. As the main purpose of this thesis is to predict stock market prices through Twitter sentiment, we do not test Granger Causality for the reverse direction. For the condition of Full Day + All of Tweets, the result is same as the result using UWS algorithm in Table 5.2. Twitter sentiment for AAPL shows a strong power of stock prediction in all of four conditions. As it has zero p-value in the condition of Full Day + All of Tweets, improvement of Granger causality test in the other three conditions is not much significant on AAPL. For GOOGL, Twitter sentiment seems like having stronger correlations with the changes of the stock prices in the condition of Market Hours + Stock Relevant Tweets, but there is similar improvement in the condition of Full Day + Stock Relevant Tweets on the same company. Therefore, we cannot suppose that applying stock market hours to the time range of a day improves the prediction power of Twitter sentiment. 94

109 Table 5.4. Result of Granger Causality test in the four different conditions. As companies do not have enough stock relevant tweets, AAPL and GOOGL are only tested in the case of using stock relevant tweets. Grey colored cells are the results having higher causation than the case of full day + all of tweets. Company Full Day + All of Tweets Granger causal relationship and p-value Full Day + Stock Relevant Tweets Market Hours + All of Tweets Market Hours + Stock Relevant Tweets AAPL SS PP SS PP SS PP SS PP << 0.05 << 0.05 << 0.05 << 0.05 GOOGL SS PP SS PP SS PP SS PP < 0.05 << 0.05 < 0.05 << 0.05 AMZN SS PP SS PP - >= 0.10 < MSFT SS PP SS PP - < 0.10 < YHOO - SS PP < BAC - - C - - S: Twitter Sentiment P: Changes of the Stock Price Full Day: 00:00-23:59 Market Hours: 16:00-16:00 of next day However, Twitter sentiment applying stock market hours improves its prediction power for AMZN and YHOO. The p-value of both AMZN and YHOO, using the time range of Full Day, is higher than 0.1, but it decreases under 0.1 by applying Market Hours. Thus we can reject the hypothesis clearly that Twitter sentiment with 4 days of lag cannot predict the actual changes of the stock prices for both AMZN and YHOO. Last, for the companies BAC and C, there is no significant relationship between Twitter sentiment and the actual changes of the stock 95

110 prices. Mainly this is because we do not have enough daily tweets for these companies, like we tested in the previous subsection, thus varying the time range of a day is not helpful to improve the prediction power of Twitter sentiment Experiments on Nonlinear Stock Market Prediction Nonlinear Prediction Accuracy Measurement In this section, we test the performance of stock prediction with the following two models: M4: M5: AAAA SSSSSS mmmmmmmmmm wwwwwwh aaaa iiiiiiiiii vvvvvvvvvvvv XX tt = {YY tt, DD tt }, YY tt = {DD tt 1, DD tt 2, DD tt 3, DD tt 4 } = DD tt 1 AAAA SSSSSS mmmmmmmmmm wwwwwwh aaaa iiiiiiiiii vvvvvvvvvvvv XX tt = {YY tt, DD tt }, YY tt = {DD tt 1, SS tt 1 }, SS tt 1 = {SS tt 1, SS tt 2, SS tt 3, SS tt 4 } where DD tt is defined to the daily changes in the stock price between day t and day (t 1), SS tt represents the calculated Twitter sentiment of day t. From nonlinear Granger causality analysis the length of lags for the prediction models are set to 4. Same as the linear prediction in the previous chapter, two types of prediction accuracy are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down) for two SVR models: M4 and M5, during the same period of the linear prediction. In Table 5.5, reflecting Twitter sentiment to the prediction of AAPL again leads to the best improvements in both MAPE values (1.60% to 0.93%) and the direction accuracy (37.50% to 12.50%). Twitter sentiment also improves the performance of prediction for the other three companies in both MAPE and the direction accuracy although the performance of M5 for 96

111 Table 5.5. Two types of prediction accuracy for models: M4 and M5 are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). Prediction of AAPL using M5 shows best performance in both the direction accuracy and MAPE. Using Twitter sentiment to prediction performs better than the prediction without Twitter sentiment for all four companies as it was same in the linear prediction. Stock Measurement Model Error NASDAQ:AAPL Direction (%) M % M % MAPE (%) M4 1.60% M5 0.93% NASDAQ:GOOGL Direction (%) M % M % MAPE (%) M4 1.56% M5 1.30% NASDAQ:AMZN Direction (%) M % M % MAPE (%) M4 1.58% M5 1.06% NASDAQ:MSFT Direction (%) M % M % MAPE (%) M4 1.35% M5 1.31% MSFT is not much different to that of M4. Average prediction accuracies of M4 are 61.46% in direction accuracy and 1.52% in MAPE while those of M5 are 80.21% in direction accuracy and 1.15% in MAPE. To find a better prediction model among M4 and M5, we compare the absolute difference between predicted stock price and the actual stock price for each prediction model. Following Figure 5.3 represents the difference between MMMM aaaaaaaaaaaa ssssssssss pppppppppp and MMMM aaaaaaaaaaaa ssssssssss pppppppppp during the prediction period. For instance, the predicted stock price with the model M5 for a day t is closer to the actual price than the model M4, then the result in Figure 5.3 is positive, which is represented as a red bar. Otherwise, the result is shown as a blue bar. 97

112 (a) AAPL (b) GOOGL 98

113 (c) AMZN (d) MSFT Figure 5.3. Graphs for the comparison of the absolute difference between predicted prices and the actual prices for four companies: (a) AAPL, (b) GOOGL, (c) AMZN and (d) MSFT. For each graph, the result is represented as a red bar for a day t if the predicted price with M5 is closer to the actual price than that with M4. 99

114 In Figure 5.3.(a), the predicted prices with M5 are closer to the actual price than M4 for 21 days except M4 performs better than M5 for 4 days. Specifically, M5 I sover than 10 dollars closer to the actual prices in average during the period of March compared to M4. For GOOGL, M4 performs better than M5 for 7 days, but most of the period, M5 performs better performance. The result for AMZN is similar to AAPL, while the performance of M5 is worst on the prediction of MSFT. For MSFT, only the results of 14 days are represented as red bars which is 33.3% lower than AAPL. Thus we can conclude that the proposed model reflecting Twitter sentiment performs significantly better than the model without considering Twitter sentiment on three companies: AAPL, GOOGL, and AMZN, while its performance is not much different with the comparison for MSFT Prediction Accuracy Comparison for the Linear and Nonlinear Models The final goal of this thesis is to find the stock prediction model among all models given in this thesis which shows the best prediction performance. From the previous chapter and this chapter, we have proved that M2 and M5 are the most accurate models in linear prediction and nonlinear prediction respectively. These models have one thing in common that they have reflected Twitter sentiment calculated by the proposed UWS algorithm. Therefore, we can conclude that UWS algorithm improves the prediction accuracy for both linear and nonlinear prediction models. In this section, we give a further comparison to test performance of two models, M2 and M5, to figure out whether the stock prediction with a nonlinear model is more accurate than a linear model. Like previous comparison, two types of prediction accuracy are measured in terms of MAPE and the direction accuracy (up/down). In Table 5.6, SVR prediction model (M5) 100

115 predicts the stock price of AAPL with the best improvement in MAPE values (1.32% to 0.93%) and improves direction accuracy the most for AMZN (29.17% to 20.83%). Our SVR prediction model improves the prediction accuracy for all of the four companies in both MAPE and direction except for MAPE of the prediction for MSFT. Table 5.6. Two types of prediction accuracy for models: M2 and M5 are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). Overall, SVR prediction model performs better than VAR prediction model. Stock Measurement Model Error NASDAQ:AAPL Direction (%) M % M % MAPE (%) M2 1.32% M5 0.93% NASDAQ:GOOGL Direction (%) M % M % MAPE (%) M2 1.49% M5 1.30% NASDAQ:AMZN Direction (%) M % M % MAPE (%) M2 1.42% M5 1.06% NASDAQ:MSFT Direction (%) M % M % MAPE (%) M2 1.25% M5 1.31% We also compare the absolute difference between predicted stock price and the actual stock price for each prediction models. Figure 5.4 represents the difference between MMMM aaaaaaaaaaaa ssssssssss pppppppppp and MMMM aaaaaaaaaaaa ssssssssss pppppppppp during the prediction period. 101

116 (a) AAPL (b) GOOGL 102

117 (c) AMZN (d) MSFT Figure 5.4. Graphs for the comparison of the absolute difference between predicted prices and the actual prices for four companies: (a) AAPL, (b) GOOGL, (c) AMZN and (d) MSFT. For each graph, the result is represented as a red bar for a day t if the predicted price with M5 is closer to the actual price than that with M2. 103

118 In Figure 5.4, for AAPL, the predicted prices with M5 are closer to the actual price than M2 for 21 days, and it is not much different with M2 for other 4 days although the accuracy of M2 is higher for those days. Also for GOOGL and AMZN, M5 performs better than M2 for most of the days. However, the accuracy of M5 is lower than that of M2 for MSFT that the absolute difference is positive for 10 days only during the period Number of Daily Tweets for Improving Nonlinear Prediction In the previous chapter, we figured out the minimum required number of daily tweets for improving the accuracy of our proposed the linear prediction model. Then, will this result be same to the nonlinear prediction model? To find the optimal number of daily tweets for improving the accuracy of the nonlinear prediction model, the test is done in the same procedures as the linear prediction model to figure out how many tweets we need per day for improving prediction. We differentiate the number of daily tweets ranging from 10 to Our proposed UWS algorithm is used for Twitter sentiment analysis, and the SVR model, M5, is used for training in this test. Like subsection 4.5.3, two types of prediction accuracy are measured in terms of MAPE and the direction accuracy (up/down). The result given in Figure 5.5 shows the direction accuracy of AAPL stock prediction with various number of daily tweets from 10 to The direction accuracy of stock prediction is 87.50% when using entire 939,518 tweets in Table 5.2. The graph shows that direction accuracy is lower than 80% when we have less than 1500 daily tweets. 104

Figure 5.5. The graph shows the direction accuracy of AAPL stock prediction with different number of daily from 10 to 3000. The direction accuracy of AAPL when using entire tweets was 87.

119 Figure 5.5. The graph shows the direction accuracy of AAPL stock prediction with different number of daily from 10 to The direction accuracy of AAPL when using entire tweets was 87.50% in Table 5.2. The graph indicates that the direction accuracy approaches its maximum accuracy when having at least 2500 daily tweets. Similar to the linear prediction model in the previous chapter, the direction accuracy approaches its maximum when having more than 2500 tweets per day. The accuracy of M5 is higher than M4 when the number of tweets per day is greater than 100, but it is not stable and the accuracy decreases when having 500 daily tweets. From 1000 daily tweets, again the accuracy increases, and it becomes stable when the number of daily tweets is greater than Thus we can conclude at least 2500 daily tweets are required to improve the direction accuracy of the nonlinear prediction. Figure 5.6 shows MAPE of AAPL stock prediction with various number of daily tweets from 10 to MAPE of stock prediction is 0.93% when using entire tweets in Table 5.5. The graph represents that MAPE is greater than 1.00% when the number of daily tweets is less than MAPE of M5 is lower than M4 when the number of daily tweets is greater than 1000, but it increases again when the number of daily tweets becomes MAPE becomes stable when 105

Figure 5.6. The graph shows MAPE of AAPL stock prediction with different number of daily tweets from 10 to 3000. MAPE of AAPL when using entire tweets was 0.93% in Table 5.2.

120 Figure 5.6. The graph shows MAPE of AAPL stock prediction with different number of daily tweets from 10 to MAPE of AAPL when using entire tweets was 0.93% in Table 5.2. The graph represents that MAPE approaches its minimum error when having at least 2500 daily tweets. the number of daily tweets is greater than Thus same as the direction accuracy in Figure 5.5, we can conclude at least 2500 daily tweets are required to improve MAPE for the nonlinear prediction model Stability of the Nonlinear Model The nonlinear prediction model using SVM in this chapter shows better performance than the linear model in the previous chapter. This is because the nonlinear causal relationship between public sentiment and the stock prices are stronger than the linear causal relationship. To prove it, we test nonlinear prediction with another learning model, MLP. If the performance using MLP model is significantly different with SVM, it cannot be said that the prediction model is stable. Like previous comparisons, MAPE and the direction accuracy (up/down) are tested. 106

121 Table 5.7. Two types of prediction accuracy for two nonlinear prediction models, SVR and MLP are measured in terms of the average mean absolute percentage error (MAPE) and the direction accuracy (up/down). The prediction accuracies of two nonlinear models are not significantly different in both tests. Stock Measurement Model Error NASDAQ:AAPL Direction (%) MLP 12.50% SVR 12.50% MAPE (%) MLP 0.97% SVR 0.93% NASDAQ:GOOGL Direction (%) MLP 12.50% SVR 16.67% MAPE (%) MLP 1.36% SVR 1.30% NASDAQ:AMZN Direction (%) MLP 25.00% SVR 20.83% MAPE (%) MLP 1.21% SVR 1.06% NASDAQ:MSFT Direction (%) MLP 33.33% SVR 29.17% MAPE (%) MLP 1.29% SVR 1.31% In Table 5.7, SVR prediction model predicts the stock price of AAPL with 87.5% directional accuracy, which means that the model correctly predicts 21 days over 24 prediction days. MLP performs with same directional accuracy for the stock price of AAPL. For GOOGL and AMZN, SVR model correctly predicts 20 days and 19 days while MLP predicts 21 days and 18 days in correct direction which are one day different from the SVR model. The direction accuracies between two prediction models are also one day different although the percentage looks like having a big gap between two models. This is because we predicts for 24 days only, so one day difference can make a big gap to the accuracy in percentage. As we can see in Table 5.7, MAPE of two prediction models are not significantly different, and mostly shows similar results, thus we can conclude that the nonlinear prediction is stable regardless of the learning model. 107

122 Table 5.8. Convergent Cross-Mapping (CCM) analysis for testing nonlinear causality. CCM here is tested in one direction to predict stock prices with lagged public sentiment. The result shows that CCM analysis has similar result with nonlinear Granger causality analysis. Time step τ = 1, and the embedding dimension is varied from one to nine. Pearson Coefficient 2 (ρ) E-Dimension Sentiment Prices AAPL GOOGL AMZN MSFT E-Dimension: embedding dimension To prove whether the causal relationship between public sentiment and the stock prices is stable, regardless of the causality testing methods, we approach finding causal relationships between two features in a different way. Convergent Cross-Mapping (CCM) analysis is used to compare causal relationship between public sentiment and the stock prices [93]. Unlike Granger Causality, CCM follows from Taken s Theorem that if the time series X influences another time series Y, then the historical value of X can be recovered from the value of Y [94]. The result can be accomplished using cross mapping which is based on the concept that a time delay embedding is constructed from Y, and the ability to estimate X from the embedding [95]. The result in Table 5.8 shows squared Pearson correlation coefficient which indicates how public sentiment is correlated to the stock prices with varying the number of embedding dimensions from one to nine. To make same condition with nonlinear Granger Causality, we set the time step τ = 1. In Granger Causality, we need to normalize public sentiment and the stock 108

123 prices as these values are not stationary. However, CCM can be applied to non-stationary system as well. Thus we use the value of the difference between two days stock prices, and calculated Twitter sentiment without Z-normalizing. Although CCM provides bidirectional results, we do not attach the opposite directional result as the purpose of this research is to predict stock market with public sentiment. The bolded results in Table 5.8 shows meaningful values of the squared Pearson correlation. For AAPL, public sentiment can be used to explain changes in the stock prices with the embedding dimension from three to six, and nine. In Table 5.1, public sentiment causes the stock prices with the lag of three days to six days, and eight days to nine days. Therefore, both causality tests show similar results except for the lag of eight days. For GOOGL, although Granger Causality cannot explain causal relationships between public sentiment and the stock prices with the lag from one to two days, both methods similarly explain the relationship between public sentiment and the stock prices. However, the result in Table 5.8 shows there are weak correlation between public sentiment and the stock prices with one embedding dimension for AMZN, and six embedding dimension for MSFT. As the approach of testing causality is different, some testing cases can show the different results. Overall, the result of CCM shown in Table 5.8 shows similar causal relationships between public sentiment and the stock prices with the result in Table 5.1. Therefore, the causal relationship between two features are similar regardless of the testing methods. 109

124 5.4. Discussion In this chapter, two SVR models were proposed and tested to prove whether Twitter sentiment can improve the nonlinear prediction accuracy of the stock market. Similar to the previous chapter, nonlinear Granger causality was performed on the four companies: AAPL, GOOGL, AMZN, and MSFT, with varying the size of lags from 1 to 10. We obtained the conclusion that Twitter sentiment of past 4 days were highly correlated to the actual changes of the stock prices which is a different conclusion with the result from linear Granger causality test. The result of nonlinear Granger causality represented that Twitter sentiment has more nonlinear relationship with the changes of the actual stock prices than linear relationships, as nonlinear Granger causality has a stronger correlation between Twitter sentiment and the changes of the actual stock prices. We also tested Granger Causality in the different conditions. For the first case, we analyzed Granger Causality between Twitter sentiment and the actual changes of the stock prices as same procedure as the test in subsection 5.2.1, but Twitter sentiment was calculated using stock relevant tweets only. The test showed that the p-values when using stock relevant tweets only were smaller than considering all tweets for calculating Twitter sentiment. Specifically, sentiment of stock relevant tweets for past 4 days had stronger nonlinear causation than Twitter sentiment of all tweets for the same period. From the result of the direction accuracy test, the accuracy increased from 87.5% to 90.4% for AAPL and increased from 83.3% to 88.2% when we had more than 100 daily stock relevant tweets. Thus from the result, we concluded that using stock relevant tweets only to analyze Twitter sentiment has more prediction power than just using all of tweets. 110

125 For the second case, we tested Granger Causality by varying the time range of a day. Our result indicated that Twitter sentiment applying stock market hours improved its prediction power for the two companies AMZN and YHOO. We could reject the hypothesis that Twitter sentiment with 4 days of lag cannot predict the actual changes of the stock prices for both companies. However, there was no significant relationship between Twitter sentiment and the actual changes of the stock prices for the companies who did not have enough daily tweets, thus we concluded that varying the time range of a day cannot improve the prediction power of Twitter sentiment if there are not enough tweets to analyze. Stock predictions for the four companies was tested by proposed models, M4 and M5. Like the previous chapter, two types of prediction accuracy were measured for the comparison in terms of MAPE and the direction accuracy (up/down). Reflecting Twitter sentiment to the prediction of AAPL again led to the best improvements in both MAPE values (1.60% to 0.93%) and the direction accuracy (37.50% to 12.50%). Besides, Twitter sentiment improved performance of prediction for the other three companies in both MAPE and the direction accuracy. Average prediction accuracies of M5 were 80.21% in direction accuracy and 1.15% in MAPE. Nonlinear Granger causality between Twitter sentiment calculated by two algorithms and the changes of the actual stock prices were tested. Overall, Twitter sentiment calculated by the proposed algorithm is more correlated to the changes of the actual stock prices compared to that calculated by the existing algorithm. The result showed that our proposed UWS sentiment algorithm has a stronger nonlinear Granger causation with the actual stock prices for the companies, AAPL, AMZN, and MSFT. However, the existing algorithm had stronger Granger causation with the actual changes of stock prices for GOOGL. 111

126 This happens mainly because of two reasons. First, UWS algorithm is based on the assumption that authoritative users have more weight than other users. However, their opinions do not always represent public opinion on Twitter. For instance, a Korean movie titled Operation Chromite recently premiered and it was a box office hit in South Korea with high rated scores while professional critics gave the lowest review scores to the movie. Thus the proposed algorithm has a limitation that it cannot correctly calculate Twitter sentiment if authoritative users opinion is far away from the public opinions. Second, we only considered the number of retweets and the number of followers when calculating the weight of the user due to the accessional limitation of Twitter. From our results, these two factors seemed to be good representatives of authority of a user. However, for some cases, messages having a massive amount of retweet counts do not have meaningful information to the predict stock market. Thus to improve the accuracy of prediction models using UWS algorithm, we need to consider further Twitter factors which can be representatives of authority of a user. We compared two models, M2 and M5, which are linear and nonlinear models reflecting Twitter sentiment calculated by UWS algorithm. Like previous comparisons, two types of prediction accuracy were measured in terms of MAPE and the direction accuracy (up/down). Our SVR prediction model predicted the stock price of AAPL with the best improvement in MAPE values (1.32% to 0.93%) and it improved direction accuracy the most for AMZN (29.17% to 20.83%). From the result, nonlinear SVR model improved the prediction accuracy for all of the four companies in both MAPE and direction, except for MAPE of the prediction for MSFT. Here, a question we can ask is how the absolute difference using M2 can be closer to the actual prices in Figure 5.4 (d) while the direction accuracy of M5 is higher than M2. The reason is 112

127 although the direction of a day t predicted by M5 is equal to that of actual stock market, the absolute difference of t by M5 can be greater than that by M2 in case as shown as Table 5.5. Table 5.9. A case that the absolute difference of a day t by M5 is greater than that by M2 while the direction accuracy of M5 is only equal to that of actual stock market. Model 01/03/ /03/2013 Absolute Diff. Direction Actual DOWN M UP M DOWN In Table 5.5, the absolute difference between the predicted price and the actual price for M5 is while that for M2 is only However, the direction accuracy of M5 still can be higher than that of M2 in this case. Although there are some limitations in our proposed algorithm and prediction model, it performs better than existing methods in most of the cases. We also tested the number of daily tweets for improving the accuracy of the proposed nonlinear prediction model. In the direction accuracy test, the result indicated that the accuracy approached its maximum when having more than 2500 daily tweets similarly to the linear prediction model in the previous chapter. The accuracy of M5 was higher than M4 when the number of daily tweets was greater than 100, and it became stable when the number of daily tweets was greater than Thus for the nonlinear prediction model, at least 2500 daily tweets are required to improve the direction accuracy. In MAPE test, MAPE of M5 was lower than M4 when the number of daily tweets was greater than 1000, but it increased again when the number of daily tweets becomes MAPE became stable when the number of daily tweets was greater than Thus we concluded at least 2500 tweets per day is required to improve both the direction accuracy and MAPE for the nonlinear prediction model. 113

128 CHAPTER 6: CONCLUSION With an exponential growth in the use of social networks, there have been discussion and studies on various ideas of utilizing it to figure out public opinion and insights. In this thesis, we have strived for reaching our goals which are i) to propose an improved sentiment analysis algorithm that reflects the impact of user, and ii) to study whether public sentiment on social networks can contribute to stock market prediction. First, we proposed an improved sentiment analysis algorithm through Chapter 3. We tested four machine learning algorithms including DNN for sentiment classification. The training result indicated that Naïve Bayes algorithm had the lowest classification accuracy among the four algorithms while DNN had the highest accuracy. The biggest improvement was in the classification of Neutral tweets from when using Naïve Bayes to when using SVM, while the least improvement was in the classification of Negative tweets from when using Naïve Bayes to when using DNN. Our proposed UWS algorithm had considered the number of retweet as a parameter to calculate the impact of the user by referencing the results of existing researches. We did further analysis to find how the number of followers affects the number of retweet. From the test, we could conclude that the number of followers of a user can be a weighting factor since it shows the potential number of retweet. To support the proposed algorithm, Granger Causality between authoritative users and the other users was tested with assuming that the users having higher authority affect the other users and transfer activity. From the result, we concluded that authoritative users sentiment affects the other users after a big event happened with the lag of 3 days. 114

129 Second, we conducted the linear and nonlinear prediction models using machine learning algorithms to forecast future stock prices of selected four companies. For linear prediction, two VAR models were proposed and tested to prove whether Twitter sentiment improves the prediction accuracy of the stock market. We did Granger causality analysis with varying the length of lags from 1 to 10. From the analysis, we obtained a conclusion that Twitter sentiment of past 3 days were highly correlated to the actual changes of the stock market. The simulation of stock prediction using VAR models indicated that Twitter sentiment calculated by the proposed UWS algorithm led to the best improvements in both MAPE values and the direction accuracy for all companies tested in this research. From the study, we could figure out that the time series of Twitter sentiment improves the prediction of future stock market with the lags of 3 when using VAR models. Further analysis was given in this thesis to prove that the UWS algorithm performs better than existing sentiment analysis algorithm. In the simulation, the average prediction accuracy of the proposed UWS algorithm were higher in both tests. We also analyzed the minimum number of tweets per day for improving prediction accuracy. We tested how the direction accuracy and MAPE are changed with varying the number of daily tweets from 10 to The test stated that at least 2,500 daily tweets are required to improve the prediction accuracy through Twitter sentiment. For nonlinear prediction, two SVR models were proposed and tested to prove whether nonlinearly correlated Twitter sentiment can improve the prediction accuracy of the stock market. Similar to VAR models, we did nonlinear Granger causality test with varying the length of lags from 1 to 10. We obtained the conclusion that Twitter sentiment with the lag of 4 days were highly correlated to the actual changes of the stock prices. Nonlinear Granger causality test 115

130 represented that Twitter sentiment had more nonlinear relationships with the changes of the actual stock prices as nonlinear Granger causality had a stronger correlation between Twitter sentiment and the changes of the actual stock prices. The simulation of stock prediction using SVR models indicated that Twitter sentiment calculated by the proposed algorithm led to the best improvements in both MAPE values and in direction accuracy for all companies tested in this research. Compare to the VAR model, we found that the nonlinear prediction model with UWS algorithm performed better than the linear prediction model as Twitter sentiment was nonlinearly correlated to the actual stock market. We also tested Granger Causality in the different conditions. Granger Causality was tested between Twitter sentiment using stock relevant tweets only and the actual changes of the stock prices. From the result, we concluded that using stock relevant tweets only to analyze Twitter sentiment has more prediction power than just using all of tweets. We tested another Granger Causality by varying the time range of a day. Our result indicated that Twitter sentiment applying stock market hours improved its prediction power for the selected two companies. We could reject the hypothesis that Twitter sentiment with lag cannot predict the actual stock prices. However, as there was no significant relationship between Twitter sentiment and the actual changes of the stock prices for the companies who did not have enough daily tweets, we concluded that varying the time range of a day improves the prediction power of Twitter sentiment only when there enough tweets in dataset. Future work shall include figuring out further Twitter factors to be reflected to the calculation of user weight in UWS algorithm. For instance, the list followers of users and how authoritative they are in Twitter can be a factor to be analyzed through PageRank [83] algorithm to rank users. 116

131 REFERENCES [1] B. Pang and L. Lee, Opinion mining and sentiment analysis, 2 nd ed., vol Foundations and trends in information retrieval, 2008, pp [2] J. Gao and T. Zhou, Evaluating User Reputation in Online Rating Systems via an iterative group-based ranking method, arxiv preprint arxiv: , [3] P. Resnick, K. Kuwabara, R. Zeckhauser, and E. Friedman, Reputation Systems. Communications of the ACM, no. 43, vol. 12, pp , [4] S. Standifird, Reputation and e-commerce: ebay auctions and the asymmetrical impact of positive and negative ratings, Journal of Management, no. 27, vol. 3, pp , [5] K. Lerman and R. Ghosh, Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks, Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (ICWSM), vol. 10, pp , [6] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi, Measuring User Influence in Twitter: The Million Follower Fallacy, Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (ICWSM), vol. 10, pp , [7] B. Liu, Sentiment Analysis and Opinion Mining, 5 th ed., vol. 1. Synthesis Lectures on Human Language Technologies, 2012, pp [8] J. Bollen, H. Mao, and X. Zeng, Twitter mood predicts the stock market, Journal of Computational Science, 2 nd ed., pp. 1-8, [9] A. Mittal and A. Goel, Stock Prediction Using Twitter Sentiment Analysis, Stanford University,

132 [10] S. Asur, B. A. Huberman, Predicting the Future with Social Media, IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp , [11] A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe, Predicting Elections with Twitter. What 140 Characters Reveal about Political Sentiment, ICWSM 10, pp , [12] H. Wang, D. Can, A. Kazemzadeh, F. Bar, and S. Narayanan, A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle, in Proceedings of the ACL 2012 System Demonstrations, pp , [13] V. Turchenko, P. Beraldi, F. De Simone, and L. Grandinetti, Short-term Stock Price Prediction using MLP in Moving Simulation Mode, In IDAACS, vol. 2, pp , [14] T. Van Gestel, et al., Finalcial Time Series Prediction using Least Squares Support Vector Machines Within The Evidence Framework, IEEE Transactions on Neural Networks, vol. 12, no. 4, pp , [15] K. Kim, Financial Time Series Forecasting using Support Vector Machines, Neurocomputing, vol. 55, pp , [16] W. Huang, Y. Nakamori, and S. Wang, Forecasting stock market movement direction with support vector machine, Computers & Operations Research, vol. 32, no. 10, pp , [17] N. Cristianini and J. S. Taylor, An Introduction to Support Vector Machines: and Other Kernel-Based Learning Methods, Cambridge University Press, New York, [18] V. N. Vapnik, An overview of statistical learning theory, IEEE Transaction of Neural Networks, no. 10, vol. 5, pp ,

133 [19] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, [20] L. Cao, Support vector machines experts for time series forecasting, Neurocomputing, vol. 55, issue. 1-2, pp , [21] T. Z. Tan, C. Quek, and G. S. Ng, Brain Inspired Genetic Complimentary Learning for Stock Market Prediction, IEEE congress on Evolutionary Computation, vol. 3, pp , [22] Y. Wang, Predicting Stock Price using Fuzzy Grey Prediction System, Expert System with Applications, vol. 22, pp , [23] R. Majhi, G. Panda, G. Sahoo, P. K. Dash, and D. P. Das, Stock Market Prediction of S&P 500 and DJIA using Bacterial Foraging Optimization Technique, IEEE Congress on Evolutionary Computation, pp , [24] B. G. Malkiel and E. F. Fama, Efficient capital markets: A review of theory and empirical work, The journal of Finance, vol. 25, no. 2, pp , [25] M. Carvalho, Trading Stock Options Made Easy, Lulu.com, [26] Investopedia Sharper Insight. Smarter Investing. available: [27] P. C. Tetlock, Giving Content to Investor Sentiment: The Role of Media in the Stock Market, The Journal of Finance, vol. 62, issue 3, pp , [28] Y. Liu, X. Huang, A. An, and X. Yu, ARSA: A Sentiment-Aware Model for Predicting Sales Performance Using Blogs, Proceedings of the 30 th annual international ACM SIGIR conference on Research and development in information retrieval, pp ,

134 [29] A. Bermingham and A. Smeaton, On using Twitter to monitor political sentiment and predict election results, International Joint Conference for Natural Language Processing (IJCNLP), [30] E, M. Rogers, Diffusion of Innovation 3 rd Ed., A Division of Macmillan Publishing Co., Inc., The Free Press, New York, [31] E. Katz and P. F. Lazarsfeld, Personal Influence: The Part Played by People in the flow of Mass Communications, The Free Press, New York, [32] M. Gladwell, The Tipping Point: How Little Things Can Make a Big Difference, Little, Brown and Company, United States of America, [33] J. Berry and E. D. Keller, The Influentials: One American in Ten Tells the Other nine How to Vote, Where to Eat, and What to Buy, The Free Press, New York, [34] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. watts, Everyone s an Influencer: Quantifying Influence on Twitter, Proceedings of the fourth ACM international conference on Web Search and Data Mining (WSDM `11), pp , [35] A. L. Samuel, Some studies in machine learning using the game of checkers, IBM Journal of research and development, vol. 3, no. 3, pp , [36] A. Pak, P. Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining, The International Conference on Language Resources and Evaluation, vol. 10, pp , [37] A. McCallum, K. Nigam, A comparison of event models for Naïve Bayes text classification, AAAI workshop on learning for text categorization, vol. 752, pp ,

135 [38] F. Peng and D. Schuurmans, Combining Naïve Bayes and N-gram language models for text classification, European Conference on Information retrieval, Springer Berlin Heidelberg, pp , [39] J. Dodd, Twitter Sentiment Analysis, Higher Diploma in Science in Data Analytics, National College of Ireland, [40] C. Castillo, M. Mendoza, and B. Poblete, Information credibility on Twitter, Proceedings of the ACM 20 th International Conference on World Wide Web, pp [41] A. Bifet and E. Frank, Sentiment knowledge discovery in Twitter streaming data, International Conference on Discovery Science, Springer Berlin Heidelberg, pp. 1-15, [42] C. Zhang, D. Zeng, J. Li, F. Y. Wang, and W. Zuo, Sentiment analysis of Chinese documents: From sentence to document level, Journal of the Association for Information Science and Technology, vol. 60, issue 12, pp , [43] Natalia Vyrva, Sentiment Analysis in Social Media, Department of Computer Science, Ostfold University College, [44] L. M. Belue, Multilayer Perceptrons for Classification, No. AFIT/GOR/ENS/92M-02, Air Force Institute of Technology, School of Engineering, [45] D. Bespalov, Y. Qi, B. Bai, and A. Shokoufandeh, Sentiment Classification Based on Supervised Latent n-gram Analysis, Proceedings of the 20 th ACM international conference on Information and knowledge management, pp ,

136 [46] H. M. Nassif, Learning Sentiment and Semantic Relatedness in User Generated Content Using Neural Models, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, [47] Twitter Inc., Social networking service. Available: [48] Z. Parker, P. Scott, and S. V. Vrbsky, Comparing nosql mongodb to an sql db, In Proceedings of the 51st ACM Southeast Conference, [49] M. Boia, B. Faltings, C.C. Musat, and P. Pu, A :) Is Worth a Thousand Words: How People Attach Sentiment to Emoticons and Words in Tweets, IEEE International Conference on Social Computing (SocialCom), pp , [50] E. Jansen and V. James, NetLingo: The Internet Dictionary, NetLingo Inc., Also available: [51] A. Franz and T. Brants, All Our N-gram are Belong to You, Google Researchh Blog, Retrieved 16/12/2011, Available: [52] P. F. Brown, P. V. desouza, R. L. Mercer, V. J. Della Pietra, and J. C. Lai, Class-based n-gram models of natural language, Computational Linguistics, vol. 18, no. 4, pp , [53] A. Esuli and F. Sebastiani, SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining, International Conference of Language Resources and Evaluation, pp , [54] S. Baccianella, A. Esuli, and F. Sebastiani, SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining, International Conference of Language Resources and Evaluation, pp ,

137 [55] C. Fellbaum, WordNet, Blackwell Publishing Ltd., [56] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets: Second Edition, Cambridge University Press, pp. 1-17, [57] B. Li, K. C. C. Chan, and C. Ou, Public Sentiment Analysis in Twitter Data for Prediction of a Company s Stock Price Movements, in Proceedings of the IEEE 11th International conference on e-bussiness Engineering, pp , [58] S. M. Ali, B. Ahmed, and K. J. Ballard, Classification of lexical stress patterns using deep neural network architecture, IEEE Spoken Language Technology Workshop (SLT), [59] J. Pan, C. Liu, Z. Wang, Y. Hu, and H. Jiang, Investigation of deep neural networks (DNN) for large vocabulary continuous speech recognition: Why DNN surpasses GMMs in acoustic modeling, 8th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp , [60] F. Seide, G. Li, X. Chen, and D. Yu, Feature engineering in context-dependent deep neural networks for conversational speech transcription, in IEEE ASRU, [61] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition, in IEEE Signal Processing Magazine, [62] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi, Measuring User Influence in Twitter: The Million Follower Fallacy, Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (ICWSM), vol. 10, pp ,

138 [63] H. Kwak, C. Lee, H. Park, and S. Moon, What is Twitter, a social network or a news media? Proceedings of the 19 th International conference on World wide web, ACM, pp , [64] Y. Yamaguchi, T. Takahasi, T. Amagasa, and H. Kitagawa, TURank: Twitter User Ranking Based on User-Tweet Graph Analysis, Web Information Systems Engineering (WISE 2010), pp , [65] K. Lerman and R. Ghosh, Information Contagion: An Empirical Study of the Spread of News on Digg and Twitter Social Networks,, Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (ICWSM), vol. 10, pp , [66] M. Skuza and A. Romanowski, Sentiment Analysis of Twitter Data within Big Data Distributed Environment for Stock Prediction, IEEE Proceedings of the Federated Conference on Computer Science and Information Systems, pp , [67] YAHOO! FINANCE. (2015). Historical Finance Data. Available: [68] C. A. Sims, Macroeconomics and reality, Econometrica: Journal of the Econometric Society, pp. 1-48, [69] R. S. Hacker, A. Hatemi-J, Optimal lag-length choice in stable and unstable VAR models under situations of homoscedasticity and ARCH, Journal of Applied Statistics, vol. 35, no. 6, pp , [70] R. S. Hacker, A. Hatemi-J, Can the LR test be helpful in choosing the optimal lag order in the VAR model when information criteria suggest different lag orders? Applied Economics, vol. 41, no. 9, pp ,

139 [71] C. W. J. Granger, Investigating causal relations by econometric models and crossspectral methods, Econometrica: Journal of the Econometric Society, pp , [72] E. Gilbert and K. Karahalios, Widespread worry and the stock market, Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, pp , [73] J. F. Kooijman, Stock market prediction using social media data and finding the covariance of the LASSO, Faculty of Mechanical, Maritime and Materials Engineering, Delft University of Technology, [74] T. F. Liao. Statistical Group Comparison. Wiley Series in Probability and Statistics, Wiley, Available: [75] C. Cortes and V. Vapnik, Support-vector networks, Machine learning, vol. 20, no. 3, pp , [76] S. R. Gunn, Support Vector Machines for Classification and regression, School of Electronics and Computer Science, University of Southampton, [77] V. N. Vapnik, The Nature of Statistical Learning theory, Springer Verlag, New York, [78] T. Joachims, Text categorization with support vector machines: Learning with many relevant features, European conference on machine learning, Springer Berlin Heidelberg, pp , [79] W. Chen, J. Shih, and S. Wu, Comparison of support-vector machines and back propagation neural networks in forecasting the six major Asian stock markets, International Journal of Electronic Finance, vol. 1, no. 1, pp ,

140 [80] W. Huang, Y. Nakamori, and S. Wang, Forecasting stock market movement direction with support vector machine, Computers & Operations Research, vol. 32, no. 10, pp , [81] E. G. Baek and W. A. Brock, A nonparametric test for independence of a multivariate time series, Statistica Sinica, pp , [82] C. Hiemstra and J. D. Jones, Testing for linear and nonlinear Granger causality in the stock price-volume relation, The Journal of Finance, vol. 49, no. 5, pp , [83] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank citation ranking: bringing order to the web, Stanford University, CA, U.S., [84] H. L. Seal, The historical development of the Gauss linear model, Biometrika, vol. 54, pp. 1-24, [85] J. D. Lowrey and S. R. F. Biegalski, Comparison of least-squares vs. maximum likelihood estimation for standard spectrum technique of ββ γγ coincidence spectrum analysis, Nuclear Instruments and Methods in Physics Research B, pp , [86] L. I. Al-turk, Comparing Between Maximum Likelihood and Least Square Estimators for Gompertz Software Reliability Model, International Journal of Software Engineering & Applications, vol. 5, No. 4, pp , [87] B. Scholkopf, K. K. sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik, Comparing Support Vector Machines with Gaussian Kernels to Radial Basis Function Classifiers, IEEE Transactions on Signal Processing, vol. 45, no. 11, pp ,

141 [88] D. M. Romero, W. Galuba, S. Asur, and B. A. Huberman, Influence and Passivity in Social Media, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer Berlin Heidelberg, pp , [89] L. Hong, O. Dan, and B. D. Davison, Predicting Popular Messages in Twitter, Proceedings of the 20 th international conference companion on World Wide Web, ACM, pp , [90] R. Recuero, R. Araujo, and G. Zago, How does Social Capital affect Retweets? ICWSM, [91] Z. Luo, et al., Who will retweet me?: finding retweeters in twitter, Proceedings of the 36 th international ACM SIGIR conference on Research and development in information retrieval, ACM, pp , [92] H. Kwak, C. Lee, H. Park, and S. Moon, What is Twitter, a social network or a news media? Proceedings of the 19 th international conference on World Wide Web, ACM, pp , [93] G. Sugihara, et al., Detecting causality in complex ecosystems, Science 338, pp , [94] F. Takens, Detecting strange attractors in turbulence, Dynamical Systems and Turbulence, Lecture Notes in Mathematics 898, pp , Springer Berlin Heidelberg, [95] H. Ye, et al. Distinguishing time-delayed causal interactions using convergent cross mapping, Scientific reports 5,

142 APPENDIX A Appendix A.1: Linear Regression Principle Linear regression is an approach for modelling the relationship between one dependent variable y and one or more independent variables XX = {xx 1, xx 2,, xx nn }. The relationship in linear regression is modeled by linear predictors that estimates unknown parameters from input data [84]. The conditional mean of the dependent variable y given the value of XX is most commonly supposed to be an affine function of XX. Linear models are normally fitted by the least square estimation (LSE) although there are other ways to fit them such as minimizing the lack of fit in some other norms. In the simple linear regression, the equation can be represented as yy = αα + ββββ (A.1.1) where y is the dependent variable, αα is the y intercept, ββ is the slope and xx is the input data. We can extend this simple linear regression to the multiple linear regression as the equation (A.1.2). yy ii = αα + ββ 1 xx 1 + ββ 2 xx ββ nn xx nn + εε ii (A.1.2) where xi are independent variables, ββ ii are regression coefficients, and εε ii is the error. The equation (A.1.2) can be also represented as (A.1.3) and (A.1.4). 128

143 yy 1 1 xx 11 xx 12 xx 1nn αα ββ yy 2 1 xx = 21 xx 22 xx 1 2nn ββ εε yy mm 1 xx mm1 xx mm2 xx mmmm ββ nn εε mm εε 1 (A.1.3) yy = XXXX + εε (A.1.4) Estimation Methods There has been developed a large number of estimators for parameter estimation in regression analysis. These estimators are different in various properties such as presence of a closed form solution and complexity of algorithms. Among them, LSE and Maximum Likelihood Estimation (MLE) are frequently used estimators in regression analysis. Although MLE performs better with a large amount of data, many researches have demonstrated that LSE provides similar results with MLE when the size of dataset is small, and furthermore LSE is more simple and faster than MLE [85], [86]. As stock prices are predicted daily basis in this thesis, the size of dataset is small, thus LSE is chosen to estimate for our linear prediction model. The method of LSE is the simplest and most common estimator in regression analysis. LSE minimizes gaps between observed results in a dataset and output results predicted by the linear estimation of the data. Suppose that ββ is a candidate of ββ. Then the sum of squared residuals (SSR) can be represented as the equations (A.1.5) or (A.1.6). 129

144 mm SSSSSS = (yy ii XX ii TT ββ ) 2 ii=1 (A.1.5) = (yy XXββ ) TT (yy XXββ ) (A.1.6) The main idea of LSE is to find the optimal ββ that minimizes the equation (4.6). As this is a quadratic, ββ can be found via matrix calculus by differentiation and setting equal to zero as shown in the equations below. dddddddd ddββ = dd ddββ ((yy XXββ ) TT (yy XXββ )) (A.1.7) = dd ddββ yytt yy ββ TT XX TT yy yy TT XXββ + ββ TT XX TT XXββ = 2XX TT yy + 22XX TT XXββ = 0 Thus ββ is given by ββ = (XX TT XX) 1 XX TT yy (A.1.8) 130

145 Appendix A.2: Support Vector Machine Principle Support Vector Machines (SVMs) are supervised learning methods proposed by C. Cortes et al. [87] for the purposes of classification and regression analysis. Suppose that we want to classify a set of nn data points into two subsets which is labeled as below PP ii = (XX ii, yy ii ) ii = 1, 2,, nn, yy ii {1, 1}, XX ii RR dd where PP ii is a data point in the set, yy ii are either 1 or -1, each indicating the class to which the point XX ii belongs. From the assumption, if we have hyperplanes which separate the set of nn data points into two subsets, the point XX ii lies on the hyperplane which satisfies the equation (A.2.1). ww TT XX ii + ββ = 0 (A.2.1) where ww is normal to the hyperplane. Figure A.2.1 represents an example of three hyperplanes: h 1, h 2 aaaaaa h 3 which separate 10 data points into two subsets. In the figure below, the hyperplane h 1 separates data points with the maximum margin while the other two hyperplanes do not separate the classes. 131

146 Figure A.2.1. An example of hyperplanes which separate data points into two subsets. The hyperplane hh 11 separates data points with the maximum margin while other two hyperplanes do not separate the classes. Therefore, two hyperplanes can be described as the equations (A.2.2) and (A.2.3). ww TT XX ii + ββ 1 ffffff yy ii = 1 (A.2.2) ww TT XX ii + ββ 1 ffffff yy ii = 1 (A.2.3) and these equations can be simplified as the equation (A.2.4). yy ii (ww TT XX ii + ββ) 1 ffffff 1 ii nn (A.2.4) Hard Margin If the data points are linearly separable like the example in Figure 5.1, we can easily select a boundary for the two classes of data, which makes the distance between two bounded hyperplanes as large as possible. The boundary of two hyperplanes is called margin, and we can 132

147 calculate it as 2 ww where ww is the Euclidean norm of ww. Figure 5.2 represents the margin between two hyperplanes, (A.2.2) and (A.2.3). Thus the problem to make the distance between two bounded hyperplanes as large as possible is equivalent to minimize ww 2 which can be formulated as (A.2.5). 1 min ww,ββ 2 wwtt ww s. t. yy ii (ww TT XX ii + ββ) 1 ffffff 1 ii nn (A.2.5) where 1 2 is added to the objective function of the optimization problem for easy calculation. Figure A.2.2. The margin between two hyperplanes, 1) ww TT XX ii + ββ 11 and 2) ww TT XX ii + ββ 11. It makes the distance between two bounded hyperplanes as large as possible. Soft Margin However, the problems in the real world are mostly nonlinear so it is hard to apply the maximum-margin hyperplane algorithm above to find feasible solutions. To use the algorithm 133

Can Twitter predict the stock market?

1 Introduction Can Twitter predict the stock market? Volodymyr Kuleshov December 16, 2011 Last year, in a famous paper, Bollen et al. (2010) made the claim that Twitter mood is correlated with the Dow