Topic-based vector space modeling of Twitter data with application in predictive analytics

Topic-based vector space modeling of Twitter data with application in predictive analytics Guangnan Zhu (U6023358) Australian National University COMP4560 Individual Project Presentation Supervisor: Dr. Timothy Graham

Stock prediction is Magic!!! 3

4 Module Outline Motivation and Background Goal Methods Experiments Future Work

5 Motivation People want to make money from stock market, Stock Price Prediction is attractive Prediction of Stock Price is challenging (some researches believe that stock price follow random walk) Research on relationship between Social Media and Stock Price Movement

6 Background Bag of words, Word2vec Problems: language is not merely a bag of words but a tool with particular properties which have been fashioned in the course of its use. The language is "due to meaning" in one sense[1]. Size too big (Document Term Matrix) [1] Zellig S. Harris (1954) Distributional Structure, WORD, 10:2-3, 146-162, DOI: 10.1080/00437956.1954.11659520

7 Background Topics <- Topic-based Modeling Topic Modeling based can work for Stock Market Prediction [1] LDA can be used as an effective dimension reduction method for text modeling and extract topics from the text [2] [1] Topic Modeling based Sentiment Analysis on Social Media for Stock Market Prediction (Thien Hai Nguyen, 2015) [2] An empirical study of text classification using Latent Dirichlet Allocation, Lei Li; Yimeng Zhang

8 Goal Hypothesis: The 'bag of words' model performs worse Topic-based modeling: How can LDA topic modelling be used a feature extraction technique for supervised machine learning using social media data? How accurately can a topic-based vector space model predict Google's stock market prices? What is the relationship between topics and stock price movements? Construct a good prediction model: Improve ML Model

9 Dataset Dataset Social Media Dataset 3-years (2014-2017) tweet text of Google collected using the Twitter API Historical Price Dataset: End of day: 3-years (2014-2017) data of Google from NASDAQ Intra-day: 7-months (01/01/2016-01/07/2016) data of Google from NASDAQ Price(Pi+1, Pi) -> Labels -> (Up, Fair, Down) represent price movement

10 Methods Combine Tweet Combine In Days (End-of-Day closing price) Stock market open and close in one day Pros: Combined Text is long and easy to extract topics Cons: No. of Instances is small (22 open days per month) Combine In Hours (Intra-day stock prices) Combine every hours Pros: No. of Instances is large (22*24 = 528 hours per month) Cons: Combine Text is short and hard to extract topics

11 Method Topic Model Unsupervised topic modeling Latent Dirichlet Allocation(LDA) LDA represents documents as mixtures of topics that spit out words with certain probabilities. α : per-document topic distributions; β : per-topic word distribution

12 Method Topic Model Gamma Function construct a document-topic matrix

13 Method Topic Model Black Box Topic model is a unsupervised and it is a black box. We don t know the what exactly the topics are. But we can have a look in which words are belongs to the topic.

14 Method Machine Learning Prediction using ML ML Methods: Support Vector Machine (SVM) XGBoost

15 Result Combine Methods 56% in prediction -> satisfied result ---- (Schumaker and Chen, 2009b; Si et al., 2013; Tsibouris and Zeidenberg, 1995)

16 Result No. of Topics (Value of K) (End-of-day data) Accuracy 0.35 0.40 0.45 0.50 0.55 0.60 0.65 20 40 60 80 100 120 k

17 Result ML Methods & Predicted Result (End-ofday data.) 56% in prediction -> satisfied result ---- (Schumaker and Chen, 2009b; Si et al., 2013; Tsibouris and Zeidenberg, 1995)

Result ML Methods & Predicted Result (End-ofday data) 0.6 0.5 value_d 0.4 Evaluation_Measures SVM_Acc SVM_F_up_Measure XGB_Acc XGB_F_up_Measure 0.3 0.2 0 25 50 75 100 125 k_value 18

19 Conclusion Topics discussed on Twitter can predict stock price movements LDA topic models can be used as input feature for supervised machine learning, and achieve close to stateof-the-art accuracy SVM tends to have better performance compared to more advanced algorithms such as XGBoost Topic-based vector space model performs better than a BoW model

20 Future Work LDA just focus on topics of texts. Need to consider more factors like opinions, mood and so on LDA should specific parameter K first. An non-parameter technique is needed The result is not stable. Sometimes may work really bad. It is not suitable for real world prediction. A more stable technique is needed. Prediction of stock price needs more factors. Different models should be fused together.

21 Thank you. Any Questions.