Lazy Prices: Vector Representations of Financial Disclosures and Market Outperformance

Similar documents
Predicting stock prices for large-cap technology companies

Novel Approaches to Sentiment Analysis for Stock Prediction

Can Twitter predict the stock market?

Classifying Press Releases and Company Relationships Based on Stock Performance

SOUTH CENTRAL SAS USER GROUP CONFERENCE 2018 PAPER. Predicting the Federal Reserve s Funds Rate Decisions

Stock Prediction Using Twitter Sentiment Analysis

Stock Market Predictor and Analyser using Sentimental Analysis and Machine Learning Algorithms

Institutional Finance Financial Crises, Risk Management and Liquidity

Institutional Finance Financial Crises, Risk Management and Liquidity

Examining Long-Term Trends in Company Fundamentals Data

HKUST CSE FYP , TEAM RO4 OPTIMAL INVESTMENT STRATEGY USING SCALABLE MACHINE LEARNING AND DATA ANALYTICS FOR SMALL-CAP STOCKS

SURVEY OF MACHINE LEARNING TECHNIQUES FOR STOCK MARKET ANALYSIS

Using Structured Events to Predict Stock Price Movement: An Empirical Investigation. Yue Zhang

Topic-based vector space modeling of Twitter data with application in predictive analytics

An enhanced artificial neural network for stock price predications

$tock Forecasting using Machine Learning

Market efficiency, questions 1 to 10

Text Mining with Python

When Positive Sentiment Is Not So Positive: Textual Analytics and Bank Failures

The Primacy of Numbers in Financial and Accounting Disclosures: Implications for Textual Analysis Research

Identifying High Spend Consumers with Equifax Dimensions

Available online at ScienceDirect. Procedia Computer Science 89 (2016 )

Applications of Neural Networks

FTS Real Time Project: The Accruals Anomaly (Advanced)

Preprocessing and Feature Selection ITEV, F /12

Sophisticated investments. Simple to use.

Predicting the Success of a Retirement Plan Based on Early Performance of Investments

Modelling the Sharpe ratio for investment strategies

Premium Timing with Valuation Ratios

Decision model, sentiment analysis, classification. DECISION SCIENCES INSTITUTE A Hybird Model for Stock Prediction

Pattern Recognition by Neural Network Ensemble

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques

Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas)

distribution of the best bid and ask prices upon the change in either of them. Architecture Each neural network has 4 layers. The standard neural netw

JACOBS LEVY CONCEPTS FOR PROFITABLE EQUITY INVESTING

A TEMPORAL PATTERN APPROACH FOR PREDICTING WEEKLY FINANCIAL TIME SERIES

Shynkevich, Y, McGinnity, M, Coleman, S, Belatreche, A and Li, Y

Session 3. Life/Health Insurance technical session

Foreign Exchange Forecasting via Machine Learning

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

MBF2253 Modern Security Analysis

Three Components of a Premium

Dynamic Smart Beta Investing Relative Risk Control and Tactical Bets, Making the Most of Smart Betas

Predicting Risk from Financial Reports with Regression

Stock Price Prediction using Deep Learning

A Comparative Study of Ensemble-based Forecasting Models for Stock Index Prediction

Methods for Retrieving Alternative Contract Language Using a Prototype

FORECASTING THE S&P 500 INDEX: A COMPARISON OF METHODS

An Analysis of the ESOP Protection Trust

Michael Ohata Managing Director - KPMG. Landon Westerlund Audit Partner - Financial Services -KPMG

Deep Learning for Time Series Analysis

What s in a Name: White-Label Funds in DC Plans

HOW TO TRANSFORM XBRL DATA INTO USEFUL INFORMATION

Conditional inference trees in dynamic microsimulation - modelling transition probabilities in the SMILE model

UNDERSTANDING ML/DL MODELS USING INTERACTIVE VISUALIZATION TECHNIQUES

International Journal of Management Sciences and Business Research, 2013 ISSN ( ) Vol-2, Issue 12

Predicting Stock Movements Using Market Correlation Networks

WHITEPAPER

Quantitative Trading System For The E-mini S&P

Leverage Financial News to Predict Stock Price Movements Using Word Embeddings and Deep Neural Networks

Patron Training: Morningstar Stock Reports

CS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults

Comment of to the Public Consultation Draft REMIT Transaction Reporting User Manual (TRUM) Public consultation document PC_2014_R_05

Producing actionable insights from predictive models built upon condensed electronic medical records.

COGNITIVE LEARNING OF INTELLIGENCE SYSTEMS USING NEURAL NETWORKS: EVIDENCE FROM THE AUSTRALIAN CAPITAL MARKETS

Guilford County Schools

Portfolio Analysis with Random Portfolios

Journal of Insurance and Financial Management, Vol. 1, Issue 4 (2016)

Factor Performance in Emerging Markets

Visualization on Financial Terms via Risk Ranking from Financial Reports

STOCK MARKET FORECASTING USING NEURAL NETWORKS

Daily Stock Returns: Momentum, Reversal, or Both. Steven D. Dolvin * and Mark K. Pyles **

Journal Of Financial And Strategic Decisions Volume 7 Number 3 Fall 1994 ASYMMETRIC INFORMATION: THE CASE OF BANK LOAN COMMITMENTS

Media content for value and growth stocks

Potential drivers of insurers equity investments

Validation of Nasdaq Clearing Models

Lecture 3: Factor models in modern portfolio choice

Reinforcement Learning

Engineering Greater Efficiency in Mergers and Acquisitions. Robert Anderson (Pepperdine) Jeffrey Manns (George Washington)

DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS

STRATEGY OVERVIEW. Long/Short Equity. Related Funds: 361 Domestic Long/Short Equity Fund (ADMZX) 361 Global Long/Short Equity Fund (AGAZX)

RBC GAM Fundamental Series RBC Global Asset Management

Full file at Chapter 2: Transaction Processing in the AIS

Wide and Deep Learning for Peer-to-Peer Lending

Do Value-added Real Estate Investments Add Value? * September 1, Abstract

Analyzing Representational Schemes of Financial News Articles

Full file at Chapter 2: Transaction Processing in the AIS

Readability of Annual Reports: Evidence from Foreign Firms in the United States Stock Exchange

Predictive Modeling Cross Selling of Home Loans to Credit Card Customers

Brain Teaser. Feel free to work with your friends

Liquidity skewness premium

Creating short-term stockmarket trading strategies using Artificial Neural Networks: A Case Study

Using alternative data, millions more consumers qualify for credit and go on to improve their credit standing

The Accrual Anomaly in the Game-Theoretic Setting

Expanding Predictive Analytics Through the Use of Machine Learning

CDS-Implied EDF TM Measures and Fair Value CDS Spreads At a Glance

Why Invest Internationally?

Dynamic Interpretation of Emerging Risks in the Financial Sector

Portfolio performance and environmental risk

Resolution of Comments

Transcription:

Lazy Prices: Vector Representations of Financial Disclosures and Market Outperformance Kuspa Kai kuspakai@stanford.edu Victor Cheung hoche@stanford.edu Alex Lin alin719@stanford.edu Abstract The Efficient Market Hypothesis (EMH) states that market outperformance is impossible through expert selection because each stock price efficiently incorporates and reflects all relevant evaluative information. We study the validity of EMH by analyzing the latent information of financial disclosures year over year. Specifically, we explore the concept of Lazy Prices, the idea that changes in financial disclosures are correlated with a decrease in market capitalization, using natural language processing methods to factor in these changes the market may not capture. We created a novel database of financial disclosures represented as GloVe vectors from 60,000 raw 10-K documents filed with the Securities and Exchange Commission (SEC) from 1994-2016, and trained several models to predict future market performace. Because our best model did not acheive cross-validated prediction accuracy greater than 56%, our model provides evidence in favor of Efficient Markets. We present our dataset, methodology for latent information mining, and results as well as a discussion of future improvements. 1 Introduction and Related Work Lazy Prices (Cohen, 2010) found that firms that modified their periodic financial reports rather than defaulting to boilerplate tended to perform worse in the future compared to firms that did not modify their disclosures. This indicates the existence of abnormal returns. For example, suppose a company changes their annual 10-K disclosure by inserting a sentence into a section describing risk factors. Knowing which particular risk factor was added is not necessary for evaluating market performance in this case, because the relevent feature is the implicit information that risk has changed. The measures used by Cohen et. al. were TF-IDF and other string edit distances. Cohen et. al. used the magnitude of edit distances between documents as a scale for portfolio mangagement, buying non-changers and shorting changers. Using this method, they acheived a rate of return of 30-60 basis points month over month over the following year. Of particular interest is the possibility that more sophisticated parsing and representation of documents may better capture latent information of the exact changes that lead to financial outperformance. Finding methods that capture semantic meaning or hierarchical structure in changes to these financial disclosures that are otherwise obscure to the market could plausibly form the basis of a more effective portfolio management strategy. We use neural networks to autonomously learn the relevent differential information contained in consecutive financial filings. This approach has 1

several advantages over String edit-distance because it can represent complexities in the difference between documents. Once we vectorize document text into a feature space semantic meaning and hierarchical structure may be learned using the neural net. The success of this strategy depends on the degree to which the Efficient Market Hypothesis is true. It claims, in weaker and stronger forms, that all relevant information governing the value of securities are already incorporated into the price of the security, which is then the best estimate of the value of that security. Fama et. al. notes that the prices of securities will also over-adjust to new intrinsic values as often as they under-adjust, and may adjust prior to new information being made public or after. This would make any investment strategy based on identifying mispricing nearly-impossible, thus invalidating the existence of abnormal returns over the long run. If Efficient Markets is true, then no amount of abstraction and parsing can consistently predict outperformance. This complicates our model evaluation, since poor performance may either indicate a poor model or that the task is intrinsically impossible; however, if markets are not efficient and the Lazy Prices results are reproducible over our data set, then we should be able to achieve good results given good models. 2 Approach 2.1 Data Representation The Form 10-K is an annual filing that comprehensively describes a company s performance for a fiscal year. All US domestic companies are mandated by the Securities and Exchange Commission (SEC) to file a 10-K each fiscal year. On the corpus, we train our own word embeddings based on the Word2Vec and Paragraph2Vec paradigms. We have XX tokens with a vocabulary size of Y Y over ZZ documents in total. We also use an alternative approach through GLoVe vectors over the Wikipedia corpus. Documents represented through this scheme is the arithmetic average of the GLoVe vector for each of the words in that document. We repeat this for each of the 50, 100, and 300 dimensions available. These representations were developed both at the document and the section level. Our decision to split the data in this fashion was driven by the hypothesis given in the Lazy Prices paper. We hypothesized that analyzing changes at the granularity of each section would result in a more meaningful representation of the document semantics. Since sectios many do not experience significant changes year-on-year, a GLoVe-averaged document level representation may have a very low signal-to-noise ratio. Figure 1: List of SEC Form 10-K Sections Comparing individual sections instead allow us to focus on the semantic differences between smaller components. Additionally, the Lazy Prices paper considered the relative changes between individual sections, specifically noting that some sections, such as Item 7. Management s Discussion and Analysis of Financial Condition and Results of Operations, were on average much more dynamic than others. This section-based representation also allows us to train models only on an individual section. Since many sections do not experience significant changes year-on-year, a even small change in a normally stagnant section could indicate a larger shift in the company s material performance. 2

2.2 Data Acquisition We used the SEC Edgar database as our main source for filing information. As of 1994, the SEC has mandated that all companies submit a digital filing of their 10-K forms. These are available in multiple formats - HTML, text, and XBRL (extensible Business Reporting Language). Since we are most interested in the textual information, rather than specific descriptions and reports, we focused on acquiring HTML and text documents. SEC EDGAR does not have an API with which we can rest documents. Consequently, downloading the filing documents from SEC Edgar (REFERENCE THIS: https://github.com/rahulrrixe/sec-edgar) required the development of a scraping tool. Our scraper was loosely based off of the SECEdgar Python library, but we eventually developed our own expanded scraper to better suit our required functionality. Our final scraper consumes a list of company stock tickers, and requests listings of filing indices from SEC Edgar. We ingested all available 10-K annual filing documents from all companies listed on the NYSE and NASDAQ exchanges from 1994 to present. The crawler parses the listing year, and identifies the relevant documents to download amongst other attached files, documents, exhibits, etc. It prioritizes HTML documents over text documents to improve our signal to noise ratio, as we can more easily parse and identify edge cases in the HTML form. We then parse the document to extract its individual sections, saving those as well as the entire document, converted to text. Sections were identified by parsing tables and lists of links within the original HTML file. However, our HTML parser was not able to identify linked sections in all files, and was not at all able to parse.txt-based filings. As such, we developed a second parser to ingest all of our related.txt documents and perform search-based parsing to identify sections. We applied this tool to our HTML files as well, in order to extract any sections that may have previously been missed. In our handling of the downloaded data, we prioritized sections that were parsed directly from HTML, and used the.txt parsed files to augment our data where necessary. We implemented an error checking layer on our parser that checks for which section IDs were derived from the raw data, matching possible concatenation errors for each section. Each permutation of possible concatenations between sections is considered and giving a distinct section ID, so that training examples only compare macroscopically alike sections between consecutive years while allowing for the word by word differences we sought to capture. This error checking step was vital to our preprocessing because comparing macroscopically disimilar sections from year to year would misrepresent training examples as containing much higher degrees of change than actually contained in the data. 2.3 Data Preprocessing Once we acquired our dataset, we needed to identify valid pairs of documents to compare, generate their proper embedded representations, and then prepare them as inputs to our neural network. A pair was considered valid if it contained two documents from consecutive years with a matching section ID. Each document was cleaned of any punctuation, numbering, or uppercase lettering. Each word was tokenized, and vectorized using GLoVe representation trained on the Wikipedia datset. To represent a document, we took the mean of each word embedding in the document. This allows us to compare two documents with variable lengths. The labels were created using data from Bloomberg Historical Market Capitalization, and are denoted as a one or a zero. A label of 1 corresponds to a 10K section whose differences from the previous year s analagous section yields a positive change in market capitalization one year later. A label of 0 denotates a negative change. 3 Model We attempt a variety of models. We simplify the task of predicting out-performance by calculating the year-on-year percentage change in market capitalization for each company, then partitioning the changes into five categories from 0 to 4, with 0 being the worst performance (decreases in market cap) and 4 being the best performance (large increases in market cap). This abstracts away from predicting stock price alone, since prices may change drastically for reasons entirely unrelated to performance, such as stock splits, reverse stock splits, share repurchasing programs, and so on. 3

Input Data Input Dim. Architecture Train Test Val All, Median Dual 20 x 30 x 2 0.5687 0.4914 0.4955 All, Median Dual 50 x 2 0.6189 0.4979 0.4983 Section 7, Median Dual 50 x 50 0.7514 0.493 0.5112 All, Median Dual 50 x 2 0.8853 0.5022 0.5023 All, Sign Dual 50 x 50 x Dropout x 2 0.5956 0.5307 0.5351 Section 7, Sign Dual 50 x 50 x 2 0.8206 0.5456 0.5494 Section 1, Sign Dual 50 x 50 x 2 0.6968 0.5172 0.5382 All, Sign Dual 10 x 10 x 2 0.513 0.5364 0.548 All, Sign Dual 200 x 2 0.5439 0.5372 0.537 All, Sign Dual 5 x 5 x Dropout x 5 x 2 0.5787 0.5407 0.5388 All, Sign Single 5 x 5 x Dropout x 5 x 2 0.551 0.5371 0.5445 4 Network Architecture and Results Our most performant network architecture consisted of two fully-connected hidden layers, each with ReLU activations and L2 regularization. We tried feeding our dense document representations into several types of networks. We tried changing the number of parameters, the number of hidden layers, and hyperparameters. We found certain sections performed better than others when predicting if a market cap change would be positive or negative. Market Cap adjusted for inflation Extracting sections 5 Results and Discussion We see performance of our models slightly above average on the validation set when run with certain hyperparameters. This is an encouraging sign. Given the number of traders and arbitrageurs who seeks to exploit informational inefficiencies in the market, we may have reasonably expected that no model could have picked up the signal hidden amongst the noise. Our choice of document representations isn t necessarily ideal - we don t see that performance drastically improves with larger representations at the densely connected layer. This may be due to the choice of granularity we have chosen for the comparisons across 10Ks differences are captured as well through a smaller dense layer as it is through a larger one. As well, the averaged GloVe vector is a naive approach, and 6 Next Steps For concrete next steps, we would like to train word2vec and document2vec embeddings on the larger corpus we have built up over the course of this project. When initially attempted, we were looking at a corupus across 500 companies and slightly less than 10,000 documents with 238 Million tokens in total. That number has expanded drastically since, and makes training purpose-specific word vectors possible. The well-written nature of most 10Ks make them especially amenable to word2vec training without further cleaning. We would also like to explore more sophisticated models using recurrent neural networks over entire sections this approach may better preserve the meaning of documents. Furthermore, an attention mechanism may help us naturally hone in on the parts of the documents that change year on year or that which has significant impact as related to market performance. 4

Acknowledgements We would like to thank the CS224n TAs for assisting with the challenges of this project, Richard Socher for contributing to the model design, and Chris Manning for providing the infrastructure without which this project would not have been possible. References [1] Cohen, Lauren and Malloy, Christopher J. and Nguyen, Quoc H., Lazy Prices (February 10, 2016). Available at SSRN: https://ssrn.com/abstract=1658471 5