Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas)

Similar documents
Credit Card Default Predictive Modeling

CS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults

LendingClub Loan Default and Profitability Prediction

We are not saying it s easy, we are just trying to make it simpler than before. An Online Platform for backtesting quantitative trading strategies.

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques

Modeling Private Firm Default: PFirm

Loan Approval and Quality Prediction in the Lending Club Marketplace

International Journal of Advance Engineering and Research Development REVIEW ON PREDICTION SYSTEM FOR BANK LOAN CREDIBILITY

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Internet Appendix. Additional Results. Figure A1: Stock of retail credit cards over time

Loan Approval and Quality Prediction in the Lending Club Marketplace

Regressing Loan Spread for Properties in the New York Metropolitan Area

Wide and Deep Learning for Peer-to-Peer Lending

$tock Forecasting using Machine Learning

Examining Long-Term Trends in Company Fundamentals Data

Lecture 17: More on Markov Decision Processes. Reinforcement learning

UPDATED IAA EDUCATION SYLLABUS

Improving Returns-Based Style Analysis

ALGORITHMIC TRADING STRATEGIES IN PYTHON

Session 5. Predictive Modeling in Life Insurance

Relative and absolute equity performance prediction via supervised learning

Classifying Press Releases and Company Relationships Based on Stock Performance

Quantitative Risk Management

EE266 Homework 5 Solutions

CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems

Machine Learning in Risk Forecasting and its Application in Low Volatility Strategies

Agricultural and Applied Economics 637 Applied Econometrics II

(RISK.03) Integrated Cost and Schedule Risk Analysis: A Draft AACE Recommended Practice. Dr. David T. Hulett

S&P 500 Portfolio Optimization Using Macroeconomic Factor Models

Reinforcement Learning Analysis, Grid World Applications

HKUST CSE FYP , TEAM RO4 OPTIMAL INVESTMENT STRATEGY USING SCALABLE MACHINE LEARNING AND DATA ANALYTICS FOR SMALL-CAP STOCKS

Portfolio Analysis with Random Portfolios

A new look at tree based approaches

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Market Variables and Financial Distress. Giovanni Fernandez Stetson University

Naïve Bayesian Classifier and Classification Trees for the Predictive Accuracy of Probability of Default Credit Card Clients

How to Consider Risk Demystifying Monte Carlo Risk Analysis

Graduated from Glasgow University in 2009: BSc with Honours in Mathematics and Statistics.

AIRCURRENTS: PORTFOLIO OPTIMIZATION FOR REINSURERS

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

Predicting and Preventing Credit Card Default

Predicting the Success of a Retirement Plan Based on Early Performance of Investments

SELECTION BIAS REDUCTION IN CREDIT SCORING MODELS

CFA Level II - LOS Changes

The Fundamental Law of Mismanagement

Inverse reinforcement learning from summary data

Predictive modelling around the world Peter Banthorpe, RGA Kevin Manning, Milliman

A COMPARATIVE STUDY OF DATA MINING TECHNIQUES IN PREDICTING CONSUMERS CREDIT CARD RISK IN BANKS

Predicting stock prices for large-cap technology companies

An introduction to Machine learning methods and forecasting of time series in financial markets

Support Vector Machines: Training with Stochastic Gradient Descent

CS 188: Artificial Intelligence

Gas storage: overview and static valuation

Reasoning with Uncertainty

Predicting Market Fluctuations via Machine Learning

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

Reinforcement Learning and Simulation-Based Search

Chapter 2 Uncertainty Analysis and Sampling Techniques

MS&E 448 Final Presentation High Frequency Algorithmic Trading

Visualization on Financial Terms via Risk Ranking from Financial Reports

BSc (Hons) Software Engineering BSc (Hons) Computer Science with Network Security

ECS171: Machine Learning

arxiv: v1 [cs.ai] 7 Jan 2018

How Can YOU Use it? Artificial Intelligence for Actuaries. SOA Annual Meeting, Gaurav Gupta. Session 058PD

ASC Topic 718 Accounting Valuation Report. Company ABC, Inc.

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017

Approximating the Confidence Intervals for Sharpe Style Weights

Importance Sampling for Fair Policy Selection

Health Insurance Market

Portfolio Management Package Insights A quarterly briefing with best practices and thought leadership concepts from your Portfolio Management Package

CS 343: Artificial Intelligence

Lecture 2: Fundamentals of meanvariance

Introduction to Fall 2007 Artificial Intelligence Final Exam

Artificially Intelligent Forecasting of Stock Market Indexes

DFAST Modeling and Solution

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Predicting Economic Recession using Data Mining Techniques

Session 5. A brief introduction to Predictive Modeling

Accelerated Option Pricing Multiple Scenarios

Final Examination CS540: Introduction to Artificial Intelligence

Draft. emerging market returns, it would seem difficult to uncover any predictability.

Decision Trees An Early Classifier

An enhanced artificial neural network for stock price predications

The Optimization Process: An example of portfolio optimization

16 MAKING SIMPLE DECISIONS

Portfolio theory and risk management Homework set 2

Random Variables and Probability Distributions

Better decision making under uncertain conditions using Monte Carlo Simulation

Executing Effective Validations

Asset Allocation and Risk Assessment with Gross Exposure Constraints

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Predicting Online Peer-to-Peer(P2P) Lending Default using Data Mining Techniques

TDT4171 Artificial Intelligence Methods

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Markov Decision Processes

And The Winner Is? How to Pick a Better Model

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

CEC login. Student Details Name SOLUTIONS

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Modelling the Sharpe ratio for investment strategies

Transcription:

CS22 Artificial Intelligence Stanford University Autumn 26-27 Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas) Overview Lending Club is an online peer-to-peer lending platform, matching borrowers and lenders. Each investor is trying to build the best portfolio of loans. Our project utilizes Artificial Intelligence techniques in order to try and build the optimal portfolio of Lending Club loans. In particular, our algorithms try to build the optimal portfolio from loans offered by Lending Club in any given month. Introduction Lending Club is an online peer-to-peer lending platform, matching individual borrowers and lenders. An individual looking to borrow inputs her information on the platform. An individual looking to lend money then browses the platform and chooses which loans to invest in. As an investor, you must use the information to choose which loans to invest in. Each loan has a list of characteristics: interest rate, job title, annual income, etc. All investors are trying to choose the loans that will give them the best returns. A 24 New York Times article[] described how many fund managers then use their own credit algorithms to identify loans that may be underpriced or overpriced, and cherrypick the ones they want. A loan is underpriced if the risk of default is lower than other loans offering the same risk of default. In the diagram below, the three red loans would be underpriced. For our project, we created our own algorithm that cherry picks the best loans. These cherrypicked loans are combined into the optimal investment portfolio. Literature Review & Similar Studies We have found similar works produced by Rob Gerritsen[2] which leverages data mining techniques for identifying high risk loans and loaning behavior. Gerritsen founded Exclusive Ore Inc. consulting group which specializes in data mining for a wide variety of industries including retail, finance, consumer packaged goods, etc. In this particular study, Gerritsen utilized Naïve Bayes and decision tree algorithms for classification yielding an approximately 85 percent accuracy rate. Although his objective was primarily classification driven, we believe that reviewing this study gave us an invaluable understanding of the tools and algorithms to rely on when analyzing loan data most specifically Sharpe s Ratio. We also looked at similar studies discussing data mining within the banking sector and, more specifically, the potential pitfalls and accuracy tradeoffs of running various algorithms on lending and loan acquisition data. Ahmed et al[3] built predictive models with j48, BayesNet and Naive Bayes algorithms in order to classify loan applications between private banks and the agriculture industry. There was also some investigation into the efficacy of neural networks and linear regression for building rating models for loans. More specifically regarding Lending Club, we found data science articles[8] leveraging the loan holder s description of what their loan was for turned out to be an important indicator for predicting default. This article was important to the development of our project as it turned our

CS22 Artificial Intelligence Stanford University Autumn 26-27 attention to the text descriptions and gave us some intuition about how to learn from them when formulating our own predictions. Data Lending Club has published data on every loan they issued in the period 2-25[4]. We used this data to train and test our algorithms. For each loan, the data includes characteristics of the borrower, as well as whether or not the borrower defaulted. We split our database of loans into four partitions a testing and training data set for each of the two terms (thirty-six and sixty months). The raw data contains sixty fields for each loan; however, not every field has an intuitive use for our learning models so we removed them from our tables. Finally, of the loans we had left, we removed any entries that were missing any of the remaining fields. We did make some simplifications to the dataset. In particular, we assumed that there were no irregular payments. We assumed that each month the individual either pays their monthly installment, or is in default and therefore pays nothing. In practice, there are irregular payments as a result of late fees or payment plans. If a borrower is late to pay the installment, the late fee increases the payment. If the individual is regularly struggling to make the payment, Lending Club will agree to reduce the monthly payments so that the borrower does not default. Payment plans reduce monthly payments. We decided against incorporating the possibility of irregular payments in order to avoid additional complexities that were not needed for a robust analysis of the Lending Club dataset a decision that was further justified by the fact that only ~.2% of the loans in our data set relied on payment plans. Baseline & Oracle We implemented our baseline and oracle algorithms in order to set a reliable success metric for our model and project. For our baseline, we constructed the simplest portfolio investment strategy: invest a proportionally equal amount in every loan for a given period. In this case, our total profit was the weighted average return to date of all the loans issued in the given month. For our oracle we assumed perfect future knowledge. The oracle picked the loans which we knew had achieved the maximum return..6.4.2.8.6.4.2.8.6.4.2.8.6.4.2 Jan- Jan- May- May- Sep- Sep- Jan-2 Jan-2 May-2 May-2 36 Month Loans Sep-2 Jan-3 Baseline May-3 Sep-3 Jan-4 6 Month Loans Sep-2 Jan-3 Baseline May-3 Sep-3 May-4 Oracle Jan-4 May-4 Oracle Sep-4 Sep-4 Jan-5 Jan-5 May-5 May-5 Sep-5 Sep-5 The Y axis represents returns to date. If you invested $ in the portfolios at the date on the X 2

CS22 Artificial Intelligence Stanford University Autumn 26-27 axis, how many dollars would you have now in 26 if you never reinvested any of the returns. Approach (): Markov Decision Process Our first approach was to treat investing in a portfolio as a Markov Decision Process, where you choose which loans to invest in and each of the loans have a chance of randomly defaulting. We defined our MDP as follows: State: Portfolio of Loans, Cash, Date Actions: Invest in a loan offered on the lending club platform in a given month Successors: The different portfolios made possible by defaults Reward: The total payment received from all the loans in the portfolio that month For a graphical example of the MDP, see the appendix. However, in this MDP model, the transition probabilities are currently unknown. Therefore, we need to estimate them. In order to estimate the transition probabilities for each loan, we predicted each loan s chance of default in any given month, and treated each loan as independent. Estimating Probability of Default In order to estimate transition probabilities, we need to estimate: P(D x = D = D x- = ) < X < (Loan Maturity) D x = indicates loan default in Month X We used machine learning to estimate each of these. We ran ten iterations and a step size of =. and the loss function below: Loss Fn: Loss squared (x, y, w) = (ɸ(x) w y) 2 Our feature vector had 23 features: each loan holder s debt to income ratio, the number of years they have been employed, the loan s grade, and 2 binary variables for each of the top twenty most meaningful words. Example Feature Vector: Total Debt/Income.67 Loan Grade A3 Employment Length 2 Business 8 other words Medic In order to pull the most meaningful words, we used TF-IDF and porter stemming (see appendix). The resulting weights gave some interesting results. Whether or not the description of the loan contained the word bill had the greatest positive weight. If the description of the loan contained start, the weight was highly positive for the first month. By the 36 th month however, the weight had become negative. MDP Problems The MDP approach had significant challenges that limited its effectiveness. Most significantly, the search space become too large to feasibly explore. Each month a lender can invest in a subset of n loans, therefore there are n! actions. Additionally, if there are m loans in the investor s portfolio, there are 2 m different successor states, depending on which loan defaults. In order to address this challenge, we thought of ways to constrain the search space in order to scale our approach down to a more realistic level. What we ultimately settled on for this approach was that for any given month an investor could select a single loan from a random sample of 5 to invest the entire investment amount in. Although these modifications took us further from an accurate model of the real world, we were able to generate optimal results given the circumstances and get some baseline intuition for what an MDP could bring to the model. 3

CS22 Artificial Intelligence Stanford University Autumn 26-27.6.4.2.8.6.4.2 MDP results Clearly, the results for this approach were not positive; the MDP approach consistently underperformed the baseline. Jan- May- Sep- Jan-2 36 Month Loans May-2 Sep-2 Jan-3 May-3 Sep-3 Jan-4 May-4 Sep-4 Jan-5 May-5 Sep-5 Additionally, we limited the MDP to choosing one of 5 random loans. Amongst the 5 loans, there may just not be many loans that deserve to be cherry-picked. Together these two constraints likely explain the poor results Approach (ii): Sharpe s Ratio Due to the complexity and poor results of the MDP approach, we decided to use a simpler method. We decided to try and maximize Sharpe s Ratio. E Portfolio Return k Sharpe Ratio = σ 6789:7;<7 Sharpe Ratio balances the expected returns of the portfolio with the riskiness of the portfolio. The optimal portfolio maximizes the Sharpe ratio. Baseline Oracle MDP 6 Month Loans.8.6.4.2.8.6.4.2 Jan- May- Sep- Jan-2 May-2 Sep-2 Jan-3 May-3 Sep-3 Jan-4 May-4 Sep-4 Jan-5 Baseline Oracle MDP May-5 Sep-5 In order to calculate Sharpe Ratio we need to estimate: E R 6 = w < E R < σ 6? = w < w @ Cov r <, r @ However, given the extreme constraints we placed on the MDP, the quality of the results is not surprising. Because we limited our actions to investing in only one loan, the portfolio is extremely volatile. If that one loan defaults, the whole portfolio goes to zero. Therefore, for each loan we need to estimate expected return, variance, and covariance with other loans. Estimating Variance and Expected Return Using previously calculated probabilities of default from our Markov Decision Process, we ran twenty Monte Carlo simulations for each loan where in each month, each loan had a probability 4

CS22 Artificial Intelligence Stanford University Autumn 26-27 of defaulting given by the probabilities we learnt for the MDP. The expected return is the average across all the simulations, the variance is the variance of the simulations. 2.5 6 Month Loans.6.4.2.8.6.4.2 Jan- Estimating Covariance In order to predict covariance between loans, we used k-means clustering to cluster our loans according to zip-code and home-ownership. To find the covariance between two loans, we find the covariance between their respective clusters. Cov(x, y) = Cov([a, a 2, a n ][b, b 2, b m ]) where a i k x, b j k y Stochastic Gradient Descent Finally, we calculate the weight of each investment so that we maximize the Sharpe Ratio. To do so we used Stochastic Gradient Descent. As W is as large as 2, we had to use relatively few iterations due to time constraints. In the end, we used iterations and a step size of. May- Maximizing Sharpe Ratio Results Sep- Jan-2 36 Month Loans May-2 Sep-2 Jan-3 May-3 Sep-3 Jan-4 May-4 Sep-4 Jan-5 Baseline Oracle Max SR May-5 Sep-5.5 Jan- May- Sep- Jan-2 May-2 Sep-2 Jan-3 May-3 Sep-3 Excitingly, our Sharpe ratio significantly outperformed the Baseline for both time periods. Further Improvements Jan-4 May-4 Sep-4 As for further developments for our project, we see a few potential routes to improve upon. First, the prediction algorithm from our second approach involving Sharpe s Ratio, although successful, could be improved by using more features in our vector and by running than ten iterations to get us closer to the truly optimal values for the default probabilities, expected returns and variances of each of the loans. Jan-5 Baseline Oracle Max SR More fundamentally, we could change the machine learning approach which we used to predict probability of default. Much of the other literature [7] notes the fact that only a small percentage of borrowers will default and that therefore the dataset is very unbalanced. In order to combat the skewedness, some have used random forests, or have limited the dataset so that the number of loans that default is equal to the number of loans that did not. As another point of improvement, we could flesh out the aforementioned payment plan capabilities. As Lending Club expands its database of loans, the relevancy of payment plans would steadily increase and require extra work and logic to account for it. We feel this would make our May-5 Sep-5 5

CS22 Artificial Intelligence Stanford University Autumn 26-27 model a more holistic representation of loan behavior in the real world. Conclusions In this report, we have built two methods to try and generate an optimal portfolio of LendingClub loans. Though our MDP approach was not successful, our Sharpe Ratio approach consistently outperformed the baseline in our test dataset. Whether it would it be effective beyond the test dataset and in the real world is more uncertain. Our dataset only includes loans from 2-26, when there were no financial crises. As a result, our algorithm suggests investing in low-grade loans with higher interest rates. These low-grade loans do well when economic conditions are good, but do badly when economic conditions decline. Therefore, if there was an economic crash similar to the Great Recession, the portfolio our algorithm recommended would likely do badly. Appendix Example MDP number of times the term t occurs in description d while idf(t, D) is the log of the total number of descriptions over D or the set of all descriptions that contain the term t. Hence, frequently occurring words are weighted less. We pulled the top twenty most meaningful words and created binary toggle features for each loan whether or not a given feature word was present in the loan description. Porter-stemming Algorithm We leveraged some existing code[5] for standardizing words in each loan description down to the root word or word stem. We utilized this method as a way to minimize the number of words we needed to iterate over by consolidating many of the similarly rooted words amongst the descriptions References []http://www.nytimes.com/24/5/4/business/l oans-that-avoid-banks-maybe-not.html?_r= [2] Gerritsen, Rob. Assessing Loan Risks: A Data Mining Case Study. XCore Case Studies. Executive Ore Inc., Nov. 25. Web. 26 Oct. 26. <http://xore.com/casestudies/dm_at_usda_(itpro). pdf>. TF-IDF Algorithm Text Frequency Inverse Document Frequency (TF-IDF) is an algorithm to extract the most important words in a document of text while accounting for the insignificance of words that appear across multiple documents. tf(t, d) is the [3] Hamid, Aboobyda J., Tarig M. Ahmed, and Nazlı İkizler. Developing Prediction Model and Analyzing Pitfalls of Loan Risk In Banks. Machine Learning and Applications: An International Journal, 26 Mar. 26. Web. 25 Oct. 26. [4] https://www.kaggle.com/wendykan/lendingclub-loan-data [5] https://github.com/nok/sklearn-porter 6

CS22 Artificial Intelligence Stanford University Autumn 26-27 [6] https://web.stanford.edu/~wfsharpe/art/sr/sr.htm [7] Yhat. http://blog.yhathq.com/posts/machinelearning-for-predicting-bad-loans.html. Machine Learning for Predicting Bad Loans. Dec 24. [8] http://drjasondavis.com/blog/22/4/8/lend ing-club-loan-analysis-making-money-withlogistic-regression 7