Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas)

CS22 Artificial Intelligence Stanford University Autumn 26-27 Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas) Overview Lending Club is an online peer-to-peer lending platform, matching borrowers and lenders. Each investor is trying to build the best portfolio of loans. Our project utilizes Artificial Intelligence techniques in order to try and build the optimal portfolio of Lending Club loans. In particular, our algorithms try to build the optimal portfolio from loans offered by Lending Club in any given month. Introduction Lending Club is an online peer-to-peer lending platform, matching individual borrowers and lenders. An individual looking to borrow inputs her information on the platform. An individual looking to lend money then browses the platform and chooses which loans to invest in. As an investor, you must use the information to choose which loans to invest in. Each loan has a list of characteristics: interest rate, job title, annual income, etc. All investors are trying to choose the loans that will give them the best returns. A 24 New York Times article[] described how many fund managers then use their own credit algorithms to identify loans that may be underpriced or overpriced, and cherrypick the ones they want. A loan is underpriced if the risk of default is lower than other loans offering the same risk of default. In the diagram below, the three red loans would be underpriced. For our project, we created our own algorithm that cherry picks the best loans. These cherrypicked loans are combined into the optimal investment portfolio. Literature Review & Similar Studies We have found similar works produced by Rob Gerritsen[2] which leverages data mining techniques for identifying high risk loans and loaning behavior. Gerritsen founded Exclusive Ore Inc. consulting group which specializes in data mining for a wide variety of industries including retail, finance, consumer packaged goods, etc. In this particular study, Gerritsen utilized Naïve Bayes and decision tree algorithms for classification yielding an approximately 85 percent accuracy rate. Although his objective was primarily classification driven, we believe that reviewing this study gave us an invaluable understanding of the tools and algorithms to rely on when analyzing loan data most specifically Sharpe s Ratio. We also looked at similar studies discussing data mining within the banking sector and, more specifically, the potential pitfalls and accuracy tradeoffs of running various algorithms on lending and loan acquisition data. Ahmed et al[3] built predictive models with j48, BayesNet and Naive Bayes algorithms in order to classify loan applications between private banks and the agriculture industry. There was also some investigation into the efficacy of neural networks and linear regression for building rating models for loans. More specifically regarding Lending Club, we found data science articles[8] leveraging the loan holder s description of what their loan was for turned out to be an important indicator for predicting default. This article was important to the development of our project as it turned our

CS22 Artificial Intelligence Stanford University Autumn 26-27 attention to the text descriptions and gave us some intuition about how to learn from them when formulating our own predictions. Data Lending Club has published data on every loan they issued in the period 2-25[4]. We used this data to train and test our algorithms. For each loan, the data includes characteristics of the borrower, as well as whether or not the borrower defaulted. We split our database of loans into four partitions a testing and training data set for each of the two terms (thirty-six and sixty months). The raw data contains sixty fields for each loan; however, not every field has an intuitive use for our learning models so we removed them from our tables. Finally, of the loans we had left, we removed any entries that were missing any of the remaining fields. We did make some simplifications to the dataset. In particular, we assumed that there were no irregular payments. We assumed that each month the individual either pays their monthly installment, or is in default and therefore pays nothing. In practice, there are irregular payments as a result of late fees or payment plans. If a borrower is late to pay the installment, the late fee increases the payment. If the individual is regularly struggling to make the payment, Lending Club will agree to reduce the monthly payments so that the borrower does not default. Payment plans reduce monthly payments. We decided against incorporating the possibility of irregular payments in order to avoid additional complexities that were not needed for a robust analysis of the Lending Club dataset a decision that was further justified by the fact that only ~.2% of the loans in our data set relied on payment plans. Baseline & Oracle We implemented our baseline and oracle algorithms in order to set a reliable success metric for our model and project. For our baseline, we constructed the simplest portfolio investment strategy: invest a proportionally equal amount in every loan for a given period. In this case, our total profit was the weighted average return to date of all the loans issued in the given month. For our oracle we assumed perfect future knowledge. The oracle picked the loans which we knew had achieved the maximum return..6.4.2.8.6.4.2.8.6.4.2.8.6.4.2 Jan- Jan- May- May- Sep- Sep- Jan-2 Jan-2 May-2 May-2 36 Month Loans Sep-2 Jan-3 Baseline May-3 Sep-3 Jan-4 6 Month Loans Sep-2 Jan-3 Baseline May-3 Sep-3 May-4 Oracle Jan-4 May-4 Oracle Sep-4 Sep-4 Jan-5 Jan-5 May-5 May-5 Sep-5 Sep-5 The Y axis represents returns to date. If you invested $ in the portfolios at the date on the X 2

CS22 Artificial Intelligence Stanford University Autumn 26-27 axis, how many dollars would you have now in 26 if you never reinvested any of the returns. Approach (): Markov Decision Process Our first approach was to treat investing in a portfolio as a Markov Decision Process, where you choose which loans to invest in and each of the loans have a chance of randomly defaulting. We defined our MDP as follows: State: Portfolio of Loans, Cash, Date Actions: Invest in a loan offered on the lending club platform in a given month Successors: The different portfolios made possible by defaults Reward: The total payment received from all the loans in the portfolio that month For a graphical example of the MDP, see the appendix. However, in this MDP model, the transition probabilities are currently unknown. Therefore, we need to estimate them. In order to estimate the transition probabilities for each loan, we predicted each loan s chance of default in any given month, and treated each loan as independent. Estimating Probability of Default In order to estimate transition probabilities, we need to estimate: P(D x = D = D x- = ) < X < (Loan Maturity) D x = indicates loan default in Month X We used machine learning to estimate each of these. We ran ten iterations and a step size of =. and the loss function below: Loss Fn: Loss squared (x, y, w) = (ɸ(x) w y) 2 Our feature vector had 23 features: each loan holder s debt to income ratio, the number of years they have been employed, the loan s grade, and 2 binary variables for each of the top twenty most meaningful words. Example Feature Vector: Total Debt/Income.67 Loan Grade A3 Employment Length 2 Business 8 other words Medic In order to pull the most meaningful words, we used TF-IDF and porter stemming (see appendix). The resulting weights gave some interesting results. Whether or not the description of the loan contained the word bill had the greatest positive weight. If the description of the loan contained start, the weight was highly positive for the first month. By the 36 th month however, the weight had become negative. MDP Problems The MDP approach had significant challenges that limited its effectiveness. Most significantly, the search space become too large to feasibly explore. Each month a lender can invest in a subset of n loans, therefore there are n! actions. Additionally, if there are m loans in the investor s portfolio, there are 2 m different successor states, depending on which loan defaults. In order to address this challenge, we thought of ways to constrain the search space in order to scale our approach down to a more realistic level. What we ultimately settled on for this approach was that for any given month an investor could select a single loan from a random sample of 5 to invest the entire investment amount in. Although these modifications took us further from an accurate model of the real world, we were able to generate optimal results given the circumstances and get some baseline intuition for what an MDP could bring to the model. 3

CS22 Artificial Intelligence Stanford University Autumn 26-27.6.4.2.8.6.4.2 MDP results Clearly, the results for this approach were not positive; the MDP approach consistently underperformed the baseline. Jan- May- Sep- Jan-2 36 Month Loans May-2 Sep-2 Jan-3 May-3 Sep-3 Jan-4 May-4 Sep-4 Jan-5 May-5 Sep-5 Additionally, we limited the MDP to choosing one of 5 random loans. Amongst the 5 loans, there may just not be many loans that deserve to be cherry-picked. Together these two constraints likely explain the poor results Approach (ii): Sharpe s Ratio Due to the complexity and poor results of the MDP approach, we decided to use a simpler method. We decided to try and maximize Sharpe s Ratio. E Portfolio Return k Sharpe Ratio = σ 6789:7;<7 Sharpe Ratio balances the expected returns of the portfolio with the riskiness of the portfolio. The optimal portfolio maximizes the Sharpe ratio. Baseline Oracle MDP 6 Month Loans.8.6.4.2.8.6.4.2 Jan- May- Sep- Jan-2 May-2 Sep-2 Jan-3 May-3 Sep-3 Jan-4 May-4 Sep-4 Jan-5 Baseline Oracle MDP May-5 Sep-5 In order to calculate Sharpe Ratio we need to estimate: E R 6 = w < E R < σ 6? = w < w @ Cov r <, r @ However, given the extreme constraints we placed on the MDP, the quality of the results is not surprising. Because we limited our actions to investing in only one loan, the portfolio is extremely volatile. If that one loan defaults, the whole portfolio goes to zero. Therefore, for each loan we need to estimate expected return, variance, and covariance with other loans. Estimating Variance and Expected Return Using previously calculated probabilities of default from our Markov Decision Process, we ran twenty Monte Carlo simulations for each loan where in each month, each loan had a probability 4

CS22 Artificial Intelligence Stanford University Autumn 26-27 of defaulting given by the probabilities we learnt for the MDP. The expected return is the average across all the simulations, the variance is the variance of the simulations. 2.5 6 Month Loans.6.4.2.8.6.4.2 Jan- Estimating Covariance In order to predict covariance between loans, we used k-means clustering to cluster our loans according to zip-code and home-ownership. To find the covariance between two loans, we find the covariance between their respective clusters. Cov(x, y) = Cov([a, a 2, a n ][b, b 2, b m ]) where a i k x, b j k y Stochastic Gradient Descent Finally, we calculate the weight of each investment so that we maximize the Sharpe Ratio. To do so we used Stochastic Gradient Descent. As W is as large as 2, we had to use relatively few iterations due to time constraints. In the end, we used iterations and a step size of. May- Maximizing Sharpe Ratio Results Sep- Jan-2 36 Month Loans May-2 Sep-2 Jan-3 May-3 Sep-3 Jan-4 May-4 Sep-4 Jan-5 Baseline Oracle Max SR May-5 Sep-5.5 Jan- May- Sep- Jan-2 May-2 Sep-2 Jan-3 May-3 Sep-3 Excitingly, our Sharpe ratio significantly outperformed the Baseline for both time periods. Further Improvements Jan-4 May-4 Sep-4 As for further developments for our project, we see a few potential routes to improve upon. First, the prediction algorithm from our second approach involving Sharpe s Ratio, although successful, could be improved by using more features in our vector and by running than ten iterations to get us closer to the truly optimal values for the default probabilities, expected returns and variances of each of the loans. Jan-5 Baseline Oracle Max SR More fundamentally, we could change the machine learning approach which we used to predict probability of default. Much of the other literature [7] notes the fact that only a small percentage of borrowers will default and that therefore the dataset is very unbalanced. In order to combat the skewedness, some have used random forests, or have limited the dataset so that the number of loans that default is equal to the number of loans that did not. As another point of improvement, we could flesh out the aforementioned payment plan capabilities. As Lending Club expands its database of loans, the relevancy of payment plans would steadily increase and require extra work and logic to account for it. We feel this would make our May-5 Sep-5 5

CS22 Artificial Intelligence Stanford University Autumn 26-27 model a more holistic representation of loan behavior in the real world. Conclusions In this report, we have built two methods to try and generate an optimal portfolio of LendingClub loans. Though our MDP approach was not successful, our Sharpe Ratio approach consistently outperformed the baseline in our test dataset. Whether it would it be effective beyond the test dataset and in the real world is more uncertain. Our dataset only includes loans from 2-26, when there were no financial crises. As a result, our algorithm suggests investing in low-grade loans with higher interest rates. These low-grade loans do well when economic conditions are good, but do badly when economic conditions decline. Therefore, if there was an economic crash similar to the Great Recession, the portfolio our algorithm recommended would likely do badly. Appendix Example MDP number of times the term t occurs in description d while idf(t, D) is the log of the total number of descriptions over D or the set of all descriptions that contain the term t. Hence, frequently occurring words are weighted less. We pulled the top twenty most meaningful words and created binary toggle features for each loan whether or not a given feature word was present in the loan description. Porter-stemming Algorithm We leveraged some existing code[5] for standardizing words in each loan description down to the root word or word stem. We utilized this method as a way to minimize the number of words we needed to iterate over by consolidating many of the similarly rooted words amongst the descriptions References []http://www.nytimes.com/24/5/4/business/l oans-that-avoid-banks-maybe-not.html?_r= [2] Gerritsen, Rob. Assessing Loan Risks: A Data Mining Case Study. XCore Case Studies. Executive Ore Inc., Nov. 25. Web. 26 Oct. 26. <http://xore.com/casestudies/dm_at_usda_(itpro). pdf>. TF-IDF Algorithm Text Frequency Inverse Document Frequency (TF-IDF) is an algorithm to extract the most important words in a document of text while accounting for the insignificance of words that appear across multiple documents. tf(t, d) is the [3] Hamid, Aboobyda J., Tarig M. Ahmed, and Nazlı İkizler. Developing Prediction Model and Analyzing Pitfalls of Loan Risk In Banks. Machine Learning and Applications: An International Journal, 26 Mar. 26. Web. 25 Oct. 26. [4] https://www.kaggle.com/wendykan/lendingclub-loan-data [5] https://github.com/nok/sklearn-porter 6

CS22 Artificial Intelligence Stanford University Autumn 26-27 [6] https://web.stanford.edu/~wfsharpe/art/sr/sr.htm [7] Yhat. http://blog.yhathq.com/posts/machinelearning-for-predicting-bad-loans.html. Machine Learning for Predicting Bad Loans. Dec 24. [8] http://drjasondavis.com/blog/22/4/8/lend ing-club-loan-analysis-making-money-withlogistic-regression 7