DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS

Size: px

Start display at page:

Download "DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS"

Eugene Howard
5 years ago
Views:

1 DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS By Ashish Pandit A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised by Dr. Carol Romanowski Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, New York December,

2 PROJECT REPORT RELEASE PERMISSION FORM Rochester Institute of Technology B. Thomas Golisano College of Computing and Information Sciences TITLE: Data Mining on Loan Approved Dataset for predicting Defaulters I, hereby grant permission to the Wallace Memorial Library reproduce my project in whole or part. Date 2

3 The project "Data Mining on Loan Approved Dataset for predicting Defaulters" by has been examined and approved by the following Examination Committee: Dr. Carol Romanowski Associate Professor Project Committee Chair 3

4 ACKNOWLEDGEMENT I would like to thank Dr. Carol Romanowski, for giving me the opportunity to do my capstone project under her guidance. I am extremely thankful to her for giving me invaluable inputs and ideas, solving my doubts whenever I had any, giving me timely feedbacks after the completion of every milestone and helping me throughout the semester to be able to complete this project with success. I would also like to thank Dr. Joe Geigel, my colloquium guide, for explaining me how the report should be written and giving me valuable feedback after every milestone presentation and class poster presentation. 4

5 ABSTRACT In today s world, taking loans from financial institutions has become a very common phenomenon. Everyday a large number of people make application for loans, for a variety of purposes. But all these applicants are not reliable and everyone cannot be approved. Every year, we read about a number of cases where people do not repay bulk of the loan amount to the banks due to which they suffers huge losses. The risk associated with making a decision on loan approval is immense. So the idea of this project is to gather loan data from multiple data sources and use data mining algorithms on this data to extract important information and predict if a customer would be able to repay his loan or not. In other words predict if the customer would be a defaulter or not. 5

6 Table of Contents 1. INTRODUCTION 1.1 BACKGROUND AND PROBLEM STATEMENT GOAL OF THE PROJECT WORKFLOW DIAGRAM RELATED WORK HYPOTHESIS PREPARING THE DATASET 2.1 DATA GATHERING DATA PREPARATION AND CLEANING DATA MINING USING CLASSIFICATION ALGORITHMS ON MERGED DATASET 3.1 HYBRID NAÏVE BAYES DECISION TREE ALGORITHM NAÏVE BAYES ALGORITHM DECISION TREE ALGORITHM BOOSTING ALGORITHM BAGGING ALGORITHM ARTIFICIAL NEURAL NETWORK ALGORITHM ANALYSING SINGLE DATASET USING CLASSIFICATION ALGORITHMS 4.1 NAÏVE BAYES ALGORITHM DECISION TREE ALGORITHM BOOSTING ALGORITHM BAGGING ALGORITHM RESULTS AND ANALYSIS 5.1 COMPARISION OF RESULTS COST SENSITIVE LEARNING CONCLUSION AND FUTURE WORK 6

7 6.1 CONCLUSION FUTURE WORK REFERENCES

8 1. INTRODUCTION 1.1 BACKGROUND AND PROBLEM STATEMENT Importance of loans in our day-to-day life has increased to a great extent. People are becoming more and more dependent on acquiring loans, be it education loan, housing loan, car loan, business loans etc. from the financial institutions like banks and credit unions. However, it is no longer surprising to see that some people are not able to properly gauge the amount of loan that they can afford. In some cases, people undergo sudden financial crisis while some try to scam money out of the banks. The consequences of such scenarios are late payments or missing payments, defaulting or in the worst-case scenario not being able to pay back those bulk amount to the banks. Assessing the risk, which is involved in a loan application, is one of the most important concerns of the banks for survival in the highly competitive market and for profitability. These banks receive number of loan applications from their customers and other people on daily basis. Not everyone gets approved. Most of the banks use their own credit scoring and risk assessment techniques in order to analyze the loan application and to make decisions on credit approval. In spite of this, there are many cases happening every year, where people do not repay the loan amounts or they default, due to which these financial institutions suffer huge amount of losses. In this project, data mining algorithms will be used to study the loan-approved data and extract patterns, which would help in predicting the likely defaulters, thereby helping the banks for making better decisions in the future. Multiple datasets from different sources would be combined to form a generalized dataset, and then different machine learning algorithms would be applied to extract patterns and to obtain results with maximum accuracy. 1.2 GOAL OF THE PROJECT The primary goal of this project is to extract patterns from a common loan approved dataset, and then build a model based on these extracted patterns, in order to predict the likely loan defaulters by using classification data mining algorithms. The historical data of the customers like their age, income, loan amount, employment length etc. will be used in order to do the analysis. Later on, some analysis will also be done to find the most relevant attributes, i.e., the factors that affect the prediction result the most. 1.3 WORKFLOW OF PROJECT The diagram below shows the workflow of this project. 8

9 Data from three sources (raw data) Data Preprocessing and Cleaning Training Data Test Data Classification Algorithms Model Classifies as defaulter or not defaulter Workflow Diagram 9

10 1.4 RELATED WORK Lot of work has been done with regards to extracting important data, which can be useful for the financial institutions. My project aim was to gather loan information from multiple sources and applying different classification algorithms, which could give best prediction results. I have taken the reference of the work listed below in order to do my analysis. Jiang and Li [1] propose a method to improve the prediction results obtained by using Naïve Bayes and Decision Tree algorithms separately. They have tried to use this Hybrid method on 36 UCI Datasets and compared the results obtained with the individual algorithms. In this project, I have used this hybrid method on the Loan Approved dataset obtained by merging three data sources. The paper by Tiwari and Prakash [2] implements ensemble methods (bagging, boosting and blending) on the SONAR dataset and compares the prediction accuracy with individual algorithms like Naïve- Bayes, Decision Tree etc. In this project, I have used Boosting and Bagging Ensemble methods on the Loan Approved dataset and Single Lending Club dataset. The paper by Atiya [4] explains the implementation of Artificial Neural Networks on the Bank dataset for predicting Bankruptcy. In this project, I have used Single Layered and Multi Layered Neural Network methods on the Loan Approved dataset. 1.5 HYPOTHESIS My hypothesis was to use the Hybrid Naïve Bayes Decision Tree, Boosting and Bagging classification algorithms and compare their results with the individual algorithms to see if they give good prediction accuracy as compared to the individual algorithms. I also wanted to use various classification algorithms on the merged dataset and an individual dataset in order to compare the results obtained by using both the datasets. 10

11 2. PREPARING THE DATASET 2.1 GATHERING DATA In the first step of accumulating information, data from previously approved loan datasets from three different sources are gathered together. These datasets are merged to form a common dataset, on which analysis will be done. Table 1 shows details of the datasets: Table 1: Dataset details Dataset Name No of attributes No of instances Data Format Lending Club Loan Data csv UCI German Data csv ROC Data sav 2.2 DATA PREPARATION AND CLEANING Tools used for data cleaning 1) Google Refine 2) Weka 3) R (For converting.sav data to.csv format) One of the most important task for preparing a common dataset is to decide which of the attributes can be used from these three tables, since all of them have different number of attributes and attributes in different forms. Nine attributes were selected for preparing the new dataset. These attributes being: a) Age of loan applicant (b) Job profile [less, moderately, highly skilled] (c) Annual income (d) Employment length (e) Loan amount (f) Loan duration (g) Purpose of loan (h) Housing [rent, own] and (j) Loan history [Defaulter, Not Defaulter] which is the class attribute. These attributes were either common to all the three or at least two of the datasets. All the other attributes were logically eliminated from each of the dataset. Tables 2, 3 and 4 show how the selected attributes look in each of the tables before merging 11

12 Table 2: Lending Club Data Attribute Name Type Values Example age Missin g Job profile Missin g Income Numeri c 5000, Emp Length Nomin al <1, 5, 10+ Loan Duration Purpose Housin Loan Amount g History Numeric Numeric Nominal Nominal Nominal 1200, 3500, , 60 (in months) Car loan, House Loan, Business Loan etc Rent, own Defaulter, Not Defaulter Table 3: UCI German Data Attribut e Name Type Values Example age Numeri c 33, 50, 46 Job profile Nominal Less, Moderately, Highly skilled Income Missin g Emp Length Nomina l <1, 1 to 4, 4 to 7 7+ Loan Amoun t Numeri c 1200, 3500, Duratio n Purpose Housin g Loan History Numeric Nominal Nominal Nominal 12, 24, 48 in months Car loan, House Loan, Busines s Loan etc Rent, Own, Free Defaulter, Not Defaulter Table 4: ROC Data Attribut e Name Type Values Example age Numeri c 33, 50, 46 Job profile Nominal Less, Moderately, Highly skilled Income Numeri c 5000, Emp Length Nomina l 1, 8, 15 Loan Amoun t Numeri c 1200, 3500, Duratio n Purpos e Housin g Loan History Missing Missing Missing Nominal Defaulter, Not Defaulter The dataset obtained by merging these three datasets was raw and it needed a lot of cleaning. Tackling Data Cleaning tasks 1) Age attribute: As we can see from the above tables the Lending Club Data set does not have any information regarding the age of the loan applicant. So all of the 5000 values for that attribute are unknown. Now, 12

13 ideally one would have removed the entire attribute, but age might be an important factor in determining whether the applicant could be a defaulter or not. So some logical assumptions were made to fill these missing values of age, based on the age and the employment length values of the other two tables. A person who has experience of more than 7 years is very likely to be in his mid 30 s where as a person having a experience of 3-4 years is likely to be in his mid or higher twenties. So in this case a Density Based Clustering Algorithm has been used, in order to get a relation or a pattern between the two attributes. This algorithm divides the age and employment length values of other two datasets into four clusters, as shown in Figure 1.1. Figure 1.1: Cluster Division The number of instances in each of the cluster is shown in Figure 1.2: Figure 1.2: Clustered instances Cluster 1 and Cluster 2 have a major difference in the average age values although the employment length centroid in both is same age value (1<=X<4). But the number of instances in cluster 1 is very less (9 percent) as compared to 43 percent in cluster 2. So here cluster 1 will not be taken into consideration. So based on these three cluster groupings, three categories for the values of age are made : (1) <=27 (2) 28 <= X <=37 (3) >=38. The age values in UCI and ROC datasets are numeric by default. So here both the datasets were combined and text facet feature of Google refine has been used to group all the values of same age together. The attribute has got 53 different numeric age value groups as shown in Figure 1.3. Each of these 53 age group values have been put into their respective category according to the age values. 13

14 But the tricky part here is not knowing if the job that the loan applicant is doing is his first job or say his fifth job. So I will also be doing data analysis by not considering the age attribute and see how that affects my prediction accuracy as compared to when age is taken into consideration. Various age groups obtained after applying text facet feature on age attribute is shown in Figure 1.3. Figure 1.3: text facet on age to group age values in dataset (2) Employment Length The employment length values in both Lending Club and ROC datasets are numeric, but in UCI German dataset the employment length are in four bins i.e. less than 1 years, 1 to 4 years, 4 to 7 years and more than 7 years. So these four bins will be common for my entire dataset. The numeric values in the other two datasets will be put into their respective bins. Text facet function of Google refine will be used in order to get groups of all possible age values and then put these groups into their respective bins. (3) Housing This is a nominal attribute and has four possible values i.e. rent, own, free and other, as shown in Table 5. Table 5: Housing attribute categories Housing Frequency (Number of Values) 14

15 rent 4475 own 1405 free 108 Other 30 Missing Values 102 Since the number of instances of free (108 out of %) and other (30 out of %) values are very less, these two will be merged into one value which is Other. This value will now have 138 instances. Housing cells having the value as rent, account for 74% of values in the dataset. So in this case, all the missing values (102 values) of the housing attribute cells will be filled by the mode value, that is the most occurring value rent. So the housing attribute finally would have only three set of values - Rent (4575 values), own (1405 values) and other (138 values). (4) Loan Purpose This is a nominal attribute. Loan purpose attribute in the Lending club had 12 different values, whereas in UCI dataset had 7 different values. All the values for loan purpose were unknown in the ROC dataset. When the datasets are combined only the following values have considerable amount of instances i.e. car (524 values), credit card (610 values), debt consolidation (2450), house loan or home improvement (927 values) and other (620 values). The probability of occurrence of all the other values is very less. So a new category value named Other/unknown is introduced, that will include all the other remaining category instances. This category will have the 610 values of category other as well. All the missing values of loan purpose will also be filled by other/unknown. So finally, loan purpose will now have only five category values 1) car 2) credit card 3) debt consolidation 4) house loan/home improvement 5) other/unknown 15

16 Various categories of purpose attribute in UCI German Dataset are shown in Figure 1.4. Figure 1.4: purpose attribute in UCI German Dataset Various categories of Purpose attribute in Lending Club Dataset are shown in Figure 1.5. Figure 1.5: Purpose attribute in Lending Club Dataset 16

17 (5) Job Profile and Income: Job profile, income and employment length are the three attributes, which are co-related. As we can very much say that the income of an employee having very high skill but with less experience would be higher than an employee having less skill even with relatively higher experience. This relation will be used in order to find the missing values of Job Profile and Income. The lending club has all Job profile values as unknown. In this case, Income and employment length values of Lending club and ROC table will be used to extract a pattern. UCI German dataset values have not been used here since it has all the income values as unknowns. So in this case, K-means clustering algorithm is used in order to get a pattern between the attributes. This algorithm divides the attributes values into three clusters. The clusters formed are shown in Figure 1.6. Figure 1.6: Cluster Division All the unknown cells of the Job profile column are filled with either value cluster 0, cluster 1 and cluster 2 according to the model created by the K-means clustering algorithm. 17

18 The values obtained for the Job_profile attribute by using K-means clustering are as shown in Figure 1.7. Figure 1.7: K-Means Output- Here the Cluster Column will be renamed as job_profile Based on the observations of the cluster centroids, cluster 0 which has annual income mean as $77928 and employment length as 7+ years has been renamed as highly skilled, cluster 1 which has annual income mean as $47230 and employment length as 1 to 4 years has been renamed as less skilled and cluster 2 which has annual income mean as $59420 and employment length as 4 to 7 years has been renamed as moderately skilled. The cluster 0, 1, 2 instances in the column would be replaced by highly, less and moderately skilled respectively. Now, the relationship derived by the clustering model will be used in order to fill out all the unknown income values. For example, filling the income value whose corresponding employment length is between 4 to 7 years and Job profile as moderately skilled. In this case, all the instances having 4 to 7 years and moderately skilled will be gathered using text facet function of Open refine and the mean of those income values would be used to fill in those cells. There are 1503 matching instances, all having employment 4 to 7 years and moderately skilled. The arithmetic mean of the matching corresponding income values of these 1503 instances would be calculated which in this case is $59342 and that value would be inserted into all the blank income cells having 4 to 7 years and moderately skilled. Similarly, all the other values of the income attribute would be calculated by taking the mean of all income values of the corresponding matching job profile and employment length. 18

19 Removing Rows: Loan history is the class attribute. It s a nominal attribute having two category values i.e. defaulter and not defaulter. It also has 16 blank or missing values. Now, since this is a class attribute, all the 16 rows which do not have values for the attributes will be removed. There are a total of 50 loan history instances having value current. These loan payments are currently in progress and there is no indication of whether they will default or not in their future payments. So these 50 rows are also deleted. In addition to this there are 177 rows whose employment length value is blank and it has an income associated with. But here, we are not sure if it is, the applicant s previous job s income or the income of his co-signer or his family. Also, we don t have any values for Job profile and age cells of these rows. A lot of important data is missing in these 177 rows. So these rows would also be deleted. FINAL DATASET The final dataset obtained by merging the datasets and cleaning it, has a total of 5857 rows. There is no missing value in this new dataset. The list of attributes along with their type is shown in Table 6. Attribute Name Age Job profile Income Table 6: Final Dataset Emp Length Loan Amount Duration Purpose Housing Loan History Type Nominal Nominal Numeric Nominal Numeric Numeric Nominal Nominal Nominal Values Example <=27, 28<=X<=37, >=38 less, moderately, highly skilled 5000, <1, 1 to 4, 4 to , 3500, , 60 in months Car loan, House Loan, Business Loan etc rent, own, other Defaulter, Not Defaulter The new dataset obtained after merging the three datasets and cleaning it, has been divided into training set and a test set in the ratio of 80:20 respectively. After splitting the data, data models are built on training set, based on some extracted data patterns by using classification algorithms. These classifier models are then evaluated by using the test dataset. 19

20 3. DATA MINING USING CLASSIFICATION ALGORITHMS HYBRID NAÏVE BAYES DECISION TREE ALGORITHM Naïve Bayes and Decision Trees are two of the most important classification algorithms for prediction purposes, due to their accuracy, easiness and effectiveness. There prediction accuracies can be increased further by combining the advantages of both the algorithms mentioned above, using a Hybrid Naïve Bayes Decision Tree algorithm. This algorithm gives high prediction accuracy as compared to Naïve Bayes and Decision Tree used individually, but the time complexity does not increase by a great extent [1]. The implementation of the algorithm is divided into two parts. In the first part, Naïve Bayesian and Decision tree models are created and assessed individually on the training data. In the second part, the class probabilities obtained on every instance of the test set, are weightily averaged based on the classification accuracies obtained on training data [1]. Finally the result of the Hybrid Naïve Bayes Decision Tree Algorithm, is compared with the results of Naïve Bayes and Decision Tree Algorithm calculated individually. The algorithm [1] is as follows: Phase 1: (Used WEKA) INPUT: Training Data STEPS: 1) Building a classifier model on Training Data using Decision Tree, denoted by (C4.5) 2) Evaluating the accuracy of the model on Training data, denoted by (ACC C4.5 ) 3) Building a classifier model on Training Data using Decision Tree, denoted by (NB) 4) Evaluating the accuracy of the model on Training data, denoted by (ACC NB ) 5) Return the models built along with their evaluated accuracies. OUTPUTS: ( C4.5, ACC C4.5, NB, ACC NB ) 20

21 Phase 2: (Used JAVA) INPUT: 1) The Models built in the first phase i.e. C4.5, NB 2) Their respective accuracies ACC C4.5, ACC NB 3) Test Data instance denoted by x STEPS: 1) For every class label (in this case 2 class labels) c of test instance x Here c is either a defaulter or not defaulter 2) Calculate P(c x) C4.5 by using the decision tree model (C4.5). The formula is as follows: Formula reference [1] Where k is the number of training instances in that particular leaf node where x falls, c i is the class of the test instance x [1]. Formula reference [1] The value of the function is equal to 0 if both the parameters are not equal and 1 if they are equal [1]. For this dataset, Calculate P(not defaulter x) C4.5 by using the decision tree model (C4.5) and the above formula Calculate P(defaulter x) C4.5 by using the decision tree model (C4.5) and the above formula 3) Calculate P(c x) NB by using the Naïve Bayesian classifier model (NB). The formula is as follows: Formula reference [1] 21

22 Here, m denotes the total number of attributes, a j is the value of jth attribute of the test instance x and c as mentioned earlier, is the value of the class attribute [1]. In this case, the Prior probability i.e. P(c) [1] is calculated by using the formula: Formula reference [1] Here c i is the is the value of the class attribute of the ith training row, n is the total number of rows in training set, n c is the total number of classes (in this case: 2) [1] and Formula reference [1] The value of the function is equal to one if both the parameters are equal and zero if they are not equal [1]. The Conditional Probability P(a j c ) [1] is calculated by using the formula: Formula reference [1] Here a j is the value of jth attribute of the test instance x, a ij is the value of jth attribute of the training set row or instance i, c i is the is the value of the class attribute of the ith training row and n j is the total number of values that the jth attribute can have [1]. For this dataset, Calculate P(defaulter x) NB by using the Naïve Bayesian model (NB) Calculate P(not defaulter x) NB by using the Naïve Bayesian model (NB) 4) Calculate P(c x) C4.5 - NB by using the formula given below: For this dataset, Calculate P(defaulter x) C4.5-NB Formula reference [1] 22

Calculate P(not defaulter x) C4.5-NB 5) Find the maximum value of P(c x) C4.5 NB which is obtained in the previous step. For this dataset, If ( P(defaulter x) C4.5-NB > P(not defaulter x) C4.

23 Calculate P(not defaulter x) C4.5-NB 5) Find the maximum value of P(c x) C4.5 NB which is obtained in the previous step. For this dataset, If ( P(defaulter x) C4.5-NB > P(not defaulter x) C4.5-NB) { Then the Class Label of the instance is defaulter according to the hybrid algorithm } else if ( P(not defaulter x) C4.5-NB > P(defaulter x) C4.5-NB ){ Then the Class Label of the instance is not defaulter according to the hybrid algorithm } OUTPUT: The Class Label (defaulter or not defaulter) for a Test Data instance denoted by x FINAL RESULT: Phase 2 steps are repeated for all the instances of the test dataset. All the Class labels outputs obtained for every test instance or row, are copied to an excel file and the actual Class labels are also pasted in the same file in the next column. This file in read by a java code and the correctly and the incorrectly classified instances are found by comparing the two columns. Sample of the probabilities obtained by using the Hybrid algorithm is shown in Figure

24 Figure 2.1 Probabilities obtained after completing step 4 Prediction accuracy and confusion matrix obtained by using Hybrid Naïve Bayes Decision Tree algorithm is shown in Figure 2.2. Figure 2.2 Hybrid Naïve Bayes Decision Tree Algorithm Hybrid Naïve Bayes Decision Tree Accuracy % Figure 2.3 shows the result of using Naïve Bayes Algorithm separately on the merged dataset. Naïve Bayes Algorithm Result 24

25 Figure Naïve-Bayes Classification Algorithm using Weka Naïve Bayes Accuracy % Figure 2.4 shows the result of using Decision Tree (J48) Algorithm separately on the merged dataset. Decision Tree Algorithm (J48) - Result Figure J48 Classification Algorithm using Weka Decision Tree J48 Accuracy % The classification accuracy obtained by using the hybrid algorithm (73.80 %) shows improvement as compared to the accuracy of the individual classification algorithms, in this case, Naïve-Bayes and Decision Tree, although by a relatively small extent as shown in Table 7. But it does serve the purpose of using the algorithm i.e. for improving the accuracy of the individual classification algorithms without increasing the time complexity by a great extent. Table 7 compares the accuracies obtained by using Hybrid, Naïve Bayes and Decision Tree algorithm respectively. Table 7: Comparison of Prediction Accuracies Classification Algorithms Accuracy Correctly Classified Instances Incorrectly Classified Instances Hybrid Naïve Bayes Decision Tree Algorithm % / /

26 Naïve Bayes Algorithm % Decision Tree Algorithm % 837/ / / /1156 ENSEMBLE METHODS Ensemble methods either use more than one data mining algorithms or use one data mining algorithm multiple times in order to improve the prediction accuracy as compared to the use an algorithm on the dataset. 1) BOOSTING ALGORITHM In the first iteration of the Boosting algorithm, a classification model is created on the training data by using a data-mining algorithm. The second iteration creates a classification model, which basically concentrates on the instances or the rows that were incorrectly classified in the first iteration [2]. This process goes on until some constraint is reached with regards to the accuracy or number of models [2]. The aim of using Boosting ensemble method is to get better results as compared to the individual classification algorithms. For this dataset, the AdaBoostM1 classification algorithm will be used. AdaBoostM1 algorithm is tried with different base classifier algorithms like J48 decision tree algorithm, Naïve-Bayes, SVM, Neural Network etc. in order to find the base classification algorithm, which gives best results for this dataset. In this case, J48 Decision Tree algorithm used as a base classifier gives best results for AdaBoostM1. So the final analysis will have AdaBoostM1 using J48. Also, various number of iterations i.e. number of successive models to be created (N) are tried in order to get good prediction accuracy. The values of N above 30 either have the same or decreased the value of classification accuracy. Finally, N=30 is selected since it gives the best prediction results. Figure 3.1 shows the result of using AdaBoostM1 with J48 as the base classifier and Number of successive models to be created (N) = 30. AdaBoostM1 Algorithm using - J48 as base classifier - Result 26

Figure 3.1 AdaboostM1 Algorithm using Weka AdaBoostM1 using J48 Accuracy 74.

27 Figure 3.1 AdaboostM1 Algorithm using Weka AdaBoostM1 using J48 Accuracy % Table 8 shows the prediction accuracies of the AdaBoostM1 classification algorithm using different base classifier algorithms and different number of successive models to be created i.e. N. N = Number of Successive Models Table 8: AdaBoostM1 Classification Algorithm Accuracy Results Base Classification Algorithm Used J48 Decision Tree N = 3 N=10 N=20 N= % % % % Naïve-Bayes Support Vector Machines (SVM) % % % % % % % % 27

28 K Nearest Neighbors (KNN) % % % % The base classification algorithm that will be used in this case is the J48 decision tree algorithm, since J48 gives the best results for this dataset as compared to the other individual algorithms and number of successive models would be 30. The aim of using AdaBoostM1 ensemble method is to get better prediction results as compared to the individual classification algorithms, in this case the J48 algorithm. 2) BAGGING ALGORITHM Bagging is a type of ensemble, which divides the entire training data into various small samples and then creates a separate classifier model for every sample [2]. The results obtained from of all these classifier models are finally merged by using techniques like majority voting or taking average of the results etc. [2]. The main advantage here is that, each sample obtained from the training set is unique. So every classifier model that has been created, will be trained on slightly different unexplored part of a problem. Like AdaBoostM1, Bagging algorithm is also tried with different base classifier algorithms like J48 decision tree algorithm, Naïve-Bayes, SVM, Neural Network etc. in order to find the base classification algorithm, which gives best results for this dataset. In this case, J48 Decision Tree algorithm used as a base classifier gives best results for Bagging. So the final analysis will have Bagging, using J48. Also, various number of iterations i.e. Number of Samples to be created (N) are tried in order to get good prediction accuracy. The values of N above 30 either have the same or decreased the value of classification accuracy. Finally, N=30, is selected since it gives the best prediction results. Figure 3.2 shows the result of using Bagging with J48 as the base classifier and Number of samples to be created (N) = 30. Bagging Algorithm using - J48 as base classifier - Result 28

Figure 3.2 Bagging Algorithm using Weka Bagging using J48 Accuracy 75.08 % Here, the inbuilt Bagging algorithm of WEKA will be used.

29 Figure 3.2 Bagging Algorithm using Weka Bagging using J48 Accuracy % Here, the inbuilt Bagging algorithm of WEKA will be used. Below are the prediction accuracy of the Bagging classification algorithm using different Base classifier algorithms and different number of Samples to be created i.e. N. Table 9 shows the prediction accuracies of the Bagging classification algorithm using different base classifier algorithms and different number of samples to be created i.e. N. Table 9: Bagging Classification Algorithm Results Base Classification Algorithm Used J48 Decision Tree Naïve-Bayes Support Vector Machines (SVM) K Nearest Neighbors (KNN) N = 3 N=10 N=20 N= % % % % % % % % % % % % % % % % 29

30 The base classification algorithm that will be used in this case is the J48 decision tree algorithm, since J48 gives the best results for this dataset as compared to the other individual algorithms and number of successive models would be 30. The aim of using Bagging ensemble method is to get better prediction results as compared to the individual classification algorithms, in this case the J48 algorithm. Prediction Accuracies obtained by using J48 individually and J48 with bagging and boosting are compared in Table 10. Table 10: Comparison of Prediction Accuracies Classification Algorithms Accuracy Correctly Classified Instances Incorrectly Classified Instances J48 Algorithm % 847/ /1156 AdaBoostM1 Algorithm % 859/ /1156 Bagging Algorithm % 868/ /1156 ARTIFICIAL NEURAL NETWORK Now-a-days Artificial Neural Network is being considered, as a well-established method, in order to evaluate the loan applications received by the banks and in order to make any approval or rejection decisions. Here, a classifier model is built on the merged dataset by using Single-layer as well as Multilayer Feed-Forward Neural Network Algorithm. Single Layer Feed Forward Neural Networks [5] consist of an Input Layer, which would have of all the attributes that are used except the class attribute, one Hidden Layer which would have some neurons (number of neurons are specified in the code) and the Output Layer which consists of the class attribute. Multi Layer Feed Forward Neural Networks [5] consist of an Input Layer, which consists of all the attributes that are used except the class attribute, Hidden Layers which would have some neurons (number of hidden layers and neurons are specified in the code) and the Output Layer which consists of the class attribute. Normally, a neural network has two hidden layers in a Multilayer Feed Forward Neural Network. Both these algorithms have been implemented in the R environment. An inbuilt package, neuralnet is used to run both the algorithms. The basic command for running this algorithm is: 30

31 NeuralNetworkResult <- neuralnet (formula, dataset, hidden, algorithm, stepmax) Here the argument meanings are as follows: 1) Formula: It specifies the classifier attribute and then lists all the attributes to be considered for building the model on the class attribute. 2) Dataset: This argument will have the name of the dataset, on which model will be built on. 3) Hidden: This argument specifies, the number of hidden layers i.e. whether it is Single layered or Multi Layered neural network. It also specifies the number of neurons a layer would have. 4) Algorithm: This argument specifies the algorithm, which will be used to create a neural network. By default, the algorithm used for creating a neural network is Resilient BackPropogation Algorithm indicated by rprop+. 5) Stepmax: This argument specifies the maximum number of steps that can be used to build the neural network. The Basic code used for building a Single Layered Neural Network in this case is as follows: nn <- neuralnet ( Not_defaulter + defaulter ~ age + Job_profile + income + emp_length + Loan.amount+ Loan.Duration + Purpose + home_ownership, data = nnet_train, algorithm ='rprop+', hidden = 3, stepmax=1e6) (Here the Single hidden Layer would have 3 neurons) The basic code used for building a Multilayered Neural Network in this case is as follows: n <- neuralnet ( Not_defaulter + defaulter ~ age + Job_profile + income + emp_length + Loan.amount+ Loan.Duration + Purpose + home_ownership, data = nnet_train, algorithm ='rprop+', hidden = c(3,2), stepmax=1e6) (Here the first hidden Layer would have 3 neurons and the second would have 2 neurons) Figure 4.1 shows the network obtained by using Single Layer Neural network. 31

32 Figure 4.1 Single Layer Feed Forward Neural Network using R Sample of Code Snippet and Result obtained by using Single Layer Neural Network is shown in Figure 4.2. Figure 4.2 Single Layer (3 Hidden neurons) Neural Network Accuracy: % 32

33 Figure 4.3 shows the network obtained by using Multilayer Neural network. Figure 4.3 Multi-Layer Feed Forward Neural Network using R Result obtained by using Multilayer Neural Network is shown in Figure 4.4. Figure 4.4 Multi-Layer (5 Hidden Neurons) Neural Network Accuracy: % Table 11: Comparing the Results of Single Layer and Multilayer Neural Network Classification Algorithm Accuracy Correctly Classified Incorrectly Classified Single Layer Neural Network 71.62% 828 / / 1156 Multi Layer Neural Network 72.57% 839 / /

34 4. Analyzing Single Dataset (Lending Club) using Classification Algorithms Here some of the algorithms used in the earlier milestones, are applied again to build models on the Lending Club Dataset and then the models created are evaluated for their accuracy. This dataset has 4586 instances and 24 attributes. It has been divided into training set and a test set in the ratio of 80:20 respectively. After splitting the data, data models are built on training set, based on some extracted data patterns by using Classification algorithms. These classifier models are then evaluated by using the test dataset. Attributes of this dataset are as follows: Loan_amount Numeric Loan Term Nominal Installment_rate Nominal Installment Numeric Grade Nominal Sub_grade Nominal Employment_length Nominal Open_accounts - Numeric Public_record - Numeric Revolving_balance - Numeric Revolving_until - Nominal Total_accounts - Numeric Total_payment - Numeric loan_status - Nominal Home_ownership Nominal Annual_income Numeric Loan_purpose Nominal Zip_code Nominal Verification_status Nominal Address_state Nominal Issue_date Nominal Earliest_credit_line Nominal Inquiry_last_6months Numeric 34

1 - J48 Classification Algorithm using Weka Decision Tree J48 Accuracy 77.

35 Figure 5.1 shows the result of using Decision Tree (J48) Algorithm on the Lending Club Dataset. Figure J48 Classification Algorithm using Weka Decision Tree J48 Accuracy % Figure 5.2 shows the result of using Naïve Bayes Algorithm on the Lending Club Dataset. Figure Naïve-Bayes Classification Algorithm using Weka Naïve Bayes Accuracy % 35

36 Figure 5.3 shows the result of AdaBoostM1 Algorithm using Decision Stump as base classifier on Lending Club Dataset. Figure 5.3 AdaboostM1 Algorithm using Weka AdaBoostM1 using Decision Stump- Accuracy % Figure 5.4 shows the result of Bagging Algorithm using - J48 as base classifier on Lending Club Dataset. Figure 5.4 Bagging Algorithm using Weka Bagging using J48 Accuracy % 36

37 5. RESULTS AND ANALYSIS: In this section, we will be studying and analyzing the results, which have been obtained by building models on Loan Approved Merged dataset and Lending Club dataset, using various classification algorithms. We will also be looking to get an insight of the most relevant attributes that help in predicting the results correctly. Prediction accuracies obtained on the Merged dataset and Lending Club dataset using various classification algorithms is shown in Table 12 and Table 13 respectively. Table 12: Comparison of the results obtained on the Merged dataset ALGORITHM USED CLASSIFICATION ACCURACY CORRECTLY CLASSIFIED INSTANCES INCORRECTLY CLASSIFIED INSTANCES Naïve Bayes 72.40% 837 / / 1156 J % 847 / / 1156 Naïve Bayes J48 Hybrid 73.80% 852 / / 1156 Boosting (AdaBoostM1) 74.31% 859 / / 1156 Bagging 75.08% 868 / / 1156 Single Layer Neural Network 71.62% 828 / / 1156 Multi Layer Neural Network 72.57% 839 / / 1156 (Highest Accuracy obtained by using Bagging has been highlighted) Table 13: Comparison of the results obtained on Lending Club dataset ALGORITHM USED CLASSIFICATION ACCURACY CORRECTLY CLASSIFIED INSTANCES INCORRECTLY CLASSIFIED INSTANCES Naïve Bayes 74.17% 718 / / 968 J % 752 / / 968 Boosting (AdaBoostM1) 90.49% 876 / / 968 Bagging 85.43% 827 / / 968 (Highest Accuracy obtained by using Boosting has been highlighted) 37

38 Observations and Analysis 1) As we can see, the Classification Accuracies obtained on the single Lending Club dataset seem to be relatively higher or in some cases much higher than the classification accuracy obtained on the merged dataset using same algorithms. Now, let us analyze the results obtained for both datasets using J48 and Bagging algorithm. Figure 6.1 and figure 6.2 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using J48 algorithm. A B ß Classified as A B ß Classified as A = not defaulter A = not defaulter B = defaulter B = defaulter Figure 6.1: Confusion Matrix for Merged Dataset Using J48 Figure 6.2: Confusion Matrix for Lending Club Dataset Using J48 Figure 6.3 and figure 6.4 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using Bagging algorithm. A B ß Classified as A B ß Classified as A = not defaulter A = not defaulter B = defaulter B = defaulter Figure 6.3: Confusion Matrix for Merged Dataset Using Bagging Figure 6.4: Confusion Matrix for Lending Club Dataset using Bagging According to the figure 6.1, the J48 algorithm correctly predicts 847 instances out of 1156 instances. Here the classification accuracy is %. Figure 6.2 states that the J48 algorithm correctly predicts 752 instances out of 968 instances. The classification accuracy in this case is %. The confusion matrix for merged dataset using Bagging algorithm shown in figure 6.3, states that 868 instances are correctly classified out of 1156 instances. The classification accuracy is %. The confusion matrix for lending club dataset using Bagging shown in figure 6.4 states that, 827 instances are correctly classified out of 968 instances. The classification accuracy is %. So the classification accuracy obtained for the Lending Club dataset is relatively higher as compared to the accuracy obtained on the merged dataset. While merging the datasets, all the uncommon attributes were removed. The less prediction accuracy for the merged dataset as compared to Lending club dataset, can be due to the fact that, some attributes having useful information for identifying the instances as defaulters and not defaulters, are missing. These attributes would have helped the algorithms to get a 38

39 much better understanding of the patterns in the dataset while creating a model, thereby improving the prediction accuracy. 2) Another thing to notice about in both the datasets, especially, in case of the Merged dataset is that, although the overall prediction accuracy is good, all the algorithms used are not very good at correctly predicting the defaulters but excellent at predicting the non defaulters. Figure 6.5 and figure 6.6 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using J48 algorithm. A B ß Classified as A B ß Classified as A = not defaulter A = not defaulter B = defaulter B = defaulter Figure 6.5: Confusion Matrix for Merged Dataset Using J48 Figure 6.6: Confusion Matrix for Lending Club Dataset Using J48 As per figure 6.5, the J48 algorithm correctly classifies 736 instances as a not defaulter from a total number of 785 instances whose class was actually not defaulter. So it has a classification accuracy of 93.7% in case of all the instances having class as not defaulter. On the other hand, it correctly classifies only 111 instances as a defaulter from a total number of 371 instances, whose class was actually defaulter. So it has a classification accuracy of just 29.9% in case of all the instances having class as defaulter. According to figure 6.6, J48 algorithm correctly classifies 606 instances as a not defaulter from a total number of 684 instances whose class was actually not defaulter. It has a classification accuracy of 88.59% in case of all the instances having class as not defaulter. On the other hand, J48 correctly classifies 146 instances as a defaulter from a total number of 284 instances, whose class was actually defaulter. It has a classification accuracy of 51.4% in case of all the instances having class as defaulter. Let us consider one more example of another classification algorithm on both the datasets. Figure 6.7 and figure 6.8 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using Naïve Bayes algorithm. 39

40 A B ß Classified as A B ß Classified as A = not defaulter A = not defaulter B = defaulter B = defaulter Figure 6.7: Confusion Matrix for Merged Dataset Using Naïve-Bayes Figure 6.8: Confusion Matrix for Lending Club Dataset Using Naïve-Bayes In case of the merged dataset, the Naïve- Bayes has a classification accuracy of 92.73% in case of all the instances having class as not defaulter and has an accuracy of only 29.38% in case of all the instances having class as defaulter. On the other hand, in case of the Lending Club dataset, the Naïve- Bayes has a classification accuracy of 79% in case of all the instances having class as not defaulter and has an accuracy of 63.73% in case of all the instances having class as defaulter. As we can see, the prediction accuracy of all the not defaulter instances remain very good for both the datasets. But the accuracy of all the defaulter instances is not that good for both the datasets, especially in the Merged dataset. The Train-Test dataset split was set to 70:30 percent for the both datasets to see if there is any change in the behavior of the models. The prediction accuracy almost remained the same. The overall accuracy just got reduced by a little margin. Also there was hardly any difference in the prediction accuracy of the 'defaulter' class instances. The reason for not so good prediction accuracy of defaulter instances might be that, the attributes in both the datasets provide adequate information about the characteristics of a non defaulter, but do not reveal any vital information, that can be used to correctly classify the applicant as a defaulter. In case of the merged dataset, the reason can also be, the removal of uncommon attributes like applicant address, interest rate, applicant grade, credit enquires in last 6 months etc. while merging the 3 datasets, or presence of many missing values which were guessed while performing data cleaning. One major reason for the low accuracy of defaulter instances could also be that, in both the datasets, the number of rows having class as not defaulter is much greater than the number of rows having class as defaulter. So the datasets have class imbalance. All these things might have lead to lack of important information or loss of information to clearly differentiate the prediction of two classes. The result being, all the algorithms are biased in predicting an applicant as a not defaulter. To tackle this problem, Cost Sensitive Learning method has been used. This method basically punishes the algorithm for falsely classifying the defaulter instances as not defaulters. In this approach, 40

41 although the overall accuracy of prediction goes down by some margin, the prediction accuracy of defaulters goes up considerably. COST SENSITIVE LEARNING Cost Sensitive Learning method can be used with any algorithm like Naïve Bayes, Decision tree, Bagging etc. For our datasets, it has the following default Cost Matrix. The values in it are the cost sensitive weights. Default Cost Matrix New Cost Matrix after changing some weight values Various values of weights were tried and finally the above cost matrix was used since it balanced the classifier result considerably and gave decent prediction results. Naive Bayes Result without using Cost Sensitive Learning on Merged Dataset is shown in Figure 7.1. Figure 7.1:Defaulter instances prediction accuracy % (Overall Accuracy %) 41

Naive Bayes result with Cost Sensitive Learning on Merged Dataset is shown in Figure 7.2. Figure 7.2: Defaulter instances prediction accuracy 61.01 % (Overall Accuracy 64.

42 Naive Bayes result with Cost Sensitive Learning on Merged Dataset is shown in Figure 7.2. Figure 7.2: Defaulter instances prediction accuracy % (Overall Accuracy %) Although the overall accuracy of the classifier reduces to around 65%, but the prediction results for defaulter class instances improves considerably and the result no longer looks to be biased in predicting an applicant as a not defaulter. Similarly, Cost Sensitive Learning method is used with some other algorithms on both the datasets as shown in Table 14 and Table 15. Table 14: Results obtained on Merged dataset with and without Cost Sensitive Learning ALGORITHM USED OVERALL CLASSIFICATION ACCURACY WITHOUT USING COST SENSITIVE LEARNING CLASSIFICATION ACCURACY OF DEFAULTER INSTANCES WITHOUT USING COST SENSITIVE LEARNING OVERALL CLASSIFICATION ACCURACY USING COST SENSITIVE LEARNING CLASSIFICATION ACCURACY OF DEFAULTER INSTANCES USING COST SENSITIVE LEARNING Naïve Bayes % % % % J % % % % Boosting (AdaBoostM1) % % % % 42

43 Bagging % % % % Table 15: Results obtained on Lending Club dataset with and without Cost Sensitive Learning ALGORITHM USED OVERALL CLASSIFICATION ACCURACY WITHOUT USING COST SENSITIVE LEARNING CLASSIFICATION ACCURACY OF DEFAULTER INSTANCES WITHOUT USING COST SENSITIVE LEARNING OVERALL CLASSIFICATION ACCURACY USING COST SENSITIVE LEARNING CLASSIFICATION ACCURACY OF DEFAULTER INSTANCES USING COST SENSITIVE LEARNING Naïve Bayes % % % % J % % % % Boosting (AdaBoostM1) % % % % Bagging % 51.0 % % % By looking at both the tables it can be seen that the use of Cost Sensitive Learning reduces the overall prediction accuracy by some margin, but the prediction accuracy of defaulter instances increases considerably and the classification results no longer look like they are biased in predicting the instances as not defaulter thereby reducing classification imbalance. 3) The classification accuracy obtained by using the Hybrid algorithm (73.80 %) is higher than the accuracy of the individual classification algorithms, in this case, Naïve-Bayes (72.40 %) and Decision Tree (73.26 %), although by a relatively small extent. In this approach, a Naïve Bayes classifier is built on every leaf node of the decision tree [1]. The hybrid algorithm combines the advantages of both Naïve Bayes and Decision Tree. Thus it does serves the purpose of improving the accuracy of the individual classification algorithms without increasing the overall time complexity by a great extent. 4) Another important observation is that, when we use AdaBoostM1 Ensemble Methods on both the datasets, the overall accuracy of predicting the applicant correctly, as either a not defaulter or defaulter, increases as compared to the accuracy obtained by using any individual algorithm. For 43

The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used.

Machine Learning Group Homework 3 MSc Business Analytics Team 9 Alexander Romanenko, Artemis Tomadaki, Justin Leiendecker, Zijun Wei, Reza Brianca Widodo The Loans_processed.csv file is the dataset we