DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS

Size: px
Start display at page:

Download "DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS"

Transcription

1 DATA MINING ON LOAN APPROVED DATSET FOR PREDICTING DEFAULTERS By Ashish Pandit A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised by Dr. Carol Romanowski Department of Computer Science B. Thomas Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, New York December,

2 PROJECT REPORT RELEASE PERMISSION FORM Rochester Institute of Technology B. Thomas Golisano College of Computing and Information Sciences TITLE: Data Mining on Loan Approved Dataset for predicting Defaulters I, hereby grant permission to the Wallace Memorial Library reproduce my project in whole or part. Date 2

3 The project "Data Mining on Loan Approved Dataset for predicting Defaulters" by has been examined and approved by the following Examination Committee: Dr. Carol Romanowski Associate Professor Project Committee Chair 3

4 ACKNOWLEDGEMENT I would like to thank Dr. Carol Romanowski, for giving me the opportunity to do my capstone project under her guidance. I am extremely thankful to her for giving me invaluable inputs and ideas, solving my doubts whenever I had any, giving me timely feedbacks after the completion of every milestone and helping me throughout the semester to be able to complete this project with success. I would also like to thank Dr. Joe Geigel, my colloquium guide, for explaining me how the report should be written and giving me valuable feedback after every milestone presentation and class poster presentation. 4

5 ABSTRACT In today s world, taking loans from financial institutions has become a very common phenomenon. Everyday a large number of people make application for loans, for a variety of purposes. But all these applicants are not reliable and everyone cannot be approved. Every year, we read about a number of cases where people do not repay bulk of the loan amount to the banks due to which they suffers huge losses. The risk associated with making a decision on loan approval is immense. So the idea of this project is to gather loan data from multiple data sources and use data mining algorithms on this data to extract important information and predict if a customer would be able to repay his loan or not. In other words predict if the customer would be a defaulter or not. 5

6 Table of Contents 1. INTRODUCTION 1.1 BACKGROUND AND PROBLEM STATEMENT GOAL OF THE PROJECT WORKFLOW DIAGRAM RELATED WORK HYPOTHESIS PREPARING THE DATASET 2.1 DATA GATHERING DATA PREPARATION AND CLEANING DATA MINING USING CLASSIFICATION ALGORITHMS ON MERGED DATASET 3.1 HYBRID NAÏVE BAYES DECISION TREE ALGORITHM NAÏVE BAYES ALGORITHM DECISION TREE ALGORITHM BOOSTING ALGORITHM BAGGING ALGORITHM ARTIFICIAL NEURAL NETWORK ALGORITHM ANALYSING SINGLE DATASET USING CLASSIFICATION ALGORITHMS 4.1 NAÏVE BAYES ALGORITHM DECISION TREE ALGORITHM BOOSTING ALGORITHM BAGGING ALGORITHM RESULTS AND ANALYSIS 5.1 COMPARISION OF RESULTS COST SENSITIVE LEARNING CONCLUSION AND FUTURE WORK 6

7 6.1 CONCLUSION FUTURE WORK REFERENCES

8 1. INTRODUCTION 1.1 BACKGROUND AND PROBLEM STATEMENT Importance of loans in our day-to-day life has increased to a great extent. People are becoming more and more dependent on acquiring loans, be it education loan, housing loan, car loan, business loans etc. from the financial institutions like banks and credit unions. However, it is no longer surprising to see that some people are not able to properly gauge the amount of loan that they can afford. In some cases, people undergo sudden financial crisis while some try to scam money out of the banks. The consequences of such scenarios are late payments or missing payments, defaulting or in the worst-case scenario not being able to pay back those bulk amount to the banks. Assessing the risk, which is involved in a loan application, is one of the most important concerns of the banks for survival in the highly competitive market and for profitability. These banks receive number of loan applications from their customers and other people on daily basis. Not everyone gets approved. Most of the banks use their own credit scoring and risk assessment techniques in order to analyze the loan application and to make decisions on credit approval. In spite of this, there are many cases happening every year, where people do not repay the loan amounts or they default, due to which these financial institutions suffer huge amount of losses. In this project, data mining algorithms will be used to study the loan-approved data and extract patterns, which would help in predicting the likely defaulters, thereby helping the banks for making better decisions in the future. Multiple datasets from different sources would be combined to form a generalized dataset, and then different machine learning algorithms would be applied to extract patterns and to obtain results with maximum accuracy. 1.2 GOAL OF THE PROJECT The primary goal of this project is to extract patterns from a common loan approved dataset, and then build a model based on these extracted patterns, in order to predict the likely loan defaulters by using classification data mining algorithms. The historical data of the customers like their age, income, loan amount, employment length etc. will be used in order to do the analysis. Later on, some analysis will also be done to find the most relevant attributes, i.e., the factors that affect the prediction result the most. 1.3 WORKFLOW OF PROJECT The diagram below shows the workflow of this project. 8

9 Data from three sources (raw data) Data Preprocessing and Cleaning Training Data Test Data Classification Algorithms Model Classifies as defaulter or not defaulter Workflow Diagram 9

10 1.4 RELATED WORK Lot of work has been done with regards to extracting important data, which can be useful for the financial institutions. My project aim was to gather loan information from multiple sources and applying different classification algorithms, which could give best prediction results. I have taken the reference of the work listed below in order to do my analysis. Jiang and Li [1] propose a method to improve the prediction results obtained by using Naïve Bayes and Decision Tree algorithms separately. They have tried to use this Hybrid method on 36 UCI Datasets and compared the results obtained with the individual algorithms. In this project, I have used this hybrid method on the Loan Approved dataset obtained by merging three data sources. The paper by Tiwari and Prakash [2] implements ensemble methods (bagging, boosting and blending) on the SONAR dataset and compares the prediction accuracy with individual algorithms like Naïve- Bayes, Decision Tree etc. In this project, I have used Boosting and Bagging Ensemble methods on the Loan Approved dataset and Single Lending Club dataset. The paper by Atiya [4] explains the implementation of Artificial Neural Networks on the Bank dataset for predicting Bankruptcy. In this project, I have used Single Layered and Multi Layered Neural Network methods on the Loan Approved dataset. 1.5 HYPOTHESIS My hypothesis was to use the Hybrid Naïve Bayes Decision Tree, Boosting and Bagging classification algorithms and compare their results with the individual algorithms to see if they give good prediction accuracy as compared to the individual algorithms. I also wanted to use various classification algorithms on the merged dataset and an individual dataset in order to compare the results obtained by using both the datasets. 10

11 2. PREPARING THE DATASET 2.1 GATHERING DATA In the first step of accumulating information, data from previously approved loan datasets from three different sources are gathered together. These datasets are merged to form a common dataset, on which analysis will be done. Table 1 shows details of the datasets: Table 1: Dataset details Dataset Name No of attributes No of instances Data Format Lending Club Loan Data csv UCI German Data csv ROC Data sav 2.2 DATA PREPARATION AND CLEANING Tools used for data cleaning 1) Google Refine 2) Weka 3) R (For converting.sav data to.csv format) One of the most important task for preparing a common dataset is to decide which of the attributes can be used from these three tables, since all of them have different number of attributes and attributes in different forms. Nine attributes were selected for preparing the new dataset. These attributes being: a) Age of loan applicant (b) Job profile [less, moderately, highly skilled] (c) Annual income (d) Employment length (e) Loan amount (f) Loan duration (g) Purpose of loan (h) Housing [rent, own] and (j) Loan history [Defaulter, Not Defaulter] which is the class attribute. These attributes were either common to all the three or at least two of the datasets. All the other attributes were logically eliminated from each of the dataset. Tables 2, 3 and 4 show how the selected attributes look in each of the tables before merging 11

12 Table 2: Lending Club Data Attribute Name Type Values Example age Missin g Job profile Missin g Income Numeri c 5000, Emp Length Nomin al <1, 5, 10+ Loan Duration Purpose Housin Loan Amount g History Numeric Numeric Nominal Nominal Nominal 1200, 3500, , 60 (in months) Car loan, House Loan, Business Loan etc Rent, own Defaulter, Not Defaulter Table 3: UCI German Data Attribut e Name Type Values Example age Numeri c 33, 50, 46 Job profile Nominal Less, Moderately, Highly skilled Income Missin g Emp Length Nomina l <1, 1 to 4, 4 to 7 7+ Loan Amoun t Numeri c 1200, 3500, Duratio n Purpose Housin g Loan History Numeric Nominal Nominal Nominal 12, 24, 48 in months Car loan, House Loan, Busines s Loan etc Rent, Own, Free Defaulter, Not Defaulter Table 4: ROC Data Attribut e Name Type Values Example age Numeri c 33, 50, 46 Job profile Nominal Less, Moderately, Highly skilled Income Numeri c 5000, Emp Length Nomina l 1, 8, 15 Loan Amoun t Numeri c 1200, 3500, Duratio n Purpos e Housin g Loan History Missing Missing Missing Nominal Defaulter, Not Defaulter The dataset obtained by merging these three datasets was raw and it needed a lot of cleaning. Tackling Data Cleaning tasks 1) Age attribute: As we can see from the above tables the Lending Club Data set does not have any information regarding the age of the loan applicant. So all of the 5000 values for that attribute are unknown. Now, 12

13 ideally one would have removed the entire attribute, but age might be an important factor in determining whether the applicant could be a defaulter or not. So some logical assumptions were made to fill these missing values of age, based on the age and the employment length values of the other two tables. A person who has experience of more than 7 years is very likely to be in his mid 30 s where as a person having a experience of 3-4 years is likely to be in his mid or higher twenties. So in this case a Density Based Clustering Algorithm has been used, in order to get a relation or a pattern between the two attributes. This algorithm divides the age and employment length values of other two datasets into four clusters, as shown in Figure 1.1. Figure 1.1: Cluster Division The number of instances in each of the cluster is shown in Figure 1.2: Figure 1.2: Clustered instances Cluster 1 and Cluster 2 have a major difference in the average age values although the employment length centroid in both is same age value (1<=X<4). But the number of instances in cluster 1 is very less (9 percent) as compared to 43 percent in cluster 2. So here cluster 1 will not be taken into consideration. So based on these three cluster groupings, three categories for the values of age are made : (1) <=27 (2) 28 <= X <=37 (3) >=38. The age values in UCI and ROC datasets are numeric by default. So here both the datasets were combined and text facet feature of Google refine has been used to group all the values of same age together. The attribute has got 53 different numeric age value groups as shown in Figure 1.3. Each of these 53 age group values have been put into their respective category according to the age values. 13

14 But the tricky part here is not knowing if the job that the loan applicant is doing is his first job or say his fifth job. So I will also be doing data analysis by not considering the age attribute and see how that affects my prediction accuracy as compared to when age is taken into consideration. Various age groups obtained after applying text facet feature on age attribute is shown in Figure 1.3. Figure 1.3: text facet on age to group age values in dataset (2) Employment Length The employment length values in both Lending Club and ROC datasets are numeric, but in UCI German dataset the employment length are in four bins i.e. less than 1 years, 1 to 4 years, 4 to 7 years and more than 7 years. So these four bins will be common for my entire dataset. The numeric values in the other two datasets will be put into their respective bins. Text facet function of Google refine will be used in order to get groups of all possible age values and then put these groups into their respective bins. (3) Housing This is a nominal attribute and has four possible values i.e. rent, own, free and other, as shown in Table 5. Table 5: Housing attribute categories Housing Frequency (Number of Values) 14

15 rent 4475 own 1405 free 108 Other 30 Missing Values 102 Since the number of instances of free (108 out of %) and other (30 out of %) values are very less, these two will be merged into one value which is Other. This value will now have 138 instances. Housing cells having the value as rent, account for 74% of values in the dataset. So in this case, all the missing values (102 values) of the housing attribute cells will be filled by the mode value, that is the most occurring value rent. So the housing attribute finally would have only three set of values - Rent (4575 values), own (1405 values) and other (138 values). (4) Loan Purpose This is a nominal attribute. Loan purpose attribute in the Lending club had 12 different values, whereas in UCI dataset had 7 different values. All the values for loan purpose were unknown in the ROC dataset. When the datasets are combined only the following values have considerable amount of instances i.e. car (524 values), credit card (610 values), debt consolidation (2450), house loan or home improvement (927 values) and other (620 values). The probability of occurrence of all the other values is very less. So a new category value named Other/unknown is introduced, that will include all the other remaining category instances. This category will have the 610 values of category other as well. All the missing values of loan purpose will also be filled by other/unknown. So finally, loan purpose will now have only five category values 1) car 2) credit card 3) debt consolidation 4) house loan/home improvement 5) other/unknown 15

16 Various categories of purpose attribute in UCI German Dataset are shown in Figure 1.4. Figure 1.4: purpose attribute in UCI German Dataset Various categories of Purpose attribute in Lending Club Dataset are shown in Figure 1.5. Figure 1.5: Purpose attribute in Lending Club Dataset 16

17 (5) Job Profile and Income: Job profile, income and employment length are the three attributes, which are co-related. As we can very much say that the income of an employee having very high skill but with less experience would be higher than an employee having less skill even with relatively higher experience. This relation will be used in order to find the missing values of Job Profile and Income. The lending club has all Job profile values as unknown. In this case, Income and employment length values of Lending club and ROC table will be used to extract a pattern. UCI German dataset values have not been used here since it has all the income values as unknowns. So in this case, K-means clustering algorithm is used in order to get a pattern between the attributes. This algorithm divides the attributes values into three clusters. The clusters formed are shown in Figure 1.6. Figure 1.6: Cluster Division All the unknown cells of the Job profile column are filled with either value cluster 0, cluster 1 and cluster 2 according to the model created by the K-means clustering algorithm. 17

18 The values obtained for the Job_profile attribute by using K-means clustering are as shown in Figure 1.7. Figure 1.7: K-Means Output- Here the Cluster Column will be renamed as job_profile Based on the observations of the cluster centroids, cluster 0 which has annual income mean as $77928 and employment length as 7+ years has been renamed as highly skilled, cluster 1 which has annual income mean as $47230 and employment length as 1 to 4 years has been renamed as less skilled and cluster 2 which has annual income mean as $59420 and employment length as 4 to 7 years has been renamed as moderately skilled. The cluster 0, 1, 2 instances in the column would be replaced by highly, less and moderately skilled respectively. Now, the relationship derived by the clustering model will be used in order to fill out all the unknown income values. For example, filling the income value whose corresponding employment length is between 4 to 7 years and Job profile as moderately skilled. In this case, all the instances having 4 to 7 years and moderately skilled will be gathered using text facet function of Open refine and the mean of those income values would be used to fill in those cells. There are 1503 matching instances, all having employment 4 to 7 years and moderately skilled. The arithmetic mean of the matching corresponding income values of these 1503 instances would be calculated which in this case is $59342 and that value would be inserted into all the blank income cells having 4 to 7 years and moderately skilled. Similarly, all the other values of the income attribute would be calculated by taking the mean of all income values of the corresponding matching job profile and employment length. 18

19 Removing Rows: Loan history is the class attribute. It s a nominal attribute having two category values i.e. defaulter and not defaulter. It also has 16 blank or missing values. Now, since this is a class attribute, all the 16 rows which do not have values for the attributes will be removed. There are a total of 50 loan history instances having value current. These loan payments are currently in progress and there is no indication of whether they will default or not in their future payments. So these 50 rows are also deleted. In addition to this there are 177 rows whose employment length value is blank and it has an income associated with. But here, we are not sure if it is, the applicant s previous job s income or the income of his co-signer or his family. Also, we don t have any values for Job profile and age cells of these rows. A lot of important data is missing in these 177 rows. So these rows would also be deleted. FINAL DATASET The final dataset obtained by merging the datasets and cleaning it, has a total of 5857 rows. There is no missing value in this new dataset. The list of attributes along with their type is shown in Table 6. Attribute Name Age Job profile Income Table 6: Final Dataset Emp Length Loan Amount Duration Purpose Housing Loan History Type Nominal Nominal Numeric Nominal Numeric Numeric Nominal Nominal Nominal Values Example <=27, 28<=X<=37, >=38 less, moderately, highly skilled 5000, <1, 1 to 4, 4 to , 3500, , 60 in months Car loan, House Loan, Business Loan etc rent, own, other Defaulter, Not Defaulter The new dataset obtained after merging the three datasets and cleaning it, has been divided into training set and a test set in the ratio of 80:20 respectively. After splitting the data, data models are built on training set, based on some extracted data patterns by using classification algorithms. These classifier models are then evaluated by using the test dataset. 19

20 3. DATA MINING USING CLASSIFICATION ALGORITHMS HYBRID NAÏVE BAYES DECISION TREE ALGORITHM Naïve Bayes and Decision Trees are two of the most important classification algorithms for prediction purposes, due to their accuracy, easiness and effectiveness. There prediction accuracies can be increased further by combining the advantages of both the algorithms mentioned above, using a Hybrid Naïve Bayes Decision Tree algorithm. This algorithm gives high prediction accuracy as compared to Naïve Bayes and Decision Tree used individually, but the time complexity does not increase by a great extent [1]. The implementation of the algorithm is divided into two parts. In the first part, Naïve Bayesian and Decision tree models are created and assessed individually on the training data. In the second part, the class probabilities obtained on every instance of the test set, are weightily averaged based on the classification accuracies obtained on training data [1]. Finally the result of the Hybrid Naïve Bayes Decision Tree Algorithm, is compared with the results of Naïve Bayes and Decision Tree Algorithm calculated individually. The algorithm [1] is as follows: Phase 1: (Used WEKA) INPUT: Training Data STEPS: 1) Building a classifier model on Training Data using Decision Tree, denoted by (C4.5) 2) Evaluating the accuracy of the model on Training data, denoted by (ACC C4.5 ) 3) Building a classifier model on Training Data using Decision Tree, denoted by (NB) 4) Evaluating the accuracy of the model on Training data, denoted by (ACC NB ) 5) Return the models built along with their evaluated accuracies. OUTPUTS: ( C4.5, ACC C4.5, NB, ACC NB ) 20

21 Phase 2: (Used JAVA) INPUT: 1) The Models built in the first phase i.e. C4.5, NB 2) Their respective accuracies ACC C4.5, ACC NB 3) Test Data instance denoted by x STEPS: 1) For every class label (in this case 2 class labels) c of test instance x Here c is either a defaulter or not defaulter 2) Calculate P(c x) C4.5 by using the decision tree model (C4.5). The formula is as follows: Formula reference [1] Where k is the number of training instances in that particular leaf node where x falls, c i is the class of the test instance x [1]. Formula reference [1] The value of the function is equal to 0 if both the parameters are not equal and 1 if they are equal [1]. For this dataset, Calculate P(not defaulter x) C4.5 by using the decision tree model (C4.5) and the above formula Calculate P(defaulter x) C4.5 by using the decision tree model (C4.5) and the above formula 3) Calculate P(c x) NB by using the Naïve Bayesian classifier model (NB). The formula is as follows: Formula reference [1] 21

22 Here, m denotes the total number of attributes, a j is the value of jth attribute of the test instance x and c as mentioned earlier, is the value of the class attribute [1]. In this case, the Prior probability i.e. P(c) [1] is calculated by using the formula: Formula reference [1] Here c i is the is the value of the class attribute of the ith training row, n is the total number of rows in training set, n c is the total number of classes (in this case: 2) [1] and Formula reference [1] The value of the function is equal to one if both the parameters are equal and zero if they are not equal [1]. The Conditional Probability P(a j c ) [1] is calculated by using the formula: Formula reference [1] Here a j is the value of jth attribute of the test instance x, a ij is the value of jth attribute of the training set row or instance i, c i is the is the value of the class attribute of the ith training row and n j is the total number of values that the jth attribute can have [1]. For this dataset, Calculate P(defaulter x) NB by using the Naïve Bayesian model (NB) Calculate P(not defaulter x) NB by using the Naïve Bayesian model (NB) 4) Calculate P(c x) C4.5 - NB by using the formula given below: For this dataset, Calculate P(defaulter x) C4.5-NB Formula reference [1] 22

23 Calculate P(not defaulter x) C4.5-NB 5) Find the maximum value of P(c x) C4.5 NB which is obtained in the previous step. For this dataset, If ( P(defaulter x) C4.5-NB > P(not defaulter x) C4.5-NB) { Then the Class Label of the instance is defaulter according to the hybrid algorithm } else if ( P(not defaulter x) C4.5-NB > P(defaulter x) C4.5-NB ){ Then the Class Label of the instance is not defaulter according to the hybrid algorithm } OUTPUT: The Class Label (defaulter or not defaulter) for a Test Data instance denoted by x FINAL RESULT: Phase 2 steps are repeated for all the instances of the test dataset. All the Class labels outputs obtained for every test instance or row, are copied to an excel file and the actual Class labels are also pasted in the same file in the next column. This file in read by a java code and the correctly and the incorrectly classified instances are found by comparing the two columns. Sample of the probabilities obtained by using the Hybrid algorithm is shown in Figure

24 Figure 2.1 Probabilities obtained after completing step 4 Prediction accuracy and confusion matrix obtained by using Hybrid Naïve Bayes Decision Tree algorithm is shown in Figure 2.2. Figure 2.2 Hybrid Naïve Bayes Decision Tree Algorithm Hybrid Naïve Bayes Decision Tree Accuracy % Figure 2.3 shows the result of using Naïve Bayes Algorithm separately on the merged dataset. Naïve Bayes Algorithm Result 24

25 Figure Naïve-Bayes Classification Algorithm using Weka Naïve Bayes Accuracy % Figure 2.4 shows the result of using Decision Tree (J48) Algorithm separately on the merged dataset. Decision Tree Algorithm (J48) - Result Figure J48 Classification Algorithm using Weka Decision Tree J48 Accuracy % The classification accuracy obtained by using the hybrid algorithm (73.80 %) shows improvement as compared to the accuracy of the individual classification algorithms, in this case, Naïve-Bayes and Decision Tree, although by a relatively small extent as shown in Table 7. But it does serve the purpose of using the algorithm i.e. for improving the accuracy of the individual classification algorithms without increasing the time complexity by a great extent. Table 7 compares the accuracies obtained by using Hybrid, Naïve Bayes and Decision Tree algorithm respectively. Table 7: Comparison of Prediction Accuracies Classification Algorithms Accuracy Correctly Classified Instances Incorrectly Classified Instances Hybrid Naïve Bayes Decision Tree Algorithm % / /

26 Naïve Bayes Algorithm % Decision Tree Algorithm % 837/ / / /1156 ENSEMBLE METHODS Ensemble methods either use more than one data mining algorithms or use one data mining algorithm multiple times in order to improve the prediction accuracy as compared to the use an algorithm on the dataset. 1) BOOSTING ALGORITHM In the first iteration of the Boosting algorithm, a classification model is created on the training data by using a data-mining algorithm. The second iteration creates a classification model, which basically concentrates on the instances or the rows that were incorrectly classified in the first iteration [2]. This process goes on until some constraint is reached with regards to the accuracy or number of models [2]. The aim of using Boosting ensemble method is to get better results as compared to the individual classification algorithms. For this dataset, the AdaBoostM1 classification algorithm will be used. AdaBoostM1 algorithm is tried with different base classifier algorithms like J48 decision tree algorithm, Naïve-Bayes, SVM, Neural Network etc. in order to find the base classification algorithm, which gives best results for this dataset. In this case, J48 Decision Tree algorithm used as a base classifier gives best results for AdaBoostM1. So the final analysis will have AdaBoostM1 using J48. Also, various number of iterations i.e. number of successive models to be created (N) are tried in order to get good prediction accuracy. The values of N above 30 either have the same or decreased the value of classification accuracy. Finally, N=30 is selected since it gives the best prediction results. Figure 3.1 shows the result of using AdaBoostM1 with J48 as the base classifier and Number of successive models to be created (N) = 30. AdaBoostM1 Algorithm using - J48 as base classifier - Result 26

27 Figure 3.1 AdaboostM1 Algorithm using Weka AdaBoostM1 using J48 Accuracy % Table 8 shows the prediction accuracies of the AdaBoostM1 classification algorithm using different base classifier algorithms and different number of successive models to be created i.e. N. N = Number of Successive Models Table 8: AdaBoostM1 Classification Algorithm Accuracy Results Base Classification Algorithm Used J48 Decision Tree N = 3 N=10 N=20 N= % % % % Naïve-Bayes Support Vector Machines (SVM) % % % % % % % % 27

28 K Nearest Neighbors (KNN) % % % % The base classification algorithm that will be used in this case is the J48 decision tree algorithm, since J48 gives the best results for this dataset as compared to the other individual algorithms and number of successive models would be 30. The aim of using AdaBoostM1 ensemble method is to get better prediction results as compared to the individual classification algorithms, in this case the J48 algorithm. 2) BAGGING ALGORITHM Bagging is a type of ensemble, which divides the entire training data into various small samples and then creates a separate classifier model for every sample [2]. The results obtained from of all these classifier models are finally merged by using techniques like majority voting or taking average of the results etc. [2]. The main advantage here is that, each sample obtained from the training set is unique. So every classifier model that has been created, will be trained on slightly different unexplored part of a problem. Like AdaBoostM1, Bagging algorithm is also tried with different base classifier algorithms like J48 decision tree algorithm, Naïve-Bayes, SVM, Neural Network etc. in order to find the base classification algorithm, which gives best results for this dataset. In this case, J48 Decision Tree algorithm used as a base classifier gives best results for Bagging. So the final analysis will have Bagging, using J48. Also, various number of iterations i.e. Number of Samples to be created (N) are tried in order to get good prediction accuracy. The values of N above 30 either have the same or decreased the value of classification accuracy. Finally, N=30, is selected since it gives the best prediction results. Figure 3.2 shows the result of using Bagging with J48 as the base classifier and Number of samples to be created (N) = 30. Bagging Algorithm using - J48 as base classifier - Result 28

29 Figure 3.2 Bagging Algorithm using Weka Bagging using J48 Accuracy % Here, the inbuilt Bagging algorithm of WEKA will be used. Below are the prediction accuracy of the Bagging classification algorithm using different Base classifier algorithms and different number of Samples to be created i.e. N. Table 9 shows the prediction accuracies of the Bagging classification algorithm using different base classifier algorithms and different number of samples to be created i.e. N. Table 9: Bagging Classification Algorithm Results Base Classification Algorithm Used J48 Decision Tree Naïve-Bayes Support Vector Machines (SVM) K Nearest Neighbors (KNN) N = 3 N=10 N=20 N= % % % % % % % % % % % % % % % % 29

30 The base classification algorithm that will be used in this case is the J48 decision tree algorithm, since J48 gives the best results for this dataset as compared to the other individual algorithms and number of successive models would be 30. The aim of using Bagging ensemble method is to get better prediction results as compared to the individual classification algorithms, in this case the J48 algorithm. Prediction Accuracies obtained by using J48 individually and J48 with bagging and boosting are compared in Table 10. Table 10: Comparison of Prediction Accuracies Classification Algorithms Accuracy Correctly Classified Instances Incorrectly Classified Instances J48 Algorithm % 847/ /1156 AdaBoostM1 Algorithm % 859/ /1156 Bagging Algorithm % 868/ /1156 ARTIFICIAL NEURAL NETWORK Now-a-days Artificial Neural Network is being considered, as a well-established method, in order to evaluate the loan applications received by the banks and in order to make any approval or rejection decisions. Here, a classifier model is built on the merged dataset by using Single-layer as well as Multilayer Feed-Forward Neural Network Algorithm. Single Layer Feed Forward Neural Networks [5] consist of an Input Layer, which would have of all the attributes that are used except the class attribute, one Hidden Layer which would have some neurons (number of neurons are specified in the code) and the Output Layer which consists of the class attribute. Multi Layer Feed Forward Neural Networks [5] consist of an Input Layer, which consists of all the attributes that are used except the class attribute, Hidden Layers which would have some neurons (number of hidden layers and neurons are specified in the code) and the Output Layer which consists of the class attribute. Normally, a neural network has two hidden layers in a Multilayer Feed Forward Neural Network. Both these algorithms have been implemented in the R environment. An inbuilt package, neuralnet is used to run both the algorithms. The basic command for running this algorithm is: 30

31 NeuralNetworkResult <- neuralnet (formula, dataset, hidden, algorithm, stepmax) Here the argument meanings are as follows: 1) Formula: It specifies the classifier attribute and then lists all the attributes to be considered for building the model on the class attribute. 2) Dataset: This argument will have the name of the dataset, on which model will be built on. 3) Hidden: This argument specifies, the number of hidden layers i.e. whether it is Single layered or Multi Layered neural network. It also specifies the number of neurons a layer would have. 4) Algorithm: This argument specifies the algorithm, which will be used to create a neural network. By default, the algorithm used for creating a neural network is Resilient BackPropogation Algorithm indicated by rprop+. 5) Stepmax: This argument specifies the maximum number of steps that can be used to build the neural network. The Basic code used for building a Single Layered Neural Network in this case is as follows: nn <- neuralnet ( Not_defaulter + defaulter ~ age + Job_profile + income + emp_length + Loan.amount+ Loan.Duration + Purpose + home_ownership, data = nnet_train, algorithm ='rprop+', hidden = 3, stepmax=1e6) (Here the Single hidden Layer would have 3 neurons) The basic code used for building a Multilayered Neural Network in this case is as follows: n <- neuralnet ( Not_defaulter + defaulter ~ age + Job_profile + income + emp_length + Loan.amount+ Loan.Duration + Purpose + home_ownership, data = nnet_train, algorithm ='rprop+', hidden = c(3,2), stepmax=1e6) (Here the first hidden Layer would have 3 neurons and the second would have 2 neurons) Figure 4.1 shows the network obtained by using Single Layer Neural network. 31

32 Figure 4.1 Single Layer Feed Forward Neural Network using R Sample of Code Snippet and Result obtained by using Single Layer Neural Network is shown in Figure 4.2. Figure 4.2 Single Layer (3 Hidden neurons) Neural Network Accuracy: % 32

33 Figure 4.3 shows the network obtained by using Multilayer Neural network. Figure 4.3 Multi-Layer Feed Forward Neural Network using R Result obtained by using Multilayer Neural Network is shown in Figure 4.4. Figure 4.4 Multi-Layer (5 Hidden Neurons) Neural Network Accuracy: % Table 11: Comparing the Results of Single Layer and Multilayer Neural Network Classification Algorithm Accuracy Correctly Classified Incorrectly Classified Single Layer Neural Network 71.62% 828 / / 1156 Multi Layer Neural Network 72.57% 839 / /

34 4. Analyzing Single Dataset (Lending Club) using Classification Algorithms Here some of the algorithms used in the earlier milestones, are applied again to build models on the Lending Club Dataset and then the models created are evaluated for their accuracy. This dataset has 4586 instances and 24 attributes. It has been divided into training set and a test set in the ratio of 80:20 respectively. After splitting the data, data models are built on training set, based on some extracted data patterns by using Classification algorithms. These classifier models are then evaluated by using the test dataset. Attributes of this dataset are as follows: Loan_amount Numeric Loan Term Nominal Installment_rate Nominal Installment Numeric Grade Nominal Sub_grade Nominal Employment_length Nominal Open_accounts - Numeric Public_record - Numeric Revolving_balance - Numeric Revolving_until - Nominal Total_accounts - Numeric Total_payment - Numeric loan_status - Nominal Home_ownership Nominal Annual_income Numeric Loan_purpose Nominal Zip_code Nominal Verification_status Nominal Address_state Nominal Issue_date Nominal Earliest_credit_line Nominal Inquiry_last_6months Numeric 34

35 Figure 5.1 shows the result of using Decision Tree (J48) Algorithm on the Lending Club Dataset. Figure J48 Classification Algorithm using Weka Decision Tree J48 Accuracy % Figure 5.2 shows the result of using Naïve Bayes Algorithm on the Lending Club Dataset. Figure Naïve-Bayes Classification Algorithm using Weka Naïve Bayes Accuracy % 35

36 Figure 5.3 shows the result of AdaBoostM1 Algorithm using Decision Stump as base classifier on Lending Club Dataset. Figure 5.3 AdaboostM1 Algorithm using Weka AdaBoostM1 using Decision Stump- Accuracy % Figure 5.4 shows the result of Bagging Algorithm using - J48 as base classifier on Lending Club Dataset. Figure 5.4 Bagging Algorithm using Weka Bagging using J48 Accuracy % 36

37 5. RESULTS AND ANALYSIS: In this section, we will be studying and analyzing the results, which have been obtained by building models on Loan Approved Merged dataset and Lending Club dataset, using various classification algorithms. We will also be looking to get an insight of the most relevant attributes that help in predicting the results correctly. Prediction accuracies obtained on the Merged dataset and Lending Club dataset using various classification algorithms is shown in Table 12 and Table 13 respectively. Table 12: Comparison of the results obtained on the Merged dataset ALGORITHM USED CLASSIFICATION ACCURACY CORRECTLY CLASSIFIED INSTANCES INCORRECTLY CLASSIFIED INSTANCES Naïve Bayes 72.40% 837 / / 1156 J % 847 / / 1156 Naïve Bayes J48 Hybrid 73.80% 852 / / 1156 Boosting (AdaBoostM1) 74.31% 859 / / 1156 Bagging 75.08% 868 / / 1156 Single Layer Neural Network 71.62% 828 / / 1156 Multi Layer Neural Network 72.57% 839 / / 1156 (Highest Accuracy obtained by using Bagging has been highlighted) Table 13: Comparison of the results obtained on Lending Club dataset ALGORITHM USED CLASSIFICATION ACCURACY CORRECTLY CLASSIFIED INSTANCES INCORRECTLY CLASSIFIED INSTANCES Naïve Bayes 74.17% 718 / / 968 J % 752 / / 968 Boosting (AdaBoostM1) 90.49% 876 / / 968 Bagging 85.43% 827 / / 968 (Highest Accuracy obtained by using Boosting has been highlighted) 37

38 Observations and Analysis 1) As we can see, the Classification Accuracies obtained on the single Lending Club dataset seem to be relatively higher or in some cases much higher than the classification accuracy obtained on the merged dataset using same algorithms. Now, let us analyze the results obtained for both datasets using J48 and Bagging algorithm. Figure 6.1 and figure 6.2 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using J48 algorithm. A B ß Classified as A B ß Classified as A = not defaulter A = not defaulter B = defaulter B = defaulter Figure 6.1: Confusion Matrix for Merged Dataset Using J48 Figure 6.2: Confusion Matrix for Lending Club Dataset Using J48 Figure 6.3 and figure 6.4 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using Bagging algorithm. A B ß Classified as A B ß Classified as A = not defaulter A = not defaulter B = defaulter B = defaulter Figure 6.3: Confusion Matrix for Merged Dataset Using Bagging Figure 6.4: Confusion Matrix for Lending Club Dataset using Bagging According to the figure 6.1, the J48 algorithm correctly predicts 847 instances out of 1156 instances. Here the classification accuracy is %. Figure 6.2 states that the J48 algorithm correctly predicts 752 instances out of 968 instances. The classification accuracy in this case is %. The confusion matrix for merged dataset using Bagging algorithm shown in figure 6.3, states that 868 instances are correctly classified out of 1156 instances. The classification accuracy is %. The confusion matrix for lending club dataset using Bagging shown in figure 6.4 states that, 827 instances are correctly classified out of 968 instances. The classification accuracy is %. So the classification accuracy obtained for the Lending Club dataset is relatively higher as compared to the accuracy obtained on the merged dataset. While merging the datasets, all the uncommon attributes were removed. The less prediction accuracy for the merged dataset as compared to Lending club dataset, can be due to the fact that, some attributes having useful information for identifying the instances as defaulters and not defaulters, are missing. These attributes would have helped the algorithms to get a 38

39 much better understanding of the patterns in the dataset while creating a model, thereby improving the prediction accuracy. 2) Another thing to notice about in both the datasets, especially, in case of the Merged dataset is that, although the overall prediction accuracy is good, all the algorithms used are not very good at correctly predicting the defaulters but excellent at predicting the non defaulters. Figure 6.5 and figure 6.6 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using J48 algorithm. A B ß Classified as A B ß Classified as A = not defaulter A = not defaulter B = defaulter B = defaulter Figure 6.5: Confusion Matrix for Merged Dataset Using J48 Figure 6.6: Confusion Matrix for Lending Club Dataset Using J48 As per figure 6.5, the J48 algorithm correctly classifies 736 instances as a not defaulter from a total number of 785 instances whose class was actually not defaulter. So it has a classification accuracy of 93.7% in case of all the instances having class as not defaulter. On the other hand, it correctly classifies only 111 instances as a defaulter from a total number of 371 instances, whose class was actually defaulter. So it has a classification accuracy of just 29.9% in case of all the instances having class as defaulter. According to figure 6.6, J48 algorithm correctly classifies 606 instances as a not defaulter from a total number of 684 instances whose class was actually not defaulter. It has a classification accuracy of 88.59% in case of all the instances having class as not defaulter. On the other hand, J48 correctly classifies 146 instances as a defaulter from a total number of 284 instances, whose class was actually defaulter. It has a classification accuracy of 51.4% in case of all the instances having class as defaulter. Let us consider one more example of another classification algorithm on both the datasets. Figure 6.7 and figure 6.8 shows the confusion matrix for Merged dataset and Lending Club dataset obtained using Naïve Bayes algorithm. 39

40 A B ß Classified as A B ß Classified as A = not defaulter A = not defaulter B = defaulter B = defaulter Figure 6.7: Confusion Matrix for Merged Dataset Using Naïve-Bayes Figure 6.8: Confusion Matrix for Lending Club Dataset Using Naïve-Bayes In case of the merged dataset, the Naïve- Bayes has a classification accuracy of 92.73% in case of all the instances having class as not defaulter and has an accuracy of only 29.38% in case of all the instances having class as defaulter. On the other hand, in case of the Lending Club dataset, the Naïve- Bayes has a classification accuracy of 79% in case of all the instances having class as not defaulter and has an accuracy of 63.73% in case of all the instances having class as defaulter. As we can see, the prediction accuracy of all the not defaulter instances remain very good for both the datasets. But the accuracy of all the defaulter instances is not that good for both the datasets, especially in the Merged dataset. The Train-Test dataset split was set to 70:30 percent for the both datasets to see if there is any change in the behavior of the models. The prediction accuracy almost remained the same. The overall accuracy just got reduced by a little margin. Also there was hardly any difference in the prediction accuracy of the 'defaulter' class instances. The reason for not so good prediction accuracy of defaulter instances might be that, the attributes in both the datasets provide adequate information about the characteristics of a non defaulter, but do not reveal any vital information, that can be used to correctly classify the applicant as a defaulter. In case of the merged dataset, the reason can also be, the removal of uncommon attributes like applicant address, interest rate, applicant grade, credit enquires in last 6 months etc. while merging the 3 datasets, or presence of many missing values which were guessed while performing data cleaning. One major reason for the low accuracy of defaulter instances could also be that, in both the datasets, the number of rows having class as not defaulter is much greater than the number of rows having class as defaulter. So the datasets have class imbalance. All these things might have lead to lack of important information or loss of information to clearly differentiate the prediction of two classes. The result being, all the algorithms are biased in predicting an applicant as a not defaulter. To tackle this problem, Cost Sensitive Learning method has been used. This method basically punishes the algorithm for falsely classifying the defaulter instances as not defaulters. In this approach, 40

41 although the overall accuracy of prediction goes down by some margin, the prediction accuracy of defaulters goes up considerably. COST SENSITIVE LEARNING Cost Sensitive Learning method can be used with any algorithm like Naïve Bayes, Decision tree, Bagging etc. For our datasets, it has the following default Cost Matrix. The values in it are the cost sensitive weights. Default Cost Matrix New Cost Matrix after changing some weight values Various values of weights were tried and finally the above cost matrix was used since it balanced the classifier result considerably and gave decent prediction results. Naive Bayes Result without using Cost Sensitive Learning on Merged Dataset is shown in Figure 7.1. Figure 7.1:Defaulter instances prediction accuracy % (Overall Accuracy %) 41

42 Naive Bayes result with Cost Sensitive Learning on Merged Dataset is shown in Figure 7.2. Figure 7.2: Defaulter instances prediction accuracy % (Overall Accuracy %) Although the overall accuracy of the classifier reduces to around 65%, but the prediction results for defaulter class instances improves considerably and the result no longer looks to be biased in predicting an applicant as a not defaulter. Similarly, Cost Sensitive Learning method is used with some other algorithms on both the datasets as shown in Table 14 and Table 15. Table 14: Results obtained on Merged dataset with and without Cost Sensitive Learning ALGORITHM USED OVERALL CLASSIFICATION ACCURACY WITHOUT USING COST SENSITIVE LEARNING CLASSIFICATION ACCURACY OF DEFAULTER INSTANCES WITHOUT USING COST SENSITIVE LEARNING OVERALL CLASSIFICATION ACCURACY USING COST SENSITIVE LEARNING CLASSIFICATION ACCURACY OF DEFAULTER INSTANCES USING COST SENSITIVE LEARNING Naïve Bayes % % % % J % % % % Boosting (AdaBoostM1) % % % % 42

43 Bagging % % % % Table 15: Results obtained on Lending Club dataset with and without Cost Sensitive Learning ALGORITHM USED OVERALL CLASSIFICATION ACCURACY WITHOUT USING COST SENSITIVE LEARNING CLASSIFICATION ACCURACY OF DEFAULTER INSTANCES WITHOUT USING COST SENSITIVE LEARNING OVERALL CLASSIFICATION ACCURACY USING COST SENSITIVE LEARNING CLASSIFICATION ACCURACY OF DEFAULTER INSTANCES USING COST SENSITIVE LEARNING Naïve Bayes % % % % J % % % % Boosting (AdaBoostM1) % % % % Bagging % 51.0 % % % By looking at both the tables it can be seen that the use of Cost Sensitive Learning reduces the overall prediction accuracy by some margin, but the prediction accuracy of defaulter instances increases considerably and the classification results no longer look like they are biased in predicting the instances as not defaulter thereby reducing classification imbalance. 3) The classification accuracy obtained by using the Hybrid algorithm (73.80 %) is higher than the accuracy of the individual classification algorithms, in this case, Naïve-Bayes (72.40 %) and Decision Tree (73.26 %), although by a relatively small extent. In this approach, a Naïve Bayes classifier is built on every leaf node of the decision tree [1]. The hybrid algorithm combines the advantages of both Naïve Bayes and Decision Tree. Thus it does serves the purpose of improving the accuracy of the individual classification algorithms without increasing the overall time complexity by a great extent. 4) Another important observation is that, when we use AdaBoostM1 Ensemble Methods on both the datasets, the overall accuracy of predicting the applicant correctly, as either a not defaulter or defaulter, increases as compared to the accuracy obtained by using any individual algorithm. For 43

The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used.

The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used. Machine Learning Group Homework 3 MSc Business Analytics Team 9 Alexander Romanenko, Artemis Tomadaki, Justin Leiendecker, Zijun Wei, Reza Brianca Widodo The Loans_processed.csv file is the dataset we

More information

Predicting Economic Recession using Data Mining Techniques

Predicting Economic Recession using Data Mining Techniques Predicting Economic Recession using Data Mining Techniques Authors Naveed Ahmed Kartheek Atluri Tapan Patwardhan Meghana Viswanath Predicting Economic Recession using Data Mining Techniques Page 1 Abstract

More information

International Journal of Advance Engineering and Research Development REVIEW ON PREDICTION SYSTEM FOR BANK LOAN CREDIBILITY

International Journal of Advance Engineering and Research Development REVIEW ON PREDICTION SYSTEM FOR BANK LOAN CREDIBILITY Scientific Journal of Impact Factor (SJIF): 4.72 International Journal of Advance Engineering and Research Development Volume 4, Issue 12, December -2017 e-issn (O): 2348-4470 p-issn (P): 2348-6406 REVIEW

More information

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques 6.1 Introduction Trading in stock market is one of the most popular channels of financial investments.

More information

Credit Card Default Predictive Modeling

Credit Card Default Predictive Modeling Credit Card Default Predictive Modeling Background: Predicting credit card payment default is critical for the successful business model of a credit card company. An accurate predictive model can help

More information

Analyzing Life Insurance Data with Different Classification Techniques for Customers Behavior Analysis

Analyzing Life Insurance Data with Different Classification Techniques for Customers Behavior Analysis Analyzing Life Insurance Data with Different Classification Techniques for Customers Behavior Analysis Md. Saidur Rahman, Kazi Zawad Arefin, Saqif Masud, Shahida Sultana and Rashedur M. Rahman Abstract

More information

Prediction Using Back Propagation and k- Nearest Neighbor (k-nn) Algorithm

Prediction Using Back Propagation and k- Nearest Neighbor (k-nn) Algorithm Prediction Using Back Propagation and k- Nearest Neighbor (k-nn) Algorithm Tejaswini patil 1, Karishma patil 2, Devyani Sonawane 3, Chandraprakash 4 Student, Dept. of computer, SSBT COET, North Maharashtra

More information

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18,   ISSN Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 31-3469 AN INVESTIGATION OF FINANCIAL TIME SERIES PREDICTION USING BACK PROPAGATION NEURAL NETWORKS K. Jayanthi, Dr. K. Suresh 1 Department of Computer

More information

Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas)

Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas) CS22 Artificial Intelligence Stanford University Autumn 26-27 Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas) Overview Lending Club is an online peer-to-peer lending

More information

International Journal of Research in Engineering Technology - Volume 2 Issue 5, July - August 2017

International Journal of Research in Engineering Technology - Volume 2 Issue 5, July - August 2017 RESEARCH ARTICLE OPEN ACCESS The technical indicator Z-core as a forecasting input for neural networks in the Dutch stock market Gerardo Alfonso Department of automation and systems engineering, University

More information

An enhanced artificial neural network for stock price predications

An enhanced artificial neural network for stock price predications An enhanced artificial neural network for stock price predications Jiaxin MA Silin HUANG School of Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR S. H. KWOK HKUST Business

More information

Can Twitter predict the stock market?

Can Twitter predict the stock market? 1 Introduction Can Twitter predict the stock market? Volodymyr Kuleshov December 16, 2011 Last year, in a famous paper, Bollen et al. (2010) made the claim that Twitter mood is correlated with the Dow

More information

Session 5. Predictive Modeling in Life Insurance

Session 5. Predictive Modeling in Life Insurance SOA Predictive Analytics Seminar Hong Kong 29 Aug. 2018 Hong Kong Session 5 Predictive Modeling in Life Insurance Jingyi Zhang, Ph.D Predictive Modeling in Life Insurance JINGYI ZHANG PhD Scientist Global

More information

Predicting and Preventing Credit Card Default

Predicting and Preventing Credit Card Default Predicting and Preventing Credit Card Default Project Plan MS-E2177: Seminar on Case Studies in Operations Research Client: McKinsey Finland Ari Viitala Max Merikoski (Project Manager) Nourhan Shafik 21.2.2018

More information

Predictive Risk Categorization of Retail Bank Loans Using Data Mining Techniques

Predictive Risk Categorization of Retail Bank Loans Using Data Mining Techniques National Conference on Recent Advances in Computer Science and IT (NCRACIT) International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume

More information

Naïve Bayesian Classifier and Classification Trees for the Predictive Accuracy of Probability of Default Credit Card Clients

Naïve Bayesian Classifier and Classification Trees for the Predictive Accuracy of Probability of Default Credit Card Clients American Journal of Data Mining and Knowledge Discovery 2018; 3(1): 1-12 http://www.sciencepublishinggroup.com/j/ajdmkd doi: 10.11648/j.ajdmkd.20180301.11 Naïve Bayesian Classifier and Classification Trees

More information

ISSN: (Online) Volume 4, Issue 2, February 2016 International Journal of Advance Research in Computer Science and Management Studies

ISSN: (Online) Volume 4, Issue 2, February 2016 International Journal of Advance Research in Computer Science and Management Studies ISSN: 2321-7782 (Online) Volume 4, Issue 2, February 2016 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Predicting Companies Delisting to Improve Mutual Fund Performance

Predicting Companies Delisting to Improve Mutual Fund Performance Predicting Companies Delisting to Improve Mutual Fund Performance TA-WEI HUANG EUGENE YANG PO-WEI HUANG BADM BADM Group 6 Executive Summary Stock is removed from an exchange because the company for which

More information

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18,   ISSN International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 31-3469 AN INVESTIGATION OF FINANCIAL TIME SERIES PREDICTION USING BACK PROPAGATION NEURAL

More information

Lazy Prices: Vector Representations of Financial Disclosures and Market Outperformance

Lazy Prices: Vector Representations of Financial Disclosures and Market Outperformance Lazy Prices: Vector Representations of Financial Disclosures and Market Outperformance Kuspa Kai kuspakai@stanford.edu Victor Cheung hoche@stanford.edu Alex Lin alin719@stanford.edu Abstract The Efficient

More information

Modeling Private Firm Default: PFirm

Modeling Private Firm Default: PFirm Modeling Private Firm Default: PFirm Grigoris Karakoulas Business Analytic Solutions May 30 th, 2002 Outline Problem Statement Modelling Approaches Private Firm Data Mining Model Development Model Evaluation

More information

Abstract Making good predictions for stock prices is an important task for the financial industry. The way these predictions are carried out is often

Abstract Making good predictions for stock prices is an important task for the financial industry. The way these predictions are carried out is often Abstract Making good predictions for stock prices is an important task for the financial industry. The way these predictions are carried out is often by using artificial intelligence that can learn from

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT) Decision Tree Each node checks

More information

Machine Learning in Risk Forecasting and its Application in Low Volatility Strategies

Machine Learning in Risk Forecasting and its Application in Low Volatility Strategies NEW THINKING Machine Learning in Risk Forecasting and its Application in Strategies By Yuriy Bodjov Artificial intelligence and machine learning are two terms that have gained increased popularity within

More information

International Journal of Computer Engineering and Applications, Volume XII, Issue IV, April 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Issue IV, April 18,  ISSN STOCK MARKET PREDICTION USING ARIMA MODEL Dr A.Haritha 1 Dr PVS Lakshmi 2 G.Lakshmi 3 E.Revathi 4 A.G S S Srinivas Deekshith 5 1,3 Assistant Professor, Department of IT, PVPSIT. 2 Professor, Department

More information

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty

Milestone2. Zillow House Price Prediciton. Group: Lingzi Hong and Pranali Shetty Milestone2 Zillow House Price Prediciton Group Lingzi Hong and Pranali Shetty MILESTONE 2 REPORT Data Collection The following additional features were added 1. Population, Number of College Graduates

More information

Performance analysis of Neural Network Algorithms on Stock Market Forecasting

Performance analysis of Neural Network Algorithms on Stock Market Forecasting www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 3 Issue 9 September, 2014 Page No. 8347-8351 Performance analysis of Neural Network Algorithms on Stock Market

More information

CHAPTER II THEORITICAL BACKGROUND

CHAPTER II THEORITICAL BACKGROUND CHAPTER II THEORITICAL BACKGROUND 2.1. Related Study To prove that this research area is quite important in the business activity field and also for academic purpose, these are some of related study that

More information

Neural Net Stock Trend Predictor

Neural Net Stock Trend Predictor Neural Net Stock Trend Predictor Advisor: Dr. Chris Polle- Commi,ee Members: Dr. Robert Chun Mr. Paul Thienprasit By Sonal Kabra SJSU Washington Square Purpose Introduc7on Review of Exis7ng Work Prior

More information

Iran s Stock Market Prediction By Neural Networks and GA

Iran s Stock Market Prediction By Neural Networks and GA Iran s Stock Market Prediction By Neural Networks and GA Mahmood Khatibi MS. in Control Engineering mahmood.khatibi@gmail.com Habib Rajabi Mashhadi Associate Professor h_mashhadi@ferdowsi.um.ac.ir Electrical

More information

Keyword: Risk Prediction, Clustering, Redundancy, Data Mining, Feature Extraction

Keyword: Risk Prediction, Clustering, Redundancy, Data Mining, Feature Extraction Volume 6, Issue 2, February 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering

More information

Bond Market Prediction using an Ensemble of Neural Networks

Bond Market Prediction using an Ensemble of Neural Networks Bond Market Prediction using an Ensemble of Neural Networks Bhagya Parekh Naineel Shah Rushabh Mehta Harshil Shah ABSTRACT The characteristics of a successful financial forecasting system are the exploitation

More information

CS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults

CS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults CS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults Kevin Rowland Johns Hopkins University 3400 N. Charles St. Baltimore, MD 21218, USA krowlan3@jhu.edu Edward Schembor Johns

More information

Final Examination CS540: Introduction to Artificial Intelligence

Final Examination CS540: Introduction to Artificial Intelligence Final Examination CS540: Introduction to Artificial Intelligence December 2008 LAST NAME: FIRST NAME: Problem Score Max Score 1 15 2 15 3 10 4 20 5 10 6 20 7 10 Total 100 Question 1. [15] Probabilistic

More information

International Journal of Computer Engineering and Applications, Volume XII, Issue IV, April 18, ISSN

International Journal of Computer Engineering and Applications, Volume XII, Issue IV, April 18,   ISSN International Journal of Computer Engineering and Applications, Volume XII, Issue IV, April 18, www.ijcea.com ISSN 2321-3469 BEHAVIOURAL ANALYSIS OF BANK CUSTOMERS Preeti Horke 1, Ruchita Bhalerao 1, Shubhashri

More information

Loan Approval and Quality Prediction in the Lending Club Marketplace

Loan Approval and Quality Prediction in the Lending Club Marketplace Loan Approval and Quality Prediction in the Lending Club Marketplace Milestone Write-up Yondon Fu, Shuo Zheng and Matt Marcus Recap Lending Club is a peer-to-peer lending marketplace where individual investors

More information

Neural Network Prediction of Stock Price Trend Based on RS with Entropy Discretization

Neural Network Prediction of Stock Price Trend Based on RS with Entropy Discretization 2017 International Conference on Materials, Energy, Civil Engineering and Computer (MATECC 2017) Neural Network Prediction of Stock Price Trend Based on RS with Entropy Discretization Huang Haiqing1,a,

More information

Accepted Manuscript. Enterprise Credit Risk Evaluation Based on Neural Network Algorithm. Xiaobing Huang, Xiaolian Liu, Yuanqian Ren

Accepted Manuscript. Enterprise Credit Risk Evaluation Based on Neural Network Algorithm. Xiaobing Huang, Xiaolian Liu, Yuanqian Ren Accepted Manuscript Enterprise Credit Risk Evaluation Based on Neural Network Algorithm Xiaobing Huang, Xiaolian Liu, Yuanqian Ren PII: S1389-0417(18)30213-4 DOI: https://doi.org/10.1016/j.cogsys.2018.07.023

More information

Jacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation?

Jacob: The illustrative worksheet shows the values of the simulation parameters in the upper left section (Cells D5:F10). Is this for documentation? PROJECT TEMPLATE: DISCRETE CHANGE IN THE INFLATION RATE (The attached PDF file has better formatting.) {This posting explains how to simulate a discrete change in a parameter and how to use dummy variables

More information

Predicting stock prices for large-cap technology companies

Predicting stock prices for large-cap technology companies Predicting stock prices for large-cap technology companies 15 th December 2017 Ang Li (al171@stanford.edu) Abstract The goal of the project is to predict price changes in the future for a given stock.

More information

Business Strategies in Credit Rating and the Control of Misclassification Costs in Neural Network Predictions

Business Strategies in Credit Rating and the Control of Misclassification Costs in Neural Network Predictions Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2001 Proceedings Americas Conference on Information Systems (AMCIS) December 2001 Business Strategies in Credit Rating and the Control

More information

CREDIT SCORING USING LOGISTIC REGRESSION

CREDIT SCORING USING LOGISTIC REGRESSION San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 5-25-2017 CREDIT SCORING USING LOGISTIC REGRESSION Ansen Mathew San Jose State University Follow

More information

Are New Modeling Techniques Worth It?

Are New Modeling Techniques Worth It? Are New Modeling Techniques Worth It? Tom Zougas PhD PEng, Manager Data Science, TransUnion TORONTO SAS USER GROUP MAY 2, 2018 Are New Modeling Techniques Worth It? Presenter Tom Zougas PhD PEng, Manager

More information

Barapatre Omprakash et.al; International Journal of Advance Research, Ideas and Innovations in Technology

Barapatre Omprakash et.al; International Journal of Advance Research, Ideas and Innovations in Technology ISSN: 2454-132X Impact factor: 4.295 (Volume 4, Issue 2) Available online at: www.ijariit.com Stock Price Prediction using Artificial Neural Network Omprakash Barapatre omprakashbarapatre@bitraipur.ac.in

More information

Predictive Model for Prosper.com BIDM Final Project Report

Predictive Model for Prosper.com BIDM Final Project Report Predictive Model for Prosper.com BIDM Final Project Report Build a predictive model for investors to be able to classify Success loans vs Probable Default Loans Sourabh Kukreja, Natasha Sood, Nikhil Goenka,

More information

The Use of Artificial Neural Network for Forecasting of FTSE Bursa Malaysia KLCI Stock Price Index

The Use of Artificial Neural Network for Forecasting of FTSE Bursa Malaysia KLCI Stock Price Index The Use of Artificial Neural Network for Forecasting of FTSE Bursa Malaysia KLCI Stock Price Index Soleh Ardiansyah 1, Mazlina Abdul Majid 2, JasniMohamad Zain 2 Faculty of Computer System and Software

More information

Managerial Accounting Prof. Dr. Varadraj Bapat Department of School of Management Indian Institute of Technology, Bombay

Managerial Accounting Prof. Dr. Varadraj Bapat Department of School of Management Indian Institute of Technology, Bombay Managerial Accounting Prof. Dr. Varadraj Bapat Department of School of Management Indian Institute of Technology, Bombay Lecture - 29 Budget and Budgetary Control Dear students, we have completed 13 modules.

More information

AN ARTIFICIAL NEURAL NETWORK MODELING APPROACH TO PREDICT CRUDE OIL FUTURE. By Dr. PRASANT SARANGI Director (Research) ICSI-CCGRT, Navi Mumbai

AN ARTIFICIAL NEURAL NETWORK MODELING APPROACH TO PREDICT CRUDE OIL FUTURE. By Dr. PRASANT SARANGI Director (Research) ICSI-CCGRT, Navi Mumbai AN ARTIFICIAL NEURAL NETWORK MODELING APPROACH TO PREDICT CRUDE OIL FUTURE By Dr. PRASANT SARANGI Director (Research) ICSI-CCGRT, Navi Mumbai AN ARTIFICIAL NEURAL NETWORK MODELING APPROACH TO PREDICT CRUDE

More information

LendingClub Loan Default and Profitability Prediction

LendingClub Loan Default and Profitability Prediction LendingClub Loan Default and Profitability Prediction Peiqian Li peiqian@stanford.edu Gao Han gh352@stanford.edu Abstract Credit risk is something all peer-to-peer (P2P) lending investors (and bond investors

More information

Relative and absolute equity performance prediction via supervised learning

Relative and absolute equity performance prediction via supervised learning Relative and absolute equity performance prediction via supervised learning Alex Alifimoff aalifimoff@stanford.edu Axel Sly axelsly@stanford.edu Introduction Investment managers and traders utilize two

More information

Forecasting Agricultural Commodity Prices through Supervised Learning

Forecasting Agricultural Commodity Prices through Supervised Learning Forecasting Agricultural Commodity Prices through Supervised Learning Fan Wang, Stanford University, wang40@stanford.edu ABSTRACT In this project, we explore the application of supervised learning techniques

More information

A Dynamic Hedging Strategy for Option Transaction Using Artificial Neural Networks

A Dynamic Hedging Strategy for Option Transaction Using Artificial Neural Networks A Dynamic Hedging Strategy for Option Transaction Using Artificial Neural Networks Hyun Joon Shin and Jaepil Ryu Dept. of Management Eng. Sangmyung University {hjshin, jpru}@smu.ac.kr Abstract In order

More information

INDIAN STOCK MARKET PREDICTOR SYSTEM

INDIAN STOCK MARKET PREDICTOR SYSTEM INDIAN STOCK MARKET PREDICTOR SYSTEM 1 VIVEK JOHN GEORGE, 2 DARSHAN M. S, 3 SNEHA PRICILLA, 4 ARUN S, 5 CH. VANIPRIYA Department of Computer Science and Engineering, Sir M Visvesvarya Institute of Technology,

More information

An introduction to Machine learning methods and forecasting of time series in financial markets

An introduction to Machine learning methods and forecasting of time series in financial markets An introduction to Machine learning methods and forecasting of time series in financial markets Mark Wong markwong@kth.se December 10, 2016 Abstract The goal of this paper is to give the reader an introduction

More information

Predicting the Success of a Retirement Plan Based on Early Performance of Investments

Predicting the Success of a Retirement Plan Based on Early Performance of Investments Predicting the Success of a Retirement Plan Based on Early Performance of Investments CS229 Autumn 2010 Final Project Darrell Cain, AJ Minich Abstract Using historical data on the stock market, it is possible

More information

Internet Appendix. Additional Results. Figure A1: Stock of retail credit cards over time

Internet Appendix. Additional Results. Figure A1: Stock of retail credit cards over time Internet Appendix A Additional Results Figure A1: Stock of retail credit cards over time Stock of retail credit cards by month. Time of deletion policy noted with vertical line. Figure A2: Retail credit

More information

STOCK PRICE PREDICTION: KOHONEN VERSUS BACKPROPAGATION

STOCK PRICE PREDICTION: KOHONEN VERSUS BACKPROPAGATION STOCK PRICE PREDICTION: KOHONEN VERSUS BACKPROPAGATION Alexey Zorin Technical University of Riga Decision Support Systems Group 1 Kalkyu Street, Riga LV-1658, phone: 371-7089530, LATVIA E-mail: alex@rulv

More information

Data Mining: A Closer Look. 2.1 Data Mining Strategies 8/30/2011. Chapter 2. Data Mining Strategies. Market Basket Analysis. Unsupervised Clustering

Data Mining: A Closer Look. 2.1 Data Mining Strategies 8/30/2011. Chapter 2. Data Mining Strategies. Market Basket Analysis. Unsupervised Clustering Data Mining: A Closer Look Chapter 2 2.1 Data Mining Strategies Data Mining Strategies Unsupervised Clustering Supervised Learning Market Basket Analysis Classification Estimation Prediction Figure 2.1

More information

Introducing GEMS a Novel Technique for Ensemble Creation

Introducing GEMS a Novel Technique for Ensemble Creation Introducing GEMS a Novel Technique for Ensemble Creation Ulf Johansson 1, Tuve Löfström 1, Rikard König 1, Lars Niklasson 2 1 School of Business and Informatics, University of Borås, Sweden 2 School of

More information

Health Insurance Market

Health Insurance Market Health Insurance Market Jeremiah Reyes, Jerry Duran, Chanel Manzanillo Abstract Based on a person s Health Insurance Plan attributes, namely if it was a dental only plan, is notice required for pregnancy,

More information

Based on BP Neural Network Stock Prediction

Based on BP Neural Network Stock Prediction Based on BP Neural Network Stock Prediction Xiangwei Liu Foundation Department, PLA University of Foreign Languages Luoyang 471003, China Tel:86-158-2490-9625 E-mail: liuxwletter@163.com Xin Ma Foundation

More information

A COMPARATIVE STUDY OF DATA MINING TECHNIQUES IN PREDICTING CONSUMERS CREDIT CARD RISK IN BANKS

A COMPARATIVE STUDY OF DATA MINING TECHNIQUES IN PREDICTING CONSUMERS CREDIT CARD RISK IN BANKS A COMPARATIVE STUDY OF DATA MINING TECHNIQUES IN PREDICTING CONSUMERS CREDIT CARD RISK IN BANKS Ling Kock Sheng 1, Teh Ying Wah 2 1 Faculty of Computer Science and Information Technology, University of

More information

Novel Approaches to Sentiment Analysis for Stock Prediction

Novel Approaches to Sentiment Analysis for Stock Prediction Novel Approaches to Sentiment Analysis for Stock Prediction Chris Wang, Yilun Xu, Qingyang Wang Stanford University chrwang, ylxu, iriswang @ stanford.edu Abstract Stock market predictions lend themselves

More information

Predicting First Day Returns for Japanese IPOs

Predicting First Day Returns for Japanese IPOs Predicting First Day Returns for Japanese IPOs Executive Summary Goal: To predict the First Day returns on Japanese IPOs (based on first day closing price), using public information available prior to

More information

Cambridge International Advanced Subsidiary Level and Advanced Level 9706 Accounting June 2015 Principal Examiner Report for Teachers

Cambridge International Advanced Subsidiary Level and Advanced Level 9706 Accounting June 2015 Principal Examiner Report for Teachers Cambridge International Advanced Subsidiary Level and Advanced Level ACCOUNTING Paper 9706/11 Multiple Choice Question Number Key Question Number Key 1 D 16 A 2 C 17 A 3 D 18 B 4 B 19 A 5 D 20 D 6 A 21

More information

UPDATED IAA EDUCATION SYLLABUS

UPDATED IAA EDUCATION SYLLABUS II. UPDATED IAA EDUCATION SYLLABUS A. Supporting Learning Areas 1. STATISTICS Aim: To enable students to apply core statistical techniques to actuarial applications in insurance, pensions and emerging

More information

CHAPTER 11 CONCLUDING COMMENTS

CHAPTER 11 CONCLUDING COMMENTS CHAPTER 11 CONCLUDING COMMENTS I. PROJECTIONS FOR POLICY ANALYSIS MINT3 produces a micro dataset suitable for projecting the distributional consequences of current population and economic trends and for

More information

Loan Approval and Quality Prediction in the Lending Club Marketplace

Loan Approval and Quality Prediction in the Lending Club Marketplace Loan Approval and Quality Prediction in the Lending Club Marketplace Final Write-up Yondon Fu, Matt Marcus and Shuo Zheng Introduction Lending Club is a peer-to-peer lending marketplace where individual

More information

3: Balance Equations

3: Balance Equations 3.1 Balance Equations Accounts with Constant Interest Rates 15 3: Balance Equations Investments typically consist of giving up something today in the hope of greater benefits in the future, resulting in

More information

Anomalies under Jackknife Variance Estimation Incorporating Rao-Shao Adjustment in the Medical Expenditure Panel Survey - Insurance Component 1

Anomalies under Jackknife Variance Estimation Incorporating Rao-Shao Adjustment in the Medical Expenditure Panel Survey - Insurance Component 1 Anomalies under Jackknife Variance Estimation Incorporating Rao-Shao Adjustment in the Medical Expenditure Panel Survey - Insurance Component 1 Robert M. Baskin 1, Matthew S. Thompson 2 1 Agency for Healthcare

More information

Portfolio Recommendation System Stanford University CS 229 Project Report 2015

Portfolio Recommendation System Stanford University CS 229 Project Report 2015 Portfolio Recommendation System Stanford University CS 229 Project Report 205 Berk Eserol Introduction Machine learning is one of the most important bricks that converges machine to human and beyond. Considering

More information

Research Article Design and Explanation of the Credit Ratings of Customers Model Using Neural Networks

Research Article Design and Explanation of the Credit Ratings of Customers Model Using Neural Networks Research Journal of Applied Sciences, Engineering and Technology 7(4): 5179-5183, 014 DOI:10.1906/rjaset.7.915 ISSN: 040-7459; e-issn: 040-7467 014 Maxwell Scientific Publication Corp. Submitted: February

More information

Tutorial 3: Working with Formulas and Functions

Tutorial 3: Working with Formulas and Functions Tutorial 3: Working with Formulas and Functions Microsoft Excel 2010 Objectives Copy formulas Build formulas containing relative, absolute, and mixed references Review function syntax Insert a function

More information

BPIC 2017: Business process mining A Loan process application

BPIC 2017: Business process mining A Loan process application BPIC 2017: Business process mining A Loan process application Dongyeon Jeong, Jungeun Lim, Youngmok Bae Department of Industrial and Management Engineering, POSTECH(Pohang University of Science and Technology),

More information

1. A is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes,

1. A is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, 1. A is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. A) Decision tree B) Graphs

More information

Could Decision Trees Improve the Classification Accuracy and Interpretability of Loan Granting Decisions?

Could Decision Trees Improve the Classification Accuracy and Interpretability of Loan Granting Decisions? Could Decision Trees Improve the Classification Accuracy and Interpretability of Loan Granting Decisions? Jozef Zurada Department of Computer Information Systems College of Business University of Louisville

More information

Session 40 PD, How Would I Get Started With Predictive Modeling? Moderator: Douglas T. Norris, FSA, MAAA

Session 40 PD, How Would I Get Started With Predictive Modeling? Moderator: Douglas T. Norris, FSA, MAAA Session 40 PD, How Would I Get Started With Predictive Modeling? Moderator: Douglas T. Norris, FSA, MAAA Presenters: Timothy S. Paris, FSA, MAAA Sandra Tsui Shan To, FSA, MAAA Qinqing (Annie) Xue, FSA,

More information

Role of soft computing techniques in predicting stock market direction

Role of soft computing techniques in predicting stock market direction REVIEWS Role of soft computing techniques in predicting stock market direction Panchal Amitkumar Mansukhbhai 1, Dr. Jayeshkumar Madhubhai Patel 2 1. Ph.D Research Scholar, Gujarat Technological University,

More information

Machine Learning and the Insurance Industry Prof. John D. Kelleher

Machine Learning and the Insurance Industry Prof. John D. Kelleher Machine Learning and the Insurance Industry Prof. John D. Kelleher ADAPT Centre, Dublin Institute of Technology john.d.kelleher@dit.ie The ADAPT Centre is funded under the SFI Research Centres Programme

More information

SAS Data Mining & Neural Network as powerful and efficient tools for customer oriented pricing and target marketing in deregulated insurance markets

SAS Data Mining & Neural Network as powerful and efficient tools for customer oriented pricing and target marketing in deregulated insurance markets SAS Data Mining & Neural Network as powerful and efficient tools for customer oriented pricing and target marketing in deregulated insurance markets Stefan Lecher, Actuary Personal Lines, Zurich Switzerland

More information

Implementing the Expected Credit Loss model for receivables A case study for IFRS 9

Implementing the Expected Credit Loss model for receivables A case study for IFRS 9 Implementing the Expected Credit Loss model for receivables A case study for IFRS 9 Corporates Treasury Many companies are struggling with the implementation of the Expected Credit Loss model according

More information

Decision Analysis. Introduction. Job Counseling

Decision Analysis. Introduction. Job Counseling Decision Analysis Max, min, minimax, maximin, maximax, minimin All good cat names! 1 Introduction Models provide insight and understanding We make decisions Decision making is difficult because: future

More information

STOCK MARKET PREDICTION AND ANALYSIS USING MACHINE LEARNING

STOCK MARKET PREDICTION AND ANALYSIS USING MACHINE LEARNING STOCK MARKET PREDICTION AND ANALYSIS USING MACHINE LEARNING Sumedh Kapse 1, Rajan Kelaskar 2, Manojkumar Sahu 3, Rahul Kamble 4 1 Student, PVPPCOE, Computer engineering, PVPPCOE, Maharashtra, India 2 Student,

More information

Bank Licenses Revocation Modeling

Bank Licenses Revocation Modeling Bank Licenses Revocation Modeling Jaroslav Bologov, Konstantin Kotik, Alexander Andreev, and Alexey Kozionov Deloitte Analytics Institute, ZAO Deloitte & Touche CIS, Moscow, Russia {jbologov,kkotik,aandreev,akozionov}@deloitte.ru

More information

SELECTION BIAS REDUCTION IN CREDIT SCORING MODELS

SELECTION BIAS REDUCTION IN CREDIT SCORING MODELS SELECTION BIAS REDUCTION IN CREDIT SCORING MODELS Josef Ditrich Abstract Credit risk refers to the potential of the borrower to not be able to pay back to investors the amount of money that was loaned.

More information

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment

MBEJ 1023 Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment MBEJ 1023 Planning Analytical Methods Dr. Mehdi Moeinaddini Dept. of Urban & Regional Planning Faculty of Built Environment Contents What is statistics? Population and Sample Descriptive Statistics Inferential

More information

Understanding neural networks

Understanding neural networks Machine Learning Neural Networks Understanding neural networks An Artificial Neural Network (ANN) models the relationship between a set of input signals and an output signal using a model derived from

More information

Optimization of China EPC power project cost risk management in construction stage based on bayesian network diagram

Optimization of China EPC power project cost risk management in construction stage based on bayesian network diagram Acta Technica 62 (2017), No. 6A, 223 232 c 2017 Institute of Thermomechanics CAS, v.v.i. Optimization of China EPC power project cost risk management in construction stage based on bayesian network diagram

More information

Time Series Forecasting Of Nifty Stock Market Using Weka

Time Series Forecasting Of Nifty Stock Market Using Weka Time Series Forecasting Of Nifty Stock Market Using Weka Raj Kumar 1, Anil Balara 2 1 M.Tech, Global institute of Engineering and Technology,Gurgaon 2 Associate Professor, Global institute of Engineering

More information

Acritical aspect of any capital budgeting decision. Using Excel to Perform Monte Carlo Simulations TECHNOLOGY

Acritical aspect of any capital budgeting decision. Using Excel to Perform Monte Carlo Simulations TECHNOLOGY Using Excel to Perform Monte Carlo Simulations By Thomas E. McKee, CMA, CPA, and Linda J.B. McKee, CPA Acritical aspect of any capital budgeting decision is evaluating the risk surrounding key variables

More information

Before How can lines on a graph show the effect of interest rates on savings accounts?

Before How can lines on a graph show the effect of interest rates on savings accounts? Compound Interest LAUNCH (7 MIN) Before How can lines on a graph show the effect of interest rates on savings accounts? During How can you tell what the graph of simple interest looks like? After What

More information

Pattern Recognition by Neural Network Ensemble

Pattern Recognition by Neural Network Ensemble IT691 2009 1 Pattern Recognition by Neural Network Ensemble Joseph Cestra, Babu Johnson, Nikolaos Kartalis, Rasul Mehrab, Robb Zucker Pace University Abstract This is an investigation of artificial neural

More information

Essays on Some Combinatorial Optimization Problems with Interval Data

Essays on Some Combinatorial Optimization Problems with Interval Data Essays on Some Combinatorial Optimization Problems with Interval Data a thesis submitted to the department of industrial engineering and the institute of engineering and sciences of bilkent university

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017

International Journal of Computer Science Trends and Technology (IJCST) Volume 5 Issue 2, Mar Apr 2017 RESEARCH ARTICLE Stock Selection using Principal Component Analysis with Differential Evolution Dr. Balamurugan.A [1], Arul Selvi. S [2], Syedhussian.A [3], Nithin.A [4] [3] & [4] Professor [1], Assistant

More information

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay. Solutions to Final Exam

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay. Solutions to Final Exam The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay Solutions to Final Exam Problem A: (40 points) Answer briefly the following questions. 1. Consider

More information

Producing actionable insights from predictive models built upon condensed electronic medical records.

Producing actionable insights from predictive models built upon condensed electronic medical records. Producing actionable insights from predictive models built upon condensed electronic medical records. Sheamus K. Parkes, FSA, MAAA Shea.Parkes@milliman.com Predictive modeling often has two competing goals:

More information

An Improved Approach for Business & Market Intelligence using Artificial Neural Network

An Improved Approach for Business & Market Intelligence using Artificial Neural Network Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 5.258 IJCSMC,

More information

Agricultural and Applied Economics 637 Applied Econometrics II

Agricultural and Applied Economics 637 Applied Econometrics II Agricultural and Applied Economics 637 Applied Econometrics II Assignment I Using Search Algorithms to Determine Optimal Parameter Values in Nonlinear Regression Models (Due: February 3, 2015) (Note: Make

More information

Personal Financial Literacy

Personal Financial Literacy Personal Financial Literacy 7 Unit Overview Being financially literate means taking responsibility for learning how to manage your money. In this unit, you will learn about banking services that can help

More information

Model Maestro. Scorto TM. Specialized Tools for Credit Scoring Models Development. Credit Portfolio Analysis. Scoring Models Development

Model Maestro. Scorto TM. Specialized Tools for Credit Scoring Models Development. Credit Portfolio Analysis. Scoring Models Development Credit Portfolio Analysis Scoring Models Development Scorto TM Models Analysis and Maintenance Model Maestro Specialized Tools for Credit Scoring Models Development 2 Purpose and Tasks to Be Solved Scorto

More information