The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used.

Size: px

Start display at page:

Download "The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used."

Shanna Robertson
6 years ago
Views:

1 Machine Learning Group Homework 3 MSc Business Analytics Team 9 Alexander Romanenko, Artemis Tomadaki, Justin Leiendecker, Zijun Wei, Reza Brianca Widodo The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used. 1. Import the pre-processed data set in R. Shuffle the records and split them into a training set (20,000 records), a validation set (8,000 records) and a test set (all remaining records). Here we first shuffle our data in order to then split them randomly into a training, validation and test data set proportionate to a data set of 20,000, 8,000 and 10,697 samples respectively (51.7%, 20.7% and 27.6% split). 2. Using a classification tree (look at the C50 library), try to predict with an accuracy greater than (# of re loans /# of re loans + # of charged loans) if a loan will be re. Do you manage to achieve this performance on the validation set? What about the training set? First, we calculate our threshold which equals the value of %. This implies that the number of re loans constitutes the % of the sample or, in other words, that the proportion of the charged loans is the % of the sample. Next, we built a model classification tree using the C50 library based on the training set in order to be able to predict, later on, whether a loan is likely to be re or not, according to our validation set. Results of the classification tree based on the training set: Evaluation on training data (20000 cases): Decision Tree Size Errors (13.9%) << (a) (b) <-classified as (a): class Off (b): class Paid So, our results here basically inform us that the tree split the data at one spot, whether or not the loan_status variable had the value or. It is also stated that there was 13.9% error rate (13.9% of cases were incorrectly classified), which accounts for 2,779 out of the 20,000 records used for training. In other words, our accuracy here is estimated to be 86.1%, which is higher than our specified threshold.

2 Next, we will add to our model weak learners in such a way that newer learners pick up the slack of older learners. In this way, we will incrementally increase the accuracy of the model. Using the C5.0() function, we can increase the number of boosting iterations by changing the trials parameter. Results of boosting: Evaluation on training data (20000 cases): Decision Tree Size Errors (13.9%) << (a) (b) <-classified as (a): class Off (b): class Paid The results we obtain from boosting are exactly the same as before. Finally, we can make our prediction for the samples of our validation set based on our training set and then find the accuracy of that prediction by estimating how many of our predicted based on our training set equal the real of our validation set (loan_status column in validation table). Predict() function is used for this purpose. The type="class" argument specifies that we want the actual class labels as output, rather than the probability that the class label was one label or another. Results of our prediction: Off Paid Obtained confusion matrix: Paid Predicted Total Total According to our prediction, the loan status of all the samples of our validation set are predicted to be Paid. However, in reality, only the 6,886 of the samples in our validation set are characteri zed as Paid. Our accuracy here is estimated to be % with an error rate of %, whic h is again higher than our threshold. From this procedure, we can infer that this method might not constitute to the optimal way to predict and find the default cases.

3 3. The majority of loans in the data set are being re. By default, a classification tree algorithm uses majority votes in the leaf nodes and thus classify loans in leaf nodes with more than 50% non-defaults as safe. This strategy optimizes the default metric: the number of correctly classified loans. From a business perspective, however, we are interested in identifying loans with a high probability of default, even if the associated data record falls in a leaf node with more than 50% of safe loan samples. R s C50 library contains a cost matrix parameter that allows you to change the optimized metric and thus put more weight on one type of error over the other (e.g., false positives or false negatives). Experiment with different cost matrices to achieve a sensitivity (also known as recall) of approximately 25%, 40% and 50% in your validation set. Also, report the percentage of the loans (n11 / n11 + n21) you would recommend to the bank for re-evaluation that were indeed charged (also known as precision). Until now we have tried to achieve the highest accuracy. However, this is not our goal here. In the business world, we are highly interested in identifying the cases where the risk of default is high, as this will have a high cost for us. Here comes the cost matrix feature of C5.0 package that enables us to reflect this fact. Please note that the column in cost matrix represents actual output and the row will represent predicted value. a. Sensitivity matrix of 25% (24.23%) will be provided by the following cost matrix: Predicted 0 1 Paid 14 0 This matrix will provide 30.75% level of precision. Confusion matrix for validation data: Paid Predicted Total Total b. Sensitivity matrix of 40% (38.77%) will be provided by the following cost matrix:

4 Predicted 0 1 Paid 30 0 This matrix will provide 27.41% level of precision. Confusion matrix for validation data: Paid Predicted Total Total c. Sensitivity matrix of 50% (51.52%) will be provided by the following cost matrix: Predicted 0 1 Paid 48 0 This matrix will provide 23.8% level of precision. Confusion matrix for validation data: Paid Predicted Total Total

5 4. Pick a cost parameter matrix that you assess as the most appropriate for identifying loan applications that deserve further examination Based on the matrices identified in question 3, we believe that the most appropriate parameter matrix for identifying loan applications that deserve further examination is matrix c. Let us explain why. The matrix c has a lower percentage value of charged loans predicted to be fully : 6.8% vs 8.5% (for both matrices a and b). The proportion of fully loans which have been classified as charged is higher for matrix c: 23% vs 7.6% (matrix a) and 14.3% (matrix b). Despite the big impact on misclassification resulted from the information provided above, we need to keep in mind that for banks the value of a lost customer (fully loan predicted to be a default and as a result refused a loan) is much lower comparing to the value of a defaulted customer. Therefore here we should give much higher weight to instances where percentage of charged loans was incorrectly classified as good loans. As a result, this confirm our initial statement that matrix c is the best option here. In addition to the above calculation, we can see a negative correlation described oabove for sensitivity against precision:

6 5. Evaluate the performance of your cost parameter matrix on the test set The confusion matrix based on our testing data set can be seen below: Paid Predicted Total Total With: sensitivity level = 49.22% and precision level = 22.69%. To summarise, this is somewhat of disappointing results as the test set shows misclassification rate of 31.5%. However the incorrectly classified charged clients as fully is only 7.3% and based on the discussion in the question 4 is a relatively low result.

ECS171: Machine Learning

ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT) Decision Tree Each node checks