ECS171: Machine Learning

Similar documents
Investing through Economic Cycles with Ensemble Machine Learning Algorithms

The Loans_processed.csv file is the dataset we obtained after the pre-processing part where the clean-up python code was used.

Gradient Boosting Trees: theory and applications

Credit Card Default Predictive Modeling

Session 5. Predictive Modeling in Life Insurance

Lecture 9: Classification and Regression Trees

LendingClub Loan Default and Profitability Prediction

Top-down particle filtering for Bayesian decision trees

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Beyond GLMs. Xavier Conort & Colin Priest

A new look at tree based approaches

Support Vector Machines: Training with Stochastic Gradient Descent

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

Quantile Regression. By Luyang Fu, Ph. D., FCAS, State Auto Insurance Company Cheng-sheng Peter Wu, FCAS, ASA, MAAA, Deloitte Consulting

Internet Appendix. Additional Results. Figure A1: Stock of retail credit cards over time

Loan Approval and Quality Prediction in the Lending Club Marketplace

Decision Trees An Early Classifier

Loan Approval and Quality Prediction in the Lending Club Marketplace

Deep learning analysis of limit order book

International Journal of Advance Engineering and Research Development REVIEW ON PREDICTION SYSTEM FOR BANK LOAN CREDIBILITY

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Machine Learning (CSE 446): Pratical issues: optimization and learning

Machine Learning Performance over Long Time Frame

Session 57PD, Predicting High Claimants. Presenters: Zoe Gibbs Brian M. Hartman, ASA. SOA Antitrust Disclaimer SOA Presentation Disclaimer

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014

Predicting Foreign Exchange Arbitrage

An introduction to Machine learning methods and forecasting of time series in financial markets

Using Random Forests in conintegrated pairs trading

Predicting the Success of a Retirement Plan Based on Early Performance of Investments

Wage Determinants Analysis by Quantile Regression Tree

CS360 Homework 14 Solution

Progressive Hedging for Multi-stage Stochastic Optimization Problems

Wide and Deep Learning for Peer-to-Peer Lending

MS&E 448 Final Presentation High Frequency Algorithmic Trading

EE365: Risk Averse Control

$tock Forecasting using Machine Learning

Bank Licenses Revocation Modeling

Lecture outline W.B.Powell 1

Pattern Recognition Chapter 5: Decision Trees

MWSUG Paper AA 04. Claims Analytics. Mei Najim, Gallagher Bassett Services, Rolling Meadows, IL

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Evaluation of Models. Niels Landwehr

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

Is Greedy Coordinate Descent a Terrible Algorithm?

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

The method of Maximum Likelihood.

CS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults

Modeling Implied Volatility

Computer Vision Group Prof. Daniel Cremers. 7. Sequential Data

Tree Diagram. Splitting Criterion. Splitting Criterion. Introduction. Building a Decision Tree. MS4424 Data Mining & Modelling Decision Tree

ALGORITHMIC TRADING STRATEGIES IN PYTHON

What can we do with numerical optimization?

Boosting Actuarial Regression Models in R

Predictive Modeling Cross Selling of Home Loans to Credit Card Customers

Topic-based vector space modeling of Twitter data with application in predictive analytics

NBER WORKING PAPER SERIES RISK AND RISK MANAGEMENT IN THE CREDIT CARD INDUSTRY

Chapter 7: Estimation Sections

Trust Region Methods for Unconstrained Optimisation

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Expanding Predictive Analytics Through the Use of Machine Learning

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

Predicting Companies Delisting to Improve Mutual Fund Performance

Options Pricing Using Combinatoric Methods Postnikov Final Paper

CEC login. Student Details Name SOLUTIONS

Challenging LGD models with Machine Learning

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 13, 2018

Prediction of securities behavior using a multi-level artificial neural network with extra inputs between layers

A start of Variational Methods for ERGM Ranran Wang, UW

CISC 889 Bioinformatics (Spring 2004) Phylogenetic Trees (II)

Machine Learning for Quantitative Finance

Classification and Regression Trees

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Prediction of Stock Price Movements Using Options Data

A Multi-topic Approach to Building Quant Models. Bringing Semantic Intelligence to Financial Markets

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques

Maximum Likelihood Estimation Richard Williams, University of Notre Dame, Last revised January 10, 2017

CS188 Spring 2012 Section 4: Games

Budget Management In GSP (2018)

Foreign Exchange Forecasting via Machine Learning

DECISION TREE INDUCTION

Financial Econometrics

Session 5. A brief introduction to Predictive Modeling

Log-Robust Portfolio Management

Accepted Manuscript. Example-Dependent Cost-Sensitive Decision Trees. Alejandro Correa Bahnsen, Djamila Aouada, Björn Ottersten

Numerical investigation on multiclass probabilistic classification of damage location in a plate structure

Molecular Phylogenetics

Predicting the direction of stock market prices using random forest

Getting Started with CGE Modeling

Multistage risk-averse asset allocation with transaction costs

2018 Predictive Analytics Symposium Session 10: Cracking the Black Box with Awareness & Validation

Abstract Making good predictions for stock prices is an important task for the financial industry. The way these predictions are carried out is often

Portfolio selection with multiple risk measures

Behavioral patterns of long term saving : Predictive analysis of adverse behaviors on a savings portfolio

Comparison of Logit Models to Machine Learning Algorithms for Modeling Individual Daily Activity Patterns

Agricultural and Applied Economics 637 Applied Econometrics II

Markov Decision Processes

Econ 582 Nonlinear Regression

Research Memo: Adding Nonfarm Employment to the Mixed-Frequency VAR Model

Exact Inference (9/30/13) 2 A brief review of Forward-Backward and EM for HMMs

Decision making in the presence of uncertainty

Investigating Algorithmic Stock Market Trading using Ensemble Machine Learning Methods

Transcription:

ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018

Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT)

Decision Tree Each node checks one feature x i : Go left if x i < threshold Go right if x i threshold

A real example

Decision Tree Strength: It s a nonlinear classifier Better interpretability Can naturally handle categorical features

Decision Tree Strength: It s a nonlinear classifier Better interpretability Can naturally handle categorical features Computation: Training: slow Prediction: fast h operations (h: depth of the tree, usually 15)

Splitting the node Classification tree: Split the node to maximize entropy Let S be set of data points in a node, c = 1,, C are labels: Entroy : H(S) = C p(c) log p(c), c=1 where p(c) is the proportion of the data belong to class c. Entropy=0 if all samples are in the same class Entropy is large if p(1) = = p(c)

Information Gain The averaged entropy of a split S S 1, S 2 S 1 S H(S 1) + S 2 S H(S 2) Information gain: measure how good is the split ( ) H(S) ( S 1 / S )H(S 1 ) + ( S 2 / S )H(S 2 )

Information Gain

Information Gain

Splitting the node Given the current note, how to find the best split?

Splitting the node Given the current note, how to find the best split? For all the features and all the threshold Compute the information gain after the split Choose the best one (maximal information gain)

Splitting the node Given the current note, how to find the best split? For all the features and all the threshold Compute the information gain after the split Choose the best one (maximal information gain) For n samples and d features: need O(nd) time

Regression Tree Assign a real number for each leaf Usually averaged y values for each leaf (minimize square error)

Regression Tree Objective function: min F 1 n n (y i F (x i )) 2 + (Regularization) i=1 The quality of partition S = S 1 S 2 can be computed by the objective function: (y i y (1) ) 2 + (y i y (2) ) 2, i S 1 i S 2 where y (1) = 1 S 1 i S 1 y i, y (2) = 1 S 2 i S 2 y i

Regression Tree Objective function: min F 1 n n (y i F (x i )) 2 + (Regularization) i=1 The quality of partition S = S 1 S 2 can be computed by the objective function: (y i y (1) ) 2 + (y i y (2) ) 2, i S 1 i S 2 where y (1) = 1 S 1 i S 1 y i, y (2) = 1 S 2 Find the best split: i S 2 y i Try all the features & thresholds and find the one with minimal objective function

Parameters Maximum depth: (usually 10) Minimum number of nodes in each node: (10, 50, 100)

Parameters Maximum depth: (usually 10) Minimum number of nodes in each node: (10, 50, 100) Single decision tree is not very powerful Can we build multiple decision trees and ensemble them together?

Random Forest

Random Forest Random Forest (Bootstrap ensemble for decision trees): Create T trees Learn each tree using a subsampled dataset S i and subsampled feature set D i Prediction: Average the results from all the T trees Benefit: Avoid over-fitting Improve stability and accuracy Good software available: R: randomforest package Python: sklearn

Random Forest

Gradient Boosted Decision Tree

Boosted Decision Tree Minimize loss l(y, F (x)) with F ( ) being ensemble trees F = argmin F n l(y i, F (x i )) with F (x) = i=1 (each f m is a decision tree) T f m (x) m=1

Boosted Decision Tree Minimize loss l(y, F (x)) with F ( ) being ensemble trees F = argmin F n l(y i, F (x i )) with F (x) = i=1 T f m (x) m=1 (each f m is a decision tree) Direct loss minimization: at each stage m, find the best function to minimize loss solve f m = argmin fm N i=1 l(y i, F m 1 (x i ) + f m (x i )) update F m F m 1 + f m F m (x) = m j=1 f j(x) is the prediction of x after m iterations.

Boosted Decision Tree Minimize loss l(y, F (x)) with F ( ) being ensemble trees F = argmin F n l(y i, F (x i )) with F (x) = i=1 T f m (x) m=1 (each f m is a decision tree) Direct loss minimization: at each stage m, find the best function to minimize loss solve f m = argmin fm N i=1 l(y i, F m 1 (x i ) + f m (x i )) update F m F m 1 + f m F m (x) = m j=1 f j(x) is the prediction of x after m iterations. Two problems: Hard to implement for general loss Tend to overfit training data

Gradient Boosted Decision Tree (GBDT) Approximate the current loss function by a quadratic approximation: n l i (ŷ i + f m (x i )) i=1 = n ( li (ŷ i ) + g i f m (x i ) + 1 2 h if m (x i ) 2) i=1 n i=1 h i 2 f m(x i ) g i /h i 2 + constant where g i = ŷi l i (ŷ i ) is gradient, h i = 2 ŷ i l i (ŷ i ) is second order derivative

Gradient Boosted Decision Tree Finding f m (x, θ m ) by minimizing the loss function: argmin f m N [f m (x i, θ) g i /h i ] 2 + R(f m ) i=1 Reduce the training of any loss function to regression tree (just need to compute g i for different functions) h i = α (fixed step size) for original GBDT. XGboost shows computing second order derivative yields better performance

Gradient Boosted Decision Tree Finding f m (x, θ m ) by minimizing the loss function: argmin f m N [f m (x i, θ) g i /h i ] 2 + R(f m ) i=1 Reduce the training of any loss function to regression tree (just need to compute g i for different functions) h i = α (fixed step size) for original GBDT. XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each ŷ i. Building a base learner (decision tree) to fit the gradient. Updating current prediction ŷ i = F m (x i ) for all i.

Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l F

Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l F

Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l F

Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l F

Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l f

Conclusions Next class: Matrix factorization, word embedding Questions?