ECS171: Machine Learning

Size: px

Start display at page:

Download "ECS171: Machine Learning"

Leonard Jordan
5 years ago
Views:

1 ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018

2 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT)

3 Decision Tree Each node checks one feature x i : Go left if x i < threshold Go right if x i threshold

4 A real example

5 Decision Tree Strength: It s a nonlinear classifier Better interpretability Can naturally handle categorical features

6 Decision Tree Strength: It s a nonlinear classifier Better interpretability Can naturally handle categorical features Computation: Training: slow Prediction: fast h operations (h: depth of the tree, usually 15)

7 Splitting the node Classification tree: Split the node to maximize entropy Let S be set of data points in a node, c = 1,, C are labels: Entroy : H(S) = C p(c) log p(c), c=1 where p(c) is the proportion of the data belong to class c. Entropy=0 if all samples are in the same class Entropy is large if p(1) = = p(c)

8 Information Gain The averaged entropy of a split S S 1, S 2 S 1 S H(S 1) + S 2 S H(S 2) Information gain: measure how good is the split ( ) H(S) ( S 1 / S )H(S 1 ) + ( S 2 / S )H(S 2 )

9 Information Gain

10 Information Gain

11 Splitting the node Given the current note, how to find the best split?

12 Splitting the node Given the current note, how to find the best split? For all the features and all the threshold Compute the information gain after the split Choose the best one (maximal information gain)

13 Splitting the node Given the current note, how to find the best split? For all the features and all the threshold Compute the information gain after the split Choose the best one (maximal information gain) For n samples and d features: need O(nd) time

14 Regression Tree Assign a real number for each leaf Usually averaged y values for each leaf (minimize square error)

15 Regression Tree Objective function: min F 1 n n (y i F (x i )) 2 + (Regularization) i=1 The quality of partition S = S 1 S 2 can be computed by the objective function: (y i y (1) ) 2 + (y i y (2) ) 2, i S 1 i S 2 where y (1) = 1 S 1 i S 1 y i, y (2) = 1 S 2 i S 2 y i

16 Regression Tree Objective function: min F 1 n n (y i F (x i )) 2 + (Regularization) i=1 The quality of partition S = S 1 S 2 can be computed by the objective function: (y i y (1) ) 2 + (y i y (2) ) 2, i S 1 i S 2 where y (1) = 1 S 1 i S 1 y i, y (2) = 1 S 2 Find the best split: i S 2 y i Try all the features & thresholds and find the one with minimal objective function

17 Parameters Maximum depth: (usually 10) Minimum number of nodes in each node: (10, 50, 100)

18 Parameters Maximum depth: (usually 10) Minimum number of nodes in each node: (10, 50, 100) Single decision tree is not very powerful Can we build multiple decision trees and ensemble them together?

19 Random Forest

20 Random Forest Random Forest (Bootstrap ensemble for decision trees): Create T trees Learn each tree using a subsampled dataset S i and subsampled feature set D i Prediction: Average the results from all the T trees Benefit: Avoid over-fitting Improve stability and accuracy Good software available: R: randomforest package Python: sklearn

21 Random Forest

22 Gradient Boosted Decision Tree

23 Boosted Decision Tree Minimize loss l(y, F (x)) with F ( ) being ensemble trees F = argmin F n l(y i, F (x i )) with F (x) = i=1 (each f m is a decision tree) T f m (x) m=1

24 Boosted Decision Tree Minimize loss l(y, F (x)) with F ( ) being ensemble trees F = argmin F n l(y i, F (x i )) with F (x) = i=1 T f m (x) m=1 (each f m is a decision tree) Direct loss minimization: at each stage m, find the best function to minimize loss solve f m = argmin fm N i=1 l(y i, F m 1 (x i ) + f m (x i )) update F m F m 1 + f m F m (x) = m j=1 f j(x) is the prediction of x after m iterations.

25 Boosted Decision Tree Minimize loss l(y, F (x)) with F ( ) being ensemble trees F = argmin F n l(y i, F (x i )) with F (x) = i=1 T f m (x) m=1 (each f m is a decision tree) Direct loss minimization: at each stage m, find the best function to minimize loss solve f m = argmin fm N i=1 l(y i, F m 1 (x i ) + f m (x i )) update F m F m 1 + f m F m (x) = m j=1 f j(x) is the prediction of x after m iterations. Two problems: Hard to implement for general loss Tend to overfit training data

26 Gradient Boosted Decision Tree (GBDT) Approximate the current loss function by a quadratic approximation: n l i (ŷ i + f m (x i )) i=1 = n ( li (ŷ i ) + g i f m (x i ) h if m (x i ) 2) i=1 n i=1 h i 2 f m(x i ) g i /h i 2 + constant where g i = ŷi l i (ŷ i ) is gradient, h i = 2 ŷ i l i (ŷ i ) is second order derivative

27 Gradient Boosted Decision Tree Finding f m (x, θ m ) by minimizing the loss function: argmin f m N [f m (x i, θ) g i /h i ] 2 + R(f m ) i=1 Reduce the training of any loss function to regression tree (just need to compute g i for different functions) h i = α (fixed step size) for original GBDT. XGboost shows computing second order derivative yields better performance

28 Gradient Boosted Decision Tree Finding f m (x, θ m ) by minimizing the loss function: argmin f m N [f m (x i, θ) g i /h i ] 2 + R(f m ) i=1 Reduce the training of any loss function to regression tree (just need to compute g i for different functions) h i = α (fixed step size) for original GBDT. XGboost shows computing second order derivative yields better performance Algorithm: Computing the current gradient for each ŷ i. Building a base learner (decision tree) to fit the gradient. Updating current prediction ŷ i = F m (x i ) for all i.

29 Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l F

30 Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l F

31 Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l F

32 Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l F

33 Gradient Boosted Decision Trees (GBDT) Key idea: Each base learner is a decision tree Each regression tree approximates the functional gradient l f

34 Conclusions Next class: Matrix factorization, word embedding Questions?

Investing through Economic Cycles with Ensemble Machine Learning Algorithms

Investing through Economic Cycles with Ensemble Machine Learning Algorithms Thomas Raffinot Silex Investment Partners Big Data in Finance Conference Thomas Raffinot (Silex-IP) Economic Cycles-Machine Learning