Gradient Boosting Trees: theory and applications

Size: px

Start display at page:

Download "Gradient Boosting Trees: theory and applications"

Charles Norris
5 years ago
Views:

1 Gradient Boosting Trees: theory and applications Dmitry Efimov November 05, 2016

2 Outline Decision trees Boosting Boosting trees Metaparameters and tuning strategies How-to-use remarks

3 Regression tree True X[12] <= mse = samples = 339 value = False X[5] <= mse = samples = 106 value = X[12] <= mse = samples = 233 value = X[7] <= mse = samples = 84 value = X[9] <= mse = samples = 22 value = X[5] <= mse = samples = 143 value = X[7] <= mse = samples = 90 value = mse = 0.0 samples = 3 value = 1.0 X[5] <= mse = samples = 81 value = X[0] <= mse = samples = 21 value = mse = 0.0 samples = 1 value = X[7] <= mse = samples = 129 value = X[5] <= mse = samples = 14 value = X[9] <= mse = samples = 54 value = X[0] <= mse = samples = 36 value = mse = samples = 44 value = mse = samples = 37 value = mse = samples = 15 value = mse = -0.0 samples = 6 value = 1.0 mse = 0.0 samples = 1 value = 1.0 mse = samples = 128 value = mse = samples = 8 value = mse = samples = 6 value = mse = samples = 15 value = mse = samples = 39 value = mse = samples = 15 value = mse = samples = 21 value = ( Mean square error for node k: y (i) ) 2 µ k m k i R k m k - number of samples µ k - average

4 Classification tree X[1] < error = samples = 134 prob = X[0] < error = samples = 70 prob = X[0] < error = samples = 64 prob = error = 0.0 samples = 4 prob = 0.0 X[1] < error = 0.14 samples = 66 prob = X[1] < error = samples = 58 prob = error = 0.0 samples = 6 prob = 1.0 error = 0.0 samples = 42 prob = 1.0 X[0] < error = 0.33 samples = 24 prob = error = 0.5 samples = 4 prob = 0.5 X[1] < error = samples = 54 prob = error = 0.0 samples = 11 prob = 1.0 X[0] < error = samples = 13 prob = X[1] < 0.47 error = samples = 20 prob = 0.15 error = 0.0 samples = 34 prob = 0.0 error = 0.0 samples = 5 prob = 0.0 error = 0.0 samples = 8 prob = 1.0 error = 0.0 samples = 12 prob = 0.0 X[0] < error = samples = 8 prob = error = samples = 4 prob = 0.25 error = 0.5 samples = 4 prob = 0.5

5 Classification error (two classes example) p - % of samples from one class in the node Misclassification error: min(p, 1 p) Gini index: 2p(1 p) Cross-entropy: p ln p (1 p) ln(1 p) 0.6 error 0.4 Gini index p Misclassification rate Entropy

6 Boosting (backfitting algorithm) Generalized additive model: ŷ = f (x 1,..., x n ) = α + f 1 (x 1 ) + f 2 (x 2 ) f n (x n ) Algorithm 1 Backfitting algorithm for GAM 1: set initial values α = 1 m y (i), f j = 0 for all j = 1,..., n m 2: repeat 3: for j = 1 to n do i=1 4: evaluate working targets z (i) = y (i) α n k=1,k j 5: train model with feature x j and target z to estimate f j 6: until convergence 7: return α, f j f k (x (i) k )

7 Boosting (general idea) Loss function for nonparametric model: L(f ) = 1 2m m (y (i) f (x (i) )) 2 i=1 From backfitting algorithm: f new = f old + g, where g is a building block algorithm Gradient Descent with respect to f : f new = f old α dl df General idea: we train the building block algorithm with the outputs g = dl df f =f old f =f old

8 Boosting trees Algorithm 2 Gradient Tree Boosting m 1: Initialize f 0 (x) = arg min L(y (i), µ) µ i=1 2: for k = 1 to K do ( ) 3: Compute working target r (i) dl f k = df =fk 1 (x (i) ) 4: Fit a regression tree to the targets r (i) k with terminal nodes R kj, j = 1,..., J k and compute γ kj = arg min γ 5: Update f k (x) = f k 1 (x) + 6: return f K (x) x (i) R kj L(y (i), f k 1 (x (i) ) + γ) J k j=1 γ kj 1{x R kj }

9 Metaparameters General: booster, seed, subsample, colsample bytree, colsample bylevel, eval metric Optimization related: objective, eta, gamma, lambda, alpha, num round, scale pos weight Tree related: max depth, min child weight

10 General metaparameters booster: gbtree, gblinear, dart seed subsample: number of training examples for each tree colsample bytree: number of features for each tree colsample bylevel: number of features for each tree node eval metric: rmse, mae, logloss, auc, map

11 Optimization and tree related metaparameters Optimization: objective: reg:linear, binary:logistic, multi:softprob, rank:pairwise eta: learning rate gamma: minimum loss reduction required lambda: L2 regularization alpha: L1 regularization scale pos weight: weights for classes num round: number of iterations Tree: max depth: maximum depth of tree min child weight: minimum size of tree node

12 Tuning strategies Grid search: Randomized search: Manual tuning parameter 2 parameter 2 parameter 1 parameter 1

13 When to apply xgboost? (just my observations) features of different origins: categorical, numerical, ordinal features are not correlated a lot the number of features is comparatively small the problem is not of some specific type (for example, not image recognition or time series) the parametric approach cannot be used General strategy 1. Use xgboost with basic parameters without tuning 2. Read literature about other approaches 3. Compare the results

14 Usecases relational datasets (Genentech, RiskyBusiness, Deloitte): Ex.: github.com/diefimov/genentech 2016 datasets with features of different origins (Otto): Ex.: github.com/diefimov/otto 2015 works for time series, but they should be converted to the traditional format (West Nile, Western Australia): Ex.: github.com/diefimov/west nile virus 2015

15 References T.Chen and C.Guestrin. XGBoost: A Scalable Tree Boosting System. In 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, T.Hastie, R.Tibshirani and J.Friedman The elements of statistical learning. Springer, MachineLearning

16 Thank you! Questions? Dmitry Efimov kaggle.com/efimov github.com/diefimov

ECS171: Machine Learning

ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT) Decision Tree Each node checks