Top-down particle filtering for Bayesian decision trees

Size: px

Start display at page:

Download "Top-down particle filtering for Bayesian decision trees"

Scott Fields
5 years ago
Views:

1 Top-down particle filtering for Bayesian decision trees Balaji Lakshminarayanan 1, Daniel M. Roy 2 and Yee Whye Teh 3 1. Gatsby Unit, UCL, 2. University of Cambridge and 3. University of Oxford

2 Outline Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

3 Outline Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

4 Introduction Input: attributes X = {x i } N i=1, labels Y = {y i} N i=1 (i.i.d) y i {1,..., K} (classification) or y i R (regression) Goal: Model p(y x)

5 Introduction Input: attributes X = {x i } N i=1, labels Y = {y i} N i=1 (i.i.d) y i {1,..., K} (classification) or y i R (regression) Goal: Model p(y x) Assume p(y x) is specified by decision tree T Bayesian decision trees: Posterior: p(t Y, X ) p(y T, X ) p(t X ) }{{}}{{} likelihood prior Prediction: p(y x ) = T p(t Y, X )p(y x, T )

6 Example: Classification tree x 1 > θ 0 x 2 > θ 10 θ 11 1 x 2 0 B 0 B 11 B 10 x 1 1 θ: Multinomial parameters at leaf nodes

7 Example: Regression tree x 1 > θ 0 x 2 > θ 10 θ 11 1 x B 0 B 11 7 B 10 x θ: Gaussian parameters at leaf nodes

8 Motivation Classic non-bayesian induction algorithms (e.g. CART) learn a single tree in a top-down manner using greedy heuristics (post-pruning and/or bagging necessary) MCMC for Bayesian decision trees: [Chipman et al., 1998]: local Monte Carlo modifications to the tree structure (less prone to over fitting but slow to mix) Our contribution: Sequential Monte Carlo (SMC) algorithm that approximates the posterior, in a top-down manner Take home message: SMC provides better computation vs predictive performance tradeoff than MCMC

9 Bayesian decision trees: likelihood p(t Y, X ) p(y T, X ) p(t X ) }{{}}{{} likelihood prior

10 Likelihood Assume x n falls in the j th leaf node of T Likelihood for n th data point: p(y n x n, T, θ) = p(y n θ j, x n ) p(y T, X, Θ) = p(y n x n, T, θ) = n p(y n θ j ) j leaves(t) n N(j)

11 Likelihood Assume x n falls in the j th leaf node of T Likelihood for n th data point: p(y n x n, T, θ) = p(y n θ j, x n ) p(y T, X, Θ) = p(y n x n, T, θ) = n p(y n θ j ) j leaves(t) n N(j) Better: integrate out θ j, use marginal likelihood p(y T, X ) = p(y n θ j )p(θ j )dθ j j leaves(t) θ j n N(j) Classification: Dirichlet - Multinomial Regression: Normal - Normal Inverse Gamma

12 Outline Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

13 Bayesian decision trees: prior p(t Y, X ) p(y T, X ) p(t X ) }{{}}{{} likelihood prior

14 Partial trees 0. Start with empty tree. 1 ɛ x 2 0 B ɛ x 1 1

15 Partial trees 1. Choose to split root node with feature 1 and threshold ɛ : x 1 > x 2 0 B 0 B 1 x 1 1

16 Partial trees 2. Choose to not split node 0. 1 ɛ : x 1 > x 2 0 B 0 B 1 x 1 1

17 Partial trees 3. Choose to split node 1 with with feature 2 and threshold ɛ : x 1 > : x 2 > x 2 0 B 0 B 11 B 10 x 1 1

18 Partial trees 4. Choose to not split node Choose to not split node 11. ɛ : x 1 > : x 2 > x 2 0 B 0 B 11 B 10 x 1 1

19 Sequence of random variables for a tree ɛ : x 1 > : x 2 > ρ ɛ = 1, κ ɛ = 1, τ ɛ = ρ 0 = 0 3. ρ 1 = 1, κ 1 = 2, τ 1 = ρ 10 = 0 5. ρ 11 = 0

20 Sequential prior over decision trees Probability of split (assuming a valid split exists): ( βs p(j split) = α s 1 depth(j)) α s (0, 1), β s [0, ) κ j, τ j sampled uniformly from the range of valid splits

21 Sequential prior over decision trees Probability of split (assuming a valid split exists): ( βs p(j split) = α s 1 depth(j)) α s (0, 1), β s [0, ) κ j, τ j sampled uniformly from the range of valid splits Prior distribution: p(t, κ, τ X ) = j leaves(t) j nonleaves(t) p(j not split) p(j split)p(κ j, τ j )

22 Outline Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

23 Bayesian decision trees: posterior p(t Y, X ) p(y T, X ) p(t X ) }{{}}{{} likelihood prior

24 SMC algorithm for Bayesian decision trees Importance sampler: Draw T (c) q( ) p(y X ) = T p(y, T X ) C 1 C c=1 p(t (c) ) q(t (c) ) p(y X, T (c) ) } {{ } w (c)

25 SMC algorithm for Bayesian decision trees Importance sampler: Draw T (c) q( ) p(y X ) = T p(y, T X ) C 1 C c=1 p(t (c) ) q(t (c) ) p(y X, T (c) ) } {{ } w (c) Normalize: w (c) = Approximate posterior: w (c) c w (c ) p(t Y, X ) c w (c) δ(t = T (c) )

26 SMC algorithm for Bayesian decision trees (contd.) Sequential importance sampler (SIS): n n p(t n ) = p(t 0 ) p(t n T n 1) q(t n ) = q 0 (T 0 ) q n (T n T n 1) n =1 n =1 p(y X, T n ) = p(y X, T 0 ) p(y X, T 1 ) p(y X, T 0 ) p(y X, T n ) p(y X, T n 1 )

27 SMC algorithm for Bayesian decision trees (contd.) Sequential importance sampler (SIS): n n p(t n ) = p(t 0 ) p(t n T n 1) q(t n ) = q 0 (T 0 ) q n (T n T n 1) n =1 n =1 p(y X, T n ) = p(y X, T 0 ) p(y X, T 1 ) p(y X, T 0 ) p(y X, T n ) p(y X, T n 1 ) w = 1 C = w 0 n p(t n ) q(t n ) p(y X, T n) n =1 p(t n T n 1) p(y X, T n ) q n (T n T n 1) p(y X, T n 1) }{{} local likelihood Sequential Monte Carlo (SMC): SIS adaptive resampling steps Every node is processed just once: no multi-path issues

28 Outline Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

29 Experimental setup Datasets: magic-04: N = 19K, D = 10, K = 2. pendigits: N = 11K, D = 16, K = % - 30% train-test split Numbers averaged across 10 different initializations

30 Outline Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

31 SMC design choices Proposals prior proposal: qn (ρ j, κ j, τ j ) = p(ρ j, κ j, τ j ) optimal proposal: q n (ρ j = stop) p(j not split)p(y N(j) X N(j) ), q n (ρ j = split, κ j, τ j ) p(j split)p(κ j, τ j ) p(y N(j0) X N(j0) ) p(y N(j1) X N(j1) ). }{{}}{{} left child right child Set of nodes considered for expansion at iteration n node-wise: next node layer-wise: all nodes at depth n Multinomial resampling

32 Effect of SMC design choices log p(y X) (test) SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer] Mean Time (s) log p(y X) (test) SMC optimal [node] SMC prior [node] SMC optimal [layer] SMC prior [layer] Number of particles Figure: Results on magic-04 dataset

33 Effect of irrelevant features on SMC design choices madelon: N = 2.6K, D = 500, K = 2 (96% of the features are irrelevant) log p(y X) (test) SMC optimal [node] SMC prior [node] Mean Time (s) log p(y X) (test) SMC optimal [node] SMC prior [node] Number of particles Figure: Results on madelon dataset

34 Outline Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

35 Predictive performance vs computation: SMC vs MCMC Fix hyper parameters α = 5, α s = 0.95, β s = 0.5 MCMC [Chipman et al., 1998]: one of the 4 proposals: grow prune change swap MCMC averages predictions over all previous trees Vary number of particles in SMC, number of MCMC iterations and compare runtime vs performance

36 Predictive performance vs computation: SMC vs MCMC log p(y X) (test) Mean Time (s) SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy) Accuracy (test) Mean Time (s) SMC optimal [node] SMC prior [node] Chipman-MCMC CART (gini) CART (entropy) Figure: Results on magic-04 dataset

37 Take home message SMC (prior, node-wise) is at least an order of magnitude faster than MCMC

38 Outline Introduction Sequential prior over decision trees Bayesian inference: Top-down particle filtering Experiments Design choices in the SMC algorithm SMC vs MCMC Conclusion

39 Conclusion SMC for fast Bayesian inference for decision trees mimick the top-down generative process of decision trees use local likelihoods resampling steps to guide tree growth For a fixed computational budget, SMC outperforms MCMC

40 Conclusion SMC for fast Bayesian inference for decision trees mimick the top-down generative process of decision trees use local likelihoods resampling steps to guide tree growth For a fixed computational budget, SMC outperforms MCMC Future directions Particle-MCMC for Bayesian Additive Regression Trees Mondrian process prior: projective and exchangeable prior for decision trees [Roy and Teh, 2009]

41 Thank you! Code available at

42 Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART model search. J. Am. Stat. Assoc., pages Roy, D. M. and Teh, Y. W. (2009). The Mondrian process. In Adv. Neural Information Proc. Systems, volume 21, pages

ECS171: Machine Learning

ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT) Decision Tree Each node checks