Multidimensional Monotonicity Discovery with mbart

Size: px

Start display at page:

Download "Multidimensional Monotonicity Discovery with mbart"

Charlene Miller
5 years ago
Views:

1 Multidimensional Monotonicity Discovery with mart Rob McCulloch Arizona State Collaborations with: Hugh Chipman (Acadia), Edward George (Wharton, University of Pennsylvania), Tom Shively (UT Austin) October 23, 2018 orthern Arizona University 1 / 46

2 Plan I) Review ART II) Introduce Monotone ART: mart III) Monotonicity Discovery with mart 2 / 46

3 eginning with a Single Tree Model 3 / 46

4 f(x) x2 x1 Three different views of a bivariate tree. 4 / 46

5 ayesian CART: Just add a prior π(m, T ) ayesian CART Model Search (Chipman, George, McCulloch 1998) π(m, T ) = π(m T )π(t ) π(m T ) : (µ 1, µ 2,..., µ b ) b (0, τ 2 I ) π(t ): Stochastic process to generate tree skeleton plus uniform prior on splitting variables and splitting rules. Closed form for π(t y) facilitates MCMC stochastic search for promising trees. 5 / 46

6 ote Although we just talking about the classic decision tree setup, our vibe is very different for the usual CART type approach pioneered by rieman (in statistics). In CART you just have an agorithm for fitting a tree to training data. We have a full generative model and our prior plays a key role. 6 / 46

7 Moving on to ART ayesian Additive Regression Trees (Chipman, George, McCulloch 2010) The ART ensemble model Y = g(x; T 1, M 1 )+g(x; T 2, M 2 )+...+g(x; T m, M m )+σz, z (0, 1) Each (T i, M i ) identifies a single tree. E(Y x, T 1, M 1,..., T m, M m ) is the sum of m bottom node µ s, one from each tree. umber of trees m can be much larger than sample size n. g(x; T 1, M 1 ), g(x; T 2, M 2 ),..., g(x; T m, M m ) is a highly redundant over-complete basis with many many parameters. 7 / 46

8 Complete the Model with a Regularization Prior π((t 1, M 1 ), (T 2, M 2 ),..., (T m, M m ), σ) π applies the ayesian CART prior to each (T j, M j ) independently so that: Each T small. Each µ small. σ will be compatible with the observed variation of y. The observed variation of y is used to guide the choice of the hyperparameters for the µ and σ priors. π is a boosting/regularization prior as it keeps the contribution of each g(x; T i, M i ) small, explaining only a small portion of the fit. 8 / 46

9 Connections to Other Modeling Ideas Y = g(x;t 1,M 1 ) g(x;t m,m m ) + & z plus #((T 1,M 1 ),...(T m,m m ),&) ayesian onparametrics: Lots of parameters (to make model flexible) A strong prior to shrink towards simple structure (regularization) ART shrinks towards additive models with some interaction Dynamic Random asis Elements: g(x; T 1, M 1 ),..., g(x; T m, M m ) are dimensionally adaptive oosting: Fit becomes the cumulative effort of many weak learners 9 / 46

10 A Sketch of the ART MCMC Algorithm Y = g(x;t 1,M 1 ) g(x;t m,m m ) + & z plus #((T 1,M 1 ),...(T m,m m ),&) Outer Loop is a simple Gibbs sampler: (T i, M i ) all other (T j, M j ), and σ) σ (T 1, M 1,...,..., T m, M m ) To draw (T i, M i ) above, subtract the contributions of the other trees from both sides to get a simple one-tree model. We integrate out M to draw T and then draw M T. 10 / 46

11 For the draw of T we use a Metropolis-Hastings within Gibbs step. Our proposal moves around tree space by proposing local modifications such as the birth-death step: such as? => propose a more complex tree? => propose a simpler tree... as the MCMC runs, each tree in the sum will grow and shrink, swapping fit amongst them / 46

12 uild up the fit, by adding up tiny bits of fit.. 12 / 46

13 Using the MCMC Output to Draw Inference Each iteration d results in a draw from the posterior of f ˆf d ( ) = g( ; T 1d, M 1d ) + + g( ; T md, M md ) To estimate f (x) we simply average the ˆf d ( ) draws at x Posterior uncertainty is captured by variation of the ˆf d (x) eg, 95% HPD region estimated by middle 95% of values Can do the same with functionals of f. 13 / 46

14 Out of Sample Prediction Predictive comparisons on 42 data sets. Data from Kim, Loh, Shih and Chaudhuri (2006) (thanks Wei-Yin Loh!) p = 3 to 65, n = 100 to 7,000. for each data set 20 random splits into 5/6 train and 1/6 test use 5-fold cross-validation on train to pick hyperparameters (except ART-default!) gives 20*42 = 840 out-of-sample predictions, for each prediction, divide rmse of different methods by the smallest + each boxplots represents 840 predictions for a method means you are 20% worse than the best + ART-cv best + ART-default (use default prior) does amazingly well!! Rondom Forests eural et oosting ART cv ART default / 46

15 Automatic Uncertainty Quantification y y A simple simulated 1-dimensional example 95% pointwise posterior intervals, ART 95% pointwise posterior intervals, mart posterior mean true f x x ote: mart on the right plot to be introduced next 15 / 46

16 Part II. Monotone ART - mart Multidimensional Monotone ART (Chipman, George, McCulloch, Shively 2018) Idea: Approximate multivariate monotone functions by the sum of many single tree models, each of which is monotonic. 16 / 46

17 x2 An Example of a Monotonic Tree f(x) Three different views of a bivariate monotonic tree. x1 17 / 46

18 f(x) x2 What makes this single tree monotonic? x1 A function g is said to be monotonic in x i if for any δ > 0, g(x 1, x 2,..., x i + δ, x i+1,..., x k ; T, M) g(x 1, x 2,..., x i, x i+1,..., x k ; T, M). For simplicity and wlog, let s restrict attention to monotone nondecreasing functions. 18 / 46

19 To implement this monotonicity in tree language we simply constrain the mean level of a node to be greater than those of it below neighbors and less than those of its above neighbors. x x1 node 7 is disjoint from node 4. node 10 is a below neighbor of node 13. node 7 is an above neighbor of node 13. The mean level of node 13 must be greater than those of 10 and 12 and less than that of node / 46

20 The mart Prior Recall the ART parameter θ = ((T 1, M 1 ), (T 2, M 2 ),..., (T m, M m ), σ) Let S = {θ : every tree is monotonic in a desired subset of x i s} To impose the monotonicity we simply truncate the ART prior π(θ) to the set S π (θ) π(θ) I S (θ) where I S (θ) is 1 if every tree in θ is montonic. 20 / 46

21 A ew ART MCMC Christmas Tree Algorithm π((t 1, M 1 ), (T 2, M 2 ),..., (T m, M m ), σ y)) ayesian backfitting again: Iteratively sample each (T j, M j ) given (y, σ) and other (T j, M j ) s Each (T 0, M 0 ) (T 1, M 1 ) update is sampled as follows: Denote move as (T 0, M 0 Common, M0 Old ) (T 1, M 0 Common, M1 ew ) Propose T via birth, death, etc. If M-H with π(t, M y) accepts (T, M 0 Common ) Set (T 1, M 1 Common ) = (T, M 0 Common ) Sample M 1 ew from π(m ew T 1, M 1 Common, y) Only M 0 Old M1 ew needs to be updated. Works for both ART and mart. 21 / 46

22 M 0 Common = µ 1, µ 2 Old ART algorithm: integrate out all the µ s and then play around with the tree. Christmas Tree: condition on all the µ not affected by the proposed tree move. 22 / 46

23 Example: Product of two x s Let s consider a very simple simulated monotone example: Y = x 1 x 2 + ɛ, x i Uniform(0, 1). Here is the plot of the true function f (x 1, x 2 ) = x 1 x 2 x f(x) x2 x x1 23 / 46

24 First we try a single (just one tree), unconstrained tree model. Here is the graph of the fit. x f(x) x2 x x1 The fit is not terrible, but there are some aspects of the fit which violate monotonicity. 24 / 46

25 Here is the graph of the fit with the monotone constraint: x f(x) x2 x x1 We see that our fit is monotonic, and more representative of the true f. 25 / 46

26 Here is the unconstrained ART fit: x f(x) x2 x x1 Much better (of course) but not monotone! 26 / 46

27 And, finally, the constrained ART fit: x f(x) x2 x x1 ot ad! Same method works with any number of x s! 27 / 46

28 A 5-Dimensional Example Y = x 1 x x 3 x x 5 + ɛ, ɛ (0, σ 2 ), x i Uniform(0, 1). For various values of σ, we simulated 5,000 observations. 28 / 46

29 RMSE improvement over unconstrained ART bart 1 mbart 1 bart 2 mbart 2 bart 3 mbart 3 bart 4 mbart 4 σ = 0.2, 0.5, 0.7, / 46

30 Part III. Discovering Monotonicity with mart Suppose we don t know if f (x) is monotone up, monotone down or even monotone at all. Of course, a simple strategy would be simply compare the fits from ART and mart. Good news, we can do much better than this! As we ll now see, mart can be deployed to simultaneously estimate all the monotone components of f. With this strategy, monotonicity can be discovered rather than imposed! 30 / 46

31 The Monotone Decomposition of a Function To begin simply, suppose x is one-dimensional and f is of bounded variation. Any such f can be uniquely written (up to an additive constant) as the sum of a monotone up function and a monotone down function where f (x) = f up (x) + f down (x) when f (x) is increasing, f up (x) increases at the same rate and is flat otherwise, when f (x) is decreasing, f down (x) decreases at the same rate and is flat otherwise. 31 / 46

32 The Discovery Strategy with mart Key Idea: To discover the monotone decomposition of f, we simply treat f (x) as a two-dimensional function in R 2, f (x) = f (x, x) = f up (x) + f down (x). Letting x 1 = x 2 = x be duplicate copies of x, we apply mart to estimate f (x 1, x 2 ) constrained to be monotone up in the x 1 direction, and constrained to be monotone down in the x 2 direction. Let s look at some illuminating one-dimensional examples. 32 / 46

33 y y Example: Suppose Y = x 3 + ɛ. ART and mart martd, fup, fdown ART mart mart: fup mart: fdown martd: overall fit x x ote that ˆf down 0 (the red in the right plot), as we would expect when f is monotone up. ote: mart looks nicer than ART, not restricted! 33 / 46

34 y y As the sample size is increased from 200 to 1,000, ˆf down gets even flatter. ART and mart martd, fup, fdown ART mart mart: fup mart: fdown martd: overall fit x x Suggests consistent estimation of the monotone components!! 34 / 46

35 Example: Suppose Y = x 2 + ɛ x y ART and mart ART mart x y martd, fup, fdown mart: fup mart: fdown martd: overall fit On the left, ART is good, but simple mart is not. On the right, ˆf up and ˆf down are spot on. And martd = ˆf up + ˆf down seems even better than ART! 35 / 46

36 Example: Suppose Y = sin(x) + ɛ. y y ART and mart martd, fup, fdown ART mart mart: fup mart: fdown martd: overall fit x x ART is good, but simple mart reveals nothing. ˆf up and ˆf down have discovered the monotone decomposition. And martd = ˆf up + ˆf down is great too. To extend this approach to multidimensional x, we simply duplicate each and every component of x!!! 36 / 46

37 Discovering Monotonicity, Simple House Price Data Let s look at a very simple example where we relate y=house price to three characteristics of the house. > head(x) nbhd size brick [1,] [2,] [3,] [4,] [5,] [6,] > dim(x) [1] > summary(x) nbhd size brick Min. :1.000 Min. :1.450 Min. : st Qu.: st Qu.: st Qu.: Median :2.000 Median :2.000 Median : Mean :1.961 Mean :2.001 Mean : rd Qu.: rd Qu.: rd Qu.: Max. :3.000 Max. :2.590 Max. : > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max y: thousands of dollars. x: three neighborhoods, thousands of square feet, brick or not. 37 / 46

38 Call: lm(formula = price ~ nbhd + size + brick, data = hdat) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) nbhd * nbhd < 2e-16 *** size e-13 *** brickyes e-12 *** --- Residual standard error: 12.5 on 123 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 4 and 123 DF, p-value: < 2.2e-16 If the linear model is correct, we are monotone up in all three variables. Remark: For the linear model we have to dummy up nbhd, but for ART and mart we can simply leave it as an ordered numerical categorical variable. 38 / 46

39 Just using x = size of the house, y = price appears to be marginally increasing in size. (ˆf down 0) x y ART, mart, and martd ART mart martd linear x y martd: y on (x up, x down) martd: monotone up fit martd: monotone down fit martd linear mart and martd seem much better than ART. 39 / 46

40 Using x = (nbhd, size, brick), here are the relationships between the fitted values from various models. y bart mbart mbartd fup fdown ote the high correlation between mart, martd and ˆf up. 40 / 46

41 x axis: martd = ˆf up + ˆf down. y axis: red: ˆf up, green: ˆf down yhat mbartd fup and fdown fup fdown martd ˆf up suggests f is multivariate monotonic!!! 41 / 46

42 Let s now look at the effect of size conditionally on the six possible values of (nbdh, brick) ART: conditional effect of size mart: conditional effect of size martd: conditional effect of size price HD1 HD2 HD3 price price size size size mart and martd look very similar!! The conditionally monotone effect of size is becoming clearer! 42 / 46

43 And finally, the effect of size conditionally on the six possible values of (nbdh, brick) via ˆf up and ˆf down mart, fup: conditional effect of size martd, fdown: conditional effect of size price price size size ˆf up and martd look very similar!! Price is clearly conditionally monotone up in all three variables! y simultaneously estimating ˆf up + ˆf down, we have discovered monotonicity without any imposed assumptions!!! 43 / 46

44 Concluding Remarks martd = ˆf up + ˆf down provides a new assumption free approach for the discovery of the monotone components of f in multidimensional settings. Discovering such regions of monotonicity may of scientific interest in real applications. We have used informal variable selection to identify the monotone components here. More formal variable selection can be used in higher dimensional settings. As a doubly adaptive shape-constrained regularization approach, martd will adapt to mart when monotonicity is present, martd will adapt to ART when monotonicity absent, martd will be at least as good and maybe better, than the best of mart and ART in general. 44 / 46

45 Concluding Remarks The fully ayesian nature of ART greatly facilitates extensions such as mart, martd and many others. Despite its many compelling successes in practice, theoretical frequentist support for ART only now beginning to appear. For example, Rockova and van der Pas (2017) Posterior Concentration for ayesian Regression Trees and Their Ensembles recently obtained the first theoretical results for ayesian CART and ART, showing near-minimax posterior concentration when p > n for classes of Holder continuous functions. Monotone ART paper is available on Arxiv. Software for mart is available at 45 / 46

46 Thank You! 46 / 46

Monotonically Constrained Bayesian Additive Regression Trees

Monotonically Constrained Bayesian Additive Regression Trees Constrained Bayesian Additive Regression Trees Robert McCulloch University of Chicago, Booth School of Business Joint with: Hugh Chipman (Acadia), Ed George (UPenn, Wharton), Tom Shively (U Texas, McCombs)