A new look at tree based approaches

A new look at tree based approaches Xifeng Wang University of North Carolina Chapel Hill xifeng@live.unc.edu April 18, 2018 Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 1 / 27

Outline of this presentation 1 Background and motivation 2 Tree and related learning approaches: pros and cons 3 A new look at tree based approaches 4 Challenges and future directions Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 2 / 27

Outline 1 Background and motivation 2 Tree and related learning approaches: pros and cons 3 A new look at tree based approaches 4 Challenges and future directions Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 3 / 27

Many drugs failed at Phase 3 1. Around 50% confirmatory phase III trials for new drug failed, posing great financial burdens to drug developers. 2. Among them, around 50% ended with efficacy failure: failed to meet primary or secondary efficacy endpoints. 3. Around 30% failed due to safety issues; 20% commercial issues. 4. Oncology trials (48% failure rate) failed more often than non-oncology trials (29% failure rate). Grignolo, A, Pretorius S. Phase III Trial Failures: Costly, But Preventable. Applied Clinical Trials 2016; 25: 36-42. Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 4 / 27

Many drugs failed at Phase 3 Main triggers of failures: Inadequate basic science Flaw study design (phase II surrogate endpoint not confirmed by Phase III clinical outcome) Suboptimal dose selection Flawed data collection and analysis Problems with study operations Other... Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 5 / 27

Population heterogeneity Another possible cause: the new drug is effective, but only for certain sub-population. (population heterogeneity) For approved drugs, the existence of population heterogeneity could imply sub-optimal use of approved drugs This means, failed drugs might actually be useful, approved drugs may be under-valued. Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 6 / 27

Subgroup analysis: overview Subgroup analysis for efficacy: partitioning of the entire covariate space into subsets of patients that are homogeneous with respect to the treatment effect and therefore can be used to evaluate the expected treatment effect (versus control) for patients with a specific set of covariates. Subgroups pre-specified: pre-define subgroups, e.g., age < 50, WBC < 4.3 10 9 /L, male/female, smoker/no-smoker,... Subgroups learned from data: regression tree and extensions Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 7 / 27

Subgroup analysis: Tree based method Tree-based methods play an important role in subgroup analysis Belongs to the field of statistical learning An important branch in predictive modelling approaches Different approaches have been developed: CART, Random Forest, Ensemble, Gradient Boosting Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 8 / 27

CART: classification and regression tree 1. Growing initial tree. Grow an initial tree. At each node there is a rule for splitting data based on the cutoff of one variable. Pre-set stopping rules for deciding when a branch is terminal and can be split no more. Different tree learning algorithms: ID3 (Iterative Dichotomiser 3) C4.5 (successor of ID3) CART (Classification And Regression Tree) CHAID (CHi-squared Automatic Interaction Detector). Performs multi-level splits when computing classification trees. MARS: extends decision trees to handle numerical data better.... 2. Prune To prevent over-fitting Cross-validation Extremely daunting and groundless and hard to justify / standardize... Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 10 / 27

IT: Interaction Tree A data-driven tree procedure, labelled as interaction trees (IT), to explore the heterogeneity structure of the treatment effect across a number of subgroups that are objectively defined in a post hoc manner. Su, X., Tsai, C. L., Wang, H., Nickerson, D. M., & Li, B. (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10(Feb), 141-158. Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 11 / 27

SIDES: subgroup identification based on differential effect search A novel recursive partitioning procedure, which allows direct evaluation of the treatment effect in subgroups and is particularly tuned to evaluating modest- sized data sets, as well as large databases, from randomized clinical trials or other health outcomes databases. Lipkovich, I., Dmitrienko, A., Denne, J., & Enas, G. (2011). Subgroup identification based on differential effect search-a recursive partitioning method for establishing response to treatment in patient subpopulations. Statistics in medicine, 30(21), 2601-2621. Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 12 / 27

Tree-based approaches: pros and cons Pros Cons Heuristic ideas Simplicity of results, easy to explain... nonparametric and nonlinear. Exploratory by nature, difficult to do inference, e.g., type 1 error control. Stand-alone, unclear connection with classical models Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 13 / 27

A decision tree is a regression model 1. growing initial tree = rule/feature generation 2. pruning = model (variable) selection Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 15 / 27

A decision tree is a regression model: a hypothetical example True model: y = β 0 + β 1 Trt + β 2 I (x 1 <= 0.5) + β 3 I (x 2 <= 0.5) + β 4 I (x 1 <= 0.5) Trt + β 5 sign(x 2 <= 0.5) Trt + ɛ with (β 0, β 1, β 2, β 3, β 4, β 5 ) = (2, 2, 2, 2, 2). Here we assume the error term ɛ comes from the standard normal distribution N(0, 1). We generated n = 1000 observations from the above model, but with 8 noise variables x 3,..., x 10 added. And x 1 x 10 are simulated from a discrete uniform distribution over (0.02, 0.04,..., 1.00). Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 16 / 27

Estimated Tree by CART trt< 0.5 x2>=0.505 x2>=0.505 1.999 n=128 x1>=0.505 3.908 n=104 4.061 n=124 x1>=0.5 5.918 n=113 x1>=0.505 x1>=0.505 4.006 n=139 7.901 n=127 7.954 n=121 12.1 n=144 Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 17 / 27

Estimated model: new approach 1. Generate feature using the initial tree from CART(without pruning) 2. Model (variable) selection using LASSO + BIC Estimated model: Esti. Term True Coef Esti. Coef Intercept 2.0 2.12 Trt 2.0 2.06 I (x 1 <= 0.483) 2.0 1.79 I (x 2 <= 0.505) 2.0 1.98 Trt I (x 1 <= 0.483) 2.0 1.96 Trt I (x 2 <= 0.505) 2.0 1.91 I (x 1 <= 0.483) I (x 2 <= 0.505) 0 0.01 Trt I (x 1 <= 0.483) I (x 2 <= 0.505) 0 0.24 Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 18 / 27

A decision tree is a regression model: proof of concept We apply model selection approach (e.g., LASSO) on generated rules and examine the performance of selected model with pruned trees in terms of 1. variable selection performance: sensitivity, specificity HITS: count the frequency of the final model which is split by X1 and X2 and only by them Less-true: count how many true terms do we miss in the model More-true: count how many extra terms do we involve in the model. 2. model estimation performance: Mean Squared Error(MSE) of coefficient estimates Prediction Error of response based on 300-observation test data Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 19 / 27

Simulation Results CART HITS 100% 100% Prediction Error 96.76 94.60 MSE NA 0.437 Less-True=0 NA 99% More-True=0 NA 36% More-True=1 NA 57% More-True 2 NA 7% New Approach Table: Simulation results for regression tree based on 100 simulation runs Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 20 / 27

Special trees: IT We generate data from two models. Each data set consists of a continuous response Y, a binary treatment, and four covariates X 1 X 4 simulated from a discrete uniform distribution over (0.02, 0.04,..., 1.00) However, only a subset of the covariates interact with the treatment. Model 1: Model 2: Y = 2 + 2 trt + 2Z 1 + 2Z 2 + ɛ Y = 2 + 2 trt + 2Z 1 + 2Z 2 + 2 trt Z 1 + 2 trt Z 2 + ɛ Here we assume the error term ɛ comes from the standard normal distribution N(0, 1). And Z 1 = I (X 1 0.5), Z 2 = I (X 2 0.5) Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 21 / 27

Estimated model: IT The Final IT Structure x1 0.5 x2 0.5 x2 0.5 n=152 n=147 n=148 n=153 Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 22 / 27

Estimated model: new approach 1. Generate feature using a single boosted tree 2. Model (variable) selection using LASSO + BIC Estimated model:(600 observations) Esti. Term True Coef Esti. Coef Trt 2.0 2.06 Trt I (x 1 <= 0.483) 2.0 1.96 Trt I (x 2 <= 0.505) 2.0 1.91 Trt I (x 1 <= 0.483) I (x 2 <= 0.505) 0 0.24 Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 23 / 27

Simulation Results Models Hits (IT) Hits MSE Less-True=0 Model 1 98.5% 95.5% 0.20 100% Model 2 98.5% 98% 0.29 100% Models More-True=0 More-True=1 More-True 2 Model 1 19.5% 45.5% 35% Model 2 68.5% 27.5% 4% Table: Simulation results for IT based on 200 simulation runs Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 24 / 27

Why bother? Why look at tree-based approaches differently? Regression models theories / methodologies are very mature, easy to borrow strength Variable (model) selection has been pretty well studied Inference (e.g., significance, Type 1 error control) is possible Many extensions are possible, e.g., hidden (latent) heterogeneity. Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 25 / 27

Challenges and future directions Error-in-Variable issue: estimated rules/features is NOT the TRUE rules/fearures. Does this cause error-in-variable issue? Inference: post-learning inference, or post-selection inference in the presence of error-in-variable... feature generation (feature learning): more flexible approach Xifeng Wang (UNC-Chapel Hill) Short title April 18, 2018 27 / 27