Treatment Allocations Based on Multi-Armed Bandit Strategies

Size: px

Start display at page:

Download "Treatment Allocations Based on Multi-Armed Bandit Strategies"

Carmella Haynes
5 years ago
Views:

1 Treatment Allocations Based on Multi-Armed Bandit Strategies Wei Qian and Yuhong Yang Applied Economics and Statistics, University of Delaware School of Statistics, University of Minnesota Innovative Statistics and Machine Learning for Precision Medicine September 15, 2017

2 1 Bandit Problems 2 Methodology and Theory 3 Model Combining 4 Numerical Studies 5 Conclusion

3 Standard Multi-Armed Bandit Problem There is a wall of slot machines.!! =!!%!!! =!!%!!! =!!%!!! Each machine has certain winning probability to receive $1.

4 Standard Multi-Armed Bandit Problem There is a wall of slot machines.!! =!!%!!! =!!%!!! =!!%!!! Each machine has certain winning probability to receive $1. Chances of winning are unknown to the game player. At each time, one and only one machine can be played, and the immediate result is observed. Goal: maximize the total number of wins over N times of plays.

5 Exploration-Exploitation Tradeoff Exploration: pull each arm as many times as possible to explore on the true reward probabilities. Exploitation: use the existing information and play the best arm.

6 Motivation: Ethical Clinical Studies Slot machines: different treatments to a certain disease Survival probability: unknown to the doctor Goal: sequentially assign treatments to patients to maximize the survival rate

7 A Real Example: ECMO Trial ECMO for treating newborns with persistent pulmonary hypertension? Ethical dilemma of using a conventional randomized controlled trial current patients versus future patients two hats on a participating doctor A solution is response adaptive design. L.J. Wei s randomized version of the play the winner rule was used in a study. The ECMO trial has generated a lot of discussions. See, e.g., two Statistical Science papers in 1989 and 1991.

8 Motivation: Online Services Web applications are generating massive data streams. Online recommendation systems recommend articles to online newspaper readers. recommend products to customers of online retailers.

9 Motivation: Online Services Web applications are generating massive data streams. Online recommendation systems recommend articles to online newspaper readers. recommend products to customers of online retailers.

Motivation: Bandit Problem For Online Services Slot machines: multiple articles Each internet visit: one and only one article delivered Clicking

10 Motivation: Bandit Problem For Online Services Slot machines: multiple articles Each internet visit: one and only one article delivered Clicking probability: unknown to the internet company Goal: sequentially choose an article for internet users to maximize the total number of clicks or click-through-rate (CTR)

11 Bandit Problem With Covariates Standard bandit problem assumes constant winning probabilities. In practice, winning probability can be dependent on covariates.

12 Bandit Problem With Covariates Standard bandit problem assumes constant winning probabilities. In practice, winning probability can be dependent on covariates. Personalized medical service Treatment effects (e.g., survival probability) can be associated with patients prognostic factors.

13 Personalized Web Service Personalized online advertising, article recommendation Internet user s interest in an ad or an article story can be associated with some user information.

14 Multi-Armed Bandit with Covariate (MABC) for Precision Medicine An example scenario: A few FDA approved drugs are available on the market for treating a certain disease Currently the doctors perhaps choose among the available drugs based on limited information and reading of scattered publications if any Why not use the MABC framework for better medical practice?

15 Two-Armed Bandit Problem with Covariates Two treatments (news articles): A and B Patient (user) covariate x [0, 1] Recovering (clicking) probability: f A(x), f B(x) clicking probablity f A (x) f B (x) x

16 Problem Setup: Two-Armed Bandit with Covariates Problem Setup: Given a bandit problem with two arms: treatments A and B Unknown recovering probabilities given covariate x [0, 1] d : f A(x), f B(x) Covariates X n, i.i.d. from continuous distribution P X

17 Problem Setup: Two-Armed Bandit with Covariates Problem Setup: Given a bandit problem with two arms: treatments A and B Unknown recovering probabilities given covariate x [0, 1] d : f A(x), f B(x) Covariates X n, i.i.d. from continuous distribution P X At each time n, 1 observe patient covariate Xn P X; 2 Based on previous observations and Xn, apply a sequential allocation algorithm to choose the treatment I n {A, B}; 3 observe result YIn,n Bernoulli(f In (X n)). recover: Y In,n = 1; otherwise: Y In,n = 0.

18 Problem Setup: Two-Armed Bandit with Covariates Problem Setup: Given a bandit problem with two arms: treatments A and B Unknown recovering probabilities given covariate x [0, 1] d : f A(x), f B(x) Covariates X n, i.i.d. from continuous distribution P X At each time n, 1 observe patient covariate Xn P X; 2 Based on previous observations and Xn, apply a sequential allocation algorithm to choose the treatment I n {A, B}; 3 observe result YIn,n Bernoulli(f In (X n)). recover: Y In,n = 1; otherwise: Y In,n = 0. Question: how to design the sequential allocation algorithm?

19 A Measure of Performance: Regret Given patient covariate x, optimal strategy: give the treatment I (x) := argmax f i(x) i {A,B} optimal recovering probability: f (x) := max fi(x) i {A,B}

20 A Measure of Performance: Regret Given patient covariate x, optimal strategy: give the treatment I (x) := argmax f i(x) i {A,B} optimal recovering probability: f (x) := max fi(x) i {A,B} Suppose at time n, the patient covariate X n is observed. optimal choice: I (X n) the algorithm chooses treatment I n. regret n = f (X n) f In (X n).

21 A Measure of Performance: Regret Given patient covariate x, optimal strategy: give the treatment I (x) := argmax f i(x) i {A,B} optimal recovering probability: f (x) := max fi(x) i {A,B} Suppose at time n, the patient covariate X n is observed. optimal choice: I (X n) the algorithm chooses treatment I n. regret n = f (X n) f In (X n). To measure the overall performance, consider cumulative regret N ( R N := f (X n) f In (X ) n) n=1 An algorithm is strongly consistent if R N = o(n) almost surely.

22 Model Assumptions of f A and f B Parametric framework Woodroofe, 1979; Auer, 2002; Li et al., 2010; Goldenshluger and Zeevi, 2009, 2013; Bastani and Bayati, 2016 Linear models Nonparametric framework Yang and Zhu, 2002; Rigollet and Zeevi, 2010; Perchet and Rigollet, 2013

23 Algorithms Two articles A and B with clicking probabilities f A(x) and f B(x) 1 Deliver each article an equal number of times (e.g., each is delivered n 0 = 20 times): I 1 = A, I 2 = B,, I 2n0 1 = A, I 2n0 = B. 2 For the next internet visit (n = 2n0 + 1), observe the internet user covariate X n. 3 Estimate fa and f B using previous data to obtain ˆf A,n and ˆf B,n. 4 Find the more promising option: în = argmax i {A,B} ˆfi,n(X n); Deliver article with randomization scheme: {în, with probability 1 π n, I n = i, with probability π n, i î n. Observe the result Y In,n.

24 Kernel Estimation Given article A, at each time point n, define J A,n = {j : I j = A, 1 j n 1} Nadaraya-Watson estimator of f A(x): ˆf A,n(x) = j J A,n Y A,jK j J A,n K ( x Xj ( x Xj h n ) h n ) kernel function K(u) : R d R; bandwidth h n Epanechnikov quadratic kernel: K(u) = 3 4 (1 u2 )I( u 1)

25 An UCB-Type Kernel Estimator Upper Confidence Bound (UCB) kernel estimator ˆf A,n(x) = j J A,n Y A,jK j J A,n K ) ( x Xj h n ( x Xj ) + U A,n(x) h n A standard error quantity c (log N) ( j J A,n K 2 x Xj U A,n(x) = ( ) x Xj j J A,n K h n Under uniform kernel K(u) = I( u 1) with N A,n(x) = j J A,n I( X j x h), log N U A,n(x) = c N A,n(x) h n )

26 Algorithm Illustration Deliver each article 20 times. X 1 = 0.93, article A Time n = 1, n A = 1, n B = 0 clicking probablity x

27 Algorithm Illustration Deliver each article 20 times. X 1 = 0.93, article A Time n = 1, n A = 1, n B = 0 clicking probablity x

28 Algorithm Illustration Deliver each article 20 times. X 2 = 0.88, article B Time n = 2, n A = 1, n B = 1 clicking probablity x

29 Algorithm Illustration Deliver each article 20 times. X 2 = 0.88, article B Time n = 2, n A = 1, n B = 1 clicking probablity x

30 Algorithm Illustration Deliver each article 20 times. Time n = 40, n A = 20, n B = 20 clicking probablity x

31 Algorithm Illustration X 41 = Estimate f A(X 41) and f B(X 41) by kernel estimation. Time n = 40, n A = 20, n B = 20 clicking probablity x

32 Algorithm Illustration Estimate f A(X 41) Time n = 40, n A = 20, n B = 20 clicking probablity x

33 Algorithm Illustration Estimate f A(X 41): consider a window [X 41 h, X 41 + h]. Similar information may give similar clicking probability. Time n = 40, n A = 20, n B = 20 clicking probablity x

34 Algorithm Illustration Estimate f A(X 41): consider a window [X 41 h, X 41 + h]. ˆf A(X 41) = 0 Time n = 40, n A = 20, n B = 20 clicking probablity x

35 Algorithm Illustration Estimate f B(X 41): consider a window [X 41 h, X 41 + h]. ˆf B(X 41) = Time n = 40, n A = 20, n B = 20 clicking probablity x

36 Algorithm Illustration Article B looks more promising: ˆfA(X 41) < ˆf B(X 41). π n = 20%: P(I 41 = B H 41) = 80%, P(I 41 = A H 41) = 20% Time n = 40, n A = 20, n B = 20 clicking probablity x

37 Algorithm Illustration Continue the process with decreasing h n and π n to the end. Time n = 800, n A = 349, n B = 451 clicking probablity x

38 Challenges and Contributions Partial information in bandit problem Breakdown of i.i.d. assumptions: Existing consistency results for kernel estimation under i.i.d. or weak dependence assumption do not apply

39 Challenges and Contributions Partial information in bandit problem Breakdown of i.i.d. assumptions: Existing consistency results for kernel estimation under i.i.d. or weak dependence assumption do not apply Technical tools to develop new arguments Martingale theories Hoeffding-type inequalities Chaining methods Stong consistency and finite-time analysis

40 Challenges and Contributions Partial information in bandit problem Breakdown of i.i.d. assumptions: Existing consistency results for kernel estimation under i.i.d. or weak dependence assumption do not apply Technical tools to develop new arguments Martingale theories Hoeffding-type inequalities Chaining methods Stong consistency and finite-time analysis Dimension reduction and model combination

41 Asymptotic Performance Theorem (Qian and Yang, JMLR, 2016a) If f i s (i {A, B}) are uniformly continuous, and h n and π n are chosen to satisfy h n 0, π n 0 and nh 2d n πn/(log 4 n) 3, then Nadaraya-Watson estimators are uniformly strong consistent, that is, for each i {A, B}, sup x [0,1] d ( ˆfi,n(x) f i(x) ) 0 a.s. as n. Estimation uniform strong consistency implies that R N = o(n) almost surely. Equivalently, N n=1 YIn,n N n=1 Y n 1 a.s. as N

42 Finite-Time Regret Analysis Modulus of continuity: ω(h; f) = Hölder continuity: ω(h; f i) ρh κ (0 < κ 1) Theorem (Qian and Yang, JMLR, 2016a) N R N < C 1 n δ + n=n δ sup f(x 1) f(x 2) x 1 x 2 h There exists n δ N such that with probability larger than 1 2δ, ( 2 max ω(hn; f C 2 log(n) ) i) + i {A,B} nh d + π n nπ n ( 1 ) + C 3 N log. δ Upper bound of f (X n) f In (X n) Estimation bias: ω(h n; f i ) Estimation variance: C 2 log(n)/(nh d nπ n) Exploration price: π n

43 Finite-Time Regret Analysis Modulus of continuity: ω(h; f) = Hölder continuity: ω(h; f i) ρh κ (0 < κ 1) Theorem (Qian and Yang, JMLR, 2016a) N R N < C 1 n δ + n=n δ sup f(x 1) f(x 2) x 1 x 2 h There exists n δ N such that with probability larger than 1 2δ, ( 2 max ω(hn; f C 2 log(n) ) i) + i {A,B} nh d + π n nπ n ( 1 ) + C 3 N log. δ Upper bound of f (X n) f In (X n) Nonparametric estimation: Bias-Variance tradeoff Bandit problem: Exploration-Exploitation tradeoff

44 Finite-Time Regret Upper Bounds Under Hölder continuity, when using the kernel UCB-type estimator, ER N < CN d/κ (log N) c. Larger d and smaller κ gives larger power index. Matches minimax rate of Perchet and Rigollet (2013) up to a logarithmic factor.

45 Finite-Time Regret Upper Bounds Under Hölder continuity, when using the kernel UCB-type estimator, ER N < CN d/κ (log N) c. Larger d and smaller κ gives larger power index. Matches minimax rate of Perchet and Rigollet (2013) up to a logarithmic factor. Adaptive performance (Qian and Yang, EJS, 2016b): near minimax rate can be achieved without having κ a priori (0 < c κ 1).

46 Model Combining Different regression methods kernel estimation, histogram, K-nearest neighbors linear regression Model combining: weighted average of different statistical models AFTER (Yang, 2004): combines different forecasting procedures Data-driven algorithm with robust performance

47 Model Combining Illustration clicking probablity f A(x) f B(x) x f A(x) = 0.7e 30(x 0.2) e 30(x 0.8)2 f B(x) = x Time horizon N = 800, π n = 1 log 2 n Model Combining 1 Nadaraya-Watson estimation (h 1 and h 2 ) 2 Linear regression

48 Model Combining Adaptive Performance Per-round regret r n = R n/n r n combined Nadaraya-Watson-h 1 Nadaraya-Watson-h 2 linear regression n

49 Yahoo! Front Page Today Module Dataset 46 million internet visit events with user response and five user covariates in ten days. Contains a pool of about 10 editor-picked news articles. Raw data file is 8GB each day. Algorithms are implemented efficiently in C++. Potentially adapted for online applications.

50 Evaluation Results Algorithms evaluated by click-through-rate (CTR). Complete random Naive simple average (no covariates) LinUCB (Chapelle and Li, 2011): Bayesian logistic regression based algorithm Model combining: Kernel estimation (h 1 = n 1/6, h 2 = n 1/8, h 3 = n 1/10 ) Naive simple average random Naive LinUCB Combining avg. normalized CTR std. dev

51 Conclusion Precision medicine demands online learning for optimal treatment results MABC provides a framework for designing effective treatment allocation rules in a way that integrates the learning from experimentation with maximizing the benefits to the patients along the process Many theoretical and practical issues need to be addressed

52 Some References Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem, Machine Learning, 47, Lai, T. L. and Robbins, H. (1985), Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, 6, Perchet, V. and Rigollet, P. (2013), The multi-armed bandit problem with covariates, The Annals of Statistics, 41, Qian, W. and Yang, Y. (2016a), Kernel estimation and model combination in a bandit problem with covariates, Journal of Machine Learning Research, 17, Qian, W. and Yang, Y. (2016b), Randomized allocation with arm elimination in a bandit problem with covariates, Electronic Journal of Statistics, 10, Robbins, H. (1954), Some aspects of the sequential design of experiments,. Bulletin of the American Mathematical Society, 58, Woodroofe, M. (1979), A one-armed bandit problem with a concomitant variable, Journal of the American Statistical Association, 74, Yang, Y. (2004), Combining forecasting procedures: some theoretical results, Econometric Theory, 20, Yang, Y. and Zhu, D. (2002), Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates, The Annals of Statistics, 30, Yahoo! Academic Relations. (2011) Yahoo! front page today module user click log dataset, version 1.0. (Available from

Tuning bandit algorithms in stochastic environments

Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, CERTIS - Ecole des Ponts Remi Munos, INRIA Futurs Lille Csaba Szepesvári, University of Alberta The 18th International Conference