Scribe: Chris Berlind Date: Feb 1, PDF Free Download

CS/CNS/EE 253: Advanced Topcs n Machne Learnng Topc: Dealng wth Partal Feedback #2 Lecturer: Danel Golovn Scrbe: Chrs Berlnd Date: Feb 1, 2010 8.1 Revew In the prevous lecture we began lookng at algorthms for dealng wth sequental decson problems n the bandt or partal) feedback model. In ths model, there are K arms ndexed by 1, 2,..., K, each wth an assocated payoff functon r t) whch s unknown. In each round t, an arm s chosen and the reward r t) 0, 1 s ganed. Only r t) s revealed to the algorthm at the end of round t, where s the arm chosen n that round; t s kept gnorant of r j t) for all other arms j. The goal s to fnd an algorthm specfyng how to choose an arm n each round that wll maxmze the total reward over all rounds. We began our study of ths model wth an assumpton of stochastc rewards, as opposed to the harder adversaral rewards case. Thus we assume there s an underlyng dstrbuton R for each arm, and each r t) s drawn from R ndependently of all other rewards both of arm durng rounds other than t, and of other arms durng round t). Note we assume the rewards are bounded; specfcally, r t) 0, 1 for all and t. We frst explored the t -Greedy algorthm n whch wth probablty t an arm s chosen unformly at random, and wth probablty 1 t the arm wth the hghest observed average reward s chosen. For the rght choce of t, ths algorthm has expected regret logarthmc n T. We can mprove upon ths algorthm by takng better advantage of the nformaton we have avalable to us. In addton to the average payoff for each arm, we also know how many tmes we have played each arm. Ths allows us to estmate confdence bounds for each arm whch leads to the Upper Confdence Bound UCB) algorthm explaned n detal n the last lecture. The UCB1 algorthm also has expected regret logarthmc n T. 8.2 Exp3 The regret bounds for the t -Greedy and UCB1 algorthms were proved under the assumpton of stochastc payoff functons. When the payoff functons are non-stochastc e.g. adversaral) these algorthms do not far so well. Because UCB1 s entrely determnstc, an adversary could predct ts play and choose payoffs to force UCB1 nto makng bad decsons. Ths flaw motvates the ntroducton of a new bandt algorthm, Exp3 1 whch s useful n the non-stochastc payoff case. In these notes, we wll develop a varant of Exp3, and gve a regret bound for t. The algorthm and analyss here are non-standard, and are provded to expose the role of unbased estmates and ther varances n the developng effectve no-regret algorthms n the non-stochastc payoff case. 8.2.1 Hedge & the Power of Unbased Estmates Back n Lecture 2, the Hedge algorthm was ntroduced to deal wth sequental decson-makng under the full nformaton model. The reward-maxmzng verson of the Hedge algorthm s defned 1

as Hedge) 1 w 1) = 1 = 1,..., K 2 for t = 1 to T 3 Play X t = w.p. w t) j w jt) 4 w t + 1) = w t)1 + ) r t) = 1,..., K At every tmestep t, each arm has weght w t) = 1 + ) t t r t ) and an arm s chosen wth probablty proportonal to the weghts. We let X t denote the arm chosen n round t. In ths algorthm, Hedge always sees the true payoff r t) n each round. Fx some real number b 1. Suppose each r t) n Hedge s replaced wth a random varable R t) such that R t) s always n 0, 1 and ER t) = r t)/b. We magne Hedge gets actual reward r t) f t pcks but only gets to see feedback R j t) for each j rather than the true rewards r j t). We can fnd a lower bound for the expected payoff E t b R X t t) = E t r X t t) as follows. Frst note that the upper bound on Hedge s expected regret on the payoffs R t) ensures T E R Xt t) E max Also note that for any set of random varables R 1, R 2,..., R n E max R max ER T R t) 1 ) ln K One way to see ths s to let j = argmax ER and note that max R } R j, always. Hence Emax R ER j = max ER. Usng these two nequaltes together wth ER t) = r t)/b we nfer the followng bound. Below, expectaton s taken wth respect to both the randomness of R t) and wth respect to the randomness we used for Hedge. T T T t) = E b R Xt t) = b E R Xt t) Hence T t) b E = max = max max b E T T T r t) max T 2 R t) 1 ) ln K R t) 1 ) b ln K 1 2 ) b ln K r t) 1 ) b ln K 8.2.1)

Ths ndcates that even though Hedge s not seeng the correct payoffs, t stll has nearly the same regret bound due to the lnearty of expectaton. The only dfference s that the ln K term n the regret ncreases to b ln K. Ths wll turn out to be a very useful property. 8.2.2 A Varaton on the Exp3 Algorthm The dea here s to observe a random varable and feed t to Hedge, snce the above analyss shows ths wll not hurt our performance. Defne 0 f s not played n round t R t) = otherwse r t) p t) where p t) = PrX t =. Then ER t) = r t). To use the above deas we need to scale these random rewards so that they always fall n 0, 1. Snce r t) 0, 1 by assumpton, the requred scalng factor s b = mn,t p t). Ths suggests that usng Hedge drectly n the bandt model would result n a poor bound on the expected regret because some arms mght see ther selecton probablty p t) tend to zero, whch wll cause b to tend to, renderng our bound n equaton 8.2.1) useless. Intutvely ths makes sense. Snce we are workng n the adversaral payoffs model, and lousy hstorcal performance s no guarantee on lousy future performance, we cannot gnore any arm for too long. We must contnuously explore the space of arms n case one of the prevously bad arms turns out to be the best one overall n hndsght. Alternately, we can vew the problem as controllng the varance of our estmate for the average reward averaged over all rounds so far) for a gven arm. Even f our estmate s unbased so that the mean s correct), there s a prce we pay for ts varance. To enforce the constrant that we contnuously explore all arms and keep these varances under control), we put a lower bound of /K on the probabltes p t). Ths ensures that b = K/ suffces. The result s a modfed form of Hedge. In ths algorthm, a varaton on Exp3, each tmestep plays accordng to the Hedge algorthm wth reward R t) := R t)/b = R t)/k wth probablty 1 and plays an arm unformly at random otherwse. Formally, t s defned as follows: Exp3-Varant, ) 1 for t = 1 to T w 2 p t) = 1 ) t) j w jt) + K = 1,..., K 3 Play X t = w.p. p t) r t) p t) f X t = 4 Let R t) = K 0 otherwse 5 w t + 1) = w t)1 + ) R t) = 1,..., K Let OPTS) := max t S r t) be the reward of the best fxed arm n hndsght over rounds n S, and let OPT T := OPT1, 2,..., T }) Usng Equaton 8.2.1), we get the followng bound on 3

expected reward bound, where X t s what we played on round t. T t) E max r t) 1 ) K ln K 2 t EXPLOIT Here, EXPLOIT s the random) set of rounds on whch the algorthm exploted prevous knowledge rather than explored 1. It s not too hard to see that EOPTEXPLOIT) 1 )OPT T. In effect, gvng up the reward for each round wth probablty to explore should only cause us to lose a fracton of the statc optmum OPT T ). Thus we get the followng regret bound. Theorem 8.2.1 The algorthm above obtans expected reward at least EOPTEXPLOIT) 1 2) K ln K and so has expected regret at most 2 + ) OPT T + K ln K. Notng OPT T T and balancng terms, we can optmze the bound by settng, = ΘK ln K) 1/3 T 1/3 ) for a regret bound of OT 2/3 K log K) 1/3 ). Compared to the OK log T ) regret bounds n the stochatc reward settng, ths s much worse. Ignorng the dependence on K, t means the average regret shrnks as OT 1/3 ) nstead of O log T T ). Ths algorthm and analyss are not the best possble; As we dscuss below, Exp3 acheves a O T K log K) regret bound, and a lower bound of Ω T K) s known for the adversaral payoff case. 8.2.3 The Orgnal Exp3 Algorthm The orgnal Exp3 algorthm has only one parameter,, and s obtaned by settng = e 1 n our varant,.e., Exp3) Exp3-Varante 1, ). Here s the psuedocode. Exp3) 1 for t = 1 to T w 2 p t) = 1 ) t) j w jt) + K = 1,..., K 3 Play X t = w.p. p t) r t) p t) f X t = 4 Let R t) = K 0 otherwse 5 w t + 1) = w t) expr t)) = 1,..., K Auer et al. 1 then prove the followng regret bound for Exp3. Theorem 8.2.2 The expected regret of Exp3) after T rounds s at most e 1) OPT T K ln K where OPT T s the statc optmum for the frst T rounds. 1 To decde f a round t was an explotaton or an exploraton round, let be the arm chosen n round t, and flp a con wth bas Kp t)) 1. If t comes up heads, ts an exploraton round. Otherwse ts an explotaton round. Provng EOPTEXPLOIT) 1 )OPT T s easy f you note that ths can be done after all the rounds have been played. 4

Wth the optmum choce of t s possble to acheve a regret bound of O OPT T K ln K). 8.3 Gradent Descent wthout the Gradent Unbased estmates are used n other algorthms n the bandt feedback model as well. For example, Flaxman et al.2 have shown that t s possble to perform gradent descent n the bandt settng by gettng an unbased estmate of an n-dmensonal gradent 2 from an observed scalar) reward! See ther paper and references theren for more on ths topc. References 1 Peter Auer, Ncolo Cesa-Banch, Yaov Freund, and Robert E. Schapre. The non-stochastc mult-armed bandt problem. SIAM journal on computng, 32:48 77, 2002. 2 Abraham D. Flaxman, Adam Tauman Kala, and H. Brendan McMahan. Onlne convex optmzaton n the bandt settng: gradent descent wthout a gradent. In SODA 05: Proceedngs of the sxteenth annual ACM-SIAM symposum on Dscrete algorthms, pages 385 394. Socety for Industral and Appled Mathematcs, 2005. 2 They estmate the gradent of a smoothed verson of the objectve functon, rather than the gradent of the objectve functon tself. 5

Scribe: Chris Berlind Date: Feb 1, 2010