Scribe: Chris Berlind Date: Feb 1, 2010

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013

MgtOp 215 Chapter 13 Dr. Ahn

OPERATIONS RESEARCH. Game Theory

Lecture 7. We now use Brouwer s fixed point theorem to prove Nash s theorem.

15-451/651: Design & Analysis of Algorithms January 22, 2019 Lecture #3: Amortized Analysis last changed: January 18, 2019

Topics on the Border of Economics and Computation November 6, Lecture 2

Chapter 5 Student Lecture Notes 5-1

II. Random Variables. Variable Types. Variables Map Outcomes to Numbers

Tests for Two Correlations

Economics 1410 Fall Section 7 Notes 1. Define the tax in a flexible way using T (z), where z is the income reported by the agent.

Ch Rival Pure private goods (most retail goods) Non-Rival Impure public goods (internet service)

Tests for Two Ordered Categorical Variables

Linear Combinations of Random Variables and Sampling (100 points)

Elements of Economic Analysis II Lecture VI: Industry Supply

/ Computational Genomics. Normalization

3: Central Limit Theorem, Systematic Errors

Chapter 3 Student Lecture Notes 3-1

Quiz on Deterministic part of course October 22, 2002

Survey of Math: Chapter 22: Consumer Finance Borrowing Page 1

Multifactor Term Structure Models

Applications of Myerson s Lemma

2.1 Rademacher Calculus... 3

Appendix - Normally Distributed Admissible Choices are Optimal

Economic Design of Short-Run CSP-1 Plan Under Linear Inspection Cost

Teaching Note on Factor Model with a View --- A tutorial. This version: May 15, Prepared by Zhi Da *

Supplementary material for Non-conjugate Variational Message Passing for Multinomial and Binary Regression

OCR Statistics 1 Working with data. Section 2: Measures of location

CS 286r: Matching and Market Design Lecture 2 Combinatorial Markets, Walrasian Equilibrium, Tâtonnement

Problem Set 6 Finance 1,

Likelihood Fits. Craig Blocker Brandeis August 23, 2004

Foundations of Machine Learning II TP1: Entropy

Data Mining Linear and Logistic Regression

CHAPTER 9 FUNCTIONAL FORMS OF REGRESSION MODELS

A MODEL OF COMPETITION AMONG TELECOMMUNICATION SERVICE PROVIDERS BASED ON REPEATED GAME

Introduction to game theory

Final Exam. 7. (10 points) Please state whether each of the following statements is true or false. No explanation needed.

Financial mathematics

Random Variables. b 2.

Consumption Based Asset Pricing

ECONOMETRICS - FINAL EXAM, 3rd YEAR (GECO & GADE)

Problems to be discussed at the 5 th seminar Suggested solutions

Global Optimization in Multi-Agent Models

REFINITIV INDICES PRIVATE EQUITY BUYOUT INDEX METHODOLOGY

2) In the medium-run/long-run, a decrease in the budget deficit will produce:

Parallel Prefix addition

An Application of Alternative Weighting Matrix Collapsing Approaches for Improving Sample Estimates

Which of the following provides the most reasonable approximation to the least squares regression line? (a) y=50+10x (b) Y=50+x (d) Y=1+50x

4.4 Doob s inequalities

Hedging Greeks for a portfolio of options using linear and quadratic programming

ISE Cloud Computing Index Methodology

Chapter 10 Making Choices: The Method, MARR, and Multiple Attributes

A Case Study for Optimal Dynamic Simulation Allocation in Ordinal Optimization 1

Mathematical Thinking Exam 1 09 October 2017

Problem Set #4 Solutions

Simulation Budget Allocation for Further Enhancing the Efficiency of Ordinal Optimization

Bid-auction framework for microsimulation of location choice with endogenous real estate prices

Equilibrium in Prediction Markets with Buyers and Sellers

FM303. CHAPTERS COVERED : CHAPTERS 5, 8 and 9. LEARNER GUIDE : UNITS 1, 2 and 3.1 to 3.3. DUE DATE : 3:00 p.m. 19 MARCH 2013

Members not eligible for this option

CHAPTER 3: BAYESIAN DECISION THEORY

Elton, Gruber, Brown and Goetzmann. Modern Portfolio Theory and Investment Analysis, 7th Edition. Solutions to Text Problems: Chapter 4

How Likely Is Contagion in Financial Networks?

- contrast so-called first-best outcome of Lindahl equilibrium with case of private provision through voluntary contributions of households

Analysis of Variance and Design of Experiments-II

Fast Laplacian Solvers by Sparsification

Value of L = V L = VL = VU =$48,000,000 (ii) Owning 1% of firm U provides a dollar return of.01 [EBIT(1-T C )] =.01 x 6,000,000 = $60,000.

A Set of new Stochastic Trend Models

Jeffrey Ely. October 7, This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.

Option pricing and numéraires

Financial Risk Management in Portfolio Optimization with Lower Partial Moment

Spatial Variations in Covariates on Marriage and Marital Fertility: Geographically Weighted Regression Analyses in Japan

Single-Item Auctions. CS 234r: Markets for Networks and Crowds Lecture 4 Auctions, Mechanisms, and Welfare Maximization

ISE High Income Index Methodology

Hewlett Packard 10BII Calculator

EuroMTS Eurozone Government Bill Index Rules

Optimal Service-Based Procurement with Heterogeneous Suppliers

Optimizing Merchant Revenue with Rebates

The IBM Translation Models. Michael Collins, Columbia University

ECE 586GT: Problem Set 2: Problems and Solutions Uniqueness of Nash equilibria, zero sum games, evolutionary dynamics

Money, Banking, and Financial Markets (Econ 353) Midterm Examination I June 27, Name Univ. Id #

Appendix for Solving Asset Pricing Models when the Price-Dividend Function is Analytic

Optimising a general repair kit problem with a service constraint

Oracle inequalities for computationally budgeted model selection

Members not eligible for this option

A Distributed Algorithm for Constrained Multi-Robot Task Assignment for Grouped Tasks

ISyE 512 Chapter 9. CUSUM and EWMA Control Charts. Instructor: Prof. Kaibo Liu. Department of Industrial and Systems Engineering UW-Madison

PhysicsAndMathsTutor.com

Physics 4A. Error Analysis or Experimental Uncertainty. Error

Random Variables. 8.1 What is a Random Variable? Announcements: Chapter 8

Introduction to PGMs: Discrete Variables. Sargur Srihari

Monte Carlo Rendering

Robust Stochastic Lot-Sizing by Means of Histograms

Probability Distributions. Statistics and Quantitative Analysis U4320. Probability Distributions(cont.) Probability

>1 indicates country i has a comparative advantage in production of j; the greater the index, the stronger the advantage. RCA 1 ij

Cyclic Scheduling in a Job shop with Multiple Assembly Firms

FORD MOTOR CREDIT COMPANY SUGGESTED ANSWERS. Richard M. Levich. New York University Stern School of Business. Revised, February 1999

A Php 5,000 loan is being repaid in 10 yearly payments. If interest is 8% effective, find the annual payment. 1 ( ) 10) 0.

Evaluating Performance

A FRAMEWORK FOR PRIORITY CONTACT OF NON RESPONDENTS

4. Greek Letters, Value-at-Risk

Transcription:

CS/CNS/EE 253: Advanced Topcs n Machne Learnng Topc: Dealng wth Partal Feedback #2 Lecturer: Danel Golovn Scrbe: Chrs Berlnd Date: Feb 1, 2010 8.1 Revew In the prevous lecture we began lookng at algorthms for dealng wth sequental decson problems n the bandt or partal) feedback model. In ths model, there are K arms ndexed by 1, 2,..., K, each wth an assocated payoff functon r t) whch s unknown. In each round t, an arm s chosen and the reward r t) 0, 1 s ganed. Only r t) s revealed to the algorthm at the end of round t, where s the arm chosen n that round; t s kept gnorant of r j t) for all other arms j. The goal s to fnd an algorthm specfyng how to choose an arm n each round that wll maxmze the total reward over all rounds. We began our study of ths model wth an assumpton of stochastc rewards, as opposed to the harder adversaral rewards case. Thus we assume there s an underlyng dstrbuton R for each arm, and each r t) s drawn from R ndependently of all other rewards both of arm durng rounds other than t, and of other arms durng round t). Note we assume the rewards are bounded; specfcally, r t) 0, 1 for all and t. We frst explored the t -Greedy algorthm n whch wth probablty t an arm s chosen unformly at random, and wth probablty 1 t the arm wth the hghest observed average reward s chosen. For the rght choce of t, ths algorthm has expected regret logarthmc n T. We can mprove upon ths algorthm by takng better advantage of the nformaton we have avalable to us. In addton to the average payoff for each arm, we also know how many tmes we have played each arm. Ths allows us to estmate confdence bounds for each arm whch leads to the Upper Confdence Bound UCB) algorthm explaned n detal n the last lecture. The UCB1 algorthm also has expected regret logarthmc n T. 8.2 Exp3 The regret bounds for the t -Greedy and UCB1 algorthms were proved under the assumpton of stochastc payoff functons. When the payoff functons are non-stochastc e.g. adversaral) these algorthms do not far so well. Because UCB1 s entrely determnstc, an adversary could predct ts play and choose payoffs to force UCB1 nto makng bad decsons. Ths flaw motvates the ntroducton of a new bandt algorthm, Exp3 1 whch s useful n the non-stochastc payoff case. In these notes, we wll develop a varant of Exp3, and gve a regret bound for t. The algorthm and analyss here are non-standard, and are provded to expose the role of unbased estmates and ther varances n the developng effectve no-regret algorthms n the non-stochastc payoff case. 8.2.1 Hedge & the Power of Unbased Estmates Back n Lecture 2, the Hedge algorthm was ntroduced to deal wth sequental decson-makng under the full nformaton model. The reward-maxmzng verson of the Hedge algorthm s defned 1

as Hedge) 1 w 1) = 1 = 1,..., K 2 for t = 1 to T 3 Play X t = w.p. w t) j w jt) 4 w t + 1) = w t)1 + ) r t) = 1,..., K At every tmestep t, each arm has weght w t) = 1 + ) t t r t ) and an arm s chosen wth probablty proportonal to the weghts. We let X t denote the arm chosen n round t. In ths algorthm, Hedge always sees the true payoff r t) n each round. Fx some real number b 1. Suppose each r t) n Hedge s replaced wth a random varable R t) such that R t) s always n 0, 1 and ER t) = r t)/b. We magne Hedge gets actual reward r t) f t pcks but only gets to see feedback R j t) for each j rather than the true rewards r j t). We can fnd a lower bound for the expected payoff E t b R X t t) = E t r X t t) as follows. Frst note that the upper bound on Hedge s expected regret on the payoffs R t) ensures T E R Xt t) E max Also note that for any set of random varables R 1, R 2,..., R n E max R max ER T R t) 1 ) ln K One way to see ths s to let j = argmax ER and note that max R } R j, always. Hence Emax R ER j = max ER. Usng these two nequaltes together wth ER t) = r t)/b we nfer the followng bound. Below, expectaton s taken wth respect to both the randomness of R t) and wth respect to the randomness we used for Hedge. T T T t) = E b R Xt t) = b E R Xt t) Hence T t) b E = max = max max b E T T T r t) max T 2 R t) 1 ) ln K R t) 1 ) b ln K 1 2 ) b ln K r t) 1 ) b ln K 8.2.1)

Ths ndcates that even though Hedge s not seeng the correct payoffs, t stll has nearly the same regret bound due to the lnearty of expectaton. The only dfference s that the ln K term n the regret ncreases to b ln K. Ths wll turn out to be a very useful property. 8.2.2 A Varaton on the Exp3 Algorthm The dea here s to observe a random varable and feed t to Hedge, snce the above analyss shows ths wll not hurt our performance. Defne 0 f s not played n round t R t) = otherwse r t) p t) where p t) = PrX t =. Then ER t) = r t). To use the above deas we need to scale these random rewards so that they always fall n 0, 1. Snce r t) 0, 1 by assumpton, the requred scalng factor s b = mn,t p t). Ths suggests that usng Hedge drectly n the bandt model would result n a poor bound on the expected regret because some arms mght see ther selecton probablty p t) tend to zero, whch wll cause b to tend to, renderng our bound n equaton 8.2.1) useless. Intutvely ths makes sense. Snce we are workng n the adversaral payoffs model, and lousy hstorcal performance s no guarantee on lousy future performance, we cannot gnore any arm for too long. We must contnuously explore the space of arms n case one of the prevously bad arms turns out to be the best one overall n hndsght. Alternately, we can vew the problem as controllng the varance of our estmate for the average reward averaged over all rounds so far) for a gven arm. Even f our estmate s unbased so that the mean s correct), there s a prce we pay for ts varance. To enforce the constrant that we contnuously explore all arms and keep these varances under control), we put a lower bound of /K on the probabltes p t). Ths ensures that b = K/ suffces. The result s a modfed form of Hedge. In ths algorthm, a varaton on Exp3, each tmestep plays accordng to the Hedge algorthm wth reward R t) := R t)/b = R t)/k wth probablty 1 and plays an arm unformly at random otherwse. Formally, t s defned as follows: Exp3-Varant, ) 1 for t = 1 to T w 2 p t) = 1 ) t) j w jt) + K = 1,..., K 3 Play X t = w.p. p t) r t) p t) f X t = 4 Let R t) = K 0 otherwse 5 w t + 1) = w t)1 + ) R t) = 1,..., K Let OPTS) := max t S r t) be the reward of the best fxed arm n hndsght over rounds n S, and let OPT T := OPT1, 2,..., T }) Usng Equaton 8.2.1), we get the followng bound on 3

expected reward bound, where X t s what we played on round t. T t) E max r t) 1 ) K ln K 2 t EXPLOIT Here, EXPLOIT s the random) set of rounds on whch the algorthm exploted prevous knowledge rather than explored 1. It s not too hard to see that EOPTEXPLOIT) 1 )OPT T. In effect, gvng up the reward for each round wth probablty to explore should only cause us to lose a fracton of the statc optmum OPT T ). Thus we get the followng regret bound. Theorem 8.2.1 The algorthm above obtans expected reward at least EOPTEXPLOIT) 1 2) K ln K and so has expected regret at most 2 + ) OPT T + K ln K. Notng OPT T T and balancng terms, we can optmze the bound by settng, = ΘK ln K) 1/3 T 1/3 ) for a regret bound of OT 2/3 K log K) 1/3 ). Compared to the OK log T ) regret bounds n the stochatc reward settng, ths s much worse. Ignorng the dependence on K, t means the average regret shrnks as OT 1/3 ) nstead of O log T T ). Ths algorthm and analyss are not the best possble; As we dscuss below, Exp3 acheves a O T K log K) regret bound, and a lower bound of Ω T K) s known for the adversaral payoff case. 8.2.3 The Orgnal Exp3 Algorthm The orgnal Exp3 algorthm has only one parameter,, and s obtaned by settng = e 1 n our varant,.e., Exp3) Exp3-Varante 1, ). Here s the psuedocode. Exp3) 1 for t = 1 to T w 2 p t) = 1 ) t) j w jt) + K = 1,..., K 3 Play X t = w.p. p t) r t) p t) f X t = 4 Let R t) = K 0 otherwse 5 w t + 1) = w t) expr t)) = 1,..., K Auer et al. 1 then prove the followng regret bound for Exp3. Theorem 8.2.2 The expected regret of Exp3) after T rounds s at most e 1) OPT T K ln K where OPT T s the statc optmum for the frst T rounds. 1 To decde f a round t was an explotaton or an exploraton round, let be the arm chosen n round t, and flp a con wth bas Kp t)) 1. If t comes up heads, ts an exploraton round. Otherwse ts an explotaton round. Provng EOPTEXPLOIT) 1 )OPT T s easy f you note that ths can be done after all the rounds have been played. 4

Wth the optmum choce of t s possble to acheve a regret bound of O OPT T K ln K). 8.3 Gradent Descent wthout the Gradent Unbased estmates are used n other algorthms n the bandt feedback model as well. For example, Flaxman et al.2 have shown that t s possble to perform gradent descent n the bandt settng by gettng an unbased estmate of an n-dmensonal gradent 2 from an observed scalar) reward! See ther paper and references theren for more on ths topc. References 1 Peter Auer, Ncolo Cesa-Banch, Yaov Freund, and Robert E. Schapre. The non-stochastc mult-armed bandt problem. SIAM journal on computng, 32:48 77, 2002. 2 Abraham D. Flaxman, Adam Tauman Kala, and H. Brendan McMahan. Onlne convex optmzaton n the bandt settng: gradent descent wthout a gradent. In SODA 05: Proceedngs of the sxteenth annual ACM-SIAM symposum on Dscrete algorthms, pages 385 394. Socety for Industral and Appled Mathematcs, 2005. 2 They estmate the gradent of a smoothed verson of the objectve functon, rather than the gradent of the objectve functon tself. 5