Rollout Allocation Strategies for Classification-based Policy Iteration
|
|
- Helena Tucker
- 6 years ago
- Views:
Transcription
1 Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh Workshop on Reinforcement Learning and Search in Very Large Spaces. June 25, 2010, Haifa, Israel.
2 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments
3 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments
4 Definitions I Discounted Markov Decision Process MDP A discounted Markov Decision Process M is a tuple X, A, r, p, γ, where The state space X, The set of actions A is finite ( A = M < ), The transition model p( x, a) is a distribution over X, The reward function r : X A X R, γ (0, 1) is a discount factor. π is a deterministic policy π : X A.
5 Definitions II Rollout Sampling a trajectory starting from state x 0 with action a 0 and following π with return R π (x 0, a 0 ): R π (x 0, a 0 ) = r(x 0, a 0, x 1 ) + t>0 γt r(x t, π(a t ), x t+1 )), x t+1 p( x t, a t ) E[R π (x, a)] = Q π (x, a), x X and a A
6 Classification-based Policy Iteration Without Value Function representation! The policy improvement step is cast as a classification problem.
7 Classification-based Policy Iteration Without Value Function representation! As accurate as possible The policy improvement step is cast as a classification problem.
8 Classification-based Policy Iteration Without Value Function representation! As accurate as possible How to allocate? The policy improvement step is cast as a classification problem.
9 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments
10 Rollout Allocation Problem The Goal: Learn the greedy policy To maximize the accuracy of the training set D T To identify the greedy actions in each state of rollout set D R The Setting: Fixed Budget L At each iteration, L rollouts are performed (generative model) to estimate the Q-values over D R. Fixed Rollout Set The rollout set is sampled from a fixed distribution ρ and is fixed through iterations. The Problem: How should we allocate the budget over actions, states and state-action pairs?
11 Example What is the best strategy to maximize the accuracy of the training set with a fixed budget? a1 a2 a3 state 1 state 2 Can we do better than just allocating uniformly the budget over the state-action pairs? state 3 state 4
12 Allocation Strategies Strategies over Actions The budget is allocated uniformly over states. In each state we face a pure exploration bandit problem. Strategies over States The budget of one state is allocated uniformly over its actions. How many rollouts in each state? (e.g. Allocate more rollouts to states which need more rollouts to discriminate the greedy action). Strategies over States and Actions Can we merge strategies from the above two problems? Should we design something new?
13 Allocation Strategies Strategies over Actions The budget is allocated uniformly over states. In each state we face a pure exploration bandit problem. Strategies over States The budget of one state is allocated uniformly over its actions. How many rollouts in each state? (e.g. Allocate more rollouts to states which need more rollouts to discriminate the greedy action). Strategies over States and Actions Can we merge strategies from the above two problems? Should we design something new?
14 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments
15 Previous Works Rollout Classification-based Policy Iteration (RCPI) was introduced by Lagoudakis & Parr (2003) & Fern et al. (2004). Rollout allocation was uniform over states and actions. Dimitrakakis & Lagoudakis (2008) proposed rollout allocation strategies over states. The target of their work is to create an accurate training set by performing the least number of rollouts possible.
16 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments
17 Rollout Allocation Strategies over Actions Pure Exploration MAB: A player has a L rounds to explore M arms/actions. He outputs the one with the highest estimated expectation. The number of rounds can be unknown. The Setting: The budget is allocated uniformly over states: L(x) = L/N In each state x, the player estimates Q π (x, a), a A After c(x, a) rollouts, Q c(x,a) π 1 (x, a) = c(x,a) He outputs â = arg max a A Qπ L(x) (x, a) c(x,a) j=1 Rj π (x, a)
18 The Goal: Rollout Allocation Strategies over Actions To minimize the probability of not identifying the greedy action w.r.t. π at a state x with a fixed number of rollouts. P(error) = P[1â a ] = P[1 argmax a A Q π L(x) (x,a) argmax a A Q π (x,a) ] Bubeck et al. (2009) proved that cumulative regret minimizers (e.g. UCB) are not optimal solutions for this problem. Notations: (x, a) = Q(x, a ) Q(x, a) (x) = H(x) = min a A, a a (x, a) a A, a a 1 (x,a) 2
19 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)
20 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)
21 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)
22 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)
23 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)
24 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)
25 Successive Rejects (SR) The length of the phases are defined as, m {1... M 1}: 1 L(x) M n m = log(m) M + 1 m, log(m) = M m=2 The (estimated) K best actions are sampled O(L(x)/K) times. Characteristics No parameter Need knowledge of L(x) Theoretical Results 1 m P(error) Successive Rejects O(exp( L(x) M log(m)h(x) )) Uniform L(x) (x)2 O(exp( M )) SR performs better than Uniform when H(x) M/ (x) 2.
26 UCB-E Audibert et al. (2010) Parameter: exploration parameter s > 0 Definition of index: For a A, for c 1 and B a,0 = +,let Algorithm B a,c = Q c (x, a) + s L(x) M H(x)c for each round t = 1, 2,..., L(x) do Explore the action with the highest index: Perform a rollout from (x, A t ), A t arg max B a,c(x,a). end for a A Output the action with maximum estimated Q-value.
27 UCB-E Audibert et al. (2010) Characteristics Have an exploration parameter Need knowledge of L(x) and H(x) P(error) = O(exp( L(x) M H )) Versions Adaptive On-line under-estimation of H(x) Any-time Replace L(x) with the current round t
28 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments
29 Domains We used classical settings of Mountain Car and Inverted Pendulum but increased the noise. Mountain Car (MC) Inverted Pendulum (IP) Formulation: Sutton et.al. (1998) Noise increased to: 1.2 Formulation: Lagoudakis & Parr (2003) Noise increased to: 15
30 Artificially increasing the number of actions left action null action right action real actions... player actions Mapping in Mountain Car One to forward, One to reverse, The rest to stay action. Mapping in Inverted Pendulum One to right, One to left, The rest are stochastic actions: P(right) = P(left) = P(null) = 1 3.
31 Training set accuracy (SR vs Uniform)
32 Training set accuracy (SR vs Uniform) Differences in the percentage of error over a fixed D R : states in D R Actions (M) Budget (L/MN) SR provides a training set more accurate than Uniform. Is this improvement propagating through iterations?
33 SR vs Uniform in Mountain Car
34 SR vs Uniform in Mountain Car Differences in the performance of the policies (steps):
35 SR vs Uniform in Mountain Car Differences in the performance of the policies (steps): states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) When the problem is not too hard or too easy, SR outperforms Uniform.
36 SR vs Uniform in Inverted Pendulum
37 SR vs Uniform in Inverted Pendulum Differences of the performance of the policies (steps) : states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) 100 states in D R Budget (L/MN) 200 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)
38 Adaptive UCB-E vs SR in Mountain Car
39 Adaptive UCB-E vs SR in Mountain Car Differences of the performance of the policies with s = 1: states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)
40 Adaptive UCB-E vs SR in Inverted Pendulum
41 Adaptive UCB-E vs SR in Inverted Pendulum Differences of performance of the policies with s = 1: states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) 100 states in D R Actions (M) Budget (L/MN)
42 Any-time Adaptive UCB-E vs Uniform in Mountain Car
43 Any-time Adaptive UCB-E vs Uniform in Mountain Car Differences of the performance of the policies with s = 1: states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)
44 Any-time Adaptive UCB-E vs Uniform in IP
45 Any-time Adaptive UCB-E vs Uniform in IP Differences of the performance of the policies with s = 1: states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) 100 states in D R Actions (M) Budget (L/MN)
46 Final Comparison
47 Final Comparison Mountain Car: 100 Actions, 50 States Average steps to the goal Uniform Successive Rejects Adaptive UCB E Any time Adaptive UCB E Budget (L/MN)
48 Future work and Conclusions Summary We studied different rollout allocation strategies for RCPI problems with fixed budget. The proposed strategies over actions (SR, UCB-E...) significantly outperforms the uniform strategy. Future Work To address the issue of rollout allocation over states and state-actions pairs. Any-time Adaptive UCB-E could be merged with any strategy over states.
49 Discussion
50 Bibliography I
51 Dimitrakakis et al. Construct a rollout set D R = {x i } N i=1, x iid i ρ, rol = 0, D T = while D T MAXSIZETRAINING rol L do x SelectState(), c(x) c(x) + 1 Sample all the actions in this state. rol rol + M if a dominating action â exists in x then D T = D T {(x, â ), +} D T = D T {(x, a), }, a â Replace x in D R by y ρ end if end while SelectState return arg max x DR (x) + UCB (x) = min a M,a â Qπ k (x, â ) Q π k (x, a) Significance Test Dominating if c(x) (x) 2 Kst Rmax log(1/δ) with δ an arbitray accuracy parameter.
52 [? ] [? ] [? ] [? ] [? ]
53 Dimitrakakis & Lagoudakis (2008) The target of these strategies is to create an accurate training set by performing the least number of rollouts possible. Their work focuses on the states where the greedy action seems easier to discriminate from the other actions. When a dominant action a is found in state x, (x, a) is immediately put in D T and x replaced by a state x new ρ. No rollout allocation strategy over actions.
54 Regression-based Policy Iteration
Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern
Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationMonte-Carlo Planning: Introduction and Bandit Basics. Alan Fern
Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned
More informationMonte-Carlo Planning Look Ahead Trees. Alan Fern
Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More informationLecture 7: Bayesian approach to MAB - Gittins index
Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach
More informationDynamic Pricing with Varying Cost
Dynamic Pricing with Varying Cost L. Jeff Hong College of Business City University of Hong Kong Joint work with Ying Zhong and Guangwu Liu Outline 1 Introduction 2 Problem Formulation 3 Pricing Policy
More informationSequential Decision Making
Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming
More informationReinforcement Learning
Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods
More informationMonte Carlo Methods (Estimators, On-policy/Off-policy Learning)
1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used
More informationIntro to Reinforcement Learning. Part 3: Core Theory
Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2
More informationAdaptive Experiments for Policy Choice. March 8, 2019
Adaptive Experiments for Policy Choice Maximilian Kasy Anja Sautmann March 8, 2019 Introduction The goal of many experiments is to inform policy choices: 1. Job search assistance for refugees: Treatments:
More informationComplex Decisions. Sequential Decision Making
Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by
More informationBandit algorithms for tree search Applications to games, optimization, and planning
Bandit algorithms for tree search Applications to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://sequel.futurs.inria.fr/ INRIA Lille - Nord Europe Journées MAS
More informationMonte-Carlo Planning: Basic Principles and Recent Progress
Monte-Carlo Planning: Basic Principles and Recent Progress Alan Fern School of EECS Oregon State University Outline Preliminaries: Markov Decision Processes What is Monte-Carlo Planning? Uniform Monte-Carlo
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use
More informationCS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm
CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure
More informationReinforcement Learning 04 - Monte Carlo. Elena, Xi
Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment
More informationIntroduction to Reinforcement Learning. MAL Seminar
Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology
More informationMonte-Carlo Planning Look Ahead Trees. Alan Fern
Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy
More informationMulti-armed bandit problems
Multi-armed bandit problems Stochastic Decision Theory (2WB12) Arnoud den Boer 13 March 2013 Set-up 13 and 14 March: Lectures. 20 and 21 March: Paper presentations (Four groups, 45 min per group). Before
More informationBasic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]
Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal
More informationDynamic Programming and Reinforcement Learning
Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision Risk and Operations Division Fall, 2017 Daniel Russo (Columbia) Fall 2017 1 / 34 Supervised Machine Learning
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC
More informationReinforcement Learning and Simulation-Based Search
Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision
More informationHandout 4: Deterministic Systems and the Shortest Path Problem
SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 4: Deterministic Systems and the Shortest Path Problem Instructor: Shiqian Ma January 27, 2014 Suggested Reading: Bertsekas
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives
More informationThe Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions
The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}
More informationMaking Complex Decisions
Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2
More informationTDT4171 Artificial Intelligence Methods
TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods
More informationQ1. [?? pts] Search Traces
CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a
More informationLecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018
Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction
More informationOptimistic Planning for the Stochastic Knapsack Problem
Optimistic Planning for the Stochastic Knapsack Problem Anonymous Author Anonymous Author 2 Anonymous Author 3 Unknown Institution Unknown Institution 2 Unknown Institution 3 Abstract The stochastic knapsack
More informationCPS 270: Artificial Intelligence Markov decision processes, POMDPs
CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward
More informationLecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision
More informationYao s Minimax Principle
Complexity of algorithms The complexity of an algorithm is usually measured with respect to the size of the input, where size may for example refer to the length of a binary word describing the input,
More informationTuning bandit algorithms in stochastic environments
Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, CERTIS - Ecole des Ponts Remi Munos, INRIA Futurs Lille Csaba Szepesvári, University of Alberta The 18th International Conference
More informationMDP Algorithms. Thomas Keller. June 20, University of Basel
MDP Algorithms Thomas Keller University of Basel June 20, 208 Outline of this lecture Markov decision processes Planning via determinization Monte-Carlo methods Monte-Carlo Tree Search Heuristic Search
More informationReinforcement Learning
Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical
More information2D5362 Machine Learning
2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files
More informationBandit Learning with switching costs
Bandit Learning with switching costs Jian Ding, University of Chicago joint with: Ofer Dekel (MSR), Tomer Koren (Technion) and Yuval Peres (MSR) June 2016, Harvard University Online Learning with k -Actions
More informationTemporal Abstraction in RL
Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;
More informationCS 461: Machine Learning Lecture 8
CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?
More informationMulti-Armed Bandit, Dynamic Environments and Meta-Bandits
Multi-Armed Bandit, Dynamic Environments and Meta-Bandits C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud and M. Sebag Lab. of Computer Science CNRS INRIA Université Paris-Sud, Orsay, France Abstract This
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline
More informationNon-Deterministic Search
Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:
More informationINVERSE REWARD DESIGN
INVERSE REWARD DESIGN Dylan Hadfield-Menell, Smith Milli, Pieter Abbeel, Stuart Russell, Anca Dragan University of California, Berkeley Slides by Anthony Chen Inverse Reinforcement Learning (Review) Inverse
More informationROBUST OPTIMIZATION OF MULTI-PERIOD PRODUCTION PLANNING UNDER DEMAND UNCERTAINTY. A. Ben-Tal, B. Golany and M. Rozenblit
ROBUST OPTIMIZATION OF MULTI-PERIOD PRODUCTION PLANNING UNDER DEMAND UNCERTAINTY A. Ben-Tal, B. Golany and M. Rozenblit Faculty of Industrial Engineering and Management, Technion, Haifa 32000, Israel ABSTRACT
More informationIntroduction to Fall 2007 Artificial Intelligence Final Exam
NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Final Exam You have 180 minutes. The exam is closed book, closed notes except a two-page crib sheet, basic calculators
More informationBernoulli Bandits An Empirical Comparison
Bernoulli Bandits An Empirical Comparison Ronoh K.N1,2, Oyamo R.1,2, Milgo E.1,2, Drugan M.1 and Manderick B.1 1- Vrije Universiteit Brussel - Computer Sciences Department - AI Lab Pleinlaan 2 - B-1050
More informationSafe Policy Iteration
Matteo Pirotta matteopirotta@polimiit Marcello Restelli marcellorestelli@polimiit Alessio Pecorino alessiopecorino@mailpolimiit Daniele Calandriello danielecalandriello@mailpolimiit Dept Elect, Inf, and
More informationThe Irrevocable Multi-Armed Bandit Problem
The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision
More informationMarkov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N
Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning
More informationCOS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014
COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May, 204 Review of Game heory: Let M be a matrix with all elements in [0, ]. Mindy (called the row player) chooses
More informationCOMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2
COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman
More informationReinforcement Learning. Monte Carlo and Temporal Difference Learning
Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements
More informationAction Selection for MDPs: Anytime AO* vs. UCT
Action Selection for MDPs: Anytime AO* vs. UCT Blai Bonet 1 and Hector Geffner 2 1 Universidad Simón Boĺıvar 2 ICREA & Universitat Pompeu Fabra AAAI, Toronto, Canada, July 2012 Online MDP Planning and
More informationDecision Theory: Value Iteration
Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision
More informationDeep RL and Controls Homework 1 Spring 2017
10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact
More informationAM 121: Intro to Optimization Models and Methods
AM 121: Intro to Optimization Models and Methods Lecture 18: Markov Decision Processes Yiling Chen and David Parkes Lesson Plan Markov decision processes Policies and Value functions Solving: average reward,
More informationReinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum
Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,
More informationMaking Decisions. CS 3793 Artificial Intelligence Making Decisions 1
Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside
More informationCS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning
CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction
More information4 Reinforcement Learning Basic Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems
More informationLearning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme
Learning for Revenue Optimization Andrés Muñoz Medina Renato Paes Leme How to succeed in business with basic ML? ML $1 $5 $10 $9 Google $35 $1 $8 $7 $7 Revenue $8 $30 $24 $18 $10 $1 $5 Price $7 $8$9$10
More informationMulti-step Bootstrapping
Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, 2017 1 / 29 Multi-step Bootstrapping Generalization
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationAn algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits
JMLR: Workshop and Conference Proceedings vol 49:1 5, 2016 An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits Peter Auer Chair for Information Technology Montanuniversitaet
More information16 MAKING SIMPLE DECISIONS
247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result
More informationRecharging Bandits. Joint work with Nicole Immorlica.
Recharging Bandits Bobby Kleinberg Cornell University Joint work with Nicole Immorlica. NYU Machine Learning Seminar New York, NY 24 Oct 2017 Prologue Can you construct a dinner schedule that: never goes
More informationLecture outline W.B.Powell 1
Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous
More informationIs Greedy Coordinate Descent a Terrible Algorithm?
Is Greedy Coordinate Descent a Terrible Algorithm? Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, Hoyt Koepke University of British Columbia Optimization and Big Data, 2015 Context: Random
More informationCSE 473: Artificial Intelligence
CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due
More informationPOMDPs: Partially Observable Markov Decision Processes Advanced AI
POMDPs: Partially Observable Markov Decision Processes Advanced AI Wolfram Burgard Types of Planning Problems Classical Planning State observable Action Model Deterministic, accurate MDPs observable stochastic
More informationTemporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options
Temporal Abstraction in RL Outline How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations?! HAMs (Parr & Russell
More informationOverview: Representation Techniques
1 Overview: Representation Techniques Week 6 Representations for classical planning problems deterministic environment; complete information Week 7 Logic programs for problem representations including
More informationONLINE LEARNING IN LIMIT ORDER BOOK TRADE EXECUTION
ONLINE LEARNING IN LIMIT ORDER BOOK TRADE EXECUTION Nima Akbarzadeh, Cem Tekin Bilkent University Electrical and Electronics Engineering Department Ankara, Turkey Mihaela van der Schaar Oxford Man Institute
More informationThe Value of Stochastic Modeling in Two-Stage Stochastic Programs
The Value of Stochastic Modeling in Two-Stage Stochastic Programs Erick Delage, HEC Montréal Sharon Arroyo, The Boeing Cie. Yinyu Ye, Stanford University Tuesday, October 8 th, 2013 1 Delage et al. Value
More informationLong Term Values in MDPs Second Workshop on Open Games
A (Co)Algebraic Perspective on Long Term Values in MDPs Second Workshop on Open Games Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July 2018
More information16 MAKING SIMPLE DECISIONS
253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)
More informationCS 188: Artificial Intelligence. Outline
C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence
More informationMDPs: Bellman Equations, Value Iteration
MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Gittins Index: Discounted, Bayesian (hence Markov arms). Reduces to stopping problem for each arm. Interpretation as (scaled)
More informationTreatment Allocations Based on Multi-Armed Bandit Strategies
Treatment Allocations Based on Multi-Armed Bandit Strategies Wei Qian and Yuhong Yang Applied Economics and Statistics, University of Delaware School of Statistics, University of Minnesota Innovative Statistics
More informationRegret Minimization against Strategic Buyers
Regret Minimization against Strategic Buyers Mehryar Mohri Courant Institute & Google Research Andrés Muñoz Medina Google Research Motivation Online advertisement: revenue of modern search engine and
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in
More informationCS885 Reinforcement Learning Lecture 3b: May 9, 2018
CS885 Reinforcement Learning Lecture 3b: May 9, 2018 Intro to Reinforcement Learning [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3, CS885 Spring
More informationStochastic Games and Bayesian Games
Stochastic Games and Bayesian Games CPSC 532l Lecture 10 Stochastic Games and Bayesian Games CPSC 532l Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games 4 Analyzing Bayesian
More informationMicroeconomics II. CIDE, MsC Economics. List of Problems
Microeconomics II CIDE, MsC Economics List of Problems 1. There are three people, Amy (A), Bart (B) and Chris (C): A and B have hats. These three people are arranged in a room so that B can see everything
More informationSTP Problem Set 2 Solutions
STP 425 - Problem Set 2 Solutions 3.2.) Suppose that the inventory model is modified so that orders may be backlogged with a cost of b(u) when u units are backlogged for one period. We assume that revenue
More informationExtending MCTS
Extending MCTS 2-17-16 Reading Quiz (from Monday) What is the relationship between Monte Carlo tree search and upper confidence bound applied to trees? a) MCTS is a type of UCT b) UCT is a type of MCTS
More informationOnline Network Revenue Management using Thompson Sampling
Online Network Revenue Management using Thompson Sampling Kris Johnson Ferreira David Simchi-Levi He Wang Working Paper 16-031 Online Network Revenue Management using Thompson Sampling Kris Johnson Ferreira
More informationThe exam is closed book, closed calculator, and closed notes except your three crib sheets.
CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.
More informationm 11 m 12 Non-Zero Sum Games Matrix Form of Zero-Sum Games R&N Section 17.6
Non-Zero Sum Games R&N Section 17.6 Matrix Form of Zero-Sum Games m 11 m 12 m 21 m 22 m ij = Player A s payoff if Player A follows pure strategy i and Player B follows pure strategy j 1 Results so far
More informationAgricultural and Applied Economics 637 Applied Econometrics II
Agricultural and Applied Economics 637 Applied Econometrics II Assignment I Using Search Algorithms to Determine Optimal Parameter Values in Nonlinear Regression Models (Due: February 3, 2015) (Note: Make
More information91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010
91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course
More informationLong-Term Values in MDPs, Corecursively
Long-Term Values in MDPs, Corecursively Applied Category Theory, 15-16 March 2018, NIST Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) MDPs, Corecursively NIST, 15/Mar/2018
More information