Rollout Allocation Strategies for Classification-based Policy Iteration

Size: px
Start display at page:

Download "Rollout Allocation Strategies for Classification-based Policy Iteration"

Transcription

1 Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh Workshop on Reinforcement Learning and Search in Very Large Spaces. June 25, 2010, Haifa, Israel.

2 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

3 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

4 Definitions I Discounted Markov Decision Process MDP A discounted Markov Decision Process M is a tuple X, A, r, p, γ, where The state space X, The set of actions A is finite ( A = M < ), The transition model p( x, a) is a distribution over X, The reward function r : X A X R, γ (0, 1) is a discount factor. π is a deterministic policy π : X A.

5 Definitions II Rollout Sampling a trajectory starting from state x 0 with action a 0 and following π with return R π (x 0, a 0 ): R π (x 0, a 0 ) = r(x 0, a 0, x 1 ) + t>0 γt r(x t, π(a t ), x t+1 )), x t+1 p( x t, a t ) E[R π (x, a)] = Q π (x, a), x X and a A

6 Classification-based Policy Iteration Without Value Function representation! The policy improvement step is cast as a classification problem.

7 Classification-based Policy Iteration Without Value Function representation! As accurate as possible The policy improvement step is cast as a classification problem.

8 Classification-based Policy Iteration Without Value Function representation! As accurate as possible How to allocate? The policy improvement step is cast as a classification problem.

9 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

10 Rollout Allocation Problem The Goal: Learn the greedy policy To maximize the accuracy of the training set D T To identify the greedy actions in each state of rollout set D R The Setting: Fixed Budget L At each iteration, L rollouts are performed (generative model) to estimate the Q-values over D R. Fixed Rollout Set The rollout set is sampled from a fixed distribution ρ and is fixed through iterations. The Problem: How should we allocate the budget over actions, states and state-action pairs?

11 Example What is the best strategy to maximize the accuracy of the training set with a fixed budget? a1 a2 a3 state 1 state 2 Can we do better than just allocating uniformly the budget over the state-action pairs? state 3 state 4

12 Allocation Strategies Strategies over Actions The budget is allocated uniformly over states. In each state we face a pure exploration bandit problem. Strategies over States The budget of one state is allocated uniformly over its actions. How many rollouts in each state? (e.g. Allocate more rollouts to states which need more rollouts to discriminate the greedy action). Strategies over States and Actions Can we merge strategies from the above two problems? Should we design something new?

13 Allocation Strategies Strategies over Actions The budget is allocated uniformly over states. In each state we face a pure exploration bandit problem. Strategies over States The budget of one state is allocated uniformly over its actions. How many rollouts in each state? (e.g. Allocate more rollouts to states which need more rollouts to discriminate the greedy action). Strategies over States and Actions Can we merge strategies from the above two problems? Should we design something new?

14 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

15 Previous Works Rollout Classification-based Policy Iteration (RCPI) was introduced by Lagoudakis & Parr (2003) & Fern et al. (2004). Rollout allocation was uniform over states and actions. Dimitrakakis & Lagoudakis (2008) proposed rollout allocation strategies over states. The target of their work is to create an accurate training set by performing the least number of rollouts possible.

16 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

17 Rollout Allocation Strategies over Actions Pure Exploration MAB: A player has a L rounds to explore M arms/actions. He outputs the one with the highest estimated expectation. The number of rounds can be unknown. The Setting: The budget is allocated uniformly over states: L(x) = L/N In each state x, the player estimates Q π (x, a), a A After c(x, a) rollouts, Q c(x,a) π 1 (x, a) = c(x,a) He outputs â = arg max a A Qπ L(x) (x, a) c(x,a) j=1 Rj π (x, a)

18 The Goal: Rollout Allocation Strategies over Actions To minimize the probability of not identifying the greedy action w.r.t. π at a state x with a fixed number of rollouts. P(error) = P[1â a ] = P[1 argmax a A Q π L(x) (x,a) argmax a A Q π (x,a) ] Bubeck et al. (2009) proved that cumulative regret minimizers (e.g. UCB) are not optimal solutions for this problem. Notations: (x, a) = Q(x, a ) Q(x, a) (x) = H(x) = min a A, a a (x, a) a A, a a 1 (x,a) 2

19 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

20 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

21 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

22 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

23 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

24 Successive Rejects (SR) Audibert et al. (2010) L(x) is split into M 1 phases of length n m. The actions to consider: Let A x = A Number of pulls for each arm for m = 1, 2,..., M 1 do Pull uniformly remaining actions: For each a A x, perform n m n m 1 rollouts from (x, a). Stop considering the worst action: Remove arg min a A x Qnm (x, a) from A x. end for Return the unique element of A x. n K n J n 3 n 2 n 1 Action 1 Action M n 1 + n n M 1 + n M 1 L(x)

25 Successive Rejects (SR) The length of the phases are defined as, m {1... M 1}: 1 L(x) M n m = log(m) M + 1 m, log(m) = M m=2 The (estimated) K best actions are sampled O(L(x)/K) times. Characteristics No parameter Need knowledge of L(x) Theoretical Results 1 m P(error) Successive Rejects O(exp( L(x) M log(m)h(x) )) Uniform L(x) (x)2 O(exp( M )) SR performs better than Uniform when H(x) M/ (x) 2.

26 UCB-E Audibert et al. (2010) Parameter: exploration parameter s > 0 Definition of index: For a A, for c 1 and B a,0 = +,let Algorithm B a,c = Q c (x, a) + s L(x) M H(x)c for each round t = 1, 2,..., L(x) do Explore the action with the highest index: Perform a rollout from (x, A t ), A t arg max B a,c(x,a). end for a A Output the action with maximum estimated Q-value.

27 UCB-E Audibert et al. (2010) Characteristics Have an exploration parameter Need knowledge of L(x) and H(x) P(error) = O(exp( L(x) M H )) Versions Adaptive On-line under-estimation of H(x) Any-time Replace L(x) with the current round t

28 Outline Classification-based Policy Iteration Rollout Allocation Previous Works New Allocation Strategies over Actions Experiments

29 Domains We used classical settings of Mountain Car and Inverted Pendulum but increased the noise. Mountain Car (MC) Inverted Pendulum (IP) Formulation: Sutton et.al. (1998) Noise increased to: 1.2 Formulation: Lagoudakis & Parr (2003) Noise increased to: 15

30 Artificially increasing the number of actions left action null action right action real actions... player actions Mapping in Mountain Car One to forward, One to reverse, The rest to stay action. Mapping in Inverted Pendulum One to right, One to left, The rest are stochastic actions: P(right) = P(left) = P(null) = 1 3.

31 Training set accuracy (SR vs Uniform)

32 Training set accuracy (SR vs Uniform) Differences in the percentage of error over a fixed D R : states in D R Actions (M) Budget (L/MN) SR provides a training set more accurate than Uniform. Is this improvement propagating through iterations?

33 SR vs Uniform in Mountain Car

34 SR vs Uniform in Mountain Car Differences in the performance of the policies (steps):

35 SR vs Uniform in Mountain Car Differences in the performance of the policies (steps): states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) When the problem is not too hard or too easy, SR outperforms Uniform.

36 SR vs Uniform in Inverted Pendulum

37 SR vs Uniform in Inverted Pendulum Differences of the performance of the policies (steps) : states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) 100 states in D R Budget (L/MN) 200 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)

38 Adaptive UCB-E vs SR in Mountain Car

39 Adaptive UCB-E vs SR in Mountain Car Differences of the performance of the policies with s = 1: states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)

40 Adaptive UCB-E vs SR in Inverted Pendulum

41 Adaptive UCB-E vs SR in Inverted Pendulum Differences of performance of the policies with s = 1: states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) 100 states in D R Actions (M) Budget (L/MN)

42 Any-time Adaptive UCB-E vs Uniform in Mountain Car

43 Any-time Adaptive UCB-E vs Uniform in Mountain Car Differences of the performance of the policies with s = 1: states in D R 20 states in D R Actions (M) Actions (M) Budget (L/MN) 50 states in D R Budget (L/MN) 100 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN)

44 Any-time Adaptive UCB-E vs Uniform in IP

45 Any-time Adaptive UCB-E vs Uniform in IP Differences of the performance of the policies with s = 1: states in D R 50 states in D R Actions (M) Actions (M) Budget (L/MN) Budget (L/MN) 100 states in D R Actions (M) Budget (L/MN)

46 Final Comparison

47 Final Comparison Mountain Car: 100 Actions, 50 States Average steps to the goal Uniform Successive Rejects Adaptive UCB E Any time Adaptive UCB E Budget (L/MN)

48 Future work and Conclusions Summary We studied different rollout allocation strategies for RCPI problems with fixed budget. The proposed strategies over actions (SR, UCB-E...) significantly outperforms the uniform strategy. Future Work To address the issue of rollout allocation over states and state-actions pairs. Any-time Adaptive UCB-E could be merged with any strategy over states.

49 Discussion

50 Bibliography I

51 Dimitrakakis et al. Construct a rollout set D R = {x i } N i=1, x iid i ρ, rol = 0, D T = while D T MAXSIZETRAINING rol L do x SelectState(), c(x) c(x) + 1 Sample all the actions in this state. rol rol + M if a dominating action â exists in x then D T = D T {(x, â ), +} D T = D T {(x, a), }, a â Replace x in D R by y ρ end if end while SelectState return arg max x DR (x) + UCB (x) = min a M,a â Qπ k (x, â ) Q π k (x, a) Significance Test Dominating if c(x) (x) 2 Kst Rmax log(1/δ) with δ an arbitray accuracy parameter.

52 [? ] [? ] [? ] [? ] [? ]

53 Dimitrakakis & Lagoudakis (2008) The target of these strategies is to create an accurate training set by performing the least number of rollouts possible. Their work focuses on the states where the greedy action seems easier to discriminate from the other actions. When a dominant action a is found in state x, (x, a) is immediately put in D T and x replaced by a state x new ρ. No rollout allocation strategy over actions.

54 Regression-based Policy Iteration

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

Lecture 7: Bayesian approach to MAB - Gittins index

Lecture 7: Bayesian approach to MAB - Gittins index Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach

More information

Dynamic Pricing with Varying Cost

Dynamic Pricing with Varying Cost Dynamic Pricing with Varying Cost L. Jeff Hong College of Business City University of Hong Kong Joint work with Ying Zhong and Guangwu Liu Outline 1 Introduction 2 Problem Formulation 3 Pricing Policy

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods

More information

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) 1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used

More information

Intro to Reinforcement Learning. Part 3: Core Theory

Intro to Reinforcement Learning. Part 3: Core Theory Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2

More information

Adaptive Experiments for Policy Choice. March 8, 2019

Adaptive Experiments for Policy Choice. March 8, 2019 Adaptive Experiments for Policy Choice Maximilian Kasy Anja Sautmann March 8, 2019 Introduction The goal of many experiments is to inform policy choices: 1. Job search assistance for refugees: Treatments:

More information

Complex Decisions. Sequential Decision Making

Complex Decisions. Sequential Decision Making Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by

More information

Bandit algorithms for tree search Applications to games, optimization, and planning

Bandit algorithms for tree search Applications to games, optimization, and planning Bandit algorithms for tree search Applications to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://sequel.futurs.inria.fr/ INRIA Lille - Nord Europe Journées MAS

More information

Monte-Carlo Planning: Basic Principles and Recent Progress

Monte-Carlo Planning: Basic Principles and Recent Progress Monte-Carlo Planning: Basic Principles and Recent Progress Alan Fern School of EECS Oregon State University Outline Preliminaries: Markov Decision Processes What is Monte-Carlo Planning? Uniform Monte-Carlo

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use

More information

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure

More information

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Reinforcement Learning 04 - Monte Carlo. Elena, Xi Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment

More information

Introduction to Reinforcement Learning. MAL Seminar

Introduction to Reinforcement Learning. MAL Seminar Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

Multi-armed bandit problems

Multi-armed bandit problems Multi-armed bandit problems Stochastic Decision Theory (2WB12) Arnoud den Boer 13 March 2013 Set-up 13 and 14 March: Lectures. 20 and 21 March: Paper presentations (Four groups, 45 min per group). Before

More information

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig] Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal

More information

Dynamic Programming and Reinforcement Learning

Dynamic Programming and Reinforcement Learning Dynamic Programming and Reinforcement Learning Daniel Russo Columbia Business School Decision Risk and Operations Division Fall, 2017 Daniel Russo (Columbia) Fall 2017 1 / 34 Supervised Machine Learning

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC

More information

Reinforcement Learning and Simulation-Based Search

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision

More information

Handout 4: Deterministic Systems and the Shortest Path Problem

Handout 4: Deterministic Systems and the Shortest Path Problem SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 4: Deterministic Systems and the Shortest Path Problem Instructor: Shiqian Ma January 27, 2014 Suggested Reading: Bertsekas

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}

More information

Making Complex Decisions

Making Complex Decisions Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2

More information

TDT4171 Artificial Intelligence Methods

TDT4171 Artificial Intelligence Methods TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018 Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction

More information

Optimistic Planning for the Stochastic Knapsack Problem

Optimistic Planning for the Stochastic Knapsack Problem Optimistic Planning for the Stochastic Knapsack Problem Anonymous Author Anonymous Author 2 Anonymous Author 3 Unknown Institution Unknown Institution 2 Unknown Institution 3 Abstract The stochastic knapsack

More information

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

CPS 270: Artificial Intelligence  Markov decision processes, POMDPs CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward

More information

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision

More information

Yao s Minimax Principle

Yao s Minimax Principle Complexity of algorithms The complexity of an algorithm is usually measured with respect to the size of the input, where size may for example refer to the length of a binary word describing the input,

More information

Tuning bandit algorithms in stochastic environments

Tuning bandit algorithms in stochastic environments Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, CERTIS - Ecole des Ponts Remi Munos, INRIA Futurs Lille Csaba Szepesvári, University of Alberta The 18th International Conference

More information

MDP Algorithms. Thomas Keller. June 20, University of Basel

MDP Algorithms. Thomas Keller. June 20, University of Basel MDP Algorithms Thomas Keller University of Basel June 20, 208 Outline of this lecture Markov decision processes Planning via determinization Monte-Carlo methods Monte-Carlo Tree Search Heuristic Search

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical

More information

2D5362 Machine Learning

2D5362 Machine Learning 2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files

More information

Bandit Learning with switching costs

Bandit Learning with switching costs Bandit Learning with switching costs Jian Ding, University of Chicago joint with: Ofer Dekel (MSR), Tomer Koren (Technion) and Yuval Peres (MSR) June 2016, Harvard University Online Learning with k -Actions

More information

Temporal Abstraction in RL

Temporal Abstraction in RL Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;

More information

CS 461: Machine Learning Lecture 8

CS 461: Machine Learning Lecture 8 CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?

More information

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits

Multi-Armed Bandit, Dynamic Environments and Meta-Bandits Multi-Armed Bandit, Dynamic Environments and Meta-Bandits C. Hartland, S. Gelly, N. Baskiotis, O. Teytaud and M. Sebag Lab. of Computer Science CNRS INRIA Université Paris-Sud, Orsay, France Abstract This

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

INVERSE REWARD DESIGN

INVERSE REWARD DESIGN INVERSE REWARD DESIGN Dylan Hadfield-Menell, Smith Milli, Pieter Abbeel, Stuart Russell, Anca Dragan University of California, Berkeley Slides by Anthony Chen Inverse Reinforcement Learning (Review) Inverse

More information

ROBUST OPTIMIZATION OF MULTI-PERIOD PRODUCTION PLANNING UNDER DEMAND UNCERTAINTY. A. Ben-Tal, B. Golany and M. Rozenblit

ROBUST OPTIMIZATION OF MULTI-PERIOD PRODUCTION PLANNING UNDER DEMAND UNCERTAINTY. A. Ben-Tal, B. Golany and M. Rozenblit ROBUST OPTIMIZATION OF MULTI-PERIOD PRODUCTION PLANNING UNDER DEMAND UNCERTAINTY A. Ben-Tal, B. Golany and M. Rozenblit Faculty of Industrial Engineering and Management, Technion, Haifa 32000, Israel ABSTRACT

More information

Introduction to Fall 2007 Artificial Intelligence Final Exam

Introduction to Fall 2007 Artificial Intelligence Final Exam NAME: SID#: Login: Sec: 1 CS 188 Introduction to Fall 2007 Artificial Intelligence Final Exam You have 180 minutes. The exam is closed book, closed notes except a two-page crib sheet, basic calculators

More information

Bernoulli Bandits An Empirical Comparison

Bernoulli Bandits An Empirical Comparison Bernoulli Bandits An Empirical Comparison Ronoh K.N1,2, Oyamo R.1,2, Milgo E.1,2, Drugan M.1 and Manderick B.1 1- Vrije Universiteit Brussel - Computer Sciences Department - AI Lab Pleinlaan 2 - B-1050

More information

Safe Policy Iteration

Safe Policy Iteration Matteo Pirotta matteopirotta@polimiit Marcello Restelli marcellorestelli@polimiit Alessio Pecorino alessiopecorino@mailpolimiit Daniele Calandriello danielecalandriello@mailpolimiit Dept Elect, Inf, and

More information

The Irrevocable Multi-Armed Bandit Problem

The Irrevocable Multi-Armed Bandit Problem The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision

More information

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May, 204 Review of Game heory: Let M be a matrix with all elements in [0, ]. Mindy (called the row player) chooses

More information

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2 COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman

More information

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Reinforcement Learning. Monte Carlo and Temporal Difference Learning Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

Action Selection for MDPs: Anytime AO* vs. UCT

Action Selection for MDPs: Anytime AO* vs. UCT Action Selection for MDPs: Anytime AO* vs. UCT Blai Bonet 1 and Hector Geffner 2 1 Universidad Simón Boĺıvar 2 ICREA & Universitat Pompeu Fabra AAAI, Toronto, Canada, July 2012 Online MDP Planning and

More information

Decision Theory: Value Iteration

Decision Theory: Value Iteration Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision

More information

Deep RL and Controls Homework 1 Spring 2017

Deep RL and Controls Homework 1 Spring 2017 10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact

More information

AM 121: Intro to Optimization Models and Methods

AM 121: Intro to Optimization Models and Methods AM 121: Intro to Optimization Models and Methods Lecture 18: Markov Decision Processes Yiling Chen and David Parkes Lesson Plan Markov decision processes Policies and Value functions Solving: average reward,

More information

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction

More information

4 Reinforcement Learning Basic Algorithms

4 Reinforcement Learning Basic Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems

More information

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme Learning for Revenue Optimization Andrés Muñoz Medina Renato Paes Leme How to succeed in business with basic ML? ML $1 $5 $10 $9 Google $35 $1 $8 $7 $7 Revenue $8 $30 $24 $18 $10 $1 $5 Price $7 $8$9$10

More information

Multi-step Bootstrapping

Multi-step Bootstrapping Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, 2017 1 / 29 Multi-step Bootstrapping Generalization

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits

An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits JMLR: Workshop and Conference Proceedings vol 49:1 5, 2016 An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits Peter Auer Chair for Information Technology Montanuniversitaet

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result

More information

Recharging Bandits. Joint work with Nicole Immorlica.

Recharging Bandits. Joint work with Nicole Immorlica. Recharging Bandits Bobby Kleinberg Cornell University Joint work with Nicole Immorlica. NYU Machine Learning Seminar New York, NY 24 Oct 2017 Prologue Can you construct a dinner schedule that: never goes

More information

Lecture outline W.B.Powell 1

Lecture outline W.B.Powell 1 Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous

More information

Is Greedy Coordinate Descent a Terrible Algorithm?

Is Greedy Coordinate Descent a Terrible Algorithm? Is Greedy Coordinate Descent a Terrible Algorithm? Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, Hoyt Koepke University of British Columbia Optimization and Big Data, 2015 Context: Random

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

POMDPs: Partially Observable Markov Decision Processes Advanced AI

POMDPs: Partially Observable Markov Decision Processes Advanced AI POMDPs: Partially Observable Markov Decision Processes Advanced AI Wolfram Burgard Types of Planning Problems Classical Planning State observable Action Model Deterministic, accurate MDPs observable stochastic

More information

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options Temporal Abstraction in RL Outline How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations?! HAMs (Parr & Russell

More information

Overview: Representation Techniques

Overview: Representation Techniques 1 Overview: Representation Techniques Week 6 Representations for classical planning problems deterministic environment; complete information Week 7 Logic programs for problem representations including

More information

ONLINE LEARNING IN LIMIT ORDER BOOK TRADE EXECUTION

ONLINE LEARNING IN LIMIT ORDER BOOK TRADE EXECUTION ONLINE LEARNING IN LIMIT ORDER BOOK TRADE EXECUTION Nima Akbarzadeh, Cem Tekin Bilkent University Electrical and Electronics Engineering Department Ankara, Turkey Mihaela van der Schaar Oxford Man Institute

More information

The Value of Stochastic Modeling in Two-Stage Stochastic Programs

The Value of Stochastic Modeling in Two-Stage Stochastic Programs The Value of Stochastic Modeling in Two-Stage Stochastic Programs Erick Delage, HEC Montréal Sharon Arroyo, The Boeing Cie. Yinyu Ye, Stanford University Tuesday, October 8 th, 2013 1 Delage et al. Value

More information

Long Term Values in MDPs Second Workshop on Open Games

Long Term Values in MDPs Second Workshop on Open Games A (Co)Algebraic Perspective on Long Term Values in MDPs Second Workshop on Open Games Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) 2nd WS Open Games Oxford 4-6 July 2018

More information

16 MAKING SIMPLE DECISIONS

16 MAKING SIMPLE DECISIONS 253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence

More information

MDPs: Bellman Equations, Value Iteration

MDPs: Bellman Equations, Value Iteration MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Gittins Index: Discounted, Bayesian (hence Markov arms). Reduces to stopping problem for each arm. Interpretation as (scaled)

More information

Treatment Allocations Based on Multi-Armed Bandit Strategies

Treatment Allocations Based on Multi-Armed Bandit Strategies Treatment Allocations Based on Multi-Armed Bandit Strategies Wei Qian and Yuhong Yang Applied Economics and Statistics, University of Delaware School of Statistics, University of Minnesota Innovative Statistics

More information

Regret Minimization against Strategic Buyers

Regret Minimization against Strategic Buyers Regret Minimization against Strategic Buyers Mehryar Mohri Courant Institute & Google Research Andrés Muñoz Medina Google Research Motivation Online advertisement: revenue of modern search engine and

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in

More information

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

CS885 Reinforcement Learning Lecture 3b: May 9, 2018 CS885 Reinforcement Learning Lecture 3b: May 9, 2018 Intro to Reinforcement Learning [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3, CS885 Spring

More information

Stochastic Games and Bayesian Games

Stochastic Games and Bayesian Games Stochastic Games and Bayesian Games CPSC 532l Lecture 10 Stochastic Games and Bayesian Games CPSC 532l Lecture 10, Slide 1 Lecture Overview 1 Recap 2 Stochastic Games 3 Bayesian Games 4 Analyzing Bayesian

More information

Microeconomics II. CIDE, MsC Economics. List of Problems

Microeconomics II. CIDE, MsC Economics. List of Problems Microeconomics II CIDE, MsC Economics List of Problems 1. There are three people, Amy (A), Bart (B) and Chris (C): A and B have hats. These three people are arranged in a room so that B can see everything

More information

STP Problem Set 2 Solutions

STP Problem Set 2 Solutions STP 425 - Problem Set 2 Solutions 3.2.) Suppose that the inventory model is modified so that orders may be backlogged with a cost of b(u) when u units are backlogged for one period. We assume that revenue

More information

Extending MCTS

Extending MCTS Extending MCTS 2-17-16 Reading Quiz (from Monday) What is the relationship between Monte Carlo tree search and upper confidence bound applied to trees? a) MCTS is a type of UCT b) UCT is a type of MCTS

More information

Online Network Revenue Management using Thompson Sampling

Online Network Revenue Management using Thompson Sampling Online Network Revenue Management using Thompson Sampling Kris Johnson Ferreira David Simchi-Levi He Wang Working Paper 16-031 Online Network Revenue Management using Thompson Sampling Kris Johnson Ferreira

More information

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

The exam is closed book, closed calculator, and closed notes except your three crib sheets. CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.

More information

m 11 m 12 Non-Zero Sum Games Matrix Form of Zero-Sum Games R&N Section 17.6

m 11 m 12 Non-Zero Sum Games Matrix Form of Zero-Sum Games R&N Section 17.6 Non-Zero Sum Games R&N Section 17.6 Matrix Form of Zero-Sum Games m 11 m 12 m 21 m 22 m ij = Player A s payoff if Player A follows pure strategy i and Player B follows pure strategy j 1 Results so far

More information

Agricultural and Applied Economics 637 Applied Econometrics II

Agricultural and Applied Economics 637 Applied Econometrics II Agricultural and Applied Economics 637 Applied Econometrics II Assignment I Using Search Algorithms to Determine Optimal Parameter Values in Nonlinear Regression Models (Due: February 3, 2015) (Note: Make

More information

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course

More information

Long-Term Values in MDPs, Corecursively

Long-Term Values in MDPs, Corecursively Long-Term Values in MDPs, Corecursively Applied Category Theory, 15-16 March 2018, NIST Helle Hvid Hansen Delft University of Technology Helle Hvid Hansen (TU Delft) MDPs, Corecursively NIST, 15/Mar/2018

More information