Reinforcement Learning and Simulation-Based Search

Similar documents
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Intro to Reinforcement Learning. Part 3: Core Theory

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

10703 Deep Reinforcement Learning and Control

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Markov Decision Processes

Complex Decisions. Sequential Decision Making

Reinforcement Learning

Introduction to Reinforcement Learning. MAL Seminar

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

Markov Decision Process

CS 188: Artificial Intelligence

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

TDT4171 Artificial Intelligence Methods

CSE 473: Artificial Intelligence

CS 343: Artificial Intelligence

Decision Theory: Value Iteration

Reinforcement Learning

Extending MCTS

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Non-Deterministic Search

4 Reinforcement Learning Basic Algorithms

Making Complex Decisions

CSEP 573: Artificial Intelligence

CS 188: Artificial Intelligence

Lecture 4: Model-Free Prediction

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i

Reasoning with Uncertainty

2D5362 Machine Learning

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

Reinforcement Learning

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

CS 188: Artificial Intelligence Fall 2011

Reinforcement Learning

Sequential Decision Making

Reinforcement Learning Lectures 4 and 5

MDPs: Bellman Equations, Value Iteration

MDP Algorithms. Thomas Keller. June 20, University of Basel

Action Selection for MDPs: Anytime AO* vs. UCT

Q1. [?? pts] Search Traces

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

CS 188: Artificial Intelligence Spring Announcements

Multi-step Bootstrapping

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

The Problem of Temporal Abstraction

Monte-Carlo Planning: Basic Principles and Recent Progress

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

CS 188: Artificial Intelligence. Outline

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

Overview: Representation Techniques

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Chapter 6: Temporal Difference Learning

Reinforcement Learning

COS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration

Energy Storage Arbitrage in Real-Time Markets via Reinforcement Learning

Markov Decision Processes. Lirong Xia

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

Modelling Anti-Terrorist Surveillance Systems from a Queueing Perspective

Reinforcement Learning

Monte-Carlo Planning Look Ahead Trees. Alan Fern

To earn the extra credit, one of the following has to hold true. Please circle and sign.

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods

Probabilistic Robotics: Probabilistic Planning and MDPs

CS360 Homework 14 Solution

Temporal Abstraction in RL

CS 461: Machine Learning Lecture 8

AM 121: Intro to Optimization Models and Methods

Relational Regression Methods to Speed Up Monte-Carlo Planning

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Evaluation of proportional portfolio insurance strategies

Patrolling in A Stochastic Environment

Inverse reinforcement learning from summary data

An Electronic Market-Maker

CEC login. Student Details Name SOLUTIONS

Markov Decision Processes

HIGH PERFORMANCE COMPUTING IN THE LEAST SQUARES MONTE CARLO APPROACH. GILLES DESVILLES Consultant, Rationnel Maître de Conférences, CNAM

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Adaptive Experiments for Policy Choice. March 8, 2019

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I

Computational Finance Least Squares Monte Carlo

Rollout Allocation Strategies for Classification-based Policy Iteration

Markov Decision Processes. CS 486/686: Introduction to Artificial Intelligence

MDPs and Value Iteration 2/20/17

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II

2.1 Mathematical Basis: Risk-Neutral Pricing

Stochastic Games and Bayesian Games

Transcription:

Reinforcement Learning and Simulation-Based Search David Silver

Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty

Reinforcement Learning Markov Decision Process Definition A Markov Decision Process is a tuple S, A, P, R S is a finite set of states A is a finite set of actions P is a state transition probability matrix, Pss a = P [s s, a] R is a reward function, R a s = E [r s, a] Assume for this talk that all sequences terminate, γ = 1

Reinforcement Learning Planning and Reinforcement Learning Planning: Given MDP M, maximise expected future reward Reinforcement Learning: Given sample sequences from MDP {s 1, a k 1, r k 1, s k 2, a k 2,..., s k T K } K k=1 M Maximise expected future reward

A simulator M is a generative model of an MDP Given a state s t and action a t The simulator can generate a next state s t+1 and reward r t+1 A simulator can be used to generate sequences of experience Starting from any root state s 1 {s 1, a 1, r 1, s 2, a 2,..., s T } M Simulation-based search applies reinforcement learning to simulated experience

Monte-Carlo Search Monte-Carlo Simulation Given a model M and a simulation policy π(s, a) = Pr(a s) Simulate K episodes from root state s 1 {s 1, a1, k r1 k, s2 k, a2, k..., s k T } K K k=1 M, π Evaluate state by mean total reward (Monte-Carlo evaluation) V (s 1 ) = 1 K T K T K rt k P E rt k K s 1 k=1 t=1 t=1

Monte-Carlo Search Simple Monte-Carlo Search Given a model M and a simulation policy π For each action a A Simulate K episodes from root state s t {s 1, a, a k 1, r k 1, s k 2, a k 2,..., s k T } K k=1 M, π Evaluate actions by mean total reward Q(s 1, a) = 1 K T K T K rt k P E K k=1 t=1 t=1 r k t s 1, a Select real action with maximum value a t = argmax a A Q(s t, a)

Monte-Carlo Search Monte-Carlo Tree Search Simulate sequences starting from root state s 1 Build a search tree containing all visited states Repeat (each simulation) Evaluate states V (s) by mean total reward of all sequences through node s Improve simulation policy by picking child s with max V (s ) Converges on the optimal search tree, V (s) V (s)

Monte-Carlo Search max 9/12 root a 1 a 2 a 3 min 0/1 6/7 2/3 b 1 b 3 b 1 b 2 max 3/4 2/2 0/1 1/1 search tree a 1 a 3 a 1 min 0/1 2/2 1/1 b 1 max 1/1 roll-outs 0 0 1 1 1 1 1 1 1 0 1 1 reward

Monte-Carlo Search Advantages of MC Tree Search Highly selective best-first search Focused on the future Uses sampling to break curse of dimensionality Works for black-box simulators (only requires samples) Computationally efficient, anytime, parallelisable

Monte-Carlo Search Disadvantages of MC Tree Search Monte-Carlo estimates have high variance No generalisation between related states

Temporal-Difference Search Temporal-Difference Search Simulate sequences starting from root state s 1 Build a search tree containing all visited states Repeat (each simulation) Evaluate states V (s) by temporal-difference learning Improve simulation policy by picking child s with max V (s ) Converges on the optimal search tree, V (s) V (s)

Temporal-Difference Search Linear Temporal-Difference Search Simulate sequences starting from root state s 1 Build a linear function approximator V (s) = φ(s) θ over all visited states Repeat (each simulation) Evaluate states V (s) by linear temporal-difference learning Improve simulation policy by picking child s with max V (s )

Temporal-Difference Search Demo

Planning Under Uncertainty Planning Under Uncertainty Consider a history h t of actions, observations and rewards h = a 1, o 1, r 1,..., a t, o t, r t What if the state s is unknown? i.e. we only have some beliefs b(s) = P(s h t ) What if the MDP dynamics P are unknown? i.e. we only have some beliefs b(p) = p(p h t ) What if the MDP reward function R is unknown? i.e. we only have some beliefs b(r) = p(r h t )

Planning Under Uncertainty Belief State MDP Plan in augmented state space over beliefs Each action now transitions to a new belief state This defines an enormous MDP over belief states

Planning Under Uncertainty Histories and Belief States History tree ε a 1 a 2 Belief tree P(s) a 1 a 2 a 1 a 2 o 1 o 2 o 1 o 2 P(s a 1 ) P(s a 2 ) o 1 o 2 o 1 o 2 a 1 o 1 a 1 o 2 a 2 o 1 a 2 o 2 P(s a 1 o 1 ) P(s a 1 o 2 ) P(s a 2 o 1 ) P(s a 2 o 2 ) a 1 a 2 a 1 o 1 a 1 a 1 o 1 a 2......... a 1 a 2 P(s a 1 o 1 a 1 ) P(s a 1 o 1 a 2 ).........

Planning Under Uncertainty Belief State Planning We can apply simulation-based search to the belief state MDP Since these methods are effective in very large state spaces Unfortunately updating belief states is slow Belief state planners cannot scale up to realistic problems

Planning Under Uncertainty Root Sampling Each simulation, pick one world from root beliefs: sample state/transitions/reward function Run simulation as if that world is real Build plan in history space (fast!) Evaluate histories V (h) e.g. by Monte-Carlo evaluation Improve simulation policy e.g. by greedy action selection a t = argmax V (h t a) a Never updates beliefs during search But still converges on the optimal search tree w.r.t. beliefs, V (h) V (h) Intuitively, it averages over different worlds, tree provides filter

Planning Under Uncertainty Demo