CS 461: Machine Learning Lecture 8

Similar documents
Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Intro to Reinforcement Learning. Part 3: Core Theory

2D5362 Machine Learning

Reinforcement Learning Lectures 4 and 5

Introduction to Reinforcement Learning. MAL Seminar

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Complex Decisions. Sequential Decision Making

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

Sequential Decision Making

Non-Deterministic Search

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

CS 188: Artificial Intelligence Spring Announcements

Markov Decision Processes

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

CS 188: Artificial Intelligence

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

TDT4171 Artificial Intelligence Methods

CS 188: Artificial Intelligence

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

CSEP 573: Artificial Intelligence

10703 Deep Reinforcement Learning and Control

4 Reinforcement Learning Basic Algorithms

MDPs: Bellman Equations, Value Iteration

Temporal Abstraction in RL

AM 121: Intro to Optimization Models and Methods

Markov Decision Processes

Elif Özge Özdamar T Reinforcement Learning - Theory and Applications February 14, 2006

CS 188: Artificial Intelligence Fall 2011

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Reinforcement Learning

Sequential Coalition Formation for Uncertain Environments

CSE 473: Artificial Intelligence

Markov Decision Processes

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II

The Problem of Temporal Abstraction

17 MAKING COMPLEX DECISIONS

Decision Theory: Value Iteration

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

Markov Decision Processes. Lirong Xia

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II

Reinforcement Learning and Simulation-Based Search

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options

Making Complex Decisions

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i

CS 343: Artificial Intelligence

EE266 Homework 5 Solutions

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

Reinforcement Learning

Compound Reinforcement Learning: Theory and An Application to Finance

Lecture 4: Model-Free Prediction

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Markov Decision Process

CS 6300 Artificial Intelligence Spring 2018

Reinforcement Learning

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Chapter 6: Temporal Difference Learning

Machine Learning for Physicists Lecture 10. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

16 MAKING SIMPLE DECISIONS

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 188: Artificial Intelligence. Outline

MDPs and Value Iteration 2/20/17

Probabilistic Robotics: Probabilistic Planning and MDPs

MDP Algorithms. Thomas Keller. June 20, University of Basel

Lecture outline W.B.Powell 1

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

ONLINE LEARNING IN LIMIT ORDER BOOK TRADE EXECUTION

Learning to Trade with Insider Information

Markov Decision Processes. CS 486/686: Introduction to Artificial Intelligence

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Q1. [?? pts] Search Traces

16 MAKING SIMPLE DECISIONS

A selection of MAS learning techniques based on RL

Intra-Option Learning about Temporally Abstract Actions

Rollout Allocation Strategies for Classification-based Policy Iteration

Final exam solutions

Dynamic Programming and Reinforcement Learning

Reinforcement Learning Analysis, Grid World Applications

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

Lecture 7: Bayesian approach to MAB - Gittins index

Deep RL and Controls Homework 1 Spring 2017

The Option-Critic Architecture

Announcements. CS 188: Artificial Intelligence Spring Outline. Reinforcement Learning. Grid Futures. Grid World. Lecture 9: MDPs 2/16/2011

Reasoning with Uncertainty

Monte-Carlo Planning: Basic Principles and Recent Progress

An Electronic Market-Maker

Ensemble Methods for Reinforcement Learning with Function Approximation

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS 188: Artificial Intelligence Fall Markov Decision Processes

Transcription:

CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1

Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised? Key components How to learn Deterministic Nondeterministic Homework 4 Solution 2/23/08 CS 461, Winter 2008 2

Review from Lecture 7 Unsupervised Learning Why? How? K-means Clustering Iterative Sensitive to initialization Non-parametric Local optimum Rand Index EM Clustering Iterative Sensitive to initialization Parametric Local optimum 2/23/08 CS 461, Winter 2008 3

Reinforcement Learning Chapter 16 2/23/08 CS 461, Winter 2008 4

What is Reinforcement Learning? Learning from interaction Goal-oriented learning Learning about, from, and while interacting with an external environment Learning what to do how to map situations to actions so as to maximize a numerical reward signal 2/23/08 CS 461, Winter 2008 5

Supervised Learning Training Info = desired (target) outputs Inputs Supervised Learning System Outputs Error = (target output actual output) 2/23/08 CS 461, Winter 2008 6

Reinforcement Learning Training Info = evaluations ( rewards / penalties ) Inputs RL System Outputs ( actions ) Objective: get as much reward as possible 2/23/08 CS 461, Winter 2008 7

Key Features of RL Learner is not told which actions to take Trial-and-Error search Possibility of delayed reward Sacrifice short-term gains for greater long-term gains The need to explore and exploit Considers the whole problem of a goal-directed agent interacting with an uncertain environment 2/23/08 CS 461, Winter 2008 8

Complete Agent (Learner) Temporally situated Continual learning and planning Object is to affect the environment Environment is stochastic and uncertain Environment state action reward Agent 2/23/08 CS 461, Winter 2008 9

Elements of an RL problem Policy Reward Value Model of environment Policy: what to do Reward: what is good Value: what is good because it predicts reward Model: what follows what 2/23/08 CS 461, Winter 2008 10

Some Notable RL Applications TD-Gammon: Tesauro world s best backgammon program Elevator Control: Crites & Barto high performance down-peak elevator controller Inventory Management: Van Roy, Bertsekas, Lee, & Tsitsiklis 10 15% improvement over industry standard methods Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin high performance assignment of radio channels to mobile telephone calls 2/23/08 CS 461, Winter 2008 11

TD-Gammon Tesauro, 1992 1995 Start with a random network Play very many games against self Learn a value function from this simulated experience Action selection by 2 3 ply search This produces arguably the best player in the world 2/23/08 CS 461, Winter 2008 12

The Agent-Environment Interface Agent and environment interact at discrete time steps: t = 0,1, 2,K Agent observes state at step t : s t " S produces action at step t : a t " A(s t ) gets resulting reward : r t +1 " # and resulting next state : s t +1... s t a t r t +1 s t +1 a t +1 r t +2 s r t +3 t +2 st a +3... t +2 a t +3 2/23/08 CS 461, Winter 2008 13

Elements of an RL problem s t : State of agent at time t a t : Action taken at time t In s t, action a t is taken, clock ticks and reward r t+1 is received and state changes to s t+1 Next state prob: P (s t+1 s t, a t ) Reward prob: p (r t+1 s t, a t ) Initial state(s), goal state(s) Episode (trial) of actions from initial state to goal 2/23/08 CS 461, Winter 2008 14 [Alpaydin 2004 The MIT Press]

The Agent Learns a Policy Policy at step t,! t : a mapping from states to action probabilities! t (s, a) = probability that a t = a when s t = s Reinforcement learning methods specify how the agent changes its policy as a result of experience. Roughly, the agent s goal is to get as much reward as it can over the long run. 2/23/08 CS 461, Winter 2008 15

Getting the Degree of Abstraction Right Time: steps need not refer to fixed intervals of real time. Actions: States: Low level (e.g., voltages to motors) High level (e.g., accept a job offer) Mental (e.g., shift in focus of attention), etc. Low-level sensations Abstract, symbolic, based on memory, or subjective e.g., the state of being surprised or lost The environment is not necessarily unknown to the agent, only incompletely controllable Reward computation is in the agent s environment because the agent cannot change it arbitrarily 2/23/08 CS 461, Winter 2008 16

Goals and Rewards Goal specifies what we want to achieve, not how we want to achieve it How = policy Reward: scalar signal Surprisingly flexible The agent must be able to measure success: Explicitly Frequently during its lifespan 2/23/08 CS 461, Winter 2008 17

Returns In general, Suppose the sequence of rewards after step t is : r t +1, r t+ 2, r t + 3, K What do we want to maximize? we want to maximize the expected return, E{ R t }, for each step t. Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. R t = r t +1 + r t +2 +L + r T, where T is a final time step at which a terminal state is reached, ending an episode. 2/23/08 CS 461, Winter 2008 18

Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return: " # k =0 R t = r t +1 +! r t+ 2 +! 2 r t +3 +L =! k r t + k +1, where!, 0 $! $ 1, is the discount rate. shortsighted 0! " # 1 farsighted 2/23/08 CS 461, Winter 2008 19

An Example Avoid failure: the pole falling beyond a critical angle or the cart hitting end of track. As an episodic task where episode ends upon failure: reward = +1 for each step before failure! return = number of steps before failure As a continuing task with discounted return: reward =!1 upon failure; 0 otherwise " return =!# k, for k steps before failure In either case, return is maximized by avoiding failure for as long as possible. 2/23/08 CS 461, Winter 2008 20

Another Example Get to the top of the hill as quickly as possible. reward =!1 for each step where not at top of hill " return =! number of steps before reaching top of hill Return is maximized by minimizing number of steps reach the top of the hill. 2/23/08 CS 461, Winter 2008 21

Markovian Examples Robot navigation Settlers of Catan State does contain board layout location of all settlements and cities your resource cards your development cards Memory of past resources acquired by opponents State does not contain: Knowledge of opponents development cards Opponent s internal development plans 2/23/08 CS 461, Winter 2008 22

Markov Decision Processes If an RL task has the Markov Property, it is a Markov Decision Process (MDP) If state, action sets are finite, it is a finite MDP To define a finite MDP, you need: P s s! state and action sets one-step dynamics defined by transition probabilities: { } for all s,! a = Pr s t +1 = s! s t = s, a t = a reward probabilities: s "S, a "A(s). a R s! = E{ r t +1 s t = s, a t = a, s t +1 = s!} for all s, s! "S, a "A(s). 2/23/08 CS 461, Winter 2008 23

An Example Finite MDP Recycling Robot At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected 2/23/08 CS 461, Winter 2008 24

Recycling Robot MDP { } { } { } S = high, low A(high) = search, wait A(low) = search, wait, recharge R search = expected no. of cans while searching R wait = expected no. of cans while waiting R search > R wait 2/23/08 CS 461, Winter 2008 25

Example: Drive a car States? Actions? Goal? Next-state probs? Reward probs? 2/23/08 CS 461, Winter 2008 26

Value Functions The value of a state = expected return starting from that state; depends on the agent s policy: State - value function for policy! : V! (s) = E! R t s t = s { } = E! & $ " k r t +k +1 s t = s The value of taking an action in a state under policy π = expected return starting from that state, taking that action, and then following π : 2/23/08 CS 461, Winter 2008 27 % ' # k =0 Action- value function for policy! : { } = E! & $ " k r t + k +1 s t = s,a t = a Q! (s, a) = E! R t s t = s, a t = a % ' # k = 0 ( ) * ( ) *

Bellman Equation for a Policy π The basic idea: R t = r t +1 +! r t +2 +! 2 r t + 3 +! 3 r t + 4 L = r t +1 +! ( r t +2 +! r t +3 +! 2 r t + 4 L) = r t +1 +! R t +1 So: V " (s) = E " { R t s t = s} { } = E " r t +1 + #V " ( s t +1 ) s t = s Or, without the expectation operator: $ V! (s) =!(s, a) P a s s " a [ + # V! ( s ")] R a s s " 2/23/08 CS 461, Winter 2008 28 $ s "

Golf State is ball location Reward of 1 for each stroke until the ball is in the hole Value of a state? Actions: putt (use putter) driver (use driver) putt succeeds anywhere on the green 2/23/08 CS 461, Winter 2008 29

Optimal Value Functions For finite MDPs, policies can be partially ordered:! "! # if and only if V! (s) " V! # (s) for all s $S Optimal policy = π * Optimal state-value function: V! (s) = max " Optimal action-value function: Q! (s, a) = max " V " (s) for all s #S Q " (s, a) for all s #S and a #A(s) This is the expected return for taking action a in state s and thereafter following an optimal policy. 2/23/08 CS 461, Winter 2008 30

Optimal Value Function for Golf We can hit the ball farther with driver than with putter, but with less accuracy Q*(s,driver) gives the value of using driver first, then using whichever actions are best 2/23/08 CS 461, Winter 2008 31

Why Optimal State-Value Functions are Useful Any policy that is greedy with respect to V! is an optimal policy. Therefore, given V!, one-step-ahead search produces the long-term optimal actions. Q * Given, the agent does not even have to do a one-step-ahead search:! " (s) = arg max a#a(s) Q" (s, a) 2/23/08 CS 461, Winter 2008 32

Summary so far Agent-environment interaction States Actions Rewards Policy: stochastic rule for selecting actions Value functions State-value fn for a policy Action-value fn for a policy Optimal state-value fn Optimal action-value fn Optimal value functions Return: the function of future rewards agent tries to maximize Optimal policies Bellman Equation Episodic and continuing tasks Markov Decision Process Transition probabilities Expected rewards 2/23/08 CS 461, Winter 2008 33

Model-Based Learning Environment, P (s t+1 s t, a t ), p (r t+1 s t, a t ), is known There is no need for exploration Can be solved using dynamic programming Solve for V * ( ) = max s t Optimal policy " * s t ( ) = arg max a t $ E r t +1 a t & % # V * s t +1 [ ] + " P( s t +1 s t,a t ) % ' E r t +1 s t,a t & s t+1 ( ) [ ] + # P( s t +1 s t,a t ) s t+1 ' ) ( $ V * s t +1 ( ) ( * ) 2/23/08 CS 461, Winter 2008 34 [Alpaydin 2004 The MIT Press]

Value Iteration 2/23/08 CS 461, Winter 2008 35 [Alpaydin 2004 The MIT Press]

Policy Iteration 2/23/08 CS 461, Winter 2008 36 [Alpaydin 2004 The MIT Press]

Temporal Difference Learning Environment, P (s t+1 s t, a t ), p (r t+1 s t, a t ), is not known; model-free learning There is need for exploration to sample from P (s t+1 s t, a t ) and p (r t+1 s t, a t ) Use the reward received in the next time step to update the value of current state (action) The temporal difference between the value of the current action and the value discounted from the next state 2/23/08 CS 461, Winter 2008 37 [Alpaydin 2004 The MIT Press]

Exploration Strategies ε-greedy: With prob ε,choose one action at random uniformly Choose the best action with pr 1-ε Probabilistic (softmax: all p > 0): P ( a s) = A! b = expq 1 exp ( s,a) Q( s,b) Move smoothly from exploration/exploitation Annealing: gradually reduce T P ( a s) = A! b = exp [ Q( s,a) / T ] [ Q( s,b) / T ] 1 exp 2/23/08 CS 461, Winter 2008 38 [Alpaydin 2004 The MIT Press]

Deterministic Rewards and Actions Deterministic: single possible reward and next state Q ( s,a ) r +! Q( s, a ) t t = t + 1 max t + 1 t + 1 a t + 1 Used as an update rule (backup) Qˆ ( s,a ) r +! Qˆ ( s, a ) t t " t + 1 max t + 1 t + 1 a t + 1 Updates happen only after reaching the reward (then are backed up ) Starting at zero, Q values increase, never decrease 2/23/08 CS 461, Winter 2008 39 [Alpaydin 2004 The MIT Press]

γ=0.9 Consider the value of action marked by * : If path A is seen first, Q(*)=0.9*max(0,81)=73 Then B is seen, Q(*)=0.9*max(100,81)=90 Or, If path B is seen first, Q(*)=0.9*max(100,0)=90 Then A is seen, Q(*)=0.9*max(100,81)=90 Q values increase but never decrease 2/23/08 CS 461, Winter 2008 40 [Alpaydin 2004 The MIT Press]

Nondeterministic Rewards and Actions When next states and rewards are nondeterministic (there is an opponent or randomness in the environment), we keep averages (expected values) instead as assignments Q-learning (Watkins and Dayan, 1992): Qˆ ( s,a ) Qˆ ( s,a ) + ) & r + ( maxqˆ ( s,a )' Qˆ ( s, a # )! " t Learning V (TD-learning: Sutton, 1988) ( s ) V ( s ) + #( r + " V ( s ) V ( s ) V! t t * t t t + 1 t + 1 t + 1 a $ % t + 1 $ t t + 1 t + 1 backup t t t 2/23/08 CS 461, Winter 2008 41 [Alpaydin 2004 The MIT Press]

Q-learning 2/23/08 CS 461, Winter 2008 42 [Alpaydin 2004 The MIT Press]

TD-Gammon Tesauro, 1992 1995 Start with a random network Play very many games against self Learn a value function from this simulated experience Action selection by 2 3 ply search Program Training games Opponents Results TDG 1.0 300,000 3 experts -13 pts/51 games TDG 2.0 800,000 5 experts -7 pts/38 games TDG 2.1 1,500,000 1 expert -1 pt/40 games 2/23/08 CS 461, Winter 2008 43

Summary: Key Points for Today Reinforcement Learning How different from supervised, unsupervised? Key components Actions, states, transition probs, rewards Markov Decision Process Episodic vs. continuing tasks Value functions, optimal value functions Learn: policy (based on V, Q) Model-based: value iteration, policy iteration TD learning Deterministic: backup rules (max) Nondeterministic: TD learning, Q-learning (running avg) 2/23/08 CS 461, Winter 2008 44

Homework 4 Solution 2/23/08 CS 461, Winter 2008 45

Next Time Ensemble Learning (read Ch. 15.1-15.5) Reading questions are posted on website 2/23/08 CS 461, Winter 2008 46