Lecture 17: More on Markov Decision Processes. Reinforcement learning

Similar documents
4 Reinforcement Learning Basic Algorithms

Lecture 4: Model-Free Prediction

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Complex Decisions. Sequential Decision Making

2D5362 Machine Learning

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

Reinforcement Learning and Simulation-Based Search

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Decision Theory: Value Iteration

Sequential Decision Making

10703 Deep Reinforcement Learning and Control

Non-Deterministic Search

Introduction to Reinforcement Learning. MAL Seminar

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

CSE 473: Artificial Intelligence

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence Spring Announcements

Chapter 7: Estimation Sections

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Reinforcement Learning

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

CSEP 573: Artificial Intelligence

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Multi-armed bandit problems

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

EE266 Homework 5 Solutions

Reinforcement Learning

Chapter 7: Estimation Sections

Making Complex Decisions

Markov Decision Processes

CS340 Machine learning Bayesian model selection

Lecture 7: Bayesian approach to MAB - Gittins index

Final exam solutions

Reasoning with Uncertainty

Q1. [?? pts] Search Traces

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Chapter 6: Temporal Difference Learning

TDT4171 Artificial Intelligence Methods

CPSC 540: Machine Learning

Reinforcement Learning Lectures 4 and 5

CS 461: Machine Learning Lecture 8

CS 343: Artificial Intelligence

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence

Notes on the EM Algorithm Michael Collins, September 24th 2005

Intro to Reinforcement Learning. Part 3: Core Theory

Reinforcement Learning

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

Markov Decision Process

Probability Distributions: Discrete

CPSC 540: Machine Learning

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world

Unobserved Heterogeneity Revisited

Chapter 7: Estimation Sections

MDPs: Bellman Equations, Value Iteration

Markov Decision Processes. Lirong Xia

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Probability. An intro for calculus students P= Figure 1: A normal integral

Introduction to Fall 2007 Artificial Intelligence Final Exam

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

CEC login. Student Details Name SOLUTIONS

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II

Dynamic Programming and Reinforcement Learning

Adaptive Experiments for Policy Choice. March 8, 2019

MDP Algorithms. Thomas Keller. June 20, University of Basel

Markov Decision Processes

Multi-step Bootstrapping

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

POMDPs: Partially Observable Markov Decision Processes Advanced AI

A start of Variational Methods for ERGM Ranran Wang, UW

Multi-armed bandits in dynamic pricing

Handout 4: Deterministic Systems and the Shortest Path Problem

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

Lecture outline W.B.Powell 1

Optimal Stopping. Nick Hay (presentation follows Thomas Ferguson s Optimal Stopping and Applications) November 6, 2008

Chapter 4: Commonly Used Distributions. Statistics for Engineers and Scientists Fourth Edition William Navidi

The normal distribution is a theoretical model derived mathematically and not empirically.

The Irrevocable Multi-Armed Bandit Problem

Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints

Chapter 3 - Lecture 5 The Binomial Probability Distribution

To earn the extra credit, one of the following has to hold true. Please circle and sign.

درس هفتم یادگیري ماشین. (Machine Learning) دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی

COS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration

CS 4100 // artificial intelligence

The Problem of Temporal Abstraction

Transcription:

Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture 17 - March 18, 2013 1

Recall: MDPs, Policies, Value functions An MDP consists of states S, actions A, rewards r a (s) and transition probabilities T a (s, s ) A policy π describes how actions are picked at each state: π(s, a) = P (a t = a s t = s) The value function of a policy, V π, is defined as: V π (s) = E π [r t+1 + γr t+2 + γ 2 r t+3 +... ] We can find V π by solving a linear system of equations Policy iteration gives a greedy local search procedure based on the value of policies COMP-424, Lecture 17 - March 18, 2013 2

Optimal Policies and Optimal Value Functions Our goal is to find a policy that has maximum expected utility, i.e. maximum value Does policy iteration fulfill this goal? The optimal value function V is defined as the best value that can be achieved at any state: V (s) = max π V π (s) In a finite MDP, there exists a unique optimal value function (shown by Bellman, 1957) Any policy that achieves the optimal value function is called optimal policy There has to be at least one deterministic optimal policy Both value iteration and policy iteration can be used to obtain an optimal value function. COMP-424, Lecture 17 - March 18, 2013 3

Main idea Turn recursive Bellman equations into update rules Eg value iteration 1. Start with an arbitrary initial approximation V 0 2. On each iteration, update the value function estimate: ( ) V k+1 (s) max a r a (s) + γ s T a (s, s )V k (s ), s 3. Stop when the maximum value change between iterations is below a threshold The algorithm converges (in the limit) to the true V Similar update for policy evaluation. COMP-424, Lecture 17 - March 18, 2013 4

A More Efficient Algorithm Instead of updating all states on every iteration, focus on important states Here, we can define important as visited often E.g., board positions that occur on every game, rather than just once in 100 games Asynchronous dynamic programming: Generate trajectories through the MDP Update states whenever they appear on such a trajectory This focuses the updates on states that are actually possible. COMP-424, Lecture 17 - March 18, 2013 5

How Is Learning Tied with Dynamic Programming? Observe transitions in the environment, learn an approximate model ˆr a (s), ˆT a (s, s ) Use maximum likelihood to compute probabilities Use supervised learning for the rewards Pretend the approximate model is correct and use it for any dynamic programming method This approach is called model-based reinforcement learning Many believers, especially in the robotics community COMP-424, Lecture 17 - March 18, 2013 6

Simplest Case We have a coin X that can land in two positions (head or tail) Let P (X = H) = θ be the unknown probability of the coin landing head In this case, X is a Bernoulli (binomial) random variable Given a sequence of independent tosses x 1, x 2,... x m we want to estimate θ. COMP-424, Lecture 17 - March 18, 2013 7

More Generally: Statistical Parameter Fitting Given instances x 1,... x m that are independently identically distributes (i.i.d.): The set of possible values for each variable in each instance is known Each instance is obtained independently of the other instances Each instance is sampled from the same distribution Find a set of parameters θ such that the data can be summarized by a probability P (x j θ) θ depends on the family of probability distributions we consider (e.g. binomial, multinomial, Gaussian etc.) COMP-424, Lecture 17 - March 18, 2013 8

Coin Toss Example Suppose you see the sequence: H, T, H, H, H, T, H, H, H, T Which of these values of P (X = H) = θ do you think is best? 0.2 0.5 0.7 0.9 COMP-424, Lecture 17 - March 18, 2013 9

How Good Is a Parameter Set? It depends on how likely it is to generate the observed data Let D be the data set (all the instances) The likelihood of parameter set θ given data set D is defined as: L(θ D) = P (D θ) If the instances are i.i.d., we have: L(θ D) = P (D θ) = P (x 1, x 2,... x m θ) = m j=1 P (x j θ) COMP-424, Lecture 17 - March 18, 2013 10

Example: Coin Tossing Suppose you see the following data: D = H, T, H, T, T What is the likelihood for a parameter θ? L(θ D) = θ(1 θ)θ(1 θ)(1 θ) = θ N H (1 θ) N T COMP-424, Lecture 17 - March 18, 2013 11

Sufficient Statistics To compute the likelihood in the coin tossing example, we only need to know N(H) and N(T ) (number of heads and tails) We say that N(H) and N(T ) are sufficient statistics for this probabilistic model (binomial distribution) In general, a sufficient statistic of the data is a function of the data that summarizes enough information to compute the likelihood Formally, s(d) is a sufficient statistic if, for any two data sets D and D, s(d) = s(d ) L(θ D) = L(θ D ) COMP-424, Lecture 17 - March 18, 2013 12

Maximum Likelihood Estimation (MLE) Choose parameters that maximize the likelihood function We want to maximize: L(θ D) = m j=1 P (x j θ) This is a product, and products are hard to maximize! Standard trick is to maximize log L(θ D) instead log L(θ D) = m log P (x j θ) j=1 To maximize, we take the derivatives of this function with respect to θ and set them to 0 COMP-424, Lecture 17 - March 18, 2013 13

MLE Applied to the Binomial Data The likelihood is: L(θ D) = θ N(H) (1 θ) N(T ) The log likelihood is: log L(θ D) = N(H) log θ + N(T ) log(1 θ) Take the derivative of the log likelihood and set it to 0: θ log L(θ D) = N(H) θ + N(T ) 1 θ ( 1) = 0 Solving this gives θ = N(H) N(H) + N(T ) COMP-424, Lecture 17 - March 18, 2013 14

Observations Depending on our choice of probability distribution, when we take the gradient of the likelihood we may not be able to find θ analytically An alternative is to do gradient descent instead: 1. Start with some guess ˆθ 2. Update ˆθ: ˆθ ˆθ + α log L(θ D) θ where α (0, 1) is a learning rate 3. Go back to 2 (for some number of iterations, or until θ stops changing significantly Sometimes we can also determine a confidence interval around the value of θ COMP-424, Lecture 17 - March 18, 2013 15

MLE for multinomial distribution Suppose that instead of tossing a coin, we roll a K-faced die The set of parameters in this case is p(k) = θ k, k = 1,... K We have the additional constraint that K k=1 θ k = 1 What is the log likelihood in this case? log L(θ D) = k N k log θ k where N k is the number of times value k appears in the data We want to maximize the likelihood, but now this is a constrained optimization problem (Without the details of the proof) the best parameters are given by the empirical frequencies : ˆθ k = N k k N k COMP-424, Lecture 17 - March 18, 2013 16

MLE for Bayes Nets Recall: For more complicated distributions, involving multiple variables, we can use a graph structure (Bayes net) P(E) E=1 E=0 0.005 0.995 E B P(B) B=1 B=0 0.01 0.99 E=0 E=1 P(R E) R=1 0.0001 0.65 R=0 0.9999 0.35 A=0 A=1 R C=1 P(C A) C=0 0.05 0.95 0.7 0.3 A C B=0,E=0 B=0,E=1 B=1,E=0 B=1,E=1 P(A B,E) A=1 A=0 0.001 0.999 0.3 0.8 0.7 0.2 Each node has a conditional probability distribution of the variable at the node given its parents (eg multinomial) The joint probability distribution is obtained as a product of the probability distributions at the nodes. 0.95 0.05 COMP-424, Lecture 17 - March 18, 2013 17

MLE for Bayes Nets Instances are of the form r j, e j, b j, a j, c j, j = 1,... m L(θ D) = = m p(r j, e j, b j, c j, a j θ) (from i.i.d) j=1 m p(e j )p(r j e j )p(b j )p(a j e j, b j )p(c j e j ) (factorization) j=1 m m m m m = ( p(e j ))( p(r j e j ))( p(b j ))( p(a j e j, b j ))( p(c j e j )) j=1 j=1 j=1 j=1 j=1 n = L(θ i D) i=1 where θ i are the parameters associated with node i. COMP-424, Lecture 17 - March 18, 2013 18

Consistency of MLE For any estimator, we would like the parameters to converge to the best possible values as the number of examples grows We need to define best possible for probability distributions Let p and q be two probability distributions over X. The Kullback-Leibler divergence between p and q is defined as: KL(p, q) = x p(x) log p(x) q(x) COMP-424, Lecture 17 - March 18, 2013 19

A very brief detour into information theory Suppose I want to send some data over a noisy channel I have 4 possible values that I could send (e.g. A,C,G,T) and I want to encode them into bits such as to have short messages. Suppose that all values are equally likely. What is the best encoding? COMP-424, Lecture 17 - March 18, 2013 20

A very brief detour into information theory (2) Now suppose I know A occurs with probability 0.5, C and G with probability 0.25 and T with probability 0.125. What is the best encoding? What is the expected length of the message I have to send? COMP-424, Lecture 17 - March 18, 2013 21

Optimal encoding Suppose that I am receiving messages from an alphabet of m letters, and letter j has probability p j The optimal encoding (by Shannon s theorem) will give log 2 p j bits to letter j So the expected message length if I used the optimal encoding will be equal to the entropy of p: j p j log 2 p j COMP-424, Lecture 17 - March 18, 2013 22

Interpretation of KL divergence Suppose now that letters would be coming from p but I don t know this. Instead, I believe letters are coming from q, and I use q to make the optimal encoding. The expected length of my messages will be j p j log 2 q j The amount of bits I waste with this encoding is: j p j log 2 q j + j p j log 2 p j = j p j log 2 p j q j = KL(p, q) COMP-424, Lecture 17 - March 18, 2013 23

Properties of MLE MLE is a consistent estimator, in the sense that (under a set of standard assumptions), w.p.1, we have: lim θ = D θ, where θ is the best set of parameters: θ = arg min θ KL(p (X), p(x θ)) (p is the true distribution) With a small amount of data, the variance may be high (what happens if we observe just one coin toss?) COMP-424, Lecture 17 - March 18, 2013 24

Model-based reinforcement learning Very simple outline: Learn a model of the reward (eg by averaging; more on this next time) Learn a model of the probability distribution (eg by using MLE) Do dynamic programming updates using the learned model as if it were true, to obtain a value function and a policy Works very well if you have to optimize many reward functions on the same environment (same transitions/dynamics) But you have to fit a probability distribution, which is quadratic in the number of states (so could be very big) Obtaining the value of a fixed policy is then cubic in the number of states, and then we have to tun multiple iterations... Can we get an algorithm linear in the number of states? COMP-424, Lecture 17 - March 18, 2013 25

Monte Carlo Methods Suppose we have an episodic task: the agent interacts with the environment in trials or episodes, which terminate at some point The agent behaves according to some policy π for a while, generating several trajectories. How can we compute V π? Compute V π (s) by averaging the observed returns after s on the trajectories in which s was visited. Like in bandits, we can do this incrementally: after received return R t, we update V (s t ) V (s t ) + α(r t V (s t )) where α (0, 1) is a learning rate parameter COMP-424, Lecture 17 - March 18, 2013 26

Temporal-Difference (TD) Prediction Monte Carlo uses as a target estimate for the value function the actual return, R t : V (s t ) V (s t ) + α [R t V (s t )] The simplest TD method, TD(0), uses instead an estimate of the return: V (s t ) V (s t ) + α [r t+1 + γv (s t+1 ) V (s t )] If V (s t+1 ) were correct, this would be like a dynamic programming target! COMP-424, Lecture 17 - March 18, 2013 27

TD Is Hybrid between Dynamic Programming and Monte Carlo! Like DP, it bootstraps (computes the value of a state based on estimates of the successors) Like MC, it estimates expected values by sampling COMP-424, Lecture 17 - March 18, 2013 28

TD Learning Algorithm 1. Initialize the value function, V (s) = 0, s 2. Repeat as many times as wanted: (a) Pick a start state s for the current trial (b) Repeat for every time step t: i. Choose action a based on policy π and the current state s ii. Take action a, observed reward r and new state s iii. Compute the TD error: δ r + γv (s ) V (s) iv. Update the value function: V (s) V (s) + α s δ v. s s vi. If s is not a terminal state, go to 2b COMP-424, Lecture 17 - March 18, 2013 29

Example Suppose you start will all 0 guesses and observe the following episodes: B,1 B,1 B,1 B,1 B,0 A,0; B (reward not seen yet) What would you predict for V (B)? What would you predict for V (A)? COMP-424, Lecture 17 - March 18, 2013 30

Example: TD vs Monte Carlo For B, it is clear that V (B) = 4/5. If you use Monte Carlo, at this point you can only predict your initial guess for A (which is 0) If you use TD, at this point you would predict 0 + 4/5! And you would adjust the value of A towards this target. COMP-424, Lecture 17 - March 18, 2013 31

Example (continued) Suppose you start will all 0 guesses and observe the following episodes: B,1 B,1 B,1 B,1 B,0 A,0; B 0 What would you predict for V (B)? What would you predict for V (A)? COMP-424, Lecture 17 - March 18, 2013 32

Example: Value Prediction The estimate for B would be 4/6 The estimate for A, if we use Monte Carlo is 0; this minimizes the sum-squared error on the training data If you were to learn a model out of this data and do dynamic programming, you would estimate the A goes to B, so the value of A would be 0 + 4/6 TD is an incremental algorithm: it would adjust the value of A towards 4/5, which is the current estimate for B (before the continuation from B is seen) This is closer to dynamic programming than Monte Carlo TD estimates take into account time sequence COMP-424, Lecture 17 - March 18, 2013 33

Advantages No model of the environment is required! TD only needs experience with the environment. On-line, incremental learning: Can learn before knowing the final outcome Less memory and peak computation are required Both TD and MC converge (under mild assumptions), but TD usually learns faster. COMP-424, Lecture 17 - March 18, 2013 34