Making Complex Decisions

Similar documents
17 MAKING COMPLEX DECISIONS

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

TDT4171 Artificial Intelligence Methods

Non-Deterministic Search

16 MAKING SIMPLE DECISIONS

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

16 MAKING SIMPLE DECISIONS

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

CS 188: Artificial Intelligence Spring Announcements

CSE 473: Artificial Intelligence

CS 188: Artificial Intelligence

CSEP 573: Artificial Intelligence

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence Fall 2011

Complex Decisions. Sequential Decision Making

Decision Theory: Value Iteration

Reinforcement Learning

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

COS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration

MDPs: Bellman Equations, Value Iteration

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

CS 188: Artificial Intelligence

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

Reasoning with Uncertainty

Markov Decision Processes. Lirong Xia

Markov Decision Processes

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

AM 121: Intro to Optimization Models and Methods

CS 343: Artificial Intelligence

Sequential Decision Making

Reinforcement Learning

2D5362 Machine Learning

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Long-Term Values in MDPs, Corecursively

4 Reinforcement Learning Basic Algorithms

Markov Decision Process

Overview: Representation Techniques

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Introduction to Reinforcement Learning. MAL Seminar

Reinforcement Learning and Simulation-Based Search

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Optimal Policies for Distributed Data Aggregation in Wireless Sensor Networks

Intro to Reinforcement Learning. Part 3: Core Theory

Markov Decision Processes. CS 486/686: Introduction to Artificial Intelligence

Estimating a Dynamic Oligopolistic Game with Serially Correlated Unobserved Production Costs. SS223B-Empirical IO

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Lecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world

CS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I

Markov Decision Processes

Q1. [?? pts] Search Traces

Unobserved Heterogeneity Revisited

Markov Decision Processes

A simple wealth model

Lecture 7: Bayesian approach to MAB - Gittins index

1 Dynamic programming

POMDPs: Partially Observable Markov Decision Processes Advanced AI

Budget Management In GSP (2018)

Temporal Abstraction in RL

Deep RL and Controls Homework 1 Spring 2017

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Pakes (1986): Patents as Options: Some Estimates of the Value of Holding European Patent Stocks

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming

Lec 1: Single Agent Dynamic Models: Nested Fixed Point Approach. K. Sudhir MGT 756: Empirical Methods in Marketing

Part A: Questions on ECN 200D (Rendahl)

Long Term Values in MDPs Second Workshop on Open Games

Reinforcement Learning Lectures 4 and 5

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

Sequential Coalition Formation for Uncertain Environments

Monte-Carlo Planning: Basic Principles and Recent Progress

c 2004 IEEE. Reprinted from the Proceedings of the International Joint Conference on Neural Networks (IJCNN-2004), Budapest, Hungary, pp

Topics in Computational Sustainability CS 325 Spring 2016

CEC login. Student Details Name SOLUTIONS

Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints

Mengdi Wang. July 3rd, Laboratory for Information and Decision Systems, M.I.T.

CS 188: Artificial Intelligence Fall Markov Decision Processes

MDP Algorithms. Thomas Keller. June 20, University of Basel

Lecture 7: Decision-making under uncertainty: Part 1

Intelligent Systems (AI-2)

STATE UNIVERSITY OF NEW YORK AT ALBANY Department of Economics. Ph. D. Comprehensive Examination: Macroeconomics Fall, 2016

Lecture 4: Model-Free Prediction

ADVANCED MACROECONOMIC TECHNIQUES NOTE 7b

Lecture 1: Lucas Model and Asset Pricing

Chapter 3. Dynamic discrete games and auctions: an introduction

A Markovian Futures Market for Computing Power

Optimal Scheduling Policy Determination in HSDPA Networks

Lecture 8: Decision-making under uncertainty: Part 1

MDPs and Value Iteration 2/20/17

Final exam solutions

Socially-Optimal Design of Service Exchange Platforms with Imperfect Monitoring

CS 461: Machine Learning Lecture 8

Transcription:

Ch. 17 p.1/29 Making Complex Decisions Chapter 17

Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm

Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2 1 p=0.1 p=0.1 1 S 1 2 3 4

Ch. 17 p.4/29 A simple environment (cont d) The agent has to make a series of decisions (or alternatively it has to know what to do in each of the possible 11 states) The move action can fail Each state has a reward

Ch. 17 p.5/29 What is different? uncertainty rewards for states (not just good/bad states) a series of decisions (not just one)

Ch. 17 p.6/29 Issues How to represent the environment? How to automate the decision making process? How to make useful simplifying assumptions?

Ch. 17 p.7/29 Markov decision process (MDP) It is a specification of a sequential decision problem for a fully observable environment. It has three components S 0 : the initial state T(s,a,s ): the transition model R(s): the reward function The rewards are additive.

Ch. 17 p.8/29 Transition model It is a specification of outcome probabilities for each state and action pair T(s,a,s ) denotes the probability of ending up in state s if action a is applied in state s The transitions are Markovian: T(s,a,s ) depends only on s, not on the history of earlier states

Ch. 17 p.9/29 Utility function It is a specification of agent preferences The utility function will depend on a sequence of states (this is a sequential decision problem, but still the transitions are Markovian) There is a negative/positive finite reward for each state given by R(s)

Ch. 17 p.10/29 Policy It is a specification of a solution for an MDP It denotes what to do at any state the agent might reach π(s) denotes the action recommended by the policy π for state s The quality of a policy is measured by the expected utility of the possible environment histories generated by that policy π denotes the optimal policy An agent with a complete policy is a reflex agent

Ch. 17 p.11/29 Optimal policy when R(s) = -0.04 3 +1 2 1 1 1 2 3 4

Ch. 17 p.12/29 Optimal policy when R(s) < -1.6284 3 +1 2 1 1 1 2 3 4

Ch. 17 p.13/29 Optimal policy when -0.4278 < R(s) < -0.0850 3 +1 2 1 1 1 2 3 4

Ch. 17 p.14/29 Optimal policy when -0.0221 < R(s) < 0 3 +1 2 1 1 1 2 3 4

Ch. 17 p.15/29 Optimal policy when R(s) > 0 3 * * +1 2 1 * 1 * * * 1 2 3 4

Ch. 17 p.16/29 Finite vs. infinite horizon A finite horizon means that there is fixed time N after which nothing matters (the game is over) k 0U h ([s 0,s 1,...,s N+k ]) = U h ([s 0,s 1,...,s N ]) The optimal policy for a finite horizon is nonstationary, i.e., it could change over time An infinite horizon means that there is no deadline There is no reason to behave differently in the same state at different times, i.e., the optimal policy is stationary It is easier than the nonstationary case

Ch. 17 p.17/29 Stationary preferences It means that the agent s preferences between state sequences do not depend on time If two state sequences [s 0,s 1,s 2,...] and [s 0,s 1,s 2,...] begin with the same state (i.e., s 0 = s 0 ) then the two sequences should be preference-ordered the same way as the sequences [s 1,s 2,...] and [s 1,s 2,...]

Ch. 17 p.18/29 Algorithms to solve MDPs Value iteration Initialize the value of each state to its immediate reward Iterate to calculate values considering sequential rewards For each state, select the action with the maximum expected utility Policy iteration Get an initial policy Evaluate the policy to find the utility of each state Modify the policy by selecting actions that increase the utility of a state. If changes occurred, go to the previous step

Ch. 17 p.19/29 Value Iteration Algorithm function VALUE-ITERATION (mdp, ε) returns a utility function inputs: mdp, an MDP with states S, transition model T, reward function R, discount γ ε, the maximum error allowed in the utility of a state local variables: U, U, vectors of utilities for states in S, initially zero δ, the maximum change in the utility of any state in an iteration repeat U U ; δ 0 for each state s in S do U [s] R[s]+γmax a s T(s,a,s )U[s ] if U [s] U[s] > δ then δ U [s] U[s] until δ < ε(1 γ)/γ return U

Ch. 17 p.20/29 State utilities with γ = 1 and R(s) = -0.04 3 0.812 0.868 0.918 +1 2 1 0.762 0.660 1 0.705 0.655 0.611 0.388 1 2 3 4

Ch. 17 p.21/29 Optimal policy using value iteration To find the optimal policy choose the action that maximizes the expected utility of the subsequent state π (s) = argmax a s T(s,a,s )U(s )

Ch. 17 p.22/29 Properties of value iteration The value iteration algorithm can be thought of as propogating information through the state space by means of local updates It converges to the correct utilities We can bound the error in the utility estimates if we stop after a finite number of iterations, and we can bound the policy loss that results from executing the corresponding MEU policy

Ch. 17 p.23/29 More on value iteration The value iteration algorithm we looked at is solving the standard Bellman equations using Bellman updates. Bellman equation Bellman update U(s) = R(s)+γ max a s T(s,a,s )U(s ) U i+1 (s) = R(s)+γ max a s T(s,a,s )U i (s )

Ch. 17 p.24/29 More on value iteration (cont d) If we apply the Bellman update infinitely often, we are guaranteed to reach an equilibrium, in which case the final utility values must be solutions to the Bellman equations. In fact, they are also the unique solutions, and the corresponding policy is optimal.

Ch. 17 p.25/29 Policy iteration With Bellman equations, we either need to solve a nonlinear set of equations or we need to use an iterative method Policy iteration starts with a initial policy and performs iterations of evaluation and impovement on it

Ch. 17 p.26/29 Policy Iteration Algorithm function POLICY-ITERATION (mdp) returns a policy inputs: mdp, an MDP with states S, transition model T local variables: U, a vector of utilities for states in S, initially zero π, a policy vector indexed by state, initially random repeat U POLICY-EVALUATION(π,U, mdp) unchanged? true for each state s in S do if max a s T(s,a,s )U[s ] > s T(s,π(s),s )U[s ] then π(s) argmax a s T(s,a,s )U[s ] unchanged? false until unchanged? return π

Ch. 17 p.27/29 Properties of Policy Iteration Implementing the POLICY-EVALUATION routine is simpler than solving the standard Bellman equations because the action in each state is fixed by the policy The simplified Bellman equation is U i (s) = R(s)+γ s T(s,π i (s),s )U i (s )

Ch. 17 p.28/29 Properties of Policy Iteration (cont d) The simplified set of Bellman Equations is linear (n equations with n unknowns can be solved in O(n 3 ) time) If n 3 is prohibitive, we can use modified policy iteration which uses the simplified Bellman update k times U i+1 (s) = R(s)+γ s T(s,π i (s),s )U i (s )

Ch. 17 p.29/29 Issues revisited (and summary) How to represent the environment? (transition model) How to automate the decision making process? (Policy iteration and value iteration) Can also use asynchronous policy iteration and work on a subset of states How to make useful simplifying assumptions? (Full observability, stationary policy, infinite horizon etc.)