Dynamic Programming (DP) Massimo Paolucci University of Genova

Similar documents
Optimization Methods. Lecture 16: Dynamic Programming

Dynamic Programming: An overview. 1 Preliminaries: The basic principle underlying dynamic programming

Introduction to Dynamic Programming

Handout 4: Deterministic Systems and the Shortest Path Problem

Lecture Notes 1

IEOR E4004: Introduction to OR: Deterministic Models

Handout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems

Problem Set 2: Answers

6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

LEC 13 : Introduction to Dynamic Programming

Dynamic Portfolio Choice II

CS 188: Artificial Intelligence. Outline

Lecture 5 January 30

Deterministic Dynamic Programming

EE365: Risk Averse Control

Optimal energy management and stochastic decomposition

CS 188: Artificial Intelligence

Chapter 21. Dynamic Programming CONTENTS 21.1 A SHORTEST-ROUTE PROBLEM 21.2 DYNAMIC PROGRAMMING NOTATION

6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE

Lecture 2 Dynamic Equilibrium Models: Three and More (Finite) Periods

THE TRAVELING SALESMAN PROBLEM FOR MOVING POINTS ON A LINE

6.231 DYNAMIC PROGRAMMING LECTURE 3 LECTURE OUTLINE

CHAPTER 5: DYNAMIC PROGRAMMING

Stochastic Optimal Control

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.

Non-Deterministic Search

0/1 knapsack problem knapsack problem

Complex Decisions. Sequential Decision Making

From Discrete Time to Continuous Time Modeling

CS 188: Artificial Intelligence

Forecast Horizons for Production Planning with Stochastic Demand

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

6.231 DYNAMIC PROGRAMMING LECTURE 8 LECTURE OUTLINE

Lecture 17: More on Markov Decision Processes. Reinforcement learning

1 Answers to the Sept 08 macro prelim - Long Questions

CS 188: Artificial Intelligence Fall 2011

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Essays on Some Combinatorial Optimization Problems with Interval Data

Dynamic Appointment Scheduling in Healthcare

Available online at ScienceDirect. Procedia Computer Science 95 (2016 )

Portfolio Choice. := δi j, the basis is orthonormal. Expressed in terms of the natural basis, x = j. x j x j,

17 MAKING COMPLEX DECISIONS

Homework #4. CMSC351 - Spring 2013 PRINT Name : Due: Thu Apr 16 th at the start of class

Sequential Decision Making

Risk-Averse Anticipation for Dynamic Vehicle Routing

Reinforcement Learning

Lecture outline W.B.Powell 1

Chapter 15: Dynamic Programming

Markov Decision Processes

EE266 Homework 5 Solutions

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods

Optimal Dam Management

Optimal Security Liquidation Algorithms

Neuro-Dynamic Programming for Fractionated Radiotherapy Planning

Robust Dual Dynamic Programming

MDPs: Bellman Equations, Value Iteration

Optimal Securitization via Impulse Control

AMH4 - ADVANCED OPTION PRICING. Contents

Yao s Minimax Principle

Dynamic Programming cont. We repeat: The Dynamic Programming Template has three parts.

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

CS 188: Artificial Intelligence Spring Announcements

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

25 Increasing and Decreasing Functions

Dynamic Programming and Reinforcement Learning

arxiv: v1 [math.pr] 6 Apr 2015

Multistage Stochastic Programming

16 MAKING SIMPLE DECISIONS

AN ALGORITHM FOR FINDING SHORTEST ROUTES FROM ALL SOURCE NODES TO A GIVEN DESTINATION IN GENERAL NETWORKS*

Revenue Management Under the Markov Chain Choice Model

CSEP 573: Artificial Intelligence

Prize offered for the solution of a dynamic blocking problem

Online Appendix. ( ) =max

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 343: Artificial Intelligence

UNIT 2. Greedy Method GENERAL METHOD

MS-E2114 Investment Science Exercise 10/2016, Solutions

TDT4171 Artificial Intelligence Methods

4 Reinforcement Learning Basic Algorithms

Stochastic Optimization

CSE 473: Artificial Intelligence

Lecture 7: Bayesian approach to MAB - Gittins index

On Complexity of Multistage Stochastic Programs

Optimal Long-Term Supply Contracts with Asymmetric Demand Information. Appendix

DM559/DM545 Linear and integer programming

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

Scenario reduction and scenario tree construction for power management problems

Differential Geometry: Curvature, Maps, and Pizza

Consumption and Portfolio Choice under Uncertainty

Reinforcement Learning and Optimal Control. Chapter 1 Exact Dynamic Programming DRAFT

Pricing Kernel. v,x = p,y = p,ax, so p is a stochastic discount factor. One refers to p as the pricing kernel.

useful than solving these yourself, writing up your solution and then either comparing your

Markov Decision Process

Lecture 2: Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and

Dynamic Portfolio Execution Detailed Proofs

IEOR 3106: Introduction to Operations Research: Stochastic Models SOLUTIONS to Final Exam, Sunday, December 16, 2012

Interpolation. 1 What is interpolation? 2 Why are we interested in this?

Identification and Estimation of Dynamic Games when Players Belief Are Not in Equilibrium

Homework solutions, Chapter 8

Transcription:

Dynamic Programming (DP) Massimo Paolucci University of Genova

DP cannot be applied to each kind of problem In particular, it is a solution method for problems defined over stages For each stage a subproblem is defined The overall solution is obtained recursively DP is based on the so-called Optimality Principle, which allows one to reduce the solution of a given problem to the solutions of a series of subproblems

A company has three machineries and 5M available to expand them C i : possible investment on machinery i R i : corresponding profit Possibility of investment An example Machinery 1 Machinery 2 Machinery 3 C1 R1 C2 R2 C3 R3 1 2 1 5 2 8 1 3 3 2 6 3 9 - - 4 - - 4 12 - - Target: finding the most rewarding investment

In this simple example one might enumerate all possible alternatives, but in general explicit enumeration is cumbersome and inefficient: for each alternative one has to solve the whole problem; unfeasible alternatives are not recognized a priori; at each step, the information obtained during the computation of previous alternatives is not exploited.

In this example, the approach based on Dynamic Programming can be introduced via a graph. The model: stage machinery state (x i ) possible allocation of money to machineries from stage to stage i (x i 5) arc (x i, x i+1 ) most rewarding allocation of money (x i+1 -x i ) Weight of each arc profit associated with the corresponding investment

stage x stage 1 x 1 stage 2 x 2 stage 3 x 3 1 2 3 4 5 5 1 2 3 4 5 5 6 6 6 6 8 9 9 8 12 8 12 12 8 3 3 3 3 3

The problem consists in finding the path -5 (from stage to stage 3) with largest weight Each path represents an admissible solution Some arcs represent overspending situations

Backward phase The graph is travelled backward, starting from stage 3 To each node one associates f i (x i ) := length of the maximal path between x i and x 3

stage 3: f 3 (5) = stage 2: f 2 (x 2 ) = R(x 2, x 3 ) = length of the maximal path between x 2 and x 3 f 2 () = f 2 (1) = f 2 (2) = f 2 (3) = f 2 (4) = 3 f 2 (5) =

stage 1: length of the maximal path between x 1 and x 3 f1(x 1) max f2(x2) R(x1,x 2) x 2 length of the maximal path between x 2 and x 3 length of the arc between x 1 and x 2

stage 1: f 1 () = max[+3, +3, 8+3, 9+3, 12+3, 12+] = 15 f 1 (1) = max[+3, +3, 8+3, 9+3, 12+] = 12 f 1 (2) = max[+3, +3, 8+3, 9+] = 11 f 1 (3) = max[+3, +3, 8+] = 8 f 1 (4) = max[+3, +] = 3 f 1 (5) = max[+] =

stage : length of the maximal path between x and x 3 f () = max[+15, 5+12, 6+11, 6+8, 6+3, 6+] = 17 f() max f1(x 1) R(,x1) x 1

Forward phase There exist various alternative maximal paths, which can be found by travelling the solution forward: path 1) - 1-4 - 5 possibility of investment 2, 3, 2 path 2) - 1-5 - 5 possibility of investment 2, 4, 1 path 3) - 2-4 - 5 possibility of investment 3, 2, 2

We have seen a deterministic example The DP approach can be applied, with suitable modifications, also to stochastic contexts

The backward equations of DP (deterministic context) Let us consider the following general case: x........ 1 2 N-1 x 1 x 2 x N-1 N x N N stages (i=,...,n-1) For each stage i one has x i {1,...,q}, i.e., there are q possible different states. In general, the state is a vector in R n A cost T(x i, x i+1 ) is associated with each arc (e.g., the cost when the arc is travelled)

Minimum total cost: T N1 min T i(x i,xi1) i Number of possible paths: q N-1 E.g., q=1, N=21 1 2 If a path is computed in 1-6 sec, then finding the solution requires 1 14 sec 3 1 6 years!!!

DP solves the problem with a backward procedure, in which a subproblem is solved at each stage To each arc one associates the optimal cost required to reach the final stage starting from it Stage N-1: T N-1(x N-1 ) = T N-1 (x N-1, x N ) For instance, in the case of the route of an plane, such a cost is the time required to reach x N from x N-1

Stage N - 2: T (x ) min T (x,x ) T N 2 N2 N2 N2 N1 N 1 (xn1) xn1 Minimum time to reach x N starting from x N-2 Minimum time to reach x N starting from x N-1 Minimum time to reach x N-1 starting from x N-2

In general Backward equations of DP T (xi) i min T i(xi,xi1) xi1 T i 1 (xi1) i N 1,N 2,..., T (xn) N Cost-To-Go

Optimality Principle: Whichever the state at a certain stage is, one has to proceed by following the optimal trajectory: T minmin... x1 x2 min TN xn1... At each stage, one has to make q sums q q 2 sums Hence, in total one has q 2 N operations Example: 1 2 21 1-6 = 2.1 1-3 = 2.1 msec Compare with 3 1 6 years!!!

Bellman s theorem It proves the correctness of the backward equations of DP T N1 min T i(x i,xi1) x1,...,xn1 i x : departure, x N : arrival T min T (x,x1) x1,...,xn1 N1 T i(x i,xi1) i1 x 1 influences all the terms, but x 2,...,x N-1 influence only the terms inside the sum.

Hence T mint (x,x1) x1 N1 min T i(x i,xi1) x2,...,xn1 i1 Cost-To-Go Optimal cost from the second stage till the last one

The equation can be rewritten as T min T (x,x ) T 1 (x1) 1 x1 1st equation of DP By writing T 1 (x 1 ) explicitly T N 1 (x ) min T (x,x ) T (x,x ) 1 1 1 1 2 i i i 1 x2,...,xn1 i2 mint1 (x1,x 2) x2 min T (x,x ) T 1 1 2 (x2) 2 x2 N1 min T i(x i,xi1) x3,...,xn1 i2 2nd equation of DP... and so on

The curse of dimensionality In general the state is not a scalar, but an n-dimensional vector Hence, at each stage of DP one has to keep the information regarding all possible combinations of values.

For instance, when the state has dimension 2 (i.e., n=2) and the possible values of each x i are d, then one has for each stage q q for N-1 stages q 2 q 2... q 2 = q 2(N-1) in general, for dimension n q n(n-1) x 2 x 1 stage 1... stage N Number of operations in dimension n: q n q n =q 2n sums per stage q 2n N operations over N stages

Dynamic Programming: summing up To apply DP, it is required that the problem can be divided into stages. For every stage, a decision policy has to be determined A number of states is associated with each stage The effect of the decision at each stage consists in transforming the present state into a new state, associated with the next stage

At the current stage, the decision taken at previous stages does not influence the decision at next stages The backward solution process determines the optimal decision policy for each state of the previous stage The optimal decision policy is obtained recursively It is determined by travelling forward the sequence of decisions