DYNAMIC PROGRAMMING REINFORCEMENT LEARNING. COGS 202 : Week 7 Presentation

Size: px

Start display at page:

Download "DYNAMIC PROGRAMMING REINFORCEMENT LEARNING. COGS 202 : Week 7 Presentation"

Noel Townsend
6 years ago
Views:

1 DYNAMIC PROGRAMMING REINFORCEMENT LEARNING COGS 202 : Week 7 Preenttion

2 OUTLINE Recp (Stte Vlue nd Action Vlue function) Computtion in MDP Dynmic Progrmming (DP) Policy Evlution Policy Improvement Policy Itertion Vlue Itertion Aynchronou DP Generlized Policy Itertion Efficiency of DP 2

3 COMPONENTS OF A MDP PROBLEM Agent, tk, environment Stte, ction, rewrd Policy (, ) probbility of doing in Stte Vlue V () number Vlue of tte Action Vlue Q (, ) number Vlue of ttection pir Model ction P Rewrd function reching probbility of going from tking R from doing in nd Return R um of future rewrd 3

4 VALUE FUNCTIONS Stte vlue function: V () Expected return when trting in nd following Stte-ction vlue function: Q (,) Expected return when trting in, performing, nd following Ueful for finding the optiml policy Cn etimte from experience Pick the bet ction uing Q (,) r Bellmn eqution 4

5 OPTIMAL VALUE FUNCTIONS there et of optiml policie V define prtil ordering on policie they hre the me optiml vlue function Bellmn optimlity eqution ytem of n non-liner eqution olve for V*() ey to extrct the optiml policy r hving Q*(,) mke it even impler 5

6 KEY COMPUTATIONS How to compute V () given fixed policy? How to compute uch tht V V? How to compute *? How to compute V * () directly? 6

7 SOLUTIONS Dynmic Progrmming Clicl olution method for MDP Ued to compute vlue function, nd hence, optiml policie uing Bellmn eqution Efficiency nd utility Aumption Finite MDP Stte nd Action et re finite 7

8 POLICY EVALUATION k t k t k t t r E R E ) ( V ) ( V R P, ) ( ) ( V Policy Evlution: for given policy, compute the tte-vlue function V Recll: Stte vlue function for policy : Bellmn eqution for V * : A ytem of S imultneou liner eqution

9 ITERATIVE METHODS 9 V 0 V 1 V k V k1 V k k V R P V ) ( ), ( ) ( 1 weep A weep conit of pplying bckup opertion to ech tte. A full policy evlution bckup:

10 ITERATIVE POLICY EVALUATION 10

11 POLICY IMPROVEMENT Suppoe we hve computed V * for determinitic policy. For given tte, would it be better to do n ction? The vlue of doing in tte i: () Q (, ) E r P t1 V R ( V t1 ( )t ), t It i better to witch to ction for tte if nd only if Q (, ) > V ( ) 11

12 THE POLICY IMPROVEMENT THEOREM 12

13 PROOF SKETCH 13

14 POLICY IMPROVEMENT CONT. Do thi for ll tte to get new policy tht i greedy with repect tov : Then ( ) V V rg mxq rg mx (, ) P R V ( ) 14

15 POLICY IMPROVEMENT CONT. Wht if V V? i.e., for ll S, V () mx P V ( )? R But thi i the Bellmn Optimlity Eqution. So V V * nd both nd re optiml policie. 15

16 POLICY ITERATION 0 V 0 1 V 1 * V * * policy evlution policy improvement greedifiction 16

17 POLICY ITERATION 17

18 VALUE ITERATION Drwbck to policy itertion i tht ech itertion involve policy evlution, which itelf my require multiple weep. Convergence of V π occur only in the limit o tht we in principle hve to wit until convergence. A een, the optiml policy i often obtined long before V π h converged. Policy evlution tep cn be truncted in everl wy without loing the convergence gurntee of policy itertion. Vlue itertion i to top policy evlution fter jut one weep. 19

19 VALUE ITERATION Recll the full policy evlution bckup: V k 1 () (,) P R V k ( ) Here i the full vlue itertion bckup: V k 1 () mx P R V k ( ) Combintion of policy improvement nd truncted policy evlution. 20

20 VALUE ITERATION CONT. 21

21 ASYNCHRONOUS DP All the DP method decribed o fr require exhutive weep of the entire tte et. Aynchronou DP doe not ue weep. Inted it work like thi: Repet until convergence criterion i met: Pick tte t rndom nd pply the pproprite bckup Still need lot of computtion, but doe not get locked into hopelely long weep Cn you elect tte to bckup intelligently? YES: n gent experience cn ct guide. 23

22 GENERALIZED POLICY ITERATION Generlized Policy Itertion (GPI): ny interction of policy evlution nd policy improvement, independent of their grnulrity. A geometric metphor for convergence of GPI: 24

23 EFFICIENCY OF DP To find n optiml policy i polynomil in the number of tte. BUT, the number of tte i often tronomicl, e.g., often growing exponentilly with the number of tte vrible (wht Bellmn clled the cure of dimenionlity ). In prctice, clicl DP cn be pplied to problem with few million of tte. Aynchronou DP cn be pplied to lrger problem, nd pproprite for prllel computtion. It i urpriingly ey to come up with MDP for which DP method re not prcticl. 25

24 SUMMARY Policy evlution: bckup without mx Policy improvement: form greedy policy, if only loclly Policy itertion: lternte the bove two procee Vlue itertion: bckup with mx Generlized Policy Itertion (GPI) Aynchronou DP: wy to void exhutive weep 26

Gridworld Values V* Gridworld: Q*

Gridworld Values V* Gridworld: Q* CS 188: Artificil Intelligence Mrkov Deciion Procee II Intructor: Dn Klein nd Pieter Abbeel --- Univerity of Cliforni, Berkeley [Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI