Example: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities

Size: px

Start display at page:

Download "Example: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities"

Solomon O’Connor’
5 years ago
Views:

CS 188: Artificial Intelligence Markov Deciion Procee II Intructor: Dan Klein and Pieter Abbeel --- Univerity of California, Berkeley [Thee lide were

] A maze-like problem The agent live in a grid Wall block the agent path Example: Grid World Noiy movement: action do not alway go a planned 80% of

have been taken, the agent tay put The agent receive reward each time tep Small living reward each tep (can be negative) Big reward come at the end

)) Reward R(,a, ) (and dicount γ) Start tate 0,a, Quantitie: Policy = map of tate to action Utility = um of dicounted reward Value = expected future

utility tarting in and acting optimally The value (utility) of a q-tate (,a): Q * (,a) = expected utility tarting out having taken action a from tate

1 CS 188: Artificial Intelligence Markov Deciion Procee II Intructor: Dan Klein and Pieter Abbeel --- Univerity of California, Berkeley [Thee lide were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 material are available at A maze-like problem The agent live in a grid Wall block the agent path Example: Grid World Noiy movement: action do not alway go a planned 80% of the time, the action North take the agent North 10% of the time, North take the agent Wet; 10% Eat If there i a wall in the direction the agent would have been taken, the agent tay put The agent receive reward each time tep Small living reward each tep (can be negative) Big reward come at the end (good or bad) Goal: maximize um of (dicounted) reward Recap: MDP Optimal Quantitie Markov deciion procee: State S Action A Tranition P(,a) (or T(,a, )) Reward R(,a, ) (and dicount γ) Start tate 0,a, Quantitie: Policy = map of tate to action Utility = um of dicounted reward Value = expected future utility from a tate (max node) Q-Value = expected future utility from a q-tate (chance node) a, a The value (utility) of a tate : V * () = expected utility tarting in and acting optimally The value (utility) of a q-tate (,a): Q * (,a) = expected utility tarting out having taken action a from tate and (thereafter) acting optimally The optimal policy: π * () = optimal action from tate a, a,a, i a tate (, a) i a q-tate (,a, ) i a tranition [Demo: gridworld value (L9D1)]

Gridworld Value V* Gridworld: Q* The Bellman Equation The

action Step 2: Keep being optimal Definition of optimal utility

relationhip amongt optimal utility value,a, a, a Thee are the

2 Gridworld Value V* Gridworld: Q* The Bellman Equation The Bellman Equation How to be optimal: Step 1: Take correct firt action Step 2: Keep being optimal Definition of optimal utility via expectimax recurrence give a imple one-tep lookahead relationhip amongt optimal utility value,a, a, a Thee are the Bellman equation, and they characterize optimal value in a way we ll ue over and over

then V M hold the actual untruncated value Cae 2: If the dicount i le than 1 Sketch: For any tate V k and V k+1 can be viewed a depth k+1 expectimax reult in nearly identical earch tree The

3 Value Iteration Convergence* Bellman equation characterize the optimal value: V() How do we know the V k vector are going to converge? Value iteration compute them: Value iteration i jut a fixed point olution method though the V k vector are alo interpretable a time-limited value,a, a, a V( ) Cae 1: If the tree ha maximum depth M, then V M hold the actual untruncated value Cae 2: If the dicount i le than 1 Sketch: For any tate V k and V k+1 can be viewed a depth k+1 expectimax reult in nearly identical earch tree The difference i that on the bottom layer, V k+1 ha actual reward while V k ha zero That lat layer i at bet all R MAX It i at wort R MIN But everything i dicounted by γ k that far out So V k and V k+1 are at mot γ k max R different So a k increae, the value converge Policy Method Policy Evaluation

Fixed Policie Utilitie for a Fixed Policy Do the optimal action a, a,a, Do what π ay to do π(), π(), π(), Another baic

under a fixed policy π: V π () = expected total dicounted reward tarting in and following π Recurive relation (one-tep

fixed ome policy π(), then the tree would be impler only one action per tate though the tree value would depend on which

4 Fixed Policie Utilitie for a Fixed Policy Do the optimal action a, a,a, Do what π ay to do π(), π(), π(), Another baic operation: compute the utility of a tate under a fixed (generally non-optimal) policy Define the utility of a tate, under a fixed policy π: V π () = expected total dicounted reward tarting in and following π Recurive relation (one-tep look-ahead / Bellman equation):, π(), π(), π() Expectimax tree max over all action to compute the optimal value If we fixed ome policy π(), then the tree would be impler only one action per tate though the tree value would depend on which policy we fixed Example: Policy Evaluation Example: Policy Evaluation Alway Go Right Alway Go Forward Alway Go Right Alway Go Forward

Policy Evaluation Policy Extraction How do we calculate the V for a fixed policy π?

Idea 2: Without the maxe, the Bellman equation are jut a linear ytem Solve with Matlab (or your favorite linear ytem

Computing Action from Q-Value Let imagine we have the optimal q-value: How hould we act? Completely trivial to decide!

5 Policy Evaluation Policy Extraction How do we calculate the V for a fixed policy π? Idea 1: Turn recurive Bellman equation into update (like value iteration) π(), π(), π(), Efficiency: O(S 2 ) per iteration Idea 2: Without the maxe, the Bellman equation are jut a linear ytem Solve with Matlab (or your favorite linear ytem olver) Computing Action from Value Let imagine we have the optimal value V*() How hould we act? It not obviou! Computing Action from Q-Value Let imagine we have the optimal q-value: How hould we act? Completely trivial to decide! We need to do a mini-expectimax (one tep) Thi i called policy extraction, ince it get the policy implied by the value Important leon: action are eaier to elect from q-value than value!

Problem 2: The max at each tate rarely change,a, Problem 3: The policy

6 Policy Iteration Problem with Value Iteration Value iteration repeat the Bellman update: a, a Problem 1: It low O(S 2 A) per iteration Problem 2: The max at each tate rarely change,a, Problem 3: The policy often converge long before the value [Demo: value iteration (L9D2)] k=0 k=1

7 k=2 k=3 k=4 k=5

8 k=6 k=7 k=8 k=9

9 k=10 k=11 k=12 k=100

10 Policy Iteration Alternative approach for optimal value: Step 1: Policy evaluation: calculate utilitie for ome fixed policy (not optimal utilitie!) until convergence Step 2: Policy improvement: update policy uing one-tep look-ahead with reulting converged (but not optimal!) utilitie a future value Repeat tep until policy converge Thi i policy iteration It till optimal! Can converge (much) fater under ome condition Policy Iteration Evaluation: For fixed current policy π, find value with policy evaluation: Iterate until value converge: Improvement: For fixed value, get a better policy uing policy extraction One-tep look-ahead: Comparion Both value iteration and policy iteration compute the ame thing (all optimal value) In value iteration: Every iteration update both the value and (implicitly) the policy We don t track the policy, but taking the max over action implicitly recompute it In policy iteration: We do everal pae that update utilitie with fixed policy (each pa i fat becaue we conider only one action, not all of them) After the policy i evaluated, a new policy i choen (low like a value iteration pa) The new policy will be better (or we re done) Summary: MDP Algorithm So you want to. Compute optimal value: ue value iteration or policy iteration Compute value for a particular policy: ue policy evaluation Turn your value into a policy: ue policy extraction (one-tep lookahead) Thee all look the ame! They baically are they are all variation of Bellman update They all ue one-tep lookahead expectimax fragment They differ only in whether we plug in a fixed policy or max over action Both are dynamic program for olving MDP

Double Bandit Double-Bandit MDP Action: Blue, Red State: Win, Loe $1 1.0 W 0.75 $2 0.

Solving MDP i offline planning You determine all quantitie through computation You need to know the detail of the MDP You do not

11 Double Bandit Double-Bandit MDP Action: Blue, Red State: Win, Loe $1 1.0 W 0.75 $ $ $ $0 L $1 No dicount 100 time tep Both tate have the ame value 1.0 Offline Planning Let Play! Solving MDP i offline planning You determine all quantitie through computation You need to know the detail of the MDP You do not actually play the game! No dicount 100 time tep Both tate have the ame value Play Red Play Blue Value $1 1.0 W 0.75 $ $ $ $0 L $1 1.0 $2 $2 $0 $2 $2 $2 $2 $0 $0 $0

Online Planning Let Play! Rule changed! Red win chance i different.?? $0 $1 1.0 W?? $2?? $2?? $0 L $1 1.0 $0 $0 $0 $2 $0 $2 $0 $0 $0 $0 What Jut Happened? Next Time: Reinforcement Learning!

12 Online Planning Let Play! Rule changed! Red win chance i different.?? $0 $1 1.0 W?? $2?? $2?? $0 L $1 1.0 $0 $0 $0 $2 $0 $2 $0 $0 $0 $0 What Jut Happened? Next Time: Reinforcement Learning! That wan t planning, it wa learning! Specifically, reinforcement learning There wa an MDP, but you couldn t olve it with jut computation You needed to actually act to figure it out Important idea in reinforcement learning that came up Exploration: you have to try unknown action to get information Exploitation: eventually, you have to ue what you know Regret: even if you learn intelligently, you make mitake Sampling: becaue of chance, you have to try thing repeatedly Difficulty: learning can be much harder than olving a known MDP

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC