10/12/2012. Logistics. Planning Agent. MDPs. Review: Expectimax. PS 2 due Tuesday Thursday 10/18. PS 3 due Thursday 10/25.
|
|
- Rosamund Phillips
- 5 years ago
- Views:
Transcription
1 Logitic PS 2 due Tueday Thurday 10/18 CSE 473 Markov Deciion Procee PS 3 due Thurday 10/25 Dan Weld Many lide from Chri Bihop, Mauam, Dan Klein, Stuart Ruell, Andrew Moore & Luke Zettlemoyer MDP Planning Agent Markov Deciion Procee Planning Under Uncertainty Static v. Dynamic Environment Mathematical Framework Bellman Equation Value Iteration Real Time Dynamic Programming Policy Iteration Reinforcement Learning Andrey Markov ( ) Fully v. Partially Obervablebl Perfect v. Noiy What action next Determinitic v. Stochatic Intantaneou v. Durative Percept Action Find a policy : Objective of an MDP which optimize minimize dicounted expected cot to reach a goal or maximize undicount. expected reward maximize expected (reward cot) given a horizon finite infinite indefinite Review: Expectimax What if we don t know what the reult of an action will be E.g., In olitaire, next card i unknown In pacman, the ghot act randomly Can do expectimax earch Max node a in minimax earch Chance node, like min node, except the outcome i uncertain take average (expectation) of children Calculate expected utilitie Today, we formalize a an Markov Deciion Proce Handle intermediate reward & infinite plan More efficient proceing max chance 1
2 Grid World Wall block the agent path Agent action may go atray: 80% of the time, North action take the agent North (auming no wall) 10% actually go Wet 10% actually go Eat If there i a wall in the choen direction, the agent tay put Small living reward each tep Big reward come at the end Goal: maximize um of reward Markov Deciion Procee An MDP i defined by: A et of tate S A et of action a A A tranition function T(,a, ) Probthat a from lead to i.e., P(,a) Alo called the model A reward function R(, a, ) Sometime jut R() or R( ) A tart tate (or ditribution) Maybe a terminal tate MDP: non determinitic earch Reinforcement learning: MDP where we don t know the tranition or reward function What i Markov about MDP Andrey Markov ( ) Markov generally mean that conditioned on the preent tate, the future i independent of the pat For Markov deciion procee, Markov mean: Solving MDP In determinitic ingle-agent earch problem, want an optimal plan, or equence of action, from tart to a goal In an MDP, we want an optimal policy *: S A A policy give an action for each tate Anoptimal policymaximize expected utilityif if followed Define a reflex agent Optimal policy when R(, a, ) = 0.03 for all non terminal Example Optimal Policie Example Optimal Policie R() = 0.01 R() = 0.03 R() = 0.01 R() = 0.03 R() = 0.4 R() = 2.0 R() = 0.4 R() = 2.0 2
3 Example Optimal Policie Example Optimal Policie R() = 0.01 R() = 0.03 R() = 0.01 R() = 0.03 R() = 0.4 R() = 2.0 R() = 0.4 R() = 2.0 Example: High Low Three card type: 2, 3, 4 Infinite deck, twice a many 2 Start with 3 howing After each card, you ay high or low New card i flipped If you re right, ih you win the point hown on the new card Tie are no op (no reward) 0 If you re wrong, game end Difference from expectimax problem: #1: get reward a you go #2: you might play forever! 3 High Low a an MDP State: 2, 3, 4, done Action: High, Low Model: T(, a, ): P( =4 4, Low) = 1/4 P( =3 4, Low) = 1/4 3 P( =2 4, Low) = 1/2 P( =done 4, Low) = 0 P( =4 4, High) = 1/4 P( =3 4, High) = 0 P( =2 4, High) = 0 P( =done 4, High) = 3/4 Reward: R(, a, ): Number hown on if < a= high 0 otherwie Start: 3 Search Tree: High Low MDP Search Tree Each MDP tate give an expectimax like earch tree T = 0.5, R = 2 T = 0.25, R = 3 Low High Low High Low High Low High, Low, High T = 0, R = 4 T = 0.25, R = 0 (, a) i a q-tate,a, a, a i a tate (,a, ) called a tranition T(,a, ) = P(,a) R(,a, ) 3
4 Utilitie of Sequence In order to formalize optimality of a policy, need to undertand utilitie of equence of reward Typically conider tationary preference: Theorem: only two way to define tationary utilitie Additive utility: Infinite Utilitie! Problem: infinite tate equence have infinite reward Solution: Finite horizon: Terminate epiode after a fixed T tep (e.g. life) Give nontationary policie ( depend on time left) Aborbing tate: guarantee that for every policy, a terminal tate will eventually be reached (like done for High Low) Dicounting: for 0 < < 1 Dicounted utility: Smaller mean maller horizon horter term focu Typically dicount reward by < 1 each time tep Sooner reward have higher utility than later reward Alo help the algorithm converge Dicounting Recap: Defining MDP Markov deciion procee: State S Start tate 0 Action A Tranition P(, a) aka T(,a, ),, Reward R(,a, ) (and dicount ) a, a,a, MDP quantitie o far: Policy, = Function that chooe an action for each tate Utility (aka return ) = um of dicounted reward Optimal Utilitie Why Not Search Tree Define the value of a tate : V * () = expected utility tarting in and acting optimally Define the value of a q tate (,a): Q * (,a) = expected utility tarting in, taking action a and thereafter acting optimally Define the optimal policy: * () = optimal action from tate a, a,a, Why not olve with expectimax Problem: Thi tree i uually infinite (why) Same tate appear over and over (why) Wewouldearch once per tate (why) Idea: Value iteration Compute optimal value for all tate all at once uing ucceive approximation Will be a bottom up dynamic program imilar in cot to memoization Do all planning offline, no replanning needed! 4
5 The Bellman Equation Definition of optimal utility lead to a imple one tep look ahead relationhip between optimal utility value: Bellman Equation for MDP Q*(a, ) ( ) a, a,a, Bellman Backup (MDP) Given an etimate of V* function (ay ) Backup function at tate calculate a new etimate (+1 ) : V ax V V 1 = 6.5 Bellman Backup a a 2 2 a 3 V 0 = 0 V 0 = 1 Q 1 (,a 1 ) = ~ 2 Q 1 (,a 2 ) = ~ + 0.1~ 2 ~ 6.1 Q 1 (,a 3 ) = ~ 6.5 Q n+1 (,a) : value/cot of the trategy: execute action a in, execute n ubequently n = argmax a Ap() Q n (,a) max 3 V 0 = 2 Value iteration [Bellman 57] aign an arbitrary aignment of V 0 to each tate. repeat for all tate compute +1 () by Bellman backup at. Iteration n+1 n1 until max +1 () () < -convergence Reidual() Theorem: will converge to unique optimal value Baic idea: approximation get refined toward optimal value Policy may converge long before value do Value Iteration Idea: Start with V 0* () = 0, which we know i right (why) Given V i*, calculate the value for all tate for depth i+1: Thi i called a value update or Bellman update Repeat until convergence Theorem: will converge to unique optimal value Baic idea: approximation get refined toward optimal value Policy may converge long before value do 5
6 Value Etimate Example: Bellman Update Example: =0.9, living reward=0, noie=0.2 Calculate etimate V k* () The optimal value conidering only next k time tep (k reward) A k, V k approache the optimal value Why: If dicounting, ditant reward become negligible If terminal tate reachable from everywhere, fraction of epiode not ending become negligible Otherwie, can get infinite expected utility and then thi approach actually won t work Example: Value Iteration Example: Value Iteration V 1 V 2 QuickTime and a GIF decompreor are needed to ee thi picture. Information propagate outward from terminal tate and eventually all tate have correct value etimate Practice: Computing Action Which action hould we choe from tate : Given optimal value Q Given optimal value V Leon: action are eaier to elect from Q! Comment Deciion theoretic Algorithm Dynamic Programming Fixed Point Computation Probabilitic verion of Bellman Ford Algorithm for hortet path computation MDP 1 : Stochatic Shortet Path Problem Time Complexity one iteration: O( 2 ) number of iteration: poly(,, 1/ 1 ) Space Complexity: O( ) Factored MDP = Planning under uncertainty exponential pace, exponential time 6
7 Convergence Propertie Convergence V* in the limit a n -convergence: function i within of V* Optimality: current policy i within 2 of optimal Monotonicity * V 0 p V * p V* ( monotonic from below) V 0 p V * p V* ( monotonic from above) otherwie non monotonic Define the max norm: Theorem: For any two approximation U t and V t I.e. any ditinct approximation mut get cloer to each other, o, in particular, any approximation mut get cloer to the true V* (aka U) and value iteration converge to a unique, table, optimal olution Theorem: I.e. once the change in our approximation i mall, it mut alo be cloe to correct Value Iteration Complexity Problem ize: A action and S tate Each Iteration Computation: O( A S 2 ) Space: O( S ) Num of iteration Can be exponential in the dicount factor γ Markov Deciion Procee Planning Under Uncertainty MDP Mathematical Framework Bellman Equation Value Iteration Real Time Dynamic Programming Policy Iteration Reinforcement Learning Andrey Markov ( ) Aynchronou Value Iteration State may be backed up in any order Intead of ytematically, iteration by iteration Theorem: A long a every tate i backed up infinitely often Aynchronou value iteration converge to optimal Aynchonou Value Iteration Prioritized Sweeping Why backup a tate if value of ucceor ame Prefer backing a tate whoe ucceor had mot change Priority Queue of (tate, expected change in value) Backup in the order of priority After backing a tate update priority queue for all predeceor 7
8 Aynchonou Value Iteration Real Time Dynamic Programming [Barto, Bradtke, Singh 95] Why Why i next lide aying min Trial: imulate greedy policy tarting from tart tate; perform Bellman backup on viited tate RTDP: Repeat Trial until value function converge RTDP Trial Comment a greedy = a 2 Q n+1 ( 0,a) Min Propertie if all tate are viited infinitely often then V* +1 ( 0 ) a 1 0 a 2 a 3 Goal Advantage Anytime: more probable tate explored quickly Diadvantage complete convergence can be low! Labeled RTDP [Bonet&Geffner ICAPS03] Stochatic Shortet Path Problem Policy w/ min expected cot to reach goal Initialize v 0 () with admiible heuritic Underetimate remaining cot Theorem: if reidual of V k () < and V k ( ) < for all ucc(),, in greedy graph Then V k i conitent and will remain o Labeling algorithm detect convergence Goal Markov Deciion Procee Planning Under Uncertainty MDP Mathematical Framework Bellman Equation Value Iteration Real Time Dynamic Programming Policy Iteration Reinforcement Learning Andrey Markov ( ) 0 8
9 Changing the Search Space Utilitie for Fixed Policie Value Iteration Search in value pace Compute the reulting policy Policy Iteration Search in policy pace Compute the reulting value Another baic operation: compute the utility of a tate under a fix (general non optimal) policy Define the utility of a tate, under a fixed policy : V () = expected total dicounted reward (return) tarting in and following Recurive relation (one tep lookahead / Bellman equation):, (), (), () Policy Evaluation Policy Iteration How do we calculate the V for a fixed policy Idea one: modify Bellman update Idea two: it jut a linear ytem, olve with Matlab (or whatever) Problem with value iteration: Conidering all action each iteration i low: take A time longer than policy evaluation But policy doen t change each iteration, time wated Alternative to value iteration: Step 1: Policy evaluation: calculate utilitie for a fixed policy (not optimal utilitie!) until convergence (fat) Step 2: Policy improvement: update policy uing one tep lookahead with reulting converged (but not optimal!) utilitie (low but infrequent) Repeat tep until policy converge Policy Iteration Policy iteration [Howard 60] Policy evaluation: with fixed current policy, find value with implified Bellman update: Iterate until value converge aign an arbitrary aignment of 0 to each tate. repeat Policy Evaluation: compute +1 the evaluation of n Policy Improvement: for all tate compute n+1 (): argmax a Ap() Q n+1 (,a) cotly: O(n 3 ) Policy improvement: with fixed utilitie, find the bet action according to one tep look ahead until n+1 n Advantage Modified Policy Iteration approximate by value iteration uing fixed policy earching in a finite (policy) pace a oppoed to uncountably infinite (value) pace convergence fater. all other propertie follow! 9
10 Modified Policy iteration aign an arbitrary aignment of 0 to each tate. repeat Policy Evaluation: compute +1 the approx. evaluation of n Policy Improvement: for all tate compute n+1 (): argmax a Ap() Q n+1 (,a) until n+1 n Advantage probably the mot competitive ynchronou dynamic programming algorithm. Policy Iteration Complexity Problem ize: A action and S tate Each Iteration Computation: O( S 3 + A S 2 ) Space: O( S ) Num of iteration Unknown, but can be fater in practice Convergence i guaranteed Comparion In value iteration: Every pa (or backup ) update both utilitie (explicitly, baed on current utilitie) and policy (poibly implicitly, baed on current policy) In policy iteration: Several pae to update utilitie with frozen policy Occaional pae to update policie Hybrid approache (aynchronou policy iteration): Any equence of partial update to either policy entrie or utilitie will converge if every tate i viited infinitely often Recap: MDP Markov deciion procee: State S Action A Tranition P(,a) (or T(,a, )) a Reward R(,a, ) (and dicount ), a Start tate 0,a, Quantitie: Return = um of dicounted reward Value = expected future return from a tate (optimal, or for a fixed policy) Q Value = expected future return from a q tate (optimal, or for a fixed policy) 10
Announcements. CS 188: Artificial Intelligence Spring Outline. Reinforcement Learning. Grid Futures. Grid World. Lecture 9: MDPs 2/16/2011
CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDP 2/16/2011 Announcement Midterm: Tueday March 15, 5-8pm P2: Due Friday 4:59pm W3: Minimax, expectimax and MDP---out tonight, due Monday February
More informationCS 188: Artificial Intelligence Fall Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007 Lecture 10: MDP 9/27/2007 Dan Klein UC Berkeley Markov Deciion Procee An MDP i defined by: A et of tate S A et of action a A A tranition function T(,a, ) Prob
More informationCSE 473: Artificial Intelligence
CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due
More informationExample: Grid World. CS 188: Artificial Intelligence Markov Decision Processes II. Recap: MDPs. Optimal Quantities
CS 188: Artificial Intelligence Markov Deciion Procee II Intructor: Dan Klein and Pieter Abbeel --- Univerity of California, Berkeley [Thee lide were created by Dan Klein and Pieter Abbeel for CS188 Intro
More informationAnnouncements. CS 188: Artificial Intelligence Fall Preferences. Rational Preferences. Rational Preferences. MEU Principle. Project 2 (due 10/1)
CS 188: Artificial Intelligence Fall 007 Lecture 9: Utilitie 9/5/007 Dan Klein UC Berkeley Project (due 10/1) Announcement SVN group available, email u to requet Midterm 10/16 in cla One ide of a page
More information91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010
91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course
More informationCSEP 573: Artificial Intelligence
CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements
More informationNon-Deterministic Search
Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:
More informationCS 188: Artificial Intelligence. Outline
C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence
More informationReinforcement Learning
Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 9: MDPs 9/22/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 2 Grid World The agent lives in
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives
More informationReinforcement Learning. CS 188: Artificial Intelligence Fall Grid World. Markov Decision Processes. What is Markov about MDPs?
CS 188: Artificil Intelligence Fll 2010 Lecture 9: MDP 9/2/2010 Reinforcement Lerning [DEMOS] Bic ide: Receive feedbck in the form of rewrd Agent utility i defined by the rewrd function Mut (lern to) ct
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in
More informationMDPs: Bellman Equations, Value Iteration
MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC
More informationMarkov Decision Processes. Lirong Xia
Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent
More informationCOMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2
COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman
More informationThe Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions
The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}
More informationLogistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week
CS 473: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials
More informationMarkov Decision Process
Markov Decision Process Human-aware Robotics 2018/02/13 Chapter 17.3 in R&N 3rd Ø Announcement: q Slides for this lecture are here: http://www.public.asu.edu/~yzhan442/teaching/cse471/lectures/mdp-ii.pdf
More informationAnnouncements. CS 188: Artificial Intelligence Fall Reinforcement Learning. Markov Decision Processes. Example Optimal Policies.
CS 188: Artificil Intelligence Fll 2008 Lecture 9: MDP 9/25/2008 Announcement Homework olution / review eion: Mondy 9/29, 7-9pm in 2050 Vlley LSB Tuedy 9/0, 6-8pm in 10 Evn Check web for detil Cover W1-2,
More information4/30/2012. Overview. MDPs. Planning Agent. Grid World. Review: Expectimax. Introduction & Agents Search, Heuristics & CSPs Adversarial Search
Overview CSE 473 Mrkov Deciion Procee Dn Weld Mny lide from Chri Bihop, Mum, Dn Klein, Sturt Ruell, Andrew Moore & Luke Zettlemoyer Introduction & Agent Serch, Heuritic & CSP Adverril Serch Logicl Knowledge
More informationCS 6300 Artificial Intelligence Spring 2018
Expectimax Search CS 6300 Artificial Intelligence Spring 2018 Tucker Hermans thermans@cs.utah.edu Many slides courtesy of Pieter Abbeel and Dan Klein Expectimax Search Trees What if we don t know what
More informationComplex Decisions. Sequential Decision Making
Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by
More informationGeneral Examination in Microeconomic Theory
HARVARD UNIVERSITY DEPARTMENT OF ECONOMICS General Examination in Microeconomic Theory Fall 06 You have FOUR hour. Anwer all quetion Part A(Glaeer) Part B (Makin) Part C (Hart) Part D (Green) PLEASE USE
More informationOptimizing Cost-sensitive Trust-negotiation Protocols
Optimizing Cot-enitive Trut-negotiation Protocol Weifeng Chen, Lori Clarke, Jim Kuroe, Don Towley Department of Computer Science Univerity of Maachuett, Amhert, MA, 000 {chenwf, clarke, kuroe, towley}@c.uma.edu
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use
More informationSequential Decision Making
Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming
More informationBasic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]
Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.
CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use
More informationMarkov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N
Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning
More informationAnnouncements. CS 188: Artificial Intelligence Fall Recap: MDPs. Recap: Optimal Utilities. Practice: Computing Actions. Recap: Bellman Equations
CS 188: Artificil Intelligence Fll 2009 Lecture 10: MDP 9/29/2009 Announcement P2: Due Wednedy P3: MDP nd Reinforcement Lerning i up! W2: Out lte thi week Dn Klein UC Berkeley Mny lide over the coure dpted
More informationOutline. CS 188: Artificial Intelligence Spring Speeding Up Game Tree Search. Minimax Example. Alpha-Beta Pruning. Pruning
CS 188: Artificil Intelligence Spring 2011 Lecture 8: Gme, MDP 2/14/2010 Pieter Abbeel UC Berkeley Mny lide dpted from Dn Klein Outline Zero-um determinitic two plyer gme Minimx Evlution function for non-terminl
More information2D5362 Machine Learning
2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files
More informationLecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision
More informationCS188 Spring 2012 Section 4: Games
CS188 Spring 2012 Section 4: Games 1 Minimax Search In this problem, we will explore adversarial search. Consider the zero-sum game tree shown below. Trapezoids that point up, such as at the root, represent
More information17 MAKING COMPLEX DECISIONS
267 17 MAKING COMPLEX DECISIONS The agent s utility now depends on a sequence of decisions In the following 4 3grid environment the agent makes a decision to move (U, R, D, L) at each time step When the
More informationMDPs and Value Iteration 2/20/17
MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that,
More informationRecap: MDPs. CS 188: Artificial Intelligence Fall Optimal Utilities. The Bellman Equations. Value Estimates. Practice: Computing Actions
CS 188: Artificil Intelligence Fll 2008 Lecture 10: MDP 9/30/2008 Dn Klein UC Berkeley Recp: MDP Mrkov deciion procee: Stte S Action A Trnition P(,) (or T(,, )) Rewrd R(,, ) (nd dicount γ) Strt tte 0 Quntitie:
More informationMarkov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo
Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Outline Sequential Decision Processes Markov chains Highlight Markov property Discounted rewards Value iteration Markov
More informationQ1. [?? pts] Search Traces
CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Spring 2015 Introduction to Artificial Intelligence Midterm 1 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib
More informationMaking Complex Decisions
Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2
More informationMarkov Decision Processes
Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More informationConfidence Intervals for One Variance with Tolerance Probability
Chapter 65 Confidence Interval for One Variance with Tolerance Probability Introduction Thi procedure calculate the ample ize neceary to achieve a pecified width (or in the cae of one-ided interval, the
More informationExpectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Quantities. Expectimax Pseudocode. Expectimax Pruning?
CS 188: Artificial Intelligence Fall 2010 Expectimax Search Trees What if we don t know what the result of an action will be? E.g., In solitaire, next card is unknown In minesweeper, mine locations In
More informationIntro to Reinforcement Learning. Part 3: Core Theory
Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2
More informationCEC login. Student Details Name SOLUTIONS
Student Details Name SOLUTIONS CEC login Instructions You have roughly 1 minute per point, so schedule your time accordingly. There is only one correct answer per question. Good luck! Question 1. Searching
More informationGridworld Values V* Gridworld: Q*
CS 188: Artificil Intelligence Mrkov Deciion Procee II Intructor: Dn Klein nd Pieter Abbeel --- Univerity of Cliforni, Berkeley [Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI
More informationReinforcement Learning and Simulation-Based Search
Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision
More informationPlayer B ensure a. is the biggest payoff to player A. L R Assume there is no dominant strategy That means a
Endogenou Timing irt half baed on Hamilton & Slutky. "Endogenizing the Order of Move in Matrix Game." Theory and Deciion. 99 Second half baed on Hamilton & Slutky. "Endogenou Timing in Duopoly Game: Stackelberg
More informationStatic Fully Observable Stochastic What action next? Instantaneous Perfect
CS 188: Ar)ficil Intelligence Mrkov Deciion Procee K+1 Intructor: Dn Klein nd Pieter Abbeel - - - Univerity of Cliforni, Berkeley [Thee lide were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to
More informationMaking Decisions. CS 3793 Artificial Intelligence Making Decisions 1
Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside
More information16 MAKING SIMPLE DECISIONS
247 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action A will have possible outcome states Result
More information16 MAKING SIMPLE DECISIONS
253 16 MAKING SIMPLE DECISIONS Let us associate each state S with a numeric utility U(S), which expresses the desirability of the state A nondeterministic action a will have possible outcome states Result(a)
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.
CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use
More informationAction Selection for MDPs: Anytime AO* vs. UCT
Action Selection for MDPs: Anytime AO* vs. UCT Blai Bonet 1 and Hector Geffner 2 1 Universidad Simón Boĺıvar 2 ICREA & Universitat Pompeu Fabra AAAI, Toronto, Canada, July 2012 Online MDP Planning and
More informationCS221 / Spring 2018 / Sadigh. Lecture 7: MDPs I
CS221 / Spring 2018 / Sadigh Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring
More informationLecture 7: MDPs I. Question. Course plan. So far: search problems. Uncertainty in the real world
Lecture 7: MDPs I cs221.stanford.edu/q Question How would you get to Mountain View on Friday night in the least amount of time? bike drive Caltrain Uber/Lyft fly CS221 / Spring 2018 / Sadigh CS221 / Spring
More informationProbabilistic Robotics: Probabilistic Planning and MDPs
Probabilistic Robotics: Probabilistic Planning and MDPs Slide credits: Wolfram Burgard, Dieter Fox, Cyrill Stachniss, Giorgio Grisetti, Maren Bennewitz, Christian Plagemann, Dirk Haehnel, Mike Montemerlo,
More informationReasoning with Uncertainty
Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally
More informationDecision Theory: Value Iteration
Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Uncertainty and Utilities Instructors: Dan Klein and Pieter Abbeel University of California, Berkeley [These slides are based on those of Dan Klein and Pieter Abbeel for
More information340B Aware and Beware
340B Aware and Beware Being aware of the complex and ever-changing 340B Drug Pricing Program rule help covered entitie maintain integrity and drive program value. Succeful 340B program focu on three fundamental
More informationWorst-Case vs. Average Case. CSE 473: Artificial Intelligence Expectimax, Uncertainty, Utilities. Expectimax Search. Worst-Case vs.
CSE 473: Artificial Intelligence Expectimax, Uncertainty, Utilities Worst-Case vs. Average Case max min 10 10 9 100 Dieter Fox [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro
More informationLecture outline W.B.Powell 1
Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous
More information- International Scientific Journal about Logistics Volume: Issue: 4 Pages: 7-15 ISSN
DOI: 10.22306/al.v3i4.72 Received: 03 Dec. 2016 Accepted: 11 Dec. 2016 THE ANALYSIS OF THE COMMODITY PRICE FORECASTING SUCCESS CONSIDERING DIFFERENT NUMERICAL MODELS SENSITIVITY TO PROGNOSIS ERROR Technical
More informationCPS 270: Artificial Intelligence Markov decision processes, POMDPs
CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward
More informationPrice Trends in a Dynamic Pricing Model with Heterogeneous Customers: A Martingale Perspective
OPERATIONS RESEARCH Vol. 57, No. 5, September October 2009, pp. 1298 1302 in 0030-364X ein 1526-5463 09 5705 1298 inform doi 10.1287/opre.1090.0703 2009 INFORMS TECHNICAL NOTE INFORMS hold copyright to
More informationBuilding Redundancy in Multi-Agent Systems Using Probabilistic Action
Proceeding of the Twenty-Ninth International Florida Artificial Intelligence Reearch Society Conference Building Redundancy in Multi-Agent Sytem Uing Probabilitic Action Annie S. Wu, R. Paul Wiegand, and
More informationExpectimax Search Trees. CS 188: Artificial Intelligence Fall Expectimax Example. Expectimax Pseudocode. Expectimax Pruning?
CS 188: Artificial Intelligence Fall 2011 Expectimax Search Trees What if we don t know what the result of an action will be? E.g., In solitaire, next card is unknown In minesweeper, mine locations In
More informationCS 188: Artificial Intelligence Fall 2011
CS 188: Artificial Intelligence Fall 2011 Lecture 7: Expectimax Search 9/15/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 Expectimax Search
More informationLecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018
Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction
More informationCS 5522: Artificial Intelligence II
CS 5522: Artificial Intelligence II Uncertainty and Utilities Instructor: Alan Ritter Ohio State University [These slides were adapted from CS188 Intro to AI at UC Berkeley. All materials available at
More informationCOS402- Artificial Intelligence Fall Lecture 17: MDP: Value Iteration and Policy Iteration
COS402- Artificial Intelligence Fall 2015 Lecture 17: MDP: Value Iteration and Policy Iteration Outline The Bellman equation and Bellman update Contraction Value iteration Policy iteration The Bellman
More informationCapacity Planning in a General Supply Chain with Multiple Contract Types
Capacity Planning in a General Supply Chain with Multiple Contract Type Xin Huang and Stephen C. Grave M.I.T. 1 Abtract The ucceful commercialization of any new product depend to a degree on the ability
More informationCS360 Homework 14 Solution
CS360 Homework 14 Solution Markov Decision Processes 1) Invent a simple Markov decision process (MDP) with the following properties: a) it has a goal state, b) its immediate action costs are all positive,
More information343H: Honors AI. Lecture 7: Expectimax Search 2/6/2014. Kristen Grauman UT-Austin. Slides courtesy of Dan Klein, UC-Berkeley Unless otherwise noted
343H: Honors AI Lecture 7: Expectimax Search 2/6/2014 Kristen Grauman UT-Austin Slides courtesy of Dan Klein, UC-Berkeley Unless otherwise noted 1 Announcements PS1 is out, due in 2 weeks Last time Adversarial
More informationMaximum Expected Utility. CS 188: Artificial Intelligence Fall Preferences. MEU Principle. Rational Preferences. Utilities: Uncertain Outcomes
CS 188: Artificil Intelligence Fll 2011 Mximum Expected Utility Why hould we verge utilitie? Why not minimx? Lecture 8: Utilitie / MDP 9/20/2011 Dn Klein UC Berkeley Principle of mximum expected utility:
More informationConfidence Intervals for One Variance using Relative Error
Chapter 653 Confidence Interval for One Variance uing Relative Error Introduction Thi routine calculate the neceary ample ize uch that a ample variance etimate will achieve a pecified relative ditance
More informationUncertain Outcomes. CS 188: Artificial Intelligence Uncertainty and Utilities. Expectimax Search. Worst-Case vs. Average Case
CS 188: Artificial Intelligence Uncertainty and Utilities Uncertain Outcomes Instructor: Marco Alvarez University of Rhode Island (These slides were created/modified by Dan Klein, Pieter Abbeel, Anca Dragan
More informationTDT4171 Artificial Intelligence Methods
TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods
More informationProbabilities. CSE 473: Artificial Intelligence Uncertainty, Utilities. Reminder: Expectations. Reminder: Probabilities
CSE 473: Artificial Intelligence Uncertainty, Utilities Probabilities Dieter Fox [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are
More informationDeep RL and Controls Homework 1 Spring 2017
10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact
More informationOptimizing revenue for bandwidth auctions over networks with time reservations
Optimizing revenue for bandwidth auction over network with time reervation Pablo Belzarena,a, Fernando Paganini b, André Ferragut b a Facultad de Ingeniería, Univeridad de la República, Montevideo, Uruguay
More informationIntermediate Macroeconomic Theory II, Winter 2009 Solutions to Problem Set 1
Intermediate Macroeconomic Theor II, Winter 2009 Solution to Problem Set 1 1. (18 point) Indicate, when appropriate, for each of the tatement below, whether it i true or fale. Briefl explain, upporting
More informationFISCAL AND MONETARY INTERACTIONS JUNE 15, 2011 MONETARY POLICY AND FISCAL POLICY. Introduction
FISCAL AND MONETARY INTERACTIONS JUNE 15, 2011 Introduction MONETARY POLICY AND FISCAL POLICY Chapter 7: tudied fical policy in iolation from monetary policy Illutrated ome core iue of fical policy (i.e.,
More informationMonte-Carlo Planning Look Ahead Trees. Alan Fern
Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy
More informationThe Problem of Temporal Abstraction
The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,
More informationNon-Deterministic Search. CS 188: Artificial Intelligence Markov Decision Processes. Grid World Actions. Example: Grid World
CS 188: Artificil Intelligence Mrkov Deciion Procee Non-Determinitic Serch Dn Klein, Pieter Abbeel Univerity of Cliforni, Berkeley Exmple: Grid World Grid World Action A mze-like problem The gent live
More informationIntroduction to Decision Making. CS 486/686: Introduction to Artificial Intelligence
Introduction to Decision Making CS 486/686: Introduction to Artificial Intelligence 1 Outline Utility Theory Decision Trees 2 Decision Making Under Uncertainty I give a robot a planning problem: I want
More informationDo you struggle with efficiently managing your assets due to a lack of clear, accurate and accessible data? You re not alone.
: k o o L e d i In t e A l a t i p a C k r o W t e A ) M A C ( t n e Managem Do you truggle with efficiently managing your aet due to a lack of clear, accurate and acceible data? You re not alone. Many
More informationCS 461: Machine Learning Lecture 8
CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?
More information