Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i

Size: px
Start display at page:

Download "Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i"

Transcription

1 Temporal-Di erence Learning Taras Kucherenko, Joonatan Manttari KTH March 7, 2017 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

2 Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it is too late) It cannot use bootstrapping Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

3 Overview 1 Description of TD Introduction to TD Sample Backup TD Error Advantages of TD 2 Examples of TD Algorithms SARSA Q-Learning Expected SARSA Double Q-Learning 3 Extensions Disadvantages of TD n-step TD and Eligibility Traces Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

4 Introduction to TD Temporal Di erence Learning Learning after time steps, not entire episodes Central Idea in RL Combination of Monte Carlo and DP No need for model, learn from experience (MC) Updates based on estimates (DP) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

5 Introduction to TD Relation to other methods: Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

6 TD Prediction Consider prediction problem, learn v For prediction both TD and MC use experience Follow, useexperiencetoupdateestimatev of v MC methods update after return is known (end of episode) TD methods update after one time step Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

7 Introduction to TD Remember state-value function: v (s) =E [G t S t = s] = E " 1 X k=0 = E "R t+1 + k R t+k+1 S t = s k=0 # # 1X k R t+k+2 S t = s = E [R t+1 + v (S t+1 ) S t = s] MC methods estimate E [G t S t = s] usingsampling DP methods estimate E [R t+1 + v (S t+1 ) S t = s] byusingthe current estimate V (S t+1 ) instead of the true value v (S t+1 ) TD methods use both ideas: sampling and using estimate for updates ) Potential for utilizing advantages of both MC and DP Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

8 Simple TD Simple every-visit MC example h i V (S t ) ( V (S t )+ G t V (S t ) G t = Return following t, =step-sizeparameter Use target and current prediction for updates Update after episode ends But what if we updated after each timestep? Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

9 MC Refresher Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

10 Simple TD Simple algorithm: TD(0) h i V (S t ) ( V (S t )+ R t+1 + V (S t+1 ) V (S t ) Update each time step Bootstrapping like DP (uses estimates) MC target = G t TD target = R t+1 + V (S t+1 ) TD(0) = update only using one time step Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

11 TD Error TD updates according to target and current estimate: h i V (S t ) ( V (S t )+ R t+1 + V (S t+1 ) V (S t ) t = R t+1 + V (S t+1 ) V (S t ) Known as TD Error Defined for time step t, availableatt +1 Error is temporal di erence between estimate at t and t+1 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

12 Sample Backup Backup diagram of TD(0): Immediately updates prediction Sample Backup Used by MC and TD Uses R t+1 and V (s 0 ) of a sample state s Full Backup: use distribution of all possible s Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

13 Tabular TD(0) algorithm for estimating v Input: the policy to be evaluated Initialize V (s) arbitrarily (e.g., V (s) =0, 82S + ) Repeat (for each episode): Initialize S Repeat (for each step of episode): A action given by for S Take action A, observe h R, S 0 i V (S t ) V (S t )+ R t+1 + V (S t+1 ) V (S t ) S S 0 until S is terminal Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

14 Example: Watching Netflix Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

15 Example: Watching Netflix How many episodes of a Netflix series will you watch today? State signal: Time, day, plans, auto-play activated Reward: Number of watched episodes between states State value: Expected episodes left to watch Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

16 Example: Watching Netflix Assume your day is as follows: State Watched Episodes Predicted Episodes to Go Predicted Total # Start watching Sunday at 15. Plans: laundry and dinner Neighbor says laundry machine broken Friend calls with dinner plans Cli hanger Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

17 Example: Watching Netflix Assume every-visit MC method with = 1, = 1 What would the update be for each state-value estimate? Remember update-rule: h i V (S t ) ( V (S t )+ G t V (S t ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

18 Example: Watching Netflix Example of calculating MC update h for broken laundry i machine: V (broken) ( V (broken)+ G broken V (broken) h i V (broken) ( V (broken) ( 4 New predicted total episodes to watch atv broken : Total broken = V (broken)+episodeswatched(broken) Total broken =4+2=6 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

19 Example: Watching Netflix Result of all Updates: Updates after day is complete But why not at each new situation? (What if internet is down?) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

20 Example: Watching Netflix Comparison with TD(0) What would the update be for each state-value estimate? Remember h update-rule: i V (S t ) ( V (S t )+ R t+1 + V (S t+1 ) V (S t ), =1 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

21 Example: Watching Netflix Example of calculating TD(0) hupdate for broken laundry machine: i V (broken) ( V (broken)+ R t+1 + V (dinner) V (broken) h i V (broken) ( V (broken) ( 2 New predicted total episodes to watch atv broken : Total broken = V (broken)+episodeswatched(broken) Total broken =2+2=4 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

22 Example: Watching Netflix Result of all updates: Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

23 Example: Watching Netflix MC TD TD updates after each situation (Proportional to) error w.r.t. next time-step Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

24 Advantages of TD Comparison with other methods Illustrative examples Convergence Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

25 Comparison with other methods Combines sampling (MC) with bootstrapping (DP) More general than DP Does not require a model of the surrounding Faster changes than in MC Don t need to wait till the end of the episode Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

26 Example 1: Random Walk Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

27 Example 1: Random Walk Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

28 Example 2: Life of developer Episodes: Code runs, 1 Code runs, 1 Code runs, 1 Code runs, 1 Code runs, -1 PC is on, 0, Code runs, -1 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

29 Convergence of TD Definition A sequence Xn of random variables converges in probability towards the random variable X if for all >0: lim x!1 Pr( Xn X >= ) =0 Theory TD converges with probability 1, if step size satisfies the following conditions: P 1 n=1 a n(a) =1 and P 1 n=1 a2 n(a) < 1, wherea n (a) isastep size for an action a, which we usually call Practice Converges faster and to the better solution than MC Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

30 TD Methods for Control Problems Use TD Methods to find optimal policy As before, exploration achieved using on-policy or o -policy methods On-policy: SARSA O -policy: Q-learning Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

31 On-policy & O -policy Refresher regarding on-policy and o -policy: For both the goal is to update state-action pair values to find optimal policy On-policy: update Q(s,a) from experience following the current policy O -policy: update Q(s,a) from experience following another policy (behavior policy) Example: New go-kart driver learning from other drivers Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

32 SARSA Uses action-value function (not state-value) Estimate q (s, a), 8(s, a) following policy Similar to TD(0) Markov Chain with reward process Converge assured for Q(S t, A t )withsametheorems Di erent transition scheme Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

33 SARSA Uses action-value function (not state-value) Estimate q (s, a), 8(s, a) following policy Similar to TD(0) Markov Chain with reward process Converge assured for Q(S t, A t )withsametheorems Di erent transition scheme Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

34 SARSA Update rule for SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + Q(S t+1, A t+1 ) Q(S t, A t )] Update after each transition Update dependent on (S t, A t, R t+1, S t+1, A t+1 ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

35 SARSA Update rule for SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + Q(S t+1, A t+1 ) Q(S t, A t )] Update after each transition If S t+1 terminal, Q(S t+1, A t+1 )=0 Update dependent on (S t, A t, R t+1, S t+1, A t+1 ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

36 SARSA Backup Diagram: Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

37 SARSA Algorithm Initialize Q(s, a), 8s 2S, 8a 2A(s) arbitrarilyand Q(terminal-state, ) =0 Repeat (for each episode): Initialize S Choose A from S using policy derived from Q (e.g. -greedy) Repeat (for each step of episode): Take action A, observe R, S 0 Choose A 0 from S 0 using policy derived from Q (e.g. -greedy) Q(S, A) Q(S, A)+ [R + Q(S 0, A 0 ) Q(S, A)] S S 0, A A 0 ; Until S is terminal On policy, as we are following our -greedy to choose actions Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

38 SARSA Convergence properties depend on how is derived from Q Converges to, Q with probability 1, given: all (s, a) arevisited1 times policy converges to greedy (e.g. = 1/t) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

39 SARSA Example: Windy Grid World Wind shifts position upward A = (up,down,left,right) R=-18(s, a) until goal Apply -greedy SARSA =0.1, =0.5, initial Q(s, a) =0, 8(s, a) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

40 SARSA Example: Windy Grid World Results of Applying -Greedy SARSA: Policy improves after each times step Optimal policy reached after 6000 time steps MC methods could get stuck, TD more suitable (learns during episode) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

41 Q-Learning(O -Policy) Update rule : Q(S, A) Q(S, A)+ [R + max a Q(S 0, a) Q(S, A)] Directly approximate q*, independent of the policy being followed, because update does not depend on the policy Converged for any policy, as long as: all pair continued to be updated step-size satisfies usual approximation condition Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

42 Q-Learning Algorithm Initialize: Q(s, a), 8s 2S, 8a 2A(s) arbitrarilyandq(terminal state,.)=0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q (e.g. -greedy) Take action A, observe R, S 0 Q(S, A) Q(S, A)+ [R + max a Q(S 0, a) Q(S, A)] S S 0 Until S is terminal O policy, as we are taking the max for update (greedy policy) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

43 Example: Cli Walking Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

44 Expected SARSA Update rule for SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + Q(S t+1, A t+1 ) Q(S t, A t )] Update rule for Q-learning: Q(S t, A t ) Q(S t, A t )+ h R t+1 + i max Q(S t+1, a) Q(S t, A t ) a Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

45 Expected SARSA Update rule for SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + Q(S t+1, A t+1 ) Q(S t, A t )] Update rule for Q-learning: Q(S t, A t ) Q(S t, A t )+ h R t+1 + i max Q(S t+1, a) Q(S t, A t ) a Update rule for Expected SARSA: Q(S t, A t ) Q(S t, A t )+ [R t+1 + E[Q(S t+1, A t+1 ) S t+1 ] Q(S t, A t )] " # Q(S t, A t )+ R t+1 + X a (a S t+1 )Q(S t+1, a) Q(S t, A t ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

46 Expected SARSA Deterministically updates values in the same direction as SARSA, using expectation Eliminates variance (no need to randomly select A t+1 ) Computationally more expensive (sum over A(S 0 )) Performs better than SARSA Uses more information for updates Performs well with = 1(bad -greedy choice not a problem) If applied as o -policy, becomes Q-learning ( greedy,µ exploratory) X (a S t+1 )Q(S t+1, a) =maxq(s t+1, a) a a Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

47 Expected SARSA vs Q learning Figure: Backup diagrams Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

48 Maximization bias Reminder: Q Learning update rule : Q(S, A) Q(S, A)+ [R + max a Q(S 0, a) Q(S, A)] We take maximum, which can lead to very optimistic estimates This problem is called Maximization bias Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

49 Example: making plans for the weekend Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

50 Maximization bias Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

51 Double Q-Learning Learn 2 independent estimates Q 1 (S, A) andq 2 (S, A) Use one to determine the maximizing action: A S =argmax a Q 1 (S, A) Use another one to provide an estimate: Q 2 (S, A S )=Q 2(arg max a Q 1 (S, A)) Resulting algorithm learns true estimate: E(Q 2 (S, A S )) = q(s, A S ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

52 Disadvantages of TD Reminder: Q Learning update rule : Q(S, A) Q(S, A)+ [R + max a Q(S 0, a) Q(S, A)] It has maximization bias The learning is local: only the state before the reward is updated After usual initialization we get Q(S, A) = 0 8S, A So update is the following : Q(S, A) 0+ [R +0 0] Resulting in only the last state being updated Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

53 n-step TD TD and MC at opposites end of backup depth spectrum Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

54 n-step TD Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

55 n-step TD Remember 1-step TD target: h i V (S t ) ( V (S t )+ R t+1 + V (S t+1 ) V (S t ) What if we used the reward for 2 transitions? R t+1 + R t V (S t+2 ) Or the reward for n transitions? R t+1 + R t R t n 1 R t+n + n V (S t+n ) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

56 n-step TD Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

57 n-step TD Example of updates: Assume following episode in gridworld: How would simple 1-step TD(0) backup look after 1 episode? What about 3-step? Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

58 1-step TD(0) 2-step TD(0) 4-step TD(0) MC Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

59 Eligibility Tracing n-step TD can update more values However, n transitions required before update! How to use return from n transitions, without having to wait? Eligibility Tracing Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

60 Eligibility Tracing Eligibility trace uses Complex Backup - average of several n-step TD errors Complex backup with 2-step and 4-step TD error Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

61 Eligibility Tracing Eligibility Tracing uses Complex Backup Complex backup with n-steps, weighted with (trace decay parameter). What is the result of = 0? What about = 1? Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

62 Eligibility Tracing Second part: Eligibility Tracing Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

63 Eligibility Tracing Second part: Eligibility Tracing Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

64 Eligibility Tracing Second part: Eligibility Tracing Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

65 Summary TD works better than MC for prediction Because updates are happening every step, but not at the end of the episode TD often uses DP for control The common way is to use Generalized policy iteration (GPI), which is abstracted from Dynamic Programming Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

66 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

67 References Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An Introduction A Bradford Book Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

68 TD Error Monte Carlo error can be written in terms of TD error: G t V (S t )=R t+1 + G t+1 V (S t ) + V (S t ) V (S t ) = t + G t+1 V (S t+1 ) = t + t G t+2 V (S t+2 ) = t + t+1 + t T t+1 T t+1 + T t G T V (S T ) = t + t+1 + t T t 1 T t 1 + T t 0 0 = TXt 1 k=0 k t+k (1) Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, / 68

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides

More information

Introduction to Reinforcement Learning. MAL Seminar

Introduction to Reinforcement Learning. MAL Seminar Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology

More information

2D5362 Machine Learning

2D5362 Machine Learning 2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files

More information

Intro to Reinforcement Learning. Part 3: Core Theory

Intro to Reinforcement Learning. Part 3: Core Theory Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods by following

More information

Multi-step Bootstrapping

Multi-step Bootstrapping Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, 2017 1 / 29 Multi-step Bootstrapping Generalization

More information

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Reinforcement Learning. Monte Carlo and Temporal Difference Learning Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of

More information

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Reinforcement Learning 04 - Monte Carlo. Elena, Xi Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment

More information

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction

More information

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig] Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning n-step bootstrapping Daniel Hennes 12.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 n-step bootstrapping Unifying Monte Carlo and TD n-step TD n-step Sarsa

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

4 Reinforcement Learning Basic Algorithms

4 Reinforcement Learning Basic Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems

More information

Reinforcement Learning and Simulation-Based Search

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision

More information

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) 1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used

More information

Lecture 4: Model-Free Prediction

Lecture 4: Model-Free Prediction Lecture 4: Model-Free Prediction David Silver Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ) Introduction Model-Free Reinforcement Learning Last lecture: Planning

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods

More information

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

Complex Decisions. Sequential Decision Making

Complex Decisions. Sequential Decision Making Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by

More information

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

CS885 Reinforcement Learning Lecture 3b: May 9, 2018 CS885 Reinforcement Learning Lecture 3b: May 9, 2018 Intro to Reinforcement Learning [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3, CS885 Spring

More information

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2 COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman

More information

Reinforcement Learning Lectures 4 and 5

Reinforcement Learning Lectures 4 and 5 Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Reinforcement Learning 1 Framework Rewards, Returns Environment Dynamics Components of a Problem Values and Action Values, V and

More information

CS 461: Machine Learning Lecture 8

CS 461: Machine Learning Lecture 8 CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

MDPs: Bellman Equations, Value Iteration

MDPs: Bellman Equations, Value Iteration MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical

More information

The Problem of Temporal Abstraction

The Problem of Temporal Abstraction The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,

More information

Reasoning with Uncertainty

Reasoning with Uncertainty Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally

More information

Decision Theory: Value Iteration

Decision Theory: Value Iteration Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-based RL and Integrated Learning-Planning Planning and Search, Model Learning, Dyna Architecture, Exploration-Exploitation (many slides from lectures of Marc Toussaint & David

More information

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. RN, AIMA Stochastic domains Image: Berkeley CS188 course notes (downloaded Summer

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options Temporal Abstraction in RL Outline How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations?! HAMs (Parr & Russell

More information

Temporal Abstraction in RL

Temporal Abstraction in RL Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Hierarchical Reinforcement Learning Action hierarchy, hierarchical RL, semi-mdp Vien Ngo Marc Toussaint University of Stuttgart Outline Hierarchical reinforcement learning Learning

More information

CSEP 573: Artificial Intelligence

CSEP 573: Artificial Intelligence CSEP 573: Artificial Intelligence Markov Decision Processes (MDP)! Ali Farhadi Many slides over the course adapted from Luke Zettlemoyer, Dan Klein, Pieter Abbeel, Stuart Russell or Andrew Moore 1 Outline

More information

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course

More information

Intra-Option Learning about Temporally Abstract Actions

Intra-Option Learning about Temporally Abstract Actions Intra-Option Learning about Temporally Abstract Actions Richard S. Sutton Department of Computer Science University of Massachusetts Amherst, MA 01003-4610 rich@cs.umass.edu Doina Precup Department of

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

MDPs and Value Iteration 2/20/17

MDPs and Value Iteration 2/20/17 MDPs and Value Iteration 2/20/17 Recall: State Space Search Problems A set of discrete states A distinguished start state A set of actions available to the agent in each state An action function that,

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Ryan P. Adams COS 324 Elements of Machine Learning Princeton University We now turn to a new aspect of machine learning, in which agents take actions and become active in their

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes. CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use

More information

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018 Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction

More information

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

CPS 270: Artificial Intelligence  Markov decision processes, POMDPs CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 9: MDPs 9/22/2011 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 2 Grid World The agent lives in

More information

Importance Sampling for Fair Policy Selection

Importance Sampling for Fair Policy Selection Importance Sampling for Fair Policy Selection Shayan Doroudi Carnegie Mellon University Pittsburgh, PA 15213 shayand@cs.cmu.edu Philip S. Thomas Carnegie Mellon University Pittsburgh, PA 15213 philipt@cs.cmu.edu

More information

EE266 Homework 5 Solutions

EE266 Homework 5 Solutions EE, Spring 15-1 Professor S. Lall EE Homework 5 Solutions 1. A refined inventory model. In this problem we consider an inventory model that is more refined than the one you ve seen in the lectures. The

More information

Ensemble Methods for Reinforcement Learning with Function Approximation

Ensemble Methods for Reinforcement Learning with Function Approximation Ensemble Methods for Reinforcement Learning with Function Approximation Stefan Faußer and Friedhelm Schwenker Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany {stefan.fausser,friedhelm.schwenker}@uni-ulm.de

More information

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II CS221 / Spring 218 / Sadigh Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 /

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return

Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return Craig Sherstan 1, Dylan R. Ashley 2, Brendan Bennett 2, Kenny Young, Adam White, Martha White, Richard

More information

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model Simerjot Kaur (sk3391) Stanford University Abstract This work presents a novel algorithmic trading system based on reinforcement

More information

Compound Reinforcement Learning: Theory and An Application to Finance

Compound Reinforcement Learning: Theory and An Application to Finance Compound Reinforcement Learning: Theory and An Application to Finance Tohgoroh Matsui 1, Takashi Goto 2, Kiyoshi Izumi 3,4, and Yu Chen 3 1 Chubu University, 1200 Matsumoto-cho, Kasugai, 487-8501 Aichi,

More information

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision

More information

Markov Decision Processes. Lirong Xia

Markov Decision Processes. Lirong Xia Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent

More information

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern

Monte-Carlo Planning: Introduction and Bandit Basics. Alan Fern Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern 1 Large Worlds We have considered basic model-based planning algorithms Model-based planning: assumes MDP model is available Methods we learned

More information

An Electronic Market-Maker

An Electronic Market-Maker massachusetts institute of technology artificial intelligence laboratory An Electronic Market-Maker Nicholas Tung Chan and Christian Shelton AI Memo 21-5 April 17, 21 CBCL Memo 195 21 massachusetts institute

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Markov Decision Processes II Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC

More information

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens. 102 OPTIMAL STOPPING TIME 4. Optimal Stopping Time 4.1. Definitions. On the first day I explained the basic problem using one example in the book. On the second day I explained how the solution to the

More information

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II CS221 / Autumn 218 / Liang Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 / Autumn

More information

Overview: Representation Techniques

Overview: Representation Techniques 1 Overview: Representation Techniques Week 6 Representations for classical planning problems deterministic environment; complete information Week 7 Logic programs for problem representations including

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1. ˆ You have approximately 2 hours and 50 minutes. CS 188 Fall 2013 Introduction to Artificial Intelligence Midterm 1 ˆ You have approximately 2 hours and 50 minutes. ˆ The exam is closed book, closed notes except your one-page crib sheet. ˆ Please use

More information

Elif Özge Özdamar T Reinforcement Learning - Theory and Applications February 14, 2006

Elif Özge Özdamar T Reinforcement Learning - Theory and Applications February 14, 2006 On the convergence of Q-learning Elif Özge Özdamar elif.ozdamar@helsinki.fi T-61.6020 Reinforcement Learning - Theory and Applications February 14, 2006 the covergence of stochastic iterative algorithms

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence

More information

Monte-Carlo Planning Look Ahead Trees. Alan Fern

Monte-Carlo Planning Look Ahead Trees. Alan Fern Monte-Carlo Planning Look Ahead Trees Alan Fern 1 Monte-Carlo Planning Outline Single State Case (multi-armed bandits) A basic tool for other algorithms Monte-Carlo Policy Improvement Policy rollout Policy

More information

Adaptive Experiments for Policy Choice. March 8, 2019

Adaptive Experiments for Policy Choice. March 8, 2019 Adaptive Experiments for Policy Choice Maximilian Kasy Anja Sautmann March 8, 2019 Introduction The goal of many experiments is to inform policy choices: 1. Job search assistance for refugees: Treatments:

More information

Contents Critique 26. portfolio optimization 32

Contents Critique 26. portfolio optimization 32 Contents Preface vii 1 Financial problems and numerical methods 3 1.1 MATLAB environment 4 1.1.1 Why MATLAB? 5 1.2 Fixed-income securities: analysis and portfolio immunization 6 1.2.1 Basic valuation of

More information

Energy Storage Arbitrage in Real-Time Markets via Reinforcement Learning

Energy Storage Arbitrage in Real-Time Markets via Reinforcement Learning Energy Storage Arbitrage in Real-Time Markets via Reinforcement Learning Hao Wang, Baosen Zhang Department of Electrical Engineering, University of Washington, Seattle, WA 9895 Email: {hwang6,zhangbao}@uw.edu

More information

EE365: Markov Decision Processes

EE365: Markov Decision Processes EE365: Markov Decision Processes Markov decision processes Markov decision problem Examples 1 Markov decision processes 2 Markov decision processes add input (or action or control) to Markov chain with

More information

A selection of MAS learning techniques based on RL

A selection of MAS learning techniques based on RL A selection of MAS learning techniques based on RL Ann Nowé 14/11/12 Herhaling titel van presentatie 1 Content Single stage setting Common interest (Claus & Boutilier, Kapetanakis&Kudenko) Conflicting

More information

Making Complex Decisions

Making Complex Decisions Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2

More information

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week

Logistics. CS 473: Artificial Intelligence. Markov Decision Processes. PS 2 due today Midterm in one week CS 473: Artificial Intelligence Markov Decision Processes Dan Weld University of Washington [Slides originally created by Dan Klein & Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials

More information

Supplementary Material: Strategies for exploration in the domain of losses

Supplementary Material: Strategies for exploration in the domain of losses 1 Supplementary Material: Strategies for exploration in the domain of losses Paul M. Krueger 1,, Robert C. Wilson 2,, and Jonathan D. Cohen 3,4 1 Department of Psychology, University of California, Berkeley

More information

To earn the extra credit, one of the following has to hold true. Please circle and sign.

To earn the extra credit, one of the following has to hold true. Please circle and sign. CS 188 Fall 2018 Introduction to Artificial Intelligence Practice Midterm 1 To earn the extra credit, one of the following has to hold true. Please circle and sign. A I spent 2 or more hours on the practice

More information

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo

Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Markov Decision Processes (MDPs) CS 486/686 Introduction to AI University of Waterloo Outline Sequential Decision Processes Markov chains Highlight Markov property Discounted rewards Value iteration Markov

More information

Reinforcement Learning Analysis, Grid World Applications

Reinforcement Learning Analysis, Grid World Applications Reinforcement Learning Analysis, Grid World Applications Kunal Sharma GTID: ksharma74, CS 4641 Machine Learning Abstract This paper explores two Markov decision process problems with varying state sizes.

More information

Machine Learning for Physicists Lecture 10. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Machine Learning for Physicists Lecture 10. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt Machine Learning for Physicists Lecture 10 Summer 2017 University of Erlangen-Nuremberg Florian Marquardt Function/Image representation Image classification [Handwriting recognition] Convolutional nets

More information

Patrolling in A Stochastic Environment

Patrolling in A Stochastic Environment Patrolling in A Stochastic Environment Student Paper Submission (Suggested Track: Modeling and Simulation) Sui Ruan 1 (Student) E-mail: sruan@engr.uconn.edu Candra Meirina 1 (Student) E-mail: meirina@engr.uconn.edu

More information

The Option-Critic Architecture

The Option-Critic Architecture The Option-Critic Architecture Pierre-Luc Bacon, Jean Harb, Doina Precup Reasoning and Learning Lab McGill University, Montreal, Canada AAAI 2017 Intelligence: the ability to generalize and adapt efficiently

More information

Maintenance and Repair Decision Making for Infrastructure Facilities without a Deterioration Model

Maintenance and Repair Decision Making for Infrastructure Facilities without a Deterioration Model Maintenance and Repair Decision Making for Infrastructure Facilities without a Deterioration Model ablo L. Durango-Cohen 1 Abstract: In the existing approach to maintenance and repair decision making for

More information

Deep RL and Controls Homework 1 Spring 2017

Deep RL and Controls Homework 1 Spring 2017 10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact

More information

Agricultural and Applied Economics 637 Applied Econometrics II

Agricultural and Applied Economics 637 Applied Econometrics II Agricultural and Applied Economics 637 Applied Econometrics II Assignment I Using Search Algorithms to Determine Optimal Parameter Values in Nonlinear Regression Models (Due: February 3, 2015) (Note: Make

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

Simulation and Validation of an Integrated Markets Model Brian Sallans Alexander Pfister Alexandros Karatzoglou Georg Dorffner

Simulation and Validation of an Integrated Markets Model Brian Sallans Alexander Pfister Alexandros Karatzoglou Georg Dorffner Simulation and Validation of an Integrated Markets Model Brian Sallans Alexander Pfister Alexandros Karatzoglou Georg Dorffner Working Paper No. 95 SFB Adaptive Information Systems and Modelling in Economics

More information

Probabilistic Robotics: Probabilistic Planning and MDPs

Probabilistic Robotics: Probabilistic Planning and MDPs Probabilistic Robotics: Probabilistic Planning and MDPs Slide credits: Wolfram Burgard, Dieter Fox, Cyrill Stachniss, Giorgio Grisetti, Maren Bennewitz, Christian Plagemann, Dirk Haehnel, Mike Montemerlo,

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

CSE202: Algorithm Design and Analysis. Ragesh Jaiswal, CSE, UCSD

CSE202: Algorithm Design and Analysis. Ragesh Jaiswal, CSE, UCSD Fractional knapsack Problem Fractional knapsack: You are a thief and you have a sack of size W. There are n divisible items. Each item i has a volume W (i) and a total value V (i). Design an algorithm

More information

V. Lesser CS683 F2004

V. Lesser CS683 F2004 The value of information Lecture 15: Uncertainty - 6 Example 1: You consider buying a program to manage your finances that costs $100. There is a prior probability of 0.7 that the program is suitable in

More information

TDT4171 Artificial Intelligence Methods

TDT4171 Artificial Intelligence Methods TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

More information