Multi-step Bootstrapping
|
|
- Bertram Harris
- 5 years ago
- Views:
Transcription
1 Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, / 29
2 Multi-step Bootstrapping Generalization of Monte Carlo methods and one-step TD methods Includes methods that lie in-between these two extremes Methods based on sample episodes of states, actions and rewards February 7, / 29
3 Multi-step Bootstrapping Generalization of Monte Carlo methods and one-step TD methods Includes methods that lie in-between these two extremes Methods based on sample episodes of states, actions and rewards Time intervals for making updates and bootstrapping are no longer the same Enables bootstrapping to occur over longer time intervals February 7, / 29
4 Prediction Problem (Policy Evaluation) Given a fixed policy π, estimate the state-value function v π J February 7, / 29
5 Prediction Problem (Policy Evaluation) Given a fixed policy π, estimate the state-value function v π Monte Carlo update V (S t ) V (S t ) + α(g t V (S t )) G t = R t+1 + γr t+2 + γ 2 Rt γ T t 1 R T Updates of the state-value estimates happen at the end of each episode Gt is the complete return of an episode after S t No bootstrapping involved (does not use other estimations) J February 7, / 29
6 Prediction Problem (Policy Evaluation) One-step TD update V t+1 (S t ) V t (S t ) + α(r t+1 + γv t (S t+1 ) V t (S t )) Updates happen one step later, bootstrapping using Vt(S t+1) Rt+1 + γv t(s t+1) approximates G t J February 7, / 29
7 n-step TD Prediction Approximate G t by looking ahead n steps Bootstrap using Vt+n 1(S t+n) G (n) t = { R t+1 + γr t γ n V t+n 1(S t+n) G t 0 t < T n t + n T G (n) t Incorporate discounted rewards up to Rt+n is called the n-step return February 7, / 29
8 n-step TD Prediction Approximate G t by looking ahead n steps Bootstrap using Vt+n 1(S t+n) G (n) t = { R t+1 + γr t γ n V t+n 1(S t+n) G t 0 t < T n t + n T G (n) t G (1) t G (T ) t Incorporate discounted rewards up to Rt+n is called the n-step return for one-step TD for Monte Carlo J February 7, / 29
9 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 February 7, / 29
10 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T February 7, / 29
11 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T No updates during the first n 1 time steps n 1 updates at the end of the episode using G t February 7, / 29
12 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T No updates during the first n 1 time steps n 1 updates at the end of the episode using G t Still considered TD methods (n < T ) Involves changing an earlier estimate based on how it differs from a later estimate February 7, / 29
13 n-step TD Prediction J February 7, / 29
14 n-step TD Prediction J February 7, / 29
15 n-step TD Prediction The expected n-step return is guaranteed to be a better estimate of v π than V t+n 1 in the worst case max s E[G (n) S t = s] v π (s) γ n max V t+n 1 (s) v π (s) t s February 7, / 29
16 n-step TD Prediction The expected n-step return is guaranteed to be a better estimate of v π than V t+n 1 in the worst case max s E[G (n) S t = s] v π (s) γ n max V t+n 1 (s) v π (s) t s All n-step TD methods converge to correct predictions under appropriate technical conditions February 7, / 29
17 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 February 7, / 29
18 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s February 7, / 29
19 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s Suppose the first episode goes from C to the right, through D and E February 7, / 29
20 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s Suppose the first episode goes from C to the right, through D and E At the end of the episode For a one-step method, only V (E) incremented towards 1 For a two-step method, both V (D) and V (E) incremented towards 1 For n 3, all V (C), V (D) and V (E) incremented towards 1 February 7, / 29
21 Example Empirical comparison for a similar problem Random walk with 19 states All rewards are 0 except the left-most being 1 February 7, / 29
22 Example Empirical comparison for a similar problem Random walk with 19 states All rewards are 0 except the left-most being 1 An intermediate value of n works best February 7, / 29
23 Control Problem (Policy Evaluation + Policy Improvement) Find an optimal policy π February 7, / 29
24 Control Problem (Policy Evaluation + Policy Improvement) Find an optimal policy π Alternate estimating action-value function q π (evaluation) and updating policy π (improvement) Estimate qπ instead of v π because we need this information to decide the next π February 7, / 29
25 Control Problem (On-Policy) Evaluation step Monte Carlo evaluation Q(S t, A t ) Q(S t, A t ) + α(g t Q(S t, A t )) Sarsa (one-step on-policy TD) evaluation Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t+1 + γq t (S t+1, A t+1 ) Q t (S t, A t )) Rt+1 + γq t(s t+1, A t+1) approximates G t February 7, / 29
26 Control Problem (On-Policy) Evaluation step Monte Carlo evaluation Q(S t, A t ) Q(S t, A t ) + α(g t Q(S t, A t )) Sarsa (one-step on-policy TD) evaluation Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t+1 + γq t (S t+1, A t+1 ) Q t (S t, A t )) Rt+1 + γq t(s t+1, A t+1) approximates G t Improvement step ɛ-greedy (or any other ɛ-soft policy) helps maintain exploration A argmax a Q(S t, a) a A(S t ), π(a S t ) { 1 ɛ + ɛ/ A(S t ) a = A ɛ a A February 7, / 29
27 n-step Sarsa Modification to evaluation step Similar to prediction, approximate G t with G (n) t = { R t+1 + γr t γ n Q t+n 1 (S t+n, A t+n ) G t 0 t < T n t + n T Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t, A t )) 0 t < T February 7, / 29
28 n-step Sarsa Modification to evaluation step Similar to prediction, approximate G t with G (n) t = { R t+1 + γr t γ n Q t+n 1 (S t+n, A t+n ) G t 0 t < T n t + n T Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t, A t )) 0 t < T Expected Sarsa Replace Qt+n 1(S t+n, A t+n) with E[Q t+n 1(S t+n, A t+n) S t+n] = π(a S t+n)q t+n 1(S t+n, a) a Moves deterministically in same direction Sarsa moves in expectation Requires more computation but eliminates variance from sampling At+n February 7, / 29
29 n-step Sarsa J February 7, / 29
30 n-step Sarsa J February 7, / 29
31 Example Gridworld scenario where rewards at all states are 0 except a positive reward on the square G Initialize V (s) = 0, s Suppose you take a path on the first episode, and you end at G At the end of the episode One-step method only strengthens the last state-actions pair in the path for the next policy n-step method strengthens the last n state-actions pairs in the path for the next policy J February 7, / 29
32 Control Problem (Off-Policy) Learn the value for one policy π while following another policy µ π often greedy and µ exploratory (ex. ɛ-greedy) Requires that π(a s) > 0 implies µ(a s) > 0 February 7, / 29
33 Control Problem (Off-Policy) Learn the value for one policy π while following another policy µ π often greedy and µ exploratory (ex. ɛ-greedy) Requires that π(a s) > 0 implies µ(a s) > 0 Importance sampling (Monte Carlo) Step size takes into account the difference between π and µ using relative probability of all the subsequent actions ρ T t is the importance sampling ratio ρ T t = V (S t) V (S t) + αρ T t (G t V (S t)) T 1 k=t T 1 π(a k S k )P(S k+1 S k, A k ) µ(a k S k )P(S k+1 S k, A k ) = k=t π(a k S k ) µ(a k S k ) February 7, / 29
34 Off-Policy Learning by Importance Sampling In n-step methods, returns are constructed over n steps Interested in the relative probability of just those n actions Incorporate ρ t+n t (in place of ρ T t ) into TD min(t+n,t 1) ρ t+n π(a k S k ) t = µ(a k S k ) k=t V t+n (S t ) V t+n 1 (S t ) + αρ t+n t (G (n) t V t+n 1 (S t )) 0 t < T February 7, / 29
35 Off-Policy Learning by Importance Sampling In n-step methods, returns are constructed over n steps Interested in the relative probability of just those n actions Incorporate ρ t+n t (in place of ρ T t ) into TD min(t+n,t 1) ρ t+n π(a k S k ) t = µ(a k S k ) k=t V t+n (S t ) V t+n 1 (S t ) + αρ t+n t (G (n) t V t+n 1 (S t )) 0 t < T If any π(a k S k ) = 0, then ρ t+n t = 0 and return would be totally ignored If any π(a k S k ) >> µ(a k S k ), then ρ t+n t increases weight given to return, which compensates for action being rarely selected under µ February 7, / 29
36 Off-Policy Learning by Importance Sampling Evaluation step replaces ρt+nt At already determined ρ t+n t+1 because requires no further sampling of A t Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + αρ t+n (n) t+1 (G t Q t+n 1 (S t, A t )) 0 t < T February 7, / 29
37 Off-Policy Learning by Importance Sampling Evaluation step replaces ρt+nt At already determined ρ t+n t+1 because requires no further sampling of A t Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + αρ t+n (n) t+1 (G t Q t+n 1 (S t, A t )) 0 t < T Expected Sarsa ρ t+n 1 t+1 replaces ρ t+n t+1 because requires no sampling of At+n Expected value all actions on (t + n)th step into account February 7, / 29
38 Off-Policy Learning by Importance Sampling J February 7, / 29
39 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data February 7, / 29
40 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data Improvements Autostep method (Mahmood et al, 2012) Invariant updates (Karampatziakis and Langford, 2010) Usage technique (Mahmood and Sutton, 2015) February 7, / 29
41 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data Improvements Autostep method (Mahmood et al, 2012) Invariant updates (Karampatziakis and Langford, 2010) Usage technique (Mahmood and Sutton, 2015) Off-policy possible without importance sampling? February 7, / 29
42 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) February 7, / 29
43 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour February 7, / 29
44 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour Updated values are independent of µ(at+1 S t+1) If π is greedy, this is exactly the Q-learning method { 1 a = argmax a Q(S t+1, a ) π(a S t+1) = 0 otherwise Q t+1(s t, A t) Q t(s t, A t) + α(r t + γmaxq a t(s t+1, a) Q t(s t, A t)) February 7, / 29
45 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour Updated values are independent of µ(at+1 S t+1) If π is greedy, this is exactly the Q-learning method { 1 a = argmax a Q(S t+1, a ) π(a S t+1) = 0 otherwise Q t+1(s t, A t) Q t(s t, A t) + α(r t + γmaxq a t(s t+1, a) Q t(s t, A t)) Possible to form off-policy methods without importance sampling J February 7, / 29
46 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Alternate the incorporation of expected values of future action-value estimates and corrections based on actual steps up to S t+n February 7, / 29
47 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Alternate the incorporation of expected values of future action-value estimates and corrections based on actual steps up to S t+n G (n) t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) γπ(a t+1 S t+1 )Q t (S t+1, A t+1 ) + γπ(a t+1 S t+1 )(R t+2 + γ π(a S t+2 )Q t+1 (S t+2, a)) a γ 2 π(a t+1 S t+1 )π(a t+2 S t+2 )Q t+1 (S t+2, A t+2 ) + γ 2 π(a t+1 S t+1 )π(a t+2 S t+2 )(R t+3 + γ a +... min(t+n,t ) 1 + γ min(t+n,t ) 1 ( π(a i S i )) (R min(t+n,t ) + γ a i=t+1 π(a S t+3 )Q t+2 (S t+3, a)) π(a S min(t+n,t ) )Q min(t+n,t ) (S min(t+n,t ), a)) February 7, / 29
48 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) February 7, / 29
49 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) min(t+n,t ) 1 G (n) t = Q t 1 (S t, A t ) + k=t δ k k i=t+1 γπ(a i S i ) Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t+n, A t+n )) February 7, / 29
50 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) min(t+n,t ) 1 G (n) t = Q t 1 (S t, A t ) + k=t δ k k i=t+1 γπ(a i S i ) Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t+n, A t+n )) G (1) t is used for Expected Sarsa February 7, / 29
51 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm J February 7, / 29
52 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm J February 7, / 29
53 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm The n-step Tree Backup algorithm is the natural extension of Q-learning to the multi-step case Requires no importance sampling like Q-learning February 7, / 29
54 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm The n-step Tree Backup algorithm is the natural extension of Q-learning to the multi-step case Requires no importance sampling like Q-learning However, if µ and π vastly differ then π(a t+i S t+i ) may be small for some i and bootstrapping may span only a few steps even if n is large February 7, / 29
55 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods February 7, / 29
56 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes February 7, / 29
57 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes Disadvantages Requires a delay of n time steps before updating Requires more computation per time step Requires more memory to store variables from the last n time steps February 7, / 29
58 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes Disadvantages Requires a delay of n time steps before updating Requires more computation per time step Requires more memory to store variables from the last n time steps n-step TD policy evaluation On-policy control: n-step Sarsa Off-policy control: Importance sampling n-step Tree Backup algorithm February 7, / 29
Reinforcement Learning
Reinforcement Learning n-step bootstrapping Daniel Hennes 12.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 n-step bootstrapping Unifying Monte Carlo and TD n-step TD n-step Sarsa
More information10703 Deep Reinforcement Learning and Control
10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides
More information4 Reinforcement Learning Basic Algorithms
Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems
More informationMotivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i
Temporal-Di erence Learning Taras Kucherenko, Joonatan Manttari KTH tarask@kth.se manttari@kth.se March 7, 2017 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, 2017 1 / 68 Motivation: disadvantages
More informationReinforcement Learning 04 - Monte Carlo. Elena, Xi
Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment
More informationReinforcement Learning
Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods
More informationIntroduction to Reinforcement Learning. MAL Seminar
Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology
More informationLecture 4: Model-Free Prediction
Lecture 4: Model-Free Prediction David Silver Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ) Introduction Model-Free Reinforcement Learning Last lecture: Planning
More informationCS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning
CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction
More information2D5362 Machine Learning
2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files
More informationMonte Carlo Methods (Estimators, On-policy/Off-policy Learning)
1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used
More informationIntro to Reinforcement Learning. Part 3: Core Theory
Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More informationSequential Decision Making
Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming
More informationReinforcement Learning. Monte Carlo and Temporal Difference Learning
Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of
More informationReinforcement Learning and Simulation-Based Search
Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision
More informationReinforcement Learning
Reinforcement Learning Model-based RL and Integrated Learning-Planning Planning and Search, Model Learning, Dyna Architecture, Exploration-Exploitation (many slides from lectures of Marc Toussaint & David
More informationComplex Decisions. Sequential Decision Making
Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by
More informationReinforcement Learning Lectures 4 and 5
Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Reinforcement Learning 1 Framework Rewards, Returns Environment Dynamics Components of a Problem Values and Action Values, V and
More informationChapter 6: Temporal Difference Learning
Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods by following
More informationMaking Decisions. CS 3793 Artificial Intelligence Making Decisions 1
Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside
More informationThe Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions
The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}
More informationReinforcement Learning
Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical
More informationNon-Deterministic Search
Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:
More informationTemporal Abstraction in RL
Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationComputational Finance Least Squares Monte Carlo
Computational Finance Least Squares Monte Carlo School of Mathematics 2019 Monte Carlo and Binomial Methods In the last two lectures we discussed the binomial tree method and convergence problems. One
More informationLecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018
Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction
More informationEnsemble Methods for Reinforcement Learning with Function Approximation
Ensemble Methods for Reinforcement Learning with Function Approximation Stefan Faußer and Friedhelm Schwenker Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany {stefan.fausser,friedhelm.schwenker}@uni-ulm.de
More informationMartingale Pricing Theory in Discrete-Time and Discrete-Space Models
IEOR E4707: Foundations of Financial Engineering c 206 by Martin Haugh Martingale Pricing Theory in Discrete-Time and Discrete-Space Models These notes develop the theory of martingale pricing in a discrete-time,
More informationMDPs: Bellman Equations, Value Iteration
MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations
More informationMarkov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N
Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements
More informationReinforcement Learning
Reinforcement Learning Hierarchical Reinforcement Learning Action hierarchy, hierarchical RL, semi-mdp Vien Ngo Marc Toussaint University of Stuttgart Outline Hierarchical reinforcement learning Learning
More information91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010
91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course
More informationDeep RL and Controls Homework 1 Spring 2017
10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact
More informationDecision Theory: Value Iteration
Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision
More informationCS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II
CS221 / Spring 218 / Sadigh Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 /
More informationComputational Finance Improving Monte Carlo
Computational Finance Improving Monte Carlo School of Mathematics 2018 Monte Carlo so far... Simple to program and to understand Convergence is slow, extrapolation impossible. Forward looking method ideal
More informationCSE 473: Artificial Intelligence
CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due
More informationCS221 / Autumn 2018 / Liang. Lecture 8: MDPs II
CS221 / Autumn 218 / Liang Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 / Autumn
More informationDRAFT. 1 exercise in state (S, t), π(s, t) = 0 do not exercise in state (S, t) Review of the Risk Neutral Stock Dynamics
Chapter 12 American Put Option Recall that the American option has strike K and maturity T and gives the holder the right to exercise at any time in [0, T ]. The American option is not straightforward
More informationCS 188: Artificial Intelligence. Outline
C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence
More informationTDT4171 Artificial Intelligence Methods
TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods
More informationComparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return
Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return Craig Sherstan 1, Dylan R. Ashley 2, Brendan Bennett 2, Kenny Young, Adam White, Martha White, Richard
More informationMONTE CARLO EXTENSIONS
MONTE CARLO EXTENSIONS School of Mathematics 2013 OUTLINE 1 REVIEW OUTLINE 1 REVIEW 2 EXTENSION TO MONTE CARLO OUTLINE 1 REVIEW 2 EXTENSION TO MONTE CARLO 3 SUMMARY MONTE CARLO SO FAR... Simple to program
More informationPractical example of an Economic Scenario Generator
Practical example of an Economic Scenario Generator Martin Schenk Actuarial & Insurance Solutions SAV 7 March 2014 Agenda Introduction Deterministic vs. stochastic approach Mathematical model Application
More information2.1 Mathematical Basis: Risk-Neutral Pricing
Chapter Monte-Carlo Simulation.1 Mathematical Basis: Risk-Neutral Pricing Suppose that F T is the payoff at T for a European-type derivative f. Then the price at times t before T is given by f t = e r(t
More informationReinforcement Learning
Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent
More informationValue at Risk Ch.12. PAK Study Manual
Value at Risk Ch.12 Related Learning Objectives 3a) Apply and construct risk metrics to quantify major types of risk exposure such as market risk, credit risk, liquidity risk, regulatory risk etc., and
More informationRollout Allocation Strategies for Classification-based Policy Iteration
Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh firstname.lastname@inria.fr Workshop on Reinforcement Learning and Search in Very Large
More informationReinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum
Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,
More informationBasic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]
Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal
More informationMonte Carlo Methods in Structuring and Derivatives Pricing
Monte Carlo Methods in Structuring and Derivatives Pricing Prof. Manuela Pedio (guest) 20263 Advanced Tools for Risk Management and Pricing Spring 2017 Outline and objectives The basic Monte Carlo algorithm
More informationCS 188 Fall Introduction to Artificial Intelligence Midterm 1
CS 188 Fall 2018 Introduction to Artificial Intelligence Midterm 1 You have 120 minutes. The time will be projected at the front of the room. You may not leave during the last 10 minutes of the exam. Do
More informationImportance Sampling for Fair Policy Selection
Importance Sampling for Fair Policy Selection Shayan Doroudi Carnegie Mellon University Pittsburgh, PA 15213 shayand@cs.cmu.edu Philip S. Thomas Carnegie Mellon University Pittsburgh, PA 15213 philipt@cs.cmu.edu
More informationCOMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2
COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman
More informationCS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm
CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationReinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration
Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision
More informationStochastic Dual Dynamic Programming
1 / 43 Stochastic Dual Dynamic Programming Operations Research Anthony Papavasiliou 2 / 43 Contents [ 10.4 of BL], [Pereira, 1991] 1 Recalling the Nested L-Shaped Decomposition 2 Drawbacks of Nested Decomposition
More informationAn Electronic Market-Maker
massachusetts institute of technology artificial intelligence laboratory An Electronic Market-Maker Nicholas Tung Chan and Christian Shelton AI Memo 21-5 April 17, 21 CBCL Memo 195 21 massachusetts institute
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in
More informationBloomberg. Portfolio Value-at-Risk. Sridhar Gollamudi & Bryan Weber. September 22, Version 1.0
Portfolio Value-at-Risk Sridhar Gollamudi & Bryan Weber September 22, 2011 Version 1.0 Table of Contents 1 Portfolio Value-at-Risk 2 2 Fundamental Factor Models 3 3 Valuation methodology 5 3.1 Linear factor
More informationMax Registers, Counters and Monotone Circuits
James Aspnes 1 Hagit Attiya 2 Keren Censor 2 1 Yale 2 Technion Counters Model Collects Our goal: build a cheap counter for an asynchronous shared-memory system. Two operations: increment and read. Read
More informationAlgorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model
Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model Simerjot Kaur (sk3391) Stanford University Abstract This work presents a novel algorithmic trading system based on reinforcement
More informationDecision making in the presence of uncertainty
CS 2750 Foundations of AI Lecture 20 Decision making in the presence of uncertainty Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Decision-making in the presence of uncertainty Computing the probability
More informationReasoning with Uncertainty
Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally
More informationMarkov Decision Processes. Lirong Xia
Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Spring 2015 Introduction to Artificial Intelligence Midterm 1 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib
More informationCS 461: Machine Learning Lecture 8
CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?
More informationThe Problem of Temporal Abstraction
The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,
More informationMAFS Computational Methods for Pricing Structured Products
MAFS550 - Computational Methods for Pricing Structured Products Solution to Homework Two Course instructor: Prof YK Kwok 1 Expand f(x 0 ) and f(x 0 x) at x 0 into Taylor series, where f(x 0 ) = f(x 0 )
More informationThe Binomial Model. Chapter 3
Chapter 3 The Binomial Model In Chapter 1 the linear derivatives were considered. They were priced with static replication and payo tables. For the non-linear derivatives in Chapter 2 this will not work
More informationQ1. [?? pts] Search Traces
CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a
More informationAD in Monte Carlo for finance
AD in Monte Carlo for finance Mike Giles giles@comlab.ox.ac.uk Oxford University Computing Laboratory AD & Monte Carlo p. 1/30 Overview overview of computational finance stochastic o.d.e. s Monte Carlo
More informationComprehensive Exam. August 19, 2013
Comprehensive Exam August 19, 2013 You have a total of 180 minutes to complete the exam. If a question seems ambiguous, state why, sharpen it up and answer the sharpened-up question. Good luck! 1 1 Menu
More informationCS 188: Artificial Intelligence
CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives
More informationDepartment of Mathematics. Mathematics of Financial Derivatives
Department of Mathematics MA408 Mathematics of Financial Derivatives Thursday 15th January, 2009 2pm 4pm Duration: 2 hours Attempt THREE questions MA408 Page 1 of 5 1. (a) Suppose 0 < E 1 < E 3 and E 2
More informationMonte Carlo Based Numerical Pricing of Multiple Strike-Reset Options
Monte Carlo Based Numerical Pricing of Multiple Strike-Reset Options Stavros Christodoulou Linacre College University of Oxford MSc Thesis Trinity 2011 Contents List of figures ii Introduction 2 1 Strike
More informationAn Experimental Study of the Behaviour of the Proxel-Based Simulation Algorithm
An Experimental Study of the Behaviour of the Proxel-Based Simulation Algorithm Sanja Lazarova-Molnar, Graham Horton Otto-von-Guericke-Universität Magdeburg Abstract The paradigm of the proxel ("probability
More informationEffects of Outliers and Parameter Uncertainties in Portfolio Selection
Effects of Outliers and Parameter Uncertainties in Portfolio Selection Luiz Hotta 1 Carlos Trucíos 2 Esther Ruiz 3 1 Department of Statistics, University of Campinas. 2 EESP-FGV (postdoctoral). 3 Department
More informationLecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision
More informationMachine Learning for Physicists Lecture 10. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt
Machine Learning for Physicists Lecture 10 Summer 2017 University of Erlangen-Nuremberg Florian Marquardt Function/Image representation Image classification [Handwriting recognition] Convolutional nets
More informationStochastic Grid Bundling Method
Stochastic Grid Bundling Method GPU Acceleration Delft University of Technology - Centrum Wiskunde & Informatica Álvaro Leitao Rodríguez and Cornelis W. Oosterlee London - December 17, 2015 A. Leitao &
More informationPractice Problems 1: Moral Hazard
Practice Problems 1: Moral Hazard December 5, 2012 Question 1 (Comparative Performance Evaluation) Consider the same normal linear model as in Question 1 of Homework 1. This time the principal employs
More informationEco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1)
Eco54 Spring 21 C. Sims FINAL EXAM There are three questions that will be equally weighted in grading. Since you may find some questions take longer to answer than others, and partial credit will be given
More informationTemporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options
Temporal Abstraction in RL Outline How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations?! HAMs (Parr & Russell
More informationDynamic Asset and Liability Management Models for Pension Systems
Dynamic Asset and Liability Management Models for Pension Systems The Comparison between Multi-period Stochastic Programming Model and Stochastic Control Model Muneki Kawaguchi and Norio Hibiki June 1,
More informationBinomial model: numerical algorithm
Binomial model: numerical algorithm S / 0 C \ 0 S0 u / C \ 1,1 S0 d / S u 0 /, S u 3 0 / 3,3 C \ S0 u d /,1 S u 5 0 4 0 / C 5 5,5 max X S0 u,0 S u C \ 4 4,4 C \ 3 S u d / 0 3, C \ S u d 0 S u d 0 / C 4
More information1 The continuous time limit
Derivative Securities, Courant Institute, Fall 2008 http://www.math.nyu.edu/faculty/goodman/teaching/derivsec08/index.html Jonathan Goodman and Keith Lewis Supplementary notes and comments, Section 3 1
More informationMonte Carlo Methods in Option Pricing. UiO-STK4510 Autumn 2015
Monte Carlo Methods in Option Pricing UiO-STK4510 Autumn 015 The Basics of Monte Carlo Method Goal: Estimate the expectation θ = E[g(X)], where g is a measurable function and X is a random variable such
More informationOptimal Window Selection for Forecasting in The Presence of Recent Structural Breaks
Optimal Window Selection for Forecasting in The Presence of Recent Structural Breaks Yongli Wang University of Leicester Econometric Research in Finance Workshop on 15 September 2017 SGH Warsaw School
More informationCS885 Reinforcement Learning Lecture 3b: May 9, 2018
CS885 Reinforcement Learning Lecture 3b: May 9, 2018 Intro to Reinforcement Learning [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3, CS885 Spring
More informationChapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29
Chapter 5 Univariate time-series analysis () Chapter 5 Univariate time-series analysis 1 / 29 Time-Series Time-series is a sequence fx 1, x 2,..., x T g or fx t g, t = 1,..., T, where t is an index denoting
More informationRichardson Extrapolation Techniques for the Pricing of American-style Options
Richardson Extrapolation Techniques for the Pricing of American-style Options June 1, 2005 Abstract Richardson Extrapolation Techniques for the Pricing of American-style Options In this paper we re-examine
More informationThe Use of Importance Sampling to Speed Up Stochastic Volatility Simulations
The Use of Importance Sampling to Speed Up Stochastic Volatility Simulations Stan Stilger June 6, 1 Fouque and Tullie use importance sampling for variance reduction in stochastic volatility simulations.
More informationCalculating VaR. There are several approaches for calculating the Value at Risk figure. The most popular are the
VaR Pro and Contra Pro: Easy to calculate and to understand. It is a common language of communication within the organizations as well as outside (e.g. regulators, auditors, shareholders). It is not really
More informationMaking Complex Decisions
Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2
More informationAsian Option Pricing: Monte Carlo Control Variate. A discrete arithmetic Asian call option has the payoff. S T i N N + 1
Asian Option Pricing: Monte Carlo Control Variate A discrete arithmetic Asian call option has the payoff ( 1 N N + 1 i=0 S T i N K ) + A discrete geometric Asian call option has the payoff [ N i=0 S T
More information