Multi-step Bootstrapping

Size: px
Start display at page:

Download "Multi-step Bootstrapping"

Transcription

1 Multi-step Bootstrapping Jennifer She Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto February 7, 2017 J February 7, / 29

2 Multi-step Bootstrapping Generalization of Monte Carlo methods and one-step TD methods Includes methods that lie in-between these two extremes Methods based on sample episodes of states, actions and rewards February 7, / 29

3 Multi-step Bootstrapping Generalization of Monte Carlo methods and one-step TD methods Includes methods that lie in-between these two extremes Methods based on sample episodes of states, actions and rewards Time intervals for making updates and bootstrapping are no longer the same Enables bootstrapping to occur over longer time intervals February 7, / 29

4 Prediction Problem (Policy Evaluation) Given a fixed policy π, estimate the state-value function v π J February 7, / 29

5 Prediction Problem (Policy Evaluation) Given a fixed policy π, estimate the state-value function v π Monte Carlo update V (S t ) V (S t ) + α(g t V (S t )) G t = R t+1 + γr t+2 + γ 2 Rt γ T t 1 R T Updates of the state-value estimates happen at the end of each episode Gt is the complete return of an episode after S t No bootstrapping involved (does not use other estimations) J February 7, / 29

6 Prediction Problem (Policy Evaluation) One-step TD update V t+1 (S t ) V t (S t ) + α(r t+1 + γv t (S t+1 ) V t (S t )) Updates happen one step later, bootstrapping using Vt(S t+1) Rt+1 + γv t(s t+1) approximates G t J February 7, / 29

7 n-step TD Prediction Approximate G t by looking ahead n steps Bootstrap using Vt+n 1(S t+n) G (n) t = { R t+1 + γr t γ n V t+n 1(S t+n) G t 0 t < T n t + n T G (n) t Incorporate discounted rewards up to Rt+n is called the n-step return February 7, / 29

8 n-step TD Prediction Approximate G t by looking ahead n steps Bootstrap using Vt+n 1(S t+n) G (n) t = { R t+1 + γr t γ n V t+n 1(S t+n) G t 0 t < T n t + n T G (n) t G (1) t G (T ) t Incorporate discounted rewards up to Rt+n is called the n-step return for one-step TD for Monte Carlo J February 7, / 29

9 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 February 7, / 29

10 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T February 7, / 29

11 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T No updates during the first n 1 time steps n 1 updates at the end of the episode using G t February 7, / 29

12 n-step TD Prediction For n > 1, V t+n 1 (S t+n ) involves future rewards and value functions not available at time between t and t + 1 Must wait until time t + n to update V (St) V t+n (S t ) V t+n 1 (S t ) + α(g (n) t V t+n 1 (S t+1 )) 0 t < T No updates during the first n 1 time steps n 1 updates at the end of the episode using G t Still considered TD methods (n < T ) Involves changing an earlier estimate based on how it differs from a later estimate February 7, / 29

13 n-step TD Prediction J February 7, / 29

14 n-step TD Prediction J February 7, / 29

15 n-step TD Prediction The expected n-step return is guaranteed to be a better estimate of v π than V t+n 1 in the worst case max s E[G (n) S t = s] v π (s) γ n max V t+n 1 (s) v π (s) t s February 7, / 29

16 n-step TD Prediction The expected n-step return is guaranteed to be a better estimate of v π than V t+n 1 in the worst case max s E[G (n) S t = s] v π (s) γ n max V t+n 1 (s) v π (s) t s All n-step TD methods converge to correct predictions under appropriate technical conditions February 7, / 29

17 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 February 7, / 29

18 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s February 7, / 29

19 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s Suppose the first episode goes from C to the right, through D and E February 7, / 29

20 Example Random walk starting from state C Rewards are all 0 except when following the right arrow from state E True state-values from A to E are 1 6, 2 6, 3 6, 4 6, 5 6 Initialize with V (s) = 0.5, s Suppose the first episode goes from C to the right, through D and E At the end of the episode For a one-step method, only V (E) incremented towards 1 For a two-step method, both V (D) and V (E) incremented towards 1 For n 3, all V (C), V (D) and V (E) incremented towards 1 February 7, / 29

21 Example Empirical comparison for a similar problem Random walk with 19 states All rewards are 0 except the left-most being 1 February 7, / 29

22 Example Empirical comparison for a similar problem Random walk with 19 states All rewards are 0 except the left-most being 1 An intermediate value of n works best February 7, / 29

23 Control Problem (Policy Evaluation + Policy Improvement) Find an optimal policy π February 7, / 29

24 Control Problem (Policy Evaluation + Policy Improvement) Find an optimal policy π Alternate estimating action-value function q π (evaluation) and updating policy π (improvement) Estimate qπ instead of v π because we need this information to decide the next π February 7, / 29

25 Control Problem (On-Policy) Evaluation step Monte Carlo evaluation Q(S t, A t ) Q(S t, A t ) + α(g t Q(S t, A t )) Sarsa (one-step on-policy TD) evaluation Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t+1 + γq t (S t+1, A t+1 ) Q t (S t, A t )) Rt+1 + γq t(s t+1, A t+1) approximates G t February 7, / 29

26 Control Problem (On-Policy) Evaluation step Monte Carlo evaluation Q(S t, A t ) Q(S t, A t ) + α(g t Q(S t, A t )) Sarsa (one-step on-policy TD) evaluation Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t+1 + γq t (S t+1, A t+1 ) Q t (S t, A t )) Rt+1 + γq t(s t+1, A t+1) approximates G t Improvement step ɛ-greedy (or any other ɛ-soft policy) helps maintain exploration A argmax a Q(S t, a) a A(S t ), π(a S t ) { 1 ɛ + ɛ/ A(S t ) a = A ɛ a A February 7, / 29

27 n-step Sarsa Modification to evaluation step Similar to prediction, approximate G t with G (n) t = { R t+1 + γr t γ n Q t+n 1 (S t+n, A t+n ) G t 0 t < T n t + n T Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t, A t )) 0 t < T February 7, / 29

28 n-step Sarsa Modification to evaluation step Similar to prediction, approximate G t with G (n) t = { R t+1 + γr t γ n Q t+n 1 (S t+n, A t+n ) G t 0 t < T n t + n T Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t, A t )) 0 t < T Expected Sarsa Replace Qt+n 1(S t+n, A t+n) with E[Q t+n 1(S t+n, A t+n) S t+n] = π(a S t+n)q t+n 1(S t+n, a) a Moves deterministically in same direction Sarsa moves in expectation Requires more computation but eliminates variance from sampling At+n February 7, / 29

29 n-step Sarsa J February 7, / 29

30 n-step Sarsa J February 7, / 29

31 Example Gridworld scenario where rewards at all states are 0 except a positive reward on the square G Initialize V (s) = 0, s Suppose you take a path on the first episode, and you end at G At the end of the episode One-step method only strengthens the last state-actions pair in the path for the next policy n-step method strengthens the last n state-actions pairs in the path for the next policy J February 7, / 29

32 Control Problem (Off-Policy) Learn the value for one policy π while following another policy µ π often greedy and µ exploratory (ex. ɛ-greedy) Requires that π(a s) > 0 implies µ(a s) > 0 February 7, / 29

33 Control Problem (Off-Policy) Learn the value for one policy π while following another policy µ π often greedy and µ exploratory (ex. ɛ-greedy) Requires that π(a s) > 0 implies µ(a s) > 0 Importance sampling (Monte Carlo) Step size takes into account the difference between π and µ using relative probability of all the subsequent actions ρ T t is the importance sampling ratio ρ T t = V (S t) V (S t) + αρ T t (G t V (S t)) T 1 k=t T 1 π(a k S k )P(S k+1 S k, A k ) µ(a k S k )P(S k+1 S k, A k ) = k=t π(a k S k ) µ(a k S k ) February 7, / 29

34 Off-Policy Learning by Importance Sampling In n-step methods, returns are constructed over n steps Interested in the relative probability of just those n actions Incorporate ρ t+n t (in place of ρ T t ) into TD min(t+n,t 1) ρ t+n π(a k S k ) t = µ(a k S k ) k=t V t+n (S t ) V t+n 1 (S t ) + αρ t+n t (G (n) t V t+n 1 (S t )) 0 t < T February 7, / 29

35 Off-Policy Learning by Importance Sampling In n-step methods, returns are constructed over n steps Interested in the relative probability of just those n actions Incorporate ρ t+n t (in place of ρ T t ) into TD min(t+n,t 1) ρ t+n π(a k S k ) t = µ(a k S k ) k=t V t+n (S t ) V t+n 1 (S t ) + αρ t+n t (G (n) t V t+n 1 (S t )) 0 t < T If any π(a k S k ) = 0, then ρ t+n t = 0 and return would be totally ignored If any π(a k S k ) >> µ(a k S k ), then ρ t+n t increases weight given to return, which compensates for action being rarely selected under µ February 7, / 29

36 Off-Policy Learning by Importance Sampling Evaluation step replaces ρt+nt At already determined ρ t+n t+1 because requires no further sampling of A t Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + αρ t+n (n) t+1 (G t Q t+n 1 (S t, A t )) 0 t < T February 7, / 29

37 Off-Policy Learning by Importance Sampling Evaluation step replaces ρt+nt At already determined ρ t+n t+1 because requires no further sampling of A t Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + αρ t+n (n) t+1 (G t Q t+n 1 (S t, A t )) 0 t < T Expected Sarsa ρ t+n 1 t+1 replaces ρ t+n t+1 because requires no sampling of At+n Expected value all actions on (t + n)th step into account February 7, / 29

38 Off-Policy Learning by Importance Sampling J February 7, / 29

39 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data February 7, / 29

40 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data Improvements Autostep method (Mahmood et al, 2012) Invariant updates (Karampatziakis and Langford, 2010) Usage technique (Mahmood and Sutton, 2015) February 7, / 29

41 Off-Policy Learning by Importance Sampling Importance sampling enables off-policy at the cost of increasing the variance of the updates Requires smaller step sizes and thus slower Slower speed inevitable because using less relevant data Improvements Autostep method (Mahmood et al, 2012) Invariant updates (Karampatziakis and Langford, 2010) Usage technique (Mahmood and Sutton, 2015) Off-policy possible without importance sampling? February 7, / 29

42 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) February 7, / 29

43 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour February 7, / 29

44 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour Updated values are independent of µ(at+1 S t+1) If π is greedy, this is exactly the Q-learning method { 1 a = argmax a Q(S t+1, a ) π(a S t+1) = 0 otherwise Q t+1(s t, A t) Q t(s t, A t) + α(r t + γmaxq a t(s t+1, a) Q t(s t, A t)) February 7, / 29

45 Control Problem (Off-Policy) Expected Sarsa (on-policy, one-step case) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γe[q t (S t+1, A t+1 ) S t+1 ] Q t (S t, A t )) Q t+1 (S t, A t ) Q t (S t, A t ) + α(r t + γ a π(a S t+1 )Q t (S t+1, a) Q t (S t, A t )) Use a different policy µ to generate behaviour Updated values are independent of µ(at+1 S t+1) If π is greedy, this is exactly the Q-learning method { 1 a = argmax a Q(S t+1, a ) π(a S t+1) = 0 otherwise Q t+1(s t, A t) Q t(s t, A t) + α(r t + γmaxq a t(s t+1, a) Q t(s t, A t)) Possible to form off-policy methods without importance sampling J February 7, / 29

46 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Alternate the incorporation of expected values of future action-value estimates and corrections based on actual steps up to S t+n February 7, / 29

47 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Alternate the incorporation of expected values of future action-value estimates and corrections based on actual steps up to S t+n G (n) t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) γπ(a t+1 S t+1 )Q t (S t+1, A t+1 ) + γπ(a t+1 S t+1 )(R t+2 + γ π(a S t+2 )Q t+1 (S t+2, a)) a γ 2 π(a t+1 S t+1 )π(a t+2 S t+2 )Q t+1 (S t+2, A t+2 ) + γ 2 π(a t+1 S t+1 )π(a t+2 S t+2 )(R t+3 + γ a +... min(t+n,t ) 1 + γ min(t+n,t ) 1 ( π(a i S i )) (R min(t+n,t ) + γ a i=t+1 π(a S t+3 )Q t+2 (S t+3, a)) π(a S min(t+n,t ) )Q min(t+n,t ) (S min(t+n,t ), a)) February 7, / 29

48 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) February 7, / 29

49 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) min(t+n,t ) 1 G (n) t = Q t 1 (S t, A t ) + k=t δ k k i=t+1 γπ(a i S i ) Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t+n, A t+n )) February 7, / 29

50 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm Define TD error δ t to simplify notation δ t = R t+1 + γ a π(a S t+1 )Q t (S t+1, a) Q t 1 (S t, A t ) min(t+n,t ) 1 G (n) t = Q t 1 (S t, A t ) + k=t δ k k i=t+1 γπ(a i S i ) Q t+n (S t, A t ) Q t+n 1 (S t, A t ) + α(g (n) t Q t+n 1 (S t+n, A t+n )) G (1) t is used for Expected Sarsa February 7, / 29

51 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm J February 7, / 29

52 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm J February 7, / 29

53 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm The n-step Tree Backup algorithm is the natural extension of Q-learning to the multi-step case Requires no importance sampling like Q-learning February 7, / 29

54 Off-Policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm The n-step Tree Backup algorithm is the natural extension of Q-learning to the multi-step case Requires no importance sampling like Q-learning However, if µ and π vastly differ then π(a t+i S t+i ) may be small for some i and bootstrapping may span only a few steps even if n is large February 7, / 29

55 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods February 7, / 29

56 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes February 7, / 29

57 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes Disadvantages Requires a delay of n time steps before updating Requires more computation per time step Requires more memory to store variables from the last n time steps February 7, / 29

58 Conclusion n-step bootstrapping looks ahead to the next n rewards, states and actions, which generalizes Monte Carlo methods and one-step TD methods Advantages Intermediate amount of bootstrapping often works better than the two extremes Disadvantages Requires a delay of n time steps before updating Requires more computation per time step Requires more memory to store variables from the last n time steps n-step TD policy evaluation On-policy control: n-step Sarsa Off-policy control: Importance sampling n-step Tree Backup algorithm February 7, / 29

Reinforcement Learning

Reinforcement Learning Reinforcement Learning n-step bootstrapping Daniel Hennes 12.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 n-step bootstrapping Unifying Monte Carlo and TD n-step TD n-step Sarsa

More information

10703 Deep Reinforcement Learning and Control

10703 Deep Reinforcement Learning and Control 10703 Deep Reinforcement Learning and Control Russ Salakhutdinov Machine Learning Department rsalakhu@cs.cmu.edu Temporal Difference Learning Used Materials Disclaimer: Much of the material and slides

More information

4 Reinforcement Learning Basic Algorithms

4 Reinforcement Learning Basic Algorithms Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 4 Reinforcement Learning Basic Algorithms 4.1 Introduction RL methods essentially deal with the solution of (optimal) control problems

More information

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i

Motivation: disadvantages of MC methods MC does not work for scenarios without termination It updates only at the end of the episode (sometimes - it i Temporal-Di erence Learning Taras Kucherenko, Joonatan Manttari KTH tarask@kth.se manttari@kth.se March 7, 2017 Taras Kucherenko, Joonatan Manttari (KTH) TD-Learning March 7, 2017 1 / 68 Motivation: disadvantages

More information

Reinforcement Learning 04 - Monte Carlo. Elena, Xi

Reinforcement Learning 04 - Monte Carlo. Elena, Xi Reinforcement Learning 04 - Monte Carlo Elena, Xi Previous lecture 2 Markov Decision Processes Markov decision processes formally describe an environment for reinforcement learning where the environment

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Monte Carlo Methods Heiko Zimmermann 15.05.2017 1 Monte Carlo Monte Carlo policy evaluation First visit policy evaluation Estimating q values On policy methods Off policy methods

More information

Introduction to Reinforcement Learning. MAL Seminar

Introduction to Reinforcement Learning. MAL Seminar Introduction to Reinforcement Learning MAL Seminar 2014-2015 RL Background Learning by interacting with the environment Reward good behavior, punish bad behavior Trial & Error Combines ideas from psychology

More information

Lecture 4: Model-Free Prediction

Lecture 4: Model-Free Prediction Lecture 4: Model-Free Prediction David Silver Outline 1 Introduction 2 Monte-Carlo Learning 3 Temporal-Difference Learning 4 TD(λ) Introduction Model-Free Reinforcement Learning Last lecture: Planning

More information

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning

CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning CS 360: Advanced Artificial Intelligence Class #16: Reinforcement Learning Daniel M. Gaines Note: content for slides adapted from Sutton and Barto [1998] Introduction Animals learn through interaction

More information

2D5362 Machine Learning

2D5362 Machine Learning 2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files

More information

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning)

Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) 1 / 24 Monte Carlo Methods (Estimators, On-policy/Off-policy Learning) Julie Nutini MLRG - Winter Term 2 January 24 th, 2017 2 / 24 Monte Carlo Methods Monte Carlo (MC) methods are learning methods, used

More information

Intro to Reinforcement Learning. Part 3: Core Theory

Intro to Reinforcement Learning. Part 3: Core Theory Intro to Reinforcement Learning Part 3: Core Theory Interactive Example: You are the algorithm! Finite Markov decision processes (finite MDPs) dynamics p p p Experience: S 0 A 0 R 1 S 1 A 1 R 2 S 2 A 2

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

Sequential Decision Making

Sequential Decision Making Sequential Decision Making Dynamic programming Christos Dimitrakakis Intelligent Autonomous Systems, IvI, University of Amsterdam, The Netherlands March 18, 2008 Introduction Some examples Dynamic programming

More information

Reinforcement Learning. Monte Carlo and Temporal Difference Learning

Reinforcement Learning. Monte Carlo and Temporal Difference Learning Reinforcement Learning Monte Carlo and Temporal Difference Learning Manfred Huber 2014 1 Monte Carlo Methods Dynamic Programming Requires complete knowledge of the MDP Spends equal time on each part of

More information

Reinforcement Learning and Simulation-Based Search

Reinforcement Learning and Simulation-Based Search Reinforcement Learning and Simulation-Based Search David Silver Outline 1 Reinforcement Learning 2 3 Planning Under Uncertainty Reinforcement Learning Markov Decision Process Definition A Markov Decision

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-based RL and Integrated Learning-Planning Planning and Search, Model Learning, Dyna Architecture, Exploration-Exploitation (many slides from lectures of Marc Toussaint & David

More information

Complex Decisions. Sequential Decision Making

Complex Decisions. Sequential Decision Making Sequential Decision Making Outline Sequential decision problems Value iteration Policy iteration POMDPs (basic concepts) Slides partially based on the Book "Reinforcement Learning: an introduction" by

More information

Reinforcement Learning Lectures 4 and 5

Reinforcement Learning Lectures 4 and 5 Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Reinforcement Learning 1 Framework Rewards, Returns Environment Dynamics Components of a Problem Values and Action Values, V and

More information

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods by following

More information

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1

Making Decisions. CS 3793 Artificial Intelligence Making Decisions 1 Making Decisions CS 3793 Artificial Intelligence Making Decisions 1 Planning under uncertainty should address: The world is nondeterministic. Actions are not certain to succeed. Many events are outside

More information

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions

The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions The Agent-Environment Interface Goals, Rewards, Returns The Markov Property The Markov Decision Process Value Functions Optimal Value Functions Optimality and Approximation Finite MDP: {S, A, R, p, γ}

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical

More information

Non-Deterministic Search

Non-Deterministic Search Non-Deterministic Search MDP s 1 Non-Deterministic Search How do you plan (search) when your actions might fail? In general case, how do you plan, when the actions have multiple possible outcomes? 2 Example:

More information

Temporal Abstraction in RL

Temporal Abstraction in RL Temporal Abstraction in RL How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations? HAMs (Parr & Russell 1998;

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

Computational Finance Least Squares Monte Carlo

Computational Finance Least Squares Monte Carlo Computational Finance Least Squares Monte Carlo School of Mathematics 2019 Monte Carlo and Binomial Methods In the last two lectures we discussed the binomial tree method and convergence problems. One

More information

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018

Lecture 2: Making Good Sequences of Decisions Given a Model of World. CS234: RL Emma Brunskill Winter 2018 Lecture 2: Making Good Sequences of Decisions Given a Model of World CS234: RL Emma Brunskill Winter 218 Human in the loop exoskeleton work from Steve Collins lab Class Structure Last Time: Introduction

More information

Ensemble Methods for Reinforcement Learning with Function Approximation

Ensemble Methods for Reinforcement Learning with Function Approximation Ensemble Methods for Reinforcement Learning with Function Approximation Stefan Faußer and Friedhelm Schwenker Institute of Neural Information Processing, University of Ulm, 89069 Ulm, Germany {stefan.fausser,friedhelm.schwenker}@uni-ulm.de

More information

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models IEOR E4707: Foundations of Financial Engineering c 206 by Martin Haugh Martingale Pricing Theory in Discrete-Time and Discrete-Space Models These notes develop the theory of martingale pricing in a discrete-time,

More information

MDPs: Bellman Equations, Value Iteration

MDPs: Bellman Equations, Value Iteration MDPs: Bellman Equations, Value Iteration Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) Adapted from slides kindly shared by Stuart Russell Sutton & Barto Ch 4 (Cf. AIMA Ch 17, Section 2-3) 1 Appreciations

More information

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N

Markov Decision Processes: Making Decision in the Presence of Uncertainty. (some of) R&N R&N Markov Decision Processes: Making Decision in the Presence of Uncertainty (some of) R&N 16.1-16.6 R&N 17.1-17.4 Different Aspects of Machine Learning Supervised learning Classification - concept learning

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 9: MDPs 2/16/2011 Pieter Abbeel UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 Announcements

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Hierarchical Reinforcement Learning Action hierarchy, hierarchical RL, semi-mdp Vien Ngo Marc Toussaint University of Stuttgart Outline Hierarchical reinforcement learning Learning

More information

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010

91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 91.420/543: Artificial Intelligence UMass Lowell CS Fall 2010 Lecture 17 & 18: Markov Decision Processes Oct 12 13, 2010 A subset of Lecture 9 slides from Dan Klein UC Berkeley Many slides over the course

More information

Deep RL and Controls Homework 1 Spring 2017

Deep RL and Controls Homework 1 Spring 2017 10-703 Deep RL and Controls Homework 1 Spring 2017 February 1, 2017 Due February 17, 2017 Instructions You have 15 days from the release of the assignment until it is due. Refer to gradescope for the exact

More information

Decision Theory: Value Iteration

Decision Theory: Value Iteration Decision Theory: Value Iteration CPSC 322 Decision Theory 4 Textbook 9.5 Decision Theory: Value Iteration CPSC 322 Decision Theory 4, Slide 1 Lecture Overview 1 Recap 2 Policies 3 Value Iteration Decision

More information

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II

CS221 / Spring 2018 / Sadigh. Lecture 8: MDPs II CS221 / Spring 218 / Sadigh Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 /

More information

Computational Finance Improving Monte Carlo

Computational Finance Improving Monte Carlo Computational Finance Improving Monte Carlo School of Mathematics 2018 Monte Carlo so far... Simple to program and to understand Convergence is slow, extrapolation impossible. Forward looking method ideal

More information

CSE 473: Artificial Intelligence

CSE 473: Artificial Intelligence CSE 473: Artificial Intelligence Markov Decision Processes (MDPs) Luke Zettlemoyer Many slides over the course adapted from Dan Klein, Stuart Russell or Andrew Moore 1 Announcements PS2 online now Due

More information

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II

CS221 / Autumn 2018 / Liang. Lecture 8: MDPs II CS221 / Autumn 218 / Liang Lecture 8: MDPs II cs221.stanford.edu/q Question If you wanted to go from Orbisonia to Rockhill, how would you get there? ride bus 1 ride bus 17 ride the magic tram CS221 / Autumn

More information

DRAFT. 1 exercise in state (S, t), π(s, t) = 0 do not exercise in state (S, t) Review of the Risk Neutral Stock Dynamics

DRAFT. 1 exercise in state (S, t), π(s, t) = 0 do not exercise in state (S, t) Review of the Risk Neutral Stock Dynamics Chapter 12 American Put Option Recall that the American option has strike K and maturity T and gives the holder the right to exercise at any time in [0, T ]. The American option is not straightforward

More information

CS 188: Artificial Intelligence. Outline

CS 188: Artificial Intelligence. Outline C 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley ome slides adapted from Dan Klein 1 Outline Markov Decision Processes (MDPs) Formalism Value iteration In essence

More information

TDT4171 Artificial Intelligence Methods

TDT4171 Artificial Intelligence Methods TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods

More information

Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return

Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return Craig Sherstan 1, Dylan R. Ashley 2, Brendan Bennett 2, Kenny Young, Adam White, Martha White, Richard

More information

MONTE CARLO EXTENSIONS

MONTE CARLO EXTENSIONS MONTE CARLO EXTENSIONS School of Mathematics 2013 OUTLINE 1 REVIEW OUTLINE 1 REVIEW 2 EXTENSION TO MONTE CARLO OUTLINE 1 REVIEW 2 EXTENSION TO MONTE CARLO 3 SUMMARY MONTE CARLO SO FAR... Simple to program

More information

Practical example of an Economic Scenario Generator

Practical example of an Economic Scenario Generator Practical example of an Economic Scenario Generator Martin Schenk Actuarial & Insurance Solutions SAV 7 March 2014 Agenda Introduction Deterministic vs. stochastic approach Mathematical model Application

More information

2.1 Mathematical Basis: Risk-Neutral Pricing

2.1 Mathematical Basis: Risk-Neutral Pricing Chapter Monte-Carlo Simulation.1 Mathematical Basis: Risk-Neutral Pricing Suppose that F T is the payoff at T for a European-type derivative f. Then the price at times t before T is given by f t = e r(t

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the reward function Must (learn to) act so as to maximize expected rewards Grid World The agent

More information

Value at Risk Ch.12. PAK Study Manual

Value at Risk Ch.12. PAK Study Manual Value at Risk Ch.12 Related Learning Objectives 3a) Apply and construct risk metrics to quantify major types of risk exposure such as market risk, credit risk, liquidity risk, regulatory risk etc., and

More information

Rollout Allocation Strategies for Classification-based Policy Iteration

Rollout Allocation Strategies for Classification-based Policy Iteration Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh firstname.lastname@inria.fr Workshop on Reinforcement Learning and Search in Very Large

More information

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have observations, perform actions, get rewards. (See lights,

More information

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig]

Basic Framework. About this class. Rewards Over Time. [This lecture adapted from Sutton & Barto and Russell & Norvig] Basic Framework [This lecture adapted from Sutton & Barto and Russell & Norvig] About this class Markov Decision Processes The Bellman Equation Dynamic Programming for finding value functions and optimal

More information

Monte Carlo Methods in Structuring and Derivatives Pricing

Monte Carlo Methods in Structuring and Derivatives Pricing Monte Carlo Methods in Structuring and Derivatives Pricing Prof. Manuela Pedio (guest) 20263 Advanced Tools for Risk Management and Pricing Spring 2017 Outline and objectives The basic Monte Carlo algorithm

More information

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

CS 188 Fall Introduction to Artificial Intelligence Midterm 1 CS 188 Fall 2018 Introduction to Artificial Intelligence Midterm 1 You have 120 minutes. The time will be projected at the front of the room. You may not leave during the last 10 minutes of the exam. Do

More information

Importance Sampling for Fair Policy Selection

Importance Sampling for Fair Policy Selection Importance Sampling for Fair Policy Selection Shayan Doroudi Carnegie Mellon University Pittsburgh, PA 15213 shayand@cs.cmu.edu Philip S. Thomas Carnegie Mellon University Pittsburgh, PA 15213 philipt@cs.cmu.edu

More information

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2 COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman

More information

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm

CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm CS 234 Winter 2019 Assignment 1 Due: January 23 at 11:59 pm For submission instructions please refer to website 1 Optimal Policy for Simple MDP [20 pts] Consider the simple n-state MDP shown in Figure

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration

Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Reinforcement Learning (1): Discrete MDP, Value Iteration, Policy Iteration Piyush Rai CS5350/6350: Machine Learning November 29, 2011 Reinforcement Learning Supervised Learning: Uses explicit supervision

More information

Stochastic Dual Dynamic Programming

Stochastic Dual Dynamic Programming 1 / 43 Stochastic Dual Dynamic Programming Operations Research Anthony Papavasiliou 2 / 43 Contents [ 10.4 of BL], [Pereira, 1991] 1 Recalling the Nested L-Shaped Decomposition 2 Drawbacks of Nested Decomposition

More information

An Electronic Market-Maker

An Electronic Market-Maker massachusetts institute of technology artificial intelligence laboratory An Electronic Market-Maker Nicholas Tung Chan and Christian Shelton AI Memo 21-5 April 17, 21 CBCL Memo 195 21 massachusetts institute

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non Deterministic Search Example: Grid World A maze like problem The agent lives in

More information

Bloomberg. Portfolio Value-at-Risk. Sridhar Gollamudi & Bryan Weber. September 22, Version 1.0

Bloomberg. Portfolio Value-at-Risk. Sridhar Gollamudi & Bryan Weber. September 22, Version 1.0 Portfolio Value-at-Risk Sridhar Gollamudi & Bryan Weber September 22, 2011 Version 1.0 Table of Contents 1 Portfolio Value-at-Risk 2 2 Fundamental Factor Models 3 3 Valuation methodology 5 3.1 Linear factor

More information

Max Registers, Counters and Monotone Circuits

Max Registers, Counters and Monotone Circuits James Aspnes 1 Hagit Attiya 2 Keren Censor 2 1 Yale 2 Technion Counters Model Collects Our goal: build a cheap counter for an asynchronous shared-memory system. Two operations: increment and read. Read

More information

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model

Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model Algorithmic Trading using Reinforcement Learning augmented with Hidden Markov Model Simerjot Kaur (sk3391) Stanford University Abstract This work presents a novel algorithmic trading system based on reinforcement

More information

Decision making in the presence of uncertainty

Decision making in the presence of uncertainty CS 2750 Foundations of AI Lecture 20 Decision making in the presence of uncertainty Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Decision-making in the presence of uncertainty Computing the probability

More information

Reasoning with Uncertainty

Reasoning with Uncertainty Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally

More information

Markov Decision Processes. Lirong Xia

Markov Decision Processes. Lirong Xia Markov Decision Processes Lirong Xia Today ØMarkov decision processes search with uncertain moves and infinite space ØComputing optimal policy value iteration policy iteration 2 Grid World Ø The agent

More information

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet. CS 188 Spring 2015 Introduction to Artificial Intelligence Midterm 1 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib

More information

CS 461: Machine Learning Lecture 8

CS 461: Machine Learning Lecture 8 CS 461: Machine Learning Lecture 8 Dr. Kiri Wagstaff kiri.wagstaff@calstatela.edu 2/23/08 CS 461, Winter 2008 1 Plan for Today Review Clustering Reinforcement Learning How different from supervised, unsupervised?

More information

The Problem of Temporal Abstraction

The Problem of Temporal Abstraction The Problem of Temporal Abstraction How do we connect the high level to the low-level? " the human level to the physical level? " the decide level to the action level? MDPs are great, search is great,

More information

MAFS Computational Methods for Pricing Structured Products

MAFS Computational Methods for Pricing Structured Products MAFS550 - Computational Methods for Pricing Structured Products Solution to Homework Two Course instructor: Prof YK Kwok 1 Expand f(x 0 ) and f(x 0 x) at x 0 into Taylor series, where f(x 0 ) = f(x 0 )

More information

The Binomial Model. Chapter 3

The Binomial Model. Chapter 3 Chapter 3 The Binomial Model In Chapter 1 the linear derivatives were considered. They were priced with static replication and payo tables. For the non-linear derivatives in Chapter 2 this will not work

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

AD in Monte Carlo for finance

AD in Monte Carlo for finance AD in Monte Carlo for finance Mike Giles giles@comlab.ox.ac.uk Oxford University Computing Laboratory AD & Monte Carlo p. 1/30 Overview overview of computational finance stochastic o.d.e. s Monte Carlo

More information

Comprehensive Exam. August 19, 2013

Comprehensive Exam. August 19, 2013 Comprehensive Exam August 19, 2013 You have a total of 180 minutes to complete the exam. If a question seems ambiguous, state why, sharpen it up and answer the sharpened-up question. Good luck! 1 1 Menu

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificial Intelligence Markov Decision Processes Dan Klein, Pieter Abbeel University of California, Berkeley Non-Deterministic Search 1 Example: Grid World A maze-like problem The agent lives

More information

Department of Mathematics. Mathematics of Financial Derivatives

Department of Mathematics. Mathematics of Financial Derivatives Department of Mathematics MA408 Mathematics of Financial Derivatives Thursday 15th January, 2009 2pm 4pm Duration: 2 hours Attempt THREE questions MA408 Page 1 of 5 1. (a) Suppose 0 < E 1 < E 3 and E 2

More information

Monte Carlo Based Numerical Pricing of Multiple Strike-Reset Options

Monte Carlo Based Numerical Pricing of Multiple Strike-Reset Options Monte Carlo Based Numerical Pricing of Multiple Strike-Reset Options Stavros Christodoulou Linacre College University of Oxford MSc Thesis Trinity 2011 Contents List of figures ii Introduction 2 1 Strike

More information

An Experimental Study of the Behaviour of the Proxel-Based Simulation Algorithm

An Experimental Study of the Behaviour of the Proxel-Based Simulation Algorithm An Experimental Study of the Behaviour of the Proxel-Based Simulation Algorithm Sanja Lazarova-Molnar, Graham Horton Otto-von-Guericke-Universität Magdeburg Abstract The paradigm of the proxel ("probability

More information

Effects of Outliers and Parameter Uncertainties in Portfolio Selection

Effects of Outliers and Parameter Uncertainties in Portfolio Selection Effects of Outliers and Parameter Uncertainties in Portfolio Selection Luiz Hotta 1 Carlos Trucíos 2 Esther Ruiz 3 1 Department of Statistics, University of Campinas. 2 EESP-FGV (postdoctoral). 3 Department

More information

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 12: MDP1. Victor R. Lesser. CMPSCI 683 Fall 2010 Lecture 12: MDP1 Victor R. Lesser CMPSCI 683 Fall 2010 Biased Random GSAT - WalkSat Notice no random restart 2 Today s lecture Search where there is Uncertainty in Operator Outcome --Sequential Decision

More information

Machine Learning for Physicists Lecture 10. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt

Machine Learning for Physicists Lecture 10. Summer 2017 University of Erlangen-Nuremberg Florian Marquardt Machine Learning for Physicists Lecture 10 Summer 2017 University of Erlangen-Nuremberg Florian Marquardt Function/Image representation Image classification [Handwriting recognition] Convolutional nets

More information

Stochastic Grid Bundling Method

Stochastic Grid Bundling Method Stochastic Grid Bundling Method GPU Acceleration Delft University of Technology - Centrum Wiskunde & Informatica Álvaro Leitao Rodríguez and Cornelis W. Oosterlee London - December 17, 2015 A. Leitao &

More information

Practice Problems 1: Moral Hazard

Practice Problems 1: Moral Hazard Practice Problems 1: Moral Hazard December 5, 2012 Question 1 (Comparative Performance Evaluation) Consider the same normal linear model as in Question 1 of Homework 1. This time the principal employs

More information

Eco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1)

Eco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1) Eco54 Spring 21 C. Sims FINAL EXAM There are three questions that will be equally weighted in grading. Since you may find some questions take longer to answer than others, and partial credit will be given

More information

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options

Temporal Abstraction in RL. Outline. Example. Markov Decision Processes (MDPs) ! Options Temporal Abstraction in RL Outline How can an agent represent stochastic, closed-loop, temporally-extended courses of action? How can it act, learn, and plan using such representations?! HAMs (Parr & Russell

More information

Dynamic Asset and Liability Management Models for Pension Systems

Dynamic Asset and Liability Management Models for Pension Systems Dynamic Asset and Liability Management Models for Pension Systems The Comparison between Multi-period Stochastic Programming Model and Stochastic Control Model Muneki Kawaguchi and Norio Hibiki June 1,

More information

Binomial model: numerical algorithm

Binomial model: numerical algorithm Binomial model: numerical algorithm S / 0 C \ 0 S0 u / C \ 1,1 S0 d / S u 0 /, S u 3 0 / 3,3 C \ S0 u d /,1 S u 5 0 4 0 / C 5 5,5 max X S0 u,0 S u C \ 4 4,4 C \ 3 S u d / 0 3, C \ S u d 0 S u d 0 / C 4

More information

1 The continuous time limit

1 The continuous time limit Derivative Securities, Courant Institute, Fall 2008 http://www.math.nyu.edu/faculty/goodman/teaching/derivsec08/index.html Jonathan Goodman and Keith Lewis Supplementary notes and comments, Section 3 1

More information

Monte Carlo Methods in Option Pricing. UiO-STK4510 Autumn 2015

Monte Carlo Methods in Option Pricing. UiO-STK4510 Autumn 2015 Monte Carlo Methods in Option Pricing UiO-STK4510 Autumn 015 The Basics of Monte Carlo Method Goal: Estimate the expectation θ = E[g(X)], where g is a measurable function and X is a random variable such

More information

Optimal Window Selection for Forecasting in The Presence of Recent Structural Breaks

Optimal Window Selection for Forecasting in The Presence of Recent Structural Breaks Optimal Window Selection for Forecasting in The Presence of Recent Structural Breaks Yongli Wang University of Leicester Econometric Research in Finance Workshop on 15 September 2017 SGH Warsaw School

More information

CS885 Reinforcement Learning Lecture 3b: May 9, 2018

CS885 Reinforcement Learning Lecture 3b: May 9, 2018 CS885 Reinforcement Learning Lecture 3b: May 9, 2018 Intro to Reinforcement Learning [SutBar] Sec. 5.1-5.3, 6.1-6.3, 6.5, [Sze] Sec. 3.1, 4.3, [SigBuf] Sec. 2.1-2.5, [RusNor] Sec. 21.1-21.3, CS885 Spring

More information

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29 Chapter 5 Univariate time-series analysis () Chapter 5 Univariate time-series analysis 1 / 29 Time-Series Time-series is a sequence fx 1, x 2,..., x T g or fx t g, t = 1,..., T, where t is an index denoting

More information

Richardson Extrapolation Techniques for the Pricing of American-style Options

Richardson Extrapolation Techniques for the Pricing of American-style Options Richardson Extrapolation Techniques for the Pricing of American-style Options June 1, 2005 Abstract Richardson Extrapolation Techniques for the Pricing of American-style Options In this paper we re-examine

More information

The Use of Importance Sampling to Speed Up Stochastic Volatility Simulations

The Use of Importance Sampling to Speed Up Stochastic Volatility Simulations The Use of Importance Sampling to Speed Up Stochastic Volatility Simulations Stan Stilger June 6, 1 Fouque and Tullie use importance sampling for variance reduction in stochastic volatility simulations.

More information

Calculating VaR. There are several approaches for calculating the Value at Risk figure. The most popular are the

Calculating VaR. There are several approaches for calculating the Value at Risk figure. The most popular are the VaR Pro and Contra Pro: Easy to calculate and to understand. It is a common language of communication within the organizations as well as outside (e.g. regulators, auditors, shareholders). It is not really

More information

Making Complex Decisions

Making Complex Decisions Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2

More information

Asian Option Pricing: Monte Carlo Control Variate. A discrete arithmetic Asian call option has the payoff. S T i N N + 1

Asian Option Pricing: Monte Carlo Control Variate. A discrete arithmetic Asian call option has the payoff. S T i N N + 1 Asian Option Pricing: Monte Carlo Control Variate A discrete arithmetic Asian call option has the payoff ( 1 N N + 1 i=0 S T i N K ) + A discrete geometric Asian call option has the payoff [ N i=0 S T

More information