Mengdi Wang. July 3rd, Laboratory for Information and Decision Systems, M.I.T.

Size: px

Start display at page:

Download "Mengdi Wang. July 3rd, Laboratory for Information and Decision Systems, M.I.T."

Reynard Morrison
6 years ago
Views:

1 Practice July 3rd, 2012 Laboratory for Information and Decision Systems, M.I.T.

2 1 2

3 Infinite-Horizon DP Minimize over policies the objective cost function J π (x 0 ) = lim N E w k,k=0,1,... DP π = {µ 0,µ 1,...} { N 1 k=0 α k g (x k,µ k (x k ),w k ) } How to DP Approximation: parameterize policies/cost vectors, aggregation, etc. Simulation: Use simulation-generated trajectories {x k } to calculate DP quantities, without knowing the system

4 Markovian Decision Process Assume the system is an n-state (controlled) Markov chain Change to Markov chain notation States i = 1,...,n (instead of x) Transition probabilities p ik i k+1 (u k ) [instead of x k+1 = f(x k,u k,w k )] Cost per stage g(i,u,j) [instead of g(x k,u k,w k )] Cost of a policy π = {µ 0,µ 1,...} J π (i) = lim N E w k k=0,1,... { N 1 k=0 α k g (i k,µ k (i k ),i k+1 ) i 0 = i }

5 MDP Continued The optimal cost vector satisfies the Bellman equation for all i J (i) = min u U or in matrix form J = n p ij (u)(g(i,u,j) +αj (j)), j=1 min {g µ +αp µ J }. µ:{1,...,n} U Shorthand notation for DP mappings (TJ)(i) = min u U(i) n p ij (u) ( g(i,u,j)+αj(j) ), i = 1,...,n, j=1 (T µ J)(i) = n ( )( ) p ij q(i) g (i,µ(i),j)+αj(j), i = 1,...,n j=1

6 Approximation in Policy Space Approximation Architecture Parameterize the set of policies µ using a vector r, and then optimize over r. Approximation in Value Space J and J µ from a family of functions parameterized by r, e.g., a linear approximation J Φr, J(i) φ(i) r.

7 PI (*) DP Algorithms: A Roadmap Implement the two steps of PI in an approximate sense: Policy J µt = T µt J µt by approximation/simulation Direct Approach (*), e.g., simulation-based least squares Indirect Approach, solve J µt = ΠT µt J µt by TD/LSTD/LSPE. Policy Improvement T µt+1 J µt = TJ µt using the approximate cost vector/q-factors. J and Q Solve J = TJ or Q = FQ directly by simulation, e.g., Q- Learning, Bellman Error Minimization, LP approach

8 1 2

9 Call A call option gives the buyer of the option the right to buy the underlying asset at a fixed price (strike price or K). The buyer pays a price for this right. At or before expiration, If the value of the underlying asset (S)> Strike Price(K) Buyer makes the difference: S - K If the value of the underlying asset (S) < Strike Price (K) Buyer does not exercise

10 Variables Valuing American Call Strike Price: K Time till Expiration: T Price of underlying asset: S Volatility, Dividends, etc. Valuing American options requires the solution of an optimal stopping problem: Option Price = E[S(t ) K Option eventually exercised ] where t = optimal exercising time. If the option writers do not solve t correctly, the option buyers will have an arbitrage opportunity to exploit the option writers.

11 Infinite-Horizon DP Formulation Assume that: Dynamics of underlying asset S t+1 = f(s t,w t ) State: S t, price of the underlying asset Control: u t {Exercise,Hold} Transition cost: g t (HOLD) = 0, g t (Exercise) = S t K. The option never expires. There exists a discount factor α (0,1) Bellman Equation Let J t (S) be the option price at the tth day when the current stock price is S J(S t ) = max{s t K,αE[J(S t+1 )]}.

12 Binomial For simplicity, consider a model with a finite number of states: S t+1 = { min{u,ust } with probability p max{d,ds t } with probability 1-p The Bellman equation is J = TJ where { TJ(S) =max S K, α[pj(min{u,us t })+(1 p)j(max{d,ds t })] }.

13 Features We will approximate the option prices J,J µ using two set of features, each consisting of 3 features/basis functions. Simple Polynomial Laguerre Polynomial L 0 (S) = 1, L 1 (S) = S,,L 2 (S) = S 2. L 0 (S) = exp( S), L 1 (S) = exp( S)(1 S), L 2 (S) = exp( S)(1 S +S 2 /2). The basis matrix Φ is an n 3 matrix.

14 Policy Exercise 1.A (Direct Approach) Use the direct least squares approach 1 min r 2 N 1 k=0 ( φ(i k ) r N 1 t=k α t k g (i t,µ(i t ),i t+1 ) to evaluate the profits of a specified exercising strategy. Construct a simulator that generates trajectories of {i k }. ) 2 Plot the approximate cost vector as a function of the stock price.

15 Formula of the Solution J µ Φr µ Exercise 1 Continued r µ = ( N 1 k=0 = A 1 b where ) 1( N 1 φ(i k )φ(i k ) A = N 1 k=0 k=0 N 1 φ(i k ) t=k φ(i k )φ(i k ), α t k g (i t,µ(i t ),i t+1 ) ) b = N 1 k=0 α tµ(k) k (S(t µ (k)) K)φ(i k ) where t µ (k) is the first time of triggering the Exercise control using policy µ after time k.

16 0.3 Results - Option Prices 0.25 Option Prices Stock Price

17 Exercise 1.B (Optional) Policy Suppose that holding the option always incurs a cost g(i,j) = Modify the program of Exercise 1.A to price the American call option. Exercise 1.C (Optional) Use an indirect approach to price an American call option. Choose any one of the three algorithms: TD/LSTD/LSPE

18 Use PI to Evaluate Exercise 2 Use approximate PI to price an American call option. The program should be a function of S 0,T,p,u,d,K. Suggestions: Start with a randomly generated policy µ 0 : {1,...,n} {HOLD,EXERCISE}. Use approximate policy evaluation (Exercise 1) to evaluate J µt and Q µt for a given policy µ t. Plot the trajectories of µ t.

19 Policy Iteration for Option Algorithm (starts with any µ 0 ) Policy evaluation: Evaluate J µt Φr µ by approximate policy evaluation: use the program of Exercise 1 to compute r µ Evaluate the Q-values. For example, for i t [2,n 1], ] Q µt (i t ) = αe[j µt (i t+1 )] αe[ Jµt (i t+1 ) ( ) = α p J µt (i t +1)+(1 p) J µt (i t 1). Note J µ (i) = φ(i) r µ. Policy improvement: { HOLD if S(i) K Qµt (i), µ t+1 (i) = EXERCISE Otherwise.

20 0.3 Results - Option Prices 0.25 Option Prices Stock Price

21 Price Convergence of Exercising Policies Convergence of Policies (blue: exercise, red: hold) Number of Policy Iteration

22 Exercise 3 Online PI for Q Factors Modify the program of Exercise 2, so that the policy improvement step uses approximate evaluation of Q-factors (instead of exact Q values calculated using known p). For each state i, calculate [ ] Q(i) = E α J(i k+1 ) i k = i by averaging the samples obtained from the trajectory Q(i) k=n k=0 1(i k = i)α J(i k+1 ) k=n k=0 1(i k = i) Note J(i k+1 ) = φ(i k+1 ) r.

23 The end Thank You Very Much! Any Question is Welcome :-)

6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE

6.231 DYNAMIC PROGRAMMING LECTURE 5 LECTURE OUTLINE Stopping problems Scheduling problems Minimax Control 1 PURE STOPPING PROBLEMS Two possible controls: Stop (incur a one-time stopping cost, and move