Information Relaxations and Duality in Stochastic Dynamic Programs

Size: px

Start display at page:

Download "Information Relaxations and Duality in Stochastic Dynamic Programs"

Ashlynn French
5 years ago
Views:

1 Information Relaxations and Duality in Stochastic Dynamic Programs David Brown, Jim Smith, and Peng Sun Fuqua School of Business Duke University February 28 1/39

2 Dynamic programming is widely applicable in principle, but often difficult to apply in practice. Dynamic programming is widely applicable because the world is full of sequential decision problems with uncertainty Dynamic programming is often difficult to apply because of: The curse of dimensionality the state space grows exponentially in the number of variables The challenge of modeling the gradual resolution of uncertainty Monte Carlo simulation (MCS) is easy to apply with large problems: Can simulate to find value with a given policy Hard to find an optimal policy or know how a given policy compares to the optimal policy 2/39

3 Our goal is to develop tools for studying large-scale DPs. Modern approaches to large-scale DPs include: Neurodynamic programming: Use simulation methods and linear approximations of value functions Bertsekas and Tsitsiklis (1996); Powell (27) - General theory Van Roy and Tsitsiklis (21), Longstaff and Schwartz (21) - Option valuation Linear programming techniques with constraint generation: de Farias and B. Van Roy (23, 24,...) - General theory Adelman (forthcoming) - Revenue management 3/39

4 We consider a dual approach based on information relaxations. A flexible framework that allows use of MCS with DPs: Provides upper bounds on DP values Users can control the computational effort vs. accuracy tradeoff Use with lower bounds from approximate policies to bracket optimal solution Inspired by Haugh and Kogan (24) who developed a similar dual approach for option pricing problems; see also Rogers (22) and Anderson and Broadie (24). We generalize to consider: General DPs Arbitrary information relaxations Linear programming duality rather than martingales 4/39

5 The Basic Idea: A dynamic program, not a good structure for MCS: Decision 1 Chance 1 Decision 2 Chance 2 Decision 3 Chance 3 Decision 4 Chance 4 Alt A Low Alt A Low Alt A Low Alt A Low Alt B Nominal Alt B Nominal Alt B Nominal Alt B Nominal Alt C High Alt C High Alt C High Alt C High The DP with additional information and a penalty: Chance 1 Chance 2 Chance 3 Chance 4 Decision 1 Decision 2 Decision 3 Decision 4 Low Low Low Low Alt A Alt A Alt A Alt A Nominal Nominal Nominal Nominal Alt B Alt B Alt B Alt B High High High High Alt C Alt C Alt C Alt C + Penalty 5/39

6 Related Literature: Relaxations of temporal feasibility constraints have been considered in the stochastic programming literature: E.g., Rockafellar and Wets (1991); Shapiro and Ruszczyński (27) Requires concave reward functions and convex action spaces Considers perfect info relaxations and linear penalties Uses sophisticated convex duality arguments Rogers (27) Pathwise Stochastic Optimal Control Perfect information relaxation Restricted forms for penalties Our results are simpler and more general General rewards and penalties Simple direct proofs 6/39

7 Outline: Introduction Basic setup: The primal DP Duality results: Penalties and relaxations Linear programming duality results Good penalties Example: Inventory control Example: Option pricing with stochastic volatility Example: Sequential exploration Conclusions 7/39

8 Basic Setup Discrete time; finite horizon: t =,...,T Uncertainty is described by a probability space (Ω, F, P) DM s information described by a filtration F = {Ft} where Ft describes the DM s information at the beginning of period t No forgetting: F F1... FT F DM will choose an action at in period t from At; A A AT = the set of all possible action vectors a Rewards {rt(a, ω)} depend on the actions a A and state ω Ω; Total reward: r(a, ω) = T t= r t(a, ω) (r(a) is a random variable.) 8/39

9 The Primal DP A policy α in A selects an action vector a A for each state ω Ω. In the primal DP, policies must be temporally feasible or F-adapted: Choice of action in period t can only depend on what is known then Let AF Adenote the set of F-adapted policies TheprimalDP: sup α AF E [r(α)] ( = sup α AF E [r(α( ω), ω)] ) If we allow mixed policies, the primal DP is a linear program with mixing probabilities as decision variables. 9/39

10 The Dual Approach: Information Relaxations AfiltrationG = {Gt} is a relaxation of filtration F = {Ft} if, for each t, Ft Gt F. The DM knows more in every period under G than under F. Perfect information relaxation: I = {It} where It = F Let AG = the set of G-adapted policies. For any relaxation G of F, AF AG AI = A As we relax the filtration, we expand the set of feasible policies. 1 / 39

11 The Dual Approach: Penalties Penalties z(a, ω), like rewards, depend on the choice of actions a and the state ω Penalties will be used to punish policies that use the additional information in G. Penalized objective function: E [r(α G ) z(α G )] Apenaltyz is dual feasible if E [z(α F )] for all α F in AF. A dual feasible penalty assigns zero (or negative) expected penalty to any temporally feasible policy Let ZF = set of all dual feasible penalties 11 / 39

12 The Dual Approach: Weak Duality Our main result can be viewed as a version of the weak duality lemma of linear programming. Lemma. If α F and z are primal and dual feasible respectively and G is a relaxation of F, then E [r(α F )] sup αg AG E [r(α G ) z(α G )]. Proof: E [r(α F )] E [r(α F ) z(α F )] (E [z(α F )] because z ZF and α F AF) sup αg AG E [r(α G ) z(α G )] (because α F AF and AF AG) 12 / 39

13 The Dual Approach: Weak Duality With the perfect information relaxation, weak duality implies that for any α F in AF and z ZF, E [r(α F )] E [ sup a A {r(a) z(a)} ]. This form is convenient for Monte Carlo simulation: In each trial, Randomly generate state ω Solve a deterministic inner problem of choosing a vector of actions a to maximize r(a, ω) z(a, ω) With imperfect information relaxations, we can often use MCS but the inner problems may be stochastic 13 / 39

14 The Dual Approach: Strong Duality In principle, we can find the optimal value by choosing the right penalty. Theorem. Let G be a relaxation of F. Then { sup αf AF E [r(α F )] = inf z ZF sup αg AG E [r(α G ) z(α G )] }. If the primal problem on the left is bounded, the dual problem on the right has an optimal solution z ZF that achieves this bound. Proof :Let z (a) =r(a) v where v = sup αf AF Then z is dual feasible and r(a) z (a) =v. E [r(α F )]. 14 / 39

15 The Dual Approach: Complementary Slackness The complementary slackness condition further characterizes optimal policies and penalties. Theorem. Let α F and z be feasible solutions for the primal and dual problems respectively with information relaxation G. A necessary and sufficient condition for these to be optimal solutions for their respective problems is: E [z (α F)] = and E [r(α F) z (α F)] = sup αg AG {E [r(α G ) z (α G )]} With an optimal penalty z, α F has zero expected penalty and solves the dual problem. 15 / 39

16 The Dual Approach: Structured Policies If the optimal policies in the primal problem have some particular structure (thresholds, Markovian, stationary,...), we can focus on policies with this structure in the dual. Proposition. Let S Aand suppose α F AF S is a primal optimal solution. Then weak and strong duality hold within this restricted policy set S. Moreover, for a fixed penalty z in Z S F = {z Z: E [z(α F)] α F AF S} the bound with the policy restriction to S is weakly tighter than the corresponding bound without the restriction. Structural constraints on policies lead to tighter bounds and more efficient computation. 16 / 39

17 How to Create Good Penalties: Let G be a relaxation of F; let{wt(a, ω)} be a sequence of generating functions where each wt depends only on the first t +1actions (a,..., at) of a. Define zt(a) =E [wt(a) Gt] E [wt(a) Ft] and z(a) = T t= z t(a). Proposition. For these penalties z: (i) E [z(α F )] = for all α F in AF (Dual feasible; potentially optimal); (ii) {zt(a)} is adapted to G and zt depends only on the first t +1 actions (a,..., at) of a (Leads to DP structure on G); and (iii) z is orthogonal to F in that E [zt(a) H] =whenever H is in Ft. (Depends only on the additional information in G) 17 / 39

18 Examples of Good Penalties: Define the DP value function: Let A t F (a) be the α in A F with the first t actions (a,...,at 1) constrained to match the first t actions in a Define Vt(a) = sup E [r(α) Ft] α A t F (a) An optimal penalty: Take z t (a) =E [Vt+1(a) Gt] E [Vt+1(a) Ft] z t (a) cancels the benefit of the knowledge in G in each period Alternatively, replace V with: An approximate value function (Haugh and Kogan) A value function based on a given policy (Anderson and Broadie) Some simple approximation 18 / 39

19 Properties of Penalties: Penalties that are close to each other (or close to optimal) give bounds that are close to each other (or close to optimal): sup α G AG E [r(α G ) z1(α G )] sup α G AG + sup α G AG E [r(α G ) z2(α G )] E [z2(αg) z1(αg)] In practice, there will typically be a tradeoff between: The accuracy of the bound and The computational effort required to compute it. We can control this tradeoff through our choice of information relaxation G and penalty z. 19 / 39

20 Outline: Introduction Basic setup: The primal DP Duality results: Penalties and relaxations Linear programming duality results Good penalties Example: Inventory control Example: Option pricing with stochastic volatility Example: Sequential exploration Conclusions 2 / 39

21 Example: Standard Inventory Control Model Discrete time: t =,...,T Order at units of good in period t; at Demand in period t, dt, is uncertain; revealed after ordering Inventory at the end of period t given by xt+1 = xt + at dt (Assumes unmet demand is backordered.) Costs: t= (r t(a, d) zt(a, rt(a, d) =ct(at)+ft(xt+1) Primal DP: inf E t= r t(α F, d) [ T ] αf AF [ { T d))} ] Perfect Information Dual DP: E inf a A 21 / 39

22 Example: Inventory Control Penalties Perfect information bound: zt = Inner problem is a deterministic dynamic lot sizing problem: inf a A T t= (ct(at)+ft(xt+1)) Easy to solve, esp. with linear ordering, holding and shortage costs Smooth bound: zt(a, d) =rt(a, d) E [rt(a, d) Ft] Inner problem as before but with a smoothed cost function: inf a A T t= (ct(at)+e [ft(xt+1) Ft]) With linear costs, if we drop at constraints, we get a weaker bound that decomposes into a series of simple newsvendor problems. 22 / 39

23 Example: Inventory Control Numerical Example; Assumptions From Zipkin (2) Linear costs; 5 Periods Detailed Assumptions: Nonstationary Demand Stationary Demand and and Stationary Costs Nonstationary Costs Period t Ordering costs kt Holding costs ht Backorder costs pt Mean demand E [dt] We consider Poisson and geometric demand distributions 23 / 39

24 Example: Inventory Control Numerical Example; Results Poisson Demand Distributions Nonstationary Demand Stationary Demand and and Stationary Costs Nonstationary Costs Cost Std. Err. % of Opt. Cost Std. Err. % of Opt. Optimal value Smooth bound % % Newsvendor bound % % Perfect info. bound % % Geometric Demand Distributions Nonstationary Demand Stationary Demand and and Stationary Costs Nonstationary Costs Cost Std. Err. % of Opt. Cost Std. Err. % of Opt. Optimal value Smooth bound % % Newsvendor bound % % Perfect info. bound % % 24 / 39

25 Example: Inventory Control Dual Policies (Poisson) Y-axis = Period-t order quantity, X-axis = Period-t inventory level; Red line = Optimal Policy Period Period1 Period2 Period3 Period Period Period1 Period2 Period3 Period Nonstationary Demand Stationary Demand and and Stationary Costs Nonstationary Costs Period t Ordering costs kt Holding costs ht Backorder costs pt Mean demand E [dt] / 39

26 Example: Inventory Control Dual Policies (Geometric) Y-axis = Period-t order quantity, X-axis = Period-t inventory level; Red line = Optimal Policy Period Period1 Period2 Period3 Period Period Period1 Period2 Period3 Period Nonstationary Demand Stationary Demand and and Stationary Costs Nonstationary Costs Period t Ordering costs kt Holding costs ht Backorder costs pt Mean demand E [dt] / 39

27 Example: Inventory Control World Model ; Assumptions Suppose ordering costs and mean demand evolve stochastically, but are observed before ordering: Mean Demands E[dt] Ordering Costs kt With a perfect info relaxation, the perfect information, smooth bounds, and newsvendor bounds carry over unchanged We also consider an imperfect info relaxation where G provides info about the state but not demand; no penalty. Inner problem a simpler stochastic DP. 27 / 39

28 Example: Inventory Control World Model ; Results Poisson Demands Geometric Demands Cost Std. Err. % of Opt. Cost Std. Err. % of Opt. Optimal value Imperfect info. bound % % Smooth bound % % Newsvendor bound % % Perfect info. bound % % Bounds also apply to models where the world is not observed. Partially observed MDP model is difficult to model Dual bounds are no more complicated 28 / 39

29 Example: Valuing an Option with Stochastic Volatility We will value put and call options on a dividend-paying stock Heston s (1993) stochastic volatility model, plus dividends: Stock Price St : dst = μstdt + vtstdξt Instant. Var. vt : dvt = κ ( v vt) dt + σ vtdνt Dividends Dti : S ti = S t Dti i where dξt and dνt are increments of a Brownian motion process with correlation ρ (Risk-neutral process) Base filtration F assumes: St and vt are known at time t (Could also take vt to be unknown) Dividend times (t1,...,tm) and amounts Dti are given 29 / 39

30 Example: Valuing an Option with Stochastic Volatility Actions: at = exercise or not; can exercise at most once Call option rewards: { rt(at,st,vt) = e rt (St K) if at = exercise otherwise where K is the strike price and r the risk-free interest rate. (Rewards for a put option are the negative of this.) Primal DP: Fix a finite set of possible exercise dates {t,t1,...,tt }: sup αf AF E [ T i= rti (α F,Sti,v ti ) ] 3 / 39

31 Example: Valuing an Option with Stochastic Volatility We consider a relaxation G where all volatilities are known in advance. Solved by simulation: Generate all volatilities v = {vt} T t= Given v, future stock prices are log-normally distributed with E [St2 St1, v] =St1e μ(t 2 t1) exp (ρ t 2 t=t1 v tdνt ρ2 2 t 2 t=t1 vtdt ). Solve inner problem: One-dimensional option valuation problem with time-varying volatility; we used a trinomial lattice. Structural restriction for call options: Under F, early exercise occurs only before dividends. Not true under G, but we can require this in the dual. 31 / 39

32 Example: Valuing an Option with Stochastic Volatility Penalties Generating function for period i (time t1 to t2) { if a i = exercise wi(a) = ( ) Δ(St1,t1) e r(t 2 t1) S t2 if ai = do not exercise Δ(St,t) is the delta for an option with constant volatility ( v) Easy to compute in the trinomial lattice; calculate once. When ai = exercise, zi(a) =;Whenai = do not exercise, zi(a) = E [wi(a) Gt i ] E [wi(a) Ft i ] ( ) = Δ(St1,t1) e r(t 2 t1) E [S t2 St1, v] St1 Approximately cancels the effect of knowing future volatilities on the stock price expectations 32 / 39

33 Example: Valuing an Option with Stochastic Volatility Results Assumptions: Put and call options, expire in 1 year S = K = $1; κ =.2; σ = 1%; r = μ =%; v = v = 25% One $2.5 dividend after 6 months sometimes optimal to exercise Lower bounds found by simulating with a fixed exercise threshold ρ = ρ =.25 ρ =.5 Mean Std Error Mean Std Error Mean Std Error Upper Bounds Call With No Penalty Options With Penalty Lower Bound Upper Bounds Put With No Penalty Options With Penalty Lower Bound Stochastic volatility affects option values, but you don t need to solve a two-dimensional DP to pin down the value. 33 / 39

34 Example: Valuing an Option with Stochastic Volatility Dual Policies Call option with ρ =.5; Actions chosen just before dividend in dual problem, by scenario: 35% With No Penalty 35% With Penalty 33% 33% 31% 31% 29% 29% 27% 25% 23% 21% 27% 25% 23% 21% Volatility at T =.5 Volatility at T =.5 19% 17% Threshold from Fixed Volatility Model 19% 17% Threshold from Fixed Volatility Model 15% 15% Stock Price at T =.5 Stock Price at T =.5 Pink = Exercise; Blue = Wait 34 / 39

35 Example: Sequential Exploration Given n (dependent) wells, what is the optimal drilling strategy? Drill 1 Drill 1 Drill 2 Drill 3 Drill 4 Drill 5 Quit Well 2 Rock Well 2 Charge Well 2 No Rock Well 2 Rock Well 2 No Charge Well 2 No Rock Well 5 Rock Well 5 Charge Well 5 No Rock Well 5 Rock Well 5 No Charge Well 5 No Rock Drill 3 Drill 4 Drill 5 Quit Well 5 Rock Well 5 Charge Well 5 No Rock Well 5 Rock Well 5 No Charge Well 5 No Rock Wells must succeed on each of k underlying factors to be wet From Bickel and Smith (26); Bickel, Smith and Meyer (27) 35 / 39

36 Example: Sequential Exploration The primal problem gets complex quickly. Size of Primal Problem Size of Inner Factors (k) Dual Problem Wells (Any k) (n) Actions States Actions States Actions States Actions States Actions States ,78 4, , ,477 6,561 13,173 83, ,25 3,125 91,854 59,49 1,837, E , ,375 15, E+5 531, E E ,29 2, ,5 78, E E E+8 4.1E ,57 6, E+6 39, E+7 4.3E+7 1.3E E+9 1, ,732 19, E E E E E E+11 2, ,879 59, E E E E+9 3.2E E+12 6,144 1,24 Perfect information dual: Randomly generate well results, then choose drilling order. Inner problem grows exponentially with the number of wells n (albeit slowly), but is independent of the number of factors k. 36 / 39

37 Example: Sequential Exploration Relaxations and Penalties We consider a numerical example with 5 wells and 3 factors 59,49 states; 91,854 actions Numeric assumptions from Bickel, Smith and Meyer (27) We consider information relaxations that provide perfect information in periods, 1,...5. Penalties: No Penalty: zt(a) = Myopic Penalty: zt(a) =rt(a) E [rt(a) Ft] [ Binary VF Penalty: zt(a) = ˆVt(a) E ˆVt(a) Ft ] where ˆVt is the value function for a binary version of the problem. Lower bound from using optimal policy from binary version of model. 37 / 39

38 Example: Sequential Exploration Results The bounds get tighter as we delay the provision of perfect information and treat more periods exactly. Some penalty is required to get reasonably tight bounds. Filtration Gi Upper Bounds: No Penalty With Myopic Penalty With Binary VF Penalty Exact Value Lower Bound with Binary Policy / 39

39 Summary and Conclusions Our dual approach is part of an iterative strategy for studying DPs: Lower bounds from MCS with a given policy Upper bounds from information relaxations and penalties If bounds are too broad, refine. Can control effort and avoid: Exactsolutionoflarge-scaleDPs Detailed modeling of uncertainty resolution (e.g., when is the state of the world known? is volatility observed?) Future research: Infinite horizon and/or continuous time Automatic generation of upper and lower bounds Careful analysis of the complexity vs. accuracy tradeoff 39 / 39

A Robust Option Pricing Problem

IMA 2003 Workshop, March 12-19, 2003 A Robust Option Pricing Problem Laurent El Ghaoui Department of EECS, UC Berkeley 3 Robust optimization standard form: min x sup u U f 0 (x, u) : u U, f i (x, u) 0,