On Complexity of Multistage Stochastic Programs

On Complexity of Multistage Stochastic Programs Alexander Shapiro School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0205, USA e-mail: ashapiro@isye.gatech.edu Abstract In this paper we derive estimates of the sample sizes required to solve a multistage stochastic programming problem with a given accuracy by the (conditional sampling) sample average approximation method. The presented analysis is self contained and is based on a, relatively elementary, one dimensional Cramér s Large Deviations Theorem. Key words: stochastic programming, Monte Carlo sampling, sample average method, large deviations exponential bounds, complexity.

Introduction Consider the following stochastic programming problem } f(x) :=E[F (x, ξ)], (.) Min x X where ξ is a random vector supported on a set Ξ R d, the expectation in (.) is taken with respect to a (known) probability distribution of ξ, X is a nonempty subset of R n and F : X Ξ R. In the case of two-stage stochastic programming, the function F (x, ξ) is given as the optimal value of a corresponding second stage problem. In that case the assumption that F (x, ξ) is real valued for all x X and ξ Ξ can only hold if the corresponding recourse is relatively complete. Only in very specific situations the expected value function f(x) can be written in a closed form. Therefore, it should be calculated by a numerical integration. Already for the number of random variables d 5 it is typically impossible to evaluate the corresponding multidimensional integral (expectation) with a high accuracy. This makes stochastic programming problems of the form (.) really difficult. A way of estimating the expected value function is suggested by the Monte Carlo method. That is, a random sample ξ,..., ξ N of N realizations of ξ is generated, and the expected value function f(x) is approximated by the sample average function ˆf N (x) :=N N i= F (x, ξi ). This is the basic idea of the so-called sample average approximation (SAA) method. It is possible to show, under mild regularity conditions, that for >0 and α (0, ) the sample size N O()σ2 2 [ n log ( ) ( )] DL O() + log α (.2) guarantees that any /2-optimal solution of the SAA problem is an -optimal solution of the true problem with probability at least α (see [3, 7, 8]). Here O() is a generic constant, D is the diameter of the set X (assumed to be finite), L is Lipschitz constant of f(x) and σ 2 is a certain constant measuring variability of the objective function F (x, ξ). Recall that for >0 it is said that x is an -optimal solution of problem (.) if x X and f( x) inf x X f(x)+. In a sense the estimate (.2) of the sample size gives a bound on complexity of solving, with a specified probability, the (true) problem (.) by using the sample average approximation. Note that the estimated sample size grows linearly in the dimension n of the first stage problem and is proportional to the squared ratio of the variability coefficient σ to the desired accuracy. (The following Example shows that this estimate cannot be significantly improved.) This indicates that one may expect to solve the true problem (.) with a manageable sample size to a reasonable accuracy by using the SSA method. And, indeed, this was verified in various numerical experiments (cf., [4, 5, 9]). Example Consider problem (.) with F (x, ξ) := x 2k 2k ξ,x, where k is a positive integer, X := x R n : x }, x, y denotes the standard scalar product of two

vectors x, y R n and x = x, x. Suppose, further, that random vector ξ has normal distribution N(0,σ 2 I n ), where σ 2 is a positive constant, i.e., components ξ i of ξ are independent and ξ i N(0,σ 2 ), i =,..., n. It follows that f(x) = x 2k, and hence for [0, ] the set of -optimal solutions of the true problem (.) is x : x 2k }. Now let ξ,..., ξ N be an iid random sample of ξ and ξ N := (ξ +... + ξ N )/N. The corresponding sample average function is ˆf N (x) = x 2k 2k ξ N,x, and the optimal solution ˆx N of the SAA problem is ˆx N = ξ N γ ξn, where γ := 2k 2 if ξ 2k N, and γ =if ξ N >. It follows that, for (0, ), the optimal solution of the corresponding SAA problem is an -optimal solution of the true problem iff ξ N ν, where ν := 2k. 2k We have that ξ N N(0,σ 2 N I n ), and hence N ξ N 2 /σ 2 has the chi-square distribution with n degrees of freedom. Consequently, the probability that ξ N ν >is equal to the probability P ( χ 2 n >N 2/ν /σ 2). Moreover, E[χ 2 n]=n and P(χ 2 n >n) increases and tends to /2 asn increases, e.g., P(χ 2 > )=0.373, P(χ 2 2 > 2)=0.3679, P(χ 2 3 > 3)=0.396 and etc. Consequently, for α (0, 0.3) and (0, ), for example, the sample size N should satisfy N> nσ2 (.3) 2/ν in order to have the property: with probability α an (exact) optimal solution of the SAA problem is an -optimal solution of the true problem. Compared with (.2), the lower bound (.3) also grows linearly in n and is proportional to σ 2 / 2/ν. It remains to note that the constant ν decreases to one as k increases. The aim of this paper is to extend this analysis of the SAA method to the multistage stochastic programming (MSP) setting. A discussion of complexity of the MSP can be found in [8]. It was already argued there that complexity of the SAA method, when applied to the MSP, grows fast with increase of the number of stages and seemingly simple MSP problems can be computationally unmanageable. We estimate sample sizes, required to solve the true problem with a given accuracy, by using tools of the Large Deviations (LD) theory (see, e.g., [2] for a thorough discussion of the LD theory). In that respect our analysis is self contained and rather elementary since we only employ the upper bound of the (one dimensional) Cramér s LD Theorem. That is, if X,..., X N is a sequence of iid realizations of random variable X and X N := N N i= X i is the corresponding average, then Here P(A) denotes probability of event A, } I(z) :=sup tz log M(t) t R P( X N a) e NI(a). (.4) is the so-called rate function, and M(t) :=E[e tx ] is the moment generating function of random variable X. 2

Let us make the following simple observation which will be used in our derivations. Let X and Y be two random variables and R. We have that if X and Y 2, where + 2 =, then X + Y, and hence ω : X(ω)+Y (ω) >} ω : X(ω) > } ω : Y (ω) > 2 }. This implies the following inequality for the corresponding probabilities P(X + Y>) P(X > )+P(Y > 2 ). (.5) 2 Sample average approximations of multistage stochastic programs Consider the following T stage stochastic programming problem [ [ Min F (x )+E inf F 2 (x 2,ξ 2 )+E + E [ inf F T (x T,ξ T ) ]]] (2.) x 2 X 2 (x,ξ 2 ) x T X T (x T,ξ T ) driven by the random data process ξ 2,..., ξ T. Here x t R nt, t =,..., T, are decision variables, F t : R nt R dt R are continuous functions and X t : R n t R dt R nt, t = 2,..., T, are measurable multifunctions, the function F : R n R and the set X R n are deterministic. We assume that the set X is nonempty. For example, in linear case F t (x t,ξ t ):= c t,x t, X := x : A x = b,x 0}, X t (x t,ξ t ):=x t : B t x t + A t x t = b t,x t 0},t=2,..., T, ξ := (c,a,b ) is known at the first stage (and hence is nonrandom), and ξ t := (c t,b t,a t,b t ) R dt, t =2,..., T, are data vectors some (all) elements of which can be random. In the sequel we use ξ t to denote random data vector and its particular realization. Which one of these two meanings will be used in a particular situation will be clear from the context. If we denote by Q 2 (x,ξ 2 ) the optimal value of the (T ) stage problem: Min x 2 X 2 (x,ξ 2 ) F 2(x 2,ξ 2 )+E [ + E [ min x T X T (x T,ξ T ) F T (x T,ξ T ) ]], (2.2) then we can write the T stage problem (2.) in the following form of two-stage programming problem Min F (x )+E [ Q 2 (x,ξ 2 ) ]. (2.3) Note, however, that if T > 2, then problem (2.2) in itself is a stochastic programming problem. Consequently, if the number of scenarios involved in (2.2) is very large, or 3

infinite, then the optimal value Q 2 (x,ξ 2 ) can be calculated only approximately, say by sampling. For the sake of simplicity we make the following derivations for the 3-stage problem, i.e., we assume that T = 3 (it will be clear how the obtained results can be extended to an analysis of T>3). In that case Q 2 (x,ξ 2 ) is given by the optimal value of the problem Min F 2 (x 2,ξ 2 )+E[Q 3 (x 2,ξ 3 ) ξ 2 ], (2.4) x 2 X 2 (x,ξ 2 ) where the expectation is taken with respect to the conditional distribution of ξ 3 given ξ 2 and Q 3 (x 2,ξ 3 ) := We make the following assumption: inf F 3 (x 3,ξ 3 ). x 3 X 3 (x 2,ξ 3 ) For every the expectation E [ Q 2 (x,ξ 2 ) ] is well defined and finite valued. Of course, finite valuedness of E [ Q 2 (x,ξ 2 ) ] can only holds if Q 2 (x,ξ 2 ) is finite valued for a.e. ξ 2, which in turn implies that X 2 (x,ξ 2 ) is nonemty for a.e. ξ 2 etc. That is, the above assumption implies that the recourse is relatively complete. Now let ξ2, i i =,..., N, be a random sample of independent realizations of the random vector ξ 2. We can approximate problem (2.3) by the following SAA problem } Min ˆf N (x ):=F (x )+ N Q 2 (x,ξ i N 2). (2.5) Since Q 2 (x,ξ2) i are not given explicitly we need to estimate these values by conditional sampling (note that in order for the SAA method to produce consistent estimators, conditional sampling is required, see [6]). That is, we generate random sample ξ ij 3, j =,..., N 2,ofN 2 independent realizations according to conditional distribution of ξ 3 given ξ2, i i =,..., N. Consequently, we approximate Q 2 (x,ξ2)by i } ˆQ 2,N2 (x,ξ2) i := inf F 2 (x 2,ξ2)+ i N 2 Q 3 (x 2,ξ ij x 2 X 2 (x,ξ2 i ) 3 ). (2.6) N 2 Finally, we approximate the true (expected value) problem (2.3) by the following socalled Sample Average Approximating (SAA) problem } Min f N,N 2 (x ):=F (x )+ N ˆQ 2,N2 (x,ξ i N 2). (2.7) The above SAA problem is obtained by approximating the objective function f(x ):=F (x )+E [ Q 2 (x,ξ 2 ) ] of problem (2.3) with f N,N 2 (x ). 4 i= i= j=

3 Sample size estimates In order to proceed with our analysis we need to estimate the probability P sup f(x ) f N,N 2 (x ) } > (3.) for an arbitrary constant >0. To this end we use the following result about Large Deviations (LD) bounds for the uniform convergence of sample average approximations. Consider a function h : X Ξ R and the corresponding expected value function φ(x) := E[h(x, ξ)], where the expectation is taken with respect to the probability distribution P of the random vector ξ = ξ(ω), X is a nonempty closed subset of R n and Ξ R d is the support of the probability distribution P. Assume that for every x X the expectation φ(x) is well defined, i.e., h(x, ) is measurable and P -integrable. Let ξ,..., ξ N be an iid sample of the random vector ξ(ω), and ˆφ N (x) := N N i= h(x, ξi ) be the corresponding sample average function. Theorem Suppose that the set X has finite diameter D, and the following conditions hold: (i) there exists a constant σ>0 such that M x (t) exp σ 2 t 2 /2 }, t R, x X, (3.2) where M x (t) is the moment generating function of the random variable h(x, ξ) φ(x), (ii) there exists a constant L>0 such that Then for any >0, P h(x,ξ) h(x, ξ) L x x, ξ Ξ, x,x X. (3.3) sup x X ˆφN (x) φ(x) } ( ) n DL O() exp N2 6σ 2 }. (3.4) To make the paper self contained we give a proof of this theorem in the appendix. We can apply the LD bound (3.4) to obtain estimates of the probability (3.). We have that sup f(x ) f N,N 2 (x ) sup f(x ) ˆf N (x ) + sup ˆfN (x ) f N,N 2 (x ), and hence, by (.5), P sup f(x ) f N,N 2 (x ) } > P sup f(x ) ˆf N (x ) } >/2 + P sup ˆfN (x ) f N,N 2 (x ) } >/2. (3.5) 5

Note that f(x ) ˆf N (x )=E[Q 2 (x,ξ 2 )] N Q 2 (x,ξ N 2), i i= and ˆf N (x ) f N,N 2 (x )= N [ Q 2 (x,ξ N 2) i ˆQ ] 2,N2 (x,ξ2) i. i= Let us assume, for the sake of simplicity, the between stages independence of the random process. That is, suppose that the following condition holds. (A) Random vectors ξ 2 and ξ 3 are independent. Of course, under this condition the conditional expectation in formula (2.4) does not depend on ξ 2. Also in that case the conditional sample ξ ij 3 has the (marginal) distribution of ξ 3 and is independent of ξ2, i and can be generated in two ways. Namely, we can either generate the same random sample ξ ij 3 for each i =,..., N, or these samples can be generated independently of each other. Let us make further the following assumptions. (A2) The set X has finite diameter D. (A3) There is a constant L > 0 such that Q 2 (x,ξ 2 ) Q 2 (x,ξ 2 ) L x x for all x, and a.e. ξ 2. (A4) There exists constant σ > 0 such that for any it holds that M,x (t) exp σ 2 t 2 /2 }, t R, where M,x (t) is the moment generating function of Q 2 (x,ξ 2 ) E[Q 2 (x,ξ 2 )]. (A5) There is a positive constant D 2 such that for every and a.e. ξ 2 the set X 2 (x,ξ 2 ) has a finite diameter less than or equal to D 2. (A6) There is a constant L 2 > 0 such that F2 (x 2,ξ 2 ) F 2 (x 2,ξ 2 )+Q 3 (x 2,ξ 3 ) Q 3 (x 2,ξ 3 ) L2 x 2 x 2 for all x 2,x 2 X 2 (x,ξ 2 ), and a.e. ξ 2 and ξ 3. 6

(A7) There exists constant σ 2 > 0 such that for any x 2 X 2 (x,ξ 2 ) and all and a.e. ξ 2 it holds that M 2,x2 (t) exp σ 2 2t 2 /2 }, t R, where M 2,x2 (t) is the moment generating function of Q 3 (x 2,ξ 3 ) E[Q 3 (x 2,ξ 3 )]. By Theorem, under assumptions (A2) (A4), we have that P sup f(x ) ˆf N (x ) } ( ) n D L >/2 O() exp O()N } 2. (3.6) σ 2 For ξ 2 and random sample ξ3,..., ξ N 2 3 of N 2 independent replications of ξ 3, consider function ˆψ N2 (x 2,ξ 2 ):=F 2 (x 2,ξ 2 )+ N 2 Q 3 (x 2,ξ N 3) j 2 and its expected value j= ψ(x 2,ξ 2 )=F 2 (x 2,ξ 2 )+E[Q 3 (x 2,ξ 3 )]. By Theorem, under assumptions (A5) (A7), we have that for any, P sup ˆψN2 (x 2,ξ 2 ) ψ(x 2,ξ 2 ) } >/2 C (N 2 ), (3.7) x 2 X 2 (x,ξ 2 ) where ( ) n2 D2 L 2 C (N 2 )=O() exp O()N } 2 2. σ2 2 It follows that } P inf ˆψ N2 (x 2,ξ 2 ) inf ψ(x 2,ξ 2 ) x 2 X 2 (x,ξ 2 ) x 2 X 2 (x,ξ 2 ) >/2 C (N 2 ). (3.8) Note that inf x 2 X 2 (x,ξ 2 ) ψ(x 2,ξ 2 )=Q 2 (x,ξ 2 ), and for ξ 2 = ξ2, i inf ˆψ N2 (x 2,ξ 2 )= ˆQ 2,N2 (x,ξ2). i x 2 X 2 (x,ξ 2 ) It follows from (3.8) that (for both strategies of using the same or independent samples for each ξ2) i the following inequality holds P ˆfN (x ) f N,N 2 (x ) } >/2 C (N 2 ). (3.9) Suppose further that: 7

There is L 3 > 0 such that ˆf N ( ) f N,N 2 ( ) is Lipschitz continuous on X with constant L 3. Then by constructing a ν-net in X and using (3.9) it can be shown (in a way similar to the proof of Theorem in the Appendix) that P sup ˆfN (x ) f N,N 2 (x ) } >/2 O() ( D L 3 ) n ( D2 L 2 ) } n2 exp O()N 2 2. σ 2 x X 2 (3.0) Combining (3.5) with estimates (3.6) and (3.0) gives an upper bound for the probability (3.). Let us also observe that if ˆx is an /2-optimal solution of the SAA problem (2.7) and sup x X f(x ) f N,N 2 (x ) /2, then ˆx is an -optimal solution of the true problem (2.3). Therefore, we obtain the following result. Theorem 2 Under the specified assumptions and for >0and α (0, ), and the sample sizes N and N 2 satisfying [ ( O() D ) } L n exp O()N 2 + ( ) D L n ( 3 D2 ) }] L n2 2 exp O()N 2 2 α, (3.) σ 2 we have that any /2-optimal solution of the SAA problem (2.7) is an -optimal solution of the true problem (2.3) with probability at least α. In particular, suppose that N = N 2. Then for L := maxl,l 2,L 3 }, D := maxd,d 2 } and σ := maxσ,σ 2 } we can use the following estimate of the required sample size N = N 2 : ( ) n +n DL 2 O() exp O()N } 2 α, (3.2) σ 2 which is equivalent to 4 Discussion N O()σ2 2 [ (n + n 2 ) log ( ) ( DL O() + log α σ 2 2 )]. (3.3) The estimate (3.3), for 3-stage programs, looks similar to the estimate (.2) for two-stage programs. Note, however, that if we use the SAA method with conditional sampling and respective sample sizes N and N 2, then the total number of scenarios is N = N N 2. Therefore, our analysis seems to indicate that for 3-stage problems we need random samples with the total number of scenarios of order of the square of the corresponding sample size for two-stage problems. This analysis can be extended to T -stage problems with the conclusion that the total number of scenarios needed to solve the true problem with a 8

reasonable accuracy grows exponentially with increase of the number of stages T. Some numerical experiments seem to confirm this conclusion (cf., []). Of course, it should be mentioned that the above analysis does not prove in a rigorous mathematical sense that complexity of multistage programming grows exponentially with increase of the number of stages. It only indicates that the SAA method, which showed a considerable promise for solving two stage problems, could be practically inapplicable for solving multistage problems with a large (say greater than 5) number of stages. Our analysis was performed under several simplifying assumptions. In particular, we consider a 3-stage setting and assumed the between stages independence condition. An extension of the analysis from 3 to a higher number of stages is straightforward. Removing the between stages independence assumption may create technical difficulties and requires a further investigation. 5 Appendix Proof of Theorem. By the LD bound (.4) we have that for any x X and >0it holds that } P ˆφN (x) φ(x) exp NI x ()}, (5.) where I x (z) :=sup zt log Mx (t) } (5.2) t R is the rate function of random variable h(x, ξ) φ(x). Similarly } P ˆφN (x) φ(x) exp NI x ( )}, and hence P ˆφN (x) φ(x) } exp NI x ()} + exp NI x ( )}. (5.3) For a constant ν>0, let x,..., x M X be such that for every x X there exists x l, l,..., M}, such that x x l ν. Such set x,..., x M } is called a ν-net in X.We can choose this net in such a way that M O()(D/ν) n, where D := sup x,x X x x is the diameter of X and O() is a generic constant. By (3.3) we have that φ(x ) φ(x) L x x (5.4) and ˆφN (x ) ˆφ N (x) L x x, (5.5) 9

for any x, x X. It follows by (5.3) that ( P max ˆφN ( x l ) φ( x l ) ) ( M P l= ˆφN ( x l ) φ( x l ) }) l M M ( P ˆφN ( x l ) φ( x l ) ) 2 M exp N[I xl () I xl ( )]}. l= l= (5.6) For an x X consider l(x) arg min l M x x l. By construction of the ν-net we have that x x l(x) ν for every x X. Then ˆφN (x) φ(x) ˆφN (x) ˆφ N ( x l(x) ) + ˆφN ( x l(x) ) φ( x l(x) ) + φ( x l(x) ) φ(x) Lν + ˆφN ( x l(x) ) φ( x l(x) ) + Lν. Let us take now a ν-net with such ν that Lν = /4, i.e., ν := [/(4L)]. Then P ˆφN (x) φ(x) } P ˆφN ( x l ) φ( x l ) } /2, sup x X max l M which together with (5.6) implies that P ˆφN (x) φ(x) } 2 M exp N [ I xl (/2) I xl ( /2) ]}. (5.7) sup x X l= Moreover, because of the condition (i) we have that log M x (t) σ 2 t 2 /2, and hence I x (z) z2, z R. (5.8) 2σ2 It follows from (5.7) and (5.8) that P sup x X ˆφN (x) φ(x) } } N 2M exp 2. (5.9) 6σ 2 Finally, since M O()(D/ν) n = O()(DL/) n, we obtain that (5.9) implies (3.4), and hence the proof is complete. References [] J. Blomvall and A. Shapiro, Solving multistage asset investment problems by Monte Carlo based optimization, E-print available at: http://www.optimization-online.org, 2004. [2] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications, Springer- Verlag, New York, NY, 998. 0

[3] A. J. Kleywegt, A. Shapiro, and T. Homem-De-Mello, The sample average approximation method for stochastic discrete optimization, SIAM Journal of Optimization, 2 (200), 479 502. [4] J. Linderoth, A. Shapiro, and S. Wright, The empirical behavior of sampling methods for stochastic programming, Annals of Operations Research, to appear. [5] W. K. Mak, D. P. Morton, and R. K. Wood, Monte Carlo bounding techniques for determining solution quality in stochastic programs, Operations Research Letters, 24 (999), 47 56. [6] A. Shapiro, Inference of statistical bounds for multistage stochastic programming problems, Mathematical Methods of Operations Research, 58 (2003), 57 68. [7] A. Shapiro, Monte Carlo sampling methods. In: A. Rusczyński and A. Shapiro (editors), Stochastic Programming, volume 0 of Handbooks in Operations Research and Management Science, North-Holland, 2003. [8] A. Shapiro and A. Nemirovski, On complexity of stochastic programming problems, E-print available at: http://www.optimization-online.org, 2004. [9] B. Verweij, S. Ahmed, A.J. Kleywegt, G. Nemhauser, and A. Shapiro, The sample average approximation method applied to stochastic routing problems: a computational study, Computational Optimization and Applications, 24 (2003), 289 333.