A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation

Size: px

Start display at page:

Download "A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation"

Melissa Phillips
5 years ago
Views:

1 A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation E Bergou Y Diouane V Kungurtsev C W Royer July 5, 08 Abstract Globally convergent variants of the Gauss-Newton algorithm are often the preferred methods to tackle nonlinear least squares problems Among such frameworks, the Levenberg- Marquardt and the trust-region methods are two well-established paradigms, and their similarities have often enabled to derive similar analyses of these schemes Both algorithms have indeed been successfully studied when the Gauss-Newton model is replaced by a random model, only accurate with a given probability Meanwhile, problems where even the obective value is subect to noise have gained interest, driven by the need for efficient methods in fields such as data assimilation In this paper, we describe a stochastic Levenberg-Marquardt algorithm that can handle noisy obective function values as well as random models, provided sufficient accuracy is achieved in probability Our method relies on a specific scaling of the regularization parameter, which clarifies further the correspondences between the two classes of methods, and allows us to leverage existing theory for trust-region alorithms Provided the probability of accurate function estimates and models is sufficiently large, we establish that the proposed algorithm converges globally to a first-order stationary point of the obective function with probability one Furthermore, we derive a bound the expected number of iterations needed to reach an approximate stationary point We finally describe an application of our method to variational data assimilation, where stochastic models are computed by the so-called ensemble methods Keywords: Levenberg-Marquardt method, nonlinear least squares, regularization, random models, noisy functions, data assimilation MaIAGE, INRA, Université Paris-Saclay, Jouy-en-Josas, France (elhoucinebergou@inrafr) ISAE-SUPAERO, Université de Toulouse, 3055 Toulouse Cedex 4, France (youssefdiouane@isaefr) Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague Support for this author was provided by the Czech Science Foundation proect S (vyacheslavkungurtsev@felcvutcz) Wisconsin Institute for Discovery, University of Wisconsin-Madison, 330 N Orchard St, Madison, WI 5375, USA (croyer@wiscedu) Support for this author was provided by Subcontract 3F-30 from Argonne National Laboratory

2 Introduction Minimizing a nonlinear least-squares function is one of the most classical problems in numerical optimization, that arises in a variety of fields In many applications, the obective function to be optimized can only be accessed through noisy estimates Typical occurrences of such a formulation can be found when solving inverse problems [6, 7, 8 or while minimizing the error of a model in the context of machine learning [9 In such cases, the presence of noise is often due to the estimation of the obective function via cheaper, less accurate calculations: this is for instance true when part of the data is left aside while computing this estimate In fact, in data-fitting problems such as those coming from machine learning, a huge amount of data is available, and considering the entire data throughout the optimization process can be extremely costly Moreover, the measurements can be redundant and possibly corrupted: in that context, a full evaluation of the function or the gradient may be unnecessary Such concerns have motivated the development of optimization frameworks that cope with inexactness in the obective function or its derivatives In particular, the field of derivative-free optimization [5, where it is assumed that the derivatives exist but are unavailable for use in an algorithm, has expanded in the recent years with the introduction of random models One seminal work in this respect is [, where the authors applied arguments from compressed sensing to guarantee accuracy of quadratic models whenever the Hessian had a certain (unknown) sparsity pattern Trust-region methods based on general probabilistic models were then proposed in [, where convergence to first- and second-order stationary points was established under appropriate accuracy assumptions on the models Global convergence rates were derived for this approach in [9, in expectation and with high probability Of particular interest to us is the extension of trust-region methods with probabilistic models to the case of noisy function values [3: the corresponding algorithm considers two sources of randomness, respectively arising from the noisy function estimates and the random construction of the models A global convergence rate in expectation for this method was derived in [7, where it was established that the method needed O(ɛ ) iterations in expectation to drive the gradient norm below some threshold ɛ In the context of derivative-free least-squares problems where exact function values are available, various deterministic approaches based on globalization of the Gauss-Newton method have been studied The algorithms developed in the derivative-free community are mostly of trustregion type, and rely on building models that satisfy the so-called fully linear property, which requires the introduction of a so-called criticality step to guarantee its satisfaction throughout the algorithmic process [, 3, 3, 9 The recent DFO-GN algorithm [ was equipped with a complexity result, showing a bound of the same order than derivative-free trust-region methods for generic functions [8 As for general problems, considering random models is a possible way of relaxing the need for accuracy at every iteration A Levenberg-Marquardt algorithm based in this idea was proposed in [6, motivated by problems from data assimilation The authors of [6 proposed an extension of the classical LM algorithm that replaces the gradient of the obective function by a noisy estimate, that is accurate only with a certain probability Using similar arguments than for the trust-region case [, almost-sure global convergence to a first-order stationary point was established The case of noisy least squares has also been examined A very recent preprint [0 proposed a efficient approach for handling noisy values in practice, but did not provide theoretical guarantees A Levenberg-Marquardt framework for noisy optimization without derivatives was proposed in [4, with similar goals as those aimed in this paper The method proposed in [4

3 assumes that function values can be estimated to a prescribed accuracy level, and explicitly maintains a sequence of these accuracies throughout the algorithm Although such an approach is relevant when any accuracy level can be used (for instance, all the data can be utilized to estimate the function), it does not allow for arbitrarily bad estimations on any iteration: moreover, the noise level must be small compared to the norm of the upcoming Levenberg-Marquardt step, a condition that may force to reduce this noise level, and resembles the criticality step of derivative-free model-based methods By contrast, the use of random models and estimates with properties only guaranteed in probability allows for arbitrarily bad estimates, which seems more economical at the iteration level, and does not necessarily mean that a good step will not be computed in that case Probabilistic properties thus emerges as an interesting alternative, particularly when it is expensive to compute accurate estimates, and one can then think of exploiting the connection between Levenberg-Marquardt and trust-region methods [3 to analyze the former in the case of noisy problems In this paper, we propose a stochastic framework that builds upon the approach developed in [6 to handle both random models and noise in the function evaluations This new algorithm is also inspired by a recently proposed variant of the Levenberg-Marquardt framework [5, where a specific scaling of the regularization parameter enabled the derivation of worst-case complexity results We adapt the analysis of the stochastic trust-region framework using random models proposed in [7, 3 to prove that our framework enoys comparable convergence and complexity guarantees Unlike [4, our setup allows for arbitrarily inaccurate models or function estimates, as long as it happens with a small probability Our method is particularly suited for applications in data assimilation, which we illustrate in the context of ensemble methods The remainder of the paper is organized as follows Section presents our Levenberg- Marquardt framework; Section 3 established the accuracy requirements we make on the function values and the models, as well as their probabilistic counterparts Global convergence and worstcase complexity of the method are analyzed in Sections 4 and 5, respectively Finally, Section 6 describes an application of our method in data assimilation A Levenberg-Marquardt algorithm based on estimated values In this paper, we consider the following nonlinear least squares problem: min x R n f(x) = r(x), () where r : R n R m is the residual vector-valued function, assumed to be continuously differentiable, and most likely m n During the minimization process, the optimizer can only have access to estimates of f - referred as f This estimate is assumed to be noisy, ie, one has for all x R n, f(x) = r(x, ξ) where the noise ξ is a random variable This section recalls the main features of the Levenberg-Marquardt method, then describes our extension of this algorithm to handle noisy function values and gradients Deterministic Levenberg-Marquardt algorithm Whenever the function r and its Jacobian can be accessed, one possible approach for solving the problem () is based on the Gauss-Newton model More precisely, at a given iterate x, a step 3

4 is computed as a solution of the linearized least squares subproblem min s R n r + J s, where r = r(x ) and J = J(x ) denotes the Jacobian of r at x The subproblem has a unique solution if J has full column rank, and in that case the step is a descent direction for f When J is not of full column rank, the introduction of a regularization parameter can lead to similar properties This is the underlying idea behind the Levenberg-Marquardt [,, 4 algorithm, a globally convergent method based upon the Gauss-Newton model At each iteration, one considers a step of the form (J J + γ I) J r, corresponding to the unique solution of min s R n r + J s + γ s, () where γ 0 is an appropriately chosen regularization parameter, typically updated in the spirit of the classical trust-region radius update strategy at each iteration Several strategies were then developed to update γ Several approaches have considered scaling this parameter using the norm of the gradient of the Gauss-Newton model [5, 33 A similar choice will be adopted in this paper Algorithmic framework based on estimates In this work, we are interested in the case where r and J r cannot be directly accessed, but noisy estimates are available As a result, we will consider a variant of the Levenberg-Marquardt algorithm in which both the function and gradient values are approximated Algorithm presents a description of our method At every iteration, estimates of the values of f and its derivative at the current iterate are obtained, and serve to define a regularized Gauss-Newton model (3), where the regularization parameter is defined using a specific scaling formula: γ = µ m (x ) where µ 0 The model m is then approximately minimized, yielding a trial step s The resulting new point is accepted only if the ratio ρ between estimated decrease (f is again estimated at the new trial point) and model decrease is sufficiently high The Levenberg-Marquardt parameter µ is updated depending on the value of ρ, and also on a condition involving the model gradient Such updates have been widely used in derivative-free model-based methods based on random estimates [, 6, 3, 9 3 Probabilistic properties for the Levenberg-Marquardt method We are interested in the case where the obective function values, the gradient J r and the Jacobian J are noisy, and we only have their approximations 3 Gradient and function estimates We begin by describing our accuracy requirements for the models computed based on sampled values, of the form given in (3) Following previous work on derivative-free Levenberg-Marquardt methods [6, we propose the following accuracy definition, and motivate further its use below 4

5 Algorithm : A Levenberg-Marquardt method using random models Data: Define η (0, ), η, µ min > 0, and λ > Choose x 0 and µ 0 µ min for = 0,, do Compute an estimate f 0 of f(x ) Compute g m and J m, the gradient and the Jacobian estimate at x, set γ = µ g m, and define the model m of f around x by: s R n, m (x + s) = m (x ) + gm s + ( ) s Jm J m + γ I s (3) 3 Compute an approximate solution s of the subproblem 4 Compute an estimate f of f(x + s ), then compute min m (x + s) (4) s R n ρ = f 0 f m (x ) m (x + s ) 5 If ρ η and g m η µ, set x + = x + s and µ + = max µ λ, µ min} Otherwise, set x + = x and µ + = λ µ end Definition 3 Consider a realization of Algorithm, and the model m of f defined around the iterate x of the form (3), and let κ ef, κ eg > 0 Then, the model m is called (κ ef, κ eg )-firstorder accurate with respect to (x, µ ) if g m J r κ eg µ (5) and f(x ) m (x ) κ ef µ (6) Remark 3 The accuracy requirement for the model gradient (5) is similar to the first-order accuracy property introduced by Bergou, Gratton and Vicente [6 However, it is not exactly equivalent as we use µ instead of γ = µ g m The purpose of this new parametrization is twofold First, it allows us to measure the accuracy in formulas (5) and (6) through a parameter that is updated in an explicit fashion throughout the algorithmic run: this is a key property for performing a probabilistic analysis of optimization methods Secondly, we believe this choice to be a better reflection of the relationship between the Levenberg-Marquardt and the trust-region parameter Indeed, for a realization of the method, the Levenberg-Marquardt direction minimizing m (s) is given by ( d = J J + γ I) J r, (7) 5

6 which is also the solution of the trust-region subproblem mind r + J d st d δ = d As a result, we see that for a large value of γ, one would have: ( ) J r δ = O, (9) which suggests that γ is not exactly equivalent to the inverse of the trust-region radius, as suggested in [6, but that it rather is an equivalent to J F δ Still, this relation implies that µ can be seen as an equivalent to δ : in that sense, (5) matches the gradient assumption for fully linear models [5 Note that Definition 3 contains two requirements: in the absence of noise, (6) is trivially satisfied by setting m (x ) = f(x ) In this work, we consider that even function values cannot be accessed inexactly, thus (6) appears to be necessary In the case of noisy function values, we also expect the estimates computed by Algorithm to be sufficiently accurate with a suitable probability This is formalized in the following definitions Definition 3 Given ε f > 0, we say that two values f 0 and f are ε f -accurate estimates of f(x ) and f(x + s ), respectively, for a given µ, if f 0 f(x ) ε f µ and f f(x + s ) ε f µ (0) 3 Probabilistic accuracy of model gradients and function estimates We are further interested in the case where the models are built in some random fashion We will thus consider random models of the form M, and we use the notation m = M (ω) for its realizations Correspondingly, let random variables g M and J M denote the estimates of the gradient J r and the Jacobian J, with their realizations denoted by g m = g M (ω), and J m = J M (ω) Note that the randomness of the models implies the randomness of the iterate X, the parameters Γ, and the step S, and so x = X (ω), γ = Γ (ω), µ = (ω) and s = S (ω) will denote their respective realizations As described in the introduction, another source of randomness from our problem in that the obective function f is accessed through a randomized estimator f For a given iteration index, we define F 0 = f(x ) and F = f(x + S ) The realizations of F 0 and F (taken over the randomness of f as well as that of the iterate X ) will be denoted by f 0 and f We can now provide probabilistic equivalents of Definitions 3 and 3 Definition 33 Let p (0,, κ ef > 0 and κ eg > 0 A sequence of random models M } is said to be p-probabilistically κ ef, κ eg }-first-order accurate with respect to the sequence X, } if the events } gm U = J(X ) r(x ) κ eg 6 γ & f(x ) M (X ) κ ef (8)

7 satisfy the following condition p = P (V F M F ) p, () where F M F = σ(m 0,, M, F0 0, F 0,, F 0, F ) is the σ-algebra generated by M 0,, M and F0 0, F 0,, F 0, F Definition 34 Given constants ε f > 0, and q (0,, the sequences of random quantities F 0 and F is called (q)-probabilistically ε f -accurate, for corresponding sequences X }, Γ }, if the events V = F 0 f(x ) ε f and F f(x + S ) } ε f satisfy the following condition q = P (V F / M F ) q, () where F M F / is the σ-algebra generated by M 0,, M, F 0 0, F 0, F 0, F Here again, we point out that the parameter µ plays the role of a reciprocal of the trustregion radius In that sense, the previous definitions are consistent with the definitions of sufficient accuracy presented in the case of stochastic trust-region methods [3 4 Global convergence to first-order critical points In this section, we aim at establishing convergence of Algorithm when the function estimates and the models satisfy the probabilistic properties described in Section 3 Our analysis bears strong similarities with that of the STORM algorithm [3, but possesses significant differences induced by the use of probabilistic gradients rather than probabilistic fully linear models 4 Assumptions and deterministic results We will analyze Algorithm under the following assumptions Assumption 4 f is continuously differentiable on an open set containing the level set L(x 0 ) = x R n f(x) f(x 0 )}, with Lipschitz continuous gradient, of Lipschitz constant ν We also require that the Jacobian model is uniformly bounded Note that the result is assumed to hold for every realization of the algorithm, therefore such an assumption will be valid in both a deterministic and random context Assumption 4 There exists κ Jm > 0 such that for all and all realizations J m of the -th model Jacobian, one has: J m κ Jm Additionally, we assume that the subproblem is approximately solved so that a fraction of a Cauchy decrease is satisfied for the model 7

8 Assumption 43 There exists θ fcd > 0 such that for every iteration of every realization of the algorithm, m (x ) m (x + s ) θ fcd g m J m (3) + γ We will also assume the following bounds hold Assumption 44 At each iteration and for each realization of the algorithm, the step size satisfies s g m γ = µ, (4) and there exists θ in > 0 such that s (γ s + g m ) 4 J m g m + θ in g m γ = 4 J m + θ in µ (5) Several choices for the approximate minimization of m (x +s) that verify (3),(4) and (5) can be proposed; in particular, the result holds for steps computed via a truncated Conugate Gradient algorithm (initialized with the null vector) applied to the quadratic m (x + s) m (x )) [6, Lemma 5 Lemma 4 Let Assumptions 4, 4, and 44 hold for a realization of Algorithm Consider the -th iteration of that realization, and suppose that m is (κ ef, κ eg )-first-order accurate Then, where κ efs = κ ef +κ eg+ν+4κ Jm f(x + s ) m (x + s ) κ efs µ, (6) Proof Using Assumptions 4, 4, and 44 within a Taylor expansion of the function f around x, we obtain: f(x + s ) m (x + s ) f(x ) + f(x ) s + ν s m (x ) gm s s Jm J m s f(x ) m (x ) + ( ) ν f(x ) g m s + s + J m J m s hence the result κ ef µ + κ eg s + ν + κ J m s µ κ ef + κ eg + ν + 4κ J m Lemma 4 illustrates that our accuracy requirements are enough to guarantee accuracy of any computed step We now state various results holding for a realization of Algorithm that do not make direct use of the probabilistic nature of the method These will be instrumental in proving Theorem 4 8, µ

9 Lemma 4 Let Assumptions 4, 4, 43, and 44 hold for a realization of Algorithm, and consider its -th iteration If the model is (κ ef, κ eg )-first-order accurate and µ max κ J m, 8(κ } ef + κ efs ) η θ fcd g m, (7) then the trial step s satisfies f(x + s ) f(x ) η θ fcd 8 Proof Since the model is (κ ef, κ eg )-first-order accurate, we have: g m µ (8) f(x + s ) f(x ) = f(x + s ) m(x + s ) + m(x + s ) m (x ) + m (x ) f(x ) κ efs µ + m(x + s ) m (x ) + κ ef µ κ ef + κ efs µ = κ ef + κ efs µ η θ fcd η θ fcd g m κ J m + γ g m κ J m + µ g m, where we used the result of Lemma 4 and Assumption 43 Using the first part of (7), we have µ g m κ J m and thus f(x + ) f(x ) κ ef + κ efs µ κ ef + κ efs µ η θ fcd η θ fcd g m κ J m + µ g m g m µ g m = [ κef + κ efs η θ fcd g m µ µ 4 [ η θ fcd g m, µ 8 where the second part of the maximum in (7) was used in the last line, yielding the expected result The next result is a consequence of Lemma 4 Lemma 43 Let the assumptions of Lemma 4 hold If m is (κ ef, κ eg )-first-order accurate and ( µ κ eg + max κ J m, 8(κ }) ef + κ efs ) η θ fcd f(x ), (9) then the trial step s satisfies f(x + s ) f(x ) C f(x ) µ, (0) 9

10 where C = η θ fcd 8 max κ Jm, 8(κ } ef +κ efs ) η θ fcd κ eg+max κ Jm, 8(κ ef +κ efs ) η θ fcd } Proof Since the model is (κ ef, κ eg )-first-order accurate, we have: f(x ) f(x ) g m + g m κ eg µ + g m () Using (9) to bound the left-hand side, we obtain: } κ eg + max κ J m, 8(κ ef +κ efs ) η θ fcd κ eg + g m, µ µ } which gives µ max κ J m, 8(κ ef +κ efs ) η θ fcd g m We are thus in the assumptions of Lemma 4, and (8) holds Using again the fact that the model is (κ ef, κ eg )-first-order accurate together with (7) and (), we have: leading to f(x ) κ eg µ + g m κ eg + max κ eg κ J m, 8(κ ef +κ efs ) η θ fcd } max κ J m, 8(κ ef +κ efs ) η θ fcd g m } f(x κ eg + max κ J m, 8(κ ef +κ efs ) ) η θ fcd Combining this relation with (8) finally gives (0) } f(x ) + g m, Lemma 44 Let Assumptions 4, 4, 43 and 44 hold Consider the -th iteration of a realization of Algorithm such that x is not a critical point of f Suppose further that m is (κ ef, κ eg )-first-order accurate, (f 0, f ) is ε f -accurate, and α + α + 4ακ J m ( η ) µ max ( η ), η g m = κ µg g m, () holds, where α = ε f + κ eg + ν + 5κ J m + θ in Then, the -th iteration is successful (ie ρ η and g m η µ ) Proof To simplify the notations, we will omit the indices in the proof 0

11 ρ = = = = f 0 f m(x) m(x + s) m(x) m(x + s) f 0 + f m(x) m(x + s) g ms s (JmJ m + γi)s f 0 + f m(x) m(x + s) ( f f 0 gms s JmJ m s ) s (g m + γs) m(x) m(x + s) f f 0 gms s JmJ m s + s (g m + γs) m(x) m(x + s) We look in more detail at the first term arising in the numerator; we have: f f 0 gms s JmJ m s = f f(x + s) + f(x + s) f(x) + f(x) f 0 gms s JmJ m s = f f(x + s) + f(x) s + ( f(x + ts) f(x)) s 0 +f(x) f 0 gms s JmJ m s = f f(x + s) + [ f(x) g m s + ( f(x + ts) f(x)) s 0 +f(x) f 0 s JmJ m s f f(x + s) + [ f(x) g m s Thus, we obtain: + 0 f(x + ts) f(x) s dt + f(x) f 0 + s J mj m s ε f µ + κ eg s + ν µ s + ε f µ + J m s = ε f µ + κ eg s µ + ν + κ J m s ρ ε f + κeg s µ µ + ν+κ Jm 4 s + s (g m + γs) m(x) m(x + s) at Using Assumption 44 on the numerator and Assumption 43 on the denominator, we arrive

12 ρ ε f µ + κeg s µ + ν+κ Jm 4 s + s (g m + γs) m(x) m(x + s) ε f + κeg + ν+κ Jm + 4κ Jm +θ in µ µ µ µ m(x) m(x + s) ( εf + κ eg + ν + 5κ ) J m + θ in µ As a result, we have = θ fcd g m J m +γ ( εf + κ eg + ν + 5κ J m + θ in ) gm θ fcd g m κ Jm +γ = ( ε f + κ eg + ν + 5κ J m + θ in ) κ J m + γ γ = α κ J m + γ γ ρ η α κ J m + γ γ η 0 ( η )γ αγ α κ J m Since the right-hand side is a second-order polynomial in γ, this gives α + α + 4ακ J m ( η ) α + α + 4ακ J m ( η ) γ µ ( η ) ( η ) g m But this contradicts (), from which we conclude that we necessarily have ρ < η, and thus ρ > η Since g m η µ as a direct consequence of (), the iteration is a successful one, and the parameter µ is not increased We point out that Lemma 44 only involves the accuracy requirements on the model gradient, thanks to the accuracy of the function estimates Lemma 45 Let Assumptions 4, 4, 43, and 44 hold Consider a successful iteration of index for a realization of Algorithm, such that x is not a critical point of f Suppose further that (f 0, f ) is ε f -accurate, with } η max κ 8ε f J m, (3) η θ fcd γ Then, one has: f(x + s ) f(x ) C µ (4) where C = η η θ fcd 4 ε f > 0

13 Proof By definition of a successful iteration and using the accuracy properties of the models and the estimates, we have f(x + ) f(x ) = f(x + s ) f(x ) = f(x + s ) f + f f 0 + f 0 f(x ) ε f µ ε f µ ε f µ ε f µ + f f 0 + η (m(x + s ) m(x )) η θ fcd η θ fcd as η κ J m Since the iteration is successful, we have µ g m η, leading to f(x + ) f(x ) ε f µ ε f µ = C µ η θ fcd 4 η η θ fcd 4 g m κ J m + µ g m g m η + µ g m g m µ which proves the desired result (the positivity of C comes from (3)) µ 4 Almost-sure global convergence We now turn to the probabilistic properties to be assumed in our algorithm Assumption 45 The random model sequence M } is p-probabilistically κ ef, κ eg }-first-order accurate for some p (0,, κ ef > 0, and κ eg > 0 Assumption 46 The sequence of random function estimates (F 0, F )} is q-probabilistically ε f -accurate for some q (0, and ε f > 0 Assumption 47 The constant η is chosen such as η max κ J m, 6(κ } ef + κ efs ) 8ε f, (5) θ fcd η θ fcd In the rest of the paper, we will assume that pq (if pq =, we have for every, p = P (U F M ) = p = q = P (V F M F ) = q = and the behavior of the algorithm reduces to that of an inexact deterministic algorithm with inexact subproblem solution) We introduce the random function Φ = τf(x ) + τ, (6) 3

14 where τ (0, ) satisfies τ λ τ > max λ, λ λ, λ λ C ζ C κ ef +κ efs } (7) and ζ is a parameter such that ( ζ κ eg + max κ µg, 8(κ }) ef + κ efs ), κ J η θ m, η (8) fcd The proposition below states that the regularization parameters diverges with probability Theorem 4 Let Assumptions 4, 4, 43, 44 and 47 hold Suppose that Assumptions 45 and 46 are also satisfied, with the probabilities p and q chosen in a way specified later on Then, P < = (9) µ =0 Proof We follow the proof technique of [3, Theorem 4 (see also [0) Our goal is to show that there exists σ > 0 such that at every iteration, E [ Φ + Φ F M F σ, (30) where here the expectation is taken over the product σ-algebra generated by all models and function value estimates (note however that Since f is bounded from below (by 0), Φ 0 and > 0, (30) guarantees that the series converges almost surely (see, eg, [4, Proposition 44) We will now prove that (30) holds and give appropriate values for τ and σ Consider a realization of Algorithm, and let φ be the corresponding realization of Φ If is the index of a successful iteration, then x + = x + s, and µ + µ λ One thus has: φ + φ τ (f(x + ) f(x )) + ( τ) λ µ (3) If is the index of an unsuccessful iteration, x + = x and µ + = λµ, leading to ( ) φ + φ = ( τ) λ µ < 0 (3) For both types of iterations, we will consider four possible outcomes, involving the quality of the model and estimates In addition, we will divide the iterations in two groups, depending on the relationship between the true gradient norm and ζ µ, where ζ satisfies (8) above Case : f(x ) ζ µ 4

15 (a) Both m and (f 0, f ) are accurate Since we are in Case, f(x ) (κ eg + κ µg ) µ Because the model is (κ ef, κ eg )-first-order accurate, this implies g m f(x ) κ eg µ (ζ κ eg ) µ κ µg µ so () holds; since the estimates are also accurate, the iteration is successful by Lemma 44 Moreover, ( f(x ) ζ µ κ eg + max κ J m, 8(κ }) ef + κ efs ) η θ µ, fcd so the condition (9) is satisfied, and by Lemma 43, we can guarantee a decrease on the function value More precisely, ϕ + ϕ f(x ) τc + ( τ)(λ ) µ µ (33) By (7), we have τc f(x ) µ + ( τ)(λ ) µ τc ζ + ( τ)(λ ) µ ( ) < ( τ) λ µ, so the last right-hand side of (33) and (3) also holds (the latter will be used for the remaining cases) (b) Only m is accurate The decrease formula of Lemma 43 is still valid in that case: if the iteration is successful, then (33) holds and by (7), (3) also holds Otherwise, (3) holds (c) Only (f 0, f ) is accurate If the iteration is unsuccessful, then (3) is satisfied Otherwise, we can apply Lemma 45 and have a guarantee of decrease in the case of a successful iteration, namely, f(x + s ) f(x ) C µ, from which we obtain ϕ + ϕ [ τc + ( τ)(λ ) µ (34) We again deduce from (7) that (3) also holds in that case (d) Both m and (f 0, f ) are inaccurate We again focus on the successful iteration case, as we can use (3) in the other situation By considering a Taylor expansion of f(x + s ), 5

16 we know that the possible increase in the step is bounded above by: f(x + s ) f(x ) f(x ) s + L s f(x ) s + L s f(x ) s + L s µ ( + L ) f(x ) s ζ ( + L ) f(x ) ζ µ We thus obtain the following bound on the change in φ: where C 3 = ( + L ζ ϕ + ϕ ) f(x ) τc 3 + ( τ)(λ ) µ µ, (35) Putting the four cases together with their associated probability of occurrence, we have [ E Φ + Φ F M F, f(x ) ζ } [ [ f(x ) pq τc + ( τ)(λ ) ( ) + [p( q) + ( p)q ( τ) λ [ f(x ) +( p)( q) τc 3 + ( τ)(λ ) = [ C pq + ( p)( q)c 3 τ f(x ) [ C pq + ( p)( q)c 3 τ f(x ) where the last line uses + [pq λ (p( q) + ( p)q) + ( p)( q)( τ)(λ ) + ( τ)(λ ), pq λ (p( q) + ( p)q) + ( p)( q) (p + ( p))(q + ( q)) = Suppose p and q are chosen such that holds Then, one has by combining (36) and (7): pq / ( p)( q) C 3 C, (36) [ C pq + ( p)( q)c 3 C ( τ)(λ ) (37) τζ 6

17 On the other hand, since f(x ) ζ/µ, we have: ( τ)(λ ) [ C pq + ( p)( q)c 3 τ f(x ) This leads to [ E Φ + Φ F M F, f(x ) ζ µ } 4 C τ f(x ) 4 C τζ, which, using (37), finally gives: [ E Φ + Φ F M F, f(x ) ζ µ } ( τ) ( λ ) 4 ( τ) ( λ ) 4 (38) Case : f(x ) < ζ µ Whenever g m < η µ, the iteration is necessarily unsuccessful and (3) holds We thus assume in what follows that g m η µ, and consider again four cases (a) Both m and (f 0, f ) are accurate It is clear that (3) holds if the iteration is unsuccessful; if it is successful, then we can use the result from Lemma 45, and we have: f(x + ) f(x ) C µ from which we obtain (34) We thus deduce from (7) that (3) also holds in that case (b) Only m is accurate If the iteration is unsuccesful, it is clear that (3) holds Otherwise, using η κ J m that arises from (5) and applying the same argument as in the proof of Lemma 45, we have g m η µ κ Jm µ, thus m (x ) m (x + s ) θ fcd g m κ J m + µ g m θ fcd g m µ g m = θ fcd g m η θ fcd 4 µ 4 Since the model is (κ ef, κ eg )-first-order accurate, the function variation satisfies: f(x ) f(x + s ) = f(x ) m (x ) + m (x ) m (x + s ) + m (x + s ) f(x + s ) κ ef µ + η θ fcd 4 µ κ efs µ ( ) η θ fcd (κ ef + κ efs ) 4 µ κ ef + κ efs µ, µ where the last line comes from (5) As a result, [ φ + φ τ κ ef + κ efs + ( τ)(λ ) by (7) µ ( ( τ) ) λ µ (39) 7

18 (c) Only (f 0, f ) is accurate This case can be analyzed the same way as Case a (d) Both m and (f 0, f ) are inaccurate As in Case d, we have The change in φ thus is f(x + s ) f(x ) f(x ) s + L s ζ s + L µ s ( ζ + L ) ζ < ζc 3 µ µ φ + φ [ τc 3 ζ + ( τ)(λ ) µ (40) Combining all the subcases for Case, we can bound all of those by (3) save for Case d, which occurs with probability ( p)( q) Thus, [ E Φ + Φ F M F, f(x ) < ζ } ( ) [pq + p( q) + q( p)( τ) λ ( ( τ) ) λ + ( p)( q) [ τc 3 ζ + ( τ)(λ ) + ( p)( q) [ τc 3 ζ + ( τ)(λ ) We now assume that p and q have been chosen such that ( p)( q) ( τ) ( ) λ τc 3 ζ + ( τ)(λ ) (4) holds Using (4), we obtain [ E Φ + Φ F M F, f(x ) < ζ } ( τ) ( λ ) ( 4 ( τ) ) λ, (4) which is the same amount of decrease as in (38) Letting σ = 4 ( τ) ( λ ), we have then established that for every iteration, E [ Φ + Φ F M F σ < As a result, the statement of the theorem holds In the proof of Theorem 4, we have enforced several properties on the probability thresholds p and q: we summarize those as follows 8

19 Corollary 4 Under the assumptions of Theorem 4, its statement holds provided the probabilities p and q satisfy: pq / ( p)( q) C 3, (43) C and ( p)( q) ( τ) ( λ ) (τc 3 ζ + ( τ)(λ )) (44) Proposition 4 Let G be a submartingale, in other words, a set of random variables which are integrable (E( G ) < ) and satisfy E(G F ) G, for every, where F = σ(g 0,, G ) is the σ-algebra generated by G 0,, G and E(G F ) denotes the conditional expectation of G given the past history of events F Assume further that there exists M > 0 such that G G M <, for every Consider the random events C = lim G exists and is finite} and D = lim sup G = } Then P (C D) = This finally leads to the desired result Theorem 4 Let the assumptions of Theorem 4 and Corollary 4 hold Then, the sequence of random iterates generated by Algorithm satisfies: ( ) P lim inf f(x ) = 0 = Proof Following the lines of the proof of [3, Theorem 46, we proceed by contradiction and assume that, there exists ɛ > 0 such that P ( f(x ) ɛ } ) > 0 We then consider a realization of Algorithm for which f(x ) ɛ Since lim µ =, there exists 0 such that for every 0, we have: } µ > b = κ µg max ɛ, 6(κ ef + κ efs ) η θ fcd ɛ, κ J m ɛ, η ɛ, λ µ min (45) Let R be a random variable with realizations r = log λ ( b µ ): then for the realization we are considering, we have r < 0 for 0 Our obective is to show that such a realization has a a zero probability of occurrence Consider 0 such that both events S and V happen: the probability of such an event is at least pq Because the model is accurate and we have (45): g m f(x ) κ eg µ ɛ ɛ = ɛ We are thus in the assumptions of Lemmas 43 and 44, from which we conclude that the - th iteration is successful, so the parameter µ is decreased, ie, µ + = µ λ Consequently, r + = r + 9

20 For any other outcome for U and V other than both happen (which occur with probability at most pq), we have µ + λµ As a result, letting F V T = σ(v 0,, V k ) σ(t 0,, T k ) = σ(r 0,, R ), E [ r + F V T pq(r + ) + ( pq)(r ) r, because pq > / as a consequence of the assumptions from Corollary 4 This implies that R is a submartingale We now define another submartingale W by W = ( Ui Vi ), i=0 where A is the indicator random variable of the event A Note that W is defined on the same probability space as R, and that we have: E [ W F V T = E [ W F V T [ + E U V F V T = W + P ( U V F V T ) W, where the last inequality holds because pq / Therefore, W is a submartingale with bounded (±) increments By Proposition 4, it does not have a finite limit and the event lim sup W = } has probability To conclude, observe that by construction of R and W, one has r r 0 w w 0, where w is a realization of W This means that R must be positive infinitely often with probability one, thus that there is a zero probability of having r < 0 0 This contradicts our initial assumption that P( f(x ) ɛ ) > 0, which means that we must have ( ) P lim inf ) = 0 = 5 Complexity analysis In this section, we will analyze the convergence rate of our algorithm using stochastic processes The proposed expected convergence rate methodology is inspired by the complexity analysis developed by Blanchet et al [7 However, it presents a number of variations that lead to a difference in the components of the final complexity bound (see Theorem 5) The derivation of our complexity result is thoroughly detailed in order to clarify the original features of our reasoning Given a stochastic process X }, T is said to a be a stopping time for X }, if, for all, the event T } belongs to the σ-algebra by X, X, X For a given ɛ > 0, define a random time T ɛ T ɛ = inf 0, f(x ) ɛ}, 0

21 let also ɛ = ζ ɛ, where ζ ( κ eg + max κ µg, 8(κ }) ef + κ efs ), κ J η θ m, η (46) fcd Based on Theorem 4, one deduces that T ɛ is a stopping time for the stochastic process defined by Algorithm and hence for Φ, } where Φ is given by (6) Assumption 5 There exists a positive constant Φ max > 0 such that Φ Φ max For simplicity reasons, we will assume that µ 0 = ɛ λ and µ s min = ɛ λ for some integers s, t > 0, t hence for all, one has = ɛ for some integer k We note that, in this case, whenever λ k < ɛ, one has ɛ λ, and hence + ɛ This assumption can be made without loss of generality, for instance, provided µ min = µ 0 λ s t (one can choose µ min so that this is true) and ζ = µ 0 λ s ɛ, where s is the smallest integer such that ζ satisfies (46) We first depart from the analysis of [7 in the next lemma It defines a geometric random walk based on successful iterations The final complexity result heavily depends upon the behavior of this random walk Lemma 5 Let Assumptions 4, 4, 43, 44 and 47 hold For all < T ɛ, whenever ɛ, one has or, equivalently, letting γ = log(λ), one has + = λ Ω + λ ( Ω ), + = e γλ, (47) where Ω is equal to if the iteration is successful and 0 otherwise and Λ = Ω defines a birth-and-death process, ie, P(Λ = F M F, ɛ ) = P(Λ = F M F, ɛ ) = ω, with ω pq Proof By the mechanism of the algorithm one has + = λ Ω + λ ( Ω ) Moreover, if µ ɛ for a given < T ɛ, one has f(x ) ɛ and hence from the definition of ɛ one gets f(x ) ζ µ Assuming U = and V = (ie both m and (f 0, f ) are accurate) Since the model is (κ ef, κ eg )-first-order accurate, this implies g m f(x ) κ eg µ (ζ κ eg ) µ κ µg µ ; since the estimates are also accurate, the iteration is successful by Lemma 44 Hence, one gets ω = P(Λ = F M F, ɛ ) pq Lemma 5 is analogous to [7, Lemma 35, however, in our case, the birth-and-process Λ } is based on successful iterations, whereas [7 considered the iterations where both the function estimates and the model were accurate The next result exactly follows Case in the proof of Theorem 4, therefore its proof is omitted

22 Lemma 5 Let Assumptions 4, 4, 43, 44 and 47 hold Suppose that Assumptions 45 and 46 are also satisfied, with the probabilities p and q satisfy: and ( p)( q) pq / ( p)( q) C 3 C, (48) ( τ) ( λ ) (τc 3 ζ + ( τ)(λ )) (49) Then, there exists a constant σ > 0 such that, conditioned on T ɛ >, one has or, equivalently, E [ Φ + Φ F M F σ <, (50) E [ Φ + F M F In this case, σ = 4 ( τ) ( λ ) We define the renewal process A i } as follows: < Φ σ, (5) A 0 = 0 and A i = min k > A i : k ɛ } A represents the number of iterations for which has a value smaller than ɛ Let also, for all, τ = A A The next result provides a bound on the expected value of τ Lemma 53 Let Assumptions 4, 4, 43, 44 and 47 hold Assuming that pq >, one has for all Proof One has E [τ pq pq (5) E [τ = E [ τ A < ɛ P( A < ɛ ) + E [ τ A = ɛ P( A = ɛ ) maxe [ τ A < ɛ, E [ τ A = ɛ } (53) First we note that whenever < ɛ, one has ɛ λ, and hence + ɛ Thus, if A < ɛ, one deduces that A = A + and then E [ τ A < ɛ = (54) Assuming now that A > A + (if not, meaning that A = A +, the proof is straightforward ), then conditioned on A = ɛ, one has A = ɛ as well We note also that for all k [A, A, one has k ɛ Hence, using Lemma 5, one has k + = k e γλ k,

23 where γ = log(λ) and P(Λ k = Fk M F, k ɛ ) = ω k and P(Λ k = Fk M F, k ɛ ) = ω k Moreover, one has ω k pq The process A, A +,, A } then defines a geometric random walk between two returns to the same state (ie, ɛ ) and τ represents the number of iterations until a return to the initial state For such a geometric random walk, one can define the state probability vector π = (π k ) k corresponding to the limiting stationary distribution [5 Using the local balance equation between the two states k and k +, see [5, Theorem 3, one has Since ω k pq, one deduces that Hence, π k κ k π 0 ( ω k )π k = ω k π k+ ( pq)π k pqπ k+ where κ = pq pq Using the assumption κ < (ie pq > ) and the definition of the state probability k π k =, one has π 0 κ (this is a classical result for geometric random walk, see for instance [5, Example 6) Applying the properties of ergodic Markov chains, one deduces that the expected number of iterations until a return to the initial state (the state 0) is given by π 0 Hence E [ τ A = ɛ = π 0 κ = pq pq (55) By substituting (54) and (55) into (53), one deduces E [τ completed pq pq and hence the proof is We now introduce a counting process N() given by the number of renewals that occur before time : N() = max i : A i } We also consider the sequence of random variables defined by Y 0 = Φ 0 and Y = Φ min(,tɛ) + σ min(,t ɛ) k=0 ( k ) for all The definition of Y } is our second and main difference with the analysis of [7, which leads to a different form for the bound on E [N(T ɛ ) provided in the lemma below, compared to [7, Lemma Lemma 54 Under the assumptions of Lemma 5 Let Assumption 5 hold One has, E [N(T ɛ ) Φ 0 σ ɛ 3

24 Proof Note that Y defines a supermartingale with respect to F M F Indeed, if < T ɛ, then using Lemma 5 one has, E [ Y + F M F = E [ [ Φ + F M F + E σ ( k=0 ( ) ( Φ σ + σ k=0 ( ) = Φ + σ = Y k=0 [ If T ɛ, one has Y + = Y and thus E Y + F M F = Y k k k ) ) F M F Using Assumption 5, one has for all T ɛ, Y = Y Tɛ Φ max + (Tɛ+)σ Hence, since T µ ɛ min is bounded, Y is also bounded Using Theorem 4, one knows that T ɛ is a stopping time and hence by means of the optional stopping Theorem ([6, Theorem 64) for supermartingale, one concludes that E [Y Tɛ E [Y 0 Hence, σe [ Tɛ ( ) E [Y Tɛ E [Y 0 = Φ 0 (56) k=0 k By the definition of the counting process N(T ɛ ), since the renewal times A i (which satisfy Ai ɛ ) are a subset of the iterations 0,,, T ɛ, one has T ɛ k=0 ( k Inserting the latter inequality in (56), one gets Which concludes the proof ) ( ) N(T ɛ ) ɛ E [N(T ɛ ) Φ 0 σ ɛ Using Wald s equation [6, Corollary 63, we can finally obtain a bound on the expected value of T ɛ Theorem 5 Under the assumptions of Lemma 5 Let Assumption 5 hold One has E [T ɛ pq ( ) Φ0 pq σ ɛ +, or, equivalently, E [T ɛ pq ( κs ɛ + ) pq 4

25 where κ s = τf(x 0)+( τ)µ 0 ) ζ, τ (0, ) satisfies λ 4 ( τ) ( τ λ τ > max λ, λ λ, λ λ C ζ C κ ef +κ efs and ζ is a parameter such that ( ζ κ eg + max κ µg, 8(κ }) ef + κ efs ), κ J η θ m, η fcd Proof First note that the renewal process A N(Tɛ)+ = N(T ɛ) i=0 τ i where τ i defines independent inter-arrival times Moreover, since the probabilities p and q satisfy (48), one has pq > / and hence, by applying Lemma 53, for all i =,, N(T ɛ ) one has E [τ i by Wald s equation [6, Corollary 63, one gets, E [ A N(Tɛ)+ = E [τ E [N(T ɛ ) + pq pq E [N(T ɛ) + } pq pq By the definition of N(T ɛ ) one has A N(Tɛ)+ > T ɛ, hence using Lemma 54 one gets E [T ɛ E [ ( ) pq Φ0 A N(Tɛ)+ pq σ ɛ + < + Thus, As for the previous lemma, we observe that the complexity bound of Theorem 5 has a different form than that of [7 Both are of order of ɛ, but our result does not include a term in ɛ 6 Application to data assimilation Data assimilation is the process by which observations of a physical system are incorporated into a model together with prior knowledge, so as to produce an estimate of the state of this system More precisely, the methodology consists in computing z 0,, z T, where z i is the realization of the stochastic state Z i at time i, from (a) an initial state Z 0 z b + N(0, B ), with z b being the prior knowledge at time 0 of the process Z, (b) the observations y i which satisfy y i H i (Z i ) + N(0, R i ), i = 0,, T, and (c) the numerical physical system model Z i M i (Z i ) + N(0, Q i ), i =,, T We note that the model operator M i at time i as well as the observation operator H i are not necessary linear The random vectors Z 0 z b, y i H i (Z i ) and Z i M i (Z i ) define the noises on the prior, on the observation at time i, and on the model at time i, with covariance matrices B, R i, and Q i, respectively The 4DVAR formulation is one of the most popular data assimilation methods It assumes that the errors (the prior, the observation, and the model errors) are independent from each other and uncorrelated in time It also assumes that the posterior probability function of Z (in other words, the probability density function of Z 0,, Z T knowing y 0,, y T is proportional to ( ( )) exp T T z 0 z b (B ) + z i M i (z i ) + y Q i H i (z i ) i R i i= 5 i=0

26 The 4DVAR method maximizes the previous function over z 0,, z T, which is equivalent to minimizing, ( z 0 z b (B ) + T i= x i M i (z i ) Q i + T i=0 y i H i (z i ) R i ) (57) The latter optimization problem is known to the data assimilation community as the weak constraint 4DVAR formulation [8 One of the most significant challenges with this formulation is the practical estimation of the covariance matrices Q i, i =,, T [8, 8 In many applications it is assumed that the physical model is perfect, ie, i, Q i = 0 This scenario, known as the strong constraint 4DVAR formulation [6 is equivalent to solving the following minimization problem, ( min z 0,,z T R n z 0 z b (B ) + ) T i=0 y i H i (z i ), R i (58) s t z i = M i (z i ), i =,, T For the sake of simplicity, we will focus on problem (58) in the rest of the section By defining R H 0 (z) y = [y 0 ; ; y T, R = 0 R 0, and H(z) H o M (z) =, 0 0 R T H T o M T o M T o o M (z) we can re-write problem (58) as ) min ( z 0 z b z 0 R n (B ) + y H(z 0 ) R (59) The problem thus reduces to the determination of z 0, as z,, z T can be computed afterwards using z i = M i (z i ), i =,, T In order to link the notation to the generic optimization problem defined earlier in this paper, we will now denote the vector z 0 in (59) by x In many data assimilation problems, like those appearing in weather forecasting, the covariance matrix B is only known approximately Instead, one has access to an ensemble of N elements ẑ k} N, assumed to be sampled from the Gaussian distribution with the empirical k= mean z b and the unknown covariance matrix B The matrix B is approximated by the empirical covariance matrix of the ensemble: B N = N N k= (ẑ k z b ) ( ẑ k z b ) (60) The matrix B N follows the Wishart distribution [30; thus, if N n +, B N is nonsingular with probability one (the matrix (B N ) follows the inverse Wishart distribution) In this case, E [ B N = B and E [ (B N ) = N N n (B ) (6) We will assume that N is large enough relative to n, so that the empirical covariance matrix B N can be assumed to be non-singular and, furthermore, E [ (B N ) approximates (B ) 6

27 sufficiently well Since E [ B N (or equivalently B ) is usually not known for many problems, in practice, one considers the following problem of minimization in lieu of (59): min x R n x z b (B N ) + y H(x) R, (6) This optimization problem can be seen as a noisy approximation of (59), with B N instead of B To find the solution of the problem (6), a common approach used in the data assimilation community is to proceed iteratively by linearization At a given iteration, one computes s as an approximate solution of the auxiliary linear least squares subproblem defined as min s R n s + x z b (B N ) + y H(x ) H s R, (63) and sets x + = x + s, where H = H (x ) Such an iterative process is known in the data assimilation community as the incremental approach [6 This method is simply the Gauss- Newton method [3 applied to (6) To solve the subproblem (63), we propose to use the ensemble Kalman filter (EnKF) as a linear least squares solver The EnKF [7 consists of applying Monte Carlo techniques to approximately solve the subproblem (63) Recall that we have a ensemble of N elements ẑ k, for k =,, N, which are assumed to be sampled from the Gaussian distribution with the mean z b and the unknown covariance matrix B Thus, the empirical covariance matrix of the ensemble B N, which approximates the matrix B, is given by (60) EnKF generates a new ensemble s k,a as follows ( ( ) s k,a = ẑ k x + K y H(x ) H ẑ k x ˆv k), ( where ˆv k is sampled from N(0, R), and K = B N H H B N H + R) In practice, the matrices B N and K are never computed or stored explicitly The reader is referred to [7 and the references therein for more details on the computation The subproblem (63) solution is then approximated by s a = z b x + K (y H(x ) H (ẑ b x ) ˆv), where ˆv is the empirical mean of the ensemble ˆv k } One can show easily that s a is the minimizer of min s R n s + x z b (B N ) + y H(x ) ˆv H s R (64) Both the incremental method (ie, the Gauss-Newton method) and the method which approximates the solution of the linearized subproblem using EnKF may diverge A regularization approach like that of Algorithm controls the norm of the step so as to guarantee convergence We thus consider min s R n s + x z b (B N ) + y H(x ) ˆv H s R + γ s, for use as a subproblem in Algorithm In order for the algorithm to be globally convergent, one then has to ensure that the regarded data assimilation problem provides estimates for the 7

28 obective function and the gradient that are sufficiently accurate to a suitable high probability By analogy with the previous sections of the paper, we set f(x) f(x + s) m (0) m (s) = x z b (B ) + y H(x) R, (65) = x + s z b (B ) + y H(x + s) R, (66) = x z b (B N ) + y H(x ) ˆv R, (67) = s + x z b (B N ) + y H(x ) ˆv H s R + γ s (68) Furthermore, natural estimates f 0 and f to f(x) and f(x + s) respectively, can be given by f 0 (x ) f (x + s) = x z b (B N ) + y H(x ) R, = x + s z b (B N ) + y H(x + s) R The exact gradient of the non-noisy function (65) is then given by f(x ) = (B ) (x z b ) + H R (H(x ) y), and the gradient of the stochastic model (67) is m (x ) = (B N ) (x z b ) + H R (H(x ) y ˆv) To derive simple bounds for the errors and for simplicity, we make the assumption that ˆv = 0 In practice this assumption can be easily satisfied by centering the ensemble ˆv k } In fact, one generates ˆv k } then consider the ensemble defined by ṽ k = ˆv k ˆv instead of ˆv k } Note that the empirical mean of ṽ k } is then ˆṽ = 0 In the next lemma, we recall the Chebyshev s inequality which will be useful in the sequel of this section Lemma 6 (Chebyshev s inequality) Let X be an n dimensional random vector with expected value µ and covariance matrix V, then for any real number t > 0 P X µ V > t} n t In particular, if X is scalar valued, then one has P X µ > t} V t The next theorem gives estimates of the required bounds on the errors appearing in Assumptions 45 and 46 Note that, at a given iteration, conditioned on the σ-algebra associated up to the current iterate F / M F, the remaining randomness only comes from the matrix BN We will consider a run of the algorithm under stopping criterion of the form µ > µ max Theorem 6 Let denote the current iterate index Assume that the ensemble size N is large enough compared to n, ie, N > (ℵ minλ µ 0, µ max } + )n +, where X z b (B ℵ = max ), X + S z b (B ), X } z b (B ), (B ) (X z b ) ε f ε f κ eg minλ µ 0, µ max } 8 κ eg

Convergence of trust-region methods based on probabilistic models

Convergence of trust-region methods based on probabilistic models A. S. Bandeira K. Scheinberg L. N. Vicente October 24, 2013 Abstract In this paper we consider the use of probabilistic or random models