Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization

Size: px

Start display at page:

Download "Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization"

Amy Small
5 years ago
Views:

1 Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint October 30, 200; Revised March 30, 20 Abstract The adaptive cubic regularization algorithms described in Cartis, Gould & Toint (2009, 200) for unconstrained (nonconvex) optimization are shown to have improved worst-case efficiency in terms of the function- and gradient-evaluation count when applied to convex and strongly convex objectives. In particular, our complexity upper bounds match in order (as a function of the accuracy of approximation), and sometimes even improve, those obtained by Nesterov (2004, 2008) and Nesterov & Polyak (2006) for these same problem classes, without requiring exact Hessians or exact or global solution of the subproblem. An additional outcome of our approximate approach is that our complexity results can naturally capture the advantages of both first- and second-order methods. Introduction State-of-the-art methods for unconstrained smooth optimization typically depend on trust-region [6] or line-search [7] techniques to globalise Newton-like iterations. Of late, a third alternative, in which a local cubic over-estimator of the objective is used as the basis for a regularization strategy for the step computation, has been proposed [9, 2, 3]; see [2, ] for a detailed description of these contributions. Such ideas have been refined so that they are now well suited to large-scale computation for a wide class of nonlinear nonconvex objectives; rigorous convergence and complexity analyses under weak assumptions, together with promising numerical experience with these techniques, are available [2, 3]. Our objective in this paper is to show that the complexity bounds for this type of algorithms significantly improve in the presence of convexity or strong convexity. Specifically, at each iteration of what we call an ARC(Adaptive Regularization with Cubics) framework, a possibly nonconvex model m k (s) = f(x k )+s T g k + 2s T B k s+ 3σ k s 3, (.) is employed as an approximation to the smooth objective f(x k + s) we wish to minimize. Here σ k > 0 is a regularization weight, we have written f(x k ) = g(x k ) = g k and here and hereafter we choose the Euclidean norm = 2. To compute the change s k to x k, the model m k is globally minimized, either exactly or approximately, with respect to s IR n. Note that if B k is taken to be the Hessian H(x) of f, and the latter is globally Lipschitz continuous with Lipschitz constant σ k /2, we have the overestimation property f(x k +s) m k (s) for all s IR n [2, ]. Thus in this case, minimizing m k with respect to s forces a decrease in f from the value f(x k ), since f(x k ) = m k (0). In the general ARC algorithmic framework, School of Mathematics, University of Edinburgh, The King s Buildings, Edinburgh, EH9 3JZ, Scotland, UK. coralia.cartis@ed.ac.uk. All three authors are grateful to the Royal Society for its support through the International Joint Project Computational Science and Engineering Department, Rutherford Appleton Laboratory, Chilton, Oxfordshire, OX 0QX, England, UK. nick.gould@stfc.ac.uk. This work was supported by the EPSRC grant EP/E05335/. Department of Mathematics, FUNDP - University of Namur, 6, rue de Bruxelles, B-5000, Namur, Belgium. philippe.toint@fundp.ac.be.

2 2 Complexity of adaptive cubic regularization methods for convex unconstrained optimization H need not be Lipschitz, nor need B k be H(x k ), but in this case σ k must be adjusted as the computation proceeds to ensure convergence [2, 3, S2.]. The generic ARC framework [2, 3, 2.] may be summarised as follows: Algorithm.: Adaptive Regularization using Cubics (ARC) [2, 3]. Given x 0, γ 2 γ >, > η 2 η > 0, and σ 0 > 0, for k = 0,,... until convergence,. Compute a step s k for which m k (s k ) m k (s C k), (.2) where the Cauchy point s C k = αkg C k and αk C = arg min m k ( αg k ). (.3) α IR+ 2. Compute f(x k +s k ) and ρ k = f(x k) f(x k +s k ). (.4) f(x k ) m k (s k ) 3. Set 4. Set σ k+ { xk +s x k+ = k if ρ k η otherwise. x k (0,σ k ] if ρ k > η 2 [very successful iteration] [σ k,γ σ k ] if η ρ k η 2 [successful iteration] [γ σ k,γ 2 σ k ] otherwise. [unsuccessful iteration] (.5) (.6) For a detailed description of the algorithm construction, including a justification that (.2) (.4) are well-ined until termination, see [2]. The above ARC algorithm is a very general first-order framework that due to the Cauchy condition (.2) ensures at least a steepest-descent-like decrease in each (successful) iteration. This is sufficient to ensure global convergence of ARC both to first-order critical points [2, 2.] and with steepest-descent-like function-evaluation complexity bounds of order ǫ 2 [3, 3] to guarantee g k ǫ. (.7) These results require that g(x) is uniformly and Lipschitz continuous (respectively) and that {B k } is uniformly bounded above. Clearly, the Cauchy point s C k achieves (.2) in a computationally inexpensive way (see [2, 2.]); the choice of interest, however, is when s k is an (approximate global) minimizer of m k (s) and B k, a nontrivial approximation to the Hessian H(x k ) (see 3). Although m k might be nonconvex, its global minimizer over IR n is always well-ined and can be characterized in a computationally-viable way [2, Thm.3.], [9, 2]. This characterization is best suited for exact computation when B k is sparse or of modest size. For large problems, a suitable alternative is to improve upon the Cauchy point by globally minimizing m k over (nested and increasing) subspaces that include g k which ensures (.2) remains satisfied until a suitable termination condition is achieved. (For instance, in our ARC implementation [2], the successive subspaces that the model is minimized over are generated using Lanczos method.) These ARC variants are summarized in Algorithm.2; where h k ( s k, g k ) is some generic function of s k and g k, with specific examples of suitable choices given in (.0) and (.) below.

3 C. Cartis, N. I. M. Gould and Ph. L. Toint 3 Algorithm.2: ARC (h) [2, 3]. In each iteration k of Algorithm., compute s k in Step as the global minimizer of min s IR nm k(s) subject to s L k, (.8) where L k is a subspace of IR n containing g k, and such that the termination condition TC.h s m k (s k ) θ k g k, where θ k = κ θ min(,h k ) and h k = h k ( s k, g k ) > 0, (.9) is satisfied, for some constant κ θ (0,) chosen at the start of the algorithm. Clearly, TC.h is satisfied when s k is the global minimizer of m k over the whole space, but one hopes that termination of the subspace minimization will occur well before this inevitable outcome, at least in early stages of the iteration. Note that in fact, TC.h only requires an approximate critical point of the model, and as such the global subspace minimization in (.8) may only need to hold along the one-dimensional subspace determined by s k [2, (3.), (3.2)], provided (.2) holds. For ARC (h) to be a proper second-order method, a careful choice of h k needs to be made, such as h k = s k or h k = g k 2, yielding the termination criteria and TC.s s m k (s k ) θ k g k, where θ k = κ θ min(, s k ). (.0) TC.g2 s m k (s k ) θ k g k, where θ k = κ θ min (, g k 2). (.) Forthwith, we refer to ARC (h) with TC.s and with TC.g2 as ARC (S) and ARC (g2), respectively. The benefit of requiring the more stringent conditions (.8), and (.0) or (.), in the above ARC variants is that ARC (S) and ARC (g2) are also guaranteed to converge locally Q-quadratically and globally to secondorder critical points [2, 4.2, 5], and to have improved function-evaluation complexity of order ǫ 3/2 to ensure (.7) [3, 5], provided H(x) is globally Lipschitz continuous along the path of the iterates and there is sufficiently good agreement between the H(x k ) and its approximation B k. In this paper, we investigate the worst-case function-evaluation complexity of the basic ARC framework and its second-order variants ARC (S) and/or ARC (g2) when applied to the minimization of special classes of objectives, namely convex and strongly convex ones. In particular, we show that as expected, these algorithms satisfy improved bounds compared to the nonconvex case. Specifically, generic ARC(Algorithm.) takes at most O(ǫ ) and O(logǫ ) function-evaluations to reach the neighbourhood f(x k ) f ǫ (.2) of the (global) minimum f of convex and strongly convex objectives, respectively, with Lipschitz continuous gradients, where the dependence of these bounds on problem conditioning is carefully considered (see page 9). Unsurprinsingly, due to the simple Cauchy decrease condition (.2) required on the step, these bounds match in order those for standard steepest-descent methods on the same classes of objectives [0]. When applied to convex objectives with bounded level sets and globally Lipschitz continuous Hessian, ARC (g2) with B k = H(x k ) will reach approximate optimality in the (.2) sense in at most O(ǫ /2 ) function-evaluations; this matches in order the bound obtained in [, 2] for cubic regularization on the same problem class when the exact subproblem solution is computed in each iteration. Note that asymptotically, in ARC (g2), the subproblem is solved to higher accuracy than in ARC (S), which seems to be crucial when deriving the improved bound compared to the first-order basic ARC. We also present an illustration on a common convex objective that indicates that despite being worst-case, the bounds presented here may be tight.

4 4 Complexity of adaptive cubic regularization methods for convex unconstrained optimization Iftheobjectiveisstronglyconvex, thenarc (S) andarc (g2) (withapproximatehessiansasb k )require at most O( log κ + log log ǫ ) function-evaluations to satisfy (.2), where κ is a problem-dependent constant and where the double logarithm term expresses the local Q-quadratic rate of convergence of these variants. The strongly convex-case bound improves that obtained in [, 2] for cubic regularization with exact subproblem solution in that the former has a logarithmic dependence on κ while the latter only includes a polynomial dependence on problem condition numbers. Our result is a direct consequence of using increasing accuracy in the subproblem solution with first-order-like behaviour, and hence complexity early on, and second-order characteristics asymptotically. Note that the assumption labeling used throughout the paper was chosen to maintain consistency with notation introduced in [2, 3]. The structure of the paper is as follows. Section 2 analyzes the complexity of basic ARC, while Section 3 that of the second-order variants ARC (S) and ARC (g2), in the convex and strongly convex cases. Section 3.3 presents a convex example of inefficient ARC behaviour with O(ǫ /2 ) complexity, and Section 4 draws some conclusions and open questions. 2 The complexity of the basic ARC framework This section addresses the basic ARC algorithm, Algorithm.. We assume that AF. f C (IR n ), (2.) and that the gradient g is Lipschitz continuous on an open convex set X containing all the iterates {x k }, AF.4 g(x) g(y) κ H x y, for all x, y X, and some κ H. (2.2) If f C 2 (IR n ), then AF.4 is satisfied if the Hessian H(x) is bounded above on X. Note however, that for now, we only assume AF.. In particular, no Lipschitz continuity of H(x) will be required in this section. The model m k is assumed to achieve AM. B k κ B, for all k 0, and some κ B. (2.3) In the case when f C 2 (IR n ) and B k = H(x k ) for all k, then AF.4 implies AM. with κ B = κ H. Naturally, we assume f is bounded below, letting f > be the (global) minimum of f and 2. Relating successful and total iteration counts k = f(x k ) f, for all k 0. (2.4) Note that the total number of ARC iterations is the same as the number of function evaluations (as we also need to evaluate f on unsuccessful iterations in order to be able to compute ρ k in (.4)), while the number of successful ARC iterations is the same as that of gradient evaluations. Let us introduce some useful notation. Throughout, denote the index set and, given any j 0, let S = {k 0 : k successful or very successful in the sense of (.6)}, (2.5) S j = {k j : k S}, (2.6) with S j denoting the cardinality of the latter. Concerning σ k, we may require that on each very successful iteration k S j, σ k+ is chosen such that σ k+ γ 3 σ k, for some γ 3 (0,]. (2.7) Note that (2.7) allows {σ k } to converge to zero on very successful iterations (but no faster than {γ k 3}). A stronger condition on σ k is σ k σ min, k 0, (2.8)

5 C. Cartis, N. I. M. Gould and Ph. L. Toint 5 for some σ min > 0. These conditions on σ k and the construction of ARC s Steps 2 4 allow us to quantify the total iteration count as a function of the successful ones. Theorem 2.. For any fixed j 0, let S j be ined in (2.6). Assume that (2.7) holds and let σ > 0 be such that σ k σ, for all k j. (2.9) Then j logγ ( ) 3 σ S j + log. (2.0) logγ logγ σ 0 In particular, if σ k satisfies (2.8), then it also achieves (2.7) with γ 3 = σ min /σ, and we have that j + 2 ( ) σ log S j. (2.) logγ σ min Proof. Apply [3, Theorem 2.] and the fact that the unsuccessful iterations up to j together with S j form a partition of {0,...,j}. Values for σ in (2.9) are provided in (2.6) below, and under stronger assumptions, in (3.6). (Note that due to Lemmas 2.4 and 2.6, the condition required for (2.6) is achieved by the gradient of convex and strongly convex functions, with appropriate values of ǫ, whenever k > ǫ.) Thus, based on the above theorem, we are left with bounding the successful iteration count S j until iteration j that is within ǫ of the optimum, which we focus on for the remainder of the paper and that has the outcome that the total iteration count up to j is of the same order in ǫ as S j. 2.2 Some useful properties The next lemma summarizes some useful properties of the basic ARC iteration. Lemma 2.2. Suppose that the step s k satisfies (.2). i) [2, Lemma 2.] Let AM. hold. Then for k 0, we have that f(x k ) m k (s k ) g k 6 2 min g k, g k, (2.2) κ B 2 σ k and so k in (2.4) is monotonically decreasing, k+ k, k 0. (2.3) ii) [3, Lemma 3.2] Let AF., AF.4 and AM. hold. Also, assume that Then iteration k is very successful and σk g k > 08 2 η 2 (κ H +κ B ) = κ HB. (2.4) σ k+ σ k. (2.5)

6 6 Complexity of adaptive cubic regularization methods for convex unconstrained optimization iii) [3, Lemma 3.3] Let AF., AF.4 and AM. hold. For any ǫ > 0 and j 0 such that g k > ǫ for all k {0,...,j}, we have ( σ k max σ 0, γ 2κ 2 ) HB, 0 k j. (2.6) ǫ A generic property follows. Lemma 2.3. Assume AF., AF.4 and AM. hold, and that when applying ARC to minimizing f, for some κ c > 0 and p > 0, with k ined in (2.4). Then where κ HB is ined in (2.4) and k κ c g k p, for all k 0, (2.7) f(x k ) m k (s k ) κ m 2/p k, for all k 0, (2.8) κ m = 2 2κ 2/p c min κ/p c σ 0 /p 0,. (2.9) γ2 κ HB Proof. We first show that σ k /p k For this, we use the implication ( max σ 0 /p 0,γ 2 κ /p c κ 2 HB ), for all k 0. (2.20) σ k /p k > κ /p c κ 2 HB = σ k+ /p k+ σ k /p k, (2.2) which follows from (2.5) in Lemma 2.2 ii), (2.7) and (2.3). Thus, when σ 0 /p 0 γ 2 κ /p c κ 2 HB, (2.2) implies σ k /p k γ 2 κ /p c κ 2 HB, where the factor γ 2 is introduced for the case when σ k /p k is less than κ /p c κ 2 HB and the iteration k is not very successful. Letting k = 0 in (2.2) gives the first inequality in (2.20) when σ 0 /p 0 γ 2 κ /p c κ 2 HB, since γ 2 >. Next we deduce from (2.2) and (2.7) that f(x k ) m k (s k ) 2/p k 6 min,, 2κ /p c κ /p c κ B 2κ /(2p) c σ k /p k which together with (2.20) and the inition of κ HB, gives (2.8) and (2.9). In the next two sections, we show that when applied to convex and strongly convex functions with globally Lipschitz continuous gradients, the basic ARC algorithm, with only the Cauchy condition for the step computation, satisfies the same upper iteration complexity bounds namely O(ǫ ) and O( logǫ ), respectively as steepest descent when applied to these problem classes; see [0, Theorems 2..4, 2..5]. 2.3 Basic ARC complexity on convex objectives Let us now assume that AF.7 f is convex, (2.22)

7 C. Cartis, N. I. M. Gould and Ph. L. Toint 7 and also that the level sets of f are bounded, namely AF.8 x x D, for all x such that f(x) f(x 0 ), (2.23) where x is any global minimizer of f and D. The following property specifies the values of p and κ c for which (2.7) holds in the convex case. Lemma 2.4. Assume AF. and AF.7 AF.8 hold, and let f = f(x ) be the (global) minimum of f. When applying ARC to minimizing f, we have for (2.4), k D g k, for all k 0. (2.24) Proof. AF.7 implies f(x) f(y) g(y) T (x y), for all x,y IR n. This with x = x and y = x k, the Cauchy-Schwarz inequality, f(x k ) f(x 0 ) and AF.8 give (2.24). An O(ǫ ) upper bound on the ARC iteration count for reaching within ǫ optimality of the objective value is given next. Theorem 2.5. Assume AF., AF.4, AF.7 AF.8 and AM. hold, and let f = f(x ) be the (global) minimum of f. Then, when applying ARC to minimizing f, we have where S j is ined in (2.6), and κ c m has the expression κ c m j = f(x j ) f S j η κ c, j 0, (2.25) m = Thus, given any ǫ > 0, ARC takes at most ( D 2 2D min, 2 σ 0 0 κ c s ǫ ). (2.26) γ2 κ HB successful iterations and gradient evaluations to generate f(x j ) f ǫ, where κ c s = (η κ c m). (2.27) Proof. From (.4) and (.5), we have f(x k ) f(x k+ ) η (f(x k ) m k (s k )), k S. (2.28) Lemma 2.4 implies that the conditions of Lemma 2.3 are satisfied with p = and κ c = D, and so (2.8) and (2.28) imply f(x k ) f(x k+ ) η κ c m 2 k, where κ c m is ined in (2.26). Thus, recalling (2.4), we have k k+ η κ c m 2 k, k S, or equivalently, k+ k = k k+ k k+ k η κ c m η κ c m, k S, k+

8 8 Complexity of adaptive cubic regularization methods for convex unconstrained optimization where in the last inequality, we used (2.3). Since k = k+ for any k / S, summing up the above inequalities up to j gives j 0 + S j η κ c m S j η κ c m, j 0, which gives (2.25), and hence, also (2.27). 2.4 Basic ARC complexity on strongly convex objectives When we know even more information about f, namely, that f is strongly convex, a global linear rate of convergence, and hence, an improved iteration-complexity of at most O(logǫ ) can be proved for the ARC basic framework, as we show next. This represents, as expected, a marked improvement over the global sublinear rate of convergence obtained in the nonconvex and convex cases, and the corresponding iteration complexity bounds. Let us assume that f is strongly convex, namely, there exists a constant µ > 0 such that AF.9 f(y) f(x)+g(x) T (y x)+ µ 2 y x 2, x, y IR n. (2.29) When AF.9 holds, f has a unique minimizer, say x. The next property specifies the values of p and κ c for which (2.7) holds in the strongly convex case. Lemma 2.6. Assume AF. and AF.9 hold, and let x be the global minimizer of f. When applying ARC to minimizing f, we have k 2µ g k 2, for all k 0. (2.30) Proof. AF.9 implies f(y) f(x) + g(x) T (y x) + 2µ g(x) g(y) 2, for all x,y IR n ; see [0, Theorem 2..0] and its proof. Letting x = x and y = x k in the latter gives (2.30). AnO(logǫ )upperboundonthearciteration count for reachingwithin ǫoptimality of theobjective value is given next. Theorem 2.7. Assume AF., AF.4, AF.9 and AM. hold, and let x be the global minimizer of f. Then, when applying ARC to minimizing f, we have j = f(x j ) f ( η κ sc m) Sj 0, j 0, (2.3) where S j is ined in (2.6), and κ sc m has the expression ( ) κ sc m = µ 6 2 min, (0,). (2.32) σ0 2µ 0 γ2 κ HB Thus, given any ǫ > 0, ARC takes at most κ sc s log 0 ǫ (2.33) successful iterations and gradient evaluations, to generate f(x j ) f ǫ, where κ sc s = (η κ sc m).

9 C. Cartis, N. I. M. Gould and Ph. L. Toint 9 Proof. Lemma 2.6 implies that (2.7) holds with p = 2 and κ c = /(2µ), and so the conditions of Lemma 2.3 are satisfied and it follows immediately from (2.8), (2.9), (2.28) and the above choices of p and κ c that k k+ = f(x k ) f(x k+ ) η κ sc m k, where κ sc m is ined in (2.32), which immediately gives (2.3) since k = k+ for any k / S. To show that κ sc m <, use γ 2, κ HB > κ H and κ H /µ ; the latter inequality follows from (2.30) and from (2.37) with x = x k. The bound (2.3) and the inequality ( η κ sc m) Sj e ηκsc m Sj imply that j ǫ provided e ηκsc m Sj 0 ǫ, which then gives (2.33) by applying the logarithm. Some remarks on basic ARC s complexity for convex and strongly convex objectives. Let us comment on the results in Theorems 2.5 and 2.7. Note that, despite AF.7 or AF.9, no convexity assumption was made on m k, confirming the basic ARC framework to be a steepest-descent-like method. The only model assumption is AM.. Our results match in order, as a function of the accuracy ǫ, the (nonoptimal) complexity bounds for steepest-descent applied to convex and strongly convex objectives with Lipschitz continuous gradients given in [0, Corollary 2..2, Theorem 2..5]. Let us now discuss the condition numbers that occur in our bounds and their connection to standard measures of conditioning. Consider first the convex-case bound in Theorem 2.5. Assume that the initial regularization parameter σ 0 is chosen small enough, namely, σ 0 / g 0. Then (2.24) implies that D/(σ 0 0 ) and so (2.26) becomes κ c m = (2 2γ 2 κ HB D 2 ), where we also used that γ 2, κ HB. Recalling (2.4) and that γ 2, η and η 2 are user-chosen constants, we deduce that the bound (2.27) is a problem-independent constant multiple of max(κ B,κ H )D 2, ǫ where D measures the size of the f(x 0 ) level set, and κ H and κ B are the exact and approximate Lipschitz constants of the gradient, respectively. The displayed expression coincides with the bound in [0, Corollary 2..2] when the exact Hessian is used in place of B k so that κ B = κ H and all iterations are successful. Consider now the strongly convex case and Theorem 2.7. Choosing again σ 0 / g 0, (2.30) provides that σ 0 2µ 0. Using this, γ 2 and κ HB, (2.32) becomes κ sc m = (6 2γ 2 κ HB /µ). Employing (2.4) for the expression of κ HB, (2.3) now becomes ( j = f(x j ) f η ) Sj 0, (2.34) c(h) where η = η ( η 2 )/(2592 γ 2 ) (0,) and c(h) = max(κ H,κ B ). (2.35) µ Note that c(h) is a uniform upper bound on the Hessian s condition number, which equals the common measure κ H /µ when exact Hessians are employed in place of B k. Recalling that η,2 and γ 2 are user-chosen parameters, we deduce that, whenever σ 0 / g 0, (2.33) is a problem-independent constant multiple of c(h)log 0 ǫ, (2.36) where c(h) is ined in (2.35). When B k = H(x k ), the function-decrease bound for steepest descent method in [0, Theorem 2..5] has a similar form to the simplified bound (2.34) with the term η/c(h) replaced by the slightly smaller expression (c(h) ) 2 /(c(h)+) 2. Note that both (2.27) and (2.33) are worse than the complexity bounds of the optimal gradient method [0]. The latter enjoys a worst-case bound of order O(/ ǫ) when applied to convex objectives [0, Theorems 2..7, 2.2.], and of order O(( c(h) ) 2 /( c(h)+) 2 logǫ ) for strongly convex functions. These two upper bounds match the lower complexity bounds for the minimization of convex and strongly convex functions with Lipschitz continuous gradient by means of gradient methods [0], and hence they are optimal from a worst-case complexity point of view.

10 0 Complexity of adaptive cubic regularization methods for convex unconstrained optimization 2.5 Complexity of basic ARC generating approximately-optimal gradients Let us address the implication of the above results on the ARC s complexity for achieving (.7). This issue is important as the latter can be used as a termination condition for ARC, while k in (2.4), whose complexity was estimated above, cannot be computed in practice since f and x are unknown. The following generic property is useful in this and other contexts. Lemma 2.8. Let AF. and AF.4 hold, and assume f is bounded below by f. Then f(x) f f(x) f(x αg(x)) 2κ H g(x) 2, for all α 0 and x IR n. (2.37) Thus, when ARC is applied to minimizing f, we have and so, for any ǫ > 0, g j ǫ holds whenever k 2κ H g k 2, k 0, (2.38) f(x j ) f ǫ2 2κ H. (2.39) Proof. First-order Taylor expansion and AF.4 give the overestimation property f(x+s) = f(x)+g(x) T s+ Thus, letting s = αg(x), we obtain 0 f(x) f(x αg(x)) (g(x+ts) g(x))dt f(x)+g(x) T s+ κ H 2 s 2, for all x,s IR n. ( α κ H 2 α2) g(x) 2, for all α 0. The minimum of the right-hand side of the above inequality is attained at α = /κ H, giving (2.37). Under the conditions of Theorem 2.5, ARC will take at most O(ǫ 2 ) successful iterations to ensure (2.39) when applied to convex objectives. For strongly convex functions, Theorem 2.7 implies the same order of complexity of logǫ for g j ǫ. (Note that the term f(x 0 ) f in (2.25) and (2.3) can be replaced by D g 0 and g 0 2 /(2µ), respectively.) Now recall [3, Corollary 3.4], which states that, when applied to nonconvex objectives, the basic ARC scheme takes at most O(ǫ 2 ) iterations to generate a first iterate k with g j ǫ. Hence we see that the difference between the convex and nonconvex cases is not so great, and the bound improvement (for g j ) is somewhat slight. Namely, as the bound on g j in the convex case was obtained from that on the function values f(x j ) which decrease monotonically, it follows from (2.38) that once g k ǫ, it will remain as such for all subsequent iterations, and so the O(ǫ 2 ) iteration bound represents the maximum total number of (successful) iterations with g k > ǫ that may occur. Clearly, there is a marked improvement in ARC s worst-case complexity for the strongly convex case. 3 The complexity of second-order ARC variants Let us now consider the complexity of Algorithm.2 with inner iteration termination criteria (.0) and (.), namely of the ARC (S) and ARC (g2) variants. For the remainder of the paper, we assume that AF.3 f C 2 (IR n ). (3.)

11 C. Cartis, N. I. M. Gould and Ph. L. Toint While no assumption on the Hessian of f being globally or locally Lipschitz continuous has been imposed in the complexity results of 2.2, we now require that the objective s Hessian is globally Lipschitz continuous on the path of the iterates, namely, there exists a constant L > 0 independent of k such that AF.6 H(x) H(x k ) L x x k, for all x [x k,x k +s k ] and all k 0, (3.2) and that B k and H(x k ) agree along s k in the sense that AM.4 (H(x k ) B k )s k C s k 2, for all k 0, and some constant C > 0. (3.3) By using finite differences on the gradient for computing B k, we showed in [5] that AM.4 can be achieved in O(n log ǫ ) additional iterations and gradient evaluations (for any user-chosen constant C). Next we recall some results for ARC (h), in particular, necessary conditions for the global subproblem solution (.8) and expressions for the model decrease (see Lemma 3 i)); also, some general properties that hold for a large class of (nonconvex) functions (see Lemma 3 ii) and iii)). Lemma 3.. i) [2, Lemmas 3.2, 3.3] Let s k be the global minimizer of (.8) for any k 0. Then g k s k +s k B k s k +σ k s k 3 = 0, (3.4) and f(x k ) m k (s k ) = 2 st kb k s k σ k s k 3. (3.5) ii) [2, Lemma 5.2] Let AF.3, AF.6 and AM.4 hold. Then σ k max(σ 0, 3 2γ 2 (C +L)) = L 0, for all k 0. (3.6) iii) [3, Lemma 5.2] Let AF.3 AF.4, AF.6, AM.4 and TC.s hold. Then s k satisfies where κ g is the positive constant s k κ g gk+ for all successful iterations k, (3.7) κ g = ( κ θ )/(L+C +L 0 +κ θ κ H ). (3.8) Note that in our second-order ARC variants in [2, 3], we employ the more general condition (3.4) and an approximate nonnegative curvature requirement [2, (3.2)] for ining the choice of s k, which may hold at other points (of local minimum) than the global minimizer over L k as prescribed by (.8). When the model is convex, as it is often the case here, such situations do not arise. The bound (3.7) ensures that the step s k does not become too small compared to the size of the gradient, and it is a crucial ingredient for obtaining, as shown in [3, Corollary 5.3], an O(ǫ 3/2 ) upper bound on the iteration count of ARC (S) to generate g k ǫ for general nonconvex functions. Next we improve the order of this bound for convex and strongly convex objectives. Despite solving the subproblem to higher accuracy than the generic ARC framework, the second-order ARC variants still only evaluate the objective function and its gradient once in each (major) iteration and each successful iteration, respectively; hence the correspondence between (successful) iteration count and the number of (gradient) function evaluations continues to hold. Recall also Theorem 2. that relates the total number of iterations to that of successful ones.

12 2 Complexity of adaptive cubic regularization methods for convex unconstrained optimization 3. ARC (g2) complexity on convex objectives Here, we prove an O(/ ǫ) iteration upper bound for ARC (g2) to achieve (.2), which improves the steepest-descent-like bound of order /ǫ for basic ARC in Theorem 2.5. A stronger requirement than AF.6 is required in this section, namely, that the Hessian is globally Lipschitz continuous AF.6 H(x) H(y) L x y, for all x, y IR n. (3.9) Note that AF.6 and AF.8 imply AF.4 on the f(x 0 ) level set of f, which is the required domain of gradient Lipschitz continuity for the results in this section. We also employ the true Hessian values for B k, namely, we make the following choice in ARC (g2), B k = H(x k ), for all k 0. (3.0) Thus AM.4 holds in this case with C = 0, and AF.4 (or AF.6 and AF.8) implies AM.. A useful lemma is given first. Lemma 3.2. Let AF.3, AF.6 and AF.7 AF.8 hold. Let f = f(x ) be the (global) minimum of f. Consider the subproblem (.8) with B k = H(x k ) and for a(ny) subspace L k of IR n with g L k. Then min m k (s) f(x k ) 2κ c m(g2)[f(x k ) f(x k +s k)] 3 2, (3.) s L k where s k is a (global) minimizer of f(x k +s) over s L k, and where κ c m(g2) ( = 6D ) 6DL and L = max(σ 0,γ 2 L,κ H ). (3.2) Proof. From AF.3 and AF.6, we have the overestimation property f(xk +s) f(x k ) s T g k 2s T H(x k )s L 6 s 3, s IR n, (3.3) and so, from (.) and B k = H(x k ), we have Employing (3.6) and γ 2, we further obtain m k (s) f(x k +s)+ 2σ k +L s 3, s IR n. 6 m k (s) f(x k +s)+l s 3, s IR n, (3.4) where L is ined in (3.2). (Note that κ H is not needed as yet in the inition of L ; it will be useful later as we shall see.) Minimizing on both sides of (3.4) gives the first inequality below { min m k (s) min f(xk +s)+l s 3} { min f(xk +αs k)+l α 3 s k 3}, (3.5) s L k s L k α [0,] where the second inequality follows from the inition of s k which gives αs k L k for all α [0,]. From AF.7, we have f(x k +αs k ) ( α)f(x k)+αf(x k +s k ), for all α [0,], and so, from (3.5), { min m k (s) f(x k )+ min α[f(xk +s k) f(x k )]+L α 3 s k 3}. (3.6) s L k α [0,] The construction of the algorithm implies f(x k ) f(x 0 ), so that x k x D due to AF.8. Furthermore, f(x k +s k ) f(x k), and so x k +s k x D. Thus s k x k x + x k +s k x 2D, and (3.6) implies min m k (s) f(x k )+ min s L k α [0,] { α[f(xk +s k) f(x k )]+8α 3 L D 3}. (3.7)

13 C. Cartis, N. I. M. Gould and Ph. L. Toint 3 The minimum in the right-hand side of (3.7) is attained at αk f(xk ) f(x k +s k = min{, ˆα k }, where ˆα k := ) 2D. 6L D Let us show that ˆα k, namely, f(x k ) f(x k +s k ) 24L D 3. AF.7 gives the first inequality f(x k +s k) f(x k ) g T ks k g k s k 2D g k = 2D g k g(x ) 2κ H D 2, where we also used the Cauchy-Schwarz inequality, the bound on s k just before (3.7), AF.4 and AF.8. Since we assumed in AF.8 that D, and the inition of L implies L κ H, we conclude that f(x k + s k ) f(x k) 2κ H D 3 2L D 3 24L D 3. Thus, αk = ˆα k and substituting the above value of ˆα k in (3.7), we deduce (3.) with the notation (3.2). The main result of this section follows. Theorem 3.3. Let AF.3, AF.6 and AF.7 AF.8 hold. Let f = f(x ) be the (global) minimum of f. Apply ARC (g2) with the choices (2.8) and (3.0) to minimizing f. Then j = f(x j ) f ( S j η βκ c 2, j 0, (3.8) m(g2)) where S j is ined in (2.6), κ c m(g2) in (3.2) and ) β = (, 2 min κ 3/2 G 4(κ H D) 3/2 with κ G = σ min(κ c m(g2)) 2. (3.9) 4κ 2 θ κ3 H Thus, given any ǫ > 0, ARC (g2) takes at most κ c s(g2) ǫ (3.20) successfuliterationsandgradientevaluationstogeneratef(x j ) f ǫ, whereκ c s(g2) = (η βκ c m(g2)). Proof. Let k S. From (.4), (.5) and (2.5), we have f(x k+ ) ( η )f(x k )+η m k (s k ) = ( η )f(x k )+η [m k (s k ) m k (s m k )]+η m k (s m k ), (3.2) where s m k denotes the global minimizer of m k(s) over IR n. AF.7 implies H(x k ) is positive semiinite and so m k (s) is convex, which gives the first inequality below, m k (s k ) m k (s m k ) s m k (s k ) T (s k s m k ) s m k (s k ) s k s m k κ θ g k 3 s k s m k. (3.22) where the second inequality follows from TC.g2 (.). To bound s k s m k, recall that both s k and s m k satisfy (3.4), which implies due to (2.8) and B k = H(x k ) being positive semiinite, σ min s 3 σ k s 3 g T ks g k s, where s = s k or s = s m k. Thus max{ s k, s m k } g k /σ min, and so s k s m g k k 2. σ min

14 4 Complexity of adaptive cubic regularization methods for convex unconstrained optimization This and (3.22) now provide the first inequality below, m k (s k ) m k (s m k ) 2κ θ σmin g k 7 2 2κ θ κ H 2κH σmin gk 3 2 k, (3.23) while the second inequality follows from (2.38). Recalling (3.2), we are left with bounding m k (s m k ) above, for which we use Lemma 3.2 with L k = IR n. Then, s k = x x k and so f(x k ) f(x k +s k ) = k, and (3.) implies m k (s m k ) f(x k ) 2κ c m(g2) 3 2 k. Substituting this bound and (3.23) into (3.2), we deduce ( ) κθ κ H 2κH f(x k+ ) f(x k )+2η σmin gk κ c m(g2) 3 2 k, or equivalently, recalling (2.4) and (3.9), Thus we have the implication k k+ 2η κ c m(g2) g k 3 2 2κ k. G g k κ G 2 = k k+ η κ c m(g2) 3 2 k. (3.24) It remains to prove a bound of the same form as the right-hand side of (3.24) when g k > κ G /2. For this, we employ again Lemma 3.2, this time for s k and the subspace L k in the kth iteration of ARC (g2) with g L k. Thus noting that the left-hand side of (3.) is equal to m k (s k ) in this case, we employ (3.) to upper bound the first inequality in (3.2), and obtain f(x k+ ) f(x k ) 2η κ c m(g2)[f(x k ) f(x k +s k)] 3 2. (3.25) Since s k is a global minimizer of f(x k+s) over s L k, and g L k, we have the first inequality below, for any α 0, f(x k ) f(x k +s k) f(x k ) f(x k αg k ) 2κ H g k 2 g k 2κ H D k, where the second and third inequalities follow from the second inequality in (2.37) and from (2.24), respectively. It follows from (3.25) that or equivalently, f(x k+ ) f(x k ) η κ c g k 3 2 m(g2) κ H D 2κ H D 3 2 k, Thus we have the implication k k+ η κ c g k 3 2 m(g2) κ H D 2κ H D 3 2 k. g k > κ G 2 = k k+ η κ c κ G κg m(g2) 4κ H D κ H D 3 2 k. (3.26) Finally, we conclude from (3.24) and (3.26) that k k+ 2η βκ c m(g2) 3 2 k, k S, (3.27)

15 C. Cartis, N. I. M. Gould and Ph. L. Toint 5 where β is ined in (3.9). For any k S, we have the identity below = k+ k k k+ k k+ ( k + k+ ) 2η βκ c m(g2) k+ ( k + k+ ) η βκ c m(g2), k where we also used (3.27) and (2.3), respectively. Thus, recalling that k remains unchanged on unsuccessful iterations and summing the above up to j, we deduce + S j η βκ c m(g2) S j η βκ c j 0 m(g2), j 0, which gives (3.8) and also (3.20). As TC.g2 is satisfied at the global minimizer of the cubic model m k (s), the latter can be chosen as the step in our algorithm, which is an efficient choice as far as the cost of the subproblem solution is concerned, provided the problem is medium-size or the Hessian at the iterates is sparse. Note the two regimes of analysis in the above proof, namely in the model decreases (3.24) and (3.26). To obtain the former asymptotic case, the termination criteria TC.g2 was used, while for the latter early stages case, the first-order condition that the gradient be included in the subspace of minimization, and the ensuing decrease along the steepest descent direction, were essential. Thus the construction of ARC (g2) to behave like steepest-descent early on and then naturally switch to higher accuracy as it approaches the solution is reflected in our complexity analysis, with the slight caveat that the (converging) gradient is nonmonotonic and so the distinction between the asymptotic and nonasymptotic regimes is not strict. Furthermore, the nonasymptotic result (3.26) also holds for ARC (S), but the termination condition TC.s does not seem strong enough to ensure a similar property to (3.24) for the asymptotic regime of ARC (S). Assuming that σ 0 is chosen small enough, then the condition number κ c m(g2) in (3.2) and (3.8) that characterizes the asymptotic function decrease is a problem-indepedent constant multiple of max(κh,l)d 3 while β (0,) in (3.8) represents the fraction of this function decrease that can be ensured in the nonasymptotic regime when only a Cauchy decrease is achieved. The iteration complexity of Nesterov & Polyak s cubic regularization algorithm applied to convex problems is analysed in [2, Theorem 4] and [, Theorem ], and an O(/ ǫ) bound is obtained. Here, we relax the requirement that the subproblem be solved globally and exactly, allowing approximate solutions to obtain a same-order bound. Complexity of generating approximately-optimal gradient values ThecomplexityofARC (g2) generating a gradient value g j ǫ can be obtained as described in Section 2.5, by using (2.39) in Lemma 2.8, and an O(/ǫ) upper bound on the total number of iterations and gradient-evaluations with g k > ǫ ensues. 3.2 ARC (S) complexity on strongly convex objectives For generality purposes (since TC.s is a milder condition than TC.g2), we focus on ARC (S) in this section, but similar results can be shown for ARC (g2). Let us now assume AF.9. Due to AF.3, (2.29) is equivalent to u T H(x)u µ u 2, for all u, x IR n. (3.28) Employing (2.29) with y = x and x = x, we deduce that AF.8 is implied by AF.9 with D 2 0 /µ. (3.29) The strong convexity of f implies that asymptotically, ARC (S) converges Q-quadratically to the (global) minimizer and hence it possesses an associated evaluation complexity of order log 2 log 2 ǫ from some iteration j q 0 onwards [, 9.5.3].

16 6 Complexity of adaptive cubic regularization methods for convex unconstrained optimization Lemma 3.4. Assume AF.3 AF.4, AF.6, AF.9 and AM.4 hold, and let x be the global minimizer of f. Apply ARC (S) to minimizing f, and assume that the Rayleigh quotient of B k along s k is uniformly bounded away from zero, namely R k (s k ) = st k B ks k s k 2 R min > 0, k S. (3.30) Then, recalling κ g ined in (3.8) and letting δ = 2(η R min κ 2 g µ) 2, N f = {x : f(x) f(x ) δ} (3.3) is a neighbourhood of quadratic convergence for f, so that if there exists j q 0 such that x jq N f with jq δ/2, then x k N f for all k j q, and k+ δ 2 k, for all k S and k j q. (3.32) Furthermore, given ǫ > 0, ARC (S) takes at most ( ) δ log 2 log 2 ǫ (3.33) successful iterations and gradient evaluations from j q onwards, to generate f(x j ) f ǫ. Proof. Let k S. Then (.5), (3.5), (3.30) and (3.7) imply f(x k ) f(x k+ ) η (f(x k ) m k (s k )) 2η R k (s k ) s k 2 2η R min s k 2 2η R min κ 2 g g k+, k S. Lemma 2.6 applies at k + and so The last two displayed equations further give k+ 2µ g k+ 2. k f(x k ) f(x k+ ) 2η R min κ 2 g 2µ k+, and so k+ δ 2 k, for all k S, (3.34) where δ is ined in (3.3). Thus the expression of N f in (3.3) follows, as well as (3.32). Assuming that x jq N f with jq δ/2, we deduce from (3.32) that j δ 2l 2l j q, for any j j q, (3.35) where l = {j q,j q+,...,j} S denotes the number of successful iterations from j q up to j. Now employing jq δ/2 in (3.35) shows that j ǫ provided 2 2l δ ǫ, which gives the bound (3.33). Remark on satisfying (3.30). If exact Hessians are used so that B k = H(x k ) for all k, then AF.9 implies (3.30) due to (3.28). Alternatively, (3.30) can be ensured if AM.4 holds with a sufficiently small C. Namely, note that AF.9, AM.4 and (3.29) imply µ st k H ks k s k 2 R k (s k )+ st k (H k B k )s k s k 2 R k (s k )+C s k R k (s k )+2CD, k 0.

17 C. Cartis, N. I. M. Gould and Ph. L. Toint 7 Thus (3.30) holds provided C < µ/(2d). Recall our comments on satisfying AM.4 by finite differencing following (3.3). We are left with bounding the successful iterations up to j q, namely, the iterations ARC (S) takes until entering the region of quadratic convergence N f (which must happen under the conditions of Corollary 3.5 as x k converges to the unique global minimizer x ). From the inition of j q and N f in Lemma 3.4, this is equivalent to counting the successful iterations until jq = f(x jq ) f(x ) 2δ, (3.36) with δ ined in (3.3). The choice of s k in (.8) with g k L k implies that ARC (S) always satisfies the Cauchy condition (.2) and so the bound in Theorem 2.7 holds. This yields an upper bound on (the successful iterations up to) j q of order log( 0 /δ), and emphasizes again that early on in the running of the algorithm, steepest-descent-like decrease is sufficient even from a worst-case complexity viewpoint. The bound on the total number of successful iterations is then obtained by adding up the bounds on the two distinct phases, up to and then inside the neighbourhood of quadratic convergence. Corollary 3.5. Assume AF.3 AF.4, AF.6, AF.9, AM. and AM.4 hold, and let x be the global minimizer of f. Apply ARC (S) to minimizing f, assuming that (3.30) holds. Then, given any ǫ > 0, ARC (S) takes, in total, at most κ sc s log 2 ( ) 0 δ +log δ 2 log 2 ǫ (3.37) successful iterations and gradient evaluations to generate f(x j ) f(x ) ǫ, where κ sc s is ined in (2.33) and δ in (3.3). Proof. The conditions of Theorem 2.7 are satisfied, and so letting ǫ = δ/2 in (2.33), we deduce that (3.36) holds in at most κ sc s log(2 0 /δ) successful iterations. To bound the number of iterations from j q to j, we employ Lemma 3.4. Thus the total number of successful iterations up to j is the sum of these two bounds. Recalling our comments following (2.34), let us interpret the condition numbers in (3.37). In particular, providedσ 0 ischosensufficientlysmall, weobtainfrom(2.36)thatκ sc s is a problem-independent multiple of theboundc(h)in(2.35)ontheconditionnumberofthehessianmatrixh(x). Additionally, ifb k = H(x k ) so that C = 0 and R min = µ, δ in (3.3) and (3.37) simplifies to a multiple of µ/c(h). Note that for the non-asymptotic phase of ARC (S), an O(/ δ) bound can be deduced similarly to the proof of Theorem 3.3. Namely, using Lemma 3.2, which clearly holds for ARC (S), we deduce (3.25); then employ (2.37) just as in the first displayed equation after (3.25) and use (2.30). Then the total ARC (S) complexity would be of order δ /2 + log 2 log 2 (δ/ǫ), which matches the bounds for cubic regularization with exact subproblem solution in [2, pages ] and [, pages 76 77]. Note that such bounds are weaker than the ones we obtained in Corollary 3.5. Complexity of generating approximately-optimal gradient values We have the following result, where the constants have already been ined in Corollary 3.5.

18 8 Complexity of adaptive cubic regularization methods for convex unconstrained optimization Lemma 3.6. Assume AF.3 AF.4, AF.6, AF.9, AM. and AM.4 hold. Apply ARC (S) to minimizing f, assuming that (3.30) holds. Then N g = {x : g(x) ( 2η R min κ g ) 2 = ζ} is a neighbourhood of quadratic convergence for the gradient g, namely, there exists j q such that x jq N g with g jq ζ/2, then x k N g for all k j q, and g k+ ζ g k 2, for all k S and k j q. (3.38) Thus, given ǫ > 0, ARC (S) takes at most ( ) ζ log 2 log 2 ǫ (3.39) successful iterations from j q onwards, to generate g j ǫ. Furthermore, to generate g jq ζ, ARC (S) takes at most 2κ sc s log g 0 κ H ζ (3.40) µ successful iterations, so that the total number of successful iterations and gradient evaluations required to generate g j ǫ is at most equal to the sum of the bounds (3.39) and (3.40). Proof. AF.9 implies AF.7 which gives f(x k+ ) f(x k ) g T ks k g k s k, k 0. This and the first set of displayed equations in the proof of Lemma 3.4 give the first inequality below g k 2η R min s k 2η R min κ g gk+, k S, (3.4) where the latter inequality follows from (3.7). The expression and properties of N g follow. The bound (3.39) is obtained similarly to the proof of (3.33) in Lemma 3.4. To deduce (3.40), let ǫ = ζ in (2.39) and in (2.33), and replace 0 in the latter by its upper bound g 0 2 /(2µ). A similar estimate of a neighbourhood of quadratic convergence for the gradient can be found in [] for Nesterov & Polyak s cubic regularization algorithm. 3.3 On the tightness of ARC s complexity bounds The question arises as to whether the complexity bounds on ARC s performance on special problem classes presented in this section are too pessimistic, even for the worst-case, and could potentially be improved. This is particularly relevant when it comes to the convex case and the corresponding bound of order / ǫ (Theorem 3.3), implying a sublinear rate of convergence of second-order ARC variants on convex functions. (For the strongly convex case, the log log ǫ bound can commonly be observed numerically when Q-quadratic convergence takes place.) Here, we find a convex function that satisfies all the conditions of Theorem 3.3 apart from having bounded level sets and on which ARC takes precisely order / ǫ iterations (and function- and gradientevaluations), to generate f(x j ) f ǫ. Consider a convex function f C 2 (IR), with f(x) = e x, for x 0. (3.42) We have the following complexity result, whose proof is given in the Appendix.

19 C. Cartis, N. I. M. Gould and Ph. L. Toint 9 Lemma 3.7. The function (3.42) is convex, bounded below by f = 0 and has bounded above and Lipschitz continuous second derivatives f (x) for x [0, ) with constants κ H = L =, thus satisfying AF.4, AF.6 and AF.7. Apply ARC to minimizing (3.42), starting with x 0 0. On each iteration k, compute the step s k as the global minimizer of the model m k (s) in (.) with B k = f (x k ) and with the (reasonable) choice σ k := σ L 2 =, k 0, (3.43) 2 which ensures that every iteration is very successful and that (2.8) holds. Then AM. and AM.4 hold (with κ B = and C = 0), and ARC takes Θ(ǫ /2 ) total iterations to achieve f(x k ) ǫ, where Θ( ) denotes upper and lower bounds of that order. Several remarks are in order concerning the above example. This example also applies to Nesterov & Polyak s cubic regularization algorithm [2, ]; recall our choice of s k and σ k in the above. In particular, it satisfies all the conditions in [, Theorem ] including σ k = L/2 but except f having bounded level sets. The latter theorem establishes the O(ǫ /2 ) iteration upper bound for Nesterov & Polyak s cubic regularization. Approximate termination criteria like TC.g2 and TC.s do not give better performance than the exact subproblem solution in this case (see the right-hand side plot of basic ARC with the Cauchy condition in Figure 3.). If Newton s method is applied to this example, the complexity would be better; see Figure 3.. Similarly, if we allowed σ k to decrease to zero so that the step approaches the Newton step, the complexity would again improve. Thus the inefficient behaviour in this example is due to keeping the regularization always switched on, and always strongly regularizing. However, we have shown in [4] that for nonconvex problems, Newton s method can behave worse than second-order ARC in the worst case, in fact it can be as poor as steepest descent. It remains to see whether this is also possible for convex problems, or for problems with bounded level sets Cauchy σ =L/2 k σ k > 0 Newton Iterations vs Objective; log scale 00 Figure 3.: Graph of (3.42) and the local cubic regularizations at the ARC iterates (left-hand side). Plot of objective values at the iterates on a log scale for different ARC variants and for Newton s method (right-hand side).

20 20 Complexity of adaptive cubic regularization methods for convex unconstrained optimization 4 Conclusions The behaviour of ARC on some special problem classes was investigated and, as expected, improved complexity bounds were shown when additional structure was assumed to be present in the problem. In particular, upper bounds of order O(/ ǫ) and O( logκ +log logǫ ) were proved for second-order ARC variants when applied to convex and strongly convex objectives, respectively. For the latter case, the fact that the constant number of steps before entering the region of quadratic convergence is a logarithmic function of condition numbers is an improvement over existing complexity bounds for second-order methods applied to such problems. We have also given an example of (relatively) inefficient behaviour of second-order ARC on a convex problem with unbounded level sets which takes order / ǫ iterations to reach within ǫ of the optimum. Several open questions remain, such as whether a convex objective with bounded level sets can be found on which the latter iteration bound is attained, or whether Newton s method always has better worst-case complexity than ARC in the convex case. References [] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, United Kingdom, [2] C. Cartis, N. I. M. Gould and Ph. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, 27(2): , 20. [3] C. Cartis, N. I. M. Gould and Ph. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity. Mathematical Programming, DOI: 0.007/s y, 200 (online). [4] C. Cartis, N. I. M. Gould and Ph. L. Toint. On the complexity of steepest descent, Newton s and regularized Newton s methods for nonconvex unconstrained optimization. SIAM Journal on Optimization, 20(6): , 200. [5] C. Cartis, N. I. M. Gould and Ph. L. Toint. On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization. ERGO Technical Report 0-005, School of Mathematics, University of Edinburgh, 200. [6] A. R. Conn, N. I. M. Gould and Ph. L. Toint. Trust-Region Methods. SIAM, Philadelphia, USA, [7] J. E. Dennis and R. B. Schnabel. Numerical methods for unconstrained optimization and nonlinear equations. Prentice-Hall, Englewood Cliffs, New Jersey, USA, 983. Reprinted as Classics in Applied Mathematics 6, SIAM, Philadelphia, USA, 996. [8] G. H. Golub and C. F. Van Loan. Matrix Computations. The John Hopkins University Press, Baltimore, USA, 996. [9] A. Griewank. The modification of Newton s method for unconstrained optimization by bounding cubic terms. Technical Report NA/2 (98), Department of Applied Mathematics and Theoretical Physics, University of Cambridge, United Kingdom, 98. [0] Yu. Nesterov. Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, Dordrecht, The Netherlands, [] Yu. Nesterov. Accelerating the cubic regularization of Newton s method on convex problems. Mathematical Programming, 2():59 8, 2008.

Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity

Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity Coralia Cartis,, Nicholas I. M. Gould, and Philippe L. Toint September