Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Size: px

Start display at page:

Download "Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)"

Victor Baker
5 years ago
Views:

1 Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective function f : IR n IR assume that f C (sometimes C 2 ) and Lipschitz often in practice this assumption violated, but not necessary

2 LINESEARCH VS TRUST-REGION METHODS Linesearch methods pick descent direction p k pick stepsize α k to reduce f(x k + αp k ) x k+ = x k + α k p k Trust-region methods pick step s k to reduce model of f(x k + s) accept x k+ = x k +s k if decrease in model inherited by f(x k +s k ) otherwise set x k+ = x k, refine model TRUST-REGION MODEL PROBLEM Model f(x k + s) by: linear model m L k (s) = f k + s T g k quadratic model symmetric B k Major difficulties: m Q k (s) = f k + s T g k + 2s T B k s models may not resemble f(x k + s) if s is large models may be unbounded from below linear model - always unless g k = 0 quadratic model - always if B k is indefinite, possibly if B k is only positive semi-definite

3 THE TRUST REGION Prevent model m k (s) from unboundedness by imposing a trust-region constraint s k for some suitable scalar radius k > 0 = trust-region subproblem approx minimize s IR n in theory does not depend on norm in practice it might! m k (s) subject to s k OUR MODEL For simplicity, concentrate on the second-order (Newton-like) model m k (s) = m Q k (s) = f k + s T g k + 2s T B k s and the l 2 -trust region norm = 2 Note: B k = H k is allowed analysis for other trust-region norms simply adds extra constants in following results

4 BASIC TRUST-REGION METHOD Given k = 0, 0 > 0 and x 0, until convergence do: Build the second-order model m(s) of f(x k + s). Solve the trust-region subproblem to find s k for which m(s k ) < f k and s k k, and define ρ k = f k f(x k + s k ). f k m k (s k ) If ρ k η v [very successful] 0 < η v < set x k+ = x k + s k and k+ = γ i k γ i Otherwise if ρ k η s then [successful] 0 < η s η v < set x k+ = x k + s k and k+ = k Otherwise [unsuccessful] set x k+ = x k and k+ = γ d k 0 < γ d < Increase k by SOLVE THE TRUST REGION SUBPROBLEM? At the very least aim to achieve as much reduction in the model as would an iteration of steepest descent Cauchy point: s C k = α C kg k where αk C = arg min m k ( αg k ) subject to α g k k α>0 = arg min m k ( αg k ) 0<α k / g k minimize quadratic on line segment = very easy! require that m k (s k ) m k (s C k) and s k k in practice, hope to do far better than this

5 ACHIEVABLE MODEL DECREASE Theorem 3.. If m k (s) is the second-order model and s C k is its Cauchy point within the trust-region s k, f k m k (s C k) 2 g k min g k + B k, k. PROOF OF THEOREM 3. m k ( αg k ) = f k α g k 2 + 2α 2 g T k B k g k. Result immediate if g k = 0. Otherwise, 3 possibilities (i) curvature g T k B k g k 0 = m k ( αg k ) unbounded from below as α increases = Cauchy point occurs on the trust-region boundary. (ii) curvature g T k B k g k > 0 & minimizer m k ( αg k ) occurs at or beyond the trust-region boundary = Cauchy point occurs on the trustregion boundary. (iii) the curvature gk T B k g k > 0 & minimizer m k ( αg k ), and hence Cauchy point, occurs before trust-region is reached. Consider each case in turn;

6 Case (i) gk T B k g k 0 & α 0 = m k ( αg k ) = f k α g k 2 + 2α 2 gk T B k g k f k α g k 2 () Cauchy point lies on boundary of the trust region = () + (2) = α C k = k g k. (2) f k m k (s C k) g k 2 k g k = g k k 2 g k k. Case (ii) = = α k (3) + (4) + (5) = def = arg min m k ( αg k ) f k α g k 2 + 2α 2 g T k B k g k (3) α k = g k 2 g T k B k g k α C k = k g k (4) α C kg T k B k g k g k 2. (5) f k m k (s C k) = α C k g k 2 2[α C k] 2 g T k B k g k 2α C k g k 2 = 2 g k 2 k g k = 2 g k k.

7 Case (iii) = where α C k = α k = g k 2 g T k B k g k f k m k (s C k) = αk g k 2 + 2(α k) 2 gk T B k g k = g k 4 g k 4 gk T 2 B k g k gk T B k g k g k 4 = 2 2 gk T B k g k g k 2 + B k, g T k B k g k g k 2 B k g k 2 ( + B k ) because of the Cauchy-Schwarz inequality. Corollary 3.2. If m k (s) is the second-order model, and s k is an improvement on the Cauchy point within the trust-region s k, f k m k (s k ) 2 g k min g k + B k, k.

8 DIFFERENCE BETWEEN MODEL AND FUNCTION Lemma 3.3. Suppose that f C 2, and that the true and model Hessians satisfy the bounds H(x) κ h for all x and B k κ b for all k and some κ h and κ b 0. Then where κ d = 2(κ h + κ b ), for all k. f(x k + s k ) m k (s k ) κ d 2 k, PROOF OF LEMMA 3.3 Mean value theorem = f(x k + s k ) = f(x k ) + s T k x f(x k ) + 2s T k xx f(ξ k )s k for some ξ k [x k, x k + s k ]. Thus f(x k + s k ) m k (s k ) = 2 s T k H(ξ k )s k s T k B k s k 2 s T k H(ξ k )s k + 2 s T k B k s k 2(κ h + κ b ) s k 2 κ d 2 k using the triangle and Cauchy-Schwarz inequalities.

9 ULTIMATE PROGRESS AT NON-OPTIMAL POINTS Lemma 3.4. Suppose that f C 2, that the true and model Hessians satisfy the bounds H k κ h and B k κ b for all k and some κ h and κ b 0, and that κ d = 2(κ h + κ b ). Suppose furthermore that g k 0 and that k g k min Then iteration k is very successful and, ( η v) κ h + κ b 2κ d k+ k.. PROOF OF LEMMA 3.4 By definition, + B k κ h + κ b + first bound on k = Corollary 3.2 = k g k κ h + κ b f k m k (s k ) 2 g k min + Lemma second bound on k = ρ k = f(x k + s k ) m k (s k ) f k m k (s k ) g k + B k. = ρ k η v = iteration is very successful. g k + B k, k = 2 g k k. 2 κ d 2 k g k k = 2 κ d k g k η v.

10 RADIUS WON T SHRINK TO ZERO AT NON-OPTIMAL POINTS Lemma 3.5. Suppose that f C 2, that the true and model Hessians satisfy the bounds H k κ h and B k κ b for all k and some κ h and κ b 0, and that κ d = 2(κ h + κ b ). Suppose furthermore that there exists a constant ɛ > 0 such that g k ɛ for all k. Then for all k. def k κ ɛ = ɛγ d min, ( η v) κ h + κ b 2κ d PROOF OF LEMMA 3.5 Suppose otherwise that iteration k is first for which k+ κ ɛ. k > k+ = iteration k unsuccessful = γ d k k+. Hence k ɛ min, ( η v) κ h + κ b 2κ d g k min, ( η v) κ h + κ b 2κ d But this contradicts assertion of Lemma 3.4 that iteration k must be very successful.

11 POSSIBLE FINITE TERMINATION Lemma 3.6. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Suppose furthermore that there are only finitely many successful iterations. Then x k = x for all sufficiently large k and g(x ) = 0. PROOF OF LEMMA 3.6 x k0 +j = x k0 + = x for all j > 0, where k 0 is index of last successful iterate. All iterations are unsuccessful for sufficiently large k = { k } 0 + Lemma 3.4 then implies that if g k0 + > 0 there must be a successful iteration of index larger than k 0, which is impossible = g k0 + = 0.

12 GLOBAL CONVERGENCE OF ONE SEQUENCE Theorem 3.7. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Then either g l = 0 for some l 0 or or lim f k = k lim inf k g k = 0. PROOF OF THEOREM 3.7 Let S be the index set of successful iterations. Lemma 3.6 = true Theorem 3.7 when S finite. So consider S =, and suppose f k bounded below and g k ɛ (6) for some ɛ > 0 and all k, and consider some k S. + Corollary 3.2, Lemma 3.5, and the assumption (6) = def ɛ f k f k+ η s [f k m k (s k )] δ ɛ = 2η s ɛ min, κ ɛ + κ b = f 0 f k+ = k [f j f j+ ] σ k δ ɛ, j=0 j S where σ k is the number of successful iterations up to iteration k. But lim k σ k = +. = f k unbounded below = a subsequence of the g k 0.

13 GLOBAL CONVERGENCE Theorem 3.8. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Then either g l = 0 for some l 0 or or lim f k = k lim g k = 0. k II: SOLVING THE TRUST-REGION SUBPROBLEM (approximately) minimize s IR n q(s) s T g + 2s T Bs subject to s AIM: find s so that q(s ) q(s C ) and s Might solve exactly = Newton-like method approximately = steepest descent/conjugate gradients

14 THE l 2 -NORM TRUST-REGION SUBPROBLEM minimize s IR n q(s) s T g + 2s T Bs subject to s 2 Solution characterisation result: Theorem 3.9. Any global minimizer s of q(s) subject to s 2 satisfies the equation (B + λ I)s = g, where B+λ I is positive semi-definite, λ 0 and λ ( s 2 ) = 0. If B + λ I is positive definite, s is unique. PROOF OF THEOREM 3.9 Problem equivalent to minimizing q(s) subject to 2 2 2s T s 0. Theorem.9 = g + Bs = λ s (7) for some Lagrange multiplier λ 0 for which either λ = 0 or s 2 = (or both). It remains to show B + λ I is positive semi-definite. If s lies in the interior of the trust-region, λ = 0, and Theorem.0 = B + λ I = B is positive semi-definite. If s 2 = and λ = 0, Theorem.0 = v T Bv 0 for all v N + = {v s T v 0}. If v / N + = v N + = v T Bv 0 for all v. Only remaining case is where s 2 = and λ > 0. Theorem.0 = v T (B + λ I)v 0 for all v N + = {v s T v = 0} = remains to consider v T Bv when s T v 0.

15 s N + w s Figure 3.: Construction of missing directions of positive curvature. Let s be any point on the boundary δr of the trust-region R, and let w = s s. Then w T s = (s s) T s = 2(s s) T (s s) = 2w T w (8) since s 2 = = s 2. (7) + (8) = q(s) q(s ) = w T (g + Bs ) + 2w T Bw = λ w T s + 2w T Bw = 2w T (B + λ I)w, = w T (B + λ I)w 0 since s is a global minimizer. But s = s 2 st v v T v v δr = (for this s) w v = v T (B + λ I)v 0. When B + λ I is positive definite, s = (B + λ I) g. If s δr and s R, (8) and (9) become w T s 2w T w and q(s) q(s ) + 2w T (B + λ I)w respectively. Hence, q(s) > q(s ) for any s s. If s is interior, λ = 0, B is positive definite, and thus s is the unique unconstrained minimizer of q(s). (9)

16 ALGORITHMS FOR THE l 2 -NORM SUBPROBLEM Two cases: B positive-semi definite and Bs = g satisfies s 2 = s = s B indefinite or Bs = g satisfies s 2 > In this case (B + λ I)s = g and s T s = 2 nonlinear (quadratic) system in s and λ concentrate on this EQUALITY CONSTRAINED l 2 -NORM SUBPROBLEM Suppose B has spectral decomposition B = U T ΛU U eigenvectors Λ diagonal eigenvalues: λ λ 2... λ n Require B + λi positive semi-definite = λ λ Define Require Note s(λ) = (B + λi) g ψ(λ) def = s(λ) 2 2 = 2 (γ i = e T i Ug) ψ(λ) = U T (Λ + λi) Ug 2 2 = n i= γ 2 i (λ i + λ) 2

17 CONVEX EXAMPLE ψ(λ) λ B = solution curve as varies 2 = g = NONCONVEX EXAMPLE ψ(λ) 2 B = minus leftmost eigenvalue g = λ

18 THE HARD CASE ψ(λ) 2 B = minus leftmost eigenvalue g = = λ SUMMARY For indefinite B, Hard case occurs when g orthogonal to eigenvector u for most negative eigenvalue λ OK if radius is radius small enough No obvious solution to equations... but solution is actually of the form where s lim = lim λ + λ s(λ) s lim + σu 2 = s lim + σu

19 HOW TO SOLVE s(λ) 2 = DON T!! Solve instead the secular equation no poles φ(λ) def = s(λ) 2 = 0 smallest at eigenvalues (except in hard case!) analytic function = ideal for Newton global convergent (ultimately quadratic rate except in hard case) need to safeguard to protect Newton from the hard & interior solution cases THE SECULAR EQUATION φ(λ) min 4s 2 + 4s s + s 2 subject to s λ λ λ

20 NEWTON S METHOD FOR SECULAR EQUATION Newton correction at λ is φ(λ)/φ (λ). Differentiating φ(λ) = s(λ) 2 = (s T (λ)s(λ)) 2 = φ (λ) = st (λ) λ s(λ) (s T (λ)s(λ)) 3 2 Differentiating the defining equation = st (λ) λ s(λ). s(λ) 3 2 (B + λi)s(λ) = g = (B + λi) λ s(λ) + s(λ) = 0. Notice that, rather than λ s(λ), merely s T (λ) λ s(λ) = s T (λ)(b + λi)(λ) s(λ) required for φ (λ). Given the factorization B + λi = L(λ)L T (λ) = s T (λ)(b + λi) s(λ) = s T (λ)l T (λ)l (λ)s(λ) = (L (λ)s(λ)) T (L (λ)s(λ)) = w(λ) 2 2 where L(λ)w(λ) = s(λ). NEWTON S METHOD & THE SECULAR EQUATION Let λ > λ and > 0 be given Until convergence do: Factorize B + λi = LL T Solve LL T s = g Solve Lw = s Replace λ by λ + s 2 s 2 2 w 2

21 SOLVING THE LARGE-SCALE PROBLEM when n is large, factorization may be impossible may instead try to use an iterative method to approximate steepest descent leads to the Cauchy point obvious generalization: conjugate gradients... but what about the trust region? what about negative curvature? CONJUGATE GRADIENTS TO MINIMIZE q(s) Given s 0 = 0, set g 0 = g, d 0 = g and i = 0 Until g i small or breakdown, iterate α i = g i 2 2/d i T Bd i s i+ = s i + α i d i g i+ = g i + α i Bd i ( g + Bs i+ ) β i = g i+ 2 2/ g i 2 2 d i+ = g i+ + β i d i and increase i by

22 CRUCIAL PROPERTY OF CONJUGATE GRADIENTS Theorem 3.0. Suppose that the conjugate gradient method is applied to minimize q(s) starting from s 0 = 0, and that d i T Bd i > 0 for 0 i k. Then the iterates s j satisfy the inequalities for 0 j k. s j 2 < s j+ 2 TRUNCATED CONJUGATE GRADIENTS Apply the conjugate gradient method, but terminate at iteration i if. d i T Bd i 0 = problem unbounded along d i 2. s i + α i d i 2 > = solution on trust-region boundary In both cases, stop with s = s i + α B d i, where α B root of s i + α B d i 2 = chosen as positive Crucially q(s ) q(s C ) and s 2 = TR algorithm converges to a first-order critical point

23 HOW GOOD IS TRUNCATED C.G.? In the convex case... very good Theorem 3.. Suppose that the truncated conjugate gradient method is applied to minimize q(s) and that B is positive definite. Then the computed and actual solutions to the problem, s and s M, satisfy the bound q(s ) 2q(s M ) In the non-convex case... maybe poor e.g., if g = 0 and B is indefinite = q(s ) = 0 can use Lanczos method to continue around trust-region boundary if necessary

Trust Region Methods for Unconstrained Optimisation

Trust Region Methods for Unconstrained Optimisation Lecture 9, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Trust