Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Similar documents
Trust Region Methods for Unconstrained Optimisation

An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity

Adaptive cubic overestimation methods for unconstrained optimization

GLOBAL CONVERGENCE OF GENERAL DERIVATIVE-FREE TRUST-REGION ALGORITHMS TO FIRST AND SECOND ORDER CRITICAL POINTS

Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

A Trust Region Algorithm for Heterogeneous Multiobjective Optimization

What can we do with numerical optimization?

Convergence of trust-region methods based on probabilistic models

Nonlinear programming without a penalty function or a filter

Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

Nonlinear programming without a penalty function or a filter

Penalty Functions. The Premise Quadratic Loss Problems and Solutions

Universal regularization methods varying the power, the smoothness and the accuracy arxiv: v1 [math.oc] 16 Nov 2018

Nonlinear programming without a penalty function or a filter

A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation

First-Order Methods. Stephen J. Wright 1. University of Wisconsin-Madison. IMA, August 2016

Financial Optimization ISE 347/447. Lecture 15. Dr. Ted Ralphs

Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.

Jing Gao 1 and Jian Cao 2* 1 Introduction

Is Greedy Coordinate Descent a Terrible Algorithm?

On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization

University of Edinburgh, Edinburgh EH9 3JZ, United Kingdom.

On the complexity of the steepest-descent with exact linesearches

Machine Learning (CSE 446): Learning as Minimizing Loss

Corrigendum: On the complexity of finding first-order critical points in constrained nonlinear optimization

Lecture Quantitative Finance Spring Term 2015

Portfolio selection with multiple risk measures

Approximate Composite Minimization: Convergence Rates and Examples

On the Superlinear Local Convergence of a Filter-SQP Method. Stefan Ulbrich Zentrum Mathematik Technische Universität München München, Germany

CS 3331 Numerical Methods Lecture 2: Functions of One Variable. Cherung Lee

Decomposition Methods

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

Lecture 7: Bayesian approach to MAB - Gittins index

Steepest descent and conjugate gradient methods with variable preconditioning

Stochastic Programming and Financial Analysis IE447. Midterm Review. Dr. Ted Ralphs

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Chapter 7 One-Dimensional Search Methods

SMOOTH CONVEX APPROXIMATION AND ITS APPLICATIONS SHI SHENGYUAN. (B.Sc.(Hons.), ECNU)

Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models

Strategies and Nash Equilibrium. A Whirlwind Tour of Game Theory

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE364b, Stanford University

Variable-Number Sample-Path Optimization

Exercise sheet 10. Discussion: Thursday,

25 Increasing and Decreasing Functions

Riemannian Geometry, Key to Homework #1

Chapter 8. Markowitz Portfolio Theory. 8.1 Expected Returns and Covariance

Game Theory: Normal Form Games

Kantorovich-type Theorems for Generalized Equations

Chapter 3: Black-Scholes Equation and Its Numerical Evaluation

Optimizing Portfolios

Sensitivity Analysis with Data Tables. 10% annual interest now =$110 one year later. 10% annual interest now =$121 one year later

Infinite Reload Options: Pricing and Analysis

Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India

Statistics and Machine Learning Homework1

Characterization of the Optimum

IEOR E4602: Quantitative Risk Management

Optimal Securitization via Impulse Control

Portfolio Management and Optimal Execution via Convex Optimization

4: SINGLE-PERIOD MARKET MODELS

The Irrevocable Multi-Armed Bandit Problem

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE392o, Stanford University

Application of an Interval Backward Finite Difference Method for Solving the One-Dimensional Heat Conduction Problem

X ln( +1 ) +1 [0 ] Γ( )

Markowitz portfolio theory

SELF-ADJOINT BOUNDARY-VALUE PROBLEMS ON TIME-SCALES

Problem Set 3. Thomas Philippon. April 19, Human Wealth, Financial Wealth and Consumption

Optimization for Chemical Engineers, 4G3. Written midterm, 23 February 2015

Multi-period mean variance asset allocation: Is it bad to win the lottery?

Technical Report Doc ID: TR April-2009 (Last revised: 02-June-2009)

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

MULTIPERIOD PORTFOLIO SELECTION WITH TRANSACTION AND MARKET-IMPACT COSTS

Maximum Contiguous Subsequences

Extender based forcings, fresh sets and Aronszajn trees

Stability in geometric & functional inequalities

Intro to Economic analysis

ERROR ESTIMATES FOR LINEAR-QUADRATIC ELLIPTIC CONTROL PROBLEMS

A Note on Error Estimates for some Interior Penalty Methods

Calibration Lecture 1: Background and Parametric Models

A No-Arbitrage Theorem for Uncertain Stock Model

Equity correlations implied by index options: estimation and model uncertainty analysis

MS-E2114 Investment Science Lecture 5: Mean-variance portfolio theory

Chapter 5 Portfolio. O. Afonso, P. B. Vasconcelos. Computational Economics: a concise introduction

A model reduction approach to numerical inversion for parabolic partial differential equations

Microeconomics of Banking: Lecture 2

A class of coherent risk measures based on one-sided moments

Convergence Analysis of Monte Carlo Calibration of Financial Market Models

Laws of probabilities in efficient markets

Realizability of n-vertex Graphs with Prescribed Vertex Connectivity, Edge Connectivity, Minimum Degree, and Maximum Degree

MAT25 LECTURE 10 NOTES. = a b. > 0, there exists N N such that if n N, then a n a < ɛ

Robust Pricing and Hedging of Options on Variance

Model-independent bounds for Asian options

Final exam solutions

Forecast Horizons for Production Planning with Stochastic Demand

CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems

3.2 No-arbitrage theory and risk neutral probability measure

Optimal Stopping in Infinite Horizon: an Eigenfunction Expansion Approach

Data-Driven Optimization for Portfolio Selection

Transcription:

Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective function f : IR n IR assume that f C (sometimes C 2 ) and Lipschitz often in practice this assumption violated, but not necessary

LINESEARCH VS TRUST-REGION METHODS Linesearch methods pick descent direction p k pick stepsize α k to reduce f(x k + αp k ) x k+ = x k + α k p k Trust-region methods pick step s k to reduce model of f(x k + s) accept x k+ = x k +s k if decrease in model inherited by f(x k +s k ) otherwise set x k+ = x k, refine model TRUST-REGION MODEL PROBLEM Model f(x k + s) by: linear model m L k (s) = f k + s T g k quadratic model symmetric B k Major difficulties: m Q k (s) = f k + s T g k + 2s T B k s models may not resemble f(x k + s) if s is large models may be unbounded from below linear model - always unless g k = 0 quadratic model - always if B k is indefinite, possibly if B k is only positive semi-definite

THE TRUST REGION Prevent model m k (s) from unboundedness by imposing a trust-region constraint s k for some suitable scalar radius k > 0 = trust-region subproblem approx minimize s IR n in theory does not depend on norm in practice it might! m k (s) subject to s k OUR MODEL For simplicity, concentrate on the second-order (Newton-like) model m k (s) = m Q k (s) = f k + s T g k + 2s T B k s and the l 2 -trust region norm = 2 Note: B k = H k is allowed analysis for other trust-region norms simply adds extra constants in following results

BASIC TRUST-REGION METHOD Given k = 0, 0 > 0 and x 0, until convergence do: Build the second-order model m(s) of f(x k + s). Solve the trust-region subproblem to find s k for which m(s k ) < f k and s k k, and define ρ k = f k f(x k + s k ). f k m k (s k ) If ρ k η v [very successful] 0 < η v < set x k+ = x k + s k and k+ = γ i k γ i Otherwise if ρ k η s then [successful] 0 < η s η v < set x k+ = x k + s k and k+ = k Otherwise [unsuccessful] set x k+ = x k and k+ = γ d k 0 < γ d < Increase k by SOLVE THE TRUST REGION SUBPROBLEM? At the very least aim to achieve as much reduction in the model as would an iteration of steepest descent Cauchy point: s C k = α C kg k where αk C = arg min m k ( αg k ) subject to α g k k α>0 = arg min m k ( αg k ) 0<α k / g k minimize quadratic on line segment = very easy! require that m k (s k ) m k (s C k) and s k k in practice, hope to do far better than this

ACHIEVABLE MODEL DECREASE Theorem 3.. If m k (s) is the second-order model and s C k is its Cauchy point within the trust-region s k, f k m k (s C k) 2 g k min g k + B k, k. PROOF OF THEOREM 3. m k ( αg k ) = f k α g k 2 + 2α 2 g T k B k g k. Result immediate if g k = 0. Otherwise, 3 possibilities (i) curvature g T k B k g k 0 = m k ( αg k ) unbounded from below as α increases = Cauchy point occurs on the trust-region boundary. (ii) curvature g T k B k g k > 0 & minimizer m k ( αg k ) occurs at or beyond the trust-region boundary = Cauchy point occurs on the trustregion boundary. (iii) the curvature gk T B k g k > 0 & minimizer m k ( αg k ), and hence Cauchy point, occurs before trust-region is reached. Consider each case in turn;

Case (i) gk T B k g k 0 & α 0 = m k ( αg k ) = f k α g k 2 + 2α 2 gk T B k g k f k α g k 2 () Cauchy point lies on boundary of the trust region = () + (2) = α C k = k g k. (2) f k m k (s C k) g k 2 k g k = g k k 2 g k k. Case (ii) = = α k (3) + (4) + (5) = def = arg min m k ( αg k ) f k α g k 2 + 2α 2 g T k B k g k (3) α k = g k 2 g T k B k g k α C k = k g k (4) α C kg T k B k g k g k 2. (5) f k m k (s C k) = α C k g k 2 2[α C k] 2 g T k B k g k 2α C k g k 2 = 2 g k 2 k g k = 2 g k k.

Case (iii) = where α C k = α k = g k 2 g T k B k g k f k m k (s C k) = αk g k 2 + 2(α k) 2 gk T B k g k = g k 4 g k 4 gk T 2 B k g k gk T B k g k g k 4 = 2 2 gk T B k g k g k 2 + B k, g T k B k g k g k 2 B k g k 2 ( + B k ) because of the Cauchy-Schwarz inequality. Corollary 3.2. If m k (s) is the second-order model, and s k is an improvement on the Cauchy point within the trust-region s k, f k m k (s k ) 2 g k min g k + B k, k.

DIFFERENCE BETWEEN MODEL AND FUNCTION Lemma 3.3. Suppose that f C 2, and that the true and model Hessians satisfy the bounds H(x) κ h for all x and B k κ b for all k and some κ h and κ b 0. Then where κ d = 2(κ h + κ b ), for all k. f(x k + s k ) m k (s k ) κ d 2 k, PROOF OF LEMMA 3.3 Mean value theorem = f(x k + s k ) = f(x k ) + s T k x f(x k ) + 2s T k xx f(ξ k )s k for some ξ k [x k, x k + s k ]. Thus f(x k + s k ) m k (s k ) = 2 s T k H(ξ k )s k s T k B k s k 2 s T k H(ξ k )s k + 2 s T k B k s k 2(κ h + κ b ) s k 2 κ d 2 k using the triangle and Cauchy-Schwarz inequalities.

ULTIMATE PROGRESS AT NON-OPTIMAL POINTS Lemma 3.4. Suppose that f C 2, that the true and model Hessians satisfy the bounds H k κ h and B k κ b for all k and some κ h and κ b 0, and that κ d = 2(κ h + κ b ). Suppose furthermore that g k 0 and that k g k min Then iteration k is very successful and, ( η v) κ h + κ b 2κ d k+ k.. PROOF OF LEMMA 3.4 By definition, + B k κ h + κ b + first bound on k = Corollary 3.2 = k g k κ h + κ b f k m k (s k ) 2 g k min + Lemma 3.3 + second bound on k = ρ k = f(x k + s k ) m k (s k ) f k m k (s k ) g k + B k. = ρ k η v = iteration is very successful. g k + B k, k = 2 g k k. 2 κ d 2 k g k k = 2 κ d k g k η v.

RADIUS WON T SHRINK TO ZERO AT NON-OPTIMAL POINTS Lemma 3.5. Suppose that f C 2, that the true and model Hessians satisfy the bounds H k κ h and B k κ b for all k and some κ h and κ b 0, and that κ d = 2(κ h + κ b ). Suppose furthermore that there exists a constant ɛ > 0 such that g k ɛ for all k. Then for all k. def k κ ɛ = ɛγ d min, ( η v) κ h + κ b 2κ d PROOF OF LEMMA 3.5 Suppose otherwise that iteration k is first for which k+ κ ɛ. k > k+ = iteration k unsuccessful = γ d k k+. Hence k ɛ min, ( η v) κ h + κ b 2κ d g k min, ( η v) κ h + κ b 2κ d But this contradicts assertion of Lemma 3.4 that iteration k must be very successful.

POSSIBLE FINITE TERMINATION Lemma 3.6. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Suppose furthermore that there are only finitely many successful iterations. Then x k = x for all sufficiently large k and g(x ) = 0. PROOF OF LEMMA 3.6 x k0 +j = x k0 + = x for all j > 0, where k 0 is index of last successful iterate. All iterations are unsuccessful for sufficiently large k = { k } 0 + Lemma 3.4 then implies that if g k0 + > 0 there must be a successful iteration of index larger than k 0, which is impossible = g k0 + = 0.

GLOBAL CONVERGENCE OF ONE SEQUENCE Theorem 3.7. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Then either g l = 0 for some l 0 or or lim f k = k lim inf k g k = 0. PROOF OF THEOREM 3.7 Let S be the index set of successful iterations. Lemma 3.6 = true Theorem 3.7 when S finite. So consider S =, and suppose f k bounded below and g k ɛ (6) for some ɛ > 0 and all k, and consider some k S. + Corollary 3.2, Lemma 3.5, and the assumption (6) = def ɛ f k f k+ η s [f k m k (s k )] δ ɛ = 2η s ɛ min, κ ɛ + κ b = f 0 f k+ = k [f j f j+ ] σ k δ ɛ, j=0 j S where σ k is the number of successful iterations up to iteration k. But lim k σ k = +. = f k unbounded below = a subsequence of the g k 0.

GLOBAL CONVERGENCE Theorem 3.8. Suppose that f C 2, and that both the true and model Hessians remain bounded for all k. Then either g l = 0 for some l 0 or or lim f k = k lim g k = 0. k II: SOLVING THE TRUST-REGION SUBPROBLEM (approximately) minimize s IR n q(s) s T g + 2s T Bs subject to s AIM: find s so that q(s ) q(s C ) and s Might solve exactly = Newton-like method approximately = steepest descent/conjugate gradients

THE l 2 -NORM TRUST-REGION SUBPROBLEM minimize s IR n q(s) s T g + 2s T Bs subject to s 2 Solution characterisation result: Theorem 3.9. Any global minimizer s of q(s) subject to s 2 satisfies the equation (B + λ I)s = g, where B+λ I is positive semi-definite, λ 0 and λ ( s 2 ) = 0. If B + λ I is positive definite, s is unique. PROOF OF THEOREM 3.9 Problem equivalent to minimizing q(s) subject to 2 2 2s T s 0. Theorem.9 = g + Bs = λ s (7) for some Lagrange multiplier λ 0 for which either λ = 0 or s 2 = (or both). It remains to show B + λ I is positive semi-definite. If s lies in the interior of the trust-region, λ = 0, and Theorem.0 = B + λ I = B is positive semi-definite. If s 2 = and λ = 0, Theorem.0 = v T Bv 0 for all v N + = {v s T v 0}. If v / N + = v N + = v T Bv 0 for all v. Only remaining case is where s 2 = and λ > 0. Theorem.0 = v T (B + λ I)v 0 for all v N + = {v s T v = 0} = remains to consider v T Bv when s T v 0.

s N + w s Figure 3.: Construction of missing directions of positive curvature. Let s be any point on the boundary δr of the trust-region R, and let w = s s. Then w T s = (s s) T s = 2(s s) T (s s) = 2w T w (8) since s 2 = = s 2. (7) + (8) = q(s) q(s ) = w T (g + Bs ) + 2w T Bw = λ w T s + 2w T Bw = 2w T (B + λ I)w, = w T (B + λ I)w 0 since s is a global minimizer. But s = s 2 st v v T v v δr = (for this s) w v = v T (B + λ I)v 0. When B + λ I is positive definite, s = (B + λ I) g. If s δr and s R, (8) and (9) become w T s 2w T w and q(s) q(s ) + 2w T (B + λ I)w respectively. Hence, q(s) > q(s ) for any s s. If s is interior, λ = 0, B is positive definite, and thus s is the unique unconstrained minimizer of q(s). (9)

ALGORITHMS FOR THE l 2 -NORM SUBPROBLEM Two cases: B positive-semi definite and Bs = g satisfies s 2 = s = s B indefinite or Bs = g satisfies s 2 > In this case (B + λ I)s = g and s T s = 2 nonlinear (quadratic) system in s and λ concentrate on this EQUALITY CONSTRAINED l 2 -NORM SUBPROBLEM Suppose B has spectral decomposition B = U T ΛU U eigenvectors Λ diagonal eigenvalues: λ λ 2... λ n Require B + λi positive semi-definite = λ λ Define Require Note s(λ) = (B + λi) g ψ(λ) def = s(λ) 2 2 = 2 (γ i = e T i Ug) ψ(λ) = U T (Λ + λi) Ug 2 2 = n i= γ 2 i (λ i + λ) 2

CONVEX EXAMPLE ψ(λ) 3 2 0 8 6 4 2 0 2 4 λ B = solution curve as varies 2 =.5 0 0 0 3 0 0 0 5 g = NONCONVEX EXAMPLE ψ(λ) 2 B = 0 0 0 3 0 0 0 5 minus leftmost eigenvalue g = 0 8 6 4 2 0 2 4 λ

THE HARD CASE ψ(λ) 2 B = 0 0 0 3 0 0 0 5 minus leftmost eigenvalue g = 0 0 8 6 4 2 0 2 4 2 = 0.0903 λ SUMMARY For indefinite B, Hard case occurs when g orthogonal to eigenvector u for most negative eigenvalue λ OK if radius is radius small enough No obvious solution to equations... but solution is actually of the form where s lim = lim λ + λ s(λ) s lim + σu 2 = s lim + σu

HOW TO SOLVE s(λ) 2 = DON T!! Solve instead the secular equation no poles φ(λ) def = s(λ) 2 = 0 smallest at eigenvalues (except in hard case!) analytic function = ideal for Newton global convergent (ultimately quadratic rate except in hard case) need to safeguard to protect Newton from the hard & interior solution cases THE SECULAR EQUATION φ(λ) min 4s 2 + 4s 2 2 + 2s + s 2 subject to s 2 4 0 0 λ λ λ

NEWTON S METHOD FOR SECULAR EQUATION Newton correction at λ is φ(λ)/φ (λ). Differentiating φ(λ) = s(λ) 2 = (s T (λ)s(λ)) 2 = φ (λ) = st (λ) λ s(λ) (s T (λ)s(λ)) 3 2 Differentiating the defining equation = st (λ) λ s(λ). s(λ) 3 2 (B + λi)s(λ) = g = (B + λi) λ s(λ) + s(λ) = 0. Notice that, rather than λ s(λ), merely s T (λ) λ s(λ) = s T (λ)(b + λi)(λ) s(λ) required for φ (λ). Given the factorization B + λi = L(λ)L T (λ) = s T (λ)(b + λi) s(λ) = s T (λ)l T (λ)l (λ)s(λ) = (L (λ)s(λ)) T (L (λ)s(λ)) = w(λ) 2 2 where L(λ)w(λ) = s(λ). NEWTON S METHOD & THE SECULAR EQUATION Let λ > λ and > 0 be given Until convergence do: Factorize B + λi = LL T Solve LL T s = g Solve Lw = s Replace λ by λ + s 2 s 2 2 w 2

SOLVING THE LARGE-SCALE PROBLEM when n is large, factorization may be impossible may instead try to use an iterative method to approximate steepest descent leads to the Cauchy point obvious generalization: conjugate gradients... but what about the trust region? what about negative curvature? CONJUGATE GRADIENTS TO MINIMIZE q(s) Given s 0 = 0, set g 0 = g, d 0 = g and i = 0 Until g i small or breakdown, iterate α i = g i 2 2/d i T Bd i s i+ = s i + α i d i g i+ = g i + α i Bd i ( g + Bs i+ ) β i = g i+ 2 2/ g i 2 2 d i+ = g i+ + β i d i and increase i by

CRUCIAL PROPERTY OF CONJUGATE GRADIENTS Theorem 3.0. Suppose that the conjugate gradient method is applied to minimize q(s) starting from s 0 = 0, and that d i T Bd i > 0 for 0 i k. Then the iterates s j satisfy the inequalities for 0 j k. s j 2 < s j+ 2 TRUNCATED CONJUGATE GRADIENTS Apply the conjugate gradient method, but terminate at iteration i if. d i T Bd i 0 = problem unbounded along d i 2. s i + α i d i 2 > = solution on trust-region boundary In both cases, stop with s = s i + α B d i, where α B root of s i + α B d i 2 = chosen as positive Crucially q(s ) q(s C ) and s 2 = TR algorithm converges to a first-order critical point

HOW GOOD IS TRUNCATED C.G.? In the convex case... very good Theorem 3.. Suppose that the truncated conjugate gradient method is applied to minimize q(s) and that B is positive definite. Then the computed and actual solutions to the problem, s and s M, satisfy the bound q(s ) 2q(s M ) In the non-convex case... maybe poor e.g., if g = 0 and B is indefinite = q(s ) = 0 can use Lanczos method to continue around trust-region boundary if necessary