Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

Similar documents
Is Greedy Coordinate Descent a Terrible Algorithm?

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.

Trust Region Methods for Unconstrained Optimisation

6.896 Topics in Algorithmic Game Theory February 10, Lecture 3

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity

Lecture Quantitative Finance Spring Term 2015

The Correlation Smile Recovery

Approximate Composite Minimization: Convergence Rates and Examples

Convergence of trust-region methods based on probabilistic models

Stochastic Proximal Algorithms with Applications to Online Image Recovery

DRAFT. 1 exercise in state (S, t), π(s, t) = 0 do not exercise in state (S, t) Review of the Risk Neutral Stock Dynamics

What can we do with numerical optimization?

Lecture 7: Bayesian approach to MAB - Gittins index

First-Order Methods. Stephen J. Wright 1. University of Wisconsin-Madison. IMA, August 2016

Support Vector Machines: Training with Stochastic Gradient Descent

OPTIMAL PORTFOLIO CONTROL WITH TRADING STRATEGIES OF FINITE

Financial Optimization ISE 347/447. Lecture 15. Dr. Ted Ralphs

IEOR E4004: Introduction to OR: Deterministic Models

CS 3331 Numerical Methods Lecture 2: Functions of One Variable. Cherung Lee

GLOBAL CONVERGENCE OF GENERAL DERIVATIVE-FREE TRUST-REGION ALGORITHMS TO FIRST AND SECOND ORDER CRITICAL POINTS

Interpolation. 1 What is interpolation? 2 Why are we interested in this?

Chapter 7: Portfolio Theory

A Trust Region Algorithm for Heterogeneous Multiobjective Optimization

Sy D. Friedman. August 28, 2001

Handout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jordan Ash May 1, 2014

EE/AA 578 Univ. of Washington, Fall Homework 8

Martingale Pricing Theory in Discrete-Time and Discrete-Space Models

Adaptive cubic overestimation methods for unconstrained optimization

Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity

Department of Mathematics. Mathematics of Financial Derivatives

THE TRAVELING SALESMAN PROBLEM FOR MOVING POINTS ON A LINE

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE364b, Stanford University

Convergence Analysis of Monte Carlo Calibration of Financial Market Models

arxiv: v1 [math.pr] 6 Apr 2015

Decomposition Methods

MAT 4250: Lecture 1 Eric Chung

25 Increasing and Decreasing Functions

Chapter 7 One-Dimensional Search Methods

Characterization of the Optimum

DASC: A DECOMPOSITION ALGORITHM FOR MULTISTAGE STOCHASTIC PROGRAMS WITH STRONGLY CONVEX COST FUNCTIONS

A No-Arbitrage Theorem for Uncertain Stock Model

Sublinear Time Algorithms Oct 19, Lecture 1

Portfolio Management and Optimal Execution via Convex Optimization

Revenue Management Under the Markov Chain Choice Model

CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems

HIGH ORDER DISCONTINUOUS GALERKIN METHODS FOR 1D PARABOLIC EQUATIONS. Ahmet İzmirlioğlu. BS, University of Pittsburgh, 2004

Technical Report Doc ID: TR April-2009 (Last revised: 02-June-2009)

4: SINGLE-PERIOD MARKET MODELS

Part 1: q Theory and Irreversible Investment

Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization

Game Theory: Normal Form Games

Lecture 8: Asset pricing

Tutorial 4 - Pigouvian Taxes and Pollution Permits II. Corrections

Optimal Allocation of Policy Limits and Deductibles

Steepest descent and conjugate gradient methods with variable preconditioning

Richardson Extrapolation Techniques for the Pricing of American-style Options

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Intro to Economic analysis

No-arbitrage theorem for multi-factor uncertain stock model with floating interest rate

Chapter 5 Finite Difference Methods. Math6911 W07, HM Zhu

Markowitz portfolio theory

Chapter 5 Portfolio. O. Afonso, P. B. Vasconcelos. Computational Economics: a concise introduction

Probability. An intro for calculus students P= Figure 1: A normal integral

Essays on Some Combinatorial Optimization Problems with Interval Data

Order book resilience, price manipulations, and the positive portfolio problem

On Complexity of Multistage Stochastic Programs

GPD-POT and GEV block maxima

PORTFOLIO OPTIMIZATION AND EXPECTED SHORTFALL MINIMIZATION FROM HISTORICAL DATA

Lecture 19: March 20

The Probabilistic Method - Probabilistic Techniques. Lecture 7: Martingales

Application of an Interval Backward Finite Difference Method for Solving the One-Dimensional Heat Conduction Problem

Yao s Minimax Principle

Infinite Reload Options: Pricing and Analysis

I R TECHNICAL RESEARCH REPORT. A Framework for Mixed Estimation of Hidden Markov Models. by S. Dey, S. Marcus T.R

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

Mengdi Wang. July 3rd, Laboratory for Information and Decision Systems, M.I.T.

Lecture 8: Introduction to asset pricing

Pricing Problems under the Markov Chain Choice Model

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.

4 Reinforcement Learning Basic Algorithms

Notes on the EM Algorithm Michael Collins, September 24th 2005

Optimal Stopping. Nick Hay (presentation follows Thomas Ferguson s Optimal Stopping and Applications) November 6, 2008

Week 2 Quantitative Analysis of Financial Markets Hypothesis Testing and Confidence Intervals

Forecast Horizons for Production Planning with Stochastic Demand

A Stochastic Approximation Algorithm for Making Pricing Decisions in Network Revenue Management Problems

Lecture Notes 6. Assume F belongs to a family of distributions, (e.g. F is Normal), indexed by some parameter θ.

On the complexity of the steepest-descent with exact linesearches

Valuation of performance-dependent options in a Black- Scholes framework

Allocation of Risk Capital via Intra-Firm Trading

The Optimization Process: An example of portfolio optimization

EC316a: Advanced Scientific Computation, Fall Discrete time, continuous state dynamic models: solution methods

Homework Assignments

16 MAKING SIMPLE DECISIONS

Lecture l(x) 1. (1) x X

Econ 582 Nonlinear Regression

MTH6154 Financial Mathematics I Interest Rates and Present Value Analysis

Transcription:

Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization Random order with fixed step size Cyclic order with fixed step size Steepest direction Gauss-Southwell rule with fixed step size Alternatives Summary 3 Examples Linear equations Logistic regression Basic idea coordinate minimization: Compute the next iterative using the update x k+ = x k α ke ik Given a function f : R n R, consider the unconstrained optimization problem We will consider various assumptions on f: nonconvex and differentiable f convex and differentiable f strongly convex and differentiable f minimize fx We will not consider general non-smooth f, because we can not prove anything. We will briefly consider structured non-smooth problems, i.e., problems that use an additional separable regularizer. Notation: f k := fx k and g k := fx k Algorithm General coordinate minimization framework. : Choose x 0 R n and set k 0. 2: loop 3: Choose i k {, 2,..., n}. 4: Choose α k > 0 5: Set x k+ x k α ke ik. 6: Set k k +. 7: end loop α k is the step size. Options include: fixed, but sufficiently small inexact linesearch exact linesearch i k {, 2,..., n} has to be chosen. Options include: cycle through the entire set choose it randomly without replacement choose it randomly with replacement choose it based on which element of fx k is the largest in absolute value e ik is the i k-th coordinate vector this update seeks better points in span{e ik }.

The following example shows that coordinate minimization may fail if f is non-smooth: minimize f x, x2 := x + x2 + min{0, x }2 + min{0, x2 }2 + 3 x x2 x,x2 Algorithm 2 Coordinate minimization with cyclic order and exact minimization. : 2: 3: 4: Choose x0 Rn and set k 0. loop Choose ik = modk, n +. Calculate the exact coordinate minimizer: αk argmin f xk αeik α R 5: 6: 7: Set xk+ xk αk eik. Set k k +. end loop Comments: This algorithm assumes that the exact minimizers exist and that they are unique. A reasonable stopping condition should be incorporated, such as k f xk k2 0 6 max{, k f x0 k2 } Figure : Level curves for the function f defined above. Coordinate minimization cannot make progress from any point satisfying x = x2 > 0. Note: Coordinate descent works if the non-smoothness is structured block separable An interesting example introduced by Powell [5, formula 2] is minimize f x, x2, x3 := x x2 + x2 x3 + x x3 + x,x2,x3 3 X xi 2 i= f is continuously differentiable and nonconvex f has minimizers at,, and,, of the unit cube. Coordinate descent with exact minimization started just outside the unit cube near any nonoptimal vertex cycles around neighborhoods of all 6 non-optimal vertices. Powell shows that the cyclic nonconvergence behavior is special and is destroyed by small perturbations on this particular example. Theorem 2. see [6, Theorem 5.32] Assume that the following hold: f is continuously differentiable; the level set L0 := {x Rn : f x f x0 } is bounded; and for every x L0 and all j {, 2,..., n}, the optimization problem minimize f x + ζej ζ R has a unique minimizer. Then, for any limit point x of the sequence {xk } generated by Algorithm 2 that satisfies f x = 0. Figure : Three dimensional example given above. It shows the possible lack of convergence of a coordinate descent method with exact minimization. This example and others in [5] show that we cannot expect a general convergence result for nonconvex functions similar to that for full-gradient descent. This picture was taken from [7]. Proof: Since f xk+ f xk for all k N, we know that the sequence {xk } k=0 L0. Since L0 is bounded that {xk } k=0 has at least one limit point; let x be any such limit point. Thus, there exists a subsequence K N satisfying lim xk = x. k K 2 Combining this with monotonicity of { f xk } and continuity of f also shows that lim f xk = f x and f xk f x for all k. k 3

We assume fx 0, and then reach a contradiction. First, consider the subsets K i K, i = 0,..., n defined as K i := {k K : k i mod n}. Since K is an infinite subsequence of the natural numbers, one of the K i must be an infinite set. Without loss of generality, we assume it is K 0 the argument is very similar for any other i, because we are using cyclic order. Let us perform a hypothetical sweep" of coordinate minimization starting from x, so that we would obtain l y := x n, with x l := x + [τ ] je j for all l =,..., n and note that since fx 0 by assumption, we must have fy < fx. why? 4 NOTE: If K i was infinite for some i 0, then we would do above sweep" at x starting with coordinate i and going in cyclic order to cover all n coordinates. Next, notice that by construction of the coordinate minimization scheme, that x k+l = x k + l τ k+j e j for all k K 0 and l n, 5 meaning that x k+l x k = τ k τ k+. τ k+l 2 max{ x : x L 0} < for all k K 0 and l n. We used the assumption that L 0 is bounded. Since this shows that the set {τ k, τ k+,... τ k+n T } k K0 is bounded, we may pass to a subsequence K K 0 with lim k K τ k τ k+. τ k+n = τ L for some τ L R n. 6 Taking the limit of 5 over k K K for each l, and using 2 and 6 we find that lim xk+l = x + k K l [τ L ] je j for each l n. 7 We next claim the following, which we will prove by induction: [τ L ] p = [τ ] p for all p n, and 8 lim k K xk+p = x p for all p n. 9 Base case: p =. We know from the coordinate minimization that fx k+ fx k + τ e for all k and τ R. Taking limits over k K K and using continuity of f, 7 with l =, and 2 yields fx + [τ L ] e = f lim xk+ = lim fxk+ lim fxk + τ e k K k K k K = f lim k K xk + τ e = fx + τ e for all τ R. Since the minimizations in coordinate directions are unique by assumption, we know that [τ L ] = [τ ], which is the first desired result. Also, combining it with 7 gives lim xk+ = x + [τ L ] e = x + [τ ] e x k K, which completes the base base. Induction step: assume that 8 and 9 hold for p p n. We know from the coordinate minimization that fx k+ p+ fx k+ p + τ e p+ for all k and τ R. Taking the limit over k K, continuity of f, 7 with l = p +, and 9 give p+ fx + [τ L ] je j = f lim xk+ p+ = lim fxk+ p+ lim fxk+ p + τ e p+ k K k K k K = f lim k K xk+ p + τ e p+ = fx p + τ e p+ for all τ R. Thus, the definition of x p, and the fact that 8 holds for all p p show that fx p + [τ L ] p+e p+ = fx + = fx + p p [τ ] je j + [τ L ] p+e p+ [τ L ] je j + [τ L ] p+e p+ p+ = fx + [τ L ] je j fx p + τ e p+ for all τ R. Since the minimization in coordinate directions is unique by assumption, we know that [τ L ] p+ = [τ ] p+, which is the first desired result. Also, combining it with 7 gives p+ p+ lim xk+ p+ = x + [τ L ] je j = x + [τ ] je j x p+, k K which completes the proof by induction.

Notation: Let L j denote the jth component Lipschitz constant, i.e., it satisfies j f x + te j j fx L j t for all x R n and t. From our induction proof, we have that τ = τ L. Combining this with 7 and the definition of y gives lim xk+n = x + k K n τ L j e j = x + Finally, combining 3, continuity of f, 0, and 4 shows that n τ j e j x n y. 0 fx = lim fxk+n = f lim xk+n = fy < fx, k K k K which is a contradiction. This completes the proof. Let denote the coordinate Lipschitz constant, i.e., it satisfies := max L i. Algorithm 3 Coordinate minimization with random order and a fixed step size. : Choose α 0, /], where 2: Choose x 0 R n and set k 0. 3: loop 4: Choose i k {, 2,..., n} randomly with equal probability. 5: Set x k+ x k α ik fx ke ik. 6: Set k k +. 7: end loop Comments: A reasonable stopping condition should be incorporated, such as fx k 2 0 6 max{, fx 0 2} A maximum number of iterations should be included in practice. Theorem 2.2 Suppose that α = / and let the following assumptions hold: f is convex f is globally Lipschitz continuous the minimum value of f is obtained on some set S, i.e., there exists S R n with there exists a scalar R 0 satisfying x S and f := fx = min fx max max { x x S x R x 2 : fx fx 0} R 0 n Then, the iterate sequence {x k} generated by Algorithm 3 satisfies E[ fx k] f 2nLmaxR2 0. k Moreover, if f is strongly convex with parameter σ > 0, i.e., for all {x, y} R n, then fy fx + fx T y x + σ 2 y x 2 2 E[ fx k] f σ k fx0 f. Proof: follows Wright [7] It follows from Taylor s Theorem, definitions of L j and, and α = /, that fx k+ = f x k α ik fx ke ik fx k α ik fx ke T i k fx k + 2 α2 L ik ik fx k 2 = fx k α ik fx k 2 + 2 α2 L ik ik fx k 2 fx k α ik fx k 2 + 2 α2 ik fx 2 k = fx k α αlmax ik 2 fxk 2 Why? Exercise. = fx k ik fx k 2. 2 If we now take the expectation of both sides with respect to i k, we find that E ik [ fx k+] fx k n Subtracting f from both sides, shows that n j fx k 2 = fx k 2 fx k 2 2. 3 E ik [ fx k+] f fx k f 2 fx k 2 2.

From the previous slide, we have E ik [ fx k+] f fx k f 2 fx k 2 2. Taking expectation with respect to all the random variables {i 0, i, i 2,... }, and defining we find that φ k+ = E [ fx k+ ] f φ k := E[ fx k] f, [ = E i0,i 2,...i Eik k [ fx ] k+ x k] f [ E i0,i,...i k fx k f [ = E fx k f = φ k [ E fx k 2 2 2 φ k 2 E [ fx k 2] 2, [ fxk 2 ] ] 2 2 [ fxk 2 ] ] 2 2 ] where we used Jensen s Inequality to derive the last inequality. 4 From the previous slide, we showed that φ k+ φ k 2 E [ fx k 2] 2. 5 Next, note from convexity of f, definition of R 0, and the fact that fx k fx 0 for all k by construction of the algorithm, that we have fx k f fx k T x x k fx k 2 x k x 2 R 0 fx k 2. 6 Taking expectation of both sides shows that Combining this bound with 5 yields φ k+ φ k E [ fx k 2 ] R 0 E[ fx k f ] = R 0 φ k. 2R 2 0 φ 2 k φ k φ k+ Combining this with φ k+ φ k see 5, we have 2R 2 0 φk φk+ φ 2 k 2R 2 0 φk φk+ φ kφ k+ = φ k+ φ k. φ 2 k. 7 From the previous slide, we showed that 2R 2 0 φ k+ φ k for all k. Summing both sides for k = 0,,..., l, shows that l l l = =. φ k+ φ k φ l φ 0 φ l 2R 2 0 2nL k=0 maxr 2 0 k=0 Rearranging, replacing l by k, and using the definition of φ k, this is equivalent to E[ fx k] f = φ k 2nLmaxR2 0 k which is the first desired result. For the second part, assume that f is strongly convex with parameter σ, i.e., that fy fx + fx T y x + σ 2 y x 2 2 for all {x, y} R n. By choosing x = x k and minimizing both sides with respect to y, we find that where y k f = min y R n fy min fxk + y R n fxkt y x k + σ 2 y xk 2 2 = fx k + fx k T y k x k + σ 2 y k x k 2 2 = fx k σ fxk 2 2 + 2σ fxk 2 2 = fx k 2σ fxk 2 2, := x k fxk. σ On the previous slide we proved that f fx k 2σ fxk 2 2. 8 Combining this bound with 4 gives the inequality φ k+ φ k E [ fx k 2 ] 2 2 Applying this recursively shows that φ k 2 E [ 2σ fx k f ] = φ k σ E [ fx k f ] = φ k σ φ k nl max = σ φ k φ k σ k φ 0 so that after we use the definition for φ k, we have E[ fx k] f σ k fx 0 f which is the second desired result.

Notation: Let L j denote the jth component Lipschitz constant, i.e., it satisfies j f x + te j j fx Lj t for all x R n and t. Let denote the coordinate Lipschitz constant, i.e., it satisfies := max L i. Let L denote the Lipschitz constant for f. Algorithm 4 Coordinate minimization with cyclic order and a fixed step size. : Choose α 0, /]. 2: Choose x 0 R n and set k 0. 3: loop 4: Choose i k = modk, n +. 5: Set x k+ x k α ik fx ke ik. 6: Set k k +. 7: end loop Comments: A reasonable stopping condition should be incorporated, such as fx k 2 0 6 max{, fx 0 2} A maximum number of allowed iterations should be included in practice. Theorem 2.3 see [, Theorem 3.6,Theorem 3.9] and [7, Theorem 3] Suppose that α = / and let the following assumptions hold: f is convex f is globally Lipschitz continuous the minimum value of f is obtained on some set S, i.e., there exists S R n with there exists a scalar R 0 satisfying x S and f := fx = min fx max max { x x S x R x 2 : fx fx 0} R 0 n If {x k} is the iterate sequence of Algorithm 4, then for k {n, 2n, 3n,... } we have fx k f 4nLmax + nl2 /L 2 maxr 2 0. 9 k + 8 If f is strongly convex with parameter σ > 0 see then for k {n, 2n, 3n,... } fx k f k/n σ fx0 f + nl 2 /L 2 max Proof: See [, Theorem 3.6 and Theorem 3.9] and use i each iteration k" in [] is a cycle of n iterations; ii choose in [] the values L i = for all i; iii in [] we have p = since our blocks of variables are singletons, i.e., coordinate descent. Comments on Theorem 2.3: The numerator in 9 is On 2, while the numerator in the analogous result Theorem 2.2 for the random coordinate choice with fixed step size is On. But, Theorem 2.3 is a deterministic result, while Theorem 2.2 is a result in expectation. As part of the homework assignment, you will find out for yourself how these methods perform on a simple quadratic objective function. It can be shown that L n Lj see [3, Lemma 2 with α = ] It follows from the fact that j fx + te j j fx fx + te j fx 2 L t holds for all j, t, and x that L j L. By combining the previous two bullet points, we find that max L j L j n L j so that L n Roughly speaking, L/ is closer to when the coordinates are more decoupled". In light of 9, the complexity result for coordinate descent becomes better as the variables become more decoupled. This makes sense! Notation: Let L j denote the jth component Lipschitz constant, i.e., it satisfies j f x + te j j fx L j t for all x R n and t. Let denote the coordinate Lipschitz constant, i.e., it satisfies := max L i. Algorithm 5 Coordinate minimization with Gauss-Southwell Rule and a fixed step size. : Choose α 0, /]. 2: Choose x 0 R n and set k 0. 3: loop 4: Calculate i k as the steepest coordinate direction, i.e., 5: Set x k+ x k α ik fx ke ik. 6: Set k k +. 7: end loop i k argmax i fx k Comments: A reasonable stopping condition should be incorporated, such as fx k 2 0 6 max{, fx 0 2}

Theorem 2.4 Suppose that α = / and let the following assumptions hold: f is convex f is globally Lipschitz continuous the minimum value of f is obtained on some set S, i.e., there exists S R n with there exists a scalar R 0 satisfying x S and f := fx = min fx max max { x x S x R x 2 : fx fx 0} R 0 n Then, the iterate sequence {x k} computed from Algorithm 5 satisfies fx k f 2nLmaxR2 0. 20 k If f is strongly convex with parameter σ > 0 see then fx k f σ k fx0 f Proof: From earlier see 2, we showed that f k+ fx k ik fx k 2. Combining this with the choice i k argmax i fx k and the standard norm inequality v 2 n v, it holds that f k+ fx k ik fx k 2 = fx k fx k 2 2 fx k fx k 2 2. 2 22 Subtracting f from both sides and using the previous fact see 6 that we find that fx k f R 0 fx k 2, f k+ f fx k f 2 fx k 2 2 fx k f Using the notation φ k = fx k f, this is equivalent to φ k+ φ k 2R 2 0 φ 2 k. 2R 2 0 fxk f 2. From the previous slide, we have φ k+ φ k φ 2 2R 2 k 0 which is exactly the same as the inequality 7 except that we now have a different definition of φ k. Then, as shown in that proof, we have which is the desired result for convex f. fx k f = φ k 2nLmaxR2 0 k Next, assume that f is strongly convex, from which earlier we showed see 8 that f fx k 2σ fxk 2 2. Subtracting f from each side of 22 and then using the previous inequality shows that so that f k+ f fx k f 2 fx k 2 2 fx k f fx k f which is the last desired result. σ fxk f = σ k fx0 f σ fxk f Comments so far for fixed step size: Cyclic has the worst dependence on n: Cyclic: On 2 Random and Gauss-Southwell: On Random is a rate in expectation. Gauss-Southwell is a deterministic rate. There is a better analysis for Gauss-Southwell when we assume that f is strongly convex that changes the above comment! See [4]. We show this next.

Theorem 2.5 Suppose that α = / and let the following assumptions hold: f is l -strongly convex, i.e., there exists σ > 0 such that fy fx + fx T y x + σ 2 y x 2 for all {x, y} R n f is globally Lipschitz continuous the minimum value of f is obtained Then, the iterate sequence {x k} computed from Algorithm 5 satisfies fx k f k σ fx0 f Proof see [4]: Using l -strong convexity means that fy fx + fx T y x + σ 2 y x 2 for all {x, y} R n for the l -strong convexity parameter σ. If we now minimize both sides with respect to y and replace x by x k, we find that where y k f = minimize y R n minimize y R n fy fx k + fx k T y x k + σ 2 y xk 2 = fx k + fx k T y k x k + σ 2 y k x k 2 why? exercise = fx k 2σ fx k 2 := x k + z k with and l any index satisfying Therefore, we have that [z k ] i := { 0 if i l i fx k σ if i = l l { j : jfx k = fx k }. fx k 2 2σ fxk f. From the previous slide, we showed that fx k 2 2σ fxk f. Subtracting f from both sides of 2 and using the previous inequality shows that f k+ f fx k f fx k 2 fx k f σ fxk f L max = σ fxk f. Applying this inequality recursively gives k fx k f σ fx0 f which is the desired result. For strongly convex functions: Random coordinate choice has the expected rate of E[ fx k] f σ k fx0 f. Gauss-Southwell coordinate choice has the determinstic rate of k fx k f σ fx0 f 23 The bound for Gauss-Southwell is better since so that σ σ n σ σ σ n σ σ σ σ

Example: A Simple Diagonal Quadratic Function Consider the problem where minimize g T x + 2 xt Hx H = diagλ, λ 2,..., λ n with λ i > 0 for all i {, 2,..., n}. For this problem, we know that σ = min{λ, λ 2,..., λ n} and σ = n λ i i= Case : For λ = α for some α > 0, the minimum value for σ occurs when α = λ = λ 2 = = λ n, which gives Thus, the convergence constants are: random selection : Gauss-Southwell selection : σ = α and σ = α n. σ σ = = α α Case 2: For this other extreme case, let us suppose that λ = β and λ 2 = λ 3 = = λ n = α with α β. For this case, it can be shown that σ = β and σ = If we now take the limit as α we find that βα n α n + n βα n 2 = βα α + n β. σ = β and σ β = σ Thus, the convergence constants in the limit are: random selection : σ = Gauss-Southwell selection : σ = β β so that Gauss-Southwell is a factor n faster than using a random coordinate selection. so the convergence constants are the same; this is the worst case for Gauss-Southwell. Alternative strongly convex: individual coordinate Lipschitz constants. The iteration update is x k+ = x k + L ik ik fx ke ik Using a similar analysis as before, it can be shown k fx k f σ fx 0 f L ij Better decrease than prior analysis since see 23 k k new rate = σ σ = previous rate L ij faster provided at least one of the used L ij satisfies L ij <. Alternative 2 strongly convex: Lipschitz sampling. Use a random coordinate direction chosen using a non-uniform probability distribution: Pi k = j = L j n l= Ll for all j {, 2,..., n} Using an analysis similar to the previous one, but using the new probability distribution when computing the expectation, it can be shown that E[ fx k+] f σ E[ fxk] f n L with L being the average component Lipschitz constant, i.e., L := n The analysis was first performed in [2]. This rate is faster than uniform random sampling if not all of the component Lipschitz constants are the same. n i= L i

Alternative 3 strongly convex: Gauss-Southwell-Lipschitz rule. Choose i k according to the rule i k max i fx k 2 L i 24 Using an argument similar to that which led to 2, it may be shown that fx k+ fx k 2L ik ik fx k 2 The update 24 is designed to choose i k to minimize the guaranteed decrease given by 25, which uses the component Lipschitz constants. It may be shown, using this update, that fx k+ f σ L fx k f where σ L is the strong convexity parameter with respect to v L := n i= Li v i. It is shown in [4, Appendix 6.2] that max { } σ n L, σ σ L σ min {L i} 25 Ordering of constant in linear convergence results when f is strongly convex: Comments: random uniform sampling, < Gauss-Southwell < Gauss-Southwell with {L i} random Lipschitz sampling, {L i} < Gauss-Southwell-Lipschitz max j n i fx k Gauss-Southwell-Lipschitz: the best rate, but is the most expensive per iteration. Better rates if you know and use {L i} instead of just using their max, i.e.,. L i 2 At least as fast as the fastest of Gauss-Southwell and Lipschitz sampling options. Linear Equations Let m n, b R m, and A T = a... a m R n m with a i 2 = for all i. Furthermore, suppose that A T has full column rank, meaning that the linear system Aw = b has infinitely many solutions. To seek the least-length solution, we wish to solve The Lagrangian dual problem is minimize w R n 2 w 2 2 subject to Aw = b. minimize x R m fx := 2 AT x 2 2 b T x, where we note that fx = AA T x b and i fx = a T i A T x b i. The solutions to the primal and dual are related via w = A T x. Coordinate descent gives x k+ = x k αa T i A T x k b ie i. If we maintain an estimate w k = A T x k, then we see that w k+ = A T x k+ = A T x k αa T i A T x k b ie i = A T x k αa T i A T x k b ia i = w k αa T i w k b ia i. Note that if α =, then it follows by using a i 2 = that Linear Equations Summary Coordinate minimization for solving the dual problem associated with linear equations along the direction e i with α = satisfies the ith linear equation exactly. Sometimes called the method of successive projections. Update: w k+ = w k αa T i w k b ia i n + addition/subtractions 2n + multiplications 3n + 2 total floating-point operations Computing fx requires a multiplication with A, which is much more expensive. a T i w k+ = a T i w k a T i w k b ia i = a T i w k a T i w k b ia T i a i = b i so that the i-th equation is satisfied exactly.

Logistic Regression Give data {d j} N R n and labels { y j} N {, } associated with the data, solve minimize fx := N If we define the data matrix D such that then it follows that i fx = N D = N.. dt., dn T N Consider the coordinate minimization update log + e y jdj T x. e y jdj T x + e y jd j T yj dji. x x k+ = x k + αe ik for some i k {, 2,..., n} and α R. For efficiency, we store and update the required quantities {Dx k} using Dx k+ }{{} new value = Dx k + αe ik = Dx k + αde ik = Dx }{{} k +α D:, i k, old value where D:, i k denotes the i k-th column of D; if x 0 0, then we can set Dx 0 0. Logistic Regression Summary Coordinate minimization for the Logistic Regression problem does not require computing the entire gradient during every iteration. Update to obtain Dx k+ requires a single vector-vector add. Computing i fx k only requires accessing a single column of the data matrix D. Computing fx requires accessing the entire data matrix D. References I [] A. BECK AND L. TETRUASHVILI, On the convergence of block coordinate descent type methods, SIAM Journal on Optimization, 23 203, pp. 2037 2060. [2] D. LEVENTHAL AND A. S. LEWIS, Randomized methods for linear constraints: convergence rates and conditioning, Mathematics of Operations Research, 35 200, pp. 64 654. [3] Y. NESTEROV, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM Journal on Optimization, 22 202, pp. 34 362. [4] J. NUTINI, M. SCHMIDT, I. H. LARADJI, M. FRIEDLANDER, AND H. KOEPKE, Coordinate descent converges faster with the gauss-southwell rule than random selection, in Proceedings of the 32nd International Conference on Machine Learning ICML-5, 205, pp. 632 64. [5] M. J. POWELL, On search directions for minimization algorithms, Mathematical programming, 4 973, pp. 93 20. [6] A. P. RUSZCZYŃSKI, Nonlinear optimization, vol. 3, Princeton university press, 2006. [7] S. J. WRIGHT, Coordinate descent algorithms, Mathematical Programming, 5 205, pp. 3 34.