Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization Random order with fixed step size Cyclic order with fixed step size Steepest direction Gauss-Southwell rule with fixed step size Alternatives Summary 3 Examples Linear equations Logistic regression Basic idea coordinate minimization: Compute the next iterative using the update x k+ = x k α ke ik Given a function f : R n R, consider the unconstrained optimization problem We will consider various assumptions on f: nonconvex and differentiable f convex and differentiable f strongly convex and differentiable f minimize fx We will not consider general non-smooth f, because we can not prove anything. We will briefly consider structured non-smooth problems, i.e., problems that use an additional separable regularizer. Notation: f k := fx k and g k := fx k Algorithm General coordinate minimization framework. : Choose x 0 R n and set k 0. 2: loop 3: Choose i k {, 2,..., n}. 4: Choose α k > 0 5: Set x k+ x k α ke ik. 6: Set k k +. 7: end loop α k is the step size. Options include: fixed, but sufficiently small inexact linesearch exact linesearch i k {, 2,..., n} has to be chosen. Options include: cycle through the entire set choose it randomly without replacement choose it randomly with replacement choose it based on which element of fx k is the largest in absolute value e ik is the i k-th coordinate vector this update seeks better points in span{e ik }.
The following example shows that coordinate minimization may fail if f is non-smooth: minimize f x, x2 := x + x2 + min{0, x }2 + min{0, x2 }2 + 3 x x2 x,x2 Algorithm 2 Coordinate minimization with cyclic order and exact minimization. : 2: 3: 4: Choose x0 Rn and set k 0. loop Choose ik = modk, n +. Calculate the exact coordinate minimizer: αk argmin f xk αeik α R 5: 6: 7: Set xk+ xk αk eik. Set k k +. end loop Comments: This algorithm assumes that the exact minimizers exist and that they are unique. A reasonable stopping condition should be incorporated, such as k f xk k2 0 6 max{, k f x0 k2 } Figure : Level curves for the function f defined above. Coordinate minimization cannot make progress from any point satisfying x = x2 > 0. Note: Coordinate descent works if the non-smoothness is structured block separable An interesting example introduced by Powell [5, formula 2] is minimize f x, x2, x3 := x x2 + x2 x3 + x x3 + x,x2,x3 3 X xi 2 i= f is continuously differentiable and nonconvex f has minimizers at,, and,, of the unit cube. Coordinate descent with exact minimization started just outside the unit cube near any nonoptimal vertex cycles around neighborhoods of all 6 non-optimal vertices. Powell shows that the cyclic nonconvergence behavior is special and is destroyed by small perturbations on this particular example. Theorem 2. see [6, Theorem 5.32] Assume that the following hold: f is continuously differentiable; the level set L0 := {x Rn : f x f x0 } is bounded; and for every x L0 and all j {, 2,..., n}, the optimization problem minimize f x + ζej ζ R has a unique minimizer. Then, for any limit point x of the sequence {xk } generated by Algorithm 2 that satisfies f x = 0. Figure : Three dimensional example given above. It shows the possible lack of convergence of a coordinate descent method with exact minimization. This example and others in [5] show that we cannot expect a general convergence result for nonconvex functions similar to that for full-gradient descent. This picture was taken from [7]. Proof: Since f xk+ f xk for all k N, we know that the sequence {xk } k=0 L0. Since L0 is bounded that {xk } k=0 has at least one limit point; let x be any such limit point. Thus, there exists a subsequence K N satisfying lim xk = x. k K 2 Combining this with monotonicity of { f xk } and continuity of f also shows that lim f xk = f x and f xk f x for all k. k 3
We assume fx 0, and then reach a contradiction. First, consider the subsets K i K, i = 0,..., n defined as K i := {k K : k i mod n}. Since K is an infinite subsequence of the natural numbers, one of the K i must be an infinite set. Without loss of generality, we assume it is K 0 the argument is very similar for any other i, because we are using cyclic order. Let us perform a hypothetical sweep" of coordinate minimization starting from x, so that we would obtain l y := x n, with x l := x + [τ ] je j for all l =,..., n and note that since fx 0 by assumption, we must have fy < fx. why? 4 NOTE: If K i was infinite for some i 0, then we would do above sweep" at x starting with coordinate i and going in cyclic order to cover all n coordinates. Next, notice that by construction of the coordinate minimization scheme, that x k+l = x k + l τ k+j e j for all k K 0 and l n, 5 meaning that x k+l x k = τ k τ k+. τ k+l 2 max{ x : x L 0} < for all k K 0 and l n. We used the assumption that L 0 is bounded. Since this shows that the set {τ k, τ k+,... τ k+n T } k K0 is bounded, we may pass to a subsequence K K 0 with lim k K τ k τ k+. τ k+n = τ L for some τ L R n. 6 Taking the limit of 5 over k K K for each l, and using 2 and 6 we find that lim xk+l = x + k K l [τ L ] je j for each l n. 7 We next claim the following, which we will prove by induction: [τ L ] p = [τ ] p for all p n, and 8 lim k K xk+p = x p for all p n. 9 Base case: p =. We know from the coordinate minimization that fx k+ fx k + τ e for all k and τ R. Taking limits over k K K and using continuity of f, 7 with l =, and 2 yields fx + [τ L ] e = f lim xk+ = lim fxk+ lim fxk + τ e k K k K k K = f lim k K xk + τ e = fx + τ e for all τ R. Since the minimizations in coordinate directions are unique by assumption, we know that [τ L ] = [τ ], which is the first desired result. Also, combining it with 7 gives lim xk+ = x + [τ L ] e = x + [τ ] e x k K, which completes the base base. Induction step: assume that 8 and 9 hold for p p n. We know from the coordinate minimization that fx k+ p+ fx k+ p + τ e p+ for all k and τ R. Taking the limit over k K, continuity of f, 7 with l = p +, and 9 give p+ fx + [τ L ] je j = f lim xk+ p+ = lim fxk+ p+ lim fxk+ p + τ e p+ k K k K k K = f lim k K xk+ p + τ e p+ = fx p + τ e p+ for all τ R. Thus, the definition of x p, and the fact that 8 holds for all p p show that fx p + [τ L ] p+e p+ = fx + = fx + p p [τ ] je j + [τ L ] p+e p+ [τ L ] je j + [τ L ] p+e p+ p+ = fx + [τ L ] je j fx p + τ e p+ for all τ R. Since the minimization in coordinate directions is unique by assumption, we know that [τ L ] p+ = [τ ] p+, which is the first desired result. Also, combining it with 7 gives p+ p+ lim xk+ p+ = x + [τ L ] je j = x + [τ ] je j x p+, k K which completes the proof by induction.
Notation: Let L j denote the jth component Lipschitz constant, i.e., it satisfies j f x + te j j fx L j t for all x R n and t. From our induction proof, we have that τ = τ L. Combining this with 7 and the definition of y gives lim xk+n = x + k K n τ L j e j = x + Finally, combining 3, continuity of f, 0, and 4 shows that n τ j e j x n y. 0 fx = lim fxk+n = f lim xk+n = fy < fx, k K k K which is a contradiction. This completes the proof. Let denote the coordinate Lipschitz constant, i.e., it satisfies := max L i. Algorithm 3 Coordinate minimization with random order and a fixed step size. : Choose α 0, /], where 2: Choose x 0 R n and set k 0. 3: loop 4: Choose i k {, 2,..., n} randomly with equal probability. 5: Set x k+ x k α ik fx ke ik. 6: Set k k +. 7: end loop Comments: A reasonable stopping condition should be incorporated, such as fx k 2 0 6 max{, fx 0 2} A maximum number of iterations should be included in practice. Theorem 2.2 Suppose that α = / and let the following assumptions hold: f is convex f is globally Lipschitz continuous the minimum value of f is obtained on some set S, i.e., there exists S R n with there exists a scalar R 0 satisfying x S and f := fx = min fx max max { x x S x R x 2 : fx fx 0} R 0 n Then, the iterate sequence {x k} generated by Algorithm 3 satisfies E[ fx k] f 2nLmaxR2 0. k Moreover, if f is strongly convex with parameter σ > 0, i.e., for all {x, y} R n, then fy fx + fx T y x + σ 2 y x 2 2 E[ fx k] f σ k fx0 f. Proof: follows Wright [7] It follows from Taylor s Theorem, definitions of L j and, and α = /, that fx k+ = f x k α ik fx ke ik fx k α ik fx ke T i k fx k + 2 α2 L ik ik fx k 2 = fx k α ik fx k 2 + 2 α2 L ik ik fx k 2 fx k α ik fx k 2 + 2 α2 ik fx 2 k = fx k α αlmax ik 2 fxk 2 Why? Exercise. = fx k ik fx k 2. 2 If we now take the expectation of both sides with respect to i k, we find that E ik [ fx k+] fx k n Subtracting f from both sides, shows that n j fx k 2 = fx k 2 fx k 2 2. 3 E ik [ fx k+] f fx k f 2 fx k 2 2.
From the previous slide, we have E ik [ fx k+] f fx k f 2 fx k 2 2. Taking expectation with respect to all the random variables {i 0, i, i 2,... }, and defining we find that φ k+ = E [ fx k+ ] f φ k := E[ fx k] f, [ = E i0,i 2,...i Eik k [ fx ] k+ x k] f [ E i0,i,...i k fx k f [ = E fx k f = φ k [ E fx k 2 2 2 φ k 2 E [ fx k 2] 2, [ fxk 2 ] ] 2 2 [ fxk 2 ] ] 2 2 ] where we used Jensen s Inequality to derive the last inequality. 4 From the previous slide, we showed that φ k+ φ k 2 E [ fx k 2] 2. 5 Next, note from convexity of f, definition of R 0, and the fact that fx k fx 0 for all k by construction of the algorithm, that we have fx k f fx k T x x k fx k 2 x k x 2 R 0 fx k 2. 6 Taking expectation of both sides shows that Combining this bound with 5 yields φ k+ φ k E [ fx k 2 ] R 0 E[ fx k f ] = R 0 φ k. 2R 2 0 φ 2 k φ k φ k+ Combining this with φ k+ φ k see 5, we have 2R 2 0 φk φk+ φ 2 k 2R 2 0 φk φk+ φ kφ k+ = φ k+ φ k. φ 2 k. 7 From the previous slide, we showed that 2R 2 0 φ k+ φ k for all k. Summing both sides for k = 0,,..., l, shows that l l l = =. φ k+ φ k φ l φ 0 φ l 2R 2 0 2nL k=0 maxr 2 0 k=0 Rearranging, replacing l by k, and using the definition of φ k, this is equivalent to E[ fx k] f = φ k 2nLmaxR2 0 k which is the first desired result. For the second part, assume that f is strongly convex with parameter σ, i.e., that fy fx + fx T y x + σ 2 y x 2 2 for all {x, y} R n. By choosing x = x k and minimizing both sides with respect to y, we find that where y k f = min y R n fy min fxk + y R n fxkt y x k + σ 2 y xk 2 2 = fx k + fx k T y k x k + σ 2 y k x k 2 2 = fx k σ fxk 2 2 + 2σ fxk 2 2 = fx k 2σ fxk 2 2, := x k fxk. σ On the previous slide we proved that f fx k 2σ fxk 2 2. 8 Combining this bound with 4 gives the inequality φ k+ φ k E [ fx k 2 ] 2 2 Applying this recursively shows that φ k 2 E [ 2σ fx k f ] = φ k σ E [ fx k f ] = φ k σ φ k nl max = σ φ k φ k σ k φ 0 so that after we use the definition for φ k, we have E[ fx k] f σ k fx 0 f which is the second desired result.
Notation: Let L j denote the jth component Lipschitz constant, i.e., it satisfies j f x + te j j fx Lj t for all x R n and t. Let denote the coordinate Lipschitz constant, i.e., it satisfies := max L i. Let L denote the Lipschitz constant for f. Algorithm 4 Coordinate minimization with cyclic order and a fixed step size. : Choose α 0, /]. 2: Choose x 0 R n and set k 0. 3: loop 4: Choose i k = modk, n +. 5: Set x k+ x k α ik fx ke ik. 6: Set k k +. 7: end loop Comments: A reasonable stopping condition should be incorporated, such as fx k 2 0 6 max{, fx 0 2} A maximum number of allowed iterations should be included in practice. Theorem 2.3 see [, Theorem 3.6,Theorem 3.9] and [7, Theorem 3] Suppose that α = / and let the following assumptions hold: f is convex f is globally Lipschitz continuous the minimum value of f is obtained on some set S, i.e., there exists S R n with there exists a scalar R 0 satisfying x S and f := fx = min fx max max { x x S x R x 2 : fx fx 0} R 0 n If {x k} is the iterate sequence of Algorithm 4, then for k {n, 2n, 3n,... } we have fx k f 4nLmax + nl2 /L 2 maxr 2 0. 9 k + 8 If f is strongly convex with parameter σ > 0 see then for k {n, 2n, 3n,... } fx k f k/n σ fx0 f + nl 2 /L 2 max Proof: See [, Theorem 3.6 and Theorem 3.9] and use i each iteration k" in [] is a cycle of n iterations; ii choose in [] the values L i = for all i; iii in [] we have p = since our blocks of variables are singletons, i.e., coordinate descent. Comments on Theorem 2.3: The numerator in 9 is On 2, while the numerator in the analogous result Theorem 2.2 for the random coordinate choice with fixed step size is On. But, Theorem 2.3 is a deterministic result, while Theorem 2.2 is a result in expectation. As part of the homework assignment, you will find out for yourself how these methods perform on a simple quadratic objective function. It can be shown that L n Lj see [3, Lemma 2 with α = ] It follows from the fact that j fx + te j j fx fx + te j fx 2 L t holds for all j, t, and x that L j L. By combining the previous two bullet points, we find that max L j L j n L j so that L n Roughly speaking, L/ is closer to when the coordinates are more decoupled". In light of 9, the complexity result for coordinate descent becomes better as the variables become more decoupled. This makes sense! Notation: Let L j denote the jth component Lipschitz constant, i.e., it satisfies j f x + te j j fx L j t for all x R n and t. Let denote the coordinate Lipschitz constant, i.e., it satisfies := max L i. Algorithm 5 Coordinate minimization with Gauss-Southwell Rule and a fixed step size. : Choose α 0, /]. 2: Choose x 0 R n and set k 0. 3: loop 4: Calculate i k as the steepest coordinate direction, i.e., 5: Set x k+ x k α ik fx ke ik. 6: Set k k +. 7: end loop i k argmax i fx k Comments: A reasonable stopping condition should be incorporated, such as fx k 2 0 6 max{, fx 0 2}
Theorem 2.4 Suppose that α = / and let the following assumptions hold: f is convex f is globally Lipschitz continuous the minimum value of f is obtained on some set S, i.e., there exists S R n with there exists a scalar R 0 satisfying x S and f := fx = min fx max max { x x S x R x 2 : fx fx 0} R 0 n Then, the iterate sequence {x k} computed from Algorithm 5 satisfies fx k f 2nLmaxR2 0. 20 k If f is strongly convex with parameter σ > 0 see then fx k f σ k fx0 f Proof: From earlier see 2, we showed that f k+ fx k ik fx k 2. Combining this with the choice i k argmax i fx k and the standard norm inequality v 2 n v, it holds that f k+ fx k ik fx k 2 = fx k fx k 2 2 fx k fx k 2 2. 2 22 Subtracting f from both sides and using the previous fact see 6 that we find that fx k f R 0 fx k 2, f k+ f fx k f 2 fx k 2 2 fx k f Using the notation φ k = fx k f, this is equivalent to φ k+ φ k 2R 2 0 φ 2 k. 2R 2 0 fxk f 2. From the previous slide, we have φ k+ φ k φ 2 2R 2 k 0 which is exactly the same as the inequality 7 except that we now have a different definition of φ k. Then, as shown in that proof, we have which is the desired result for convex f. fx k f = φ k 2nLmaxR2 0 k Next, assume that f is strongly convex, from which earlier we showed see 8 that f fx k 2σ fxk 2 2. Subtracting f from each side of 22 and then using the previous inequality shows that so that f k+ f fx k f 2 fx k 2 2 fx k f fx k f which is the last desired result. σ fxk f = σ k fx0 f σ fxk f Comments so far for fixed step size: Cyclic has the worst dependence on n: Cyclic: On 2 Random and Gauss-Southwell: On Random is a rate in expectation. Gauss-Southwell is a deterministic rate. There is a better analysis for Gauss-Southwell when we assume that f is strongly convex that changes the above comment! See [4]. We show this next.
Theorem 2.5 Suppose that α = / and let the following assumptions hold: f is l -strongly convex, i.e., there exists σ > 0 such that fy fx + fx T y x + σ 2 y x 2 for all {x, y} R n f is globally Lipschitz continuous the minimum value of f is obtained Then, the iterate sequence {x k} computed from Algorithm 5 satisfies fx k f k σ fx0 f Proof see [4]: Using l -strong convexity means that fy fx + fx T y x + σ 2 y x 2 for all {x, y} R n for the l -strong convexity parameter σ. If we now minimize both sides with respect to y and replace x by x k, we find that where y k f = minimize y R n minimize y R n fy fx k + fx k T y x k + σ 2 y xk 2 = fx k + fx k T y k x k + σ 2 y k x k 2 why? exercise = fx k 2σ fx k 2 := x k + z k with and l any index satisfying Therefore, we have that [z k ] i := { 0 if i l i fx k σ if i = l l { j : jfx k = fx k }. fx k 2 2σ fxk f. From the previous slide, we showed that fx k 2 2σ fxk f. Subtracting f from both sides of 2 and using the previous inequality shows that f k+ f fx k f fx k 2 fx k f σ fxk f L max = σ fxk f. Applying this inequality recursively gives k fx k f σ fx0 f which is the desired result. For strongly convex functions: Random coordinate choice has the expected rate of E[ fx k] f σ k fx0 f. Gauss-Southwell coordinate choice has the determinstic rate of k fx k f σ fx0 f 23 The bound for Gauss-Southwell is better since so that σ σ n σ σ σ n σ σ σ σ
Example: A Simple Diagonal Quadratic Function Consider the problem where minimize g T x + 2 xt Hx H = diagλ, λ 2,..., λ n with λ i > 0 for all i {, 2,..., n}. For this problem, we know that σ = min{λ, λ 2,..., λ n} and σ = n λ i i= Case : For λ = α for some α > 0, the minimum value for σ occurs when α = λ = λ 2 = = λ n, which gives Thus, the convergence constants are: random selection : Gauss-Southwell selection : σ = α and σ = α n. σ σ = = α α Case 2: For this other extreme case, let us suppose that λ = β and λ 2 = λ 3 = = λ n = α with α β. For this case, it can be shown that σ = β and σ = If we now take the limit as α we find that βα n α n + n βα n 2 = βα α + n β. σ = β and σ β = σ Thus, the convergence constants in the limit are: random selection : σ = Gauss-Southwell selection : σ = β β so that Gauss-Southwell is a factor n faster than using a random coordinate selection. so the convergence constants are the same; this is the worst case for Gauss-Southwell. Alternative strongly convex: individual coordinate Lipschitz constants. The iteration update is x k+ = x k + L ik ik fx ke ik Using a similar analysis as before, it can be shown k fx k f σ fx 0 f L ij Better decrease than prior analysis since see 23 k k new rate = σ σ = previous rate L ij faster provided at least one of the used L ij satisfies L ij <. Alternative 2 strongly convex: Lipschitz sampling. Use a random coordinate direction chosen using a non-uniform probability distribution: Pi k = j = L j n l= Ll for all j {, 2,..., n} Using an analysis similar to the previous one, but using the new probability distribution when computing the expectation, it can be shown that E[ fx k+] f σ E[ fxk] f n L with L being the average component Lipschitz constant, i.e., L := n The analysis was first performed in [2]. This rate is faster than uniform random sampling if not all of the component Lipschitz constants are the same. n i= L i
Alternative 3 strongly convex: Gauss-Southwell-Lipschitz rule. Choose i k according to the rule i k max i fx k 2 L i 24 Using an argument similar to that which led to 2, it may be shown that fx k+ fx k 2L ik ik fx k 2 The update 24 is designed to choose i k to minimize the guaranteed decrease given by 25, which uses the component Lipschitz constants. It may be shown, using this update, that fx k+ f σ L fx k f where σ L is the strong convexity parameter with respect to v L := n i= Li v i. It is shown in [4, Appendix 6.2] that max { } σ n L, σ σ L σ min {L i} 25 Ordering of constant in linear convergence results when f is strongly convex: Comments: random uniform sampling, < Gauss-Southwell < Gauss-Southwell with {L i} random Lipschitz sampling, {L i} < Gauss-Southwell-Lipschitz max j n i fx k Gauss-Southwell-Lipschitz: the best rate, but is the most expensive per iteration. Better rates if you know and use {L i} instead of just using their max, i.e.,. L i 2 At least as fast as the fastest of Gauss-Southwell and Lipschitz sampling options. Linear Equations Let m n, b R m, and A T = a... a m R n m with a i 2 = for all i. Furthermore, suppose that A T has full column rank, meaning that the linear system Aw = b has infinitely many solutions. To seek the least-length solution, we wish to solve The Lagrangian dual problem is minimize w R n 2 w 2 2 subject to Aw = b. minimize x R m fx := 2 AT x 2 2 b T x, where we note that fx = AA T x b and i fx = a T i A T x b i. The solutions to the primal and dual are related via w = A T x. Coordinate descent gives x k+ = x k αa T i A T x k b ie i. If we maintain an estimate w k = A T x k, then we see that w k+ = A T x k+ = A T x k αa T i A T x k b ie i = A T x k αa T i A T x k b ia i = w k αa T i w k b ia i. Note that if α =, then it follows by using a i 2 = that Linear Equations Summary Coordinate minimization for solving the dual problem associated with linear equations along the direction e i with α = satisfies the ith linear equation exactly. Sometimes called the method of successive projections. Update: w k+ = w k αa T i w k b ia i n + addition/subtractions 2n + multiplications 3n + 2 total floating-point operations Computing fx requires a multiplication with A, which is much more expensive. a T i w k+ = a T i w k a T i w k b ia i = a T i w k a T i w k b ia T i a i = b i so that the i-th equation is satisfied exactly.
Logistic Regression Give data {d j} N R n and labels { y j} N {, } associated with the data, solve minimize fx := N If we define the data matrix D such that then it follows that i fx = N D = N.. dt., dn T N Consider the coordinate minimization update log + e y jdj T x. e y jdj T x + e y jd j T yj dji. x x k+ = x k + αe ik for some i k {, 2,..., n} and α R. For efficiency, we store and update the required quantities {Dx k} using Dx k+ }{{} new value = Dx k + αe ik = Dx k + αde ik = Dx }{{} k +α D:, i k, old value where D:, i k denotes the i k-th column of D; if x 0 0, then we can set Dx 0 0. Logistic Regression Summary Coordinate minimization for the Logistic Regression problem does not require computing the entire gradient during every iteration. Update to obtain Dx k+ requires a single vector-vector add. Computing i fx k only requires accessing a single column of the data matrix D. Computing fx requires accessing the entire data matrix D. References I [] A. BECK AND L. TETRUASHVILI, On the convergence of block coordinate descent type methods, SIAM Journal on Optimization, 23 203, pp. 2037 2060. [2] D. LEVENTHAL AND A. S. LEWIS, Randomized methods for linear constraints: convergence rates and conditioning, Mathematics of Operations Research, 35 200, pp. 64 654. [3] Y. NESTEROV, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM Journal on Optimization, 22 202, pp. 34 362. [4] J. NUTINI, M. SCHMIDT, I. H. LARADJI, M. FRIEDLANDER, AND H. KOEPKE, Coordinate descent converges faster with the gauss-southwell rule than random selection, in Proceedings of the 32nd International Conference on Machine Learning ICML-5, 205, pp. 632 64. [5] M. J. POWELL, On search directions for minimization algorithms, Mathematical programming, 4 973, pp. 93 20. [6] A. P. RUSZCZYŃSKI, Nonlinear optimization, vol. 3, Princeton university press, 2006. [7] S. J. WRIGHT, Coordinate descent algorithms, Mathematical Programming, 5 205, pp. 3 34.