Is Greedy Coordinate Descent a Terrible Algorithm?

Size: px

Start display at page:

Download "Is Greedy Coordinate Descent a Terrible Algorithm?"

Audrey Ball
5 years ago
Views:

1 Is Greedy Coordinate Descent a Terrible Algorithm? Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, Hoyt Koepke University of British Columbia Optimization and Big Data, 2015

2 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate.

3 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate.

4 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper.

5 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i

6 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i x1 x2 x3

7 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i x1 x2 x3

8 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell x1 x2 x3

9 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell (GS) is at least as expensive as random. But Nesterov showed the rate is the same. So greedy is a terrible algorithm and just use random!

10 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell (GS) is at least as expensive as random. But Nesterov showed the rate is the same. So greedy is a terrible algorithm and just use random! But this theory disagrees with practice...

11 Context: Random vs. Greedy Coordinate Descent 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs If random and GS have similar costs, GS works much better.

12 Context: Random vs. Greedy Coordinate Descent 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs If random and GS have similar costs, GS works much better. This work: refined analysis of GS.

13 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update?

14 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E i=1 where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions)

15 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions) h 1 includes least squares, logistic regression, Lasso, and SVMs. 1 E.g., min x IR n 2 Ax b 2 + λ n x i. i=1 i=1

16 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions) h 1 includes least squares, logistic regression, Lasso, and SVMs. 1 E.g., min x IR n 2 Ax b 2 + λ n x i. h 2 includes quadratics, graph-based label propagation, and probabilistic graphical models. 1 E.g., min x IR n 2 x T Ax + b T x = 1 2 n i=1 j=1 i=1 n a ij x i x j + i=1 n b i x i. i=1

17 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2.

18 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster.

19 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap.

20 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200.

21 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200. Problem h 1 : GS efficient if rows and columns of A have O(log(n)) non-zeros. (iteration cost of O((log n) 3 ))

22 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200. Problem h 1 : GS efficient if rows and columns of A have O(log(n)) non-zeros. (iteration cost of O((log n) 3 )) GS can be approximated as nearest neighbour problem. [Dhillon et al., 2011, Shrivastava & Li, 2014].

23 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α.

24 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α. We focus on the case where f is µ-strongly convex, meaning that x f (x) µ 2 x 2, is a convex function for some µ > 0.

25 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α. We focus on the case where f is µ-strongly convex, meaning that x f (x) µ 2 x 2, is a convex function for some µ > 0. If twice-differentiable, equivalent to 2 ii f (x) L, 2 f (x) µi.

26 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik,

27 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik, With i k chosen uniformly from {1, 2,..., n} [Nesterov, 2010], ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )]. Ln

28 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik, With i k chosen uniformly from {1, 2,..., n} [Nesterov, 2010], ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )]. Ln Compare this to the rate of gradient descent, ) f (x k+1 ) f (x ) (1 µlf [f (x k ) f (x )]. Since Ln L f L, coordinate descent is slower per iteration, but n coordinate iterations are faster than one gradient iteration.

29 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i

30 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies f (x k+1 ) f (x k ) 1 2L f (x k ) 2.

31 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2.

32 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2. Using f (x k ) 2 n f (x k ) 2, we get ( f (x k+1 ) f (x ) 1 µ ) [f (x k ) f (x )], Ln same rate as randomized [Boyd & Vandenberghe, 2004, 9.4.3].

33 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2. Using f (x k ) 2 n f (x k ) 2, we get ( f (x k+1 ) f (x ) 1 µ ) [f (x k ) f (x )]. Ln same rate as randomized [Boyd & Vandenberghe, 2004, 9.4.3].

34 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm.

35 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. x f (x) µ 1 2 x 2 1,

36 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2.

37 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )].

38 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L The relationship between µ and µ 1 is given by µ n µ 1 µ. ) [f (x k ) f (x )].

39 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )]. The relationship between µ and µ 1 is given by µ n µ 1 µ. GS bound is the same as random when µ 1 = µ/n.

40 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )]. The relationship between µ and µ 1 is given by µ n µ 1 µ. GS bound is the same as random when µ 1 = µ/n. Otherwise, GS can be faster by as large as n.

41 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show 1 µ = min λ i, and µ 1 = n i i=1 1. λ i

42 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n).

43 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster.

44 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n).

45 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n). With one very small λ i : Here GS bound can be better by a factor of n (µ 1 µ). In this case, GS can actually be faster than gradient descent.

46 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n). With one very small λ i : Here GS bound can be better by a factor of n (µ 1 µ). In this case, GS can actually be faster than gradient descent. µ 1 is harmonic mean of λ i divided by n, H(λ)/n: H(λ) is dominated by minimum of its arguments. If each worker takes λ i time to finish a task on their own, H(λ)/n is time needed when working together [Ferger, 1931].

47 Fast Convergence with Bias Term Consider the linear-prediction framework in statistics, argmin x,β n f (ai T x + β) + λ 2 x 2 + σ 2 β2, i=1 where we ve included a bias β.

48 Fast Convergence with Bias Term Consider the linear-prediction framework in statistics, argmin x,β n f (ai T x + β) + λ 2 x 2 + σ 2 β2, i=1 where we ve included a bias β. Typically σ << λ to avoid biasing against a global shift. This is an instance of h 1 where GS has the most benefit.

49 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik

50 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik In this setting, we get a rate of k ( f (x k ) f (x ) 1 µ ) 1 [f (x 0 ) f (x )]. L ij j=1

51 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik In this setting, we get a rate of k ( f (x k ) f (x ) 1 µ ) 1 [f (x 0 ) f (x )]. L ij j=1 Since L = max i L i, this is faster if L ik < L for any i k. But rate is the same in the worst case, even if L i are distinct. Let s consider the effect of exact coordinate optimization on L ik.

52 Gauss-Southwell with Exact Optimization Exact coordinate optimization chooses the stepsize minimizing f. We can get the same rates for randomized/gs because f (x k+1 ) = min{f (x k α ik f (x k )e ik )} α ) f (x k 1Lik ii f (x k )e ik f (x k ) 1 2L ik [ ik f (x k )] 2,

53 Gauss-Southwell with Exact Optimization Exact coordinate optimization chooses the stepsize minimizing f. We can get the same rates for randomized/gs because f (x k+1 ) = min{f (x k α ik f (x k )e ik )} α ) f (x k 1Lik ii f (x k )e ik f (x k ) 1 2L ik [ ik f (x k )] 2, But theory again disagrees with practice: Empirically, exact optimization is much faster.

54 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0.

55 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k.

56 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k. f (x k ) = ,

57 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k f (x k ) = 0.72, f (x k+1 ) =

58 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k f (x k ) = 0.72, f (x k+1 ) = If L i are distinct, worst case is alternating between largest two L i.

59 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better:

60 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours.

61 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours

62 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 1.6 Gauss-Southwell

63 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Update

64 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Gauss-Southwell

65 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Update

66 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 (can t rechoose)

67 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: By bounding the worst-case sequence of L i values, we have ( ( ) ) k f (x k ) f (x µ 1 ) = O 1 max{l G 2, LG 3 } [f (x 0 ) f (x )]. L G 2 is the largest average between neighbours. L G 3 is the largest average 3-node path.

68 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: By bounding the worst-case sequence of L i values, we have ( ( ) ) k f (x k ) f (x µ 1 ) = O 1 max{l G 2, LG 3 } [f (x 0 ) f (x )]. L G 2 is the largest average between neighbours. L G 3 is the largest average 3-node path. This is much faster if the large L i are not neighbours. Similar for h 1 : edges between variables non-zero in same row.

69 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them.

70 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them. Nesterov [2010] shows that sampling proportional to L i yields ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )], n L where L = 1 n n i=1 L i. Faster than uniform sampling when the L i are disinct.

71 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them. Nesterov [2010] shows that sampling proportional to L i yields ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )], n L where L = 1 n n i=1 L i. Faster than uniform sampling when the L i are disinct. This could be faster or slower than the GS rule. In the separalbe quadratic case: With one large λ i, Lipschitz sampling is faster. With one small λ i, GS is faster. So which should we use?

72 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small.

73 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. x1 x2

74 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. Gauss-Southwell x1 x2

75 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. Gauss-Southwell-Lipschitz Gauss-Southwell x1 x2

76 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i.

77 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling.

78 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling. GSL using 1 L ik is optimal myopic coordinate update for quadratics, f (x k+1 ) = argmin{f (x k + αe i )}. i,α

79 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling. GSL using 1 L ik is optimal myopic coordinate update for quadratics, f (x k+1 ) = argmin{f (x k + αe i )}. i,α Analysis gives tighter bound on maximum improvement rule, used in certain applications. [Della Pietra et al., 1997, Lee et al., 2006]

80 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i

81 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour,

82 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour, argmin i 1 2 r(x k ) a i 2 = 1 2 r(x k ) 2 a T i r(x k ) a i 2. (use a i and a i to get absolute value) Approximation is exact if a i = 1 for all i.

83 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour, argmin i 1 2 r(x k ) a i 2 = 1 2 r(x k ) 2 a T i r(x k ) a i 2. (use a i and a i to get absolute value) Approximation is exact if a i = 1 for all i. Using L i = γ a i 2, exact GSL as a nearest neighbour problem, 1 argmin i 2 r(x k ) a 2 i γ ai = 1 2 r(x k ) 2 at i r(x k ) γ ai + 1 2γ.

84 Approximate Gauss-Southwell In many applications, can approximate GS rule.

85 Approximate Gauss-Southwell In many applications, can approximate GS rule. With multiplicative errror, ik f (x k ) f (x k ) (1 ɛ k ), we have a fast rate and do not need ɛ k 0, f (x k+1 ) f (x ) (1 µ 1(1 ɛ k ) 2 ) [f (x k ) f (x )]. L

86 Approximate Gauss-Southwell In many applications, can approximate GS rule. With multiplicative errror, ik f (x k ) f (x k ) (1 ɛ k ), we have a fast rate and do not need ɛ k 0, f (x k+1 ) f (x ) (1 µ 1(1 ɛ k ) 2 ) [f (x k ) f (x )]. L With additive error, ik f (x k ) f (x k ) ɛ k, we have a fast rate if ɛ k 0 fast enough. With constant additive error, only get a certain solution accuracy.

87 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints.

88 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints.

89 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints. Here we can apply proximal-gradient style of update, [ x k+1 = prox 1 L g x k 1 ] i k L i k f (x k )e ik, where prox αg [y] = argmin x R n 1 2 x y 2 + αg(x).

90 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints. Here we can apply proximal-gradient style of update, [ x k+1 = prox 1 L g x k 1 ] i k L i k f (x k )e ik, where 1 prox αg [y] = argmin x R n 2 x y 2 + αg(x). Richtárik and Takac [2014] show that ( E[F (x k+1 ) F(x k )] 1 µ ) [F (x k ) F (x )], Ln the same rate as if non-smooth g i was not there.

91 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }.

92 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny.

93 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny. GS-r: Maximize how far we move, i k = argmax i { [ xi k prox 1 L g x k i k i 1 L i k f (x k ) ] }. Effective for bound constraints, but ignores g i (x k+1 i ) g i (xi k ).

94 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny. GS-r: Maximize how far we move, i k = argmax i { [ xi k prox 1 L g x k i k i 1 L i k f (x k ) ] }. Effective for bound constraints, but ignores g i (x k+1 i ) g i (xi k ). GS-q: Maximize progress under quadratic approximation of f. { i k = argmin min f (x k ) + i f (x k )d + Ld 2 } i d 2 + g i(xi k + d) g i (xi k ). Least intuitive, but has the best theoretical properties. Generalizes GSL if you use L i instead of L.

95 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0.

96 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max)

97 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max) But one final time theory disagrees with practice:

98 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max) But one final time theory disagrees with practice: All three rules seem to work pretty well. Though GS-r works badly if you use the L i.

99 Experiment 1: Sparse l 2 -Regularized Least Squares Least squares with l 2 -regularization and very sparse matrix. 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs

100 Experiment 2: Sparse l 2 -Regularized Logistic Logistic regression with l 2 -regularization and very sparse matrix. 1 l 2 -regularized sparse logistic regression Lipschitz constant Random constant Cyclic constant 0.9 Random exact Cyclic exact GSL constant GS constant Lipschitz exact Objective GS exact 0.3 GSL exact Epochs Exact optimization makes a bigger difference than coordinate selection.

101 Experiment 3: Over-determined least squares Least squares with dense matrix and nearest neighbour GS. 1 Over-determined dense least squares 0.9 Cyclic Random 0.8 Lipschitz Objective Approximated GS 0.4 Approximated GSL GS Epochs Approximate GS is still faster than random sampling.

102 Discussion GS not always practical. But even approximate GS rules may outperform random.

103 Discussion GS not always practical. But even approximate GS rules may outperform random. We ve given a justification for line-search in certain scenarios. We proposed GSL rule, and approximate/proximal variants.

104 Discussion GS not always practical. But even approximate GS rules may outperform random. We ve given a justification for line-search in certain scenarios. We proposed GSL rule, and approximate/proximal variants. Analysis extends to block updates. Could be used for accelerated/parallel methods [Fercocq & Richtárik, 2013], primal-dual methods [Shalev-Schwartz & Zhang, 2013], and without strong-convexity [Luo & Tseng, 1993].

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0. Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization