Is Greedy Coordinate Descent a Terrible Algorithm?

Size: px
Start display at page:

Download "Is Greedy Coordinate Descent a Terrible Algorithm?"

Transcription

1 Is Greedy Coordinate Descent a Terrible Algorithm? Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, Hoyt Koepke University of British Columbia Optimization and Big Data, 2015

2 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate.

3 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate.

4 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper.

5 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i

6 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i x1 x2 x3

7 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i x1 x2 x3

8 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell x1 x2 x3

9 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell (GS) is at least as expensive as random. But Nesterov showed the rate is the same. So greedy is a terrible algorithm and just use random!

10 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell (GS) is at least as expensive as random. But Nesterov showed the rate is the same. So greedy is a terrible algorithm and just use random! But this theory disagrees with practice...

11 Context: Random vs. Greedy Coordinate Descent 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs If random and GS have similar costs, GS works much better.

12 Context: Random vs. Greedy Coordinate Descent 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs If random and GS have similar costs, GS works much better. This work: refined analysis of GS.

13 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update?

14 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E i=1 where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions)

15 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions) h 1 includes least squares, logistic regression, Lasso, and SVMs. 1 E.g., min x IR n 2 Ax b 2 + λ n x i. i=1 i=1

16 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions) h 1 includes least squares, logistic regression, Lasso, and SVMs. 1 E.g., min x IR n 2 Ax b 2 + λ n x i. h 2 includes quadratics, graph-based label propagation, and probabilistic graphical models. 1 E.g., min x IR n 2 x T Ax + b T x = 1 2 n i=1 j=1 i=1 n a ij x i x j + i=1 n b i x i. i=1

17 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2.

18 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster.

19 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap.

20 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200.

21 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200. Problem h 1 : GS efficient if rows and columns of A have O(log(n)) non-zeros. (iteration cost of O((log n) 3 ))

22 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200. Problem h 1 : GS efficient if rows and columns of A have O(log(n)) non-zeros. (iteration cost of O((log n) 3 )) GS can be approximated as nearest neighbour problem. [Dhillon et al., 2011, Shrivastava & Li, 2014].

23 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α.

24 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α. We focus on the case where f is µ-strongly convex, meaning that x f (x) µ 2 x 2, is a convex function for some µ > 0.

25 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α. We focus on the case where f is µ-strongly convex, meaning that x f (x) µ 2 x 2, is a convex function for some µ > 0. If twice-differentiable, equivalent to 2 ii f (x) L, 2 f (x) µi.

26 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik,

27 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik, With i k chosen uniformly from {1, 2,..., n} [Nesterov, 2010], ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )]. Ln

28 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik, With i k chosen uniformly from {1, 2,..., n} [Nesterov, 2010], ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )]. Ln Compare this to the rate of gradient descent, ) f (x k+1 ) f (x ) (1 µlf [f (x k ) f (x )]. Since Ln L f L, coordinate descent is slower per iteration, but n coordinate iterations are faster than one gradient iteration.

29 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i

30 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies f (x k+1 ) f (x k ) 1 2L f (x k ) 2.

31 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2.

32 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2. Using f (x k ) 2 n f (x k ) 2, we get ( f (x k+1 ) f (x ) 1 µ ) [f (x k ) f (x )], Ln same rate as randomized [Boyd & Vandenberghe, 2004, 9.4.3].

33 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2. Using f (x k ) 2 n f (x k ) 2, we get ( f (x k+1 ) f (x ) 1 µ ) [f (x k ) f (x )]. Ln same rate as randomized [Boyd & Vandenberghe, 2004, 9.4.3].

34 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm.

35 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. x f (x) µ 1 2 x 2 1,

36 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2.

37 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )].

38 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L The relationship between µ and µ 1 is given by µ n µ 1 µ. ) [f (x k ) f (x )].

39 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )]. The relationship between µ and µ 1 is given by µ n µ 1 µ. GS bound is the same as random when µ 1 = µ/n.

40 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )]. The relationship between µ and µ 1 is given by µ n µ 1 µ. GS bound is the same as random when µ 1 = µ/n. Otherwise, GS can be faster by as large as n.

41 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show 1 µ = min λ i, and µ 1 = n i i=1 1. λ i

42 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n).

43 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster.

44 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n).

45 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n). With one very small λ i : Here GS bound can be better by a factor of n (µ 1 µ). In this case, GS can actually be faster than gradient descent.

46 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n). With one very small λ i : Here GS bound can be better by a factor of n (µ 1 µ). In this case, GS can actually be faster than gradient descent. µ 1 is harmonic mean of λ i divided by n, H(λ)/n: H(λ) is dominated by minimum of its arguments. If each worker takes λ i time to finish a task on their own, H(λ)/n is time needed when working together [Ferger, 1931].

47 Fast Convergence with Bias Term Consider the linear-prediction framework in statistics, argmin x,β n f (ai T x + β) + λ 2 x 2 + σ 2 β2, i=1 where we ve included a bias β.

48 Fast Convergence with Bias Term Consider the linear-prediction framework in statistics, argmin x,β n f (ai T x + β) + λ 2 x 2 + σ 2 β2, i=1 where we ve included a bias β. Typically σ << λ to avoid biasing against a global shift. This is an instance of h 1 where GS has the most benefit.

49 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik

50 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik In this setting, we get a rate of k ( f (x k ) f (x ) 1 µ ) 1 [f (x 0 ) f (x )]. L ij j=1

51 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik In this setting, we get a rate of k ( f (x k ) f (x ) 1 µ ) 1 [f (x 0 ) f (x )]. L ij j=1 Since L = max i L i, this is faster if L ik < L for any i k. But rate is the same in the worst case, even if L i are distinct. Let s consider the effect of exact coordinate optimization on L ik.

52 Gauss-Southwell with Exact Optimization Exact coordinate optimization chooses the stepsize minimizing f. We can get the same rates for randomized/gs because f (x k+1 ) = min{f (x k α ik f (x k )e ik )} α ) f (x k 1Lik ii f (x k )e ik f (x k ) 1 2L ik [ ik f (x k )] 2,

53 Gauss-Southwell with Exact Optimization Exact coordinate optimization chooses the stepsize minimizing f. We can get the same rates for randomized/gs because f (x k+1 ) = min{f (x k α ik f (x k )e ik )} α ) f (x k 1Lik ii f (x k )e ik f (x k ) 1 2L ik [ ik f (x k )] 2, But theory again disagrees with practice: Empirically, exact optimization is much faster.

54 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0.

55 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k.

56 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k. f (x k ) = ,

57 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k f (x k ) = 0.72, f (x k+1 ) =

58 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k f (x k ) = 0.72, f (x k+1 ) = If L i are distinct, worst case is alternating between largest two L i.

59 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better:

60 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours.

61 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours

62 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 1.6 Gauss-Southwell

63 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Update

64 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Gauss-Southwell

65 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Update

66 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 (can t rechoose)

67 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: By bounding the worst-case sequence of L i values, we have ( ( ) ) k f (x k ) f (x µ 1 ) = O 1 max{l G 2, LG 3 } [f (x 0 ) f (x )]. L G 2 is the largest average between neighbours. L G 3 is the largest average 3-node path.

68 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: By bounding the worst-case sequence of L i values, we have ( ( ) ) k f (x k ) f (x µ 1 ) = O 1 max{l G 2, LG 3 } [f (x 0 ) f (x )]. L G 2 is the largest average between neighbours. L G 3 is the largest average 3-node path. This is much faster if the large L i are not neighbours. Similar for h 1 : edges between variables non-zero in same row.

69 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them.

70 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them. Nesterov [2010] shows that sampling proportional to L i yields ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )], n L where L = 1 n n i=1 L i. Faster than uniform sampling when the L i are disinct.

71 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them. Nesterov [2010] shows that sampling proportional to L i yields ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )], n L where L = 1 n n i=1 L i. Faster than uniform sampling when the L i are disinct. This could be faster or slower than the GS rule. In the separalbe quadratic case: With one large λ i, Lipschitz sampling is faster. With one small λ i, GS is faster. So which should we use?

72 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small.

73 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. x1 x2

74 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. Gauss-Southwell x1 x2

75 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. Gauss-Southwell-Lipschitz Gauss-Southwell x1 x2

76 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i.

77 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling.

78 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling. GSL using 1 L ik is optimal myopic coordinate update for quadratics, f (x k+1 ) = argmin{f (x k + αe i )}. i,α

79 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling. GSL using 1 L ik is optimal myopic coordinate update for quadratics, f (x k+1 ) = argmin{f (x k + αe i )}. i,α Analysis gives tighter bound on maximum improvement rule, used in certain applications. [Della Pietra et al., 1997, Lee et al., 2006]

80 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i

81 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour,

82 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour, argmin i 1 2 r(x k ) a i 2 = 1 2 r(x k ) 2 a T i r(x k ) a i 2. (use a i and a i to get absolute value) Approximation is exact if a i = 1 for all i.

83 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour, argmin i 1 2 r(x k ) a i 2 = 1 2 r(x k ) 2 a T i r(x k ) a i 2. (use a i and a i to get absolute value) Approximation is exact if a i = 1 for all i. Using L i = γ a i 2, exact GSL as a nearest neighbour problem, 1 argmin i 2 r(x k ) a 2 i γ ai = 1 2 r(x k ) 2 at i r(x k ) γ ai + 1 2γ.

84 Approximate Gauss-Southwell In many applications, can approximate GS rule.

85 Approximate Gauss-Southwell In many applications, can approximate GS rule. With multiplicative errror, ik f (x k ) f (x k ) (1 ɛ k ), we have a fast rate and do not need ɛ k 0, f (x k+1 ) f (x ) (1 µ 1(1 ɛ k ) 2 ) [f (x k ) f (x )]. L

86 Approximate Gauss-Southwell In many applications, can approximate GS rule. With multiplicative errror, ik f (x k ) f (x k ) (1 ɛ k ), we have a fast rate and do not need ɛ k 0, f (x k+1 ) f (x ) (1 µ 1(1 ɛ k ) 2 ) [f (x k ) f (x )]. L With additive error, ik f (x k ) f (x k ) ɛ k, we have a fast rate if ɛ k 0 fast enough. With constant additive error, only get a certain solution accuracy.

87 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints.

88 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints.

89 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints. Here we can apply proximal-gradient style of update, [ x k+1 = prox 1 L g x k 1 ] i k L i k f (x k )e ik, where prox αg [y] = argmin x R n 1 2 x y 2 + αg(x).

90 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints. Here we can apply proximal-gradient style of update, [ x k+1 = prox 1 L g x k 1 ] i k L i k f (x k )e ik, where 1 prox αg [y] = argmin x R n 2 x y 2 + αg(x). Richtárik and Takac [2014] show that ( E[F (x k+1 ) F(x k )] 1 µ ) [F (x k ) F (x )], Ln the same rate as if non-smooth g i was not there.

91 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }.

92 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny.

93 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny. GS-r: Maximize how far we move, i k = argmax i { [ xi k prox 1 L g x k i k i 1 L i k f (x k ) ] }. Effective for bound constraints, but ignores g i (x k+1 i ) g i (xi k ).

94 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny. GS-r: Maximize how far we move, i k = argmax i { [ xi k prox 1 L g x k i k i 1 L i k f (x k ) ] }. Effective for bound constraints, but ignores g i (x k+1 i ) g i (xi k ). GS-q: Maximize progress under quadratic approximation of f. { i k = argmin min f (x k ) + i f (x k )d + Ld 2 } i d 2 + g i(xi k + d) g i (xi k ). Least intuitive, but has the best theoretical properties. Generalizes GSL if you use L i instead of L.

95 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0.

96 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max)

97 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max) But one final time theory disagrees with practice:

98 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max) But one final time theory disagrees with practice: All three rules seem to work pretty well. Though GS-r works badly if you use the L i.

99 Experiment 1: Sparse l 2 -Regularized Least Squares Least squares with l 2 -regularization and very sparse matrix. 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs

100 Experiment 2: Sparse l 2 -Regularized Logistic Logistic regression with l 2 -regularization and very sparse matrix. 1 l 2 -regularized sparse logistic regression Lipschitz constant Random constant Cyclic constant 0.9 Random exact Cyclic exact GSL constant GS constant Lipschitz exact Objective GS exact 0.3 GSL exact Epochs Exact optimization makes a bigger difference than coordinate selection.

101 Experiment 3: Over-determined least squares Least squares with dense matrix and nearest neighbour GS. 1 Over-determined dense least squares 0.9 Cyclic Random 0.8 Lipschitz Objective Approximated GS 0.4 Approximated GSL GS Epochs Approximate GS is still faster than random sampling.

102 Discussion GS not always practical. But even approximate GS rules may outperform random.

103 Discussion GS not always practical. But even approximate GS rules may outperform random. We ve given a justification for line-search in certain scenarios. We proposed GSL rule, and approximate/proximal variants.

104 Discussion GS not always practical. But even approximate GS rules may outperform random. We ve given a justification for line-search in certain scenarios. We proposed GSL rule, and approximate/proximal variants. Analysis extends to block updates. Could be used for accelerated/parallel methods [Fercocq & Richtárik, 2013], primal-dual methods [Shalev-Schwartz & Zhang, 2013], and without strong-convexity [Luo & Tseng, 1993].

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0. Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization

More information

Support Vector Machines: Training with Stochastic Gradient Descent

Support Vector Machines: Training with Stochastic Gradient Descent Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Support vector machines Training by maximizing margin The SVM

More information

Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.

Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem. Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem. Robert M. Gower. October 3, 07 Introduction This is an exercise in proving the convergence

More information

Approximate Composite Minimization: Convergence Rates and Examples

Approximate Composite Minimization: Convergence Rates and Examples ISMP 2018 - Bordeaux Approximate Composite Minimization: Convergence Rates and S. Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi MLO Lab, EPFL, Switzerland sebastian.stich@epfl.ch July 4, 2018

More information

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

Large-Scale SVM Optimization: Taking a Machine Learning Perspective Large-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological Institute at Chicago Joint work with Nati Srebro Talk at NEC Labs, Princeton, August, 2008 Shai

More information

Trust Region Methods for Unconstrained Optimisation

Trust Region Methods for Unconstrained Optimisation Trust Region Methods for Unconstrained Optimisation Lecture 9, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Trust

More information

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016 AM 22: Advanced Optimization Spring 206 Prof. Yaron Singer Lecture 9 February 24th Overview In the previous lecture we reviewed results from multivariate calculus in preparation for our journey into convex

More information

An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity

An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity Coralia Cartis, Nick Gould and Philippe Toint Department of Mathematics,

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2019 Last Time: Markov Chains We can use Markov chains for density estimation, d p(x) = p(x 1 ) p(x }{{}

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2018 Last Time: Markov Chains We can use Markov chains for density estimation, p(x) = p(x 1 ) }{{} d p(x

More information

Machine Learning (CSE 446): Pratical issues: optimization and learning

Machine Learning (CSE 446): Pratical issues: optimization and learning Machine Learning (CSE 446): Pratical issues: optimization and learning John Thickstun guest lecture c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 10 Review 1 / 10 Our running example

More information

Portfolio selection with multiple risk measures

Portfolio selection with multiple risk measures Portfolio selection with multiple risk measures Garud Iyengar Columbia University Industrial Engineering and Operations Research Joint work with Carlos Abad Outline Portfolio selection and risk measures

More information

Stochastic Proximal Algorithms with Applications to Online Image Recovery

Stochastic Proximal Algorithms with Applications to Online Image Recovery 1/24 Stochastic Proximal Algorithms with Applications to Online Image Recovery Patrick Louis Combettes 1 and Jean-Christophe Pesquet 2 1 Mathematics Department, North Carolina State University, Raleigh,

More information

EE/AA 578 Univ. of Washington, Fall Homework 8

EE/AA 578 Univ. of Washington, Fall Homework 8 EE/AA 578 Univ. of Washington, Fall 2016 Homework 8 1. Multi-label SVM. The basic Support Vector Machine (SVM) described in the lecture (and textbook) is used for classification of data with two labels.

More information

CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems

CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems January 26, 2018 1 / 24 Basic information All information is available in the syllabus

More information

Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India

Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India Presented at OSL workshop, Les Houches, France. Joint work with Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford Linear

More information

Convergence of trust-region methods based on probabilistic models

Convergence of trust-region methods based on probabilistic models Convergence of trust-region methods based on probabilistic models A. S. Bandeira K. Scheinberg L. N. Vicente October 24, 2013 Abstract In this paper we consider the use of probabilistic or random models

More information

The Correlation Smile Recovery

The Correlation Smile Recovery Fortis Bank Equity & Credit Derivatives Quantitative Research The Correlation Smile Recovery E. Vandenbrande, A. Vandendorpe, Y. Nesterov, P. Van Dooren draft version : March 2, 2009 1 Introduction Pricing

More information

Multi-period Portfolio Choice and Bayesian Dynamic Models

Multi-period Portfolio Choice and Bayesian Dynamic Models Multi-period Portfolio Choice and Bayesian Dynamic Models Petter Kolm and Gordon Ritter Courant Institute, NYU Paper appeared in Risk Magazine, Feb. 25 (2015) issue Working paper version: papers.ssrn.com/sol3/papers.cfm?abstract_id=2472768

More information

Portfolio Management and Optimal Execution via Convex Optimization

Portfolio Management and Optimal Execution via Convex Optimization Portfolio Management and Optimal Execution via Convex Optimization Enzo Busseti Stanford University April 9th, 2018 Problems portfolio management choose trades with optimization minimize risk, maximize

More information

Convex-Cardinality Problems

Convex-Cardinality Problems l 1 -norm Methods for Convex-Cardinality Problems problems involving cardinality the l 1 -norm heuristic convex relaxation and convex envelope interpretations examples recent results Prof. S. Boyd, EE364b,

More information

Lecture 7: Bayesian approach to MAB - Gittins index

Lecture 7: Bayesian approach to MAB - Gittins index Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization for Strongly Convex Stochastic Optimization Microsoft Research New England NIPS 2011 Optimization Workshop Stochastic Convex Optimization Setting Goal: Optimize convex function F ( ) over convex domain

More information

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models Math. Program., Ser. A DOI 10.1007/s10107-017-1137-4 FULL LENGTH PAPER Global convergence rate analysis of unconstrained optimization methods based on probabilistic models C. Cartis 1 K. Scheinberg 2 Received:

More information

The Irrevocable Multi-Armed Bandit Problem

The Irrevocable Multi-Armed Bandit Problem The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision

More information

What can we do with numerical optimization?

What can we do with numerical optimization? Optimization motivation and background Eddie Wadbro Introduction to PDE Constrained Optimization, 2016 February 15 16, 2016 Eddie Wadbro, Introduction to PDE Constrained Optimization, February 15 16, 2016

More information

The Probabilistic Method - Probabilistic Techniques. Lecture 7: Martingales

The Probabilistic Method - Probabilistic Techniques. Lecture 7: Martingales The Probabilistic Method - Probabilistic Techniques Lecture 7: Martingales Sotiris Nikoletseas Associate Professor Computer Engineering and Informatics Department 2015-2016 Sotiris Nikoletseas, Associate

More information

6.896 Topics in Algorithmic Game Theory February 10, Lecture 3

6.896 Topics in Algorithmic Game Theory February 10, Lecture 3 6.896 Topics in Algorithmic Game Theory February 0, 200 Lecture 3 Lecturer: Constantinos Daskalakis Scribe: Pablo Azar, Anthony Kim In the previous lecture we saw that there always exists a Nash equilibrium

More information

First-Order Methods. Stephen J. Wright 1. University of Wisconsin-Madison. IMA, August 2016

First-Order Methods. Stephen J. Wright 1. University of Wisconsin-Madison. IMA, August 2016 First-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) First-Order Methods IMA, August 2016 1 / 48 Smooth

More information

IEOR E4004: Introduction to OR: Deterministic Models

IEOR E4004: Introduction to OR: Deterministic Models IEOR E4004: Introduction to OR: Deterministic Models 1 Dynamic Programming Following is a summary of the problems we discussed in class. (We do not include the discussion on the container problem or the

More information

Lecture outline W.B.Powell 1

Lecture outline W.B.Powell 1 Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous

More information

GLOBAL CONVERGENCE OF GENERAL DERIVATIVE-FREE TRUST-REGION ALGORITHMS TO FIRST AND SECOND ORDER CRITICAL POINTS

GLOBAL CONVERGENCE OF GENERAL DERIVATIVE-FREE TRUST-REGION ALGORITHMS TO FIRST AND SECOND ORDER CRITICAL POINTS GLOBAL CONVERGENCE OF GENERAL DERIVATIVE-FREE TRUST-REGION ALGORITHMS TO FIRST AND SECOND ORDER CRITICAL POINTS ANDREW R. CONN, KATYA SCHEINBERG, AND LUíS N. VICENTE Abstract. In this paper we prove global

More information

Yao s Minimax Principle

Yao s Minimax Principle Complexity of algorithms The complexity of an algorithm is usually measured with respect to the size of the input, where size may for example refer to the length of a binary word describing the input,

More information

Portfolio replication with sparse regression

Portfolio replication with sparse regression Portfolio replication with sparse regression Akshay Kothkari, Albert Lai and Jason Morton December 12, 2008 Suppose an investor (such as a hedge fund or fund-of-fund) holds a secret portfolio of assets,

More information

Q1. [?? pts] Search Traces

Q1. [?? pts] Search Traces CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a

More information

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Lecture 17: More on Markov Decision Processes. Reinforcement learning Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture

More information

Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4.

Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4. Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4. If the reader will recall, we have the following problem-specific

More information

STOCHASTIC REPUTATION DYNAMICS UNDER DUOPOLY COMPETITION

STOCHASTIC REPUTATION DYNAMICS UNDER DUOPOLY COMPETITION STOCHASTIC REPUTATION DYNAMICS UNDER DUOPOLY COMPETITION BINGCHAO HUANGFU Abstract This paper studies a dynamic duopoly model of reputation-building in which reputations are treated as capital stocks that

More information

Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization

Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint October 30, 200; Revised March 30, 20 Abstract

More information

Max Registers, Counters and Monotone Circuits

Max Registers, Counters and Monotone Circuits James Aspnes 1 Hagit Attiya 2 Keren Censor 2 1 Yale 2 Technion Counters Model Collects Our goal: build a cheap counter for an asynchronous shared-memory system. Two operations: increment and read. Read

More information

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the

More information

Decomposition Methods

Decomposition Methods Decomposition Methods separable problems, complicating variables primal decomposition dual decomposition complicating constraints general decomposition structures Prof. S. Boyd, EE364b, Stanford University

More information

Adaptive cubic overestimation methods for unconstrained optimization

Adaptive cubic overestimation methods for unconstrained optimization Report no. NA-07/20 Adaptive cubic overestimation methods for unconstrained optimization Coralia Cartis School of Mathematics, University of Edinburgh, The King s Buildings, Edinburgh, EH9 3JZ, Scotland,

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT) Decision Tree Each node checks

More information

Eco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1)

Eco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1) Eco54 Spring 21 C. Sims FINAL EXAM There are three questions that will be equally weighted in grading. Since you may find some questions take longer to answer than others, and partial credit will be given

More information

Catalyst Acceleration for Gradient-Based Non-Convex Optimization

Catalyst Acceleration for Gradient-Based Non-Convex Optimization Catalyst Acceleration for Gradient-Based Non-Convex Optimization Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, Zaid Harchaoui To cite this version: Courtney Paquette, Hongzhou Lin,

More information

A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation

A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation E Bergou Y Diouane V Kungurtsev C W Royer July 5, 08 Abstract Globally convergent variants of the Gauss-Newton

More information

2. This algorithm does not solve the problem of finding a maximum cardinality set of non-overlapping intervals. Consider the following intervals:

2. This algorithm does not solve the problem of finding a maximum cardinality set of non-overlapping intervals. Consider the following intervals: 1. No solution. 2. This algorithm does not solve the problem of finding a maximum cardinality set of non-overlapping intervals. Consider the following intervals: E A B C D Obviously, the optimal solution

More information

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE392o, Stanford University

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE392o, Stanford University Ellipsoid Method ellipsoid method convergence proof inequality constraints feasibility problems Prof. S. Boyd, EE392o, Stanford University Challenges in cutting-plane methods can be difficult to compute

More information

Convex-Cardinality Problems Part II

Convex-Cardinality Problems Part II l 1 -norm Methods for Convex-Cardinality Problems Part II total variation iterated weighted l 1 heuristic matrix rank constraints Prof. S. Boyd, EE364b, Stanford University Total variation reconstruction

More information

IDENTIFYING BROAD AND NARROW FINANCIAL RISK FACTORS VIA CONVEX OPTIMIZATION: PART II

IDENTIFYING BROAD AND NARROW FINANCIAL RISK FACTORS VIA CONVEX OPTIMIZATION: PART II 1 IDENTIFYING BROAD AND NARROW FINANCIAL RISK FACTORS VIA CONVEX OPTIMIZATION: PART II Alexander D. Shkolnik ads2@berkeley.edu MMDS Workshop. June 22, 2016. joint with Jeffrey Bohn and Lisa Goldberg. Identifying

More information

k-layer neural networks: High capacity scoring functions + tips on how to train them

k-layer neural networks: High capacity scoring functions + tips on how to train them k-layer neural networks: High capacity scoring functions + tips on how to train them A new class of scoring functions Linear scoring function s = W x + b 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0,

More information

Robust Optimization Applied to a Currency Portfolio

Robust Optimization Applied to a Currency Portfolio Robust Optimization Applied to a Currency Portfolio R. Fonseca, S. Zymler, W. Wiesemann, B. Rustem Workshop on Numerical Methods and Optimization in Finance June, 2009 OUTLINE Introduction Motivation &

More information

56:171 Operations Research Midterm Exam Solutions October 22, 1993

56:171 Operations Research Midterm Exam Solutions October 22, 1993 56:171 O.R. Midterm Exam Solutions page 1 56:171 Operations Research Midterm Exam Solutions October 22, 1993 (A.) /: Indicate by "+" ="true" or "o" ="false" : 1. A "dummy" activity in CPM has duration

More information

Log-Robust Portfolio Management

Log-Robust Portfolio Management Log-Robust Portfolio Management Dr. Aurélie Thiele Lehigh University Joint work with Elcin Cetinkaya and Ban Kawas Research partially supported by the National Science Foundation Grant CMMI-0757983 Dr.

More information

Algorithmic Game Theory (a primer) Depth Qualifying Exam for Ashish Rastogi (Ph.D. candidate)

Algorithmic Game Theory (a primer) Depth Qualifying Exam for Ashish Rastogi (Ph.D. candidate) Algorithmic Game Theory (a primer) Depth Qualifying Exam for Ashish Rastogi (Ph.D. candidate) 1 Game Theory Theory of strategic behavior among rational players. Typical game has several players. Each player

More information

Fast Convergence of Regress-later Series Estimators

Fast Convergence of Regress-later Series Estimators Fast Convergence of Regress-later Series Estimators New Thinking in Finance, London Eric Beutner, Antoon Pelsser, Janina Schweizer Maastricht University & Kleynen Consultants 12 February 2014 Beutner Pelsser

More information

Solving dynamic portfolio choice problems by recursing on optimized portfolio weights or on the value function?

Solving dynamic portfolio choice problems by recursing on optimized portfolio weights or on the value function? DOI 0.007/s064-006-9073-z ORIGINAL PAPER Solving dynamic portfolio choice problems by recursing on optimized portfolio weights or on the value function? Jules H. van Binsbergen Michael W. Brandt Received:

More information

SMOOTH CONVEX APPROXIMATION AND ITS APPLICATIONS SHI SHENGYUAN. (B.Sc.(Hons.), ECNU)

SMOOTH CONVEX APPROXIMATION AND ITS APPLICATIONS SHI SHENGYUAN. (B.Sc.(Hons.), ECNU) SMOOTH CONVEX APPROXIMATION AND ITS APPLICATIONS SHI SHENGYUAN (B.Sc.(Hons.), ECNU) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF MATHEMATICS NATIONAL UNIVERSITY OF SINGAPORE 2004

More information

Infinite Reload Options: Pricing and Analysis

Infinite Reload Options: Pricing and Analysis Infinite Reload Options: Pricing and Analysis A. C. Bélanger P. A. Forsyth April 27, 2006 Abstract Infinite reload options allow the user to exercise his reload right as often as he chooses during the

More information

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE364b, Stanford University

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE364b, Stanford University Ellipsoid Method ellipsoid method convergence proof inequality constraints feasibility problems Prof. S. Boyd, EE364b, Stanford University Ellipsoid method developed by Shor, Nemirovsky, Yudin in 1970s

More information

Advanced Operations Research Prof. G. Srinivasan Dept of Management Studies Indian Institute of Technology, Madras

Advanced Operations Research Prof. G. Srinivasan Dept of Management Studies Indian Institute of Technology, Madras Advanced Operations Research Prof. G. Srinivasan Dept of Management Studies Indian Institute of Technology, Madras Lecture 23 Minimum Cost Flow Problem In this lecture, we will discuss the minimum cost

More information

1 Online Problem Examples

1 Online Problem Examples Comp 260: Advanced Algorithms Tufts University, Spring 2018 Prof. Lenore Cowen Scribe: Isaiah Mindich Lecture 9: Online Algorithms All of the algorithms we have studied so far operate on the assumption

More information

A Harmonic Analysis Solution to the Basket Arbitrage Problem

A Harmonic Analysis Solution to the Basket Arbitrage Problem A Harmonic Analysis Solution to the Basket Arbitrage Problem Alexandre d Aspremont ORFE, Princeton University. A. d Aspremont, INFORMS, San Francisco, Nov. 14 2005. 1 Introduction Classic Black & Scholes

More information

Convergence Analysis of Monte Carlo Calibration of Financial Market Models

Convergence Analysis of Monte Carlo Calibration of Financial Market Models Analysis of Monte Carlo Calibration of Financial Market Models Christoph Käbe Universität Trier Workshop on PDE Constrained Optimization of Certain and Uncertain Processes June 03, 2009 Monte Carlo Calibration

More information

Financial Giffen Goods: Examples and Counterexamples

Financial Giffen Goods: Examples and Counterexamples Financial Giffen Goods: Examples and Counterexamples RolfPoulsen and Kourosh Marjani Rasmussen Abstract In the basic Markowitz and Merton models, a stock s weight in efficient portfolios goes up if its

More information

CS134: Networks Spring Random Variables and Independence. 1.2 Probability Distribution Function (PDF) Number of heads Probability 2 0.

CS134: Networks Spring Random Variables and Independence. 1.2 Probability Distribution Function (PDF) Number of heads Probability 2 0. CS134: Networks Spring 2017 Prof. Yaron Singer Section 0 1 Probability 1.1 Random Variables and Independence A real-valued random variable is a variable that can take each of a set of possible values in

More information

Direct Methods for linear systems Ax = b basic point: easy to solve triangular systems

Direct Methods for linear systems Ax = b basic point: easy to solve triangular systems NLA p.1/13 Direct Methods for linear systems Ax = b basic point: easy to solve triangular systems... 0 0 0 etc. a n 1,n 1 x n 1 = b n 1 a n 1,n x n solve a n,n x n = b n then back substitution: takes n

More information

Sublinear Time Algorithms Oct 19, Lecture 1

Sublinear Time Algorithms Oct 19, Lecture 1 0368.416701 Sublinear Time Algorithms Oct 19, 2009 Lecturer: Ronitt Rubinfeld Lecture 1 Scribe: Daniel Shahaf 1 Sublinear-time algorithms: motivation Twenty years ago, there was practically no investigation

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical

More information

CHAPTER 6 CRASHING STOCHASTIC PERT NETWORKS WITH RESOURCE CONSTRAINED PROJECT SCHEDULING PROBLEM

CHAPTER 6 CRASHING STOCHASTIC PERT NETWORKS WITH RESOURCE CONSTRAINED PROJECT SCHEDULING PROBLEM CHAPTER 6 CRASHING STOCHASTIC PERT NETWORKS WITH RESOURCE CONSTRAINED PROJECT SCHEDULING PROBLEM 6.1 Introduction Project Management is the process of planning, controlling and monitoring the activities

More information

DM559/DM545 Linear and integer programming

DM559/DM545 Linear and integer programming Department of Mathematics and Computer Science University of Southern Denmark, Odense May 22, 2018 Marco Chiarandini DM559/DM55 Linear and integer programming Sheet, Spring 2018 [pdf format] Contains Solutions!

More information

The Assignment Problem

The Assignment Problem The Assignment Problem E.A Dinic, M.A Kronrod Moscow State University Soviet Math.Dokl. 1969 January 30, 2012 1 Introduction Motivation Problem Definition 2 Motivation Problem Definition Outline 1 Introduction

More information

Lecture outline W.B. Powell 1

Lecture outline W.B. Powell 1 Lecture outline Applications of the newsvendor problem The newsvendor problem Estimating the distribution and censored demands The newsvendor problem and risk The newsvendor problem with an unknown distribution

More information

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks, Detroit MI (joint work with Waheed Bajwa, Rutgers) Motivation: Autonomous Driving Network of autonomous automobiles

More information

ECE 586GT: Problem Set 1: Problems and Solutions Analysis of static games

ECE 586GT: Problem Set 1: Problems and Solutions Analysis of static games University of Illinois Fall 2018 ECE 586GT: Problem Set 1: Problems and Solutions Analysis of static games Due: Tuesday, Sept. 11, at beginning of class Reading: Course notes, Sections 1.1-1.4 1. [A random

More information

Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice

Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice Journal of Machine Learning Research 8 (8) -54 Submitted /7; Revised /7; Published 4/8 Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice Hongzhou Lin Massachusetts Institute

More information

Budget Management In GSP (2018)

Budget Management In GSP (2018) Budget Management In GSP (2018) Yahoo! March 18, 2018 Miguel March 18, 2018 1 / 26 Today s Presentation: Budget Management Strategies in Repeated auctions, Balseiro, Kim, and Mahdian, WWW2017 Learning

More information

IEOR E4703: Monte-Carlo Simulation

IEOR E4703: Monte-Carlo Simulation IEOR E4703: Monte-Carlo Simulation Other Miscellaneous Topics and Applications of Monte-Carlo Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Optimal Portfolio Selection Under the Estimation Risk in Mean Return

Optimal Portfolio Selection Under the Estimation Risk in Mean Return Optimal Portfolio Selection Under the Estimation Risk in Mean Return by Lei Zhu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics

More information

Machine Learning (CSE 446): Learning as Minimizing Loss

Machine Learning (CSE 446): Learning as Minimizing Loss Machine Learning (CSE 446): Learning as Minimizing Loss oah Smith c 207 University of Washington nasmith@cs.washington.edu October 23, 207 / 2 Sorry! o office hour for me today. Wednesday is as usual.

More information

Intro to Economic analysis

Intro to Economic analysis Intro to Economic analysis Alberto Bisin - NYU 1 The Consumer Problem Consider an agent choosing her consumption of goods 1 and 2 for a given budget. This is the workhorse of microeconomic theory. (Notice

More information

Rollout Allocation Strategies for Classification-based Policy Iteration

Rollout Allocation Strategies for Classification-based Policy Iteration Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh firstname.lastname@inria.fr Workshop on Reinforcement Learning and Search in Very Large

More information

Pricing Problems under the Markov Chain Choice Model

Pricing Problems under the Markov Chain Choice Model Pricing Problems under the Markov Chain Choice Model James Dong School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA jd748@cornell.edu A. Serdar Simsek

More information

Tug of War Game. William Gasarch and Nick Sovich and Paul Zimand. October 6, Abstract

Tug of War Game. William Gasarch and Nick Sovich and Paul Zimand. October 6, Abstract Tug of War Game William Gasarch and ick Sovich and Paul Zimand October 6, 2009 To be written later Abstract Introduction Combinatorial games under auction play, introduced by Lazarus, Loeb, Propp, Stromquist,

More information

CSCE 750, Fall 2009 Quizzes with Answers

CSCE 750, Fall 2009 Quizzes with Answers CSCE 750, Fall 009 Quizzes with Answers Stephen A. Fenner September 4, 011 1. Give an exact closed form for Simplify your answer as much as possible. k 3 k+1. We reduce the expression to a form we ve already

More information

Applications of Quantum Annealing in Computational Finance. Dr. Phil Goddard Head of Research, 1QBit D-Wave User Conference, Santa Fe, Sept.

Applications of Quantum Annealing in Computational Finance. Dr. Phil Goddard Head of Research, 1QBit D-Wave User Conference, Santa Fe, Sept. Applications of Quantum Annealing in Computational Finance Dr. Phil Goddard Head of Research, 1QBit D-Wave User Conference, Santa Fe, Sept. 2016 Outline Where s my Babel Fish? Quantum-Ready Applications

More information

Risk Estimation via Regression

Risk Estimation via Regression Risk Estimation via Regression Mark Broadie Graduate School of Business Columbia University email: mnb2@columbiaedu Yiping Du Industrial Engineering and Operations Research Columbia University email: yd2166@columbiaedu

More information

2.1 Mathematical Basis: Risk-Neutral Pricing

2.1 Mathematical Basis: Risk-Neutral Pricing Chapter Monte-Carlo Simulation.1 Mathematical Basis: Risk-Neutral Pricing Suppose that F T is the payoff at T for a European-type derivative f. Then the price at times t before T is given by f t = e r(t

More information

Regression estimation in continuous time with a view towards pricing Bermudan options

Regression estimation in continuous time with a view towards pricing Bermudan options with a view towards pricing Bermudan options Tagung des SFB 649 Ökonomisches Risiko in Motzen 04.-06.06.2009 Financial engineering in times of financial crisis Derivate... süßes Gift für die Spekulanten

More information

Window Width Selection for L 2 Adjusted Quantile Regression

Window Width Selection for L 2 Adjusted Quantile Regression Window Width Selection for L 2 Adjusted Quantile Regression Yoonsuh Jung, The Ohio State University Steven N. MacEachern, The Ohio State University Yoonkyung Lee, The Ohio State University Technical Report

More information

Multi-armed bandits in dynamic pricing

Multi-armed bandits in dynamic pricing Multi-armed bandits in dynamic pricing Arnoud den Boer University of Twente, Centrum Wiskunde & Informatica Amsterdam Lancaster, January 11, 2016 Dynamic pricing A firm sells a product, with abundant inventory,

More information

Maximizing Winnings on Final Jeopardy!

Maximizing Winnings on Final Jeopardy! Maximizing Winnings on Final Jeopardy! Jessica Abramson, Natalie Collina, and William Gasarch August 2017 1 Abstract Alice and Betty are going into the final round of Jeopardy. Alice knows how much money

More information

Handout 4: Deterministic Systems and the Shortest Path Problem

Handout 4: Deterministic Systems and the Shortest Path Problem SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 4: Deterministic Systems and the Shortest Path Problem Instructor: Shiqian Ma January 27, 2014 Suggested Reading: Bertsekas

More information

An Adaptive Learning Model in Coordination Games

An Adaptive Learning Model in Coordination Games Department of Economics An Adaptive Learning Model in Coordination Games Department of Economics Discussion Paper 13-14 Naoki Funai An Adaptive Learning Model in Coordination Games Naoki Funai June 17,

More information

A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization

A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization Hongzhou Lin 1, Julien Mairal 1, Zaid Harchaoui 2 1 Inria, Grenoble 2 University of Washington LCCC Workshop on large-scale and distributed

More information

Agricultural and Applied Economics 637 Applied Econometrics II

Agricultural and Applied Economics 637 Applied Econometrics II Agricultural and Applied Economics 637 Applied Econometrics II Assignment I Using Search Algorithms to Determine Optimal Parameter Values in Nonlinear Regression Models (Due: February 3, 2015) (Note: Make

More information

Utility Indifference Pricing and Dynamic Programming Algorithm

Utility Indifference Pricing and Dynamic Programming Algorithm Chapter 8 Utility Indifference ricing and Dynamic rogramming Algorithm In the Black-Scholes framework, we can perfectly replicate an option s payoff. However, it may not be true beyond the Black-Scholes

More information

Chapter 5 Finite Difference Methods. Math6911 W07, HM Zhu

Chapter 5 Finite Difference Methods. Math6911 W07, HM Zhu Chapter 5 Finite Difference Methods Math69 W07, HM Zhu References. Chapters 5 and 9, Brandimarte. Section 7.8, Hull 3. Chapter 7, Numerical analysis, Burden and Faires Outline Finite difference (FD) approximation

More information