Is Greedy Coordinate Descent a Terrible Algorithm?
|
|
- Audrey Ball
- 5 years ago
- Views:
Transcription
1 Is Greedy Coordinate Descent a Terrible Algorithm? Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, Hoyt Koepke University of British Columbia Optimization and Big Data, 2015
2 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate.
3 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate.
4 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper.
5 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i
6 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i x1 x2 x3
7 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i x1 x2 x3
8 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell x1 x2 x3
9 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell (GS) is at least as expensive as random. But Nesterov showed the rate is the same. So greedy is a terrible algorithm and just use random!
10 Context: Random vs. Greedy Coordinate Descent We consider coordinate descent for large-scale optimization: 1 Select a coordinate to update. 2 Take a small gradient step along coordinate. Recent interest began with Nesterov [2010]: Global convergence rate for randomized coordinate selection. Faster than gradient descent if iterations are n times cheaper. Contrast random with classic Gauss-Southwell rule: argmax i f (x). i Gauss-Southwell (GS) is at least as expensive as random. But Nesterov showed the rate is the same. So greedy is a terrible algorithm and just use random! But this theory disagrees with practice...
11 Context: Random vs. Greedy Coordinate Descent 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs If random and GS have similar costs, GS works much better.
12 Context: Random vs. Greedy Coordinate Descent 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs If random and GS have similar costs, GS works much better. This work: refined analysis of GS.
13 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update?
14 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E i=1 where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions)
15 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions) h 1 includes least squares, logistic regression, Lasso, and SVMs. 1 E.g., min x IR n 2 Ax b 2 + λ n x i. i=1 i=1
16 Problems where we can apply coordinate descent When is coordinate update n times faster than gradient update? There are basically two problems where this is true: n h 1 (x) = f (Ax)+ g i (x i ), or h 2 (x) = n f ij (x i, x j )+ g i (x i ), i=1 (i,j) E where f and f ij are smooth, A is a matrix, E are edges in a graph. (g i can be general non-degenerate convex functions) h 1 includes least squares, logistic regression, Lasso, and SVMs. 1 E.g., min x IR n 2 Ax b 2 + λ n x i. h 2 includes quadratics, graph-based label propagation, and probabilistic graphical models. 1 E.g., min x IR n 2 x T Ax + b T x = 1 2 n i=1 j=1 i=1 n a ij x i x j + i=1 n b i x i. i=1
17 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2.
18 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster.
19 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap.
20 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200.
21 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200. Problem h 1 : GS efficient if rows and columns of A have O(log(n)) non-zeros. (iteration cost of O((log n) 3 ))
22 Problems where can apply Gauss-Southwell GS rule may be as expensive as gradient even for h 1 and h 2. But there are special cases where GS is n times faster. Problem h 2 : GS efficient if maximum degree is comparable to average degree. You can track the gradients and use a max-heap. Examples: Grid-based models, max degree = 4 and average degree 4. [Meshi et al., 2012] Dense quadratic: max degree = (n 1), average degree = (n 1). Facebook graph: max degree < 7000, average is 200. Problem h 1 : GS efficient if rows and columns of A have O(log(n)) non-zeros. (iteration cost of O((log n) 3 )) GS can be approximated as nearest neighbour problem. [Dhillon et al., 2011, Shrivastava & Li, 2014].
23 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α.
24 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α. We focus on the case where f is µ-strongly convex, meaning that x f (x) µ 2 x 2, is a convex function for some µ > 0.
25 Notation and Assumptions We focus on the convex optimization problem min x IR n f (x), where f is coordinate-wise L-Lipschitz continuous, i f (x + αe i ) i f (x) L α. We focus on the case where f is µ-strongly convex, meaning that x f (x) µ 2 x 2, is a convex function for some µ > 0. If twice-differentiable, equivalent to 2 ii f (x) L, 2 f (x) µi.
26 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik,
27 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik, With i k chosen uniformly from {1, 2,..., n} [Nesterov, 2010], ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )]. Ln
28 Convergence of Randomized Coordinate Descent Coordinate descent with constant step size 1 L uses for some variable i k. x k+1 = x k 1 L i k f (x k )e ik, With i k chosen uniformly from {1, 2,..., n} [Nesterov, 2010], ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )]. Ln Compare this to the rate of gradient descent, ) f (x k+1 ) f (x ) (1 µlf [f (x k ) f (x )]. Since Ln L f L, coordinate descent is slower per iteration, but n coordinate iterations are faster than one gradient iteration.
29 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i
30 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies f (x k+1 ) f (x k ) 1 2L f (x k ) 2.
31 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2.
32 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2. Using f (x k ) 2 n f (x k ) 2, we get ( f (x k+1 ) f (x ) 1 µ ) [f (x k ) f (x )], Ln same rate as randomized [Boyd & Vandenberghe, 2004, 9.4.3].
33 Classic Analysis of Gauss-Southwell Rule GS rule chooses coordinate with largest directional derivative, i k = argmax i f (x k ). i From Lipschitz-continuity assumption this rule satisfies From strong-convexity we have f (x k+1 ) f (x k ) 1 2L f (x k ) 2. f (x ) f (x k ) 1 2µ f (x k ) 2. Using f (x k ) 2 n f (x k ) 2, we get ( f (x k+1 ) f (x ) 1 µ ) [f (x k ) f (x )]. Ln same rate as randomized [Boyd & Vandenberghe, 2004, 9.4.3].
34 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm.
35 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. x f (x) µ 1 2 x 2 1,
36 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2.
37 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )].
38 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L The relationship between µ and µ 1 is given by µ n µ 1 µ. ) [f (x k ) f (x )].
39 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )]. The relationship between µ and µ 1 is given by µ n µ 1 µ. GS bound is the same as random when µ 1 = µ/n.
40 Refined Gauss-Southwell Analysis To avoid norm inequality, measure strong-convexity in 1-norm. E.g., find the maximum µ 1 such that is convex. We now have that This gives a rate of ( f (x k+1 ) f (x ) x f (x) µ 1 2 x 2 1, f (x ) f (x k ) 1 2µ 1 f (x k ) 2. 1 µ 1 L ) [f (x k ) f (x )]. The relationship between µ and µ 1 is given by µ n µ 1 µ. GS bound is the same as random when µ 1 = µ/n. Otherwise, GS can be faster by as large as n.
41 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show 1 µ = min λ i, and µ 1 = n i i=1 1. λ i
42 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n).
43 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster.
44 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n).
45 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n). With one very small λ i : Here GS bound can be better by a factor of n (µ 1 µ). In this case, GS can actually be faster than gradient descent.
46 Comparison for Separable Quadratic In f is a quadratic with diagonal Hessian, we can show If all λ i equal: 1 µ = min λ i, and µ 1 = n i i=1 1. λ i There is no advantage to GS (µ 1 = µ/n). With one very large λ i : Here you would think that GS would be faster. But GS and random are still similar (µ 1 µ/n). With one very small λ i : Here GS bound can be better by a factor of n (µ 1 µ). In this case, GS can actually be faster than gradient descent. µ 1 is harmonic mean of λ i divided by n, H(λ)/n: H(λ) is dominated by minimum of its arguments. If each worker takes λ i time to finish a task on their own, H(λ)/n is time needed when working together [Ferger, 1931].
47 Fast Convergence with Bias Term Consider the linear-prediction framework in statistics, argmin x,β n f (ai T x + β) + λ 2 x 2 + σ 2 β2, i=1 where we ve included a bias β.
48 Fast Convergence with Bias Term Consider the linear-prediction framework in statistics, argmin x,β n f (ai T x + β) + λ 2 x 2 + σ 2 β2, i=1 where we ve included a bias β. Typically σ << λ to avoid biasing against a global shift. This is an instance of h 1 where GS has the most benefit.
49 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik
50 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik In this setting, we get a rate of k ( f (x k ) f (x ) 1 µ ) 1 [f (x 0 ) f (x )]. L ij j=1
51 Rates with Different Lipschitz Constants Consider the case where we have an L i for each coordinate, i f (x + αe i ) i f (x) L i α, and we use a coordinate-dependent stepsize, x k+1 = x k 1 ik f (x k )e ik. L ik In this setting, we get a rate of k ( f (x k ) f (x ) 1 µ ) 1 [f (x 0 ) f (x )]. L ij j=1 Since L = max i L i, this is faster if L ik < L for any i k. But rate is the same in the worst case, even if L i are distinct. Let s consider the effect of exact coordinate optimization on L ik.
52 Gauss-Southwell with Exact Optimization Exact coordinate optimization chooses the stepsize minimizing f. We can get the same rates for randomized/gs because f (x k+1 ) = min{f (x k α ik f (x k )e ik )} α ) f (x k 1Lik ii f (x k )e ik f (x k ) 1 2L ik [ ik f (x k )] 2,
53 Gauss-Southwell with Exact Optimization Exact coordinate optimization chooses the stepsize minimizing f. We can get the same rates for randomized/gs because f (x k+1 ) = min{f (x k α ik f (x k )e ik )} α ) f (x k 1Lik ii f (x k )e ik f (x k ) 1 2L ik [ ik f (x k )] 2, But theory again disagrees with practice: Empirically, exact optimization is much faster.
54 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0.
55 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k.
56 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k. f (x k ) = ,
57 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k f (x k ) = 0.72, f (x k+1 ) =
58 Gauss-Southwell with Exact Optimization and Sparsity For dense problems, exact optimization bound is a little better: After an exact update, we have ik f (x k+1 ) = 0. Since i k+1 = argmax i i f (x k+1 ), we never have i k+1 = i k f (x k ) = 0.72, f (x k+1 ) = If L i are distinct, worst case is alternating between largest two L i.
59 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better:
60 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours.
61 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours
62 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 1.6 Gauss-Southwell
63 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Update
64 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Gauss-Southwell
65 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 Update
66 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: After an exact update we have ik f (x k+m ) = 0, for all m until i k+m 1 is a neighbour of node i k in the graph. We never alternate between large L i that aren t neighbours. 0 (can t rechoose)
67 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: By bounding the worst-case sequence of L i values, we have ( ( ) ) k f (x k ) f (x µ 1 ) = O 1 max{l G 2, LG 3 } [f (x 0 ) f (x )]. L G 2 is the largest average between neighbours. L G 3 is the largest average 3-node path.
68 Gauss-Southwell with Exact Optimization and Sparsity For sparse instances of h 2, exact can be much better: By bounding the worst-case sequence of L i values, we have ( ( ) ) k f (x k ) f (x µ 1 ) = O 1 max{l G 2, LG 3 } [f (x 0 ) f (x )]. L G 2 is the largest average between neighbours. L G 3 is the largest average 3-node path. This is much faster if the large L i are not neighbours. Similar for h 1 : edges between variables non-zero in same row.
69 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them.
70 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them. Nesterov [2010] shows that sampling proportional to L i yields ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )], n L where L = 1 n n i=1 L i. Faster than uniform sampling when the L i are disinct.
71 Rules Depending on Lipschitz Constants Assume that we know the L i or approximate them. Nesterov [2010] shows that sampling proportional to L i yields ( E[f (x k+1 )] f (x ) 1 µ ) [f (x k ) f (x )], n L where L = 1 n n i=1 L i. Faster than uniform sampling when the L i are disinct. This could be faster or slower than the GS rule. In the separalbe quadratic case: With one large λ i, Lipschitz sampling is faster. With one small λ i, GS is faster. So which should we use?
72 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small.
73 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. x1 x2
74 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. Gauss-Southwell x1 x2
75 Gauss-Southwell-Lipschitz Rule We obtain a faster rate by using L i in the GS rule, i k = argmax i i f (x k ) Li, which we call the Gauss-Southwell-Lipschitz (GSL) rule. Intuition: if gradients are similar, more progress if L i is small. Gauss-Southwell-Lipschitz Gauss-Southwell x1 x2
76 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i.
77 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling.
78 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling. GSL using 1 L ik is optimal myopic coordinate update for quadratics, f (x k+1 ) = argmin{f (x k + αe i )}. i,α
79 Gauss-Southwell-Lipschitz Rule The GSL rule obtains a rate of f (x k+1 ) f (x k ) (1 µ L )[f (x k ) f (x )]. where µ L is strong-convexity constant in x L = n i=1 Li x i. We have that { µ max n L, µ } 1 µ 1 µ L L min i {L i }, so GSL is at least as fast as GS and Lipschitz sampling. GSL using 1 L ik is optimal myopic coordinate update for quadratics, f (x k+1 ) = argmin{f (x k + αe i )}. i,α Analysis gives tighter bound on maximum improvement rule, used in certain applications. [Della Pietra et al., 1997, Lee et al., 2006]
80 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i
81 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour,
82 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour, argmin i 1 2 r(x k ) a i 2 = 1 2 r(x k ) 2 a T i r(x k ) a i 2. (use a i and a i to get absolute value) Approximation is exact if a i = 1 for all i.
83 Gauss-Southwell-Lipschitz as Nearest Neighbour Consider a special case of h 1, min x h 1 (x) = where GS rule has the form n f (ai T x), i=1 i k = argmax ai T r(x k ). i Dhillon et al. [2011] approximate GS as nearest neighbour, argmin i 1 2 r(x k ) a i 2 = 1 2 r(x k ) 2 a T i r(x k ) a i 2. (use a i and a i to get absolute value) Approximation is exact if a i = 1 for all i. Using L i = γ a i 2, exact GSL as a nearest neighbour problem, 1 argmin i 2 r(x k ) a 2 i γ ai = 1 2 r(x k ) 2 at i r(x k ) γ ai + 1 2γ.
84 Approximate Gauss-Southwell In many applications, can approximate GS rule.
85 Approximate Gauss-Southwell In many applications, can approximate GS rule. With multiplicative errror, ik f (x k ) f (x k ) (1 ɛ k ), we have a fast rate and do not need ɛ k 0, f (x k+1 ) f (x ) (1 µ 1(1 ɛ k ) 2 ) [f (x k ) f (x )]. L
86 Approximate Gauss-Southwell In many applications, can approximate GS rule. With multiplicative errror, ik f (x k ) f (x k ) (1 ɛ k ), we have a fast rate and do not need ɛ k 0, f (x k+1 ) f (x ) (1 µ 1(1 ɛ k ) 2 ) [f (x k ) f (x )]. L With additive error, ik f (x k ) f (x k ) ɛ k, we have a fast rate if ɛ k 0 fast enough. With constant additive error, only get a certain solution accuracy.
87 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints.
88 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints.
89 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints. Here we can apply proximal-gradient style of update, [ x k+1 = prox 1 L g x k 1 ] i k L i k f (x k )e ik, where prox αg [y] = argmin x R n 1 2 x y 2 + αg(x).
90 Proximal Coordinate Descent Important application of coordinate descent is for problems min x R n F (x) f (x) + i g i (x i ), where f is smooth and g i might be non-smooth. E.g., l 1 -regularization or bound constraints. Here we can apply proximal-gradient style of update, [ x k+1 = prox 1 L g x k 1 ] i k L i k f (x k )e ik, where 1 prox αg [y] = argmin x R n 2 x y 2 + αg(x). Richtárik and Takac [2014] show that ( E[F (x k+1 ) F(x k )] 1 µ ) [F (x k ) F (x )], Ln the same rate as if non-smooth g i was not there.
91 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }.
92 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny.
93 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny. GS-r: Maximize how far we move, i k = argmax i { [ xi k prox 1 L g x k i k i 1 L i k f (x k ) ] }. Effective for bound constraints, but ignores g i (x k+1 i ) g i (xi k ).
94 Proximal Gauss-Southwell There are several generalizations of GS to this setting: GS-s: Minimize directional derivative, i k = argmax i { min s g i i f (x k ) + s }. Commonly-used for l 1 -regularization, but x k+1 x k could be tiny. GS-r: Maximize how far we move, i k = argmax i { [ xi k prox 1 L g x k i k i 1 L i k f (x k ) ] }. Effective for bound constraints, but ignores g i (x k+1 i ) g i (xi k ). GS-q: Maximize progress under quadratic approximation of f. { i k = argmin min f (x k ) + i f (x k )d + Ld 2 } i d 2 + g i(xi k + d) g i (xi k ). Least intuitive, but has the best theoretical properties. Generalizes GSL if you use L i instead of L.
95 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0.
96 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max)
97 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max) But one final time theory disagrees with practice:
98 Proximal Gauss-Southwell Convergence Rate For the GS-q rule, we show that f (x k+1 ) f (x k ) min { ( 1 µ ) [f (x k ) f (x )], ( Ln 1 µ ) 1 [f (x 0 ) f (x } )] + ɛ k, L where ɛ k 0 measures non-linearity of g i that are not updated. We conjecture that the above always holds with ɛ k = 0. The above rate does not hold for GS-s or GS-r. (even if you change min to max) But one final time theory disagrees with practice: All three rules seem to work pretty well. Though GS-r works badly if you use the L i.
99 Experiment 1: Sparse l 2 -Regularized Least Squares Least squares with l 2 -regularization and very sparse matrix. 1 l 2 -regularized sparse least squares 0.9 Lipschitz Random 0.8 Cyclic 0.7 Objective GS GSL Epochs
100 Experiment 2: Sparse l 2 -Regularized Logistic Logistic regression with l 2 -regularization and very sparse matrix. 1 l 2 -regularized sparse logistic regression Lipschitz constant Random constant Cyclic constant 0.9 Random exact Cyclic exact GSL constant GS constant Lipschitz exact Objective GS exact 0.3 GSL exact Epochs Exact optimization makes a bigger difference than coordinate selection.
101 Experiment 3: Over-determined least squares Least squares with dense matrix and nearest neighbour GS. 1 Over-determined dense least squares 0.9 Cyclic Random 0.8 Lipschitz Objective Approximated GS 0.4 Approximated GSL GS Epochs Approximate GS is still faster than random sampling.
102 Discussion GS not always practical. But even approximate GS rules may outperform random.
103 Discussion GS not always practical. But even approximate GS rules may outperform random. We ve given a justification for line-search in certain scenarios. We proposed GSL rule, and approximate/proximal variants.
104 Discussion GS not always practical. But even approximate GS rules may outperform random. We ve given a justification for line-search in certain scenarios. We proposed GSL rule, and approximate/proximal variants. Analysis extends to block updates. Could be used for accelerated/parallel methods [Fercocq & Richtárik, 2013], primal-dual methods [Shalev-Schwartz & Zhang, 2013], and without strong-convexity [Luo & Tseng, 1993].
Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.
Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization
More informationSupport Vector Machines: Training with Stochastic Gradient Descent
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Support vector machines Training by maximizing margin The SVM
More informationExercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.
Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem. Robert M. Gower. October 3, 07 Introduction This is an exercise in proving the convergence
More informationApproximate Composite Minimization: Convergence Rates and Examples
ISMP 2018 - Bordeaux Approximate Composite Minimization: Convergence Rates and S. Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi MLO Lab, EPFL, Switzerland sebastian.stich@epfl.ch July 4, 2018
More informationLarge-Scale SVM Optimization: Taking a Machine Learning Perspective
Large-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological Institute at Chicago Joint work with Nati Srebro Talk at NEC Labs, Princeton, August, 2008 Shai
More informationTrust Region Methods for Unconstrained Optimisation
Trust Region Methods for Unconstrained Optimisation Lecture 9, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Trust
More information1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016
AM 22: Advanced Optimization Spring 206 Prof. Yaron Singer Lecture 9 February 24th Overview In the previous lecture we reviewed results from multivariate calculus in preparation for our journey into convex
More informationAn adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity
An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity Coralia Cartis, Nick Gould and Philippe Toint Department of Mathematics,
More informationPart 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)
Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2019 Last Time: Markov Chains We can use Markov chains for density estimation, d p(x) = p(x 1 ) p(x }{{}
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2018 Last Time: Markov Chains We can use Markov chains for density estimation, p(x) = p(x 1 ) }{{} d p(x
More informationMachine Learning (CSE 446): Pratical issues: optimization and learning
Machine Learning (CSE 446): Pratical issues: optimization and learning John Thickstun guest lecture c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 10 Review 1 / 10 Our running example
More informationPortfolio selection with multiple risk measures
Portfolio selection with multiple risk measures Garud Iyengar Columbia University Industrial Engineering and Operations Research Joint work with Carlos Abad Outline Portfolio selection and risk measures
More informationStochastic Proximal Algorithms with Applications to Online Image Recovery
1/24 Stochastic Proximal Algorithms with Applications to Online Image Recovery Patrick Louis Combettes 1 and Jean-Christophe Pesquet 2 1 Mathematics Department, North Carolina State University, Raleigh,
More informationEE/AA 578 Univ. of Washington, Fall Homework 8
EE/AA 578 Univ. of Washington, Fall 2016 Homework 8 1. Multi-label SVM. The basic Support Vector Machine (SVM) described in the lecture (and textbook) is used for classification of data with two labels.
More informationCSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems
CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems January 26, 2018 1 / 24 Basic information All information is available in the syllabus
More informationAccelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India
Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India Presented at OSL workshop, Les Houches, France. Joint work with Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford Linear
More informationConvergence of trust-region methods based on probabilistic models
Convergence of trust-region methods based on probabilistic models A. S. Bandeira K. Scheinberg L. N. Vicente October 24, 2013 Abstract In this paper we consider the use of probabilistic or random models
More informationThe Correlation Smile Recovery
Fortis Bank Equity & Credit Derivatives Quantitative Research The Correlation Smile Recovery E. Vandenbrande, A. Vandendorpe, Y. Nesterov, P. Van Dooren draft version : March 2, 2009 1 Introduction Pricing
More informationMulti-period Portfolio Choice and Bayesian Dynamic Models
Multi-period Portfolio Choice and Bayesian Dynamic Models Petter Kolm and Gordon Ritter Courant Institute, NYU Paper appeared in Risk Magazine, Feb. 25 (2015) issue Working paper version: papers.ssrn.com/sol3/papers.cfm?abstract_id=2472768
More informationPortfolio Management and Optimal Execution via Convex Optimization
Portfolio Management and Optimal Execution via Convex Optimization Enzo Busseti Stanford University April 9th, 2018 Problems portfolio management choose trades with optimization minimize risk, maximize
More informationConvex-Cardinality Problems
l 1 -norm Methods for Convex-Cardinality Problems problems involving cardinality the l 1 -norm heuristic convex relaxation and convex envelope interpretations examples recent results Prof. S. Boyd, EE364b,
More informationLecture 7: Bayesian approach to MAB - Gittins index
Advanced Topics in Machine Learning and Algorithmic Game Theory Lecture 7: Bayesian approach to MAB - Gittins index Lecturer: Yishay Mansour Scribe: Mariano Schain 7.1 Introduction In the Bayesian approach
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
for Strongly Convex Stochastic Optimization Microsoft Research New England NIPS 2011 Optimization Workshop Stochastic Convex Optimization Setting Goal: Optimize convex function F ( ) over convex domain
More informationGlobal convergence rate analysis of unconstrained optimization methods based on probabilistic models
Math. Program., Ser. A DOI 10.1007/s10107-017-1137-4 FULL LENGTH PAPER Global convergence rate analysis of unconstrained optimization methods based on probabilistic models C. Cartis 1 K. Scheinberg 2 Received:
More informationThe Irrevocable Multi-Armed Bandit Problem
The Irrevocable Multi-Armed Bandit Problem Ritesh Madan Qualcomm-Flarion Technologies May 27, 2009 Joint work with Vivek Farias (MIT) 2 Multi-Armed Bandit Problem n arms, where each arm i is a Markov Decision
More informationWhat can we do with numerical optimization?
Optimization motivation and background Eddie Wadbro Introduction to PDE Constrained Optimization, 2016 February 15 16, 2016 Eddie Wadbro, Introduction to PDE Constrained Optimization, February 15 16, 2016
More informationThe Probabilistic Method - Probabilistic Techniques. Lecture 7: Martingales
The Probabilistic Method - Probabilistic Techniques Lecture 7: Martingales Sotiris Nikoletseas Associate Professor Computer Engineering and Informatics Department 2015-2016 Sotiris Nikoletseas, Associate
More information6.896 Topics in Algorithmic Game Theory February 10, Lecture 3
6.896 Topics in Algorithmic Game Theory February 0, 200 Lecture 3 Lecturer: Constantinos Daskalakis Scribe: Pablo Azar, Anthony Kim In the previous lecture we saw that there always exists a Nash equilibrium
More informationFirst-Order Methods. Stephen J. Wright 1. University of Wisconsin-Madison. IMA, August 2016
First-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. IMA, August 2016 Stephen Wright (UW-Madison) First-Order Methods IMA, August 2016 1 / 48 Smooth
More informationIEOR E4004: Introduction to OR: Deterministic Models
IEOR E4004: Introduction to OR: Deterministic Models 1 Dynamic Programming Following is a summary of the problems we discussed in class. (We do not include the discussion on the container problem or the
More informationLecture outline W.B.Powell 1
Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous
More informationGLOBAL CONVERGENCE OF GENERAL DERIVATIVE-FREE TRUST-REGION ALGORITHMS TO FIRST AND SECOND ORDER CRITICAL POINTS
GLOBAL CONVERGENCE OF GENERAL DERIVATIVE-FREE TRUST-REGION ALGORITHMS TO FIRST AND SECOND ORDER CRITICAL POINTS ANDREW R. CONN, KATYA SCHEINBERG, AND LUíS N. VICENTE Abstract. In this paper we prove global
More informationYao s Minimax Principle
Complexity of algorithms The complexity of an algorithm is usually measured with respect to the size of the input, where size may for example refer to the length of a binary word describing the input,
More informationPortfolio replication with sparse regression
Portfolio replication with sparse regression Akshay Kothkari, Albert Lai and Jason Morton December 12, 2008 Suppose an investor (such as a hedge fund or fund-of-fund) holds a secret portfolio of assets,
More informationQ1. [?? pts] Search Traces
CS 188 Spring 2010 Introduction to Artificial Intelligence Midterm Exam Solutions Q1. [?? pts] Search Traces Each of the trees (G1 through G5) was generated by searching the graph (below, left) with a
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More informationSupplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4.
Supplementary Material for Combinatorial Partial Monitoring Game with Linear Feedback and Its Application. A. Full proof for Theorems 4.1 and 4. If the reader will recall, we have the following problem-specific
More informationSTOCHASTIC REPUTATION DYNAMICS UNDER DUOPOLY COMPETITION
STOCHASTIC REPUTATION DYNAMICS UNDER DUOPOLY COMPETITION BINGCHAO HUANGFU Abstract This paper studies a dynamic duopoly model of reputation-building in which reputations are treated as capital stocks that
More informationEvaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization
Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization Coralia Cartis, Nicholas I. M. Gould and Philippe L. Toint October 30, 200; Revised March 30, 20 Abstract
More informationMax Registers, Counters and Monotone Circuits
James Aspnes 1 Hagit Attiya 2 Keren Censor 2 1 Yale 2 Technion Counters Model Collects Our goal: build a cheap counter for an asynchronous shared-memory system. Two operations: increment and read. Read
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationDecomposition Methods
Decomposition Methods separable problems, complicating variables primal decomposition dual decomposition complicating constraints general decomposition structures Prof. S. Boyd, EE364b, Stanford University
More informationAdaptive cubic overestimation methods for unconstrained optimization
Report no. NA-07/20 Adaptive cubic overestimation methods for unconstrained optimization Coralia Cartis School of Mathematics, University of Edinburgh, The King s Buildings, Edinburgh, EH9 3JZ, Scotland,
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT) Decision Tree Each node checks
More informationEco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1)
Eco54 Spring 21 C. Sims FINAL EXAM There are three questions that will be equally weighted in grading. Since you may find some questions take longer to answer than others, and partial credit will be given
More informationCatalyst Acceleration for Gradient-Based Non-Convex Optimization
Catalyst Acceleration for Gradient-Based Non-Convex Optimization Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, Zaid Harchaoui To cite this version: Courtney Paquette, Hongzhou Lin,
More informationA Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation
A Stochastic Levenberg-Marquardt Method Using Random Models with Application to Data Assimilation E Bergou Y Diouane V Kungurtsev C W Royer July 5, 08 Abstract Globally convergent variants of the Gauss-Newton
More information2. This algorithm does not solve the problem of finding a maximum cardinality set of non-overlapping intervals. Consider the following intervals:
1. No solution. 2. This algorithm does not solve the problem of finding a maximum cardinality set of non-overlapping intervals. Consider the following intervals: E A B C D Obviously, the optimal solution
More informationEllipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE392o, Stanford University
Ellipsoid Method ellipsoid method convergence proof inequality constraints feasibility problems Prof. S. Boyd, EE392o, Stanford University Challenges in cutting-plane methods can be difficult to compute
More informationConvex-Cardinality Problems Part II
l 1 -norm Methods for Convex-Cardinality Problems Part II total variation iterated weighted l 1 heuristic matrix rank constraints Prof. S. Boyd, EE364b, Stanford University Total variation reconstruction
More informationIDENTIFYING BROAD AND NARROW FINANCIAL RISK FACTORS VIA CONVEX OPTIMIZATION: PART II
1 IDENTIFYING BROAD AND NARROW FINANCIAL RISK FACTORS VIA CONVEX OPTIMIZATION: PART II Alexander D. Shkolnik ads2@berkeley.edu MMDS Workshop. June 22, 2016. joint with Jeffrey Bohn and Lisa Goldberg. Identifying
More informationk-layer neural networks: High capacity scoring functions + tips on how to train them
k-layer neural networks: High capacity scoring functions + tips on how to train them A new class of scoring functions Linear scoring function s = W x + b 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0,
More informationRobust Optimization Applied to a Currency Portfolio
Robust Optimization Applied to a Currency Portfolio R. Fonseca, S. Zymler, W. Wiesemann, B. Rustem Workshop on Numerical Methods and Optimization in Finance June, 2009 OUTLINE Introduction Motivation &
More information56:171 Operations Research Midterm Exam Solutions October 22, 1993
56:171 O.R. Midterm Exam Solutions page 1 56:171 Operations Research Midterm Exam Solutions October 22, 1993 (A.) /: Indicate by "+" ="true" or "o" ="false" : 1. A "dummy" activity in CPM has duration
More informationLog-Robust Portfolio Management
Log-Robust Portfolio Management Dr. Aurélie Thiele Lehigh University Joint work with Elcin Cetinkaya and Ban Kawas Research partially supported by the National Science Foundation Grant CMMI-0757983 Dr.
More informationAlgorithmic Game Theory (a primer) Depth Qualifying Exam for Ashish Rastogi (Ph.D. candidate)
Algorithmic Game Theory (a primer) Depth Qualifying Exam for Ashish Rastogi (Ph.D. candidate) 1 Game Theory Theory of strategic behavior among rational players. Typical game has several players. Each player
More informationFast Convergence of Regress-later Series Estimators
Fast Convergence of Regress-later Series Estimators New Thinking in Finance, London Eric Beutner, Antoon Pelsser, Janina Schweizer Maastricht University & Kleynen Consultants 12 February 2014 Beutner Pelsser
More informationSolving dynamic portfolio choice problems by recursing on optimized portfolio weights or on the value function?
DOI 0.007/s064-006-9073-z ORIGINAL PAPER Solving dynamic portfolio choice problems by recursing on optimized portfolio weights or on the value function? Jules H. van Binsbergen Michael W. Brandt Received:
More informationSMOOTH CONVEX APPROXIMATION AND ITS APPLICATIONS SHI SHENGYUAN. (B.Sc.(Hons.), ECNU)
SMOOTH CONVEX APPROXIMATION AND ITS APPLICATIONS SHI SHENGYUAN (B.Sc.(Hons.), ECNU) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF MATHEMATICS NATIONAL UNIVERSITY OF SINGAPORE 2004
More informationInfinite Reload Options: Pricing and Analysis
Infinite Reload Options: Pricing and Analysis A. C. Bélanger P. A. Forsyth April 27, 2006 Abstract Infinite reload options allow the user to exercise his reload right as often as he chooses during the
More informationEllipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE364b, Stanford University
Ellipsoid Method ellipsoid method convergence proof inequality constraints feasibility problems Prof. S. Boyd, EE364b, Stanford University Ellipsoid method developed by Shor, Nemirovsky, Yudin in 1970s
More informationAdvanced Operations Research Prof. G. Srinivasan Dept of Management Studies Indian Institute of Technology, Madras
Advanced Operations Research Prof. G. Srinivasan Dept of Management Studies Indian Institute of Technology, Madras Lecture 23 Minimum Cost Flow Problem In this lecture, we will discuss the minimum cost
More information1 Online Problem Examples
Comp 260: Advanced Algorithms Tufts University, Spring 2018 Prof. Lenore Cowen Scribe: Isaiah Mindich Lecture 9: Online Algorithms All of the algorithms we have studied so far operate on the assumption
More informationA Harmonic Analysis Solution to the Basket Arbitrage Problem
A Harmonic Analysis Solution to the Basket Arbitrage Problem Alexandre d Aspremont ORFE, Princeton University. A. d Aspremont, INFORMS, San Francisco, Nov. 14 2005. 1 Introduction Classic Black & Scholes
More informationConvergence Analysis of Monte Carlo Calibration of Financial Market Models
Analysis of Monte Carlo Calibration of Financial Market Models Christoph Käbe Universität Trier Workshop on PDE Constrained Optimization of Certain and Uncertain Processes June 03, 2009 Monte Carlo Calibration
More informationFinancial Giffen Goods: Examples and Counterexamples
Financial Giffen Goods: Examples and Counterexamples RolfPoulsen and Kourosh Marjani Rasmussen Abstract In the basic Markowitz and Merton models, a stock s weight in efficient portfolios goes up if its
More informationCS134: Networks Spring Random Variables and Independence. 1.2 Probability Distribution Function (PDF) Number of heads Probability 2 0.
CS134: Networks Spring 2017 Prof. Yaron Singer Section 0 1 Probability 1.1 Random Variables and Independence A real-valued random variable is a variable that can take each of a set of possible values in
More informationDirect Methods for linear systems Ax = b basic point: easy to solve triangular systems
NLA p.1/13 Direct Methods for linear systems Ax = b basic point: easy to solve triangular systems... 0 0 0 etc. a n 1,n 1 x n 1 = b n 1 a n 1,n x n solve a n,n x n = b n then back substitution: takes n
More informationSublinear Time Algorithms Oct 19, Lecture 1
0368.416701 Sublinear Time Algorithms Oct 19, 2009 Lecturer: Ronitt Rubinfeld Lecture 1 Scribe: Daniel Shahaf 1 Sublinear-time algorithms: motivation Twenty years ago, there was practically no investigation
More informationReinforcement Learning
Reinforcement Learning MDP March May, 2013 MDP MDP: S, A, P, R, γ, µ State can be partially observable: Partially Observable MDPs () Actions can be temporally extended: Semi MDPs (SMDPs) and Hierarchical
More informationCHAPTER 6 CRASHING STOCHASTIC PERT NETWORKS WITH RESOURCE CONSTRAINED PROJECT SCHEDULING PROBLEM
CHAPTER 6 CRASHING STOCHASTIC PERT NETWORKS WITH RESOURCE CONSTRAINED PROJECT SCHEDULING PROBLEM 6.1 Introduction Project Management is the process of planning, controlling and monitoring the activities
More informationDM559/DM545 Linear and integer programming
Department of Mathematics and Computer Science University of Southern Denmark, Odense May 22, 2018 Marco Chiarandini DM559/DM55 Linear and integer programming Sheet, Spring 2018 [pdf format] Contains Solutions!
More informationThe Assignment Problem
The Assignment Problem E.A Dinic, M.A Kronrod Moscow State University Soviet Math.Dokl. 1969 January 30, 2012 1 Introduction Motivation Problem Definition 2 Motivation Problem Definition Outline 1 Introduction
More informationLecture outline W.B. Powell 1
Lecture outline Applications of the newsvendor problem The newsvendor problem Estimating the distribution and censored demands The newsvendor problem and risk The newsvendor problem with an unknown distribution
More informationDistributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks
Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks, Detroit MI (joint work with Waheed Bajwa, Rutgers) Motivation: Autonomous Driving Network of autonomous automobiles
More informationECE 586GT: Problem Set 1: Problems and Solutions Analysis of static games
University of Illinois Fall 2018 ECE 586GT: Problem Set 1: Problems and Solutions Analysis of static games Due: Tuesday, Sept. 11, at beginning of class Reading: Course notes, Sections 1.1-1.4 1. [A random
More informationCatalyst Acceleration for First-order Convex Optimization: from Theory to Practice
Journal of Machine Learning Research 8 (8) -54 Submitted /7; Revised /7; Published 4/8 Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice Hongzhou Lin Massachusetts Institute
More informationBudget Management In GSP (2018)
Budget Management In GSP (2018) Yahoo! March 18, 2018 Miguel March 18, 2018 1 / 26 Today s Presentation: Budget Management Strategies in Repeated auctions, Balseiro, Kim, and Mahdian, WWW2017 Learning
More informationIEOR E4703: Monte-Carlo Simulation
IEOR E4703: Monte-Carlo Simulation Other Miscellaneous Topics and Applications of Monte-Carlo Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationOptimal Portfolio Selection Under the Estimation Risk in Mean Return
Optimal Portfolio Selection Under the Estimation Risk in Mean Return by Lei Zhu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics
More informationMachine Learning (CSE 446): Learning as Minimizing Loss
Machine Learning (CSE 446): Learning as Minimizing Loss oah Smith c 207 University of Washington nasmith@cs.washington.edu October 23, 207 / 2 Sorry! o office hour for me today. Wednesday is as usual.
More informationIntro to Economic analysis
Intro to Economic analysis Alberto Bisin - NYU 1 The Consumer Problem Consider an agent choosing her consumption of goods 1 and 2 for a given budget. This is the workhorse of microeconomic theory. (Notice
More informationRollout Allocation Strategies for Classification-based Policy Iteration
Rollout Allocation Strategies for Classification-based Policy Iteration V. Gabillon, A. Lazaric & M. Ghavamzadeh firstname.lastname@inria.fr Workshop on Reinforcement Learning and Search in Very Large
More informationPricing Problems under the Markov Chain Choice Model
Pricing Problems under the Markov Chain Choice Model James Dong School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA jd748@cornell.edu A. Serdar Simsek
More informationTug of War Game. William Gasarch and Nick Sovich and Paul Zimand. October 6, Abstract
Tug of War Game William Gasarch and ick Sovich and Paul Zimand October 6, 2009 To be written later Abstract Introduction Combinatorial games under auction play, introduced by Lazarus, Loeb, Propp, Stromquist,
More informationCSCE 750, Fall 2009 Quizzes with Answers
CSCE 750, Fall 009 Quizzes with Answers Stephen A. Fenner September 4, 011 1. Give an exact closed form for Simplify your answer as much as possible. k 3 k+1. We reduce the expression to a form we ve already
More informationApplications of Quantum Annealing in Computational Finance. Dr. Phil Goddard Head of Research, 1QBit D-Wave User Conference, Santa Fe, Sept.
Applications of Quantum Annealing in Computational Finance Dr. Phil Goddard Head of Research, 1QBit D-Wave User Conference, Santa Fe, Sept. 2016 Outline Where s my Babel Fish? Quantum-Ready Applications
More informationRisk Estimation via Regression
Risk Estimation via Regression Mark Broadie Graduate School of Business Columbia University email: mnb2@columbiaedu Yiping Du Industrial Engineering and Operations Research Columbia University email: yd2166@columbiaedu
More information2.1 Mathematical Basis: Risk-Neutral Pricing
Chapter Monte-Carlo Simulation.1 Mathematical Basis: Risk-Neutral Pricing Suppose that F T is the payoff at T for a European-type derivative f. Then the price at times t before T is given by f t = e r(t
More informationRegression estimation in continuous time with a view towards pricing Bermudan options
with a view towards pricing Bermudan options Tagung des SFB 649 Ökonomisches Risiko in Motzen 04.-06.06.2009 Financial engineering in times of financial crisis Derivate... süßes Gift für die Spekulanten
More informationWindow Width Selection for L 2 Adjusted Quantile Regression
Window Width Selection for L 2 Adjusted Quantile Regression Yoonsuh Jung, The Ohio State University Steven N. MacEachern, The Ohio State University Yoonkyung Lee, The Ohio State University Technical Report
More informationMulti-armed bandits in dynamic pricing
Multi-armed bandits in dynamic pricing Arnoud den Boer University of Twente, Centrum Wiskunde & Informatica Amsterdam Lancaster, January 11, 2016 Dynamic pricing A firm sells a product, with abundant inventory,
More informationMaximizing Winnings on Final Jeopardy!
Maximizing Winnings on Final Jeopardy! Jessica Abramson, Natalie Collina, and William Gasarch August 2017 1 Abstract Alice and Betty are going into the final round of Jeopardy. Alice knows how much money
More informationHandout 4: Deterministic Systems and the Shortest Path Problem
SEEM 3470: Dynamic Optimization and Applications 2013 14 Second Term Handout 4: Deterministic Systems and the Shortest Path Problem Instructor: Shiqian Ma January 27, 2014 Suggested Reading: Bertsekas
More informationAn Adaptive Learning Model in Coordination Games
Department of Economics An Adaptive Learning Model in Coordination Games Department of Economics Discussion Paper 13-14 Naoki Funai An Adaptive Learning Model in Coordination Games Naoki Funai June 17,
More informationA Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization
A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization Hongzhou Lin 1, Julien Mairal 1, Zaid Harchaoui 2 1 Inria, Grenoble 2 University of Washington LCCC Workshop on large-scale and distributed
More informationAgricultural and Applied Economics 637 Applied Econometrics II
Agricultural and Applied Economics 637 Applied Econometrics II Assignment I Using Search Algorithms to Determine Optimal Parameter Values in Nonlinear Regression Models (Due: February 3, 2015) (Note: Make
More informationUtility Indifference Pricing and Dynamic Programming Algorithm
Chapter 8 Utility Indifference ricing and Dynamic rogramming Algorithm In the Black-Scholes framework, we can perfectly replicate an option s payoff. However, it may not be true beyond the Black-Scholes
More informationChapter 5 Finite Difference Methods. Math6911 W07, HM Zhu
Chapter 5 Finite Difference Methods Math69 W07, HM Zhu References. Chapters 5 and 9, Brandimarte. Section 7.8, Hull 3. Chapter 7, Numerical analysis, Burden and Faires Outline Finite difference (FD) approximation
More information