Approximate Composite Minimization: Convergence Rates and Examples

Size: px

Start display at page:

Download "Approximate Composite Minimization: Convergence Rates and Examples"

Elaine Todd
5 years ago
Views:

1 ISMP Bordeaux Approximate Composite Minimization: Convergence Rates and S. Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi MLO Lab, EPFL, Switzerland sebastian.stich@epfl.ch July 4, 2018 [AISTATS 2018] Adaptive balancing of gradient and update computation times S. U. Stich (Adaptive) Importance Sampling 1

2 1 Theory of Random Descent 2 Coordinate Descent Revisited Number of iterations Experiments 3 S. U. Stich (Adaptive) Importance Sampling 2

3 Convex Optimization S. U. Stich (Adaptive) Importance Sampling 3

4 Black-Box Optimization min f(x) x f(x) x Rn f : R n R, assume f C 1 oracle access to f (f(x), f(x)) or just ( i f(x)) Approximate solution x ɛ : f(x ɛ ) min x R n f(x) ɛ N(f, A, ɛ) number of oracle calls for algorithm A to find an approximate solution Worst case complexity of algorithm A on a class F: N(A, ɛ) = max f F N(f, A, ɛ) S. U. Stich (Adaptive) Importance Sampling 4

5 Function Classes Theory of Random Descent Upper bound: f C 1 L Lower bound: f C µ f(y) f(x) + f(x), y x y x 2 L }{{} =:u L x (y) f(x) f(y) f(x)+ f(x), y x + µ 2 y x 2 L f(x) (x 0, f(x 0 )) f(x 0 ) + f(x 0 ), t (x 0, f(x 0 )) f(x 0 ) + f(x 0 ), t Consequences: x := argmin x f(x) unique f(x) f(x ) 1 2µ f(x) 2 L 1 x L := x, Lx condition number κ := L /µ µl 2 f(x) L S. U. Stich (Adaptive) Importance Sampling 5

6 general framework F (x) := f(x) + Ψ(x) S. U. Stich (Adaptive) Importance Sampling 6

7 Setting: f C 1 L, f C1 µ f for µ f 0. Ψ C µψ for µ Ψ 0. Goal: min F (x) := f(x) + Ψ(x) x Rn Method 0: x 0 R n x = min u R n F (x 0 + u) This problem is too hard! Define easier subproblems! S. U. Stich (Adaptive) Importance Sampling 7

8 Relaxing Accuracy: Introduce relative accuracy parameter γ k 0: x k 1 γ k Method 1: x 0 R n F (x k+1 ) (1 γ k ) F (x k ) }{{} old value +γ k min u R n F (x k + u) } {{ } best value Verifying the condition requires function evaluations, which might be elusive! S. U. Stich (Adaptive) Importance Sampling 8

9 Option II: Optimize the upper bound Upper bound: F (x k + u) u L x k (x k + u) + Ψ(x k + u) }{{} =:m L x k (x k +u) Note: It suffices to pick L s.t. F (x k + u) m L x k (x k + u), i.e. f(x k + u) u L x k (x k + u) is not required. Method 2: x 0 R n F (x k+1 ) m L x k (x k+1 ) (1 γ k ) F (x k ) }{{} old value +γ k min u R n ml x k (x k + u) } {{ } best value This requires computation of f(x k ) (to evaluate u L x k (x k + u)) and Ψ(x k + u) instead of F (x k + u). S. U. Stich (Adaptive) Importance Sampling 9

10 Complexity Results Theory of Random Descent { 2D 2 ( 1 + µ Ψ F N(Method 2, ɛ) = min, γɛ (µ f + µ Ψ ) γ ln (x0 ) F (x ) } ) ɛ }{{}}{{} D = smooth case strongly convex case max y y : F (y) F (x 0 ) x L, average accuracy γ = 1 k k i=1 γ k. Both µ f and µ Ψ appear in the rate, thus L should ideally approximate the curvature of both f an Ψ. Extends and unifies [Stich et al. 13], [Qu et al. 15], [Tappenden et al. 16], [Mutny, Richtárik, 18], [...] S. U. Stich (Adaptive) Importance Sampling 10

11 S. U. Stich (Adaptive) Importance Sampling 11

12 Gradient Method Theory of Random Descent L = LI n x k+1 = argmin u R n { f(x k ) + f(x k ), u x k + L } 2 u x k 2 + Ψ(x k + u) L 1 hard to compute: use an (arbitrary) iterative method to solve { x k+1 = f(x k )+γ k argmin f(x k ), u x k + 1 } u R n 2 u x k 2 L + Ψ(x k + u) Requires full f(x k ), which might be elusive for n 1. S. U. Stich (Adaptive) Importance Sampling 12

13 Random Methods T (iteration) T ( f) S. U. Stich (Adaptive) Importance Sampling 13

14 Random Pursuit Theory of Random Descent Choose a (random) sketch matrix U k R n m, rank U k = m. x k+1 = f(x k )+γ k argmin u span U k { f(x k ), u x k + 1 } 2 u x k 2 L + Ψ(x k + u) }{{} =: argmin u R n m L x k U k (x k + u) Requires computation of U k f(x k). Example: (Block)-Coordinate Descent U k U k = I m, U k f(x k) = [ i1 f(x k ),..., im f(x k )]. Assume Ψ (block)-separable structure: Ψ(x k + U k z) = Ψ U k (x k ) + Ψ Uk (x k + U k z) S. U. Stich (Adaptive) Importance Sampling 14

15 How does this fit in the framework? Option I: The γ k -approximate solution on m L x k U k a γ k -approximate solution on ml x k, γ k γ k. can be seen as Option II: For certain distributions of U k the results can be explicitly stated. Example: If E U k U k = m n I n the complexity increases by a factor of n m. [KSJ18] also states explicit results for parallel updates. S. U. Stich (Adaptive) Importance Sampling 15

16 Coordinate Descent Revisited Number of iterations Experiments Application: (Block) Coordinate Descent [ i1 f(x),..., im f(x)] S. U. Stich (Adaptive) Importance Sampling 16

17 Coordinate Descent Revisited Number of iterations Experiments Block Coordinate Descent: sketch matrix U k, U k U k = I m m ( n)-dimensional subproblem: m L x k (x k+1 ) (1 γ k )F (x k ) + γ k min m L x u span U k (x k + u) k Iteration comprises two steps: 1 Compute U k f(x k) = [ i1 f(x k ),... im f(x k )] 2 Compute approx. minimizer x k+1 Two different costs! : Cache misses, expensive operations to compute i f(x k ) Hard subproblem, Ψ(x) = x T V S. U. Stich (Adaptive) Importance Sampling 17

18 Towards optimal balancing Coordinate Descent Revisited Number of iterations Experiments Goal: Set accuracy γ k to optimally balance the costs Feasible strategy: Suppose we use an interative algorithm to approximately solve the model m L x k. I.e. x k = y 0, y 1,..., y t,..., y T = x k+1 Each intermediate solution y t corresponds to an approximate solution with parameter γ t k : m L x k (y t ) (1 γk t )F (x k) + γk t min m L x u span U k (x k + u) k Thus we can express the progress in terms of iterations on the model: ( ) p k (t) := γk t F (x k ) min m L x u span U k (x k + u) k }{{}}{{} relative progress scale S. U. Stich (Adaptive) Importance Sampling 18

19 Optimal balancing Theory of Random Descent Coordinate Descent Revisited Number of iterations Experiments Each (inner) iteration (y t y t+1 ) denotes one unit of time. Let c k = T (U k f(x k )) denote the time to compute the required coordinates of the gradient. Optimal number of inner iterations t : t k := argmax t p k (t) t + c k }{{} progress per time spent Can we compute t k? Can we compute approximate t k? Predict t k based on observations at iteration k 1. Goal: t k t k S. U. Stich (Adaptive) Importance Sampling 19

20 Coordinate Descent Revisited Number of iterations Experiments (Constant) strategies for approximating t k Constant Strategies: one: Set t k 1, k 0. arbitrary constant comp: t k = c k, balance time best practice Can we measure p k (t k )? p k (t k ) = F (x k ) m L x k (x k+1 ) And p k (t k)? d p k (t k ) = p k (t k)(t k + c k ) p k (t k ) dt t + c }{{ k (t } k + c k ) 2 =: g k (t k ) progress per time spent Can be estimated using p k (t k) p k (t k ) p k (t k 1 ). S. U. Stich (Adaptive) Importance Sampling 20

21 Coordinate Descent Revisited Number of iterations Experiments (Adaptive) strategies for approximating t k General adaptive strategy: Initialize t 0 := 1 Each iteration: Example strategy: add 1 Estimate g k (t k ) 2 t k+1 = A(t k, g k ) t k+1 = { t k + 1 if g k (t k ) > 0 max{t k 1, 1} otherwise Two more adaptive strategies: mult: t k+1 = {2t k, if g k (t k ) > 0, max{t k 1, 1}, else grad: t k+1 = t k + g t S. U. Stich (Adaptive) Importance Sampling 21

22 Coordinate Descent Revisited Number of iterations Experiments Experiments S. U. Stich (Adaptive) Importance Sampling 22

23 Improvement in Time Coordinate Descent Revisited Number of iterations Experiments Time relative to constant strategy one: adaptivity is important! best practice comp is not performing well! precise rule less important S. U. Stich (Adaptive) Importance Sampling 23

24 A closer look Theory of Random Descent Coordinate Descent Revisited Number of iterations Experiments mult strategy: Problem difficulty: (simulated) t k values v.s. iteration k t k required to reach accuracy γ k 0.1 vs. iteration k less time is spent as the subproblems get harder S. U. Stich (Adaptive) Importance Sampling 24

25 S. U. Stich (Adaptive) Importance Sampling 25

26 Contributions & Open Problem Theory: We present a sound theoretical framework where subproblems need to be solved just approximately (with arbitrary iterative solver) convergence rate depends on average quality of approximation parallel, distributed and primal-dual extensions Practice: We observe adaptivity is important, not all subproblems are the same! Open problem: Proof for adaptive schemes missing! The proofs do not extend to the fully-adaptive setting, i.e. when γ k depends on x k+1 (as it is the case for the adaptive strategies). S. U. Stich (Adaptive) Importance Sampling 26

27 References References Stich et al. 13 S.U. Stich, C.L. Müller, B. Gärtner. Optimization of convex functions with Random Pursuit, SIAM J.Opt Qu et al. 15 Z. Qu, P. Richtárik, M. Takac, O. Fercoq. SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization, Tappenden et al. 16 R. Tappenden, P. Richtárik, J. Gondzio. Inexact Coordinate Descent: Complexity and Preconditioning, J. Opt. T& A, Mutny, Richtárik, 18 M. Mutny, P. Richtárik, Parallel Stochastic Newton Method, J. Comp. Math, KSJ18 S.P.R. Karimireddy, S.U. Stich, M. Jaggi, Adaptive balancing of gradient and update computation times using global geometry and approximate subproblems, PMLR 84, S. U. Stich (Adaptive) Importance Sampling 27

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016 AM 22: Advanced Optimization Spring 206 Prof. Yaron Singer Lecture 9 February 24th Overview In the previous lecture we reviewed results from multivariate calculus in preparation for our journey into convex