Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India

Size: px

Start display at page:

Download "Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India"

Ursula Lindsey
5 years ago
Views:

1 Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India Presented at OSL workshop, Les Houches, France. Joint work with Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford

2 Linear regression min x f x = Ax b 2 2 x R d, A R n d, b R n Basic problem and arises pervasively in applications Deeply studied in literature

3 Gradient descent for linear regression x t+1 = x t δ A Ax t b Convergence rate: O κ log f x 0 f Gradient f = min x f x ; = Target suboptimality Condition number: κ = σ max A A σ min A A

4 Question: Is it possible to do better? Hope: GD does not reuse past gradients Answer: Yes! Gradient descent: x t+1 = x t δ f x t Conjugate gradient (Hestenes and Stiefel 1952) Heavy ball method (Polyak 1964) Accelerated gradient descent (Nemirovsky and Yudin 1977, Nesterov 1983)

5 Accelerated gradient descent (AGD) x t+1 = y t δ f y t y t+1 = x t+1 + γ x t+1 x t Convergence rate: O κ log f x 0 f f = min x f x ; = Target suboptimality Condition number: κ = σ max A A σ min A A

6 Accelerated gradient descent (AGD) Compared to: O κ log f x 0 f for GD Convergence rate: O κ log f x 0 f f = min x f x ; = Target suboptimality Condition number: κ = σ max A A σ min A A

7 Source:

8 Stochastic approximation (Robbins and Monro 1951) Distribution D on R d R f x E a,b D a x b 2 Equivalent to Ax b 2 2 where A has infinite rows Observe n pairs a 1, b 1,, a n, b n Interested in entire distribution D rather than data points like in ML Fit a linear model to the distribution Cannot compute exact gradients

9 Stochastic gradient descent (SGD) (Robbins and Monro 1951) x t+1 = x t δ f x t where E f x t = f x t Return 1 n σ i x i (Polyak and Juditsky 1992) Is gradient descent in expectation For linear regression, SGD: x t+1 = x t δ a t x t b t a t Streaming algorithm: extremely efficient and widely used in practice

10 Best possible rate Consider b = a x + noise; noise N 0, σ 2 Recall: f x E a,b D a x b 2 x argmin x σn i=1 a i x b i 2 E f x f x = 1 + o 1 σ2 d n (van der Vaart, 2000)

11 Best possible rate In general: x argmin x f x E a x b 2 aa σ 2 E aa x argmin x n σ i=1 a i x b 2 Equivalently n 1 + o 1 σ2 d E f x f x 1 + o 1 σ2 d n (van der Vaart, 2000)

12 Convergence rate of SGD Convergence rate: O κ log f x 0 f + σ2 d (Jain et al. 2016) f = min x f x ; = Target suboptimality Condition number: κ max a 2 2 σ min E aa Noise level: E a x b 2 aa σ 2 E aa

13 Recap Deterministic case GD O κ log f x 0 f AGD O κ log f x 0 f Stochastic approximation SGD O κ log f x 0 f Accelerated SGD? Unknown + σ2 d Question: Is accelerating SGD possible?

14 Is this really important? Extremely important in practice As we saw, acceleration can really give orders of magnitude improvement Neural network training uses Nesterov s AGD as well as Adam; but no theoretical understanding Jain et al shows acceleration leads to more parallelizability Existing results show AGD not robust to deterministic noise (d Aspremont 2008, Devolder et al. 2014) but is robust to random additive noise (Ghadimi and Lan 2010, Dieuleveut et al. 2016) Stochastic approximation falls between the above two cases Key issue: mixes optimization and statistics (i.e., # iterations = #samples)

15 Is acceleration possible? b = a x Noise level: σ 2 = 0 SGD convergence rate: O κ log f x 0 f Accelerated rate: O κ log f x 0 f?

16 Example I: Discrete distribution a = with probability p i In this case, κ max a 2 2 σ min E aa = 1 p min Is O κ log f x 0 f possible? Or, halve the error using O κ samples?

17 Example I: Discrete distribution a = with probability p i ; κ 1 p min Fewer than κ samples do not observe p min direction σ i a i a i not invertible Cannot do better than O κ Acceleration not possible

18 Example II: Gaussian a N 0, H, H is a PSD matrix In this case, κ Tr H σ min H d However, after O d samples: 1 n σ i a i a i H Possible to solve a i x = b i after O d samples Acceleration might be possible

19 Discrete vs Gaussian Discrete distribution Gaussian distribution

20 Key issue: matrix spectral concentration Recall: a i D. Let H E a i a i. For x argmin x σn i=1 a i x b i 2 to be good, need: n 1 δ H 1 n i=1 a i a i 1 + δ H How many samples are required for spectral concentration?

21 Separating optimization and statistics Matrix variance (Tropp 2012): E a 2 2 aa 2 Recall H E aa 2 Statistical condition number: κǁ E H 1 2a 2 H 1 2a H 1 2a 2 Matrix Bernstein Theorem (Tropp 2015) If n > O κ ǁ, then 1 δ H 1 σ n i=1 n a i a i 1 + δ H

22 Is acceleration possible? O κǁ samples sufficient Recall SGD convergence rate: O κ log f x 0 f Always κǁ κ. Acceleration might be possible if κǁ κ Discrete case: κ ǁ = 1 = κ; p min Gaussian case: κ ǁ = O d κ

23 Result Convergence rate of ASGD: O κκǁ log f x 0 f + σ2 d Compared to SGD: O κ log f x 0 f + σ2 d Improvement since κǁ κ Conjecture: lower bound Ω Srebro 2016) κκǁ log f x 0 f (inspired by Woodworth and Key takeaway Acceleration possible! Gain depends on statistical condition number

24 Simulations No noise Discrete distribution Gaussian distribution

25 Simulations With noise Discrete distribution Gaussian distribution

26 High level challenges Several versions of accelerated algorithms known e.g., conjugate gradient 1952, heavy ball 1964, momentum methods 1983, accelerated coordinate descent 2012, linear coupling 2014 Many of them are equivalent in deterministic setting but not in stochastic setting Many different analyses even for momentum methods: Nesterov s analysis 1983, coordinate descent 2012, ODE analysis 2013, linear coupling 2014

27 Algorithm Parameters: α, β, γ, δ 1. v 0 = x 0 2. y t 1 = αx t α v t 1 3. x t = y t 1 δ f y t 1 4. z t 1 = βy t β v t 1 5. v t = z t 1 γ f y t 1 Parameters: α, β, γ, δ 1. v 0 = x 0 2. y t 1 = αx t α v t 1 3. x t = y t 1 δ t f y t 1 4. z t 1 = βy t β v t 1 5. v t = z t 1 γ t f y t 1 Nesterov 2012 Our algorithm

28 Proof overview Recall our guarantee: O κκǁ log f x 0 f + σ2 d First term depends on initial error; second is statistical error Different analyses for the two terms For the first term, analyze assuming σ = 0 For the second term, analyze assuming x 0 = x

29 Part I: Potential function Iterates x t, v t of ASGD. H E aa. Existing analyses use potential function x t x H 2 + σ min H v t x 2 2 We use x t x σ min H v t x 2 H 1 We show x t x σ min H v t x 2 H κκ x t 1 x σ min H v t 1 x 2 H 1

30 Part II: Stochastic process analysis x t+1 x y t+1 x = C x t x y t x + noise Parameters: α, β, γ, δ 1. v 0 = x 0 Let θ t E x t x y t x x t x y t x 2. y t 1 = αx t α v t 1 3. x t = y t 1 δ f y t 1 θ t+1 = Bθ t + noise noise θ n B i noise noise i = I B 1 noise noise 4. z t 1 = βy t β v t 1 5. v t = z t 1 γ f y t 1 Our algorithm

31 Part II: Stochastic process analysis Need to understand I B 1 noise noise B has singular values > 1, but fortunately eigenvalues < 1 Solve the 1-dim version of I B 1 noise noise computations via explicit Combine the 1-dim bounds with (statistical) condition number bounds I B 1 noise noise κh ǁ 1 + δ I

32 Recap Deterministic case GD O κ log f x 0 f AGD O κ log f x 0 f O Stochastic approximation SGD O κ log f x 0 f ASGD κκǁ log f x 0 f + σ2 d + σ2 d Acceleration possible depends on statistical condition number Techniques: new potential function, stochastic process analysis Conjecture: Our result is tight

33 Streaming optimization for ML Streaming algorithms are very powerful for ML applications SGD and variants widely used in practice Classical stochastic approximation focuses on asymptotic rates Tools from optimization help obtain strong finite sample guarantees Have implications for parallelization as well

34 Some examples Linear regression Finite sample guarantees: Moulines and Bach 2011, Defossez and Bach 2015 Parallelization: Jain et al Acceleration: This talk Smooth convex functions: Finite sample guarantees: Bach and Moulines 2013 PCA: Oja s algorithm Rank-1: Balsubramani et al. 2013, Jain et al Higher rank: Allen-Zhu and Li 2016

35 Open problems Linear regression: Parameter free algorithm e.g., conjugate gradient General convex functions: Acceleration, parallelization? Non-convex functions: Streaming algorithms, acceleration, parallelization? PCA: Tight finite sample guarantees? Quasi-Newton methods

36 Thank you! Questions?

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization for Strongly Convex Stochastic Optimization Microsoft Research New England NIPS 2011 Optimization Workshop Stochastic Convex Optimization Setting Goal: Optimize convex function F ( ) over convex domain