Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Size: px

Start display at page:

Download "Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization"

Meryl Harmon
5 years ago
Views:

1 for Strongly Convex Stochastic Optimization Microsoft Research New England NIPS 2011 Optimization Workshop

2 Stochastic Convex Optimization Setting Goal: Optimize convex function F ( ) over convex domain W Access to unbiased estimates of F subgradients (e.g. by sampling from a training set) Simple and very popular method - stochastic gradient descent (SGD): Initialize w = 0 For t = 1, 2,... Get ĝ t such that E[ĝ t ] F (w t ) w t+1 = Π W (w t η t ĝ t ) With appropriate step-size η t, average w T = 1 T (w w T ) has suboptimality F ( w T ) F (w ) = O(1/ T )

3 Strongly Convex Stochastic Optimization F ( ) is strongly convex: For some λ > 0, F (w ) F (w) + g, w w + λ 2 for all w, w W and all g F (w) w w 2 E.g. regularized learning F (w) = 1 m m i=1 l(w, z i) + λ 2 w 2

4 Strongly Convex Stochastic Optimization Theorem (Hazan et al. (2007); Shalev-Shwartz et al. (2008)) Suppose F is λ-strongly convex ĝ t uniformly bounded By running SGD with step sizes η t = 1 λt, expected suboptimality of average w T = 1 T (w w T ) at most ( ) log(t ) O λt Proof goes through online learning: a more difficult setting where gradients assumed to be provided by an adversary

5 Better Algorithms Hazan and Kale (2010): Propose an algorithm different from SGD, with optimal 1/λT rate (removes the log factor) Juditsky and Nesterov (2010): Independently and a bit later, propose virtually identical algorithm for a somewhat more general setting Important gap: Are these new algorithms actually better than SGD?? If so, we should abandon SGD! (say, for solving support vector machines)

6 Related Work Convergence of SGD very well studied for over 50 years More recent work on non-asymptotic performance (e.g. Bach (2011)) Almost all results are under smoothness assumptions on F - inappropriate for many modern applications E.g. Support Vector Machines: optimize 1 m m [1 y i w, x i ] + + λ 2 w 2 i=1 Any regularized learning framework with non-smooth losses (absolute loss, ɛ-insensitive loss...)

7 This Work We study the convergence rate of SGD for stochastic strongly convex (possibly non-smooth) problems Main Findings:

8 This Work We study the convergence rate of SGD for stochastic strongly convex (possibly non-smooth) problems Main Findings: 1 (Mostly known): If F is strongly convex and smooth, then SGD with and without averaging achieves optimal O(1/T ) rate

9 This Work We study the convergence rate of SGD for stochastic strongly convex (possibly non-smooth) problems Main Findings: 1 (Mostly known): If F is strongly convex and smooth, then SGD with and without averaging achieves optimal O(1/T ) rate 2 If F is non-smooth, SGD with averaging might really have a Θ(log(T )/T ) rate

10 This Work We study the convergence rate of SGD for stochastic strongly convex (possibly non-smooth) problems Main Findings: 1 (Mostly known): If F is strongly convex and smooth, then SGD with and without averaging achieves optimal O(1/T ) rate 2 If F is non-smooth, SGD with averaging might really have a Θ(log(T )/T ) rate 3 By a slight change of averaging step, can recover optimal O(1/T ) rate for SGD - no new algorithm required!

11 This Work We study the convergence rate of SGD for stochastic strongly convex (possibly non-smooth) problems Main Findings: 1 (Mostly known): If F is strongly convex and smooth, then SGD with and without averaging achieves optimal O(1/T ) rate 2 If F is non-smooth, SGD with averaging might really have a Θ(log(T )/T ) rate 3 By a slight change of averaging step, can recover optimal O(1/T ) rate for SGD - no new algorithm required! 4 Corroborate our findings experimentally

12 This Work We study the convergence rate of SGD for stochastic strongly convex (possibly non-smooth) problems Main Findings: 1 (Mostly known): If F is strongly convex and smooth, then SGD with and without averaging achieves optimal O(1/T ) rate 2 If F is non-smooth, SGD with averaging might really have a Θ(log(T )/T ) rate 3 By a slight change of averaging step, can recover optimal O(1/T ) rate for SGD - no new algorithm required! 4 Corroborate our findings experimentally Also: avoiding an online learning analysis leads to weaker conditions on step sizes

13 Smooth F We define F as µ-smooth if for any w W, F (w) F (w ) µ 2 w w 2 Note: we require smoothness only at w

14 Smooth F We define F as µ-smooth if for any w W, F (w) F (w ) µ 2 w w 2 Note: we require smoothness only at w Classical Analysis (Chung 1954, Sacks 1958,...) Suppose F is λ-strongly convex and µ-smooth; E[ ĝ t 2 ] uniformly bounded The last predictor w T returned by SGD with step-size c/λt, c > 1/2 gives ( cµ ) E [F (w T ) F (w )] O λ 2 T

15 Smooth F Proof idea in 3 steps: 1 Strong convexity implies the recursive inequality [ E w t+1 w 2] (1 2η t λ)e[ w t w 2 ] + O(ηt 2 ) 2 Solving the recursion, we get E [ w T w 2] O 3 By smoothness of F, implies Theorem E [F (w T ) F (w )] O For average predictor w T = 1 T (w w T ), ( µ E [F ( w T ) F (w )] O λ ( c ) λ 2 T ( cµ ) λ 2 T ( ) ) c λ T

16 Non-Smooth F When [ F non-smooth, can still show E w T w 2] O(1/T ), but this is insufficient to show O(1/T ) suboptimality On other hand, by an online learning analysis, E [F ( w T ) F (w )] O(log(T )/T ) Maybe log(t ) just artifact of analysis? Answer: NO! Exists stochastic problems where suboptimality of w T is really Θ(log(T )/T )

17 Warm-up Suppose W = [0, 1] F (w) = 1 2 w 2 + w Non-smooth with respect to w = 0 ĝ t = F (w t ) + U[ 2, 2] Can be shown that E[w t ] = Ω ( 1 t ( Therefore, E[ w T ] = Ω 1 T T t=1 1 t But this example is not satisfying... ) ) = Ω( log(t ) T )

18 Second Example W = R (can be easily extended to R d ) { F (w) = 1 2 w 2 w w 0 + 7w w < 0 ĝ t = F (w t ) + U[ 2, 2] Sharp gradient for w 0 makes w t stay at the positive side - same effect as explicit domain constraint in previous example As before, E[w t ] = Ω ( ) 1 t, leading to wt = Ω( log(t ) T )

19 Fixing SGD Conclusion: log(t )/T is inevitable for SGD with averaging Problematic point: averaging step w T = 1 T (w w T ) However, can t prove much for w T in non-smooth case... Solution: do something in between!

20 Fixing SGD α-suffix Averaging For some α (0, 1), define Theorem w α T = w (1 α)t w T αt For SGD with α-suffix averaging, and step size c/λt, c > 1/2, ( ) E[F ( w T α ) F (w )] O log 1 1 α c α λt Proof idea: Combination of online analysis technique with stochastic convergence guarantees on E[ w T w 2 ]

21 Experiments Smooth and Strongly convex F (F(w T ) F(w * )) * T SGD A SGD α SGD L EPOCH GD log 2 (T) Non-Smooth and Strongly Convex F (F(w T ) F(w * )) * T SGD A SGD α SGD L EPOCH GD log 2 (T)

22 Experiments log 2 (F(w T )) ASTRO Training Loss SGD A SGD α SGD L EPOCH GD log 2 (F(w T )) ASTRO Test Loss SGD A SGD α SGD L EPOCH GD log 2 (T) COV1 Training Loss log 2 (T) COV1 Test Loss log 2 (F(w T )) SGD A SGD α SGD L EPOCH GD log 2 (F(w t )) SGD A SGD α SGD L EPOCH GD log 2 (T) log 2 (T)

23 Conclusions and Open Problems SGD performs optimally in the smooth case In non-smooth case, averaging leads to log(t )/T rate - artifact of algorithm, not just the online analysis However, easily fixed by different averaging step - no new algorithm necessary Experiments accord with theory Open questions: Can we show 1/T rate for last predictor?? High probability bounds (not trivial!)

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

Large-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological Institute at Chicago Joint work with Nati Srebro Talk at NEC Labs, Princeton, August, 2008 Shai