arxiv: v2 [cs.lg] 7 Oct 2016

Size: px

Start display at page:

Download "arxiv: v2 [cs.lg] 7 Oct 2016"

Aubrie Conley
6 years ago
Views:

1 Hadi Daeshmad Aurelie Lucchi Thomas Hofma Departmet of Computer Sciece, ETH Zurich, Switzerlad arxiv: v2 cs.lg] 7 Oct 216 Abstract For may machie learig problems, data is abudat ad it may be prohibitive to make multiple passes through the full traiig set. I this cotext, we ivestigate strategies for dyamically icreasig the effective sample size, whe usig iterative methods such as stochastic gradiet descet. Our iterest is motivated by the rise of variace-reduced methods, which achieve liear covergece rates that scale favorably for smaller sample sizes. Exploitig this feature, we show theoretically ad empirically how to obtai sigificat speed-ups with a ovel algorithm that reaches statistical accuracy o a -sample i 2, istead of log steps. 1. Itroductio I empirical risk miimizatio (ERM) (Vapik, 1998) the traiig set S is used to defie a sample risk R S, which is the miimized with regard to a pre-defied fuctio class. Oe effectively equates learig algorithms with optimizatio algorithms. However, for all practical purposes a approximate solutio of R S will be sufficiet, as log as the optimizatio error is small relative to the statistical accuracy at sample size := S. This is importat for massive data sets, where optimizatio to umerical precisio is ifeasible. Istead of performig early stoppig o black-box optimizatio, oe ought to uderstad the trade-offs betwee statistical ad computatioal accuracy, cf. (Chadrasekara & Jorda, 213). I this paper, we ivestigate a much eglected facet of this topic, amely how to dyamically cotrol the effective sample size i optimizatio. May large-scale optimizatio algorithms are iterative: they use sampled or aggregated data to perform a sequece Proceedigs of the 33 rd Iteratioal Coferece o Machie Learig, New York, NY, USA, 216. JMLR: W&CP volume 48. Copyright 216 by the author(s). of update steps. This icludes the popular family of gradiet descet methods. Ofte, the computatioal complexity icreases with the size of the traiig sample, e.g. i steepest-descet, where the cost of a gradiet computatio scales with. Does oe really eed a highly accurate gradiet though, i particular i the early phase of optimizatio? Why ot use subsets T t S which are icreased i size with the iteratio cout t, matchig-up statistical accuracy with optimizatio accuracy i a dyamic maer? This is the geeral program we pursue i this paper. I order to make this idea cocrete ad to reach competitive results, we focus o a recet variat of stochastic gradiet descet (SGD), which is kow as SAGA (Defazio et al., 214). As we will show, this algorithm has a particularly iterestig property i how its covergece rate depeds o Empirical Risk Miimizatio Formally, we assume that traiig examples x S X have bee draw i.i.d. from some uderlyig, but ukow probability distributio P. We fix a fuctio class F parametrized by weight vectorsw R d ad defie the expected risk asr(w) := Ef x (w), wheref is ax-idexed family of loss fuctios, ofte covex. We deote the miimum ad the miimizer of R(w) over F by R ad w, respectively. Give that P is ukow, ERM suggests to rely o the empirical (or sample) risk with regard tos R S (w) := 1 x S f x (w), ws := argmir S (w). (1) w F Note that oe may absorb a regularizer i the defiitio of the lossf x Geeralizatio bouds The relatio betwee w ad ws has bee widely studied i the literature o learig theory. It is usually aalysed with the help of uiform covergece bouds that take the geeric form (Bouchero et al., 25) ] E S sup R(w) R S (w) H(), (2) w F

2 where the expectatio is over a radom -sample S. Here H is a boud that depeds o, usually through a ratio /d, where d is the capacity of F (e.g. VC dimesio). This fast covergece rate has bee show to hold for a class of strictly covex loss fuctios such as quadratic, ad logistic loss (Bartlett et al., 26; 25). I the realizable case, we may be able to observe a favorable H() d/, whereas i the pessimistic case, we may oly be able to establish weaker bouds such as H() d/ (e.g. for liear fuctio classes); see also (Bousquet & Bottou, 28). We igore additioal log factors that ca be elimiated usig the chaiig techique (Bousquet, 22; Bousquet & Bottou, 28) Statistical efficiecy Assume ow that we have some approximate optimizatio algorithm, which give S produces solutios w S that are o averageǫ() optimal, i.e. E S R S (w S ) R S ] ǫ(). Oe ca the provide the followig quality guaratee i expectatio over sample sets S (Bousquet & Bottou, 28) E S R(w S ) R H()+ǫ(), (3) which is a additive decompositio of the expected solutio suboptimality ito a estimatio (or statistical) error H() ad a optimizatio (or computatioal) error ǫ(). For a give computatioal budget, oe typically fids that ǫ() is icreasig with, whereas H() is always decreasig. This hits at a trade-off, which may suggest to chose a sample size m <. Ituitively speakig, cocetratig the computatioal budget o fewer data may be better tha spreadig computatios too thily Stochastic Gradiet Optimizatio For large scale problems, stochastic gradiet descet is a method of choice i order to optimize problems of the form give i Eq. (1). Yet, while SGD update directios equal the true (egative) gradiet directio i expectatio, high variace typically leads to sub-liear covergece. This is where variace-reducig methods for ERM such as SAG (Roux et al., 212), SVRG (Johso & Zhag, 213), ad SAGA (Defazio et al., 214) come ito play. We focus o the latter here, where oe ca establish the followig result o the covergece rate (see appedix). Lemma 1. Let all f x be covex with L-Lipschitz cotiuous gradiets ad assume that R S is µ-strogly covex. The the suboptimality of the SAGA iteratew t aftertsteps is over a radomly sampled S bouded by ( E A RS (w t ) R S] ρ t 1 C S, ρ = 1 mi, µ ), L This highlights two differet regimes: For small, the coditio umber κ := L µ dictates how fast the optimizatio algorithm coverges. O the other had, for large, the covergece rate of SAGA becomesρ = Cotributios Our mai questio is: ca we obtai faster covergece to a statistically accurate solutio by ruig SAGA o a iitially smaller sample, whose size is the gradually icreased? Motivated by a simple, yet succict aalysis, we preset a ovel algorithm, called DYNASAGA that implemets this idea ad achieves ǫ() H() after oly 2 iteratios. 2. Related Work Stochastic approximatio is a powerful tool for miimizig objective Eq. (1) for covex loss fuctios. The pioeerig work of (Robbis & Moro, 1951) is essetially a streamig SGD method where each observatio is used oly oce. Aother major milestoes has bee the idea of iterate averagig (Polyak & Juditsky, 1992). A thorough theoretical aalysis of asymptotic covergece of SGD ca be foud i (Kusher & Yi, 23), whereas some o-asymptotic results have bee preseted i (Moulies & Bach, 211). A lie of recet work kow as variace-reduced SGD, e.g. (Roux et al., 212; Shalev-Shwartz & Zhag, 213; Johso & Zhag, 213; Defazio et al., 214; 215; Koečỳ & Richtárik, 213; Zhag et al., 213), has exploited the fiite sum structure of the empirical risk to establish liear covergece for strogly covex objectives ad also a a better covergece rate for purely covex objectives (Mahdavi et al., 213). There is also evidece of slightly improved statistical efficiecy (Babaezhad et al., 215). (Frostig et al., 215) provides a o-asymptotic aalysis of a streamig SVRG algorithm (SSVRG), for which a covergece rate approachig that of the ERM is established. There have also bee related data-adaptive samplig approaches, e.g. i the cotext of usupervised learig (Lucic et al., 215) or for o-uiform samplig of data poits (Schmidt et al., 213; He & Takác, 215) with the goal of samplig importat data poits more ofte. This directio is largely orthogoal to our dyamic sizig of the sample, which is purely based o radom subsamplig. Our samplig strategy is istead based o revisitig samples which has also bee explored i (Wag et al., 216) to empirically improve the covergece of certai variacereduced methods. where the expectatio is over the algorithmic radomess.

3 taied by differetiatig V with regard to 1/m ad solvig form(see Lemma 9 i appedix). ǫ(m )+H(m ) m ǫ(m) H(m) Figure 1. Tradeoff betwee sample statistical accuracy term H(m) ad optimizatio suboptimality ǫ(m) usig sample size m <. Note that ǫ(m) is draw by takig the first order approximatio of the upper boud Ce m. Here, m = O(/log) yields the best balace betwee these two terms. 3. Methodology 3.1. Settig ad Assumptios We work uder the assumptios made i Lemma 1 ad focus o the large data regime, where κ ad the geometric rate of covergece of SAGA depeds o through ρ = 1 1/. This is a iterestig regime as the guarateed progress per update is larger for smaller samples. This form of ρ implies for the case of performig t = iteratios, i.e. performig oe pass 1 : ( E A R S (w ) R S] 1 1 ) C S C S e. (4) So we are guarateed to improve the solutio suboptimality o average by a factor 1/e per pass. This i tur implies that i order to get to a guarateed accuracy O( α ), we eed O(α log ) update steps Sample Size Optimizatio For illustrative purposes, let us use the above result to select a sample size for SAGA, which yields the best guaratees. Propositio 2. Assume H(m) = D/m ad is give. DefieC to be a upper-boud oc S, S (from Lemma 1), the for m κ, V(m) := D m + Ce m provides a boud o the expected suboptimality of SAGA. It is miimized for the choice { } m = max κ, log+log C. D Proof. The first claim follows directly from the assumptios ad Lemma 1. Moreover the tightest boud is ob- 1 The SAGA aalysis holds for i.i.d. samplig, so strictly speakig this is ot a pass, but correspods to update steps. m The result implies that we will perform roughly log + log C D epochs o the optimally sized sample. Also the value of the boud is (for simplicity, assumig C = D) V(m ) = log + 1 V() = e, (5) showig that the sigle pass approximatio error o the full sample is too large (costat), relative to the statistical accuracy Dyamic Sample Growth As we have see, optimizig over a smaller sample ca be beeficial (if we believe the sigificace of the bouds). But why chose a sigle sample size oce ad for all? A smaller sample set seems advatageous early o, but as a optimizatio algorithm approaches the empirical miimizer, it is hit by the statistical accuracy limit. This suggests that we should dyamically icremet the size of the sample set. We illustrate this idea i Figure 2. I order to aalyze such a dyamic samplig scheme, we eed to relate the suboptimality o a sub-sample T to a suboptimality boud os. We establish a basic result i the followig theorem. Theorem 3. Let w be a (ǫ, T )-optimal solutio, i.e.r T (w) R T ǫ, wheret S,m := T, := S. The the suboptimality of w for R S is bouded w.h.p. i the choice of T as: E S R S (w) R S Proof. Cosider the followig equality m ] ǫ+ H(m). (6) R S (w) R S = R S (w) (1) R T (w) (2) R T (3) R S We boud the three ivolved differeces (i expectatio) as follows: (2): R T (w) R T ǫ by assumptio. (3): E S R T (wt ) R S(wS )] as T S. For (1) we apply the boud (see Lemma 1 i the appedix) E S T R S (w) R T (w)] m R(w) R T(w). Moreover E T R(w) R T (w)] sup w R(w ) R T (w ) H(m) by Eq. (2), which cocludes the proof. I plai Eglish, this result suggests the followig: If we have optimized w to (ǫ,t ) accuracy o a sub-sample T ad we wat to cotiue optimizig o a larger samples T, the we ca boud the suboptimality or S by the same ǫ plus a additioal switchig cost of( m)/ H(m).

4 R(w) H(/4) H(/3) H(/2) sample size H() Figure 2. Illustratio of a optimal progress path via sample size adjustmet. The vertical black lies show the progress made at each step, thus illustratig the faster covergece for smaller sample size. Table 1. Compariso of obtaied bouds for differet SAGA variats whe performig T κ update steps. METHOD OPTIMIZATION ERROR SAMPLES SAGA (oe pass) cost. T SAGA (optimal size) O(logT H(T)) T/logT DYNASAGA O(H(T)) T/2 4. Algorithms & Aalysis 4.1. Computatioal Limited Learig The work of (Bottou, 21) emphasized that for massive data sets the limitig factor of ay learig algorithm will be its computatioal complexityt, rather tha the umber of samples. For SGD this computatioal limit typically traslates ito the umber of stochastic gradiets evaluated by the algorithm, i.e. T becomes the umber of update steps. Oe obvious strategy with abudat data is to sample a ew data poit i every iteratio. There are asymptotic results establishig bouds for various SGD variats i (Bousquet & Bottou, 28). However, SAGA ad related algorithms rely o memorizig past stochastic gradiets, cf. (Hofma et al., 215), which makes it beeficial to revisit data poits, ad which is at the root of results such as Lemma 1. This leads to a qualitatively differet behavior ad our fidigs idicate that ideed, the trade-offs for large scale learig eed to be re-visited, cf. Table SAGA with Dyamic Sample Sizes We suggest to modify SAGA to work with a dyamic sample size schedule. Let us defie a schedule as a mootoic fuctio M : Z + Z +, where t is the iteratio umber Algorithm 2 DYNASAGA 1: Iput: traiig examplesx = (x 1,x 2,...,x ),x i P total umber of iteratios T (e.g. T = 2) startig poitw R d (e.gw = ) learig rateη > (e.g.η = 1 4L ) sample schedulem : 1 : T] 1 : ] 2: w w 3: fori = 1,..., do 4: α i f xi (w ) {ca also be doe o the fly} 5: ed for 6: fort = 1,...,T do 7: samplex i Uiform(x 1,...,x M(t) ) 8: g f xi (w t 1 ) 9: A M(t) j=1 α j/m(t){ca be doe icremetally} 1: w t w t 1 η(g α i +A) 11: α i g 12: ed for ad M(t) the effective sample size used at t. We assume that a sequece of data poits X = (x 1,...,x ) draw from P is give such that M iduces a ested sequece of samplest t := {x i : 1 i M(t)}. DYNASAGA geeralizes SAGA (Defazio et al., 214) i that it samples data poits o-uiformly at each iteratio. Specifically, for a give schedule M ad iteratio t, it samples uiformly from T t, but igores X T t. The pseudocode for DYNASAGA is show i Algorithm Upper Boud Recurrece Assume we are give a stochastic optimizatio method that guaratees a geometrical decay at each iteratio, i.e. E A RS (w t ) R S] ρ RS (w t 1 ) R ] S (7) where S = ad expectatio is over radomess of optimizatio process. 2 For acceleratio, we pursue the strategy of usig the basic iequalities obtaied so far ad to stitch them together i the form of a recurrece. At ay iteratiot we allow ourselves the choice to augmet the curret sample of size m by some icremet m. We defie a upper boud fuctio U as follows ρ U(t 1,) U(t,) = mi mi m< U(t,m)+ m H(m) ], such that U(,m) = ξ, where the iitial error ξ is defied as: ξ := 4L R(w ) R(w ) ]. (9) µ 2 Note that this assumptio is slightly stroger tha Lemma 1 but it leads to a much simpler proof techique. (8)

5 We refer the reader to Lemma 8 i the Appedix for further details o how to derive the expressio forξ. The costructio of Eq. (8) is motivated by the followig result: Propositio 4. W.h.p. over the radom -sample X, the iterate sequecew t geerated by DYNASAGA fulfils E X RT (w t ) R T ] U(t,). Proof. By iductio over t. The result for t = follows directly from Lemma 8. The first case i Eq. (8) for the iductio step (fixed sample size) follows from Eq. (7). The secod case holds by virtue of Theorem 3 for aym, hece also for the miimum. Although the U-recursio ca be solved for small usig dyamic programmig (assumig kowledge of all costats), we aalyse a much simpler heuristics ad its behavior. This leads to iterestig isights, while beig very practical. I particular, our algorithm is a aytime algorithm, which does ot require kowledge of the total umber of iteratiost ahead of time Sample Schedules I this sectio, we preset ad aalyse two adaptive sample-size schemes for DYNASAGA. We start with sample size κ ad perform 2κ steps. From the o, we add a ew sample every other iteratio. The effective sample size is thus M LIN (t) = max { 2κ, } t 2 (1) Note that this strategy defies a upper boud o U(2t, t) adu(2t+1,t). We have also implemeted a variat where we perform updates i alteratio: every other iteratio we sample a ew data poit, which is added to the set. However, we also force a update o this fresh sample. I alteratio, we simply re-sample a existig data poit uiformly at radom. We do ot provide a theoretical aalysis for this scheme but show experimetally that it slightly outperforms the strategy (see results i the appedix). We thus report results for the ALTERNAT- ING strategy i the experimetal sectio Aalysis We ow provide a aalysis that establishes the covergece rate of the strategy. Lemma 5. For H() = D α, < α 1, the strategy obtais the followig suboptimality U(2,) H()+ ξ 2 ( κ ) 2 (11) Proof. By iductio over. The base case follows from C m ξ. Usig Eq. (8) ad (11) for the iductive case, we get U(2(+1),+1) (8) ρ 2 +1 U(2,)+ 1 ] +1 H() (11) ξ ( ) 2 κ + 2 (+2) 2 +1 (+1) 3 H() Note that by defiitio of the logarithmic fuctio, log(+2)] < 2log(+1), ad moreover H() +1H(+1) = 1 α 1, (+1) 1 α which completes the proof. This meas that for large eoughthe strategy is able to approach the statistical accuracy with2 iteratios, i.e. two passes over the data. Note the very sigificat improvemet relative to the log factor iheret to the optimal fixed sample size choice (see Table 1 for a compariso of these two bouds). What does that imply for thet = case that we have bee emphasizig? It is simple to state a aswer as a corollary. Corollary 6. Uder the same assumptios as Lemma 5, it holds for eve U(,) ( 3 2 α 1) ( κ 2 H()+2ξ ) Proof. Note that with Eq. (8) (a) ad Lemma 5 (b) we get U(2,2) (a) U(2,)+ 1 (b) H() 3 ( κ ) H()+2ξ 2 The fact that H() = 2 α H(2) completes the proof. The proof of the above corollary suggests to oly use = T/2 samples, whe performig T steps ad to simply igore the other half (that potetially could have bee sampled). Oe might woder if a better strategy tha the oe could be defied, e.g. by iteratig more tha twice o each ewly added sample or by icreasig the sample size by more tha oe. The ext lemma aswers this questio ad proves that the strategy is optimal for large-scale datasets as log as H() 1/. Lemma 7. Assume that H() D/, the the strategy is optimal for all sample size > κ.

6 l R(w) R(w ) ] 9 11 κ = Suboptimality of Risk y = 1.4 x l() 9 11 κ =.75 Suboptimality of Risk y =.59 x l() Figure 3. Results o sythetic dataset. (left) Sice, the empirical suboptimality is 1/, we expect the slope measured o this plot to be close to oe. (right) Sice κ =.75 slows dow the covergece rate, the slope of this plot is less tha oe. Table 2. Details of the real datasets used i our experimets. All datasets were selected from the LIBSVM dataset collectio. DATASET SIZE NUMBER OF FEATURES RCV1.BINARY A9A W8A IJCNN REAL-SIM COVTYPE.BINARY SUSY 5 18 coditio umber κ. Proof. Here, we briefly state a sketch of the proof. The details are preseted i Appedix A.2. First, we reformulate the problem of the optimal sample size schedule i terms of umber of iteratios o each samples size. Give that this problem is covex, we ca use the KKT coditios to prove the optimality of icremetig by oe sample (see Lemma 12) ad iteratig twice o each sample size (see Lemma 13). 5. Experimetal Results We preset experimetal results o sythetic as well as real-world data, which largely cofirms the above aalysis Baselies We compare DYNASAGA (both the ad ALTER- NATING strategy) to various optimizatio methods preseted i Sectio 2. This icludes SGD (with costat ad decreasig step-size), SAGA, streamig SVRG (SSVRG) as well as the mixed SGD/SVRG approach preseted i (Babaezhad et al., 215) Experimet o sythetic data We cosider liear regressio, where iputs a R d are draw from a Gaussia distributio N(,Σ d d ) ad outputs are corrupted by additive oise y = x,w + ǫ, ǫ N (,σ 2). We are give i.i.d observatios of this model, S = {(a i,y i )} i=1, from which we compute the least squares riskr S (w) = 1 i=1 ( a i,w y i ) 2. By cosiderig the matrix A to be a row-wise arragemet of the iput vectors a i, we ca write the Hessia matrix of R (w) as Σ = 1 AT A. Whe d, the matrix Σ coverges to Σ ad we ca therefore assume that R (w) is µ-strogly covex ad L-Lipschitz where the costats µ ad L are the smallest ad largest eigevalues of Σ. We experimet with two differet values for the Case κ = : We use a diagoal Σ with elemets decreasig from 1 to 1, hece κ =. I this particular case the aalysis derived i Lemma 5 predicts a upper boud U(,) < O( 1 ) which is cofirmed by the results show i Figure 3. Case κ = 3 4 : Whe κ = 3 4, the term ( κ 2 ) is the domiatig term i the proposed upper-boud. ( I this case, 1 U(, ) is thus upper-bouded by O ), which is oce agai verified experimetally i Figure Experimets o Real Datasets We also ra experimets o several real-world datasets i order to compare the performace of DYNASAGA to stateof-the-art methods. The details of the datasets are show i Table 2. Throughout all the experimets we used the logistic loss with a regularizer λ = 1 3. Figures 4, ad 5 show the suboptimality o the empirical risk ad expected risk after a sigle pass over the datasets. The various parameters used for the baselie methods are described i Table 3. A critical factor i the performace of most baselies, especially SGD, is the selectio of the step-size. We picked the best-performig step-size withi the commo rage guided by existig theoretical aalyses, specifically η = 1/L ad η = C C+µt for various values of C. Overall, we ca see that DYNASAGA performs very well, both as a optimizatio as well as a learig algorithm. SGD is also very competitive ad typically achieves faster covergece tha the other baselies, however, its behaviour is ot stable throughout all the datasets. The SGD variat with decreasig step-size is typically very fast i the early stages but the slows dow after a certai umber of steps. The results o the RCV dataset are somehow surprisig as SGD with costat step-size clearly outperforms all methods but we show i the appedix that its behaviour 3 We also preset some additioal results for various regularizers of the form λ = 1 p,p < 1 i the appedix

7 SGD SAGA dyasaga SSVRG SGD:.5 SGD:.5 SGD/SVRG RCV 2. A9A W8A 4. IJCNN1 5. REAL-SIM x 1 5 x COVTYPE 7. SUSY Figure 4. Suboptimality ] o the empirical risk. The vertical axis shows the suboptimality of the empirical risk, i.e. log 2 E 1 RT(w t ) R T where the expectatio is take over 1 idepedet rus. The traiig set icludes 9% of the data. The vertical red dashed lie is draw after exactly oe epoch over the data. gets worse as we icrease the coditio umber. As ca be see very clearly, DYNASAGA yields excellet solutios i terms of expected risk after oe pass (see suboptimality values that itersect with the vertical red dashed lies). 6. Coclusio We have preseted a ew methodology to exploit the tradeoff betwee computatioal ad statistical complexity, i order to achieve fast covergece to a statistically efficiet solutio. Specifically, we have focussed o a modificatio of SAGA ad suggested a simple dyamic samplig schedule that adds oe ew data poit every other update step. Our aalysis shows competitive covergece rates both i term of suboptimality o the empirical risk as well as (more importatly) the expected risk i a oe pass or a two pass settig. These results have bee validated experimetally. Our approach depeds o the uderlyig optimizatio method oly through its covergece rate for miimizig a empirical risk. We thus suspect that a similar sample size adaptio is applicable to a much wider rage of algorithms, icludig to o-covex optimizatio methods for deep learig.

8 SGD SAGA dyasaga SSVRG SGD:.5 SGD:.5 SGD/SVRG RCV 2. A9A W8A 4. IJCNN1 5. REAL-SIM x 1 5 x COVTYPE 7. SUSY Figure 5. Suboptimality o the expected risk. The vertical axis shows the suboptimality of the expected risk, i.e. log 2 E 1 RS(w t ) R S(w T) ], where S is a test set which icludes 1% of the data ad w T is the optimum of the empirical risk o T. The vertical red dashed lie is draw after exactly oe epoch over the data. Refereces Babaezhad, Reza, Ahmed, Mohamed Osama, Virai, Alim, Schmidt, Mark, Koečỳ, Jakub, ad Sallie, Scott. Stop wastig my gradiets: Practical svrg. Advaces i Neural Iformatio Processig Systems, 215. Bartlett, Peter L, Bousquet, Olivier, ad Medelso, Shahar. Local rademacher complexities. Aals of Statistics, pp , 25. Bartlett, Peter L, Jorda, Michael I, ad McAuliffe, Jo D. Covexity, classificatio, ad risk bouds. Joural of the America Statistical Associatio, 11(473): , 26. Bottou, Léo. Large-scale machie learig with stochastic gradiet descet. I Proceedigs of COMP- STAT 21, pp Spriger, 21. Bouchero, Stéphae, Bousquet, Olivier, ad Lugosi, Gábor. Theory of classificatio: A survey of some recet advaces. ESAIM: probability ad statistics, 9: , 25. Bousquet, Olivier. Cocetratio iequalities ad empirical processes theory applied to the aalysis of learig algorithms. PhD thesis, Ecole Polytechique, 22. Bousquet, Olivier ad Bottou, Léo. The tradeoffs of large scale learig. I Advaces i Neural Iformatio Processig Systems, pp , 28.

9 Boyd, Stephe ad Vadeberghe, Lieve. Covex Optimizatio. Cambridge Uiversity Press, New York, NY, USA, 24. Chadrasekara, Vekat ad Jorda, Michael I. Computatioal ad statistical tradeoffs via covex relaxatio. Proceedigs of the Natioal Academy of Scieces, 11 (13):E1181 E119, 213. Defazio, Aaro, Bach, Fracis, ad Lacoste-Julie, Simo. Saga: A fast icremetal gradiet method with support for o-strogly covex composite objectives. I Advaces i Neural Iformatio Processig Systems, pp , 214. Defazio, Aaro J, Caetao, Tibério S, ad Domke, Justi. Fiito: A faster, permutable icremetal gradiet method for big data problems. I The iteratioal coferece o Machie learig, 215. Frostig, Roy, Ge, Rog, Kakade, Sham M., ad Sidford, Aaro. Competig with the empirical risk miimizer i a sigle pass. I The Coferece o Learig Theory, pp , 215. He, Xi ad Takác, Marti. Dual free SDCA for empirical risk miimizatio with adaptive probabilities. CoRR, abs/ , 215. Hofma, Thomas, Lucchi, Aurelie, Lacoste-Julie, Simo, ad McWilliams, Bria. Variace reduced stochastic gradiet descet with eighbors. I Advaces i Neural Iformatio Processig Systems 28, pp Curra Associates, Ic., 215. Johso, Rie ad Zhag, Tog. Acceleratig stochastic gradiet descet usig predictive variace reductio. I Advaces i Neural Iformatio Processig Systems, pp , 213. learig. I Advaces i Neural Iformatio Processig Systems, pp , 211. Polyak, Boris T ad Juditsky, Aatoli B. Acceleratio of stochastic approximatio by averagig. SIAM Joural o Cotrol ad Optimizatio, 3(4): , Robbis, Herbert ad Moro, Sutto. A stochastic approximatio method. The Aals of Mathematical Statistics, pp. 4 47, Roux, Nicolas L, Schmidt, Mark, ad Bach, Fracis R. A stochastic gradiet method with a expoetial covergece rate for fiite traiig sets. I Advaces i Neural Iformatio Processig Systems, pp , 212. Schmidt, Mark, Roux, Nicolas Le, ad Bach, Fracis. Miimizig fiite sums with the stochastic average gradiet. arxiv preprit arxiv: , 213. Shalev-Shwartz, Shai ad Zhag, Tog. Stochastic dual coordiate ascet methods for regularized loss. The Joural of Machie Learig Research, 14: , 213. Vapik, Vlamimir. Statistical learig theory, volume 1. Wiley New York, Wag, Jialei, Wag, Hai, ad Srebro, Natha. Reducig rutime by recyclig samples. arxiv preprit arxiv: , 216. Zhag, Liju, Mahdavi, Mehrdad, ad Ji, Rog. Liear covergece with coditio umber idepedet access of full gradiets. I Advaces i Neural Iformatio Processig Systems, pp , 213. Koečỳ, Jakub ad Richtárik, Peter. Semi-stochastic gradiet descet methods. arxiv preprit arxiv: , 213. Kusher, Harold J ad Yi, George. Stochastic approximatio ad recursive algorithms ad applicatios, volume 35. Spriger Sciece & Busiess Media, 23. Lucic, Mario, Ohaessia, Mesrob I, Karbasi, Ami, ad Krause, Adreas. Tradeoffs for space, time, data ad risk i usupervised learig. I AISTATS, 215. Mahdavi, Mehrdad, Zhag, Liju, ad Ji, Rog. Mixed optimizatio for smooth fuctios. I Advaces i Neural Iformatio Processig Systems, pp , 213. Moulies, Eric ad Bach, Fracis R. No-asymptotic aalysis of stochastic approximatio algorithms for machie

10 A. Appedix A.1. Proofs Proof of Lemma 1. Proof. We start with the covergece rate of SAGA established i (Defazio et al., 214) as E A w t ws 2] ρ t S w ws 2 S ( + RS (w ) R S (w µ S +L S),w ws R ) ] S. (12) We the use the L-smoothess assumptio of f x (w) to relate the suboptimality o the fuctio values to the boud i Eq. (12). E A RS (w t ) R S (w S ) ] = E A Ex S fx (w t ) ] E x S f x (w S )] ] L smoothess LE A w t w S 2] Eq. 12 ρ t S C S, wherec S is the iitial suboptimality o the empirical risk defied as: C S = L w ws 2 S ( + RS (w ) R S (w µ S +L S),w ws R ) ] S Note that this iitial error depeds o the set S ad its size S. I the followig Lemma, we propose a upper boud o this iitial error that is idepedet ofs Lemma 8. W.h.p, the iitial suboptimality error of samples is bouded by: C S ξ := 4L µ R(w ) R(w ) ] Proof. We first use the fact thatr S (w) is µ-strogly covex as well as the optimality ofws to boudc S as ( C S := L w ws S 2 + RS (w ) R S (ws µ S +L ),w ws R S(wS )]) L RS (w ) R S (w µ S) ] + S L RS (w ) R S (w µ S +L S),w ws R S (ws) ] L RS (w ) R S (w µ S) ] + S L RS (w ) R S (w µ S +L S) ] (L>) 2L RS (w ) R S (ws µ )] 2L ] R S (w ) 1] R(w ) 2] R(w ) 3] R(w µ S) R S (ws) We use the geeralizatio bouds i (Vapik, 1998) to upper boud 1] ad 2]. For 3], we used the uiform covergece rate of the ERM that implies (Vapik, 1998): where c is a costat. We the get C S w.h.p 2L µ R(wS) R(w ) csup R S (w) R(w), w H( S )+R(w ) R(w )+ch( S )+H( S ) ]. (13)

11 We also make the further assumptio that with high probability the iitial suboptimality is greater tha a costat factor of the statistical accuracy, i.e. R(w ) R(w ) > (2+c)H( S ). We ca the further upper boudc S as C S 4L µ R(w ) R(w ) ]. (14) Lemma 9 (for Propositio 2). V(m) := D m +Ce m, the argmi V(m) = <m log C D Proof. dv dm 1 = D Ce m! = e m = D C m = log C D Solvig for m, this ideed correspods to a miimum which ca be verified by checkig the boudary valuesm = ad m. Lemma 1 (for Theorem 3). Proof. E S T R S (w) R T (w)] m R(w) R T(w). E S T R S (w) R T (w)] = E S T T R S (w) R T (w)] = E S T 1 f x (w)+ f y (w) 1 f x (w) m x T y S T x T = m E S T 1 f y (w) R T (w) m y S T = m E S T 1 f y (w) R T (w) m y S T = m E S T R S T (w)] R T (w)] = m R(w) R T (w)]

12 A.2. Optimality of the Strategy Startig Small Learig with Adaptive Sample Sizes We here itroduce a ew otatio ad chose to represet a sample size schedule by a vector t = t m,m < where t m deotes the umber of iteratios o sample size m. Note that the total umber of iteratios up to the sample size is T = m< t m. We defie as the sample size that we iterate o immediately before sample size, i.e. = max{k < : t k > }. (15) We ow rewrite the suboptimality boud i terms of the sample size schedulet as where the secod equality is derived usig Lemma 1 ad Theorem 3. A(t ) = E S R S (w(t )) R S (w )] ) = ρ t (A(t )+ H( ), (16) Oe ca relate the upper boudu(,) toa(t ) usig the followig costraied program: U(,) = mi t A(t ) (17) Subject to m : t m t m = m I the followig we aim at showig that the Strategy is the optimal solutio of Equatio 16. We first prove a Lemma that will be used i the rest of our aalysis. Lemma 11 (Expasio ofa(t )). if H() = D/, the A(t ) := C(t )+ C(t ) := ξ i=m m=m +1 B m (t ), where (18) ( ) ti i 1, B m (t ) := i D (m 1)m ( ) ti i 1. (19) i=m i Proof. Although oe could paistakigly uroll the recursivity i Equatio 16, we here provide a simple iductio proof. First, oe ca easily verify that the equatio holds for = m. For the iductive step, we assume it holds for ad prove it holds for all {k : < k }. Accordig to the defiitio of, we havet k = for all < k <, ad therefore ρ t k k = k m= +1 ρ tm m. (2) We will also make use of the followig equality i our aalysis: k k H( ) = H( ) H(k) (H()=D/) = k m= +1 H(m 1) H(m). (21)

13 We are ow ready to prove the iductive step. ( A(t k EQ 16 ) = ρ t k k A(t )+ k ) H( ) k = ρ t k k C(t )+ EQ 19, 2 = C(t k )+ EQ 21 = C(t k )+ EQ 2 = C(t k )+ = C(t k )+ m=m +1 m=m +1 m=m +1 m=m +1 k m=m +1 B m (t )+ k k ( ) B m (t k )+ρ t k k k H( ) k B m (t k )+ρ t k k B m (t k )+ k m= +1 k m= +1 (22) H( ) (23) D (m 1)m (24) (25) B m (t k ) (26) B m (t k ) (27) Usig the defiitios provided i Lemma 11, we ivestigate the optimality coditios of the optimal sample size strategy. I the followig, we simplify our otatios ad writeb m adc istead ofb m (t ) adc(t ). As a first step i our aalysis, we itroduce the followig equatios based o the defiitios ofb m adc. B m = ( ) ti i 1 = i i=m 1 m(m 1) ( i 1 i i m ) ti = m+1 m 1 ( ) tm m 1 B m+1. (28) m ( ( (i 1 ) )) ti exp log = exp i i=m We ow compute the derivative ofa(t ) as ( A(t ) = log(1 1 t m m ) C(t )+ 1 m ( C + m k=m +1 B k ) m k=m +1 i=m B k (t ) ( t i log 1 1 i) ]. (29) ). (3) C(t ) adb m (t ) are log-covex (hece covex) fuctios with respect tot. Sice the sum operator preserves covexity (Boyd & Vadeberghe, 24), A(t ) is covex as well. Let λ i, ν deote the Lagragia coefficiets associated with the iequality ad equality costraits respectively. Accordig the KKT coditios (Boyd & Vadeberghe, 24) for the the optimal solutio, the followig iequalities hold: λ m (31) λ m t m = (32) A(t ) t m λ m +ν = (33) Accordig the above coditio there are two possible cases for the partial derivative A(t ) t m :

14 For the case oft m >, the slackess coditio 32 implies that λ m =. The, accordig to the coditio 33: A(t ) = ν t ( m EQ. 3 = 1 m C + m k=m +1 B k ) = ν (34) For the case oft m =,λ i > (a.) holds based o the complemetary slackess coditio 32. EQ. 3 = 1 m A(t ) t ( m C + = λ i ν (a.) > ν m k=m +1 B k ) < ν (35) I the followig two lemmas we use the coditios of optimality derived i Equatios 34 ad 35 to prove optimality of the Strategy. Specifically, we first prove that for the optimal strategy, t m > for m < m ad t m = for m >. We also prove the optimality of icremetig the sample size by oe. I the secod lemma, we show that t m 2. Lemma 12 (Optimality of sample size icremet). For large eoughm, a schedule witht m = adt m+1 > caot be optimal. Proof. Note that by repeated applicatio of Equatio (28) we obtai B m+1 < B m < < B m +1 EQ. 34 & 35 < ν (36) where optimality coditios a. t m > (EQ.34) ad b. t m +1 = (EQ.35) yeild the last iequality: B m +1 = a. = m +1 k=m +1 m +1 k=m +1 B k m k=m +1 B k C (37) B k +C mν (38) b. < (m+1)ν mν = ν (39) O the other had, optimality of a. t m+1 > (EQ.34) ad b. t m = (EQ.35) also imply B m+1 > ν which is i cotradictio with the previously establishedb m+1 < ν. Ideed, we have B m+1 = m+1 k=m +1 B k a. = (m+1)ν m k=m +1 m k=m +1 B k C (4) B k C (41) b. > (m+1)ν mν = ν (42) Lemma 13 (Optimality of two iteratios). Cosider t largem : m < m, t m 2. as the miimizer of the optimizatio problem 17. For sufficietly

15 Proof. Usig Lemma 12, t m > holds for m < m. We proceed with optimality coditios a. t m > ad b. t m 1 > i equatio 34. B m = m k=m +1 a. = mν B k m 1 k=m +1 m 1 k=m +1 B k C (43) B k C (44) b. = mν (m 1)ν = ν (45) Cosequetly,B m = B m+1 = ν. Usig Equatio 28, oe coclude that t m 2: ( ) ( ) t m 1 m 1 m+1 = m log 1 2 t m+1 m m = log ( ) 1 1 2m 2. (46) m+1 m

16 A.3. Additioal Experimetal results Startig Small Learig with Adaptive Sample Sizes A.3.1. COMPARISON OF THE TWO ADAPTIVE SAMPLE SIZE SCHEMES FOR DYNASAGA We here compare the ad schemes o the collectio of real datasets preseted i Table 2 for a regularizer λ = 1 2. The results for the empirical ad expected risk show i Figure 6 ad Figure 7 show that the scheme slightly outperforms the strategy x SUSY 2. RCV 3. A9A W8A 5. IJCNN1 6. REAL-SIM x COVTYPE Figure 6. Suboptimality o ] the empirical risk. The vertical axis shows the suboptimality of the empirical risk, i.e. log 2 E 1 RT(w t ) R T where the expectatio is take over 1 idepedet rus. The traiig set icludes 9% of the data. The vertical gree dashed lie is draw after exactly oe epoch over the data.

17 x SUSY 2. RCV 3. A9A W8A 5. IJCNN1 6. REAL-SIM x COVTYPE Figure 7. Suboptimality o the expected risk. The vertical axis shows the suboptimality of the expected risk, i.e. log 2 E 1 RS(w t ) R S(w T) ], where S is a test set which icludes 1% of the data ad w T is the optimum of the empirical risk o T. The vertical gree dashed lie is draw after exactly oe epoch over the data. A.3.2. EFFECT OF THE REGULARIZER We here preset additioal results for various regularizers of the form λ = 1,p < 1. I the iterest of clarity we oly p show results o four datasets. We ca see a similar tred to the mai results preseted i the paper for λ = 1 where DYNASAGAshows very fast covergece i terms of both empirical ad expected risk. SGD is also very competitive ad typically achieves faster covergece tha the other baselies, however, its behaviour is ot stable throughout all the datasets.

18 RCV 2. W8A SGD SAGA dyasaga SSVRG SGD:.5 SGD:.5 SGD/SVRG x IJCNN1 4. COVTYPE Figure 8. Suboptimality o the empirical risk with regularizer λ = RCV 2. W8A SGD SAGA dyasaga SSVRG SGD:.5 SGD:.5 SGD/SVRG x IJCNN1 4. COVTYPE Figure 9. Suboptimality o the expected risk with regularizer λ = 3

19 RCV 2. W8A SGD SAGA dyasaga SSVRG SGD:.5 SGD:.5 SGD/SVRG x IJCNN1 4. COVTYPE Figure 1. Suboptimality o the empirical risk with regularizer λ = RCV 2. W8A SGD SAGA dyasaga SSVRG SGD:.5 SGD:.5 SGD/SVRG x IJCNN1 4. COVTYPE Figure 11. Suboptimality o the expected risk with regularizer λ = 4

20 A.4. Details of Experimets The various parameters of all baselies ad DYNASAGA are represeted i Table 3. Table 3. Experimetal settig METHOD PARAMETER NOTATION VALUE.1 SGD STEP SIZE η t.1+µt.3 SAGA STEP SIZE η L+µ SSVRG AND SGD/SVRG FACTOR FOR INCREASING SAMPLE SIZE b 3 A CONSTANT PARAMETER p 2 1 STEP SIZE η 1b p INITIAL BATCH SIZE k κ κ NUMBER OF STEPS ON EACH BATCH SIZE m η SGD:.5 STEP SIZE η.5 SGD:.5 STEP SIZE η.5.3 DYNASAGA STEP SIZE FOR SAMPLE SIZE m η(m) L+µm INITIAL BATCH SIZE k κ NUMBER OF ITERATIONS ON SAMPLE SIZE m t(m) 2

A DOUBLE INCREMENTAL AGGREGATED GRADIENT METHOD WITH LINEAR CONVERGENCE RATE FOR LARGE-SCALE OPTIMIZATION

A DOUBLE INCREMENTAL AGGREGATED GRADIENT METHOD WITH LINEAR CONVERGENCE RATE FOR LARGE-SCALE OPTIMIZATION Arya Mokhtari, Mert Gürbüzbalaba, ad Alejadro Ribeiro Departmet of Electrical ad Systems Egieerig,