Optimal Black-Box Reductions Between Optimization Objectives

Size: px

Start display at page:

Download "Optimal Black-Box Reductions Between Optimization Objectives"

Chrystal Richard
6 years ago
Views:

1 Optmal Black-Box Reductons Between Optmzaton Objectves Zeyuan Allen-Zhu Prnceton Unversty Elad Hazan Prnceton Unversty arxv:163.56v3 [math.oc] May 16 frst crculated on February 5, 16 Abstract The dverse world of machne learnng applcatons has gven rse to a plethora of algorthms and optmzaton methods, fnely tuned to the specfc regresson or classfcaton task at hand. We reduce the complexty of algorthm desgn for machne learnng by reductons: we develop reductons that take a method developed for one settng and apply t to the entre spectrum of smoothness and strong-convexty n applcatons. Furthermore, unlke exstng results, our new reductons are optmal and more practcal. We show how these new reductons gve rse to new and faster runnng tmes on tranng lnear classfers for varous famles of loss functons, and conclude wth experments showng ther successes also n practce. 1 Introducton The basc machne learnng problem of mnmzng a regularzer plus a loss functon comes n numerous dfferent varatons and names. Examples nclude Rdge Regresson, Lasso, Support Vector Machne (SVM), Logstc Regresson and many others. A multtude of optmzaton methods were ntroduced for these problems, but n most cases specalzed to very partcular problem settngs. Such specalzatons appear necessary snce objectve functons for dfferent classfcaton and regularzaton tasks admn dfferent convexty and smoothness parameters. We lst below a few recent algorthms along wth ther applcable settngs. Varance-reducton methods such as SAGA and SVRG [7, 1] ntrnscally requre the objectve to be smooth, and do not work for non-smooth problems lke SVM. Ths s because for loss functons such as hnge loss, no unbased gradent estmator can acheve a varance that approaches to zero. Dual methods such as SDCA or APCG [18, 8] ntrnscally requre the objectve to be strongly convex (SC), and do not drectly apply to non-sc problems. Ths s because for a non-sc objectve such as Lasso, ts dual s not even be well-defned. Prmal-dual methods such as SPDC [3] requre the objectve to be both smooth and SC. Many other algorthms are only analyzed for both smooth and SC objectves [5, 1, 15]. In ths paper we nvestgate whether such specalzatons are nherent. Is t possble to take a convex optmzaton algorthm desgned for one problem, and apply t to dfferent classfcaton or Frst appeared on ArXv on March 17, 16. Corrected a few typos n ths most recent verson. 1

2 regresson settngs n a black-box manner? Such a reducton should deally take full and optmal advantage of the objectve propertes, namely strong-convexty and smoothness, for each settng. Unfortunately, exstng reductons are stll very lmted for at least two reasons. Frst, they ncur at least a logarthmc factor log(1/ε) n the runnng tme so leadng only to suboptmal convergence rates. 1 Second, after applyng exstng reductons, algorthms become based so the objectve value does not converge to the global mnmum. These theoretcal concerns also translate nto runnng tme losses and parameter tunng dffcultes n practce. In ths paper, we develop new and optmal regularzaton and smoothng reductons that can shave off a non-optmal log(1/ε) factor produce unbased algorthms Besdes such techncal advantages, our new reductons also enable researchers to focus on desgnng algorthms for only one settng but nfer optmal results more broadly. Ths s opposed to results such as [, 3] where the authors develop ad hoc technques to tweak specfc algorthms, rather than all algorthms, and apply them to other settngs wthout losng extra factors and wthout ntroducng bas. Our new reductons also enable researchers to prove lower bounds more broadly [3]. 1.1 Formal Settng and Classcal Approaches Consder mnmzng a composte objectve functon { def mn F (x) = f(x) + ψ(x) }, (1.1) x R d where f(x) s a dfferentable convex functon and ψ(x) s a relatvely smple (but possbly nondfferentable) convex functon, sometmes referred to as the proxmal functon. Our goal s to fnd a pont x R d satsfyng F (x) F (x ) + ε, where x s a mnmzer of F. In most classfcaton and regresson problems, f(x) can be wrtten as f(x) = 1 n n =1 f ( x, a ) where each a R d s a feature vector. We refer to ths as the fnte-sum case of (1.1). Classcal Regularzaton Reducton. Gven a non-sc F (x), one can defne a new objectve F (x) def = F (x) + σ x x n whch σ s on the order of ε. In order to mnmze F (x), the classcal regularzaton reducton calls an oracle algorthm to mnmze F (x) nstead, and ths oracle only needs to work wth SC functons. Example. If F s L-smooth, one can apply accelerated gradent descent to mnmze F and obtan an algorthm that converges n O( L/ε log 1 ε ) teratons n terms of mnmzng the orgnal F. Ths complexty has a suboptmal dependence on ε and shall be mproved usng our new regularzaton reducton. Classcal Smoothng Reducton (fnte-sum case). Gven a non-smooth F (x) of a fnte-sum form, one can defne a smoothed varant f (α) = 1 Recall that obtanng the optmal convergence rate s one of the man goals n operatons research and machne learnng. For nstance, obtanng the optmal 1/ε rate for onlne learnng was a major breakthrough snce the log(1/ε)/ε rate was dscovered [11, 13, ]. Smoothng reducton s typcally appled to the fnte sum form only. Ths s because, for a general hgh dmensonal functon f(x), ts smoothed varant f(x) may not be effcently computable.

3 E v [ 1,1] [f (α + εv)] for each f (α) and let F (x) = 1 n n =1 f ( a, x ) + ψ(x). 3 In order to mnmze F (x), the classcal smoothng reducton calls an oracle algorthm to mnmze F (x) nstead, and ths oracle only needs to work wth smooth functons. Example. If F (x) s σ-sc and one apples accelerated gradent descent to mnmze F, ths yelds an algorthm that converges n O ( 1 σε log 1 ε) teratons for mnmzng the orgnal F (x). Agan, the addtonal factor log(1/ε) can be removed usng our new smoothng reducton. Besdes the non-optmalty, applyng the above two reductons gves only based algorthms. One has to tune the regularzaton or smoothng parameter, and the algorthm only converges to the mnmum of the regularzed or smoothed problem F (x), whch can be away from the true mnmzer of F (x) by a dstance proportonal to the parameter. Ths makes the reducton hard to use n practce. 1. Our New Results To ntroduce our new reductons, we frst defne a property on the oracle algorthm. Our Black-Box Oracle. Consder an algorthm A that mnmzes (1.1) when the objectve F s L-smooth and σ-sc. We say that A satsfes the homogenous objectve decrease (HOOD) property n tme Tme(L, σ) f, for every startng vector x, A produces an output x satsfyng F (x ) F (x ) F (x ) F (x ) n tme Tme(L, σ). In other words, A decreases the objectve value dstance to the mnmum by a constant factor n tme Tme(L, σ), regardless of how large or small F (x ) F (x ) s. We gve a few example algorthms that satsfy HOOD: Gradent descent and accelerated gradent descent satsfy HOOD wth Tme(L, σ) = O(L/σ) C and Tme(L, σ) = O( L/σ) C respectvely, where C s the tme needed to compute a gradent f(x) and perform a proxmal gradent update [1]. Many subsequent works n ths lne of research also satsfy HOOD, ncludng [3, 5, 1, 15]. SVRG and SAGA [1, 31] solve the fnte-sum form of (1.1) and satsfy HOOD wth Tme(L, σ) = O ( n + L/σ ) C 1 where C 1 s the tme needed to compute a stochastc gradent f (x) and perform a proxmal gradent update. Katyusha [1] solves the fnte-sum form of (1.1) and satsfes HOOD wth Tme(L, σ) = O ( n + nl/σ ) C1. AdaptReg. For objectves F (x) that are non-sc and L-smooth, our AdaptReg reducton calls the an oracle satsfyng HOOD a logarthmc number of tmes, each tme wth a SC objectve F (x) + σ x x for an exponentally decreasng value σ. In the end, AdaptReg produces an output x satsfyng F ( x) F (x ) ε wth a total runnng tme t= Tme(L, ε t ). Snce most algorthms have an nverse polynomal dependence on σ n Tme(L, σ), when summng up Tme(L, ε t ) for postve values t, we do not ncur the addtonal factor log(1/ε) as opposed to the old reducton. In addton, AdaptReg s an unbased and anytme algorthm. F ( x) converges to F (x ) as the tme goes wthout the necessty of changng parameters, so the algorthm can be nterrupted at any tme. We menton some theoretcal applcatons of AdaptReg: 3 More formally, one needs ths varant to satsfy f (α) f (α) ε for all α and be smooth at the same tme. Ths can be done at least n two classcal ways f f (α) s Lpschtz contnuous. One s to defne f (α) = E v [ 1,1] [f (α+εv)] as an ntegral of f over the scaled unt nterval, see for nstance Chapter.3 of [1], and the other s to defne f (α) = max β { β α f (β) ε α } usng the Fenchel dual f (β) of f (α), see for nstance []. 3

4 Applyng AdaptReg to SVRG, we obtan a runnng tme O ( n log 1 ε + L ) ε C1 for mnmzng fnte-sum, non-sc, and smooth objectves (such as Lasso and Logstc Regresson). Ths mproves on known theoretcal runnng tme obtaned by non-accelerated methods, ncludng O ( n log 1 ε + L ε log 1 ε) C1 through the old reducton, as well as O ( ) n+l ε C1 through drect methods such as SAGA [7] and SAG [5]. Applyng AdaptReg to Katyusha, we obtan a runnng tme O ( n log 1 ε + nl ε ) C1 for mnmzng fnte-sum, non-sc, and smooth objectves (such as Lasso and Logstc Regresson). Ths s the frst and only known stochastc method that converges wth the optmal 1/ ε rate (as opposed to log(1/ε)/ ε) for ths class of objectves. [1] Applyng AdaptReg to methods that do not orgnally work for non-sc objectves such as [5, 1, 15], we mprove ther runnng tmes by a factor of log(1/ε) for workng wth non-sc objectves. AdaptSmooth and JontAdaptRegSmooth. For objectves F (x) that are fnte-sum, σ-sc, but non-smooth, our AdaptSmooth reducton calls an oracle satsfyng HOOD a logarthmc number of tmes, each tme wth a smoothed varant of F (λ) (x) and an exponentally decreasng smoothng parameter λ. In the end, AdaptSmooth produces an output x satsfyng F ( x) F (x ) ε wth a total runnng tme t= Tme( 1 ε, σ). t Snce most algorthms have a polynomal dependence on L n Tme(L, σ), when summng up Tme( 1 ε, σ) for postve values t, we do not ncur an addtonal factor of log(1/ε) as opposed to t the old reducton. AdaptSmooth s also an unbased and anytme algorthm for the same reason as AdaptReg. In addton, AdaptReg and AdaptSmooth can effectvely work together, to solve fnte-sum, non- SC, and non-smooth case of (1.1), and we call ths reducton JontAdaptRegSmooth. We menton some theoretcal applcatons of AdaptSmooth and JontAdaptRegSmooth: Applyng AdaptReg to Katyusha, we obtan a runnng tme O ( n log 1 ε + n ) C1 σε for mnmzng fnte-sum, SC, and non-smooth objectves (such as SVM). Therefore, Katyusha combned wth AdaptReg s the frst and only known stochastc method that converges wth the optmal 1/ ε rate (as opposed to log(1/ε)/ ε) for ths class of objectves. [1] Applyng JontAdaptRegSmooth to Katyusha, we obtan a runnng tme O ( n log 1 ε + n ) ε C1 for mnmzng fnte-sum, SC, and non-smooth objectves (such as L1-SVM). Therefore, Katyusha combned wth JontAdaptRegSmooth s the frst and only known stochastc method that converges wth the optmal 1/ε rate (as opposed to log(1/ε)/ε) for ths class of objectves. [1] Theory vs. Practce. In theory, not all algorthms solvng (1.1) satsfy HOOD. Some machne learnng algorthms such as APCG [18], SPDC [3], AccSDCA [9] and SDCA [8] ether do not satsfy HOOD or ncur some addtonal log(l/σ) factor n ts runnng tme so cannot beneft from our new reductons n theory. For example, APCG solves the fnte-sum form of (1.1) and produces an output x satsfyng F (x) F (x ) ε n tme O (( n+ nl ( σ ) log L σε)) C1. Ths runnng tme does not have a logarthmc dependence on ε that has the form log ( F (x ) F (x )) ε. In other words, APCG mght n prncple take a much longer runnng tme n order to decrease the objectve dstance to the mnmum from 1 to 1/, as compared to the tme needed to decrease from 1 1 to 1 1 /. Fortunately, although wthout theoretcal guarantee, these methods also beneft from our new reductons, and we nclude experments n ths paper to confrm such fndngs.

5 Related Works. Catalyst and APPA [9, 17] reductons turn non-accelerated methods nto accelerated ones. They can be used as regularzaton reductons too; however, n such a case they become dentcal to the tradtonal regularzaton reducton, and contnue to ntroduce bas and suffer from a log factor loss n the runnng tme. In fact, Catalyst and APPA fx the regularzaton parameter throughout the algorthm but our AdaptReg decreases t exponentally. Therefore, ther results cannot mply ours. PRISMA [3] turns Nesterov s accelerated gradent descent to work for non-smooth objectves wthout payng the log factor. However, PRISMA does not apply to all algorthms n a black-box manner so s not a reducton. Furthermore, PRISMA requres the algorthm to know the number of teratons n advance, whch AdaptSmooth does not. Roadmap. We nclude the descrpton and analyss of AdaptReg n Secton 3, but only nclude the descrpton of AdaptSmooth n Secton. We leave proofs as well as the descrpton and analyss of JontAdaptRegSmooth to the appendx. We nclude expermental results n Secton 6. Prelmnares In ths paper we denote by f(x) the full gradent of f f t s dfferentable, or the subgradent f f s only Lpschtz contnuous. Recall some classcal defntons on strong convexty and smoothness. Defnton.1 (smoothness and strong convexty). For a convex functon f : R n R, f s σ-strongly convex f x, y R n, t satsfes f(y) f(x) + f(x), y x + σ x y. f s L-smooth f x, y R n, t satsfes f(x) f(y) L x y. Characterzaton of SC and Smooth Regmes. In ths paper we gve numbers to the followng categores of objectves F (x) n (1.1). Each of them corresponds to some well-known tranng problems n machne learnng. (Lettng (a, b ) R d R be the -th feature vector and label.) Case 1: ψ(x) s σ-sc and f(x) s L-smooth. Examples: rdge regresson: f(x) = 1 n n =1 ( a, x b ) and ψ(x) = σ x. elastc net: f(x) = 1 n n =1 ( a, x b ) and ψ(x) = σ x + λ x 1. Case : ψ(x) s non-sc and f(x) s L-smooth. Examples: Lasso: f(x) = 1 n n =1 ( a, x b ) and ψ(x) = λ x 1. logstc regresson: f(x) = 1 n n =1 log(1 + exp( b a, x )) and ψ(x) = λ x 1. Case 3: ψ(x) s σ-sc and f(x) s non-smooth (but Lpschtz contnuous). Examples: SVM : f(x) = 1 n n =1 max{, 1 b a, x } and ψ(x) = σ x. Case : ψ(x) s non-sc and f(x) s non-smooth (but Lpschtz contnuous). Examples: l 1 -SVM : f(x) = 1 n n =1 max{, 1 b a, x } and ψ(x) = λ x 1. Defnton. (HOOD property). We say an algorthm A(F, x ) solvng Case 1 of problem (1.1) satsfes the homogenous objectve decrease (HOOD) property wth tme Tme(L, σ) f, for every startng pont x, t produces output x A(F, x ) such that F (x ) mn x F (x) F (x ) mn x F (x) n tme Tme(L, σ). Although our defnton s only for determnstc algorthms, f the guarantee s probablstc,.e., E [ F (x ) ] mn x F (x) F (x ) mn x F (x), all the results of ths paper reman true. 5

6 Algorthm 1 The AdaptReg Reducton Input: an objectve F ( ) n Case (smooth and not necessarly strongly convex); x a startng vector, σ an ntal regularzaton parameter, T the number of epochs; an algorthm A that solves Case 1 of problem (1.1). Output: x T. 1: x x. : for t to T 1 do 3: Defne F (σt) (x) def = σt x x + F (x). : x t+1 A(F (σt), x t ). 5: σ t+1 σ t /. 6: end for 7: return x T. In ths paper, we denote by C the tme needed for computng a full gradent f(x) and performng a proxmal gradent update of the form x { arg mn 1 x x x +η( f(x), x x +ψ(x)) }. For the fnte-sum case of problem (1.1), we denote by C 1 the tme needed for computng a stochastc (sub-)gradent f ( a, x ) and performng a proxmal gradent update of the form x { arg mn 1 x x x + η( f ( a, x )a, x x + ψ(x)) }. For fnte-sum forms of (1.1), C s usually on the magntude of n C 1. 3 AdaptReg: Reducton from Case to Case 1 We now focus on solvng Case of problem (1.1): that s, f( ) s L-smooth, but ψ( ) s not necessarly SC. We acheve so by reducng the problem to an algorthm A solvng Case 1 that satsfes HOOD. AdaptReg works as follows (see Algorthm 1). At the begnnng of AdaptReg, we set x to equal x, an arbtrary gven startng vector. AdaptReg conssts of T epochs. At each epoch t =, 1,..., T 1, we defne a σ t -strongly convex objectve F (σt) (x) def = σt x x + F (x). Here, the parameter σ t+1 = σ t / for each t and σ s an nput parameter to AdaptReg that wll be specfed later. We run A on F (σt) (x) wth startng vector x t n each epoch, and let the output be x t+1. After all T epochs are fnshed, AdaptReg smply outputs x T. We state our man theorem for AdaptReg below and prove t n Secton 3.1. Theorem 3.1 (AdaptReg). Suppose that n problem (1.1) f( ) s L-smooth. Let x be a startng vector such that F (x ) F (x ) and x x Θ. Then, AdaptReg wth σ = /Θ and T = log ( /ε) produces an output x T satsfyng F ( x T ) mn x F (x) O(ε) n a total runnng tme of T 1 t= Tme(L, σ t ). 5 Remark 3.. We compare the parameter tunng effort needed for AdaptReg aganst the classcal regularzaton reducton. In the classcal reducton, there are two parameters: T, the number of teratons that does not need tunng; and σ, whch had better equal ε/θ whch s an unknown quantty so requres tunng. In AdaptReg, we also need tune only one parameter, that s σ. Our T need not be tuned because AdaptReg can be nterrupted at any moment and x t of the current epoch can be outputted. In our experments later, we spent the same effort turnng σ n the classcal reducton and σ n AdaptReg. As t can be easly seen from the plots, tunng σ s much easer than σ. 5 If the HOOD property s only satsfed probablstcally as per Footnote, our error guarantee becomes probablstc,.e., E [ F ( x T ) ] mn x F (x) O(ε). Ths s also true for other reducton theorems of ths paper. 6

7 Corollary 3.3. When AdaptReg s appled to SVRG, we solve the fnte-sum case of Case wth runnng tme T 1 t= Tme(L, σ t ) = T 1 Lt t= O(n + σ ) C 1 = O(n log ε + LΘ ε ) C 1. Ths s faster than O (( n+ LΘ ) ε log ε ) C1 obtaned through the old reducton, and faster than O ( n+lθ ε ) C1 obtaned by SAGA [7] and SAG [5]. When AdaptReg s appled to Katyusha, we solve the fnte-sum case of Case wth runnng tme T 1 t= Tme(L, σ t ) = T 1 t= O(n + nl t σ ) C 1 = O(n log ε + nlθ/ε) C 1. Ths s faster than O (( n + nl/ε ) log ) ε C1 obtaned through the old reducton on Katyusha [1] Convergence Analyss for AdaptReg For analyss purpose, we defne x t+1 to be the exact mnmzer of F (σt) (x). The HOOD property of A ensures that F (σt) ( x t+1 ) F (σt) (x t+1 ) F (σt) ( x t ) F (σt) (x t+1 ). (3.1) We denote by x an arbtrary mnmzer of F (x), and the followng clam states a smple property about the mnmzers of F (σt) (x): Clam 3.. We have x t+1 x x x for each t. Proof. By the strong convexty of F (σt) (x) and the fact that x t+1 s ts exact mnmzer, we have F (σt) (x t+1 ) F (σt) (x ) σ t x t+1 x. Usng the fact that F (σt) (x t+1 ) F (x t+1 ), as well as the defnton F (σt) (x ) = σt x x +F (x ), we mmedately have σ t x x σ t x t+1 x F (x t+1 ) F (x ). def Defne D t = F (σt) ( x t ) F (σt) (x t+1 ) to be the ntal objectve dstance to the mnmum on functon F (σt) before we call A n epoch t. At epoch, we have upper bound D = F (σ) ( x ) mn x F (σ) (x) F (x ) F (x ). For each epoch t 1, we compute that 1 def D t = F (σt) ( x t ) F (σt) (x t+1 ) = F (σt 1) ( x t ) σ t 1 σ t x x t F (σt 1) (x t+1 ) + σ t 1 σ t x x t+1 F (σt 1) ( x t ) σ t 1 σ t x x t F (σt 1) (x t ) σ t 1 x t x t+1 + σ t 1 σ t x x t+1 F (σt 1) ( x t ) F (σt 1) (x t ) + σ t 1 σ t x x t+1 3 F (σt 1) ( x t ) F (σt 1) (x t ) + σ t 1 σ t ( x x + x t+1 x ) F (σt 1) ( x t ) F (σt 1) (x t ) + (σ t 1 σ t ) x x 5 D t 1 + (σ t 1 σ t ) x x 6 D t 1 = + σ t x x. Above, 1 follows from the defnton of F (σt) ( ) and F (σ t 1) ( ); follows from the strong convexty of F (σ t 1) ( ) as well as the fact that x t s ts mnmzer; 3 follows because for any two vectors a, b 6 If the old reducton s appled on APCG, SPDC, or AccSDCA rather than Katyusha, then two log factors wll be lost. 7

8 t satsfes a b a + b ; follows from Clam 3.; 5 follows from the defnton of D t 1 and (3.1); and 6 uses the choce that σ t = σ t 1 / for t 1. Recursvely applyng the above nequalty, we have D T D T + x x (σ T + σ T 1 + ) 1 T (F (x ) F (x )) + σ T x x, (3.) where the second nequalty uses our choce σ t = σ t 1 /. In sum, we obtan a vector x T satsfyng F ( x T ) F (x ) 1 F (σ T ) ( x T ) F (σ T ) (x ) + σ T x x F (σ T ) ( x T ) F (σ T ) (x T +1 ) + σ T x x 3 = DT + σ T x x 1 T (F (x ) F (x )) +.5σ T x x. (3.3) Above, 1 uses the fact that F (σ T ) (x) F (x) for every x; uses the defnton that x T +1 s the mnmzer of F (σ T ) ( ); 3 uses the defnton of D T ; and uses (3.). Fnally, after approprately choosng σ and T, (3.3) drectly mples Theorem 3.1. AdaptSmooth: Reducton from Case 3 to 1 We now focus on solvng the fnte-sum form of Case 3 for problem (1.1). That s, mn F (x) = 1 x n n f ( a, x ) + ψ(x), =1 where ψ(x) s σ-strongly convex and each f ( ) may not be smooth (but s Lpschtz contnuous). Wthout loss of generalty, we assume a = 1 for each [n] because otherwse one can scale f accordngly. We solve ths problem by reducng t to an oracle A whch solves the fnte-sum form of Case 1 and satsfes HOOD. Recall the followng defnton usng Fenchel conjugate: 7 Defnton.1. For each functon f : R R, let f def (β) = max α {α β f (α)} be ts Fenchel conjugate. Then, we defne the followng smoothed varant of f parameterzed by λ > : Accordngly, we defne f (λ) (α) def = max β F (λ) (x) def = 1 n { β α f (β) λ β}. n =1 f (λ) ( a, x ) + ψ(x). From the property of Fenchel conjugate (see for nstance the textbook [6]), we know that f (λ) ( ) s a (1/λ)-smooth functon and therefore the objectve F (λ) (x) falls nto the fnte-sum form of Case 1 for problem (1.1) wth smoothness parameter L = 1/λ. Our AdaptSmooth works as follows (see Algorthm n Appendx B). At the begnnng of AdaptSmooth, we set x to equal x, an arbtrary gven startng vector. AdaptSmooth conssts of 7 For every explctly gven f ( ), ths Fenchel conjugate can be symbolcally computed and fed nto the algorthm. Ths pre-process s needed for nearly all known algorthms n order for them to apply to non-smooth settngs (such as SVRG, SAGA, SPDC, APCG, SDCA, etc). SGD and ts strongly convex varant PEGASOS are the only known methods whch do not need ths computaton. However, they are not accelerated methods. 8

9 T epochs. At each epoch t =, 1,..., T 1, we defne a (1/λ t )-smooth objectve F (λt) (x) usng Defnton.1. Here, the parameter λ t+1 = λ t / for each t and λ s an nput parameter to AdaptSmooth that wll be specfed later. We run A on F (λt) (x) wth startng vector x t n each epoch, and let the output be x t+1. After all T epochs are fnshed, AdaptSmooth outputs x T. (Alternatvely, f one sets T to be nfnty, AdaptSmooth can be nterrupted at an arbtrary moment and output x t of the current epoch.) We state our man theorem for AdaptSmooth below and prove t n Appendx B. Theorem.. Suppose that n problem (1.1), ψ( ) s σ strongly convex and each f ( ) s G- Lpschtz contnuous. Let x be a startng vector such that F (x ) F (x ). Then, AdaptSmooth wth λ = /G and T = log ( /ε) produces an output x T satsfyng F ( x T ) mn x F (x) O(ε) n a total runnng tme of T 1 t= Tme(t /λ, σ). Remark.3. We emphasze that AdaptSmooth requres less parameter tunng effort than the old reducton for the same reason as n Remark 3.. Also, AdaptSmooth, when appled to Katyusha, provdes the fastest runnng tme on solvng the Case 3 fnte-sum form of (1.1), smlar to Corollary JontAdaptRegSmooth: From Case to 1 We show n Appendx C that AdaptReg and AdaptSmooth can work together to reduce the fntesum form of Case to Case 1. We call ths reducton JontAdaptRegSmooth and t reles on a jontly exponentally decreasng sequence of (σ t, λ t ), where σ t s the weght of the convexty parameter that we add on top of F (x), and λ t s the smoothng parameter that determnes how we change each f ( ). The analyss s analogous to a careful combnaton of the proofs for AdaptReg and AdaptSmooth. 6 Experments We perform experments to confrm our theoretcal speed-ups obtaned for AdaptSmooth and AdaptReg. We work on mnmzng Lasso and SVM objectves for the followng three well-known datasets that can be found on the LbSVM webste [8]: covtype, mnst, and rcv1. We defer some dataset and mplementaton detals to Appendx A. 6.1 Experments on AdaptReg To test the performance of AdaptReg, consder the Lasso objectve whch s non-sc but smooth. We apply AdaptReg to reduce t to Case 1 and apply ether APCG [18], an accelerated method, or (Prox-)SDCA [7, 8], a non-accelerated method. Let us make a few remarks: APCG and SDCA are both ndrect solvers for non-strongly convex objectves and therefore regularzaton s ntrnscally requred n order to run them for Lasso or more generally Case. 8 APCG and SDCA do not satsfy HOOD n theory. However, they stll beneft from AdaptReg as we shall see, demonstratng the practcal value of AdaptReg. A Practcal Implementaton. In prncple, one can mplement AdaptReg by settng the termnaton crtera of the oracle n the nner loop as precsely suggested by the theory, such as settng the number of teratons for SDCA to be exactly T = O(n + L σ t ) n the t-th epoch. However, n 8 Note that some other methods, such as SVRG, although only provdng theoretcal results for strongly convex and smooth objectves (Case 1), n practce works for Case drectly. Therefore, t s not needed to apply AdaptReg on such methods at least n practce. 9

10 E-5 1E-6 1E-7 1E-8 1E-9 1E-1 (a) covtype, λ = E-5 1E-6 1E-7 1E-8 (b) mnst, λ = E+ 1E-5 (c) rcv1, λ = 1 5 Fgure 1: Comparng AdaptReg and the classcal reducton on Lasso (wth l 1 regularzer weght λ). y-axs s the objectve dstance to mnmum, and x-axs s the number of passes to the dataset. The blue sold curves represent APCG under the old regularzaton reducton, and the red dashed curve represents APCG under AdaptReg. For other values of λ, or the results on SDCA, please refer to Fgure 3 and n the appendx. practce, t s more desrable to automatcally termnate the oracle whenever the objectve dstance to the mnmum has been suffcently decreased. In all of our experments, we smply compute the dualty gap and termnate the oracle whenever the dualty gap s below 1/ tmes the last recorded dualty gap of the prevous epoch. For detals see Appendx A. Expermental Results. For each dataset, we consder three dfferent magntudes of regularzaton weghts for the l 1 regularzer n the Lasso objectve. Ths totals 9 analyss tasks for each algorthm. σ For each such a task, we frst mplement the old reducton by addng an addtonal x term to the Lasso objectve and then apply APCG or SDCA. We consder values of σ n the set {1 k, 3 1 k : k Z} and show the most representatve sx of them n the plots (blue sold curves n Fgure 3 and Fgure ). Naturally, for a larger value of σ the old reducton converges faster but to a pont that s farther from the exact mnmzer because of the bas. We mplement AdaptReg where we choose the ntal parameter σ also from the set {1 k, 3 1 k : k Z} and present the best one n each of 18 plots (red dashed curves n Fgure 3 and Fgure ). Due to space lmtatons, we provde only 3 of the 18 plots for medum-szed λ n the man body of ths paper (see Fgure 1), and nclude Fgure 3 and only n the appendx. It s clear from our experments that AdaptReg s more effcent than the old regularzaton reducton; AdaptReg requres no more parameter tunng than the classcal reducton; AdaptReg s unbased so smplfes the parameter selecton procedure Experments on AdaptSmooth To test the performance of AdaptSmooth, consder the SVM objectve whch s non-smooth but SC. We apply AdaptSmooth to reduce t to Case 1 and apply SVRG [1]. We emphasze that SVRG s an ndrect solver for non-smooth objectves and therefore regularzaton s ntrnscally requred n order to run SVRG for SVM or more generally for Case It s easy to determne the best σ n AdaptReg, and n contrast, n the old reducton f the desred error s somehow changed for the applcaton, one has to select a dfferent σ and restart the algorthm. 1 Note that some other methods, such as APCG or SDCA, although only provdng theoretcal guarantees for strongly convex and smooth objectves (Case 1), n practce work for Case drectly wthout smoothng (see for nstance the dscusson n [7]). Therefore, t s unnecessary to apply AdaptSmooth to such methods at least n practce. 1

11 E+ 1E-5 1E-6 1E-7 (a) covtype, σ = E+ 1E-5 1E-6 1E-7 (b) mnst, σ = E+ (c) rcv1, σ = 1 Fgure : Comparng AdaptSmooth and the classcal reducton on SVM (wth l regularzer weght λ). y-axs s the objectve dstance to mnmum, and x-axs s the number of passes to the dataset. The blue sold curves represent SVRG under the old smoothng reducton, and the red dashed curve represents SVRG under AdaptSmooth. For other values of σ, please refer to Fgure 5 n the appendx. A Practcal Implementaton. In prncple, one can mplement AdaptSmooth by settng the termnaton crtera of the oracle n the nner loop as precsely suggested by the theory, such as settng the number of teratons for SVRG to be exactly T = O(n + 1 σλ t ) n the t-th epoch. In practce, however, t s more desrable to automatcally termnate the oracle whenever the objectve dstance to the mnmum has been suffcently decreased. In all of our experments, we smply compute the Eucldean norm of the full gradent of the objectve, and termnate the oracle whenever the norm s below 1/3 tmes the last recorded Eucldean norm of the prevous epoch. For detals see Appendx A. Expermental Results. For each dataset, we consder three dfferent magntudes of regularzaton weghts for the l regularzer n the SVM objectve. Ths totals 9 analyss tasks. For each such a task, we frst mplement the old reducton by smoothng the hnge loss functons (usng Defnton.1) wth parameter λ > and then apply SVRG. We consder dfferent values of λ n the set {1 k, 3 1 k : k Z} and show the most representatve sx of them n the plots (blue sold curves n Fgure 5). Naturally, for a larger λ, the old reducton converges faster but to a pont that s farther from the exact mnmzer due to ts bas. We then mplement AdaptSmooth where we choose the ntal smoothng parameter λ also from the set {1 k, 3 1 k : k Z} and present the best one n each of the 9 plots (red dashed curves n Fgure 5). Due to space lmtatons, we provde only 3 of the 9 plots for small-szed σ n the man body of ths paper (see Fgure, and nclude Fgure 5 only n the appendx. It s clear from our experments that AdaptSmooth s more effcent than the old smoothng reducton, especally when the desred tranng error s small; AdaptSmooth requres no more parameter tunng than the classcal reducton; AdaptSmooth s unbased so smplfes the parameter selecton for the same reason as Footnote 9. Acknowledgements We thank Yang Yuan for very enlghtenng conversatons, and Alon Gonen for catchng a few typos n an earler verson of ths paper. Ths paper s partally supported by a Mcrosoft Research Grant, no

12 References [1] Zeyuan Allen-Zhu. Katyusha: Accelerated Varance Reducton for Faster SGD. ArXv e-prnts, abs/ , March 16. [] Zeyuan Allen-Zhu and Lorenzo Oreccha. Lnear couplng: An ultmate unfcaton of gradent and mrror descent. ArXv e-prnts, abs/ , July 1. [3] Zeyuan Allen-Zhu, Peter Rchtárk, Zheng Qu, and Yang Yuan. Even Faster Accelerated Coordnate Descent Usng Non-Unform Samplng. In ICML, 16. [] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for Non-Strongly-Convex or Sum-of-Non- Convex Objectves. In ICML, 16. [5] Sébasten Bubeck, Yn Tat Lee, and Moht Sngh. A geometrc alternatve to Nesterov s accelerated gradent descent. ArXv e-prnts, abs/ , June 15. [6] Antonn Chambolle and Thomas Pock. A frst-order prmal-dual algorthm for convex problems wth applcatons to magng. Journal of Mathematcal Imagng and Vson, (1):1 15, 11. [7] Aaron Defazo, Francs Bach, and Smon Lacoste-Julen. SAGA: A Fast Incremental Gradent Method Wth Support for Non-Strongly Convex Composte Objectves. In NIPS, 1. [8] Rong-En Fan and Chh-Jen Ln. LIBSVM Data: Classfcaton, Regresson and Mult-label. Accessed: [9] Roy Frostg, Rong Ge, Sham M. Kakade, and Aaron Sdford. Un-regularzng: approxmate proxmal pont and faster stochastc algorthms for emprcal rsk mnmzaton. In ICML, volume 37, pages 1 8, 15. [1] Elad Hazan. DRAFT: Introducton to onlne convex optmmzaton. Foundatons and Trends n Machne Learnng, XX(XX):1 168, 15. [11] Elad Hazan and Satyen Kale. Beyond the regret mnmzaton barrer: Optmal algorthms for stochastc strongly-convex optmzaton. The Journal of Machne Learnng Research, 15(1):89 51, 1. [1] Re Johnson and Tong Zhang. Acceleratng stochastc gradent descent usng predctve varance reducton. In Advances n Neural Informaton Processng Systems, NIPS 13, pages , 13. [13] Smon Lacoste-Julen, Mark W. Schmdt, and Francs R. Bach. A smpler approach to obtanng an o(1/t) convergence rate for the projected stochastc subgradent method. ArXv e-prnts, abs/11., 1. [1] Yn Tat Lee and Aaron Sdford. Effcent accelerated coordnate descent methods and faster algorthms for solvng lnear systems. In FOCS, pages IEEE, 13. [15] Laurent Lessard, Benjamn Recht, and Andrew Packard. Analyss and desgn of optmzaton algorthms va ntegral quadratc constrants. CoRR, abs/ , 1. [16] Hongzhou Ln. prvate communcaton, 16. 1

13 [17] Hongzhou Ln, Julen Maral, and Zad Harchaou. A Unversal Catalyst for Frst-Order Optmzaton. In NIPS, 15. [18] Qhang Ln, Zhaosong Lu, and Ln Xao. An Accelerated Proxmal Coordnate Gradent Method and ts Applcaton to Regularzed Emprcal Rsk Mnmzaton. In NIPS, pages , 1. [19] Arkad Nemrovsk. Prox-Method wth Rate of Convergence O(1/t) for Varatonal Inequaltes wth Lpschtz Contnuous Monotone Operators and Smooth Convex-Concave Saddle Pont Problems. SIAM Journal on Optmzaton, 15(1):9 51, January. [] Yur Nesterov. A method of solvng a convex programmng problem wth convergence rate O(1/k ). In Doklady AN SSSR (translated as Sovet Mathematcs Doklady), volume 69, pages 53 57, [1] Yur Nesterov. Introductory Lectures on Convex Programmng Volume: A Basc course, volume I. Kluwer Academc Publshers,. [] Yur Nesterov. Smooth mnmzaton of non-smooth functons. Mathematcal Programmng, 13(1):17 15, December 5. [3] Francesco Orabona, Andreas Argyrou, and Nathan Srebro. Prsma: Proxmal teratve smoothng algorthm. arxv preprnt arxv:16.37, 1. [] Alexander Rakhln, Ohad Shamr, and Karthk Srdharan. Makng gradent descent optmal for strongly convex stochastc optmzaton. In ICML, 1. [5] Mark Schmdt, Ncolas Le Roux, and Francs Bach. Mnmzng fnte sums wth the stochastc average gradent. arxv preprnt arxv: , pages 1 5, 13. Prelmnary verson appeared n NIPS 1. [6] Sha Shalev-Shwartz. Onlne Learnng and Onlne Convex Optmzaton. Foundatons and Trends n Machne Learnng, ():17 19, 1. [7] Sha Shalev-Shwartz and Tong Zhang. Proxmal Stochastc Dual Coordnate Ascent. arxv preprnt arxv: , pages 1 18, 1. [8] Sha Shalev-Shwartz and Tong Zhang. Stochastc dual coordnate ascent methods for regularzed loss mnmzaton. Journal of Machne Learnng Research, 1: , 13. [9] Sha Shalev-Shwartz and Tong Zhang. Accelerated Proxmal Stochastc Dual Coordnate Ascent for Regularzed Loss Mnmzaton. In ICML, pages 6 7, 1. [3] Blake Woodworth and Nat Srebro. Tght Complexty Bounds for Optmzng Composte Objectves. Workng manuscrpt, 16. [31] Ln Xao and Tong Zhang. A Proxmal Stochastc Gradent Method wth Progressve Varance Reducton. SIAM Journal on Optmzaton, ():57-75, 1. [3] Yuchen Zhang and Ln Xao. Stochastc Prmal-Dual Coordnate Method for Regularzed Emprcal Rsk Mnmzaton. In ICML,

14 Appendx A Experment Detals The datasets we used n ths paper are downloaded from the LbSVM webste [8]: the covtype (bnary.scale) dataset (581, 1 samples and 5 features). the mnst (class 1) dataset (6, samples and 78 features). the rcv1 (tran.bnary) dataset (, samples and 7, 36 features). To make easer comparson across datasets, we scale every vector by the average Eucldean norm of all the vectors n the dataset. In other words, we ensure that the data vectors have an average Eucldean norm 1. Ths step s for comparson only and not necessary n practce. We use the default step-length choce for APCG whch requres solvng a quadratc unvarate functon per teraton; for SDCA, to avod the ssue for tunng step lengths, we use the steepest descent (.e., automatc) choce whch s Opton I for SDCA [7]; for SVRG, we use the default step length η = 1/L E-5 1E-5 1E-6 1E-6 1E-7 1E-7 1E-8 1E-8 1E-7 1E-9 1E-9 1E-8 1E-1 1E-1 1E (c) covtype, λ = E-6 6 1E-5 (b) covtype, λ = (a) covtype, λ = E-5 1E-5 1E-6 1E-5 1E-6 1E-7 1E-8 1E-7 1E-9 1E-8 (d) mnst, λ = 1 1E-6 1E-7 6 (e) mnst, λ = 1 8 1E E+ 1E+ (f) mnst, λ = 1 6 1E-5 1E-6 1E-7 1E-5 (g) rcv1, λ = 1 (h) rcv1, λ = 1 5 () rcv1, λ = 1 6 Fgure 3: Performance Comparson for Lasso wth weght λ on the `1 regularzer. The y axs represents the objectve dstance to mnmum, and the x axs represents the number of passes to the dataset. The blue sold curves represent APCG under the old regularzaton reducton, and the red dashed curve represents APCG under AdaptReg. 1

15 E+ 1E+ 1E+ 1E-5 1E-5 1E-6 1E-6 1E-7 1E-7 1E-7 1E-8 1E-8 1E-8 1E E-5 1E-5 1E-6 1E E-5 1E-6 6 (b) covtype, λ = (a) covtype, λ = (c) covtype, λ = E-7 1E-5 1E-7 (d) mnst, λ = 1 1E-6 6 (e) mnst, λ = 1 8 1E E+ (f) mnst, λ = E+ 1E-5 1E-6 1E-5 (g) rcv1, λ = 1 (h) rcv1, λ = 1 5 () rcv1, λ = 1 6 Fgure : Performance Comparson for Lasso wth weght λ on the `1 regularzer. The y axs represents the objectve dstance to mnmum, and the x axs represents the number of passes to the dataset. The blue sold curves represent SDCA under the old regularzaton reducton, and the red dashed curve represents SDCA under AdaptReg. When applyng our reductons, t s desrable to automatcally termnate the oracle whenever the objectve dstance to the mnmum has been suffcently decreased, say, by a factor of. Unfortunately, the oracle usually does not know the exact mnmzer and cannot compute the exact objectve dstance to the mnmum (.e., Dt ). Instead, we use the followng heurstcs whch were also used by other reducton methods such as Catalyst [16]. Snce SDCA and APCG are prmal-dual methods, n our experments, we compute nstead the dualty gap whch gves a reasonable approxmaton on Dt. More specfcally, for both experments, we compute the dualty gap every n/3 teratons nsde the mplementaton of APCG/SDCA, and termnate t whenever the dualty gap s below 1/ tmes the last recorded dualty gap of the prevous epoch. Although one can further tune ths parameter 1/ for a better performance, to perform a far comparson, we smply set t to be dentcally 1/ across all the datasets and analyss tasks. When applyng SVRG to Lasso, we cannot compute the dualty gap because the objectve s not strongly convex. In our experments, we compute nstead the Eucldean norm of the full gradent of the objectve (.e., k f (x)k) whch gves a reasonable approxmaton on Dt. More specfcally, we use the default settng of SVRG Opton I that s to compute a gradent 15

16 E+ 1E+ 1E+ 1E-5 1E-5 1E-5 1E-6 1E-6 1E-6 1E-7 1E-7 1E-7 (a) covtype, σ = (b) covtype, σ = E E+ (c) covtype, σ = 1 7 1E+ 1E-5 1E-6 1E-7 1E-5 (d) mnst, σ = 1 6 (e) mnst, σ = (f) mnst, σ = E+ 1E+ 1E+ (g) rcv1, σ = 1 6 (h) rcv1, σ = 1 5 () rcv1, σ = 1 6 Fgure 5: Performance Comparson for L-SVM wth weght σ on the ` regularzer. The y axs represents the objectve dstance to mnmum, and the x axs represents the number of passes to the dataset. The blue sold curves represent SVRG under the old smoothng reducton, and the red dashed curve represents SVRG under AdaptSmooth. snapshot every n teratons. When a gradent snapshot s computed, we can also compute ts Eucldean norm almost for free. If ths norm s below 1/3 tmes the last norm-of-gradent of the prevous epoch, we termnate SVRG for the current epoch. Note that one can further tune ths parameter 1/3 for a better performance; however, to perform a far comparson n ths paper, we smply set t to be dentcally 1/3 across all the datasets and analyss tasks. B Convergence Analyss for AdaptSmooth (λ) We frst recall the followng property that bounds the dfference between f and f of λ: Lemma B.1. If each f ( ) s G-Lpschtz contnuous, t satsfes f (α) def λg as a functon (λ) f (α) f (α). Proof. Lettng β = arg maxβ {β α f (β)}, we have β [ G, G] because the doman of f ( ) equals the range of f ( ) whch s a subset of [ G, G] due to the Lpschtz contnuty of f ( ). As 16

17 Algorthm The AdaptSmooth Reducton Input: an objectve F ( ) n fnte-sum form of Case 3 (strongly convex and not necessarly smooth); x a startng vector, λ an ntal smoothng parameter, T the number of epochs; an algorthm A that solves the fnte-sum form of Case 1 for problem (1.1). Output: x T. 1: x x. : for t to T 1 do 3: Defne F (λt) (x) def = 1 n : x t+1 A(F (λt), x t ). 5: λ t+1 λ t /. 6: end for 7: return x T. n =1 f (λt) ( a, x ) + ψ(x) usng Defnton.1. a result, we have f (α) = max β {β α f (β)} = β α f (β ) λ (β ) + λ (β ) max β The other nequalty s obvous. We also note that {β α f (β) λ β } + λ (β ) = f (λ) (α) + λ (β ) f (λ) (α) + λg. Fact B.. For λ 1 λ, we have f (λ 1) (α) f (λ ) (α) for every α R. For analyss purpose only, we defne x t+1 to be the exact mnmzer of F (λt) (x). The HOOD property of the gven oracle A ensures that F (λt) ( x t+1 ) F (λt) (x t+1 ) F (λt) ( x t ) F (λt) (x t+1 ). (B.1) We denote by x def the mnmzer of F (x), and defne D t = F (λt) ( x t ) F (λt) (x t+1 ) to be the ntal objectve dstance to the mnmum on functon F (λt) ( ) before we call A n epoch t. At epoch, we smply have the upper bound D = F (λ ) (x ) F (λ ) (x 1 ) F (x ) F (x 1 ) + λ G F (x ) F (x ) + λ G Above, the frst nequalty s by Lemma B.1 and Fact B., and the second nequalty s because x s the mnmzer of F ( ). Next, for each epoch t 1, we compute that. def D t = F (λt) ( x t ) F (λt) (x t+1 ) F (λt 1) ( x t ) + λ t 1G F (λt 1) (x t+1 ) F (λt 1) ( x t ) + λ t 1G F (λt 1) (x t ) D t 1 + λ t 1G. Above, the frst nequalty s by Lemma B.1 and Fact B., and the second nequalty s because x t s the mnmzer of F (λ t 1) ( ). 17

18 Therefore, by telescopng the above nequalty and the choce λ t = λ t 1 /, we have that D T F (x ) F (x ) ( T + G λt 1 + λ ) T + F (x ) F (x ) 8 T + λ T G. In sum, we obtan a vector x T satsfyng F ( x T ) F (x ) F (λ T ) ( x T ) F (λ T ) (x ) + λ T G F (λ T ) ( x T ) F (λ T ) (x T +1 ) + λ T G = D T + λ T G 1 T (F (x ) F (x )) +.5λ T G (B.) Fnally, after approprately choosng λ and T, (B.) drectly mples Theorem.. C JontAdaptRegSmooth: Reducton from Case to Case 1 In ths secton, we show that AdaptReg and AdaptSmooth can work together to solve the fnte-sum form of Case. That s, mn F (x) = 1 n f ( a, x ) + ψ(x), x n =1 where ψ(x) s not necessarly strongly convex and each f ( ) may not be smooth (but s Lpschtz contnuous). Wthout loss of generalty, we assume a = 1 for each [n]. We solve ths problem by reducng t to an algorthm A solvng the fnte-sum form of Case 1 that satsfes HOOD. Followng the same defnton of f (λ) ( ) n Defnton.1, n ths secton, we consder the followng regularzed smoothed objectve F (λ,σ) (x): Defnton C.1. Gven parameters λ, σ >, let F (λ,σ) (x) def = 1 n f (λ) ( a, x ) + ψ(x) + σ n x x. =1 From ths defnton we know that F (λ,σ) (x) falls nto the fnte-sum form of Case 1 for problem (1.1) wth L = 1/λ and σ beng the strong convexty parameter. JontAdaptRegSmooth works as follows (see Algorthm 3). At the begnnng of the reducton, we set x to equal x, an arbtrary gven startng vector. JontAdaptRegSmooth conssts of T epochs. At each epoch t =, 1,..., T 1, we defne a (1/λ t )-smooth σ t -strongly convex objectve F (λt,σt) (x) usng Defnton C.1 above. Here, the parameters λ t+1 = λ t / and σ t+1 = σ t / for each t, and λ, σ are two nput parameters to JontAdaptRegSmooth that wll be specfed later. We run A on F (λt,σt) (x) wth startng vector x t n each epoch, and let the output be x t+1. After all T epochs are fnshed, JontAdaptRegSmooth smply outputs x T. (Alternatvely, f one sets T to be nfnty, JontAdaptRegSmooth can be nterrupted at an arbtrary moment and output x t of the current epoch.) We state our man theorem for JontAdaptRegSmooth below and prove t n Secton D. Theorem C.. Suppose that n problem (1.1), each f ( ) s G-Lpschtz contnuous. Let x be a startng vector such that F (x ) F (x ) and x x Θ. Then, JontAdaptRegSmooth wth λ = /G, σ = /Θ and T = log ( /ε) produces an output x T satsfyng F ( x T ) mn x F (x) O(ε) n a total runnng tme of T 1 t= Tme(t /λ, σ t ). 18

19 Algorthm 3 The JontAdaptRegSmooth Reducton Input: an objectve F ( ) n fnte-sum form of Case (not necessarly strongly convex or smooth); x startng vector, λ, σ ntal smoothng and regularzaton params, T number of epochs; an algorthm A that solves the fnte-sum form of Case 1 for problem (1.1). Output: x T. 1: x x. : for t to T 1 do 3: Defne F (λt,σt) (x) def = 1 n : x t+1 A(F (λt), x t ). 5: σ t+1 σ t /, λ t+1 λ t /. 6: end for 7: return x T. n =1 f (λt,σt) ( a, x ) + ψ(x) + σt x x usng Defnton C.1. Example C.3. When JontAdaptRegSmooth s appled to an accelerated gradent descent method such as [, 5, 1, 15, ], we solve the fnte-sum form of Case wth a total runnng tme T 1 t= Tme(t /λ, σ t ) = O(Tme(1/λ T, σ T )) = O(G Θ/ε) C. Ths matches the best known runnng tme of full-gradent frst-order methods on solvng Case, whch usually s obtaned va saddle-pont based methods such as Chambolle-Pock [6] or the mrror prox method of Nemrovsk [19]. D Convergence Analyss for JontAdaptRegSmooth For analyss purpose only, we defne x t+1 to be the exact mnmzer of F (λt,σt) (x). The HOOD property of the gven oracle A ensures that F (λt,σt) ( x t+1 ) F (λt,σt) (x t+1 ) F (λt,σt) ( x t ) F (λt,σt) (x t+1 ) We denote by x an arbtrary mnmzer of F (x). The followng clam states a smple property about the mnmzers of F (λt,σt) (x) whch s analogous to Clam 3.: Clam D.1. We have σt x t+1 x σt x x + λtg for each t. Proof. By the strong convexty of F (λt,σt) (x) and the fact that x t+1 s ts exact mnmzer, we have F (λt,σt) (x t+1 ) F (λt,σt) (x ) σ t x t+1 x. Usng the fact that F (λt,σt) (x t+1 ) F (λt,) (x t+1 ) F (x t+1 ) + λtg (where the second nequalty follows from Lemma B.1), as well as the defnton F (λt,σt) (x ) = F (λt,) (x ) + σt x x F (x )+ σt x x (where the second nequalty agan follows from Lemma B.1), we mmedately have λ t G + σ t x x σ t x t+1 x F (x t+1 ) F (x ). def Let D t = F (λt,σt) ( x t ) F (λt,σt) (x t+1 ) be the ntal objectve dstance to the mnmum on functon F (λt,σt) ( ) before we call A n epoch t. At epoch, we smply have the upper bound. D = F (λ,σ ) (x ) F (λ,σ ) (x 1 ) 1 F (,σ ) (x ) F (λ,) (x 1 ) F (x ) F (x 1 ) + λ G 3 F (x ) F (x ) + λ G. 19

20 Above, 1 uses F (λ,σ ) (x ) F (,σ) (x ) whch s a consequence of Fact B.; uses F (,σ) (x ) = F (x ) from the defnton and F (x 1 ) F (λ,) (x 1 ) + λ G from Lemma B.1; and 3 uses the mnmalty of x. Next, for each epoch t 1, we compute that def D t = F (λt,σt) ( x t ) F (λt,σt) (x t+1 ) 1 F (λt,σt 1) ( x t ) F (λt,σt 1) (x t+1 ) + σ t 1 σ t x t+1 x F (λ t 1,σ t 1 ) ( x t ) + λ t 1G F (λ t 1,σ t 1 ) (x t+1 ) + σ t x t+1 x 3 F (λ t 1,σ t 1 ) ( x t ) + λ t 1G F (λ t 1,σ t 1 ) (x t+1 ) + σ t x t+1 x + σ t x x F (λ t 1,σ t 1 ) ( x t ) + λ t 1G F (λ t 1,σ t 1 ) (x t ) + σ t x x + λ t G 5 D t 1 + σ t x x + λ t G. Above, 1 follows from the defnton; follows from Lemma B.1, Fact B. as well as the choce σ t 1 = σ t ; 3 follows because for any two vectors a, b t satsfes a b a + b ; follows from Clam D.1; 5 follows from the defnton of D t 1, from (3.1), and from the choce λ t 1 = λ t. By telescopng the above nequalty, we have D T F (x ) F (x ) ( T + G λ T + λ ) ( T x x σ T + σ ) T T (F (x ) F (x )) + λ T G + σ T x x, where the second nequalty uses our choce λ t = λ t 1 / and σ t = σ t 1 / agan. In sum, we obtan a vector x T satsfyng F ( x T ) F (x ) 1 F (λ T,) ( x T ) F (,σ T ) (x ) + λ T G F (λ T,σ T ) ( x T ) F (λ T,σ T ) (x ) + λ T G 3 F (λ T,σ T ) ( x T ) F (λ T,σ T ) (x T +1 ) + λ T G + σ T x x + σ T x x + σ T x x 1 T (F (x ) F (x )) +.5λ T G +.5σ T x x. (D.1) Above, 1 uses Lemma B.1 and the defnton; uses the monotoncty and Fact B., 3 uses the defnton that x T +1 s the mnmzer of F (λ T,σ T ) ( ); and uses the defnton of D T and our derved upper bound. Fnally, after approprately choosng σ, λ and T, (D.1) mmedately mples Theorem C..

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #21 Scribe: Lawrence Diao April 23, 2013 COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture #21 Scrbe: Lawrence Dao Aprl 23, 2013 1 On-Lne Log Loss To recap the end of the last lecture, we have the followng on-lne problem wth N