Doubly Random Parallel Stochastic Algorithms for Large Scale Learning

Size: px

Start display at page:

Download "Doubly Random Parallel Stochastic Algorithms for Large Scale Learning"

Howard Randall
5 years ago
Views:

1 Doubly Random Parallel Stochastc Algorthms for Large Scale Learnng Anonymous Author(s Afflaton Address emal Abstract We consder learnng problems over tranng sets n whch both, the number of tranng examples and the dmenson of the feature vectors, are large. To solve these problems we propose the random parallel stochastc algorthm (RAPSA. We call the algorthm random parallel because t utlzes multple processors to operate n a randomly chosen subset of blocks of the feature vector. We call the algorthm parallel stochastc because processors choose elements of the tranng set randomly and ndependently. Algorthms that are parallel n ether of these dmensons exst, but RAPSA s the frst attempt at a methodology that s parallel n both, the selecton of blocks and the selecton of elements of the tranng set. In RAPSA, processors utlze the randomly chosen functons to compute the stochastc gradent component assocated wth a randomly chosen block. The techncal contrbuton of ths paper s to show that ths mnmally coordnated algorthm converges to the optmal classfer when the tranng objectve s convex. In partcular, we show that: ( When usng decreasng stepszes, RAPSA converges almost surely over the random choce of blocks and functons. ( When usng constant stepszes, convergence s to a neghborhood of optmalty wth a rate that s lnear n expectaton. Acelerated (ARAPSA s further proposed by leveragng deas of stochastc quas-newton optmzaton. Both, RAPSA and ARAPSA, are numercally evaluated on the MNIST dgt recognton problem. Introducton Learnng s often formulated as an optmzaton problem that fnds a classfer x R p that mnmzes the average of a loss functon across the elements of a tranng set. For a precse defnton consder a tranng set wth N elements and let f n : R p R be a convex loss functon assocated wth the nth element of the tranng set. The optmal classfer x R p s defned as the mnmzer of the average cost F (x := (/N N n= f n(x, x := argmn F (x := argmn x x N N f n (x. ( Problems such as support vector machnes, logstc regresson, and matrx completon can be put n the form of problem (. In ths paper we are nterested n large scale problems where both, the number of features p and the number of elements N n the tranng set are very large whch arse, e.g., n text [], mage [], and genomc [3] processng. When N and p are large, the parallel processng archtecture n Fgure becomes of nterest. In ths archtecture, features are dvded n B blocks each of whch contans p b p features and a set of I B processors work n paralell on randomly chosen feature blocks whle usng a stocahstc n=

2 x x b x b x B P P P I f f n f n f n f N Fgure : Random parallel stochastc algorthm (RAPSA. At each teraton, processor P pcks a random block from the set {x,..., x B } and a random set of functons from the tranng set {f,..., f N }. The functons drawn are used to evaluate a stochastc gradent component assocated wth the chosen block. RAPSA s shown here to converge to the optmal argument x of (. subset of elements of the tranng set. In the schematc shown, Processor fetches functons f and f n to operate on block x b and Processor fetches functons f n and f n to operate on block x b. Other processors select other elements of the tranng set and other blocks wth the majorty of blocks remanng unchanged and the majorty of functons remanng unused. The blocks chosen for update and the functons fetched for determnaton of block updates are selected ndependently at random n subsequent slots. Problems that operate on blocks of the feature vectors or subsets of the tranng set, but not on both, blocks and subsets, exst. Block coordnate descent (BCD s the generc name for methods n whch the varable space s dvded n blocks that are processed separately. Early versons operate by cyclcally updatng all coordnates at each step [4, 5], whle more recent parallelzed versons of coordnate descent have been developed to accelerate convergence of BCD [6 8]. Closer to the archtecture n Fgure, methods n whch subsets of blocks are selected at random have also been proposed [9]. BCD, seral, parallel, or random, can handle cases where the parameter dmenson p s large but requres access to all tranng samples at each teraton. Methods that utlze a subset of functons are known by the generc name of stochastc approxmaton and rely on the use of stochastc gradents. In plan stochastc gradent descent (SGD, the gradent of the aggregate functon s estmated by the gradent of a randomly chosen functon f n [0]. Snce convergence of SGD s slow more often that not, varous recent developments have been amed at acceleratng convergence. These attempts nclude methodologes to reduce the varance of stochastc gradents [?, SAG, SAGA, SVRG] and the use of deas from quas-newton optmzaton to handle dffcult curvature profles [, ]. More pertnent to the work consdered here are the use of cyclc block SGD updates [3] and the explotaton of sparsty propertes of feature vectors to allow for parallel updates [4]. These methods are sutable when the number of elements n the tranng set N s large but don t allow for parallel feature processng unless parallelsm s nherent to the problem s structure. The random parallel stochastc algorthm (RAPSA proposed n ths paper represents the frst effort at mplementng the archtecture n Fgure that randomzes over both, features and sample functons. In RAPSA, the functons fetched by a processor are used to compute the stochastc gradent component assocated wth a randomly chosen block (Secton. The processors do not coordnate n ether choce except to avod selecton of the same block. Our man techncal contrbuton s to show that RAPSA terates converge to the optmal classfer x when usng a sequence of decreasng stepszes and to a neghborhood of the optmal classfer when usng constant stepszes (Secton.. In the latter case, we further show that the rate of convergence to ths optmalty neghborhood s lnear n expectaton. These results are nterestng because only a subset of features are updated per teraton and the functons used to update dfferent blocks are, n general, dfferent. We further propose an acceleraton of RAPSA n whch processors also update and explot a curvature estmaton matrx assocated wth each block (Secton 3. Ths accelerated (ARAPSA algorthm leverages deas of stochastc quas-newton methods [, ] and results n faster convergence. Both, RAPSA and ARAPSA, are numercally evaluated on the MNIST dgt recognton problem (Secton 4.

3 Random Parallel Stochastc Algorthm (RAPSA We consder a more general formulaton of ( n whch the number N of functons f n s not necessarly fnte. Introduce then a random varable θ Θ R q that determne the choce of the random smooth convex functon f(, θ : R p R. We consder the problem of mnmzng the expectaton of the random functons F (x := E θ [f(x, θ], x := argmn x F (x := argmn E θ [f(x, θ]. ( x Problem ( s a partcular case of ( n whch each of the functons f n s drawn wth probablty /N. We refer to f(, θ as nstantaneous functons and to F (x as the average functon. RAPSA utlzes I processors to update a random subset of blocks of the varable x, wth each of the blocks relyng on a subset of randomly and ndependently chosen elements of the tranng set; see Fgure. Formally, decompose the varable x nto B blocks to wrte x = [x ;... ; x B ], where block b has length p b so that we have x b R p b. At teraton t, processor selects a random ndex b t for updatng and a random subset Θ t of L nstantaneous functons. It then uses these nstantaneous functons to determne stochastc gradent components for the subset of varables x b = x b t as an average of the components of the gradents of the functons f(x t, θ for θ Θ t, xb f(x t, Θ t = xb f(x t, θ, b = b t L. (3 θ Θ t The stochastc gradent block n (3 s then modulated by a possbly tme varyng stepsze γ t and used by processor to update the block x b = x b t x t+ b = x t b γ t xb f(x t, Θ t b = b t. (4 RAPSA s defned by the jont mplementaton of (3 and (4 across all I processors. The selecton of blocks s coordnated so that no processors operate n the same block. The selecton of elements of the tranng set s uncoordnated across processors. The fact that at any pont n tme a random subset of blocks s beng updated utlzng a random subset of elements of the tranng set means that RAPSA requres almost no coordnaton between processors. The contrbuton of ths paper s to show that ths very lean algorthm converges to the optmal argument x as we show n the followng secton.. Convergence Analyss We show n ths secton that the sequence of objectve functon values F (x t generated by RAPSA approaches the optmal objectve functon value F (x. In establshng ths result we defne the set S t contanng the blocks that are updated at step t wth assocated ndces I t {,..., B}. Note that components of the set S t are chosen unformly at random from the set of blocks {x,..., x B }. The defnton of S t s such that the tme evoluton of RAPSA terates can be wrtten as, [cf. (4], x t+ = x t γ t x f(x t, Θ t for all x S t, (5 whle the rest of the blocks reman unchanged,.e., x t+ = x t for x / S t. Snce the number of updated blocks s equal to the number of processors, the rato of updated blocks s r := I t /B = I/B. To prove convergence of RAPSA, we requre the followng assumptons Assumpton. The nstantaneous objectve functons f (x are dfferentable and the average functon F (x s strongly convex wth parameter m > 0. Assumpton. The average objectve functon gradents F (x are Lpschtz contnuous wth respect to the Eucldan norm wth parameter M. I.e., for all x, ˆx R p, t holds F (x F (ˆx M x ˆx. (6 Assumpton 3. The second moment of the norm of the stochastc gradent s bounded for all x,.e., there exsts a constant K such that for all varables x, t holds E θ [ f(x t, θ t x t ] K. (7 3

4 Notce that Assumpton only enforces strong convexty of the average functon F, whle the nstantaneous functons f may not be even convex. Further, notce that snce the nstantaneous functons f are dfferentable the average functon F s also dfferentable. The Lpschtz contnuty of the average functon gradents F s customary n provng objectve functon convergence for descent algorthms. The restrcton mposed by Assumpton 3 s a standard condton n stochastc approxmaton lterature [0], ts ntent beng to lmt the varance of the stochastc gradents [5]. Our frst result comes n the form of a expected descent lemma that relates the expected dfference of subsequent terates to the gradent of the average functon. Lemma. Consder the random parallel stochastc algorthm defned n (3-(5. Recall the defntons of the set of updated blocks I t whch are randomly chosen from the total B blocks. Defne F t as a sgma algebra that measures the hstory of the system up untl tme t. Then, the expected value of the dfference x t+ x t wth respect to the random set I t gven F t s E I t[ x t+ x t F t] = rγ t f(x t, Θ t. (8 Moreover, the expected value of the squared norm x t+ x t wth respect to the random set S t gven F t can be smplfed as E I t[ x t+ x t F t] = r(γ t f(x t, Θ t. (9 Notce that n the regular stochastc gradent descent method the dfference of two consecutve terates x t+ x t s equal to the stochastc gradent f(x t, Θ t tmes the stepsze γ t. Accordng to the frst result n Lemma, the expected value of stochastc gradents wth respect to the random set of blocks I t s the same as the one for SGD except that t s multpled by the fracton of updated blocks r. Expresson n (9 shows the same relaton for the expected value of the squared dfference x t+ x t. These relatonshps confrm that n expectaton RAPSA behaves as SGD whch allows us to establsh the global convergence of RAPSA. Proposton. Consder the random parallel stochastc algorthm defned n (3- (5. If Assumptons -3 hold true, then the objectve functon error sequence F (x t F (x satsfes E [ F (x t+ F (x F t] ( mrγ t ( F (x t F (x + rmk(γt. (0 Proposton leads to a supermartngale relatonshp for the sequence of objectve functon errors F (x t F (x. In the followng theorem we show that f the sequence of stepsze satsfes standard stochastc approxmaton dmnshng step-sze rules (non-summable and squared summable, the sequence of objectve functon errors F (x t F (x converges to null almost surely. Consderng the strong convexty assumpton ths result mples almost sure convergence of the sequence x t x to null. Theorem. Consder the random parallel stochastc algorthm defned n (3-(5. If Assumptons -3 hold true and the sequence of stepszes are non-summable t=0 γt = and square summable t=0 (γt <, then sequence of the varables x t generated by RAPSA converges almost surely to the optmal argument x, lm t xt x = 0 a.s. ( Moreover, f stepsze s defned as γ t := γ 0 T 0 /(t+t 0 and the stepsze parameters are chosen such that mrγ 0 T 0 >, then the expected average functon error E [F (x t F (x ] converges to null at least wth a sublnear convergence rate of order O(/t, E [ F (x t F (x ] C t + T 0, ( where the constant C s defned as { rmk(γ 0 T 0 } C = max 4mrγ 0 T 0, T 0 (F (x 0 F (x. (3 The result n Theorem shows that when the sequence of stepsze s dmnshng as γ t = γ 0 T 0 /(t+ T 0, the average objectve functon value F (x t sequence converges to the optmal objectve value 4

5 F (x wth probablty. Further, the rate of convergence n expectaton s at least n the order of O(/t. Dmnshng stepszes are useful when exact convergence s requred, however, for the case that we are nterested n a specfc accuracy ɛ the more effcent choce s usng a constant stepsze. In the followng theorem we study the convergence propertes of RAPSA for a constant stepsze γ t = γ. Theorem. Consder the random parallel stochastc algorthm defned n (3 and (5. If Assumptons -3 hold true and the stepsze s constant γ t = γ, then the sequence of the varables x t generated by RAPSA converges almost surely to a neghborhood of the optmal argument x as lm nf t F (xt F (x γmk 4m a.s. (4 Moreover, f the constant stepsze γ s chosen such that mrγ < then the expected average functon value error E [F (x t F (x ] converges lnearly to an error bound as E [ F (x t F (x ] ( mγr t (F (x 0 F (x + γmk 4m. (5 Notce that accordng to the result n (5 there exts a trade-off between accuracy and speed of convergence. Decreasng the constant stepsze γ leads to a smaller error bound γmk/4m and a more accurate convergence, whle the lnear convergence constant ( mγr ncreases and the convergence rate becomes slower. Further, note that the error of convergence γm K/4m s ndependent of the rato of updated blocks r, whle the constant of lnear convergence mγr depends on r. Therefore, updatng a fracton of the blocks at each teraton decreases the speed of convergence for RAPSA relatve to SGD that updates all of the blocks, however, both of the algorthms reach the same accuracy. To acheve accuracy ɛ the sum of two terms n the rght hand sde of (5 should be smaller than ɛ. Let s consder φ as a postve constant that s strctly smaller than,.e., 0 < φ <. Then, we want to have γmk 4m φɛ, ( mγrt (F (x 0 F (x ( φɛ. (6 Therefore, to satsfy the frst condton n (6 we set the stepsze as γ = 4mφɛ/MK. Apply ths substtuton nto the second nequalty n (6 and consder the nequalty a + ln( a < 0 for 0 < a <, to obtan that t MK 8m rφɛ ln ( F (x 0 F (x ( φɛ. (7 The lower bound n (7 shows the mnmum number of requred teratons for RAPSA to acheve accuracy ɛ. 3 Accelerated Random Parallel Stochastc Algorthm (ARAPSA As we mentoned n Secton, RAPSA operates on frst-order nformaton whch may lead to slow convergence n ll-condtoned problems. We ntroduce Accelerated RAPSA (ARAPSA as a parallel doubly stochastc algorthm that ncorporates second-order nformaton of the objectve by separately approxmatng the functon curvature for dfferent blocks. We do ths by mplementng the olbfgs algorthm for dfferent blocks of the varable x. Defne ˆB t as an approxmaton for the Hessan nverse of the objectve functon that corresponds to the block x. The update of ARAPSA s defned as multplcaton of the descent drecton of RAPSA by ˆB t,.e. x t+ = x t γ t ˆBt x f(x t, Θ t for all x S t. (8 We next detal how to properly specfy the block approxmate Hessan ˆB t so that t behaves n a manner comparable to the true Hessan. To do so, defne for each block coordnate x at step t the varable varaton v t and the stochastc gradent varaton ˆrt as v t = x t+ x t, ˆr t = x f(x t+, Θ t x f(x t, Θ t. (9 The expectaton on the left hand sde of (, and throughout the subsequent convergence rate analyss, s taken wth respect to the algorthm hstory F 0, whch contans all randomness n both Θ t and I t for all t 0. 5

6 Observe that the stochastc gradent varaton ˆr t s defned as the dfference of stochastc gradents at tmes t + and t correspondng to the block x for a common set of realzatons Θ t. The term x f(x t, Θ t s the same as the stochastc gradent used at tme t n (8, whle x f(x t+, Θ t s computed only to determne the stochastc gradent varaton ˆr t. An alternatve and perhaps more natural defnton for the stochastc gradent varaton s x f(x t+, Θ t+ x f(x t, Θ t. However, as ponted out n [], ths formulaton s nsuffcent for establshng the convergence of stochastc quas-newton methods. We proceed to developng a block-coordnate quas-newton method by frst notng an mportant property of the true Hessan, and desgn our approxmate scheme to satsfy ths property. In partcular, observe that the true Hessan nverse (H t correspondng to block x satsfes the block secant condton, stated as (H t v t = ˆrt when the terates xt and xt+ are close to each other. The secant condton may be nterpreted as statng that the stochastc gradent of a quadratc approxmaton of the objectve functon evaluated at the next teraton agrees wth the stochastc gradent at the current teraton. We select a Hessan nverse approxmaton matrx assocated wth block x such that t satsfes the secant condton a comparable manner to the true block Hessan. ˆB t+ v t = ˆrt, and thus behaves n The olbfgs Hessan nverse update rule mantans the secant condton at each teraton by usng nformaton of the last τ pars of varable and stochastc gradent varatons {v u, ˆru }t u=t τ. To state the update rule of olbfgs for revsng the Hessan nverse approxmaton matrces of the t,0 blocks, defne a matrx as ˆB := η ti for each block and t, where the constant ηt for t > 0 s gven by η t := (vt T ˆr t ˆr t, (0 whle the ntal value s η t t,0 =. The matrx ˆB s the ntal approxmate for the Hessan nverse assocated wth block x. The approxmate matrx ˆB t t,0 s computed by updatng the ntal matrx ˆB usng the last τ pars of curvature nformaton {v u, ˆru }t u=t τ. We defne the approxmate Hessan nverse ˆB t t,τ = ˆB correspondng to block x at step t as the outcome of τ recursve applcatons of the update ˆB t,u+ where the matrces Ẑt τ+u ˆρ t τ+u = = (Ẑt τ+u (v t τ+u T ˆr t τ+u T ˆBt,u (Ẑt τ+u and the constants ˆρ t τ+u and Ẑ t τ+u + ˆρ t τ+u (v t τ+u (v t τ+u T, ( n ( for u = 0,..., τ are defned as = I ˆρ t τ+u ˆr t τ+u (v t τ+u T. ( The computaton cost of ˆB t n ( s n the order of O(p, however, for the update n (8 the descent drecton ˆd t := ˆB t x f(x t, Θ t s requred. [6] ntroduces an effcent mplementaton of product ˆB t x f(x t, Θ t that requres computaton complexty of order O(τp. We use the same dea for computng the descent drecton of ARAPSA for each block more detals are provded n supplementary materals. Therefore, the computaton complexty of updatng each block for ARAPSA s n the order of O(τp, whle RAPSA requres O(p operatons. On the other hand, ARAPSA accelerates the convergence of RAPSA by ncorporatng the second order nformaton of the objectve functon for the block updates, as may be observed n the numercal analyses provded n Secton 4. 4 Numercal analyss In ths secton we study the practcal performance of the doubly stochastc approxmaton algorthms developed n Sectons and 3 by consderng the problem of developng an automated decson system to dstngush between dstnct hand-wrtten dgts. For that purpose, let z R p be a feature vector encodng pxel ntenstes of an mage, and let y {, } be an ndcator varable of whether the mage contans the dgt 0 or 8, n whch case the bnary ndcator s respectvely y = or y =. We model the task of learnng a hand-wrtten dgt detector as a logstc regresson problem, where one ams to tran a classfer x R p to determne the relatonshp between feature vectors z n R p and ther assocated labels y n {, } for n =,..., N. The nstantaneous functons f n n ( may be wrtten as the log-lkelhood of a generalzed lnear model of the odds rato of 6

7 F(xt F(x, Emprcal Rsk pt, number features processed (a Emprcal rsk F (x t F (x vs. teraton p t P(Ŷu = Yu, Average Accuracy t t u= pt, number features processed (b Classfcaton accuracy vs. teraton p t Fgure : RAPSA on MNIST data wth constant step-sze γ = 0, wth no mn-batchng L =. Algorthm performance s comparable across dfferent numbers of decson varable coordnates updated per teraton. RAPSA s I tmes faster than SGD and acheves comparable performance. In some cases, updatng fewer varables per teraton mproves accuracy. whether the label s y n = or y n =. The emprcal rsk mnmzaton assocated wth tranng set T = {(z n, y n } N n= s to fnd x as the maxmum lkelhood estmate, x := argmn F (x = λ x R p x + N log( + exp( y n x T z n. (3 N We use the MNIST dataset [7], n whch feature vectors z n R p are p = 8 8 pxel mages whose values are recorded as ntenstes, or elements of the unt nterval [0, ]. Consdered here s the subset assocated wth dgts 0 and 8, a tranng set T = {z n, y n } N n= wth N = sample ponts. Further, the optmal classfer x n (3 s computed usng Lblnear [8]. We run RAPSA on ths tranng subset for the case that B = [p/4] = 96, where [ ] denotes roundng a number to ts nearest nteger. Moreover, we unformly allocate feature vectors nto blocks of sze p = 4 for =,..., B. To determne the advantages of ncomplete randomzed parallel processng, we vary I t = I, the number of blocks updated at each teraton, set to the number of processors for smplcty, as I = B, I = [B/] = 98, I = [B/4] = 49, I = [B/8] = 5. When I = B, RAPSA s parallel stochastc gradent descent. In the subsequent experment, we set L = (no mn-batchng. Comparng algorthm performance over teraton t across varyng numbers of blocks updates I t s unfar. If RAPSA s run on a problem for whch I t = B/, then at teraton t t has only processed half the data that parallel SGD has processed by the same teraton. Instead we consder the algorthm performance as compared wth the amount of features processed up to the current tme p t. Observe that for a feature vector of length p, tp features have been processed by teraton t when all decson varable coordnates are updated at each teraton, as n the case of SGD where r =. When r <, p t must be scaled by the proporton of coordnates updated per teraton, whch snce blocks are unformly szed, s prt. Moreover, when the mn-batch sze L >, p t = prtl. In Fgure we show the result of runnng RAPSA wth constant step-sze γ t = 0. In Fgure (a, we plot F (x t F (x versus the number of features processed p t. As n Theorem, we observe n Fgure (a that the emprcal rsk F (x t F (x converges to a small postve constant of approxmately 0 and the convergence rate dfference to a neghborhood of the optmum s neglgble. In Fgure (b we plot the emprcal average classfcaton accuracy on a test set of sze Ñ = Note that RAPSA wth fewer blocks updated (processors per teraton acheves mproved classfcaton accuracy,.e. I = B, I = [B/], I = [B/4], I = [B/8] respectvely acheve accuraces of 78%, 87%, 9%, 9%. We now run Accelerated RAPSA, or ARAPSA as stated n Secton 3 for ths problem settng for the entre MNIST bnary tranng subset assocated wth dgts 0 and 8, wth mn-batch sze L = 0 and the level of curvature nformaton set as τ = 0. We select decreasng step-sze γ t = γ 0 T 0 /(t+t 0 wth annealng rate T 0 = [N/p] = 45, regularzer λ = / N = , and γ 0 = /(λt 0. As before, we study the advantages of ncomplete randomzed parallel processng by varyng I = I t, the number of blocks updated at per teraton, as I {B, [B/], [B/4], [B/8]}. Fg- n= 7

8 F(xt F(x, Emprcal Rsk pt, number features processed (a Emprcal rsk F (x t F (x vs. teraton p t P(Ŷu = Yu, Average Accuracy t t u= pt, number features processed (b Classfcaton accuracy vs. teraton p t Fgure 3: ARAPSA on MNIST data wth regularzer λ = / N = , mn-batch sze L = 0, curvature nformaton level τ = 0, and dmnshng step-sze γ t = γ 0 T 0 /(t + T 0 wth annealng rate T 0 = [N/p] = 45. ARAPSA convergence propertes hold n practce. F(xt F(x, Emprcal Rsk RAPSA ARAPSA pt, number features processed (a Emprcal rsk F (x t F (x vs. teraton p t P(Ŷu = Yu, Average Accuracy t t u= RAPSA ARAPSA pt, number features processed (b Classfcaton accuracy vs. teraton p t Fgure 4: RAPSA and ARAPSA algorthms on MNIST data wth mn-batch sze L = 0, the level of curvature nformaton set to τ = 0, and constant step-sze γ = 0. ARAPSA successfully accelerates the convergence of RAPSA, and acheves hgher accuracy per features processed p t. ure 3(a shows the error F (x t F (x versus the number of features processed p t. Observe that the algorthm acheves convergence across the dfferng numbers of parallel computng nodes I t, wth mproved convergence wth smaller I t,.e. for I t = B, I t = [B/], I t = [B/4], I t = [B/8], the algorthm respectvely acheves F (x t F (x by p t = 70, p t = 40, p t = 85, and p t = 77. Moreover, n Fgure 3(b we observe ARAPSA acheves comparable accuracy across the dfferent levels of block varables updated per teraton, surpassng 90%. We study the effect of ncorporatng second-order nformaton nto the block-updates. To do so, we fx mn-batch sze as L = 0 and run ARAPSA and RAPSA for ths problem nstance wth constant step-sze γ = 0. Moreover, only a quarter of the blocks per teraton are updated per teraton I t = [B/4]. The results of ths experment are gven n Fgure 4. In Fgure 4(a we plot F (x t F (x versus the number of features processed p t. Observe that ARAPSA converges more quckly and to a pont closer to the optmum than RAPSA,.e. for the benchmark F (x t F (x, ARAPSA requres p t = 85 whle RAPSA requres p t = 575 features. Ths accelerated behavor s also apparent n Fgure 4(b where ARAPSA acheves an accuracy of 80% by p t = 30, whereas RAPSA requres p t = 650 features for the same benchmark. The comparable performance of RAPSA to SGD n Fgure (a, and ARAPSA to olbfgs n Fgure 3, demonstrates the practcal utlty of the proposed method. Because RAPSA may be mplemented n parallel, t may acheve the same emprcal result as SGD but at a rate of I tmes faster than SGD. Moreover, n some cases RAPSA acheves superor classfcaton accuracy wth fewer blocks updated per teraton. A natural queston, gven the speedup made possble by parallel computng, s why to select I < B. For stuatons where p s very large, ths would requre a large number of computng nodes, whch may or may not be avalable. 8

9 References [] G. Sampson, R. Hagh, and E. Atwell, Natural language analyss by stochastc optmzaton: A progress report on project aprl, J. Exp. Theor. Artf. Intell., vol., no. 4, pp. 7 87, Oct [Onlne]. Avalable: [] J. Maral, F. Bach, J. Ponce, and G. Sapro, Onlne learnng for matrx factorzaton and sparse codng, The Journal of Machne Learnng Research, vol., pp. 9 60, 00. [3] M. Taşan, G. Musso, T. Hao, M. Vdal, C. A. MacRae, and F. P. Roth, selectng causal genes from genome-wde assocaton studes va functonally coherent subnetworks, Nature methods, 04. [4] P. Tseng and C. O. L. Mangasaran, Convergence of a block coordnate descent method for nondfferentable mnmzaton, J. Optm Theory Appl, pp , 00. [5] Y. Xu and W. Yn, A globally convergent algorthm for nonconvex optmzaton based on block coordnate update, arxv preprnt arxv:40.386, 04. [6] P. Rchtárk and M. Takáč, Parallel coordnate descent methods for bg data optmzaton, arxv preprnt arxv:.0873, 0. [7] Z. Lu and L. Xao, On the complexty analyss of randomzed block-coordnate descent methods, Mathematcal Programmng, pp. 8, 03. [8] Y. Nesterov, Effcency of coordnate descent methods on huge-scale optmzaton problems, SIAM Journal on Optmzaton, vol., no., pp , 0. [9] J. Lu, S. J. Wrght, C. Ré, V. Bttorf, and S. Srdhar, An asynchronous parallel stochastc coordnate descent algorthm, arxv preprnt arxv:3.873, 03. [0] H. Robbns and S. Monro, A stochastc approxmaton method, Ann. Math. Statst., vol., no. 3, pp , [Onlne]. Avalable: [] N. Schraudolph, J. Yu, and S. Günter, A stochastc quas-newton method for onlne convex optmzaton, 007. [] A. Bordes, L. Bottou, and P. Gallnar, Sgd-qn: Careful quas-newton stochastc gradent descent, The Journal of Machne Learnng Research, vol. 0, pp , 009. [3] Y. Xu and W. Yn, Block stochastc gradent teraton for convex and nonconvex optmzaton, ArXv preprnt v, Aug. 04. [4] F. Nu, B. Recht, C. R, and S. J. Wrght, Hogwld: A lock-free approach to parallelzng stochastc gradent descent, n In NIPS, 0. [5] A. Nemrovsk, A. Judtsky, and A. Shapro, Robust stochastc approxmaton approach to stochastc programmng, SIAM Journal on optmzaton, vol. 9, no. 4, pp , 009. [6] L. Dong C. and J. Nocedal, On the lmted memory bfgs method for large scale optmzaton, Mathematcal programmng, no. 45(-3, pp , 989. [7] Y. Lecun and C. Cortes, The MNIST database of handwrtten dgts. [Onlne]. Avalable: [8] R.-E. Fan, K.-W. Chang, C.-J. Hseh, X.-R. Wang, and C.-J. Ln, Lblnear: A lbrary for large lnear classfcaton, The Journal of Machne Learnng Research, vol. 9, pp ,

10 A Proof of Lemma Recall that the components of vector x t+ are equal to the components of x t for the coordnates that are not updated at step t,.e., / I t. For the updated coordnates I t we know that x t+ = x t γt x t f(x t, θ t. Therefore, B I blocks of the vector x t+ x t are 0 and the remanng I randomly chosen blocks are gven by γ t x t f(x t, θ t. Notce that there are ( B I dfferent ways for pckng I blocks out of the whole B blocks. Therefore, the probablty of each combnaton of blocks s / ( ( B I. Further, each block appears n B I of the combnatons. Therefore, the expected value can be wrtten as E I t[ x t+ x t F t] = ( B I Observe that smplfyng the rato n the rght hand sdes of (4 leads to ( B I ( B I = (B! (I! (B I! p! I! (B I! ( m I ( γ t f(x t, Θ t (4 = I = r. (5 B Substtutng the smplfcaton n (5 nto (4 follows the clam n (8. To prove the clam n (9 we can use the same argument that we used n provng (8 to show that E I t[ xt+ x t F t] = ( B I By substtutng the smplfcaton n (5 nto (6 the clam n (9 follows. B Proof of Proposton ( B I (γ t f(x t, Θ t. (6 By consderng the Taylor s expanson of F (x t+ near the pont x t and observng the Lpschtz contnuty of gradents F wth constant M we obtan that the average objectve functon F (x t+ s bounded above by F (x t+ F (x t + F (x t T (x t+ x t + M xt+ x t. (7 Compute the expectaton of the both sdes of (7 wth respect to the random set I t gven the observed set of nformaton F t. Substtute E I t[ x t+ x t F t] and E I t[ x t+ x t F t] wth ther smplfcatons n (8 and (9, respectvely, to wrte E I t [ F (x t+ F t] F (x t rγ t F (x t T f(x t, Θ t + rm(γt f(x t, Θ t. (8 Notce that the stochastc gradent f(x [ t, Θ t s an unbased estmate of the average functon gradent F (x t. Therefore, we obtan E Θ t f(x t, Θ t F t] = F (x t. Observng ths relaton and consderng the assumpton n (7, the expected value of (8 wth respect to the set of realzatons Θ t can be wrtten as [ E It,Θ t F (x t+ F t] F (x t rγ t F (x t + rm(γt K. (9 Subtractng the optmal objectve functon value F (x form the both sdes of (9 mples that [ E I t,θ t F (x t+ F (x F t] F (x t F (x rγ t F (x t + rm(γt K. (30 We proceed to fnd a lower bound for the gradent norm F (x t n terms of the objectve value error F (x t F (x. Assumpton states that the average objectve functon F s strongly convex wth constant m > 0. Therefore, for any y, z R p we can wrte F (y F (z + F (z T (y z + m y z. (3 0

11 For fxed z, the rght hand sde of (3 s a quadratc functon of y whose mnmum argument we can fnd by settng ts gradent to zero. Dong ths yelds the mnmzng argument ŷ = z (/m F (z mplyng that for all y we must have F (y F (w + F (z T (ŷ z + m ŷ z = F (z m F (z. (3 Observe that the bound n (3 holds true for all y and z. Settng values y = x and z = x t n (3 and rearrangng the terms yelds a lower bound for the squared gradent norm F (x t as F (x t m(f (x t F (x (33 Substtutng the lower bound n (33 by the norm of gradent square F (x t n (30 follows the clam n (0. C Proof of Theorem We use the relatonshp n (0 to buld a supermartngale sequence. To do so, defne the stochastc process α t as α t := F (x t F (x + rmk (γ u. (34 Note that α t s well-defned because u=t (γu u=0 (γu < s summable. Further defne the sequence β t wth values u=t β t := mγ t r(f (x t F (x. (35 The defntons of sequences α t and β t n (34 and (35, respectvely, and the nequalty n (0 mply that the expected value α t+ gven F t can be wrtten as E [ α t+ F t] α t β t. (36 Snce the sequences α t and β t are nonnegatve t follows from (36 that they satsfy the condtons of the supermartngale convergence theorem. Therefore, we obtan that: ( The sequence α t converges almost surely to a lmt. ( The sum t=0 βt < s almost surely fnte. The latter result yelds mγ t r(f (x t F (x <. a.s. (37 t=0 Snce the sequence of step szes s non-summable there exts a subsequence of sequence F (x t F (x whch s convergng to null. Ths observaton s equvalent to almost sure convergence of lm nf F (x t F (x to null lm nf t F (xt F (x = 0. a.s. (38 Based on the martngale convergence theorem for the sequences α t and β t n relaton (36, the sequence α t almost surely converges to a lmt. Consder the defnton of α t n (34. Observe that the sum u=t (γu s determnstc and ts lmt s null. Therefore, the sequence of the objectve functon value error F (x t F (x almost surely converges to a lmt. Ths observaton n assocaton wth the result n (38 mples that the whole sequence of F (x t F (x converges almost surely to null, lm F t (xt F (x = 0. a.s. (39 The last step s to prove almost sure convergence of the sequence x t x to null, as a result of the lmt n (39. To do so, we follow by provng a lower bound for the objectve functon value error F (x t F (x n terms of the squared norm error x t x. Accordng to the strong convexty assumpton, we can wrte the followng nequalty F (x t F (x + F (x T (x t x + m xt x (40

12 Observe that the gradent of the optmal pont s the null vector,.e., F (x = 0. Ths observaton and rearrangng the terms n (40 mply that F (x t F (x m xt x. (4 The upper bound n (4 for the squared norm x t x n assocaton wth the fact that the sequence F (x t F (x almost surely converges to null, leads to the concluson that the sequence x t x almost surely converges to zero. Hence, the clam n ( s vald. The next step s to study the convergence rate of RAPSA n expectaton. In ths step we assume that the dmnshng stepsze s defned as γ t = γ 0 T 0 /(t + T 0. Recall the nequalty n (0. Substtute γ t by γ 0 T 0 /(t + T 0 and compute the expected value of (0 gven F 0 to obtan E [ F (x t+ F (x ] ( mrγ0 T 0 (t + T 0 E [ F (x t F (x ] + rmk(γ0 T 0 (t + T 0. (4 We use the followng lemma to show that the result n (4 mples sublnear convergence of the sequence of expected objectve value error E [F (x t F (x ]. Lemma. Let c >, b > 0 and t 0 > 0 be gven constants and u t 0 be a nonnegatve sequence that satsfes the nequalty ( u t+ c t + t 0 u t b + (t + t 0, (43 for all tmes t 0. The sequence u t s then bounded as u t Q t + t 0, (44 for all tmes t 0, where the constant Q s defned as Q := max{b/(c, t 0 u 0 }. Proof: See... Lemma shows that f a sequence u t satsfes the condton n (43 then the sequence u t converges to null at least wth the rate of O(/t. By assgnng values t 0 = T 0, u t = E [F (x t F (x ], c = mrγ 0 T 0, and b = rmk(γ 0 T 0 /, the relaton n (4 mples that the nequalty n (43 s satsfed for the case that mrγ 0 T 0 >. Therefore, the result n (44 holds and we can conclude that E [ F (x t F (x ] C t + T 0, (45 where the constant C s defned as { rmk(γ 0 T 0 } C = max 4rmγ 0 T 0, T 0 (F (x 0 F (x (46 D Proof of Theorem To prove the clam n (4 we use the relatonshp n (0 to construct a supermartngale. Defne the stochastc process α t wth values α t = ( F (x t F (x ( mn F u t (xu F (x > γmk (47 4m The process α t tracks the optmalty gap F (x t F (x untl the gap becomes smaller than γmk/m for the frst tme at whch pont t becomes α t = 0. Notce that the stochastc process α t s always non-negatve,.e., α t 0. Lkewse, we defne the stochastc process β t as β t = γmr ( F (x t F (x γmk 4m ( mn F u t (xu F (x > γmk 4m, (48 whch follows γmr (F (x t F (x γmk/4m untl the tme that the optmalty gap F (x t F (x becomes smaller than γmk/m for the frst tme. After ths moment the stochastc process

13 β t becomes null. Accordng to the defnton of β t n (48, the stochastc process satsfes β t 0 for all t 0. Based on the relatonshp (0 and the defntons of stochastc processes α t and β t n (47 and (48 we obtan that for all tmes t 0 E [ α t+ F t] α t β t. (49 To check the valdty of (49 we frst consder the case that mn u t F (x u F (x > γmk/4m holds. In ths scenaro we can smply the stochastc processes n (47 and (48 as α t = F (x t F (x and β t = γmr (F (x t F (x γmk/4m. Therefore, accordng to the nequalty n (0 the result n (49 s vald. The second scenaro that we check s mn u t F (x u F (x γmk/4m. Based on the defntons of stochastc processes α t and β t, both of these two sequences are equal to 0. Further, notce that when α t = 0, t follows that α t+ = 0. Hence, the relatonshp n (49 s true. Gven the relaton n (49 and non-negatvty of stochastc processes α t and β t we obtan that α t s a supermartngale. The supermartngale convergence theorem yelds: The sequence α t converges to a lmt almost surely. The sum t= βt s fnte almost surely. The latter result mples that the sequence β t s convergng to null almost surely. I.e., lm t βt = 0 a.s. (50 Based on the defnton of β t n (48, the lmt n (50 s true f one of the followng events holds: The ndcator functon s null after for large t. The lmt lm t (F (x t F (x γmk/4m = 0 holds true. From any of these two events we t s mpled that lm nf F t (xt F (x γmk a.s. (5 4m Therefore, the clam n (4 s vald. The result n (5 shows the objectve functon value sequence F (x t almost sure converges to a neghborhood of the optmal objectve functon value F (x. We proceed to prove the result n (5. Compute the expected value of (0 gven F 0 and set γ t = γ to obtan E [ F (x t+ F (x ] ( mγr E [ F (x t F (x ] + rmkγ. (5 Notce that the expresson n (5 provdes an upper bound for the expected value of objectve functon error E [ F (x t+ F (x ] n terms of ts prevous value E [F (x t F (x ] and an error term. Rewrtng the relaton n (5 for step t leads to E [ F (x t F (x ] ( mγr E [ F (x t F (x ] + rmkγ. (53 Substtutng the upper bound n (53 for the expectaton E [F (x t F (x ] n (5 follows an upper bound for the expected error E [ F (x t+ F (x ] as E [ F (x t+ F (x ] ( mγr E [ F (x t F (x ] + rmkγ ( + ( mrγ. (54 By recursvely applyng the steps n (53 and (54 we can bound the expected objectve functon error E [ F (x t+ F (x ] n terms of the ntal objectve functon error F (x 0 F (x and the accumulaton of the errors as E [ F (x t+ F (x ] ( mγr t+ (F (x 0 F (x + rmkγ Substtutng t by t and smplfyng the sum n the rght hand sde of (55 yelds E [ F (x t F (x ] ( mγr t (F (x 0 F (x + MKγ 4m t ( mrγ u. (55 u=0 [ ( mrγ t]. (56 Observng that the term ( mrγ t n the rght hand sde of (56 s strctly smaller than for the stepsze γ < /(mr, the clam n (5 follows. 3

14 E Implementaton of ARAPS For reference, ARAPSA s also summarzed n algorthmc form n Algorthm. Steps and 3 are devoted to assgnng random blocks to the processors. In Step a subset of avalable blocks I t s chosen. These blocks are assgned to dfferent processors n Step 3. In Step 5 processors compute the partal stochastc gradent correspondng to ther assgned blocks x f(x t, Θ t usng the acqured samples n Step 4. Steps 6 and 7 are devoted to the computaton of the ARAPSA descent drecton ˆd t t,0. In Step 6 the approxmate Hessan nverse ˆB for block x s ntalzed as ˆB t,0 = η ti whch s a scaled dentty matrx usng the expresson for ηt n (0 for t > 0. The ntal value of η t s η0 =. In Step 7 we use Algorthm for effcent computaton of the descent drecton ˆd t = ˆB t x f(x t, Θ t. The descent drecton ˆd t s used to update the block xt wth stepsze γt n Step 8. Step 9 determnes the value of the partal stochastc gradent x f(x t+, Θ t whch s requred for the computaton of stochastc gradent varaton ˆr t. In Step 0 the varable varaton v t and stochastc gradent varaton ˆrt assocated wth block x are computed to be used n the next teraton. Algorthm Computaton of the ARAPSA step ˆd t = ˆB t x f(x t, Θ t for block x. : functon ˆd ( t = qτ = ARAPSA Step ˆBt,0, p 0 = x f(x t, Θ t, {v u, ˆru }t u=t τ : for u = 0,,..., τ do {Loop to compute constants α u and sequence p u } 3: Compute and store scalar α u = ˆρ t u (v t u T p u 4: Update sequence vector p u+ = p u t u αuˆr. 5: end for 6: Multply p τ by ntal matrx: q 0 t,0 = ˆB p τ 7: for u = 0,,..., τ do {Loop to compute constants β u and sequence q u } 8: Compute scalar β u = ˆρ t τ+u (ˆr t τ+u T q u 9: Update sequence vector q u+ = q u + (α τ u β u v t τ+u 0: end for {return ˆd t = qτ } Algorthm Accelerated Random Parallel Stochastc Algorthm (ARAPSA for ndvdual processors : for t = 0,,,... do : Choose unformly at random set I t {,..., B} of block varables to update 3: Assgn block varables S t to processors n any manner. 4: Choose a set of realzatons Θ t for the block x 5: Compute stochastc gradent : x f(x t, Θ t = x f(x t, θ [cf. (3] L 6: Compute the ntal Hessan nverse approxmaton: 7: Compute descent drecton: ˆd t = olbfgs Step θ Θ t t,0 ˆB ( ˆBt,0 = η t I, x f(x t, Θ t, {v u, ˆr u } t u=t τ 8: Update the coordnates of the decson varable x t+ = x t γ t ˆdt 9: Compute updated stochastc gradent: x f(x t+, Θ t = x f(x t+, θ [cf. (3] L θ Θ t 0: Update varatons v t = x t+ x t and ˆr t = x f(x t+, Θ t x f(x t, Θ t [ cf.(9] : end for 4

Doubly Stochastic Algorithms for Large-Scale Optimization

Doubly Stochastic Algorithms for Large-Scale Optimization Douly Stochastc Algorthms for Large-Scale Optmzaton Alec Koppel, Aryan Mokhtar, and Alejandro Rero Astract We consder learnng prolems over tranng sets n whch oth, the numer of tranng examples and the dmenson