A DOUBLE INCREMENTAL AGGREGATED GRADIENT METHOD WITH LINEAR CONVERGENCE RATE FOR LARGE-SCALE OPTIMIZATION

Similar documents
5. Best Unbiased Estimators

Productivity depending risk minimization of production activities

arxiv: v2 [cs.lg] 7 Oct 2016

SUPPLEMENTAL MATERIAL

ECON 5350 Class Notes Maximum Likelihood Estimation

An Empirical Study of the Behaviour of the Sample Kurtosis in Samples from Symmetric Stable Distributions

Subject CT5 Contingencies Core Technical. Syllabus. for the 2011 Examinations. The Faculty of Actuaries and Institute of Actuaries.

Subject CT1 Financial Mathematics Core Technical Syllabus

Hopscotch and Explicit difference method for solving Black-Scholes PDE

A random variable is a variable whose value is a numerical outcome of a random phenomenon.

Rafa l Kulik and Marc Raimondo. University of Ottawa and University of Sydney. Supplementary material

18.S096 Problem Set 5 Fall 2013 Volatility Modeling Due Date: 10/29/2013

The Limit of a Sequence (Brief Summary) 1

Unbiased estimators Estimators

Lecture 9: The law of large numbers and central limit theorem

x satisfying all regularity conditions. Then

Combining imperfect data, and an introduction to data assimilation Ross Bannister, NCEO, September 2010

Maximum Empirical Likelihood Estimation (MELE)

1 Estimating sensitivities

Neighboring Optimal Solution for Fuzzy Travelling Salesman Problem

We analyze the computational problem of estimating financial risk in a nested simulation. In this approach,

Institute of Actuaries of India Subject CT5 General Insurance, Life and Health Contingencies

Statistics for Economics & Business

Sequences and Series

This article is part of a series providing

EVEN NUMBERED EXERCISES IN CHAPTER 4

CAPITAL PROJECT SCREENING AND SELECTION

Minhyun Yoo, Darae Jeong, Seungsuk Seo, and Junseok Kim

5 Statistical Inference

SETTING GATES IN THE STOCHASTIC PROJECT SCHEDULING PROBLEM USING CROSS ENTROPY

Sampling Distributions and Estimation

AY Term 2 Mock Examination

INTERVAL GAMES. and player 2 selects 1, then player 2 would give player 1 a payoff of, 1) = 0.

14.30 Introduction to Statistical Methods in Economics Spring 2009

A New Approach to Obtain an Optimal Solution for the Assignment Problem

FINM6900 Finance Theory How Is Asymmetric Information Reflected in Asset Prices?

Math 312, Intro. to Real Analysis: Homework #4 Solutions

Estimating Proportions with Confidence

Research Article The Probability That a Measurement Falls within a Range of n Standard Deviations from an Estimate of the Mean

Asymptotics: Consistency and Delta Method

Proceedings of the 5th WSEAS Int. Conf. on SIMULATION, MODELING AND OPTIMIZATION, Corfu, Greece, August 17-19, 2005 (pp )

Positivity Preserving Schemes for Black-Scholes Equation

1 Random Variables and Key Statistics

A Technical Description of the STARS Efficiency Rating System Calculation

MODIFICATION OF HOLT S MODEL EXEMPLIFIED BY THE TRANSPORT OF GOODS BY INLAND WATERWAYS TRANSPORT

Policy Improvement for Repeated Zero-Sum Games with Asymmetric Information

SELECTING THE NUMBER OF CHANGE-POINTS IN SEGMENTED LINE REGRESSION

Economic Computation and Economic Cybernetics Studies and Research, Issue 2/2016, Vol. 50

Journal of Statistical Software

The material in this chapter is motivated by Experiment 9.

A New Second-Order Corrector Interior-Point Algorithm for P (κ)-lcp

Solutions to Problem Sheet 1

. (The calculated sample mean is symbolized by x.)

Monopoly vs. Competition in Light of Extraction Norms. Abstract

Reinforcement Learning


4.5 Generalized likelihood ratio test

The Time Value of Money in Financial Management

Control Charts for Mean under Shrinkage Technique

Multi-Criteria Flow-Shop Scheduling Optimization

Linear Programming for Portfolio Selection Based on Fuzzy Decision-Making Theory

Bayes Estimator for Coefficient of Variation and Inverse Coefficient of Variation for the Normal Distribution

Overlapping Generations

Dr. Maddah ENMG 624 Financial Eng g I 03/22/06. Chapter 6 Mean-Variance Portfolio Theory

Estimation of Population Variance Utilizing Auxiliary Information

1. Suppose X is a variable that follows the normal distribution with known standard deviation σ = 0.3 but unknown mean µ.

Topic-7. Large Sample Estimation

Lecture 5 Point Es/mator and Sampling Distribu/on

The Valuation of the Catastrophe Equity Puts with Jump Risks

Department of Mathematics, S.R.K.R. Engineering College, Bhimavaram, A.P., India 2

Game Theory. Lecture Notes By Y. Narahari. Department of Computer Science and Automation Indian Institute of Science Bangalore, India July 2012

Predicting Market Data Using The Kalman Filter

Lecture 4: Parameter Estimation and Confidence Intervals. GENOME 560 Doug Fowler, GS

CHANGE POINT TREND ANALYSIS OF GNI PER CAPITA IN SELECTED EUROPEAN COUNTRIES AND ISRAEL

Two methods for optimal investment with trading strategies of finite variation

Models of Asset Pricing

Models of Asset Pricing

Decision Science Letters

On Regret and Options - A Game Theoretic Approach for Option Pricing

Parametric Density Estimation: Maximum Likelihood Estimation

DESCRIPTION OF MATHEMATICAL MODELS USED IN RATING ACTIVITIES

An Improved Estimator of Population Variance using known Coefficient of Variation

Faster Alternating Direction Method of Multipliers with a Worst-case O(1/n 2 ) Convergence Rate

A FINITE HORIZON INVENTORY MODEL WITH LIFE TIME, POWER DEMAND PATTERN AND LOST SALES

Inferential Statistics and Probability a Holistic Approach. Inference Process. Inference Process. Chapter 8 Slides. Maurice Geraghty,

A Bayesian perspective on estimating mean, variance, and standard-deviation from data

Random Sequences Using the Divisor Pairs Function

IN this work, we aim to design real-time dynamic pricing

0.1 Valuation Formula:

We consider the planning of production over the infinite horizon in a system with timevarying

Lecture 4: Probability (continued)

Chapter 8. Confidence Interval Estimation. Copyright 2015, 2012, 2009 Pearson Education, Inc. Chapter 8, Slide 1

Problem Set 1a - Oligopoly

Anomaly Correction by Optimal Trading Frequency

CAPITAL ASSET PRICING MODEL

point estimator a random variable (like P or X) whose values are used to estimate a population parameter

Estimating Forward Looking Distribution with the Ross Recovery Theorem

Portfolio Optimization for Options

Kernel Density Estimation. Let X be a random variable with continuous distribution F (x) and density f(x) = d

Outline. Plotting discrete-time signals. Sampling Process. Discrete-Time Signal Representations Important D-T Signals Digital Signals

Transcription:

A DOUBLE INCREMENTAL AGGREGATED GRADIENT METHOD WITH LINEAR CONVERGENCE RATE FOR LARGE-SCALE OPTIMIZATION Arya Mokhtari, Mert Gürbüzbalaba, ad Alejadro Ribeiro Departmet of Electrical ad Systems Egieerig, Uiversity of Pesylvaia Departmet of Maagemet Sciece ad Iformatio Systems, Rutgers Uiversity ABSTRACT This paper cosiders the problem of miimizig the average of a fiite set of strogly covex fuctios. We itroduce a double icremetal aggregated gradiet method DIAG) that computes the gradiet of oly oe fuctio at each iteratio, which is chose based o a cyclic scheme, ad uses the aggregated average gradiet of all the fuctios to approximate the full gradiet. We prove that ot oly the proposed DIAG method coverges liearly to the optimal solutio, but also its liear covergece factor justifies the advatage of icremetal methods o full batch gradiet descet. I particular, we show theoretically ad empirically that oe pass of DIAG is more efficiet tha oe iteratio of gradiet descet. Idex Terms Icremetal methods, gradiet descet, liear covergece rate 1. INTRODUCTION We cosider optimizatio problems where the objective fuctio ca be writte as the average of a set of strogly covex ad smooth fuctios. Formally, cosider variable x R p ad objective fuctio summads f i : R p R. We aim to fid the miimizer of the average fuctio fx) := 1/) fix), i.e., x 1 = argmi fx) := argmi x R p x R p f ix). 1) I this paper, we refer to the fuctios f i as the istataeous fuctios ad the average fuctio f as the global objective fuctio. This class of optimizatio problems arises i may applicatios icludig machie learig [1], estimatio [2], ad sesor etworks [3]. Whe the umber of istataeous fuctios f i is large, it is costly to compute descet directios of the aggregate fuctio f. I particular, this makes the use of gradiet descet GD) i 1) costly because each descet step requires cyclig through the whole set of istataeous fuctios f i. A stadard solutio to this drawback is i the form of the stochastic S)GD method which evaluates the gradiet of oly oe of the istataeous fuctios i each iteratio [4]. This algorithm ca be show to coverge uder mild coditios while icurrig a reasoable per-iteratio cost. This advatage otwithstadig, the covergece rate of SGD is subliear, which is slower tha the liear covergece rate of GD. Developig alterative stochastic descet algorithms with liear covergece rates has bee a very active area i the last few years. A partial list of this cosequetial literature icludes stochastic averagig gradiet [5, 6], variace reductio [7, 8], dual coordiate methods [9, 10], hybrid algorithms [11, 12], ad majorizatio-miimizatio algorithms [13]. All of these stochastic methods are successful i achievig a liear covergece rate i expectatio. A separate alterative to reduce the per-iteratio cost of GD is the use of icremetal methods [14, 15]. I icremetal methods oe fuctio is chose from the set of fuctios at each iteratio as i GD but Work i this paper is supported by ONR N00014-12-1-0997. the fuctios are chose i a cyclic order as opposed from their selectio uiformly at radom i stochastic methods. As i the case of SGD, cyclic GD exhibits subliear covergece. This limitatio motivated the developmet of the icremetal aggregated gradiet IAG) method that achieves a liear covergece rate [16]. To explai our cotributio, we must emphasize that the covergece costat of IAG ca be smaller tha the covergece costat of GD Sectio 2). Thus, eve though IAG is desiged to improve upo GD, the available aalyses still make it impossible to assert that IAG outperforms GD uder all circumstaces. I fact, the questio of whether it is possible at all to desig a cyclic method that is guarateed to always outperform GD remais ope. I this paper we itroduce the double icremetal aggregated gradiet DIAG) method ad show that its covergece rate is liear, with a covergece costat that is guarateed to be smaller tha the covergece costat of GD. The mai differece betwee DIAG ad IAG methods is that DIAG iterates are computed by usig averages of iterates ad gradiets whereas IAG utilizes gradiet averages but does ot utilize iterate averages. DIAG is the first cyclic icremetal gradiet method which is guarateed to improve the performace of GD. We start the paper by presetig the GD ad IAG algorithms ad reviewig their covergece rates Sectio 2). The, we preset the proposed DIAG method ad explai its differece with IAG i approximatig the global fuctio f Sectio 3). We show that this critical differece leads to a icremetal gradiet algorithm with a smaller liear covergece factor Sectio 4). Moreover, we compare the performaces of DIAG, GD, ad IAG i solvig a quadratic programmig ad a biary classificatio problem Sectio 5). Fially, we close the paper by cocludig remarks Sectio 6). Proofs of results i this paper are available i [17]. 2. BACKGROUND AND RELATED WORKS Sice the objective fuctio i 1) is covex, descet methods ca be used to fid the optimal argumet x. I this paper, we are iterested i studyig methods that coverge to the optimal argumet of fx) at a liear rate. It is customary for the liear covergece aalysis of first-order methods to assume that the fuctios are smooth ad strogly covex. We formalize these coditios i the followig assumptio. Assumptio 1 The fuctios f i are differetiable ad strogly covex with costat µ, ad the gradiets f i are Lipschitz cotiuous with costat L, i.e., for all x, y R p we have f ix) f iy) L x y. 2) The strog covexity of the fuctios f i with costat µ implies that the global fuctio f is also strogly covex with costat µ. Likewise, the coditio i 2) yields Lipschitz cotiuity of the global fuctio gradiets f with costat L. Note that the coditios i Assumptio 1 are mild, ad they hold for most machie learig applicatios. The optimizatio problem i 1) ca be solved usig the gradiet descet GD) method. I GD, the variable x k is updated by descedig

through the egative directio of the gradiet fx k ), i.e., x k+1 = x k ɛ fx k ) = x k ɛ f ix k ), 3) where ɛ is the stepsize. Accordig to the covergece aalysis of GD i [18], the sequece of iterates x k coverges liearly to the optimal argumet if the stepsize satisifes ɛ < 2/L. The fastest covergece rate belogs to the stepsize ɛ = 2/µ + L) which leads to the iequality x k x ) k κ 1 x 0 x, 4) where κ = L/µ is the coditio umber of the objective fuctio. The result i 4) shows that GD reduces the differece betwee the iterate x k ad the optimal argumet x by the factor κ 1)/) after oe iteratio or equivaletly after oe pass over the dataset. The IAG method reduces the computatioal complexity of GD by computig oly oe gradiet at each iteratio. I IAG, at each iteratio the gradiet of oly oe fuctio, which is chose i a cyclic order, is updated ad the average of gradiets is used as a approximatio for the exact gradiet. I particular, if we defie yi k as the copy of the variable x for the last time that the fuctio f i s gradiet is updated up to step k, we ca write the update of IAG as x k+1 = x k ɛ f iyi k ). 5) It has bee show that IAG is liearly coverget for strogly covex fuctios with Lipschitz cotiuous gradiets [16]. I particular, the sequece of iterates x k geerated by IAG satisfies the iequality x k x ) k 2 x 0 x. 6) 252 + 1)) 2 Comparig the decremet factors of GD i 4) ad IAG after gradiet evaluatios i 6) shows that for some values of ad κ the GD method is preferable to IAG i terms of upper bouds. I particular, there exists ad κ such that the decremet factor of oe iteratio of GD is smaller tha the decremet factor of IAG after iteratios, i.e., if the iequality ) κ 1 < ) 2, 7) 252 + 1)) 2 is satisfied, the covergece rate of GD i 4) is better tha the covergece rate of IAG i 6). This is more likely to happe whe the coditio umber κ is relatively large. I the followig sectio, we propose a ew icremetal gradiet method that is preferable with respect to GD for all values of ad κ. 3. ALGORITHM DEFINITION The update of IAG i 5) ca be writte as the miimizatio of a first order approximatio of the objective fuctio fx) where each istataeous fuctio f ix) is approximated by the followig approximatio f ix) f ix k ) + f iy k i ) T x x k ) + 1 2ɛ x xk 2. 8) Notice that the first two terms f ix k ) + f iy k i ) T x x k ) correspod to the first order approximatio of the fuctio f i aroud the iterate y k i. The last term, which is 1/2ɛ) x x k 2, is a proximal term that is added to the first order approximatio. This approximatio is differet from the customary approximatio that is used i first-order methods sice the first-order approximatio of the fuctio f ix) is evaluated at y k i, while the iterate x k is used i the proximal term. This observatio verifies that the IAG method performs well whe the delayed variables yi k are close to the curret iterate x k. We resolve this issue by replacig the approximatio of IAG i 8) with the approximatio that uses yi k both i first-order approximatio ad proximity coditio. I particular, we propose a ovel cyclic icremetal method called double icremetal aggregated gradiet method DIAG) which approximates the istataeous fuctio f ix) as f ix) f iy k i ) + f iy k i ) T x y k i ) + 1 2ɛ x yk i 2. 9) I geeral, the approximatio i 9) is more accurate tha the oe i 8) sice the first order approximatio f iyi k ) + f iyi k ) T x yi k ) ad the proximal term 1/2ɛ) x yi k 2 are both evaluated usig the same poit which is yi k. Cosiderig this approximatio the update of the DIAG method is give by { x k+1 1 = argmi f iyi k ) + 1 f iyi k ) T x yi k ) x R p + 1 } 1 2ɛ x yk i 2. 10) The update i 10) miimizes the first-order approximatio of the global objective fuctio fx) which is the outcome of the istataeous fuctios approximatio i 9). Cosiderig the covex programmig i 10) we ca derive a closed form solutio for the variable x k+1 as x k+1 = 1 yi k ɛ f iyi k ). 11) The DIAG update i 11) requires the icremeted aggregate of both variables ad gradiets ad oly uses gradiet first-order) iformatio. Hece, we call it the double icremetal aggregated gradiet method. Sice we use a cyclic scheme, the set of variables {y k 1, y k 2,..., y k } is equal to the set {x k, x k 1,..., x k +1 }. Hece, the update for the proposed cyclic icremetal aggregated gradiet method with the cyclic order f 1, f 2,..., f ca be writte as x k+1 = 1 x k +i ɛ f ix k +i ). 12) The update i 12) shows that we use the first-order iformatio of the fuctios f i aroud the last iterates to evaluate the ew update x k+1. I other words, x k+1 is a fuctio of the last iterates {x k, x k 1,..., x k +1 }. This observatio is very fudametal i the aalysis of the DIAG method as we study i Sectio 4. Remark 1 Oe may cosider the proposed DIAG method as a cyclic versio of the stochastic MISO algorithm i [13]. This is a valid iterpretatio; however, the covergece aalysis of MISO caot guaratee that for all choices of ad κ it outperforms GD, while we establish theoretical results i Sectio 4 which guaratee the advatages of DIAG o GD for ay ad κ. Moreover, the proposed DIAG method is desiged based o the ew iterpretatio i 9) that leads to a ovel proof techique; see Lemma 1. This ew aalysis is differet from the aalysis of MISO i [13] ad provides stroger covergece results. 3.1. Implemetatio Details Naive implemetatio of the update i 11) requires computatio of sum of vectors per iteratio which is computatioally costly. This uecessary computatio ca be avoided by trackig the sums i 11) over time. To be more precise, the first sum i 11) which is the sum of the variables ca be updated as y k+1 i = yi k + x k+ y k ik, 13)

Algorithm 1 The proposed DIAG method 1: Require: {yi 0 } = x 0, ad { f iyi 0 )} 2: Set the fuctio idex as i 0 = 1 3: for k = 0, 1,... do 4: Update variable x k+1 = 1 yi k ɛ f iyi k ). 5: Update the sum of variables y k+1 i = yi k + x k+ y k i k. 6: Compute f i kx k+1 ) ad update the sum of gradiets f iy k+1 i ) = f i kx k+1 ) f i ky k i k) + f iyi k ). 7: Replace y k i ad f k i ky k i ) i the table by f k i kx k+1 ) ad x k+1, respectively. The rest remai uchaged. i.e., y k+1 i = yi k ad f iy k+1 i ) = f iyi k ) for i i k. 8: Update the fuctio idex i k+1 = modi k, ) + 1. 9: ed for where i k is the idex of the fuctio that is chose at step k. Likewise, the sum of gradiets i 11) ca be updated as f iy k+1 i ) = f iyi k ) + f i kx k+1 ) f i ky k ik). 14) Note that the implemetatio of DIAG requires a memory of the order Op) to store the variables yi k ad gradiets f iyi k ). The proposed DIAG method is summarized i Algorithm 1. The variables for all copies of the vector x are iitialized by vector 0, i.e., y1 0 = = y 0 = x 0 = 0, ad their correspodig gradiets are stored i the memory. At each iteratio k, the updated variable x k+1 is computed i Step 4 usig the update i 11). The sums of variables ad gradiets are updated i Step 5 ad 6, respectively, followig the recursios i 13) ad 14). I Step 7, the old variable ad gradiet of the updated fuctio f i k are replaced with their updated versios ad other compoets of the the variable ad gradiet tables remai uchaged. Fially, i Step 8, the fuctio idex is updated i a cyclic maer by icreasig the idex i k by 1. If the curret value of the idex i k is, we set i k+1 = 1 for the ext iteratio. 4. CONVERGENCE ANALYSIS I this sectio, we study the covergece properties of the DIAG method ad justify its advatages versus the GD algorithm. I the followig lemma, we characterize a upper boud for the optimality error at step k + 1 i terms of the optimality errors of its previous iteratios. Lemma 1 Cosider the proposed DIAG method i 11). If the coditios i Assumptio 1 hold, ad the stepsize ɛ is chose as ɛ = 2/µ + L), the sequece of iterates x k geerated by DIAG satisfies the iequality x k+ x κ 1 [ x k x + + x k + x where κ = L/µ is the objective fuctio coditio umber. ], 15) The result i Lemma 1 has a sigificat role i the aalysis of the proposed DIAG method. It shows that the error x k+ x at step k + 1 is smaller tha the average of the last errors. This is true sice the ratio κ 1)/) is strictly smaller tha 1. Note that the cyclic scheme of DIAG is critical to prove the result i 15), sice it allows to replace the sum yk i x by the sum of the last errors x k x + + x k + x. If we pick fuctios uiformly at radom, as i MISO, it is ot possible to write the expressio i 15), eve i expectatio. Likewise, for the IAG method, we caot guaratee that the iequality i 15) holds. This special property distiguishes DIAG from IAG ad MISO. I the followig theorem, we use the result i Lemma 1 to show that the sequece of errors x k x is coverget. Theorem 1 Cosider the proposed DIAG method i 11). If the coditios i Assumptio 1 hold, ad the stepsize ɛ is chose as ɛ = 2/µ+L), the the error after m passes over the fuctios f i, i.e., k = m iteratios, is bouded above by ) ) x m x ρ 1 m 1 ρ) x 0 x, 16) where ρ := κ 1)/). The result i Theorem 1 verifies the advatage of DIAG with respect to GD. The result i 16) shows that the error of DIAG after m passes over the dataset, which is bouded above by ρ m ρ) 1)/) x 0 x, is strictly smaller tha the upper boud for the error of GD after m iteratios, which is give by ρ m x 0 x. Hece, the DIAG method outperforms GD for ay choice of κ ad > 1. Notice that the upper boud for GD is tight, ad there exists a optimizatio problem such that the error of GD satisfies the equality x m x = ρ m x 0 x. Although the result i Propositio 1 implies that DIAG is preferable relative to GD, it is caot show liear covergece of DIAG. To be more precise, the result i Propositio 1 shows that the subsequece of errors x k x k=1, which are associated with the variables at the ed of each pass over the set of fuctios, is liearly coverget. However, we aim to show that the whole sequece x k x is liearly coverget. I the followig theorem, we show that the sequece of iterates geerated by DIAG coverges liearly. Theorem 2 Cosider the proposed DIAG method i 11). Further, recall the defiitio of the costat ρ := κ 1)/). If the coditios i Assumptio 1 hold ad ɛ = 2/µ + L), the sequece of iterates x k geerated by DIAG satisfies x k x a 0γ k 0 x 0 x, 17) where γ 0 is the oly root of the polyomial equatio γ + 1 + ρ ) γ + ρ = 0 18) i the iterval [0, 1), ad a 0 is give by a 0 = max i {1,...,} i 1) ρ) ) γ i 0. 19) The result i Theorem 2 shows that the whole sequece of iterates x k geerated by DIAG coverges liearly to the optimal argumet x. Note that the polyomial i 18) has oly oe root i the iterval [0, 1). To verify this claim, cosider the fuctio hγ) := γ +1 + ρ/) γ +ρ/ for γ [0, 1). The derivative of the fuctio h is give by dh/dγ) = + 1)γ + ρ)γ 1. Therefore, the oly critical poit of the fuctio h i the iterval 0, 1) is γ = + ρ)/ + 1). The poit γ is a local miimum for the fuctio h, sice the secod derivative of the fuctio h is positive at γ. Note that hγ ) < 0, h0) > 0, ad h1) = 0. These observatio imply that the fuctio h has oly oe root γ 0 i the iterval [0, 1) ad this root is betwee 0 ad γ. 5. NUMERICAL EXPERIMENTS I this sectio, we compare the performaces of GD, IAG, ad DIAG. First, we apply these methods to solve the quadratic programmig mi x R p fx) := 1 1 2 xt A ix + b T i x, 20)

error xk x x 0 x 10-2 10-3 10-4 10-5 GD IAG DIAG objective fuctio value error fx k ) fx ) GD with ǫ = 2/µ+L) IAG with ǫ = 2/L) DIAG with ǫ = 2/µ+L) GD with the best stepsize IAG with the best stepsize DIAG with the best stepsize 10-6 0 5000 10000 15000 umber of gradiet evaluatios 0 2 4 6 8 10 12 14 16 18 20 umber of passes over the dataset Fig. 1. Covergece paths of GD, IAG, ad DIAG for the quadratic programmig with = 200 ad κ = 10. Fig. 3. Covergece paths of GD, IAG, ad DIAG for the biary classificatio applicatio. error xk x x 0 x 10-2 10-3 10-4 10-5 GD IAG DIAG 10-6 0 2 4 6 8 10 12 14 16 umber of gradiet evaluatios 10 4 Fig. 2. Covergece paths of GD, IAG, ad DIAG for the quadratic programmig with = 200 ad κ = 117. where A i R p p is a diagoal matrix ad b i R p is a radom vector chose from the box [0, 1] p. To cotrol the problem coditio umber, the first p/2 diagoal elemets of A i are chose uiformly at radom from the iterval [1, 10 1,..., 10 η/2 ] ad its last p/2 elemets chose from the iterval [1, 10 1,..., 10 η/2 ]. This selectio resultig i the sum Ai havig eigevalues i the rage [10 η/2, 10 η/2 ]. I our simulatios, we fix the variable dimesio as p = 20 ad the umber of fuctios as = 200. Moreover, the stepsizes of GD ad DIAG are set as their best theoretical stepsize which are ɛ GD = 2/µ + L) ad ɛ DIAG = 2/µ + L), respectively. Note that the stepsize suggested i [16] for IAG is ɛ IAG = 0.32/L)L + µ)); however, this choice of stepsize is very slow i practice. Thus, we use the stepsize ɛ IAG = 2/L) which performs better tha the oe suggested i [16]. To have a fair compariso, we compare the algorithms i terms of the total umber of gradiet evaluatios. Note that compariso of these methods i terms of the total umber of iteratios would ot be fair sice each iteratio of GD requires gradiet evaluatios, while IAG ad DIAG oly require oe gradiet computatio per iteratio. We first cosider the case that η = 1 ad use the realizatio with coditio umber κ = 10 to have a relatively small coditio umber. Fig. 1 demostrates the covergece paths of the ormalized error x k x / x 0 x for IAG, DIAG, ad GD whe = 200 ad κ = 10. As we observe, IAG performs better tha GD, while the best performace belogs to DIAG. I the secod experimet, we icrease the problem coditio umber by settig η = 2 ad usig the realizatio with coditio umber κ = 117. Fig. 2 illustrates the performaces of these methods for the case that = 200 ad κ = 117. We observe that the covergece path of IAG is almost idetical to the oe for GD. I this experimet, we also observe that DIAG has the best performace amog the three methods. Note that the relative performace of IAG ad GD chages for problems with differet coditio umbers. O the other had, the relative covergece paths of DIAG ad GD does ot chage i differet settigs, ad DIAG cosistetly outperforms GD. We also compare the performaces of GD, IAG, ad DIAG i solvig a biary classificatio problem. Cosider the logistic regressio problem where samples {u i} ad their correspodig labels {l i} are give. The dimesio of samples is p, i.e., u i R p, ad the labels l i are either 1 or 1. The goal is to fid the optimal classifier x R p that miimizes the regularized logistic loss which is give by mi x R p fx) := 1 log1 + exp l ix T u i)) + λ 2 x 2. 21) The objective fuctio f i 21) is strogly covex with costat µ = λ ad its gradiets are Lipschitz cotiuous with costat L = λ + ζ/4 where ζ = max i u T i u i. Note that the fuctios f i i this case ca be defied as f ix) = log1 + exp l ix T u i)) + λ/2) x 2. It is easy to verify that the istataeous fuctios f i are also strogly covex with costat µ = L, ad their gradiets are Lipschitz cotiuous with costat L = λ + ζ/4. We apply GD, IAG, ad DIAG to the logistic regressio problem i 21) for the MNIST dataset [19]. We assig label l i = 1 to the samples that correspod to digit 8 ad label l i = 1 to those associated with digit 0. We get a total of = 11, 774 traiig examples, each of dimesio p = 784. The objective fuctio error fx k ) fx ) of the GD, IAG, ad DIAG methods versus the umber of passes over the dataset are show i Fig. 3 for the stepsizes ɛ GD = 2/µ + L), ɛ IAG = 2/L), ad ɛ DIAG = 2/µ + L). Moreover, we report the covergece paths of these algorithms for their best choice of stepsize i practice. The results verify the advatage of the proposed DIAG method relative to IAG ad GD i both scearios. 6. CONCLUSIONS I this paper, we proposed a ovel cyclic icremetal aggregated gradiet method DIAG) for solvig the problem of miimizig the average of a set of smooth ad strogly covex fuctios. The proposed method is the first cyclic icremetal method that has covergece guaratees better tha the gradiet descet method. Numerical experimets justify the advatage of the proposed DIAG method relative to gradiet descet ad other first-order icremetal methods.

7. REFERENCES [1] L. Bottou ad Y. Le Cu, O-lie learig for very large data sets, Applied stochastic models i busiess ad idustry, vol. 21, o. 2, pp. 137 151, 2005. [2] D. P. Bertsekas, Icremetal least squares methods ad the exteded kalma filter, SIAM Joural o Optimizatio, vol. 6, o. 3, pp. 807 822, 1996. [3] S. S. Ram, A. Nedic, ad V. Veeravalli, Stochastic icremetal gradiet descet for estimatio i sesor etworks, i 2007 Coferece Record of the Forty-First Asilomar Coferece o Sigals, Systems ad Computers. IEEE, 2007, pp. 582 586. [4] H. Robbis ad S. Moro, A stochastic approximatio method, The aals of mathematical statistics, pp. 400 407, 1951. [5] N. L. Roux, M. Schmidt, ad F. R. Bach, A stochastic gradiet method with a expoetial covergece rate for fiite traiig sets, i Advaces i Neural Iformatio Processig Systems, 2012, pp. 2663 2671. [6] A. Defazio, F. Bach, ad S. Lacoste-Julie, SAGA: A fast icremetal gradiet method with support for o-strogly covex composite objectives, i Advaces i Neural Iformatio Processig Systems, 2014, pp. 1646 1654. [7] R. Johso ad T. Zhag, Acceleratig stochastic gradiet descet usig predictive variace reductio, i Advaces i Neural Iformatio Processig Systems, 2013, pp. 315 323. [8] L. Xiao ad T. Zhag, A proximal stochastic gradiet method with progressive variace reductio, SIAM Joural o Optimizatio, vol. 24, o. 4, pp. 2057 2075, 2014. [9] S. Shalev-Shwartz ad T. Zhag, Stochastic dual coordiate ascet methods for regularized loss, The Joural of Machie Learig Research, vol. 14, o. 1, pp. 567 599, 2013. [10], Accelerated proximal stochastic dual coordiate ascet for regularized loss miimizatio, Mathematical Programmig, vol. 155, o. 1-2, pp. 105 145, 2016. [11] L. Zhag, M. Mahdavi, ad R. Ji, Liear covergece with coditio umber idepedet access of full gradiets, i Advaces i Neural Iformatio Processig Systems, 2013, pp. 980 988. [12] J. Koečỳ ad P. Richtárik, Semi-stochastic gradiet descet methods, arxiv preprit arxiv:1312.1666, 2013. [13] J. Mairal, Icremetal majorizatio-miimizatio optimizatio with applicatio to large-scale machie learig, SIAM Joural o Optimizatio, vol. 25, o. 2, pp. 829 855, 2015. [14] D. Blatt, A. O. Hero, ad H. Gauchma, A coverget icremetal gradiet method with a costat step size, SIAM Joural o Optimizatio, vol. 18, o. 1, pp. 29 51, 2007. [15] P. Tseg ad S. Yu, Icremetally updated gradiet methods for costraied ad regularized optimizatio, Joural of Optimizatio Theory ad Applicatios, vol. 160, o. 3, pp. 832 853, 2014. [16] M. Gürbüzbalaba, A. Ozdaglar, ad P. Parrilo, O the covergece rate of icremetal aggregated gradiet algorithms, arxiv preprit arxiv:1506.02081, 2015. [17] A. Mokhtari, M. Gürbüzbalaba, ad A. Ribeiro, O the liear covergece of a cyclic icremetal aggregated gradiet method, Uiversity of Pesylvaia Techical Report, 2016. [Olie]. Available: https://flig.seas.upe.edu/ aryam/wiki/ CAG joural.pdf [18] Y. Nesterov, Itroductory lectures o covex optimizatio. Spriger Sciece & Busiess Media, 2004, vol. 87. [19] Y. LeCu, C. Cortes, ad C. J. Burges, The MNIST database of hadwritte digits, 1998.