Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.

Size: px

Start display at page:

Download "Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem."

Barrie White
5 years ago
Views:

1 Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem. Robert M. Gower. October 3, 07 Introduction This is an exercise in proving the convergence of iterative optimization methods. We will take a simple case study: solving the linear least squares problem, and prove the linear convergence of the gradient descent method and a variant of the stochastic gradient descent (SGD) method with importance sampling. This variant of SGD is also known as the randomized Kaczmarz method and the linear convergence we prove in Exe. was first established in [3]. irst we introduce some necessary notation. Notation: or every x, y, R n let x, y def = x y and let x = x, x. Let σ min (A) and σ max (A) be the smallest and largest singular values of A defined by Thus clearly σ min (A) def Ax = min and σ max (A) def Ax = max. () x R n, x 0 x x R n, x 0 x Ax x σ max (A), x R n. () Let A def = Tr ( A A ) denote the robenius norm of A. inally, a result you will need, is that for every symmetric positive semi-definite matrix G the L induced matrix norm can be equivalently defined by σ max (G) = Gx, x max x R n, x 0 x Gx = max. (3) x R n, x 0 x

2 The Linear Least Squares Problem Now consider the problem of solving the linear system Ax = b, (4) where A R m n and b R m. We assume that there exists a solution to (4). We also assume that n m and that A has full column rank so that there is a unique solution x R n to (4). We can recast (4) as the following Least Squares optimization problem ( ) x = arg min x R n Ax b def = f(x). (5) 3 Exercises Ex. Consider the Gradient descent method x t+ = x t α f(x t ), (6) where is a fixed stepsize. α = σ max (A), (7) Part I Show or convince yourself that σ max (I αa A) = α σ min (A) = σ min(a) σ max (A). (8) Part II Calculate the gradient f(x) of (5) and re-write the iterates (6) with this gradient. Part III Show that the iterates (6) converge to x according to x t+ x ( σ min(a) ) σ max (A) x t x, for all t. Hint : Subtract x from both sides of (6) and use the results from the previous two exercises. Hint : Remember that b = Ax!

3 Answer (Ex. I) irst note that (I αa A)x, x = x α Ax (7) x Ax σ max (A) () x σ max(a) x σ max (A) x thus the matrix (I αa A) is positive semi-definite and only has non-negative eigenvalues. urthermore (I αa A)x, x A Ax, x x = α x = 0,. (9) Since (I αa A) is symmetric positive semi-definite we can use (3) to calculate the induced norm, thus we have ( σ max (I αa (3)+(9) A Ax, x ) A) = max α x R n x A Ax, x = α min x R n = α σ min (A). x Answer (Ex. II) Differentiating we have f(x) = A (Ax b) = A A(x x ), where the last equality follows since Ax = b. Consequently the gradient descent method (6) can be written as x t+ = x t αa A(x t x ). (0) Answer (Ex. III) Subtracting x from both sides of (0) gives x t+ x = x t x αa A(x t x ) = (I αa A)(x t x ). Taking norm in the above gives x t+ x () ) σ max (I αa A x t x (8) = ( α σ min (A) ) x t x. 3

4 In particular for α = σ max(a) x t+ x the above shows that ( σ min(a) σ max (A) ) x t x. Ex. The least squares problem (5) can be re-written as min x Ax b = min x (A i: x b i ) def = min x f i (x) () where f i (x) = (A i: x b i ), A i: denotes the ith row of A and b i denotes the ith element of b. Given this sum of terms structure in () we can implement the stochastic gradient method as follows. rom a given x 0 R n, consider the iterates where x t+ = x t α j f j (x t ), () α j = A j:, (3) and j is a random index chosen from {,..., m} such that for every i {,..., m} the probability that j = i is given by A i:. In other words, P(j = i) = A i: A A {,..., m}. Part I Show that P j def for all i = α j A j:a j: = A j: A j: A j:, (4) is a projection operator which projects orthogonally onto Range (A j: ). In other words, show that P j P j = P j and (I P j )(I P j ) = I P j. (5) urthermore, verify that E [P j ] = P(j = i)p i = A A A. (6) Part II 4

5 Using analogous techniques from the previous exercise, show that the iterates () converge according to E [ x t+ x ] ( σ min(a) ) A E [ x t x ]. (7) This is an amazing and recent result [3], since it shows that SGD converges exponentially fast despite the fact that the iterates () only require access to a single row of A at a time! This result can be extended to any matrix A, including rank deficient matrices. Indeed, so long as there exists a solution to ) (4), the iterates () converge to the solution of least norm and at rate of ( σ+ min (A) where σ A min + (A) is the smallest nonzero singular value of A []. Thus the assumption that A has full column rank is not necessary. These results have also been extended to a general class of methods []. Part III When is this stochastic gradient method () faster than the gradient descent method (6)? Note that the each iteration of SGD costs O(n) floating point operations while an iteration of the GD method costs O(nm) floating point operations. What happens if m is very big? What if A is very large? Discuss this. Answer (Ex. I) Verify by most all claim by direct computation. or instances A i: A i: E [P j ] = P(j = i)p i = A i: A i: A A i: = A i: A = A A A. Answer (Ex. II) irst note that f j (x t ) = A j:(a j: x b j ) = A j:a j: (x x ). Using the above and subtracting x from both sides of () we have x t+ x = x t x α j A j:a j: (x t x ) ( (3) = I A j: A ) j: A j: (x t x ). Taking norm squared in the above we have that ( x t+ x = I A j: A ) j: A j: (x t x ) ( (5) = I A j: A ) j: A j: (x t x ), x t x A = x t x j: A j: A j: (x t x ), x t x. 5

6 Taking expectation conditioned on x t in the above gives E [ [ ] x t+ x x t] A = x t x j: A j: E A j: (x t x ), x t x (6) = x t x A A A(x t x ), x t x () x t x σ min(a) A x t x = ( σ min(a) ) A x t x. It remains to take expectation in the above. Answer (Ex. III)... References [] R. M. Gower and P. Richtárik. Stochastic Dual Ascent for Solving Linear Systems. In: arxiv: (05). [] R. M. Gower and P. Richtárik. Randomized Iterative Methods for Linear Systems. In: SIAM Journal on Matrix Analysis and Applications 36.4 (05), pp [3] T. Strohmer and R. Vershynin. A Randomized Kaczmarz Algorithm with Exponential Convergence. In: Journal of ourier Analysis and Applications 5. (009), pp

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0. Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization