Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1
Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 2
SVM objective function min w,b 1 2 w> w + C X i max 0, 1 y i (w > x i + b) Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3
Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 4
Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5
Solving the SVM optimization problem min w,b 1 2 w> w + C X i This function is convex in w, b X max 0, 1 y i (w > x i + b) For convenience, use simplified notation: w 0 w w [w 0,b] x i [x i,1] min w 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) 6
Recall: Convex functions A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) From geometric perspective Every tangent plane lies below the function f(u) f(v) u v 7
Recall: Convex functions A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) From geometric perspective Every tangent plane lies below Xthe function f(u) f(v) u v f(x) f(u)+rf(u) > (x u) 8
Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is nonnegative (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 9
Not all functions are convex These are concave These are neither! "# + 1 " ' "! # + 1 "!(') 10
Convex functions are convenient A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) f(u) f(v) In general: Necessary condition for x to be a minimum for the function f is f (x)= 0 For convex functions, this is both necessary and sufficient u v 11
Solving the SVM optimization problem min w 1 2 w> 0 w 0 + C X max(0, 1 y i w > x i ) i This function is convex in w This is a quadratic optimization problem because the objective is quadratic Older methods: Used techniques from Quadratic Programming Very slow No constraints, can use gradient descent Still very slow! 12
We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 13
We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 14
We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 15
We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 16
Gradient descent for SVM X 1. Initialize w 0 2. For t = 0, 1, 2,. We are trying to minimize J(w) = 1 2 w> 0 w 0 + C X max(0, 1 y i w > x i ) i 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Update w as follows: r: Called the learning rate. 17
Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 18
Gradient descent for SVM X 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Gradient Update of the w as SVM follows: objective requires summing over the entire training set Slow, does not really scale r: Called the learning rate We are trying to minimize J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) 19
J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 20
J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 21
J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 22
J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w 23
J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t à w t-1 t rj t (w t-1 ) Number of training examples 3. Return final w 24
J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w 25
J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w What is the gradient of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) This algorithm is guaranteed to converge to the minimum of J if % t is small enough. 26
Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 27
Gradient Descent vs SGD Gradient descent 28
Gradient Descent vs SGD Stochastic Gradient descent 29
Gradient Descent vs SGD Stochastic Gradient descent 30
Gradient Descent vs SGD Stochastic Gradient descent 31
Gradient Descent vs SGD Stochastic Gradient descent 32
Gradient Descent vs SGD Stochastic Gradient descent 33
Gradient Descent vs SGD Stochastic Gradient descent 34
Gradient Descent vs SGD Stochastic Gradient descent 35
Gradient Descent vs SGD Stochastic Gradient descent 36
Gradient Descent vs SGD Stochastic Gradient descent 37
Gradient Descent vs SGD Stochastic Gradient descent 38
Gradient Descent vs SGD Stochastic Gradient descent 39
Gradient Descent vs SGD Stochastic Gradient descent 40
Gradient Descent vs SGD Stochastic Gradient descent 41
Gradient Descent vs SGD Stochastic Gradient descent 42
Gradient Descent vs SGD Stochastic Gradient descent 43
Gradient Descent vs SGD Stochastic Gradient descent 44
Gradient Descent vs SGD Stochastic Gradient descent 45
Gradient Descent vs SGD Many more updates than gradient descent, but each individual update is less computationally expensive Stochastic Gradient descent 46
Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 47
J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w What is the derivative of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) 48
Hinge loss is not differentiable! X What is the derivative of the hinge loss with respect to w? J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 49
Detour: Sub-gradients Generalization of gradients to non-differentiable functions (Remember that every tangent lies below the function for convex functions) Informally, a sub-tangent at a point is any line lies below the function at the point. A sub-gradient is the slope of that line 50
Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 51
Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 52
Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 53
Sub-gradient of the SVM objective X J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) General strategy: First solve the max and compute the gradient for each case 54
Sub-gradient of the SVM objective X X J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) General strategy: First solve the max and compute the gradient for each case rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 55
Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 56
Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else w à (1- t ) w + t C y i x i w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 57
Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else w à (1- t ) w + t C y i x i w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 58
Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, Update w à w (1- t w ) w + $ t t J C t y i x i else w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 59
Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 60
Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w % t : learning rate, many tweaks possible Important to shuffle examples at the start of each epoch 61
Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i ) S: If y i w T x i 1, % t : learning rate, many tweaks possible else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w 62
Convergence and learning rates With enough iterations, it will converge in expectation Provided the step sizes are square summable, but not summable Step sizes! t are positive Sum of squares of step sizes over t = 1 to is not infinite Sum of step sizes over t = 1 to is infinity Some examples:! # = % & '( ) &* + or! # = % & '(# 63
Convergence and learning rates Number of iterations to get to accuracy within! For strongly convex functions, N examples, d dimensional: Gradient descent: O(Nd ln(1/!)) Stochastic gradient descent: O(d/!) More subtleties involved, but SGD is generally preferable when the data size is huge 64
Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss ü Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 65
Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i Compare with the Perceptron update: If y i w T x i 0, update w w + r y i x i w 0 (1- % t ) w 0 3. Return w 66
Perceptron vs. SVM Perceptron: Stochastic sub-gradient descent for a different loss No regularization though SVM optimizes the hinge loss With regularization 67
SVM summary from optimization perspective Minimize regularized hinge loss Solve using stochastic gradient descent Very fast, run time does not depend on number of examples Compare with Perceptron algorithm: similar framework with different objectives! Compare with Perceptron algorithm: Perceptron does not maximize margin width Perceptron variants can force a margin Other successful optimization algorithms exist Eg: Dual coordinate descent, implemented in liblinear Questions? 68