Support Vector Machines: Training with Stochastic Gradient Descent

Size: px

Start display at page:

Download "Support Vector Machines: Training with Stochastic Gradient Descent"

Everett Harrell
5 years ago
Views:

1 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1

2 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 2

3 SVM objective function min w,b 1 2 w> w + C X i max 0, 1 y i (w > x i + b) Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3

4 Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 4

5 Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5

6 Solving the SVM optimization problem min w,b 1 2 w> w + C X i This function is convex in w, b X max 0, 1 y i (w > x i + b) For convenience, use simplified notation: w 0 w w [w 0,b] x i [x i,1] min w 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) 6

7 Recall: Convex functions A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) From geometric perspective Every tangent plane lies below the function f(u) f(v) u v 7

8 Recall: Convex functions A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) From geometric perspective Every tangent plane lies below Xthe function f(u) f(v) u v f(x) f(u)+rf(u) > (x u) 8

9 Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is nonnegative (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 9

10 Not all functions are convex These are concave These are neither! "# + 1 " ' "! # + 1 "!(') 10

11 Convex functions are convenient A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) f(u) f(v) In general: Necessary condition for x to be a minimum for the function f is f (x)= 0 For convex functions, this is both necessary and sufficient u v 11

12 Solving the SVM optimization problem min w 1 2 w> 0 w 0 + C X max(0, 1 y i w > x i ) i This function is convex in w This is a quadratic optimization problem because the objective is quadratic Older methods: Used techniques from Quadratic Programming Very slow No constraints, can use gradient descent Still very slow! 12

13 We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 13

14 We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 14

15 We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 15

16 We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 16

17 Gradient descent for SVM X 1. Initialize w 0 2. For t = 0, 1, 2,. We are trying to minimize J(w) = 1 2 w> 0 w 0 + C X max(0, 1 y i w > x i ) i 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Update w as follows: r: Called the learning rate. 17

18 Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 18

19 Gradient descent for SVM X 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Gradient Update of the w as SVM follows: objective requires summing over the entire training set Slow, does not really scale r: Called the learning rate We are trying to minimize J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) 19

20 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w 20

21 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w 21

22 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) 3. Return final w 22

23 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w 23

24 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t Ã w t-1 t rj t (w t-1 ) Number of training examples 3. Return final w 24

25 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w 25

26 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w What is the gradient of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) This algorithm is guaranteed to converge to the minimum of J if % t is small enough. 26

27 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 27

28 Gradient Descent vs SGD Gradient descent 28

29 Gradient Descent vs SGD Stochastic Gradient descent 29

30 Gradient Descent vs SGD Stochastic Gradient descent 30

31 Gradient Descent vs SGD Stochastic Gradient descent 31

32 Gradient Descent vs SGD Stochastic Gradient descent 32

33 Gradient Descent vs SGD Stochastic Gradient descent 33

34 Gradient Descent vs SGD Stochastic Gradient descent 34

35 Gradient Descent vs SGD Stochastic Gradient descent 35

36 Gradient Descent vs SGD Stochastic Gradient descent 36

37 Gradient Descent vs SGD Stochastic Gradient descent 37

38 Gradient Descent vs SGD Stochastic Gradient descent 38

39 Gradient Descent vs SGD Stochastic Gradient descent 39

40 Gradient Descent vs SGD Stochastic Gradient descent 40

41 Gradient Descent vs SGD Stochastic Gradient descent 41

42 Gradient Descent vs SGD Stochastic Gradient descent 42

43 Gradient Descent vs SGD Stochastic Gradient descent 43

44 Gradient Descent vs SGD Stochastic Gradient descent 44

45 Gradient Descent vs SGD Stochastic Gradient descent 45

46 Gradient Descent vs SGD Many more updates than gradient descent, but each individual update is less computationally expensive Stochastic Gradient descent 46

47 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 47

48 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w What is the derivative of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) 48

49 Hinge loss is not differentiable! X What is the derivative of the hinge loss with respect to w? J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 49

50 Detour: Sub-gradients Generalization of gradients to non-differentiable functions (Remember that every tangent lies below the function for convex functions) Informally, a sub-tangent at a point is any line lies below the function at the point. A sub-gradient is the slope of that line 50

51 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 51

52 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 52

53 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 53

54 Sub-gradient of the SVM objective X J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) General strategy: First solve the max and compute the gradient for each case 54

55 Sub-gradient of the SVM objective X X J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) General strategy: First solve the max and compute the gradient for each case rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 55

56 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 56

57 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else w Ã (1- t ) w + t C y i x i w Ã (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 57

58 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else w Ã (1- t ) w + t C y i x i w Ã (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 58

59 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, Update w Ã w (1- t w ) w + $ t t J C t y i x i else w Ã (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 59

60 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 60

61 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w % t : learning rate, many tweaks possible Important to shuffle examples at the start of each epoch 61

62 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i ) S: If y i w T x i 1, % t : learning rate, many tweaks possible else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w 62

63 Convergence and learning rates With enough iterations, it will converge in expectation Provided the step sizes are square summable, but not summable Step sizes! t are positive Sum of squares of step sizes over t = 1 to is not infinite Sum of step sizes over t = 1 to is infinity Some examples:! # = % & '( ) &* + or! # = % & '(# 63

64 Convergence and learning rates Number of iterations to get to accuracy within! For strongly convex functions, N examples, d dimensional: Gradient descent: O(Nd ln(1/!)) Stochastic gradient descent: O(d/!) More subtleties involved, but SGD is generally preferable when the data size is huge 64

65 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss ü Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 65

66 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i Compare with the Perceptron update: If y i w T x i 0, update w w + r y i x i w 0 (1- % t ) w 0 3. Return w 66

67 Perceptron vs. SVM Perceptron: Stochastic sub-gradient descent for a different loss No regularization though SVM optimizes the hinge loss With regularization 67

68 SVM summary from optimization perspective Minimize regularized hinge loss Solve using stochastic gradient descent Very fast, run time does not depend on number of examples Compare with Perceptron algorithm: similar framework with different objectives! Compare with Perceptron algorithm: Perceptron does not maximize margin width Perceptron variants can force a margin Other successful optimization algorithms exist Eg: Dual coordinate descent, implemented in liblinear Questions? 68

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

Large-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological Institute at Chicago Joint work with Nati Srebro Talk at NEC Labs, Princeton, August, 2008 Shai