Support Vector Machines: Training with Stochastic Gradient Descent

Similar documents
Large-Scale SVM Optimization: Taking a Machine Learning Perspective

CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems

Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.

Is Greedy Coordinate Descent a Terrible Algorithm?

Machine Learning (CSE 446): Learning as Minimizing Loss

Decomposition Methods

Chapter 7 One-Dimensional Search Methods

Approximate Composite Minimization: Convergence Rates and Examples

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

ECS171: Machine Learning

1 Consumption and saving under uncertainty

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE392o, Stanford University

Trust Region Methods for Unconstrained Optimisation

$tock Forecasting using Machine Learning

Probability and Stochastics for finance-ii Prof. Joydeep Dutta Department of Humanities and Social Sciences Indian Institute of Technology, Kanpur

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Ellipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE364b, Stanford University

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India

Sensitivity Analysis with Data Tables. 10% annual interest now =$110 one year later. 10% annual interest now =$121 one year later

Machine Learning (CSE 446): Pratical issues: optimization and learning

Macroeconomics. Lecture 5: Consumption. Hernán D. Seoane. Spring, 2016 MEDEG, UC3M UC3M

Predicting the Success of a Retirement Plan Based on Early Performance of Investments

Robust Dual Dynamic Programming

Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Revenue Management Under the Markov Chain Choice Model

Learning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme

Definition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.

Scaling SGD Batch Size to 32K for ImageNet Training

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Session 5. Predictive Modeling in Life Insurance

CS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults

Math Models of OR: More on Equipment Replacement

Financial Optimization ISE 347/447. Lecture 15. Dr. Ted Ralphs

Lecture 10: Performance measures

Theory of Consumer Behavior First, we need to define the agents' goals and limitations (if any) in their ability to achieve those goals.

Stochastic Programming and Financial Analysis IE447. Midterm Review. Dr. Ted Ralphs

Lecture 5 January 30

Report for technical cooperation between Georgia Institute of Technology and ONS - Operador Nacional do Sistema Elétrico Risk Averse Approach

Contents Critique 26. portfolio optimization 32

ECON 200 EXERCISES. (b) Appeal to any propositions you wish to confirm that the production set is convex.

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

TDT4171 Artificial Intelligence Methods

The Correlation Smile Recovery

Penalty Functions. The Premise Quadratic Loss Problems and Solutions

Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints

Heuristic Methods in Finance

Worst-case-expectation approach to optimization under uncertainty

Lecture outline W.B. Powell 1

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Stochastic Approximation Algorithms and Applications

CS360 Homework 14 Solution

AIRCURRENTS: PORTFOLIO OPTIMIZATION FOR REINSURERS

Intro to GLM Day 2: GLM and Maximum Likelihood

Markov Decision Processes

Scenario Generation and Sampling Methods

EE/AA 578 Univ. of Washington, Fall Homework 8

Global convergence rate analysis of unconstrained optimization methods based on probabilistic models

6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE

Name Date Student id #:

Optimal Portfolio Selection Under the Estimation Risk in Mean Return

Barapatre Omprakash et.al; International Journal of Advance Research, Ideas and Innovations in Technology

3/1/2016. Intermediate Microeconomics W3211. Lecture 4: Solving the Consumer s Problem. The Story So Far. Today s Aims. Solving the Consumer s Problem

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

Lecture outline W.B.Powell 1

Progressive Hedging for Multi-stage Stochastic Optimization Problems

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

2D5362 Machine Learning

Integer Programming Models

CPS 270: Artificial Intelligence Markov decision processes, POMDPs

Comparison of Logit Models to Machine Learning Algorithms for Modeling Individual Daily Activity Patterns

POMDPs: Partially Observable Markov Decision Processes Advanced AI

ALGORITHMIC TRADING STRATEGIES IN PYTHON

Accelerated Option Pricing Multiple Scenarios

The Optimization Process: An example of portfolio optimization

Financial derivatives exam Winter term 2014/2015

The exam is closed book, closed calculator, and closed notes except your three crib sheets.

Portfolio Management and Optimal Execution via Convex Optimization

Stochastic Dual Dynamic Programming Algorithm for Multistage Stochastic Programming

CS 331: Artificial Intelligence Game Theory I. Prisoner s Dilemma

Problem 1: Random variables, common distributions and the monopoly price

Pricing Kernel. v,x = p,y = p,ax, so p is a stochastic discount factor. One refers to p as the pricing kernel.

Artificially Intelligent Forecasting of Stock Market Indexes

Department of Economics The Ohio State University Final Exam Answers Econ 8712

Final Examination Re - Calculus I 21 December 2015

k-layer neural networks: High capacity scoring functions + tips on how to train them

Multi-period mean variance asset allocation: Is it bad to win the lottery?

Budget Constrained Choice with Two Commodities

Predictive Model Learning of Stochastic Simulations. John Hegstrom, FSA, MAAA

1 Asset Pricing: Bonds vs Stocks

CS 7180: Behavioral Modeling and Decision- making in AI

Journal of Computational and Applied Mathematics. The mean-absolute deviation portfolio selection problem with interval-valued returns

Optimizing the Omega Ratio using Linear Programming

Reasoning with Uncertainty

Journal of Internet Banking and Commerce

Efficient Portfolio and Introduction to Capital Market Line Benninga Chapter 9

Lecture 2: Fundamentals of meanvariance

INVERSE REWARD DESIGN

Making Complex Decisions

Transcription:

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1

Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 2

SVM objective function min w,b 1 2 w> w + C X i max 0, 1 y i (w > x i + b) Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3

Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 4

Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5

Solving the SVM optimization problem min w,b 1 2 w> w + C X i This function is convex in w, b X max 0, 1 y i (w > x i + b) For convenience, use simplified notation: w 0 w w [w 0,b] x i [x i,1] min w 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) 6

Recall: Convex functions A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) From geometric perspective Every tangent plane lies below the function f(u) f(v) u v 7

Recall: Convex functions A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) From geometric perspective Every tangent plane lies below Xthe function f(u) f(v) u v f(x) f(u)+rf(u) > (x u) 8

Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is nonnegative (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 9

Not all functions are convex These are concave These are neither! "# + 1 " ' "! # + 1 "!(') 10

Convex functions are convenient A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) f(u) f(v) In general: Necessary condition for x to be a minimum for the function f is f (x)= 0 For convex functions, this is both necessary and sufficient u v 11

Solving the SVM optimization problem min w 1 2 w> 0 w 0 + C X max(0, 1 y i w > x i ) i This function is convex in w This is a quadratic optimization problem because the objective is quadratic Older methods: Used techniques from Quadratic Programming Very slow No constraints, can use gradient descent Still very slow! 12

We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 13

We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 14

We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 15

We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 16

Gradient descent for SVM X 1. Initialize w 0 2. For t = 0, 1, 2,. We are trying to minimize J(w) = 1 2 w> 0 w 0 + C X max(0, 1 y i w > x i ) i 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Update w as follows: r: Called the learning rate. 17

Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 18

Gradient descent for SVM X 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Gradient Update of the w as SVM follows: objective requires summing over the entire training set Slow, does not really scale r: Called the learning rate We are trying to minimize J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) 19

J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 20

J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 21

J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 22

J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w 23

J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t à w t-1 t rj t (w t-1 ) Number of training examples 3. Return final w 24

J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w 25

J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w What is the gradient of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) This algorithm is guaranteed to converge to the minimum of J if % t is small enough. 26

Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 27

Gradient Descent vs SGD Gradient descent 28

Gradient Descent vs SGD Stochastic Gradient descent 29

Gradient Descent vs SGD Stochastic Gradient descent 30

Gradient Descent vs SGD Stochastic Gradient descent 31

Gradient Descent vs SGD Stochastic Gradient descent 32

Gradient Descent vs SGD Stochastic Gradient descent 33

Gradient Descent vs SGD Stochastic Gradient descent 34

Gradient Descent vs SGD Stochastic Gradient descent 35

Gradient Descent vs SGD Stochastic Gradient descent 36

Gradient Descent vs SGD Stochastic Gradient descent 37

Gradient Descent vs SGD Stochastic Gradient descent 38

Gradient Descent vs SGD Stochastic Gradient descent 39

Gradient Descent vs SGD Stochastic Gradient descent 40

Gradient Descent vs SGD Stochastic Gradient descent 41

Gradient Descent vs SGD Stochastic Gradient descent 42

Gradient Descent vs SGD Stochastic Gradient descent 43

Gradient Descent vs SGD Stochastic Gradient descent 44

Gradient Descent vs SGD Stochastic Gradient descent 45

Gradient Descent vs SGD Many more updates than gradient descent, but each individual update is less computationally expensive Stochastic Gradient descent 46

Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 47

J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w What is the derivative of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) 48

Hinge loss is not differentiable! X What is the derivative of the hinge loss with respect to w? J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 49

Detour: Sub-gradients Generalization of gradients to non-differentiable functions (Remember that every tangent lies below the function for convex functions) Informally, a sub-tangent at a point is any line lies below the function at the point. A sub-gradient is the slope of that line 50

Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 51

Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 52

Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 53

Sub-gradient of the SVM objective X J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) General strategy: First solve the max and compute the gradient for each case 54

Sub-gradient of the SVM objective X X J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) General strategy: First solve the max and compute the gradient for each case rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 55

Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 56

Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else w à (1- t ) w + t C y i x i w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 57

Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else w à (1- t ) w + t C y i x i w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 58

Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, Update w à w (1- t w ) w + $ t t J C t y i x i else w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 59

Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 60

Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w % t : learning rate, many tweaks possible Important to shuffle examples at the start of each epoch 61

Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i ) S: If y i w T x i 1, % t : learning rate, many tweaks possible else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w 62

Convergence and learning rates With enough iterations, it will converge in expectation Provided the step sizes are square summable, but not summable Step sizes! t are positive Sum of squares of step sizes over t = 1 to is not infinite Sum of step sizes over t = 1 to is infinity Some examples:! # = % & '( ) &* + or! # = % & '(# 63

Convergence and learning rates Number of iterations to get to accuracy within! For strongly convex functions, N examples, d dimensional: Gradient descent: O(Nd ln(1/!)) Stochastic gradient descent: O(d/!) More subtleties involved, but SGD is generally preferable when the data size is huge 64

Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss ü Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 65

Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i Compare with the Perceptron update: If y i w T x i 0, update w w + r y i x i w 0 (1- % t ) w 0 3. Return w 66

Perceptron vs. SVM Perceptron: Stochastic sub-gradient descent for a different loss No regularization though SVM optimizes the hinge loss With regularization 67

SVM summary from optimization perspective Minimize regularized hinge loss Solve using stochastic gradient descent Very fast, run time does not depend on number of examples Compare with Perceptron algorithm: similar framework with different objectives! Compare with Perceptron algorithm: Perceptron does not maximize margin width Perceptron variants can force a margin Other successful optimization algorithms exist Eg: Dual coordinate descent, implemented in liblinear Questions? 68