Machine Learning (CSE 446): Learning as Minimizing Loss

Size: px

Start display at page:

Download "Machine Learning (CSE 446): Learning as Minimizing Loss"

Marylou Walsh
5 years ago
Views:

1 Machine Learning (CSE 446): Learning as Minimizing Loss oah Smith c 207 University of Washington nasmith@cs.washington.edu October 23, 207 / 2

2 Sorry! o office hour for me today. Wednesday is as usual. 2 / 2

3 Perceptron A model and an algorithm, rolled into one. Model: f(x) = sign(w x + b), known as linear, visualized by a (hopefully) separating hyperplane in feature-space. Algorithm: PerceptronTrain, an error-driven, iterative updating algorithm. 3 / 2

4 A Different View of PerceptronTrain: Optimization Minimize training-set error rate : loss y n (w x + b) 0 } {{ } ɛ train zero-one loss margin = y (w x + b) 4 / 2

5 A Different View of PerceptronTrain: Optimization Minimize training-set error rate : y n (w x + b) 0 } {{ } ɛ train zero-one loss This problem is P-hard; even solving it approximately (i.e., getting within a small constant factor of the optimal value) is P-hard! loss margin = y (w x + b) 5 / 2

6 A Different View of PerceptronTrain: Optimization loss Minimize training-set error rate : What the perceptron does: y n (w x + b) 0 } {{ } ɛ train zero-one loss margin = y (w x + b) loss max( y n (w x + b), 0) }{{} perceptron loss margin = y (w x + b) 6 / 2

7 A Different View of PerceptronTrain: Optimization Minimize training-set error rate : What the perceptron does: y n (w x + b) 0 } {{ } ɛ train zero-one loss max( y n (w x + b), 0) }{{} perceptron loss 7 / 2

8 A Different View of PerceptronTrain: Optimization Minimize training-set error rate : What the perceptron does: y n (w x + b) 0 } {{ } ɛ train zero-one loss max( y n (w x + b), 0) }{{} perceptron loss 8 / 2

9 Squash (Sigmoid) Loss? loss margin = y (w x + b) 9 / 2

10 Different Kinds of Objective Functions Continuous (perceptron loss, squash loss) vs. discrete (zero-one loss) loss margin = y (w x + b) 0 / 2

11 Different Kinds of Objective Functions Continuous (perceptron loss, squash loss) vs. discrete (zero-one loss) Convex (perceptron loss) vs. nonconvex (zero-one loss, squash loss) loss margin = y (w x + b) / 2

12 Different Kinds of Objective Functions Continuous (perceptron loss, squash loss) vs. discrete (zero-one loss) Convex (perceptron loss) vs. nonconvex (zero-one loss, squash loss) Differentiable (squash loss) vs. nondifferentiable (zero-one loss, perceptron loss) loss margin = y (w x + b) 2 / 2

13 Different Kinds of Objective Functions Continuous (perceptron loss, squash loss) vs. discrete (zero-one loss) (The sum of two continuous functions is also continuous.) Convex (perceptron loss) vs. nonconvex (zero-one loss, squash loss) (The sum of two convex functions is also convex.) Differentiable (squash loss) vs. nondifferentiable (zero-one loss, perceptron loss) (The sum of two differentiable functions is also differentiable.) loss margin = y (w x + b) 3 / 2

14 Regularization Choose your loss function L. To fit the training data: L (y n (w x n + b)) 4 / 2

15 Regularization Choose your loss function L. To fit the training data: L (y n (w x n + b)) + R(w, b) Regularization: add a penalty to the objective function to encourage generalization. 5 / 2

16 Regularization Choose your loss function L. To fit the training data: L (y n (w x n + b)) + R(w, b) Regularization: add a penalty to the objective function to encourage generalization. Most common: R(w, b) = λ w / 2

17 Regularization Choose your loss function L. To fit the training data: L (y n (w x n + b)) + R(w, b) Regularization: add a penalty to the objective function to encourage generalization. Most common: R(w, b) = λ w 2 2. ote that this term is convex and differentiable. 7 / 2

18 Your new hobby: blindfolded mountain escape 8 / 2

19 Convex Optimization 0 Assume we are imizing a function F : R d R that is continuous, convex, and differentiable with respect to its input, z. F (z) z At a given point z 0, the direction of steepest descent is the negative gradient: F z[] (z 0) F z[2] (z 0) ote that g : R d R d. g(z 0 ) = z F (z 0 ) =. F z[d] (z 0) 9 / 2

20 Gradient Descent Data: function F : R d R, number of iterations K, step sizes η (),..., η (K) Result: z R d initialize: z (0) = 0; for k {,..., K} do g (k) = z F (z (k ) ); z (k) = z (k ) η (k) g (k) ; end return z (K) ; Algorithm : GradientDescent 20 / 2

21 Gradient Descent 2 / 2

Support Vector Machines: Training with Stochastic Gradient Descent

Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Support vector machines Training by maximizing margin The SVM