Support Vector Machines: Training with Stochastic Gradient Descent
|
|
- Everett Harrell
- 5 years ago
- Views:
Transcription
1 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1
2 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels 2
3 SVM objective function min w,b 1 2 w> w + C X i max 0, 1 y i (w > x i + b) Regularization term: Maximize the margin Imposes a preference over the hypothesis space and pushes for better generalization Can be replaced with other regularization terms which impose other preferences Empirical Loss: Hinge loss Penalizes weight vectors that make mistakes Can be replaced with other loss functions which impose other preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 3
4 Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 4
5 Outline: Training SVM by optimization 1. Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 5
6 Solving the SVM optimization problem min w,b 1 2 w> w + C X i This function is convex in w, b X max 0, 1 y i (w > x i + b) For convenience, use simplified notation: w 0 w w [w 0,b] x i [x i,1] min w 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) 6
7 Recall: Convex functions A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) From geometric perspective Every tangent plane lies below the function f(u) f(v) u v 7
8 Recall: Convex functions A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) From geometric perspective Every tangent plane lies below Xthe function f(u) f(v) u v f(x) f(u)+rf(u) > (x u) 8
9 Convex functions Linear functions max is convex Some ways to show that a function is convex: 1. Using the definition of convexity 2. Showing that the second derivative is nonnegative (for one dimensional functions) 3. Showing that the second derivative is positive semi-definite (for vector functions) 9
10 Not all functions are convex These are concave These are neither! "# + 1 " ' "! # + 1 "!(') 10
11 Convex functions are convenient A function! is convex if for every ", $ in the domain, and for every % [0,1] we have! %" + 1 % $ %! " + 1 %!($) f(u) f(v) In general: Necessary condition for x to be a minimum for the function f is f (x)= 0 For convex functions, this is both necessary and sufficient u v 11
12 Solving the SVM optimization problem min w 1 2 w> 0 w 0 + C X max(0, 1 y i w > x i ) i This function is convex in w This is a quadratic optimization problem because the objective is quadratic Older methods: Used techniques from Quadratic Programming Very slow No constraints, can use gradient descent Still very slow! 12
13 We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 13
14 We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 14
15 We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 15
16 We are trying to minimize Gradient descent J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) General strategy for minimizing a function J(w) Start with an initial guess for w, say w 0 J(w) Iterate till convergence: Compute the gradient of J at w t Update w t to get w t+1 by taking a step in the opposite direction of the gradient w 3 w 2 w 1 w 0 Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction w 16
17 Gradient descent for SVM X 1. Initialize w 0 2. For t = 0, 1, 2,. We are trying to minimize J(w) = 1 2 w> 0 w 0 + C X max(0, 1 y i w > x i ) i 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Update w as follows: r: Called the learning rate. 17
18 Outline: Training SVM by optimization ü Review of convex functions and gradient descent 2. Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 18
19 Gradient descent for SVM X 1. Initialize w 0 2. For t = 0, 1, 2,. 1. Compute gradient of J(w) at w t. Call it J(w t ) 2. Gradient Update of the w as SVM follows: objective requires summing over the entire training set Slow, does not really scale r: Called the learning rate We are trying to minimize J(w) = 1 2 w> 0 w 0 + C X i max(0, 1 y i w > x i ) 19
20 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 20
21 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be rj t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 21
22 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t à w t-1 t rj t (w t-1 ) 3. Return final w 22
23 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w 23
24 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t à w t-1 t rj t (w t-1 ) Number of training examples 3. Return final w 24
25 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) X from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w 25
26 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Repeat (x i, y i ) to make a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w What is the gradient of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) This algorithm is guaranteed to converge to the minimum of J if % t is small enough. 26
27 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent 3. Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 27
28 Gradient Descent vs SGD Gradient descent 28
29 Gradient Descent vs SGD Stochastic Gradient descent 29
30 Gradient Descent vs SGD Stochastic Gradient descent 30
31 Gradient Descent vs SGD Stochastic Gradient descent 31
32 Gradient Descent vs SGD Stochastic Gradient descent 32
33 Gradient Descent vs SGD Stochastic Gradient descent 33
34 Gradient Descent vs SGD Stochastic Gradient descent 34
35 Gradient Descent vs SGD Stochastic Gradient descent 35
36 Gradient Descent vs SGD Stochastic Gradient descent 36
37 Gradient Descent vs SGD Stochastic Gradient descent 37
38 Gradient Descent vs SGD Stochastic Gradient descent 38
39 Gradient Descent vs SGD Stochastic Gradient descent 39
40 Gradient Descent vs SGD Stochastic Gradient descent 40
41 Gradient Descent vs SGD Stochastic Gradient descent 41
42 Gradient Descent vs SGD Stochastic Gradient descent 42
43 Gradient Descent vs SGD Stochastic Gradient descent 43
44 Gradient Descent vs SGD Stochastic Gradient descent 44
45 Gradient Descent vs SGD Stochastic Gradient descent 45
46 Gradient Descent vs SGD Many more updates than gradient descent, but each individual update is less computationally expensive Stochastic Gradient descent 46
47 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent 4. Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 47
48 J(w) = 1 2 w> 0 w 0 + C X i Stochastic gradient descent for SVM max(0, 1 y i w > x i ) Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Pick a random example (x i, y i ) from the training set S 2. Treat (x i, y i ) as a full dataset and take the derivative of the SVM objective at the current w t-1 to be J t (w t-1 ) 3. Update: w t w t-1 % t J t (w t-1 ) 3. Return final w What is the derivative of the hinge loss with respect to w? (The hinge loss is not a differentiable function!) 48
49 Hinge loss is not differentiable! X What is the derivative of the hinge loss with respect to w? J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) 49
50 Detour: Sub-gradients Generalization of gradients to non-differentiable functions (Remember that every tangent lies below the function for convex functions) Informally, a sub-tangent at a point is any line lies below the function at the point. A sub-gradient is the slope of that line 50
51 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 51
52 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 52
53 Sub-gradients Formally, g is a subgradient to f at x if f is differentiable at x 1 Tangent at this point g 1 is a gradient at x 1 g 2 and g 3 is are both subgradients at x 2 [Example from Boyd] 53
54 Sub-gradient of the SVM objective X J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) General strategy: First solve the max and compute the gradient for each case 54
55 Sub-gradient of the SVM objective X X J t (w) = 1 2 w> 0 w 0 + C N max(0, 1 y i w > x i ) General strategy: First solve the max and compute the gradient for each case rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 55
56 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss 5. Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 56
57 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else w à (1- t ) w + t C y i x i w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 57
58 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i )2 S: If y i w T x i 1, else w à (1- t ) w + t C y i x i w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 58
59 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, Update w à w (1- t w ) w + $ t t J C t y i x i else w à (1- t ) w 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 59
60 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w rj t = [w0 ; 0] if max(0, 1 y i w x i )=0 [w 0 ; 0] C Ny i x i otherwise 60
61 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w % t : learning rate, many tweaks possible Important to shuffle examples at the start of each epoch 61
62 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i ) S: If y i w T x i 1, % t : learning rate, many tweaks possible else w (1- % t ) [w 0 ; 0] + % t C N y i x i w 0 (1- % t ) w 0 3. Return w 62
63 Convergence and learning rates With enough iterations, it will converge in expectation Provided the step sizes are square summable, but not summable Step sizes! t are positive Sum of squares of step sizes over t = 1 to is not infinite Sum of step sizes over t = 1 to is infinity Some examples:! # = % & '( ) &* + or! # = % & '(# 63
64 Convergence and learning rates Number of iterations to get to accuracy within! For strongly convex functions, N examples, d dimensional: Gradient descent: O(Nd ln(1/!)) Stochastic gradient descent: O(d/!) More subtleties involved, but SGD is generally preferable when the data size is huge 64
65 Outline: Training SVM by optimization ü Review of convex functions and gradient descent ü Stochastic gradient descent ü Gradient descent vs stochastic gradient descent ü Sub-derivatives of the hinge loss ü Stochastic sub-gradient descent for SVM 6. Comparison to perceptron 65
66 Stochastic sub-gradient descent for SVM Given a training set S = {(x i, y i )}, x R n, y {-1,1} 1. Initialize w 0 = 0 R n 2. For epoch = 1 T: 1. Shuffle the training set 2. For each training example (x i, y i ) S: If y i w T x i 1, else w (1- % t ) [w 0 ; 0] + % t C N y i x i Compare with the Perceptron update: If y i w T x i 0, update w w + r y i x i w 0 (1- % t ) w 0 3. Return w 66
67 Perceptron vs. SVM Perceptron: Stochastic sub-gradient descent for a different loss No regularization though SVM optimizes the hinge loss With regularization 67
68 SVM summary from optimization perspective Minimize regularized hinge loss Solve using stochastic gradient descent Very fast, run time does not depend on number of examples Compare with Perceptron algorithm: similar framework with different objectives! Compare with Perceptron algorithm: Perceptron does not maximize margin width Perceptron variants can force a margin Other successful optimization algorithms exist Eg: Dual coordinate descent, implemented in liblinear Questions? 68
Large-Scale SVM Optimization: Taking a Machine Learning Perspective
Large-Scale SVM Optimization: Taking a Machine Learning Perspective Shai Shalev-Shwartz Toyota Technological Institute at Chicago Joint work with Nati Srebro Talk at NEC Labs, Princeton, August, 2008 Shai
More informationCSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems
CSCI 1951-G Optimization Methods in Finance Part 00: Course Logistics Introduction to Finance Optimization Problems January 26, 2018 1 / 24 Basic information All information is available in the syllabus
More informationExercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.
Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem. Robert M. Gower. October 3, 07 Introduction This is an exercise in proving the convergence
More informationIs Greedy Coordinate Descent a Terrible Algorithm?
Is Greedy Coordinate Descent a Terrible Algorithm? Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, Hoyt Koepke University of British Columbia Optimization and Big Data, 2015 Context: Random
More informationMachine Learning (CSE 446): Learning as Minimizing Loss
Machine Learning (CSE 446): Learning as Minimizing Loss oah Smith c 207 University of Washington nasmith@cs.washington.edu October 23, 207 / 2 Sorry! o office hour for me today. Wednesday is as usual.
More informationDecomposition Methods
Decomposition Methods separable problems, complicating variables primal decomposition dual decomposition complicating constraints general decomposition structures Prof. S. Boyd, EE364b, Stanford University
More informationChapter 7 One-Dimensional Search Methods
Chapter 7 One-Dimensional Search Methods An Introduction to Optimization Spring, 2014 1 Wei-Ta Chu Golden Section Search! Determine the minimizer of a function over a closed interval, say. The only assumption
More informationApproximate Composite Minimization: Convergence Rates and Examples
ISMP 2018 - Bordeaux Approximate Composite Minimization: Convergence Rates and S. Praneeth Karimireddy, Sebastian U. Stich, Martin Jaggi MLO Lab, EPFL, Switzerland sebastian.stich@epfl.ch July 4, 2018
More information1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016
AM 22: Advanced Optimization Spring 206 Prof. Yaron Singer Lecture 9 February 24th Overview In the previous lecture we reviewed results from multivariate calculus in preparation for our journey into convex
More informationOutline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.
Outline Coordinate Minimization Daniel P. Robinson Department of Applied Mathematics and Statistics Johns Hopkins University November 27, 208 Introduction 2 Algorithms Cyclic order with exact minimization
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 15: Tree-based Algorithms Cho-Jui Hsieh UC Davis March 7, 2018 Outline Decision Tree Random Forest Gradient Boosted Decision Tree (GBDT) Decision Tree Each node checks
More information1 Consumption and saving under uncertainty
1 Consumption and saving under uncertainty 1.1 Modelling uncertainty As in the deterministic case, we keep assuming that agents live for two periods. The novelty here is that their earnings in the second
More informationEllipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE392o, Stanford University
Ellipsoid Method ellipsoid method convergence proof inequality constraints feasibility problems Prof. S. Boyd, EE392o, Stanford University Challenges in cutting-plane methods can be difficult to compute
More informationTrust Region Methods for Unconstrained Optimisation
Trust Region Methods for Unconstrained Optimisation Lecture 9, Numerical Linear Algebra and Optimisation Oxford University Computing Laboratory, MT 2007 Dr Raphael Hauser (hauser@comlab.ox.ac.uk) The Trust
More information$tock Forecasting using Machine Learning
$tock Forecasting using Machine Learning Greg Colvin, Garrett Hemann, and Simon Kalouche Abstract We present an implementation of 3 different machine learning algorithms gradient descent, support vector
More informationProbability and Stochastics for finance-ii Prof. Joydeep Dutta Department of Humanities and Social Sciences Indian Institute of Technology, Kanpur
Probability and Stochastics for finance-ii Prof. Joydeep Dutta Department of Humanities and Social Sciences Indian Institute of Technology, Kanpur Lecture - 07 Mean-Variance Portfolio Optimization (Part-II)
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
for Strongly Convex Stochastic Optimization Microsoft Research New England NIPS 2011 Optimization Workshop Stochastic Convex Optimization Setting Goal: Optimize convex function F ( ) over convex domain
More informationEllipsoid Method. ellipsoid method. convergence proof. inequality constraints. feasibility problems. Prof. S. Boyd, EE364b, Stanford University
Ellipsoid Method ellipsoid method convergence proof inequality constraints feasibility problems Prof. S. Boyd, EE364b, Stanford University Ellipsoid method developed by Shor, Nemirovsky, Yudin in 1970s
More informationLecture 17: More on Markov Decision Processes. Reinforcement learning
Lecture 17: More on Markov Decision Processes. Reinforcement learning Learning a model: maximum likelihood Learning a value function directly Monte Carlo Temporal-difference (TD) learning COMP-424, Lecture
More informationAccelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India
Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India Presented at OSL workshop, Les Houches, France. Joint work with Prateek Jain, Sham M. Kakade, Rahul Kidambi and Aaron Sidford Linear
More informationSensitivity Analysis with Data Tables. 10% annual interest now =$110 one year later. 10% annual interest now =$121 one year later
Sensitivity Analysis with Data Tables Time Value of Money: A Special kind of Trade-Off: $100 @ 10% annual interest now =$110 one year later $110 @ 10% annual interest now =$121 one year later $100 @ 10%
More informationMachine Learning (CSE 446): Pratical issues: optimization and learning
Machine Learning (CSE 446): Pratical issues: optimization and learning John Thickstun guest lecture c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 10 Review 1 / 10 Our running example
More informationMacroeconomics. Lecture 5: Consumption. Hernán D. Seoane. Spring, 2016 MEDEG, UC3M UC3M
Macroeconomics MEDEG, UC3M Lecture 5: Consumption Hernán D. Seoane UC3M Spring, 2016 Introduction A key component in NIPA accounts and the households budget constraint is the consumption It represents
More informationPredicting the Success of a Retirement Plan Based on Early Performance of Investments
Predicting the Success of a Retirement Plan Based on Early Performance of Investments CS229 Autumn 2010 Final Project Darrell Cain, AJ Minich Abstract Using historical data on the stock market, it is possible
More informationRobust Dual Dynamic Programming
1 / 18 Robust Dual Dynamic Programming Angelos Georghiou, Angelos Tsoukalas, Wolfram Wiesemann American University of Beirut Olayan School of Business 31 May 217 2 / 18 Inspired by SDDP Stochastic optimization
More informationAdvanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras
Advanced Operations Research Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras Lecture 21 Successive Shortest Path Problem In this lecture, we continue our discussion
More informationRevenue Management Under the Markov Chain Choice Model
Revenue Management Under the Markov Chain Choice Model Jacob B. Feldman School of Operations Research and Information Engineering, Cornell University, Ithaca, New York 14853, USA jbf232@cornell.edu Huseyin
More informationLearning for Revenue Optimization. Andrés Muñoz Medina Renato Paes Leme
Learning for Revenue Optimization Andrés Muñoz Medina Renato Paes Leme How to succeed in business with basic ML? ML $1 $5 $10 $9 Google $35 $1 $8 $7 $7 Revenue $8 $30 $24 $18 $10 $1 $5 Price $7 $8$9$10
More informationDefinition 4.1. In a stochastic process T is called a stopping time if you can tell when it happens.
102 OPTIMAL STOPPING TIME 4. Optimal Stopping Time 4.1. Definitions. On the first day I explained the basic problem using one example in the book. On the second day I explained how the solution to the
More informationScaling SGD Batch Size to 32K for ImageNet Training
Scaling SGD Batch Size to 32K for ImageNet Training Yang You Computer Science Division of UC Berkeley youyang@cs.berkeley.edu Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley
More informationReinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein
Reinforcement Learning Slides based on those used in Berkeley's AI class taught by Dan Klein Reinforcement Learning Basic idea: Receive feedback in the form of rewards Agent s utility is defined by the
More informationSession 5. Predictive Modeling in Life Insurance
SOA Predictive Analytics Seminar Hong Kong 29 Aug. 2018 Hong Kong Session 5 Predictive Modeling in Life Insurance Jingyi Zhang, Ph.D Predictive Modeling in Life Insurance JINGYI ZHANG PhD Scientist Global
More informationCS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults
CS 475 Machine Learning: Final Project Dual-Form SVM for Predicting Loan Defaults Kevin Rowland Johns Hopkins University 3400 N. Charles St. Baltimore, MD 21218, USA krowlan3@jhu.edu Edward Schembor Johns
More informationMath Models of OR: More on Equipment Replacement
Math Models of OR: More on Equipment Replacement John E. Mitchell Department of Mathematical Sciences RPI, Troy, NY 12180 USA December 2018 Mitchell More on Equipment Replacement 1 / 9 Equipment replacement
More informationFinancial Optimization ISE 347/447. Lecture 15. Dr. Ted Ralphs
Financial Optimization ISE 347/447 Lecture 15 Dr. Ted Ralphs ISE 347/447 Lecture 15 1 Reading for This Lecture C&T Chapter 12 ISE 347/447 Lecture 15 2 Stock Market Indices A stock market index is a statistic
More informationLecture 10: Performance measures
Lecture 10: Performance measures Prof. Dr. Svetlozar Rachev Institute for Statistics and Mathematical Economics University of Karlsruhe Portfolio and Asset Liability Management Summer Semester 2008 Prof.
More informationTheory of Consumer Behavior First, we need to define the agents' goals and limitations (if any) in their ability to achieve those goals.
Theory of Consumer Behavior First, we need to define the agents' goals and limitations (if any) in their ability to achieve those goals. We will deal with a particular set of assumptions, but we can modify
More informationStochastic Programming and Financial Analysis IE447. Midterm Review. Dr. Ted Ralphs
Stochastic Programming and Financial Analysis IE447 Midterm Review Dr. Ted Ralphs IE447 Midterm Review 1 Forming a Mathematical Programming Model The general form of a mathematical programming model is:
More informationLecture 5 January 30
EE 223: Stochastic Estimation and Control Spring 2007 Lecture 5 January 30 Lecturer: Venkat Anantharam Scribe: aryam Kamgarpour 5.1 Secretary Problem The problem set-up is explained in Lecture 4. We review
More informationReport for technical cooperation between Georgia Institute of Technology and ONS - Operador Nacional do Sistema Elétrico Risk Averse Approach
Report for technical cooperation between Georgia Institute of Technology and ONS - Operador Nacional do Sistema Elétrico Risk Averse Approach Alexander Shapiro and Wajdi Tekaya School of Industrial and
More informationContents Critique 26. portfolio optimization 32
Contents Preface vii 1 Financial problems and numerical methods 3 1.1 MATLAB environment 4 1.1.1 Why MATLAB? 5 1.2 Fixed-income securities: analysis and portfolio immunization 6 1.2.1 Basic valuation of
More informationECON 200 EXERCISES. (b) Appeal to any propositions you wish to confirm that the production set is convex.
ECON 00 EXERCISES 3. ROBINSON CRUSOE ECONOMY 3.1 Production set and profit maximization. A firm has a production set Y { y 18 y y 0, y 0, y 0}. 1 1 (a) What is the production function of the firm? HINT:
More informationInternational Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN
Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 31-3469 AN INVESTIGATION OF FINANCIAL TIME SERIES PREDICTION USING BACK PROPAGATION NEURAL NETWORKS K. Jayanthi, Dr. K. Suresh 1 Department of Computer
More informationTDT4171 Artificial Intelligence Methods
TDT47 Artificial Intelligence Methods Lecture 7 Making Complex Decisions Norwegian University of Science and Technology Helge Langseth IT-VEST 0 helgel@idi.ntnu.no TDT47 Artificial Intelligence Methods
More informationThe Correlation Smile Recovery
Fortis Bank Equity & Credit Derivatives Quantitative Research The Correlation Smile Recovery E. Vandenbrande, A. Vandendorpe, Y. Nesterov, P. Van Dooren draft version : March 2, 2009 1 Introduction Pricing
More informationPenalty Functions. The Premise Quadratic Loss Problems and Solutions
Penalty Functions The Premise Quadratic Loss Problems and Solutions The Premise You may have noticed that the addition of constraints to an optimization problem has the effect of making it much more difficult.
More informationEconomics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints
Economics 2010c: Lecture 4 Precautionary Savings and Liquidity Constraints David Laibson 9/11/2014 Outline: 1. Precautionary savings motives 2. Liquidity constraints 3. Application: Numerical solution
More informationHeuristic Methods in Finance
Heuristic Methods in Finance Enrico Schumann and David Ardia 1 Heuristic optimization methods and their application to finance are discussed. Two illustrations of these methods are presented: the selection
More informationWorst-case-expectation approach to optimization under uncertainty
Worst-case-expectation approach to optimization under uncertainty Wajdi Tekaya Joint research with Alexander Shapiro, Murilo Pereira Soares and Joari Paulo da Costa : Cambridge Systems Associates; : Georgia
More informationLecture outline W.B. Powell 1
Lecture outline Applications of the newsvendor problem The newsvendor problem Estimating the distribution and censored demands The newsvendor problem and risk The newsvendor problem with an unknown distribution
More informationInternational Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN
International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, www.ijcea.com ISSN 31-3469 AN INVESTIGATION OF FINANCIAL TIME SERIES PREDICTION USING BACK PROPAGATION NEURAL
More informationStochastic Approximation Algorithms and Applications
Harold J. Kushner G. George Yin Stochastic Approximation Algorithms and Applications With 24 Figures Springer Contents Preface and Introduction xiii 1 Introduction: Applications and Issues 1 1.0 Outline
More informationCS360 Homework 14 Solution
CS360 Homework 14 Solution Markov Decision Processes 1) Invent a simple Markov decision process (MDP) with the following properties: a) it has a goal state, b) its immediate action costs are all positive,
More informationAIRCURRENTS: PORTFOLIO OPTIMIZATION FOR REINSURERS
MARCH 12 AIRCURRENTS: PORTFOLIO OPTIMIZATION FOR REINSURERS EDITOR S NOTE: A previous AIRCurrent explored portfolio optimization techniques for primary insurance companies. In this article, Dr. SiewMun
More informationIntro to GLM Day 2: GLM and Maximum Likelihood
Intro to GLM Day 2: GLM and Maximum Likelihood Federico Vegetti Central European University ECPR Summer School in Methods and Techniques 1 / 32 Generalized Linear Modeling 3 steps of GLM 1. Specify the
More informationMarkov Decision Processes
Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato Stochastic domains So far, we have studied search Can use
More informationScenario Generation and Sampling Methods
Scenario Generation and Sampling Methods Güzin Bayraksan Tito Homem-de-Mello SVAN 2016 IMPA May 9th, 2016 Bayraksan (OSU) & Homem-de-Mello (UAI) Scenario Generation and Sampling SVAN IMPA May 9 1 / 30
More informationEE/AA 578 Univ. of Washington, Fall Homework 8
EE/AA 578 Univ. of Washington, Fall 2016 Homework 8 1. Multi-label SVM. The basic Support Vector Machine (SVM) described in the lecture (and textbook) is used for classification of data with two labels.
More informationGlobal convergence rate analysis of unconstrained optimization methods based on probabilistic models
Math. Program., Ser. A DOI 10.1007/s10107-017-1137-4 FULL LENGTH PAPER Global convergence rate analysis of unconstrained optimization methods based on probabilistic models C. Cartis 1 K. Scheinberg 2 Received:
More information6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE
6.231 DYNAMIC PROGRAMMING LECTURE 10 LECTURE OUTLINE Rollout algorithms Cost improvement property Discrete deterministic problems Approximations of rollout algorithms Discretization of continuous time
More informationName Date Student id #:
Math1090 Final Exam Spring, 2016 Instructor: Name Date Student id #: Instructions: Please show all of your work as partial credit will be given where appropriate, and there may be no credit given for problems
More informationOptimal Portfolio Selection Under the Estimation Risk in Mean Return
Optimal Portfolio Selection Under the Estimation Risk in Mean Return by Lei Zhu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics
More informationBarapatre Omprakash et.al; International Journal of Advance Research, Ideas and Innovations in Technology
ISSN: 2454-132X Impact factor: 4.295 (Volume 4, Issue 2) Available online at: www.ijariit.com Stock Price Prediction using Artificial Neural Network Omprakash Barapatre omprakashbarapatre@bitraipur.ac.in
More information3/1/2016. Intermediate Microeconomics W3211. Lecture 4: Solving the Consumer s Problem. The Story So Far. Today s Aims. Solving the Consumer s Problem
1 Intermediate Microeconomics W3211 Lecture 4: Introduction Columbia University, Spring 2016 Mark Dean: mark.dean@columbia.edu 2 The Story So Far. 3 Today s Aims 4 We have now (exhaustively) described
More informationCOMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2
COMP417 Introduction to Robotics and Intelligent Systems Reinforcement Learning - 2 Speaker: Sandeep Manjanna Acklowledgement: These slides use material from Pieter Abbeel s, Dan Klein s and John Schulman
More informationLecture outline W.B.Powell 1
Lecture outline What is a policy? Policy function approximations (PFAs) Cost function approximations (CFAs) alue function approximations (FAs) Lookahead policies Finding good policies Optimizing continuous
More informationProgressive Hedging for Multi-stage Stochastic Optimization Problems
Progressive Hedging for Multi-stage Stochastic Optimization Problems David L. Woodruff Jean-Paul Watson Graduate School of Management University of California, Davis Davis, CA 95616, USA dlwoodruff@ucdavis.edu
More informationPart 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)
Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective
More information2D5362 Machine Learning
2D5362 Machine Learning Reinforcement Learning MIT GALib Available at http://lancet.mit.edu/ga/ download galib245.tar.gz gunzip galib245.tar.gz tar xvf galib245.tar cd galib245 make or access my files
More informationInteger Programming Models
Integer Programming Models Fabio Furini December 10, 2014 Integer Programming Models 1 Outline 1 Combinatorial Auctions 2 The Lockbox Problem 3 Constructing an Index Fund Integer Programming Models 2 Integer
More informationCPS 270: Artificial Intelligence Markov decision processes, POMDPs
CPS 270: Artificial Intelligence http://www.cs.duke.edu/courses/fall08/cps270/ Markov decision processes, POMDPs Instructor: Vincent Conitzer Warmup: a Markov process with rewards We derive some reward
More informationComparison of Logit Models to Machine Learning Algorithms for Modeling Individual Daily Activity Patterns
Comparison of Logit Models to Machine Learning Algorithms for Modeling Individual Daily Activity Patterns Daniel Fay, Peter Vovsha, Gaurav Vyas (WSP USA) 1 Logit vs. Machine Learning Models Logit Models:
More informationPOMDPs: Partially Observable Markov Decision Processes Advanced AI
POMDPs: Partially Observable Markov Decision Processes Advanced AI Wolfram Burgard Types of Planning Problems Classical Planning State observable Action Model Deterministic, accurate MDPs observable stochastic
More informationALGORITHMIC TRADING STRATEGIES IN PYTHON
7-Course Bundle In ALGORITHMIC TRADING STRATEGIES IN PYTHON Learn to use 15+ trading strategies including Statistical Arbitrage, Machine Learning, Quantitative techniques, Forex valuation methods, Options
More informationAccelerated Option Pricing Multiple Scenarios
Accelerated Option Pricing in Multiple Scenarios 04.07.2008 Stefan Dirnstorfer (stefan@thetaris.com) Andreas J. Grau (grau@thetaris.com) 1 Abstract This paper covers a massive acceleration of Monte-Carlo
More informationThe Optimization Process: An example of portfolio optimization
ISyE 6669: Deterministic Optimization The Optimization Process: An example of portfolio optimization Shabbir Ahmed Fall 2002 1 Introduction Optimization can be roughly defined as a quantitative approach
More informationFinancial derivatives exam Winter term 2014/2015
Financial derivatives exam Winter term 2014/2015 Problem 1: [max. 13 points] Determine whether the following assertions are true or false. Write your answers, without explanations. Grading: correct answer
More informationThe exam is closed book, closed calculator, and closed notes except your three crib sheets.
CS 188 Spring 2016 Introduction to Artificial Intelligence Final V2 You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your three crib sheets.
More informationPortfolio Management and Optimal Execution via Convex Optimization
Portfolio Management and Optimal Execution via Convex Optimization Enzo Busseti Stanford University April 9th, 2018 Problems portfolio management choose trades with optimization minimize risk, maximize
More informationStochastic Dual Dynamic Programming Algorithm for Multistage Stochastic Programming
Stochastic Dual Dynamic Programg Algorithm for Multistage Stochastic Programg Final presentation ISyE 8813 Fall 2011 Guido Lagos Wajdi Tekaya Georgia Institute of Technology November 30, 2011 Multistage
More informationCS 331: Artificial Intelligence Game Theory I. Prisoner s Dilemma
CS 331: Artificial Intelligence Game Theory I 1 Prisoner s Dilemma You and your partner have both been caught red handed near the scene of a burglary. Both of you have been brought to the police station,
More informationProblem 1: Random variables, common distributions and the monopoly price
Problem 1: Random variables, common distributions and the monopoly price In this problem, we will revise some basic concepts in probability, and use these to better understand the monopoly price (alternatively
More informationPricing Kernel. v,x = p,y = p,ax, so p is a stochastic discount factor. One refers to p as the pricing kernel.
Payoff Space The set of possible payoffs is the range R(A). This payoff space is a subspace of the state space and is a Euclidean space in its own right. 1 Pricing Kernel By the law of one price, two portfolios
More informationArtificially Intelligent Forecasting of Stock Market Indexes
Artificially Intelligent Forecasting of Stock Market Indexes Loyola Marymount University Math 560 Final Paper 05-01 - 2018 Daniel McGrath Advisor: Dr. Benjamin Fitzpatrick Contents I. Introduction II.
More informationDepartment of Economics The Ohio State University Final Exam Answers Econ 8712
Department of Economics The Ohio State University Final Exam Answers Econ 8712 Prof. Peck Fall 2015 1. (5 points) The following economy has two consumers, two firms, and two goods. Good 2 is leisure/labor.
More informationFinal Examination Re - Calculus I 21 December 2015
. (5 points) Given the graph of f below, determine each of the following. Use, or does not exist where appropriate. y (a) (b) x 3 x 2 + (c) x 2 (d) x 2 (e) f(2) = (f) x (g) x (h) f (3) = 3 2 6 5 4 3 2
More informationk-layer neural networks: High capacity scoring functions + tips on how to train them
k-layer neural networks: High capacity scoring functions + tips on how to train them A new class of scoring functions Linear scoring function s = W x + b 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0,
More informationMulti-period mean variance asset allocation: Is it bad to win the lottery?
Multi-period mean variance asset allocation: Is it bad to win the lottery? Peter Forsyth 1 D.M. Dang 1 1 Cheriton School of Computer Science University of Waterloo Guangzhou, July 28, 2014 1 / 29 The Basic
More informationBudget Constrained Choice with Two Commodities
1 Budget Constrained Choice with Two Commodities Joseph Tao-yi Wang 2013/9/25 (Lecture 5, Micro Theory I) The Consumer Problem 2 We have some powerful tools: Constrained Maximization (Shadow Prices) Envelope
More informationPredictive Model Learning of Stochastic Simulations. John Hegstrom, FSA, MAAA
Predictive Model Learning of Stochastic Simulations John Hegstrom, FSA, MAAA Table of Contents Executive Summary... 3 Choice of Predictive Modeling Techniques... 4 Neural Network Basics... 4 Financial
More information1 Asset Pricing: Bonds vs Stocks
Asset Pricing: Bonds vs Stocks The historical data on financial asset returns show that one dollar invested in the Dow- Jones yields 6 times more than one dollar invested in U.S. Treasury bonds. The return
More informationCS 7180: Behavioral Modeling and Decision- making in AI
CS 7180: Behavioral Modeling and Decision- making in AI Algorithmic Game Theory Prof. Amy Sliva November 30, 2012 Prisoner s dilemma Two criminals are arrested, and each offered the same deal: If you defect
More informationJournal of Computational and Applied Mathematics. The mean-absolute deviation portfolio selection problem with interval-valued returns
Journal of Computational and Applied Mathematics 235 (2011) 4149 4157 Contents lists available at ScienceDirect Journal of Computational and Applied Mathematics journal homepage: www.elsevier.com/locate/cam
More informationOptimizing the Omega Ratio using Linear Programming
Optimizing the Omega Ratio using Linear Programming Michalis Kapsos, Steve Zymler, Nicos Christofides and Berç Rustem October, 2011 Abstract The Omega Ratio is a recent performance measure. It captures
More informationReasoning with Uncertainty
Reasoning with Uncertainty Markov Decision Models Manfred Huber 2015 1 Markov Decision Process Models Markov models represent the behavior of a random process, including its internal state and the externally
More informationJournal of Internet Banking and Commerce
Journal of Internet Banking and Commerce An open access Internet journal (http://www.icommercecentral.com) Journal of Internet Banking and Commerce, December 2017, vol. 22, no. 3 STOCK PRICE PREDICTION
More informationEfficient Portfolio and Introduction to Capital Market Line Benninga Chapter 9
Efficient Portfolio and Introduction to Capital Market Line Benninga Chapter 9 Optimal Investment with Risky Assets There are N risky assets, named 1, 2,, N, but no risk-free asset. With fixed total dollar
More informationLecture 2: Fundamentals of meanvariance
Lecture 2: Fundamentals of meanvariance analysis Prof. Massimo Guidolin Portfolio Management Second Term 2018 Outline and objectives Mean-variance and efficient frontiers: logical meaning o Guidolin-Pedio,
More informationINVERSE REWARD DESIGN
INVERSE REWARD DESIGN Dylan Hadfield-Menell, Smith Milli, Pieter Abbeel, Stuart Russell, Anca Dragan University of California, Berkeley Slides by Anthony Chen Inverse Reinforcement Learning (Review) Inverse
More informationMaking Complex Decisions
Ch. 17 p.1/29 Making Complex Decisions Chapter 17 Ch. 17 p.2/29 Outline Sequential decision problems Value iteration algorithm Policy iteration algorithm Ch. 17 p.3/29 A simple environment 3 +1 p=0.8 2
More information