k-layer neural networks: High capacity scoring functions + tips on how to train them

Similar documents
Lecture 4 - k-layer Neural Networks

Machine Learning (CSE 446): Pratical issues: optimization and learning

4 Reinforcement Learning Basic Algorithms

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

Support Vector Machines: Training with Stochastic Gradient Descent

Bayesian Finance. Christa Cuchiero, Irene Klein, Josef Teichmann. Obergurgl 2017

arxiv: v3 [q-fin.cp] 20 Sep 2018

distribution of the best bid and ask prices upon the change in either of them. Architecture Each neural network has 4 layers. The standard neural netw

Deep Learning - Financial Time Series application

Lecture 17: More on Markov Decision Processes. Reinforcement learning

Scaling SGD Batch Size to 32K for ImageNet Training

Deep Learning in Asset Pricing

Application of Deep Learning to Algorithmic Trading

Is Greedy Coordinate Descent a Terrible Algorithm?

Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.

$tock Forecasting using Machine Learning

Predicting stock prices for large-cap technology companies

Leverage Financial News to Predict Stock Price Movements Using Word Embeddings and Deep Neural Networks

Portfolio replication with sparse regression

Introduction to Reinforcement Learning. MAL Seminar

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks

Machine Learning and Options Pricing: A Comparison of Black-Scholes and a Deep Neural Network in Pricing and Hedging DAX 30 Index Options

Machine Learning for Quantitative Finance

COMP417 Introduction to Robotics and Intelligent Systems. Reinforcement Learning - 2

Reinforcement Learning. Slides based on those used in Berkeley's AI class taught by Dan Klein

Stock Market Index Prediction Using Multilayer Perceptron and Long Short Term Memory Networks: A Case Study on BSE Sensex

2D5362 Machine Learning

Deep Learning for Forecasting Stock Returns in the Cross-Section

Investing through Economic Cycles with Ensemble Machine Learning Algorithms

Journal of Internet Banking and Commerce

CS 188: Artificial Intelligence

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Applications of Neural Networks

Markov Decision Process

LendingClub Loan Default and Profitability Prediction

SYSM 6304: Risk and Decision Analysis Lecture 6: Pricing and Hedging Financial Derivatives

CS 343: Artificial Intelligence

Portfolio Management and Optimal Execution via Convex Optimization

Deep learning analysis of limit order book

Graduate School of Business, University of Chicago Business 41202, Spring Quarter 2007, Mr. Ruey S. Tsay. Solutions to Final Exam

Anurag Sodhi University of North Carolina at Charlotte

Machine Learning in Finance: The Case of Deep Learning for Option Pricing

Random Variables and Probability Distributions

Machine Learning (CSE 446): Learning as Minimizing Loss

1. You are given the following information about a stationary AR(2) model:

Forecasting Foreign Exchange Rate during Crisis - A Neural Network Approach

Point Estimation. Stat 4570/5570 Material from Devore s book (Ed 8), and Cengage

1 Overview. 2 The Gradient Descent Algorithm. AM 221: Advanced Optimization Spring 2016

Machine Learning in Computer Vision Markov Random Fields Part II

Financial Econometrics

Course information FN3142 Quantitative finance

arxiv: v1 [q-fin.cp] 6 Oct 2016

Stock Trading Following Stock Price Index Movement Classification Using Machine Learning Techniques

The Use of Importance Sampling to Speed Up Stochastic Volatility Simulations

COMPARING NEURAL NETWORK AND REGRESSION MODELS IN ASSET PRICING MODEL WITH HETEROGENEOUS BELIEFS

Backpropagation. Deep Learning Theory and Applications. Kevin Moon Guy Wolf

The University of Chicago, Booth School of Business Business 41202, Spring Quarter 2012, Mr. Ruey S. Tsay. Solutions to Final Exam

Artificial Neural Networks Lecture Notes

Foreign Exchange Forecasting via Machine Learning

Application of Innovations Feedback Neural Networks in the Prediction of Ups and Downs Value of Stock Market *

Lecture Note 9 of Bus 41914, Spring Multivariate Volatility Models ChicagoBooth

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables and Statistical Tables THE UNIVERSITY OF MANCHESTER

- 1 - **** d(lns) = (µ (1/2)σ 2 )dt + σdw t

Barrier Option. 2 of 33 3/13/2014

Predictive Model Learning of Stochastic Simulations. John Hegstrom, FSA, MAAA

Understanding Deep Learning Requires Rethinking Generalization

A Novel Prediction Method for Stock Index Applying Grey Theory and Neural Networks

Stock market price index return forecasting using ANN. Gunter Senyurt, Abdulhamit Subasi

Gradient Descent and the Structure of Neural Network Cost Functions. presentation by Ian Goodfellow

Reinforcement learning and Markov Decision Processes (MDPs) (B) Avrim Blum

CSE 473: Artificial Intelligence

FX Smile Modelling. 9 September September 9, 2008

Predicting Bitcoin Exchange Rate Values Can Machine Learning Algorithms Help?

SDMR Finance (2) Olivier Brandouy. University of Paris 1, Panthéon-Sorbonne, IAE (Sorbonne Graduate Business School)

Learning from Data: Learning Logistic Regressors

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Application of Soft-Computing Techniques in Accident Compensation

Macroeconomics of the Labour Market Problem Set

Iran s Stock Market Prediction By Neural Networks and GA

Handout 8: Introduction to Stochastic Dynamic Programming. 2 Examples of Stochastic Dynamic Programming Problems

Trinomial Tree. Set up a trinomial approximation to the geometric Brownian motion ds/s = r dt + σ dw. a

An enhanced artificial neural network for stock price predications

Option Pricing Using Bayesian Neural Networks

CPSC 540: Machine Learning

Extend the ideas of Kan and Zhou paper on Optimal Portfolio Construction under parameter uncertainty

Analysing the IS-MP-PC Model

IN finance applications, the idea of training learning algorithms

CSEP 573: Artificial Intelligence

Outline. 1 Introduction. 2 Algorithms. 3 Examples. Algorithm 1 General coordinate minimization framework. 1: Choose x 0 R n and set k 0.

Design and implementation of artificial neural network system for stock market prediction (A case study of first bank of Nigeria PLC Shares)

Characterization of the Optimum

Chapter 5 Univariate time-series analysis. () Chapter 5 Univariate time-series analysis 1 / 29

Neuro-Genetic System for DAX Index Prediction

Parallel Multilevel Monte Carlo Simulation

CPSC 540: Machine Learning

Utility Indifference Pricing and Dynamic Programming Algorithm

Sublinear Time Algorithms Oct 19, Lecture 1

Eco504 Spring 2010 C. Sims FINAL EXAM. β t 1 2 φτ2 t subject to (1)

Gamma. The finite-difference formula for gamma is

An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its function-evaluation complexity

Transcription:

k-layer neural networks: High capacity scoring functions + tips on how to train them

A new class of scoring functions Linear scoring function s = W x + b 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0, s 1 ) s = W 2 h + b 2 xd xd. s3. s1,m hm s3 x3 s2 x3.. s2 x2 s1 x2 s1,1 h1 s1 x1 x1 Input: x Before Output: s = W x + b Input: x s1 = W1x + b1 h = max(0, s1) s = W2h + b2 Now

Not restricted to two layers 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0, s 1 ) s = W 2 h + b 2 3-layer Neural Network s 1 = W 1 x + b 1 h 1 = max(0, s 1 ) s 2 = W 2 h 1 + b 2 h 2 = max(0, s 2 ) s = W 3 h 2 + b 3 xd xd. hm s3. h1,m1 h2,m2 s3 x3. s2 x3.. s2 x2 h1 s1 x2 h1,1 h2,1 s1 x1 x1 Input: x s1 = W1x + b1 Output: s = W2h + b2 h = max(0, s1) Input: x s1 = W1x + b1 s2 = W2h1 + b2 Output: s = W3h2 + b3 h1 = max(0, s1) h2 = max(0, s2)

Some terminology 3-layer Neural Network s 1 = W 1 x + b 1 W 1 is m 1 d 1st hidden layer activations h 1 = max(0, s 1 ) apply non-linearity via activation fn s 2 = W 2 h 1 + b 2 W 2 is m 2 m 1 2nd hidden layer activations h 2 = max(0, s 2 ) apply non-linearity via activation fn Output responses s = W 3 h 2 + b 3 W 3 is c m 2 xd. h1,m1 h2,m2 s3 x3.. s2 x2 h1,1 h2,1 s1 x1 Input: x s1 = W1x + b1 s2 = W2h1 + b2 Output: s = W3h2 + b3 h1 = max(0, s1) h2 = max(0, s2) Sometimes referred to as a 2-hidden-layer neural network.

Computational Graph of our 2-layer neural network W 1 x + b 1 max(0, s 1 ) W 2 h + b 2 x s 1 h s W 1 b 1 W 2 b 2

2-layer neural network with probabilistic outputs W 1 x + b 1 max(0, s 1 ) W 2 h + b 2 softmax(s) x s 1 h s p W 1 b 1 W 2 b 2

Effect of the number of hidden nodes in a 2 layer network m = 3 m = 20 m = 30 m = 100 m is the number of nodes in the hidden layer. No regularization.

Result depends on parameter initialization m = 3 m = 20 m = 30 m = 100 m is the number of nodes in the hidden layer. No regularization. Different random parameter initialization to previous slide.

Effect of regularization J(D, λ, Θ) = (x,y) D l(x, y, Θ) + λr(θ) λ = 0 λ =.001 λ =.01 λ =.1 m = 100 nodes in the hidden layer. L 2 regularization. Do not use size of neural network as a regularizer. Use stronger regularization.

High-level overview of how to train network Mini-batch GD (or variant) Loop 1. Sample a batch of the training data. 2. Forward propagate it through the graph and calculate loss/cost. 3. Backward propagate to calculate the gradients. 4. Update the parameters using the gradient.

Options for activation functions Sigmoid tanh ReLu 1 σ(x) 1 tanh(x) 10 max (0, x) 0.5 8 0.5 10 5 5 10 x 6 4 0.5 2 10 5 5 10 x 1 10 5 5 10 x σ(x) = 1 1+exp( x) tanh(x) = exp(x) exp( x) exp(x)+exp( x) ReLu(x) = max(0, x) Activation function is applied independently to each element of the score vector.

Options for activation Functions Leaky ReLu ELU 10 8 max (0.1x, x) 10 8 ELU(x) 6 6 4 4 2 2 10 5 5 10 x 10 5 5 10 x max(0.1x, x) ELU(x) = { x if x > 0 α (exp(x) 1)) otherwise Activation function is generally applied independently to each element of vector.

Options for Activation Functions Sigmoid tanh ReLu 1 σ(x) 1 tanh(x) 10 max (0, x) 0.5 8 0.5 10 5 5 10 x 6 4 0.5 2 10 5 5 10 x 1 10 5 5 10 x σ(x) = 1 1+exp( x) tanh(x) = exp(x) exp( x) exp(x)+exp( x) ReLu(x) = max(0, x) In modern networks ReLU is the most common activation function.

Better understanding of gradient flows during BackProp has helped training of neural networks Understanding Effect of Activation Functions

Sigmoid 1 σ(x) dσ(x) dx σ(x) = 1 1 + exp( x) 0.5 Problems 1. Saturated activations kill the gradient flow. 2. Sigmoid outputs are not zero-centered. 3. exp() is expensive to compute 10 5 5 10 x

tanh 1 tanh(x) d tanh(x) dx tanh(x) = Properties exp(x) exp( x) exp(x) + exp( x) 1. Squashes numbers to range [ 1, 1]. 2. Tanh outputs are zero-centered. 3. Saturated activations kill the gradients. 10 5 5 10 1 x

Rectified Linear Unit (ReLu) 10 8 max (0, x) d max (0,x) dx ReLu(x) = max(0, x) 6 4 2 Pros 1. Does not saturate for large positive x. 10 5 5 10 x 2. Very computationally efficient. 3. In practice training of a ReLu network converges much faster than one with sigmoid/tanh activation functions. 4. Output is not zero-centered 5. Negative activations have zero gradients and freezes some parameter weights.

Effect of weight initialization & activation function on gradient flow

Some activation histograms Initialize a 10-layer network with 500 nodes at each layer. Use a tanh activation function at each layer. Initialize weights will small random numbers. Generate random input data (N(0, 1 2 )) with d = 500. 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Layer 1 Layer 2 Layer 3 Layer 4 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer

Change the initialization to bigger random numbers Almost all neurons completely saturated, either -1 or +1. = Gradients will be all zero (Remember the picture of the gradient of tanh.) 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Layer 1 Layer 2 Layer 3 Layer 4 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer

Change the initialization to Xavier initialization Initialize a 10-layer network with 500 nodes at each layer. Use a tanh activation function at each layer. Initialize weights with Xavier initialization: W i,lm N(w; 0, 1/ 500). Generate random input data (N(0, 1 2 )) with d = 500. 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 1 0 1 1 0 1 1 0 1 1 0 1 Layer 1 Layer 2 Layer 3 Layer 4 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 1 0 1 1 0 1 1 0 1 1 0 1 Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer

Lessening the effect of initialization: Batch normalization

Batch Normalization Want unit Gaussian activations at each layer? Just make them unit Guassian! Idea introduced in: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, arxiv 2015. Consider activations at some layer for a batch: s (j) 1, s(j) 2..., s n (j) To make each dimension unit gaussian, apply: ŝ (j) i ( ) = diag(σ 1,..., σ m ) 1 s (j) i µ where µ = 1 n n i=1 s (j) i, σ 2 p = 1 n n (s (j) i, p µ p) 2 i=1

Batch Normalization Usually apply normalization after the fully connected layer before non-linearity. Therefore for a k layer network have - for i = 1,..., k 1 for (x (i 1), y) D Apply ith linear transformation to batch s (i) = W i x (i 1) + b i end Compute batch mean and variances of ith layer: µ = 1 s (i), σ 2 j D = 1 ( s (i) ) 2 j µ j for j = 1,..., mi s (i) D D s (i) D for (s (i), y) D Apply BN and activation function ŝ (i) = BatchNormalise(s (i), µ, σ 1,..., σ mi ) x (i) = max (0, ŝ (i)) end end - Apply final linear transformation: s (k) = W k x (k 1) + b k

Batch Normalization: Scale & shift range Can also allow the network to squash and shift the range of the ŝ (i) s at each layer. ŝ (i) = γ (i) ŝ (i) + β (i) Can learn the γ (i) s and β (i) s and add them as parameters of the network. To keep things simple this added complexity is often omitted.

Benefits of Batch Normalization Improves gradient flow through the network. Reduces the strong dependence on initialization. = learn deeper networks more reliably. Allows higher learning rates. Acts as a form of regularization. If training a deep network, you should use Batch Normalization.

Batch Normalization at Test Time At test time do not have a batch. Instead fixed empirical mean and variances of activations at each level are used. These quantities estimated during training (with running averages).

Baby sitting the training process

Training neural networks not completely trivial Several hyper-parameters affect the quality of your training. These include - learning rate - degree of regularization - network architecture - hyper-parameters controlling weight initialization If these (potentially correlated) hyper-parameters are not appropriately set = you will not learn an effective network. Multiple quantities you should monitor during training. These quantities indicate - a reasonable hyper-parameter setting and/or - how hyper-parameters setting could be changed for the better.

What to monitor during training

Monitor & Visualize the loss/cost curve Evolution of your training loss is telling you something! Typical training loss over time

Telltale sign of a bad initialization

Monitor & visualize the accuracy Gap between training and validation accuracy indicates amount of over-fitting. Over-fitting = should increase regularization during training: - increase the degree of L 2 regularization - more dropout - use more training data.

Monitor & visualize the accuracy Gap between training and validation accuracy indicates amount of over-fitting. Under-fitting = model capacity not high enough: - increase the size of the network

Optimization of the training hyper-parameters

Hyperparameters to adjust Initial learning rate. Learning rate decay schedule. Regularization strength - L 2 penalty - Dropout strength

Cross-validation strategy Do a coarse fine cross-validation in stages. Stage 0: Identify the range of feasible learning rates & regularization penalties. (usually done interactively and train only for a few updates.) Stage 1: Broad search. Goal is to narrow the search range. Only run training for a few epochs. Stage 2: Finer search. Increase training times. Stage...: Repeat Stage 2 as necessary. Use performance on the validation set to identify good hyper-parameter settings.

Prefer random search to grid search randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid Random Search for Hyper-Parameter Optimization, Bergstra and Bengio, 2012

Parameter Updates: Variations of Stochastic Gradient Descent

One weakness of SGD SGD can be very slow... Example: Use SGD to find the optimum of f(x) = exp(.5x T Σx) 150 iterations, η =.01 Curves show the iso-contours of f(x) SGD has trouble navigating ravines as it oscillates across the bottom of the ravine. Could increase learning rate but increased the learning rate = more likely the optimizer will diverge. Unfortunately, ravines are common around local optima.

Solution: SGD with momentum Introduce momentum vector as well as the gradient vector. Let γ [0, 1] and v is the momentum vector v (t+1) = γ v (t) + η x f(x (t) ) x (t+1) = x (t) v (t+1) update vector Typically set γ in somewhere in the range [.9,.99]. e (t+1) η x f(x (t) ) x (t+1) γv (t) γv (t) η x f(x (t) ) x (t) η xf(x (t) )

How and why momentum helps How? Momentum helps accelerate SGD in the appropriate direction. Momentum dampens the oscillations of default SGD. = Faster convergence. Why? (γ =.9, η =.01, 150 iterations) For dimensions whose gradient is constantly changing then their entries in the update vector are damped. For dimensions whose gradient is approx. constant then their entries in the update vector are not damped.

Momentum not the complete answer When using momentum = can pick up too much speed in one direction. = can overshoot the local optimum. (γ =.9, η =.03)

Solution: Nesterov accelerated gradient (NAG) Look and measure ahead. Use gradient at an estimate of the parameters at the next iteration. Let γ [0, 1] then e (t+1) = x (t) γv (t) estimate of x (t+1) v (t+1) = γ v (t) + η x f(e (t+1) ) update vector x (t+1) = x (t) v (t+1) Typically γ set to.9. e (t+1) η xf(x (t) ) x (t+1) e (t+1) η xf(e (t+1) ) γv (t) γv (t) x (t+1) x (t) η xf(x (t) ) γv (t) η xf(x (t) ) Momentum update x (t) η xf(x (t) ) γv (t) η xf(e (t+1) ) NAG update

How and why NAG helps The anticipatory update prevents the algorithm having too large updates and overshooting. Algorithm has increased responsiveness to the landscape of f. (γ =.9, η =.01, 150 iterations) Note: NAG shown to greatly increase the ability to train RNNs: Bengio, Y., Boulanger-Lewandowski, N. & Pascanu, R. Advances in Optimizing Recurrent Networks, (2012). http://arxiv.org/abs/1212.0901

Improvements to NAG Want to adapt the updates to each individual parameter. Perform larger or smaller updates depending on the landscape of the cost function. Family of algorithms with adaptive learning rates - AdaGrad - AdaDelta - RMSProp - Adam

AdaGrad For a cleaner statement introduce some notation: g t = x f(x (t) ) and g t = (g t,1,..., g t,d ) T. Keep a record of the sum of the squares of the gradients w.r.t. each x i up to time t: G t,i = t j=1 g 2 j,i The AdaGrad update step for each dimension is x (t+1) i = x (t) i Usually set ɛ = 1e 8 and η =.01. η Gt,i + ɛ g t,i J. Duchi, E. Hazan & Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Journal of Machine Learning Research, 2011.

Adagrad s convergence on our toy problem (ɛ = 1e 8, η =.01, 150 iterations)

Big weakness of AdaGrad Each g 2 t,i is positive. = Each G t,i = t j=1 g2 j,i keeps growing during training. = the effective learning rate η/( G t,i + ɛ) shrinks and eventually 0. = updates of x (t) stop.

AdaDelta Devised as an improvement to AdaGrad. Tackles AdaGrad s convergence to zero of the learning rate as t increases. AdaDelta s two central ideas - scale learning rate based on the previous gradient values (like AdaGrad) but only using a recent time window, - include an acceleration term (like momentum) by accumulating prior updates. M. Zeiler, ADADELTA: An Adaptive Learning Rate Method, 2012. http://arxiv.org/abs/1212.5701

Technical details of AdaDelta Compute gradient vector g t at current estimate x (t). Update average of previous squared gradients (AdaGrad-like step) G t,i = ρ G t 1,i + (1 ρ) g 2 t,i Compute the update vector Ut 1,i + ɛ u t,i = Gt,i + ɛ g t,i Compute exponentially decaying average of updates (momentum-like step) The AdaDelta update step: U t,i = ρ U t 1,i + (1 ρ) u 2 t,i x (t+1) i = x (t) i u t,i

Adaptive Moment Estimation (Adam) Computes adaptive learning rates for each parameter. How? - Stores an exponentially decaying average of past gradients m (t) and past squared gradients v (t) - m (t) and v (t) are estimates respectively of the first and second moments of the gradient in each dimension. - Uses the variance+mean 2 estimate to damp the update in dimensions with high second moment D. P. Kingma & J. L. Ba, Adam: a Method for Stochastic Optimization, International Conference on Learning Representations, 2015.

Update equations for Adam Let g t = x f(x (t) ) m (t+1) = β 1 m (t) + (1 β 1 ) g t v (t+1) = β 2 v (t) + (1 β 2 ) g t. g t Set m (0) = v (0) = 0 = m (t) and v (t) are biased towards zero (especially during the initial time-steps). Counter these biases by setting: The Adam update rule: ˆm (t+1) = m(t+1) 1 β1 t, ˆv (t+1) = v(t+1) 1 β2 t x (t+1) = x (t) η ˆv (t+1) + ɛ ˆm(t+1) Suggested default values β 1 =.9, β 2 =.999, ɛ = 10 8.

Adam s performance on our toy problem (default parameter settings, 150 iterations)

Comparison of different algorithms on our toy problem Adam Adagrad NAG Momentum SGD (ɛ = 1e 8, γ =.9, η =.01, 150 iterations) (ɛ = 1e 8, γ =.9, η =.03, 150 iterations)

Which optimizer to use? Data sparse = likely to achieve best results using one of the adaptive learning-rate methods. Using the adaptive learning-rate methods = won t need to tune the learning rate (much!). RMSprop, AdaDelta, and Adam are very similar algorithms that do well in similar circumstances. Adam slightly outperforms RMSProp near the end of optimization. Adam might be the best overall choice. But vanilla SGD (without momentum) and a simple learning rate annealing schedule may be sufficient. But time until finding a local minimum may be long...

Annealing the learning rate

Useful to anneal the learning rate When training deep networks, usually helpful to anneal the learning rate over time. Why? - Stops the parameter vector from bouncing around too widely. - = can reach into deeper, but narrower parts of the loss function. But knowing when to decay the learning rate is tricky! Decay too slowly = waste computations bouncing around chaotically with little improvement. Decay too aggressively = system unable to reach the best position it can.

Common approaches to learning rate decay Step decay: After every nth epoch set η = αη where α (0, 1). (Instead sometimes people monitor the validation loss and reduce the learning rate when this loss stops improving.) Exponential decay: η = η 0 e kt where t is iteration number (either w.r.t. number of update steps or epochs). Then η 0 and k are hyper-parameters. 1/t decay: η = η 0 1 + kt Step decay most common. Better to decay conservatively and train for longer.