k-layer neural networks: High capacity scoring functions + tips on how to train them

A new class of scoring functions Linear scoring function s = W x + b 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0, s 1 ) s = W 2 h + b 2 xd xd. s3. s1,m hm s3 x3 s2 x3.. s2 x2 s1 x2 s1,1 h1 s1 x1 x1 Input: x Before Output: s = W x + b Input: x s1 = W1x + b1 h = max(0, s1) s = W2h + b2 Now

Not restricted to two layers 2-layer Neural Network s 1 = W 1 x + b 1 h = max(0, s 1 ) s = W 2 h + b 2 3-layer Neural Network s 1 = W 1 x + b 1 h 1 = max(0, s 1 ) s 2 = W 2 h 1 + b 2 h 2 = max(0, s 2 ) s = W 3 h 2 + b 3 xd xd. hm s3. h1,m1 h2,m2 s3 x3. s2 x3.. s2 x2 h1 s1 x2 h1,1 h2,1 s1 x1 x1 Input: x s1 = W1x + b1 Output: s = W2h + b2 h = max(0, s1) Input: x s1 = W1x + b1 s2 = W2h1 + b2 Output: s = W3h2 + b3 h1 = max(0, s1) h2 = max(0, s2)

Some terminology 3-layer Neural Network s 1 = W 1 x + b 1 W 1 is m 1 d 1st hidden layer activations h 1 = max(0, s 1 ) apply non-linearity via activation fn s 2 = W 2 h 1 + b 2 W 2 is m 2 m 1 2nd hidden layer activations h 2 = max(0, s 2 ) apply non-linearity via activation fn Output responses s = W 3 h 2 + b 3 W 3 is c m 2 xd. h1,m1 h2,m2 s3 x3.. s2 x2 h1,1 h2,1 s1 x1 Input: x s1 = W1x + b1 s2 = W2h1 + b2 Output: s = W3h2 + b3 h1 = max(0, s1) h2 = max(0, s2) Sometimes referred to as a 2-hidden-layer neural network.

Computational Graph of our 2-layer neural network W 1 x + b 1 max(0, s 1 ) W 2 h + b 2 x s 1 h s W 1 b 1 W 2 b 2

2-layer neural network with probabilistic outputs W 1 x + b 1 max(0, s 1 ) W 2 h + b 2 softmax(s) x s 1 h s p W 1 b 1 W 2 b 2

Effect of the number of hidden nodes in a 2 layer network m = 3 m = 20 m = 30 m = 100 m is the number of nodes in the hidden layer. No regularization.

Result depends on parameter initialization m = 3 m = 20 m = 30 m = 100 m is the number of nodes in the hidden layer. No regularization. Different random parameter initialization to previous slide.

Effect of regularization J(D, λ, Θ) = (x,y) D l(x, y, Θ) + λr(θ) λ = 0 λ =.001 λ =.01 λ =.1 m = 100 nodes in the hidden layer. L 2 regularization. Do not use size of neural network as a regularizer. Use stronger regularization.

High-level overview of how to train network Mini-batch GD (or variant) Loop 1. Sample a batch of the training data. 2. Forward propagate it through the graph and calculate loss/cost. 3. Backward propagate to calculate the gradients. 4. Update the parameters using the gradient.

Options for activation functions Sigmoid tanh ReLu 1 σ(x) 1 tanh(x) 10 max (0, x) 0.5 8 0.5 10 5 5 10 x 6 4 0.5 2 10 5 5 10 x 1 10 5 5 10 x σ(x) = 1 1+exp( x) tanh(x) = exp(x) exp( x) exp(x)+exp( x) ReLu(x) = max(0, x) Activation function is applied independently to each element of the score vector.

Options for activation Functions Leaky ReLu ELU 10 8 max (0.1x, x) 10 8 ELU(x) 6 6 4 4 2 2 10 5 5 10 x 10 5 5 10 x max(0.1x, x) ELU(x) = { x if x > 0 α (exp(x) 1)) otherwise Activation function is generally applied independently to each element of vector.

Options for Activation Functions Sigmoid tanh ReLu 1 σ(x) 1 tanh(x) 10 max (0, x) 0.5 8 0.5 10 5 5 10 x 6 4 0.5 2 10 5 5 10 x 1 10 5 5 10 x σ(x) = 1 1+exp( x) tanh(x) = exp(x) exp( x) exp(x)+exp( x) ReLu(x) = max(0, x) In modern networks ReLU is the most common activation function.

Better understanding of gradient flows during BackProp has helped training of neural networks Understanding Effect of Activation Functions

Sigmoid 1 σ(x) dσ(x) dx σ(x) = 1 1 + exp( x) 0.5 Problems 1. Saturated activations kill the gradient flow. 2. Sigmoid outputs are not zero-centered. 3. exp() is expensive to compute 10 5 5 10 x

tanh 1 tanh(x) d tanh(x) dx tanh(x) = Properties exp(x) exp( x) exp(x) + exp( x) 1. Squashes numbers to range [ 1, 1]. 2. Tanh outputs are zero-centered. 3. Saturated activations kill the gradients. 10 5 5 10 1 x

Rectified Linear Unit (ReLu) 10 8 max (0, x) d max (0,x) dx ReLu(x) = max(0, x) 6 4 2 Pros 1. Does not saturate for large positive x. 10 5 5 10 x 2. Very computationally efficient. 3. In practice training of a ReLu network converges much faster than one with sigmoid/tanh activation functions. 4. Output is not zero-centered 5. Negative activations have zero gradients and freezes some parameter weights.

Effect of weight initialization & activation function on gradient flow

Some activation histograms Initialize a 10-layer network with 500 nodes at each layer. Use a tanh activation function at each layer. Initialize weights will small random numbers. Generate random input data (N(0, 1 2 )) with d = 500. 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Layer 1 Layer 2 Layer 3 Layer 4 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer

Change the initialization to bigger random numbers Almost all neurons completely saturated, either -1 or +1. = Gradients will be all zero (Remember the picture of the gradient of tanh.) 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Layer 1 Layer 2 Layer 3 Layer 4 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer

Change the initialization to Xavier initialization Initialize a 10-layer network with 500 nodes at each layer. Use a tanh activation function at each layer. Initialize weights with Xavier initialization: W i,lm N(w; 0, 1/ 500). Generate random input data (N(0, 1 2 )) with d = 500. 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 1 0 1 1 0 1 1 0 1 1 0 1 Layer 1 Layer 2 Layer 3 Layer 4 0.15 0.15 0.15 0.15 0.1 0.1 0.1 0.1 0.05 0.05 0.05 0.05 1 0 1 1 0 1 1 0 1 1 0 1 Layer 5 Layer 6 Layer 7 Layer 8 Histograms of activations at each layer

Lessening the effect of initialization: Batch normalization

Batch Normalization Want unit Gaussian activations at each layer? Just make them unit Guassian! Idea introduced in: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy, arxiv 2015. Consider activations at some layer for a batch: s (j) 1, s(j) 2..., s n (j) To make each dimension unit gaussian, apply: ŝ (j) i ( ) = diag(σ 1,..., σ m ) 1 s (j) i µ where µ = 1 n n i=1 s (j) i, σ 2 p = 1 n n (s (j) i, p µ p) 2 i=1

Batch Normalization Usually apply normalization after the fully connected layer before non-linearity. Therefore for a k layer network have - for i = 1,..., k 1 for (x (i 1), y) D Apply ith linear transformation to batch s (i) = W i x (i 1) + b i end Compute batch mean and variances of ith layer: µ = 1 s (i), σ 2 j D = 1 ( s (i) ) 2 j µ j for j = 1,..., mi s (i) D D s (i) D for (s (i), y) D Apply BN and activation function ŝ (i) = BatchNormalise(s (i), µ, σ 1,..., σ mi ) x (i) = max (0, ŝ (i)) end end - Apply final linear transformation: s (k) = W k x (k 1) + b k

Batch Normalization: Scale & shift range Can also allow the network to squash and shift the range of the ŝ (i) s at each layer. ŝ (i) = γ (i) ŝ (i) + β (i) Can learn the γ (i) s and β (i) s and add them as parameters of the network. To keep things simple this added complexity is often omitted.

Benefits of Batch Normalization Improves gradient flow through the network. Reduces the strong dependence on initialization. = learn deeper networks more reliably. Allows higher learning rates. Acts as a form of regularization. If training a deep network, you should use Batch Normalization.

Batch Normalization at Test Time At test time do not have a batch. Instead fixed empirical mean and variances of activations at each level are used. These quantities estimated during training (with running averages).

Baby sitting the training process

Training neural networks not completely trivial Several hyper-parameters affect the quality of your training. These include - learning rate - degree of regularization - network architecture - hyper-parameters controlling weight initialization If these (potentially correlated) hyper-parameters are not appropriately set = you will not learn an effective network. Multiple quantities you should monitor during training. These quantities indicate - a reasonable hyper-parameter setting and/or - how hyper-parameters setting could be changed for the better.

What to monitor during training

Monitor & Visualize the loss/cost curve Evolution of your training loss is telling you something! Typical training loss over time

Telltale sign of a bad initialization

Monitor & visualize the accuracy Gap between training and validation accuracy indicates amount of over-fitting. Over-fitting = should increase regularization during training: - increase the degree of L 2 regularization - more dropout - use more training data.

Monitor & visualize the accuracy Gap between training and validation accuracy indicates amount of over-fitting. Under-fitting = model capacity not high enough: - increase the size of the network

Optimization of the training hyper-parameters

Hyperparameters to adjust Initial learning rate. Learning rate decay schedule. Regularization strength - L 2 penalty - Dropout strength

Cross-validation strategy Do a coarse fine cross-validation in stages. Stage 0: Identify the range of feasible learning rates & regularization penalties. (usually done interactively and train only for a few updates.) Stage 1: Broad search. Goal is to narrow the search range. Only run training for a few epochs. Stage 2: Finer search. Increase training times. Stage...: Repeat Stage 2 as necessary. Use performance on the validation set to identify good hyper-parameter settings.

Prefer random search to grid search randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid Random Search for Hyper-Parameter Optimization, Bergstra and Bengio, 2012

Parameter Updates: Variations of Stochastic Gradient Descent

One weakness of SGD SGD can be very slow... Example: Use SGD to find the optimum of f(x) = exp(.5x T Σx) 150 iterations, η =.01 Curves show the iso-contours of f(x) SGD has trouble navigating ravines as it oscillates across the bottom of the ravine. Could increase learning rate but increased the learning rate = more likely the optimizer will diverge. Unfortunately, ravines are common around local optima.

Solution: SGD with momentum Introduce momentum vector as well as the gradient vector. Let γ [0, 1] and v is the momentum vector v (t+1) = γ v (t) + η x f(x (t) ) x (t+1) = x (t) v (t+1) update vector Typically set γ in somewhere in the range [.9,.99]. e (t+1) η x f(x (t) ) x (t+1) γv (t) γv (t) η x f(x (t) ) x (t) η xf(x (t) )

How and why momentum helps How? Momentum helps accelerate SGD in the appropriate direction. Momentum dampens the oscillations of default SGD. = Faster convergence. Why? (γ =.9, η =.01, 150 iterations) For dimensions whose gradient is constantly changing then their entries in the update vector are damped. For dimensions whose gradient is approx. constant then their entries in the update vector are not damped.

Momentum not the complete answer When using momentum = can pick up too much speed in one direction. = can overshoot the local optimum. (γ =.9, η =.03)

Solution: Nesterov accelerated gradient (NAG) Look and measure ahead. Use gradient at an estimate of the parameters at the next iteration. Let γ [0, 1] then e (t+1) = x (t) γv (t) estimate of x (t+1) v (t+1) = γ v (t) + η x f(e (t+1) ) update vector x (t+1) = x (t) v (t+1) Typically γ set to.9. e (t+1) η xf(x (t) ) x (t+1) e (t+1) η xf(e (t+1) ) γv (t) γv (t) x (t+1) x (t) η xf(x (t) ) γv (t) η xf(x (t) ) Momentum update x (t) η xf(x (t) ) γv (t) η xf(e (t+1) ) NAG update

How and why NAG helps The anticipatory update prevents the algorithm having too large updates and overshooting. Algorithm has increased responsiveness to the landscape of f. (γ =.9, η =.01, 150 iterations) Note: NAG shown to greatly increase the ability to train RNNs: Bengio, Y., Boulanger-Lewandowski, N. & Pascanu, R. Advances in Optimizing Recurrent Networks, (2012). http://arxiv.org/abs/1212.0901

Improvements to NAG Want to adapt the updates to each individual parameter. Perform larger or smaller updates depending on the landscape of the cost function. Family of algorithms with adaptive learning rates - AdaGrad - AdaDelta - RMSProp - Adam

AdaGrad For a cleaner statement introduce some notation: g t = x f(x (t) ) and g t = (g t,1,..., g t,d ) T. Keep a record of the sum of the squares of the gradients w.r.t. each x i up to time t: G t,i = t j=1 g 2 j,i The AdaGrad update step for each dimension is x (t+1) i = x (t) i Usually set ɛ = 1e 8 and η =.01. η Gt,i + ɛ g t,i J. Duchi, E. Hazan & Y. Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Journal of Machine Learning Research, 2011.

Adagrad s convergence on our toy problem (ɛ = 1e 8, η =.01, 150 iterations)

Big weakness of AdaGrad Each g 2 t,i is positive. = Each G t,i = t j=1 g2 j,i keeps growing during training. = the effective learning rate η/( G t,i + ɛ) shrinks and eventually 0. = updates of x (t) stop.

AdaDelta Devised as an improvement to AdaGrad. Tackles AdaGrad s convergence to zero of the learning rate as t increases. AdaDelta s two central ideas - scale learning rate based on the previous gradient values (like AdaGrad) but only using a recent time window, - include an acceleration term (like momentum) by accumulating prior updates. M. Zeiler, ADADELTA: An Adaptive Learning Rate Method, 2012. http://arxiv.org/abs/1212.5701

Technical details of AdaDelta Compute gradient vector g t at current estimate x (t). Update average of previous squared gradients (AdaGrad-like step) G t,i = ρ G t 1,i + (1 ρ) g 2 t,i Compute the update vector Ut 1,i + ɛ u t,i = Gt,i + ɛ g t,i Compute exponentially decaying average of updates (momentum-like step) The AdaDelta update step: U t,i = ρ U t 1,i + (1 ρ) u 2 t,i x (t+1) i = x (t) i u t,i

Adaptive Moment Estimation (Adam) Computes adaptive learning rates for each parameter. How? - Stores an exponentially decaying average of past gradients m (t) and past squared gradients v (t) - m (t) and v (t) are estimates respectively of the first and second moments of the gradient in each dimension. - Uses the variance+mean 2 estimate to damp the update in dimensions with high second moment D. P. Kingma & J. L. Ba, Adam: a Method for Stochastic Optimization, International Conference on Learning Representations, 2015.

Update equations for Adam Let g t = x f(x (t) ) m (t+1) = β 1 m (t) + (1 β 1 ) g t v (t+1) = β 2 v (t) + (1 β 2 ) g t. g t Set m (0) = v (0) = 0 = m (t) and v (t) are biased towards zero (especially during the initial time-steps). Counter these biases by setting: The Adam update rule: ˆm (t+1) = m(t+1) 1 β1 t, ˆv (t+1) = v(t+1) 1 β2 t x (t+1) = x (t) η ˆv (t+1) + ɛ ˆm(t+1) Suggested default values β 1 =.9, β 2 =.999, ɛ = 10 8.

Adam s performance on our toy problem (default parameter settings, 150 iterations)

Comparison of different algorithms on our toy problem Adam Adagrad NAG Momentum SGD (ɛ = 1e 8, γ =.9, η =.01, 150 iterations) (ɛ = 1e 8, γ =.9, η =.03, 150 iterations)

Which optimizer to use? Data sparse = likely to achieve best results using one of the adaptive learning-rate methods. Using the adaptive learning-rate methods = won t need to tune the learning rate (much!). RMSprop, AdaDelta, and Adam are very similar algorithms that do well in similar circumstances. Adam slightly outperforms RMSProp near the end of optimization. Adam might be the best overall choice. But vanilla SGD (without momentum) and a simple learning rate annealing schedule may be sufficient. But time until finding a local minimum may be long...

Annealing the learning rate

Useful to anneal the learning rate When training deep networks, usually helpful to anneal the learning rate over time. Why? - Stops the parameter vector from bouncing around too widely. - = can reach into deeper, but narrower parts of the loss function. But knowing when to decay the learning rate is tricky! Decay too slowly = waste computations bouncing around chaotically with little improvement. Decay too aggressively = system unable to reach the best position it can.

Common approaches to learning rate decay Step decay: After every nth epoch set η = αη where α (0, 1). (Instead sometimes people monitor the validation loss and reduce the learning rate when this loss stops improving.) Exponential decay: η = η 0 e kt where t is iteration number (either w.r.t. number of update steps or epochs). Then η 0 and k are hyper-parameters. 1/t decay: η = η 0 1 + kt Step decay most common. Better to decay conservatively and train for longer.