Scaling SGD Batch Size to 32K for ImageNet Training

Similar documents
arxiv: v3 [cs.lg] 21 Oct 2018

Support Vector Machines: Training with Stochastic Gradient Descent

Predicting stock prices for large-cap technology companies

Deep Learning - Financial Time Series application

Large-Scale SVM Optimization: Taking a Machine Learning Perspective

k-layer neural networks: High capacity scoring functions + tips on how to train them

distribution of the best bid and ask prices upon the change in either of them. Architecture Each neural network has 4 layers. The standard neural netw

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

$tock Forecasting using Machine Learning

Gradient Descent and the Structure of Neural Network Cost Functions. presentation by Ian Goodfellow

Leverage Financial News to Predict Stock Price Movements Using Word Embeddings and Deep Neural Networks

International Journal of Computer Engineering and Applications, Volume XII, Issue II, Feb. 18, ISSN

Stochastic Grid Bundling Method

Machine Learning (CSE 446): Pratical issues: optimization and learning

Fast R-CNN. Ross Girshick Facebook AI Research (FAIR) Work done at Microsoft Research. Presented by: Nick Joodi Doug Sherman

arxiv: v3 [q-fin.cp] 20 Sep 2018

Classifica(on- based Market Predic(on using Deep Neural Networks. Ma;hew Dixon, Ph.D., FRM Quiota LLC Qwafafew, Chicago

Liangzi AUTO: A Parallel Automatic Investing System Based on GPUs for P2P Lending Platform. Gang CHEN a,*

Novel Approaches to Sentiment Analysis for Stock Prediction

Financial Mathematics and Supercomputing

arxiv: v2 [cs.lg] 13 Jun 2017

Machine Learning in Finance: The Case of Deep Learning for Option Pricing

ECS171: Machine Learning

Deep Learning for Forecasting Stock Returns in the Cross-Section

HKUST CSE FYP , TEAM RO4 OPTIMAL INVESTMENT STRATEGY USING SCALABLE MACHINE LEARNING AND DATA ANALYTICS FOR SMALL-CAP STOCKS

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Lecture 8: Linear Prediction: Lattice filters

Stock Market Index Prediction Using Multilayer Perceptron and Long Short Term Memory Networks: A Case Study on BSE Sensex

CS 343: Artificial Intelligence

Anne Bracy CS 3410 Computer Science Cornell University

CS 188: Artificial Intelligence

Accelerated Stochastic Gradient Descent Praneeth Netrapalli MSR India

Mark Redekopp, All rights reserved. EE 357 Unit 12. Performance Modeling

Assessing Solvency by Brute Force is Computationally Tractable

Stock Price Prediction using Deep Learning

Portfolio replication with sparse regression

Trust Region Methods for Unconstrained Optimisation

Journal of Internet Banking and Commerce

Algorithmic Differentiation of a GPU Accelerated Application

Application of Deep Learning to Algorithmic Trading

ifko, LANB, PWML, PCA & Other Fascinating Post-ICL Acronyms

Approximate Composite Minimization: Convergence Rates and Examples

Understanding Deep Learning Requires Rethinking Generalization

Exercise List: Proving convergence of the (Stochastic) Gradient Descent Method for the Least Squares Problem.

CS360 Homework 14 Solution

Artificially Intelligent Forecasting of Stock Market Indexes

Lending Club Loan Portfolio Optimization Fred Robson (frobson), Chris Lucas (cflucas)

Based on BP Neural Network Stock Prediction

Financial Risk Modeling on Low-power Accelerators: Experimental Performance Evaluation of TK1 with FPGA

Towards efficient option pricing in incomplete markets

Markov Decision Processes

CS 188: Artificial Intelligence

Accelerating Financial Computation

CUDA-enabled Optimisation of Technical Analysis Parameters

What can we do with numerical optimization?

Backpropagation and Recurrent Neural Networks in Financial Analysis of Multiple Stock Market Returns

Machine Learning (CSE 446): Learning as Minimizing Loss

Understanding neural networks

Pricing Early-exercise options

The Use of Artificial Neural Network for Forecasting of FTSE Bursa Malaysia KLCI Stock Price Index

A Machine Learning Approach to Price Impact Modeling Using NASDAQ Level-2 ITCH Data

Deep Learning and Reinforcement Learning

Barapatre Omprakash et.al; International Journal of Advance Research, Ideas and Innovations in Technology

Top-down particle filtering for Bayesian decision trees

Forecasting stock market prices

Why know about performance

Essays on Some Combinatorial Optimization Problems with Interval Data

Portfolio selection with multiple risk measures

Lecture 17: More on Markov Decision Processes. Reinforcement learning

International Journal of Computer Engineering and Applications, Volume XII, Issue IV, April 18, ISSN

Keywords: artificial neural network, backpropagtion algorithm, derived parameter.

Dynamic Replication of Non-Maturing Assets and Liabilities

Deep Learning for Time Series Analysis

Remarks on stochastic automatic adjoint differentiation and financial models calibration

A simple wealth model

arxiv: v1 [cs.dc] 14 Jan 2013

Distributed Approaches to Mirror Descent for Stochastic Learning over Rate-Limited Networks

HPC IN THE POST 2008 CRISIS WORLD

Keywords: artificial neural network, backpropagtion algorithm, capital asset pricing model

Randomized Full Waveform Inversion

Wide and Deep Learning for Peer-to-Peer Lending

A Branch-and-Price method for the Multiple-depot Vehicle and Crew Scheduling Problem

Deep learning analysis of limit order book

Portfolio Recommendation System Stanford University CS 229 Project Report 2015

arxiv: v1 [cs.ai] 7 Jan 2018

A Multifrequency Theory of the Interest Rate Term Structure

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL

ALGORITHMIC TRADING STRATEGIES IN PYTHON

Lecture 9 Feb. 21, 2017

A Pattern Matching Approach to Map Cognitive Domain Ontologies to the IBM TrueNorth Processor

An enhanced artificial neural network for stock price predications

Modeling Path Dependent Derivatives Using CUDA Parallel Platform

Role of soft computing techniques in predicting stock market direction

STOCK MARKET PREDICTION AND ANALYSIS USING MACHINE LEARNING

Final exam solutions

Unparalleled Performance, Agility and Security for NSE

A Machine Learning Investigation of One-Month Momentum. Ben Gum

A Multi-Stage Stochastic Programming Model for Managing Risk-Optimal Electricity Portfolios. Stochastic Programming and Electricity Risk Management

-divergences and Monte Carlo methods

2D5362 Machine Learning

Transcription:

Scaling SGD Batch Size to 32K for ImageNet Training Yang You Computer Science Division of UC Berkeley youyang@cs.berkeley.edu Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 1 / 37

Outline Why large-batch training is important? Why large-batch training is difficult? How to scale up batch size? Results and Benefits of large-batch training. Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 2 / 37

Mini-Batch SGD (Stochastic Gradient Descent) Take B data points each iteration Compute gradients of weights based on B data points Update the weights: W = W η W also used momentum and weight decay W : weights W : gradients η: learning rate B: batch size Data-Parallelism on P GPUs Each GPU has a copy of W i and W i (i {1, 2,..., P}) Each GPU has B/P data points to compute its own W i communication: an all-reduce sum each iteration ( P i=1 W i) Each GPU does W i = W i η/p P i=1 W i Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 3 / 37

Single GPU: large batch size benefits B = 512, the GPU achieves peak performance If we have 16 GPUs, we need a batch size of 8192 (16 512) make sure each GPU is efficient Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 4 / 37

Motivation Pick a Commonly-Used Approach in DNN Training? Data-Parallelism Mini-Batch SGD (e.g. Caffe, Tensorflow, Torch) recommended by Dr. Bryan Catanzaro (NVIDIA VP) How to speedup Mini-Batch SGD? Use more processors (e.g. GPU) How to make each GPU efficient if we use many GPUs? Give each GPU enough computations (find the right B) How to give each GPU enough computations? Use large batch size (use PB) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 5 / 37

Standard Benchmarks 1000-class ImageNet dataset by AlexNet 58% accuracy in 100 epochs 1000-class ImageNet dataset by ResNet-50 73% accuracy in 90 epochs 1 epoch: statistically touch all the data once (n/b iterations) n is the total number of data points do not use data augmentation (preprocess the dataset) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 6 / 37

Fixed # epochs = Fixed # floating point operations We fix the number of operations as 90 1.28 Million 7.72 Billion 90 epochs for using ResNet-50 to process ImageNet-1k dataset Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 7 / 37

Why Large-Batch can speedup DNN training? Reduce the number of iterations Keep the single iteration time constant (roughly) by using more processors Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 8 / 37

Why Large-Batch can speedup DNN training? Batch Size Epochs Iterations 512 100 250,000 1024 100 125,000 2048 100 62,500 4096 100 31,250 8192 100 15,625......... 1,280,000 100 100 ImageNet dataset: 1,280,000 data points Goal: get the same accuracy in the same epochs fixed epochs = fixed number of floating point operations needs much less iterations: speedup! Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 9 / 37

Why Large-Batch can speedup DNN training? Batch Size Epochs Iterations GPUs Iteration Time 512 100 250,000 1 t 1 1024 100 125,000 2 t 1 + log(2)t 2 2048 100 62,500 4 t 1 + log(4)t 2 4096 100 31,250 8 t 1 + log(8)t 2 8192 100 15,625 16 t 1 + log(16)t 2......... 1,280,000 100 100 2500 t 1 + log(2500)t 2 ImageNet dataset: 1,280,000 data points use batch size = 512 for each GPU t 1 : computation time, t 2 : communication time (α + W β) 1 t 1 >> t 2 is possible for ImageNet training by Inifniband 2 1 α is latency, β is inverse of bandwidth 2 Goyal et al, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017 (Facebook Report) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 10 / 37

Difficulties of Large-Batch Training: much more epochs! slide from Dr. Bryan Catanzaro (Feb 13, 2017 at Berkeley) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 11 / 37

Difficulties of Large-Batch Training Lose Accuracy by running the same epochs! Without accuracy, this was well studied 20 years ago Standard Divide-and-Conquer approach Divide: partition a batch of data points to different machines Conquer: an all-reduce operation at each iteration Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 12 / 37

Outline Why large-batch training is important? Why large-batch training is difficult? How to scale up batch size? Results and Benefits of large-batch training. Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 13 / 37

Difficulties of Large-Batch Training Why lose accuracy? Generalization Problem 3 High training accuracy, but low test accuracy Optimization Difficulty 4 Hard to get the right hyper-parameters 3 Keskar et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017 (ICLR) 4 Goyal et al, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017 (Facebook Report) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 14 / 37

Generalization Problem Large-batch training is a sharp minimum problem 5 even you can train a good model, it is hard to generalize high training accuracy :-) but low test accuracy :-( 5 Keskar et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017 (ICLR) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 15 / 37

Optimization Problem You can keep the accuracy, but it is hard to optimize 6 Facebook scales to 8K (able to use 256 NVIDIA P100 GPUs!) 6 Goyal et al, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017 (Facebook Report) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 16 / 37

Most effective techniques (Facebook s recipe) Control the learning rate (η) Linear Scaling rule 7 if you increase B to kb, then increase η to kη # iterations reduced by k, # updates reduced by k each update should enlarged by k Warmup rule 8 start from a small η, increase η in a few epochs avoid the network diverges in the beginning 7 Alex Krizhevsky, One weird trick for parallelizing convolutional neural networks, 2014 (Google Report) 8 Goyal et al, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017 (Facebook Report) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 17 / 37

State-of-the-art Large-Batch ImageNet Training Team Model Baseline Batch Large Batch Baseline Accuracy Large Batch Accuracy Google 9 AlexNet 128 1024 57.7% 56.7% Amazon 10 ResNet-152 256 5120 77.8% 77.8% Facebook 11 ResNet-50 256 8192 76.40% 76.26% 9 Alex Krizhevsky, One weird trick for parallelizing convolutional neural networks, 2014 (Google Report) 10 Mu Li, Scaling Distributed Machine Learning with System and Algorithm Co-design, 2017 (CMU Thesis) 11 Goyal et al, Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, 2017 (Facebook Report) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 18 / 37

Reproduce Facebook s results B = 256 and B = 8192: achieve 73% accuracy in 90 epochs Our baseline s accuracy is lower than Facebook s we didn t use data augmentation Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 19 / 37

Facebook s recipe does not work for AlexNet Can only scale batch size to 1024, tried everything: Warmup + Linear Scaling Tune η + Tune momentum + Tune weight decay data shuffle, data scaling, min η tuning, etc Batch Size Base η poly power momentum epochs test accuracy 512 0.02 2 0.9 100 0.588 1024 0.02 2 0.9 100 0.582 4096 0.05 2 0.9 100 0.531 Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 20 / 37

Facebook s recipe does not work for AlexNet We couldn t scale up the learning rate Warmup did help (1, 2, 3,..., 10 epochs) Network diverged at η = 0.07 Batch Size Base η warmup epochs test accuracy 4096 0.01 yes 100 0.509 4096 0.02 yes 100 0.527 4096 0.03 yes 100 0.520 4096 0.04 yes 100 0.530 4096 0.05 yes 100 0.531 4096 0.06 yes 100 0.516 4096 0.07 yes 100 0.001 Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 21 / 37

Outline Why large-batch training is important? Why large-batch training is difficult? How to scale up batch size? Results and Benefits of large-batch training. Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 22 / 37

Solve the generalization problem by Batch Normalizatin Generalization problem 12 regular batch: Test loss - Train Loss is small large batch: Test loss - Train Loss is large 12 Keskar et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017 (ICLR) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 23 / 37

Solve the generalization problem by Batch Normalizatin Generalization problem 13 regular batch: Test loss - Train Loss is small large batch: Test loss - Train Loss is large 13 Keskar et al, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, 2017 (ICLR) Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 24 / 37

Solve the generalization problem by Batch Normalizatin Optimize the model Batch Norm (BN) instead of Local Response Norm (LRN) BN after Convolutional layers Run more epochs (100 epochs to 128 epochs) Batch Size Base LR poly power momentum weight decay epochs test accuracy 512 0.02 2 0.9 0.0005 128 0.602 4096 0.18 2 0.9 0.0005 128 0.589 8192 0.30 2 0.9 0.0005 128 0.580 Higher accuracy, but the baseline is also higher Still needs to improve large-batch s accuracy Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 25 / 37

Still needs to imporve AlexNet s accuracy Reduce epochs from 128 to 100 Clearly an accuracy gap Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 26 / 37

Reason: different Gradient-Weight ( W / W ) Ratios Layer conv1.1 conv1.0 conv2.1 conv2.0 conv3.1 conv3.0 conv4.0 conv4.1 W 2 1.86 0.098 5.546 0.16 9.40 0.196 8.15 0.196 W 2 0.22 0.017 0.165 0.002 0.135 0.0015 0.109 0.0013 W 2 W 2 8.48 5.76 33.6 83.5 69.9 127 74.6 148 Layer conv5.1 conv5.0 fc6.1 fc6.0 fc7.1 fc7.0 fc8.1 fc8.0 W 2 6.65 0.16 30.7 6.4 20.5 6.4 20.2 0.316 W 2 0.09 0.0002 0.26 0.005 0.30 0.013 0.22 0.016 W 2 W 2 73.6 69 117 1345 68 489 93 19 L2 norm of layer weights and gradients of AlexNet Batch = 4096 at 1st iteration Bad: the same η for all the layers (W = W η W ) layer fc6.0 s best η leads to divergence for layer conv1.0 Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 27 / 37

Layer-wise Adaptive Rate Scaling (LARS) η = l γ W 2 W 2 l: scaling factor, 0.001 for AlexNet and ResNet training γ: input LR, a tuning parameter for users We usually tune γ from 1 to 50 Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 28 / 37

Effects of LARS AlexNet Batch Size LR rule poly power warmup weight decay momentum Epochs test accuracy 512 regular 2 N/A 0.0005 0.9 100 0.588 4096 LARS 2 13 epochs 0.0005 0.9 100 0.584 8192 LARS 2 8 epochs 0.0005 0.9 100 0.583 AlexNet-BN Batch Size LR rule poly power warmup weight decay momentum Epochs test accuracy 512 LARS 2 2 epochs 0.0005 0.9 100 0.602 4096 LARS 2 2 epochs 0.0005 0.9 100 0.604 8192 LARS 2 2 epochs 0.0005 0.9 100 0.601 Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 29 / 37

Outline Why large-batch training is important? Why large-batch training is difficult? How to scale up batch size? Results and Benefits of large-batch training. Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 30 / 37

Implementation Details NVIDIA Caffe 0.16 with our own modification (Auto LR) 1 Intel Xeon CPU E5-2698 v4 @ 2.20GHz 8 NVIDIA P100 GPUs interconnected by NVIDIA NVLink Batch 8192 by ResNet-50: out of memory partition the 8192-batch into 32 256-batches compute 32 pieces of gradients sequentially do an average operation after we get all the gradients Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 31 / 37

Effects of LARS Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 32 / 37

Effects of LARS Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 33 / 37

Effects of LARS Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 34 / 37

Benefits of Large-Batch Training AlexNet-BN: 3 speedup by just increasing the batch size Batch Size Stable Accuracy 8-GPU speed 8-GPU time 512 0.602 5771 img/sec 6h 10m 30s 4096 0.604 15379 img/sec 2h 19m 24s AlexNet: 3 speedup by just increasing the batch size Batch Size Stable Accuracy 8-GPU speed 8-GPU time 512 0.588 5797 img/sec 6h 9m 0s 4096 0.584 16373 img/sec 2h 10m 52s Large-Batch can make full use of the increased computational powers Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 35 / 37

Benefits of Large-Batch Training Large-Batch can make full use of the increased computational powers Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 36 / 37

Thanks! Scaling SGD Batch Size to 32K for ImageNet Training https://arxiv.org/abs/1708.03888 Yang You (youyang@cs.berkeley.edu) 32K SGD Batch Size CS Division of UC Berkeley 37 / 37