Analytics in 10 Micro-Seconds Using FPGAs. David B. Thomas Imperial College London

Analytics in 10 Micro-Seconds Using FPGAs David B. Thomas dt10@imperial.ac.uk Imperial College London

Overview 1. The case for low-latency computation 2. Quasi-Random Monte-Carlo in 10us 3. Binomial Trees in 10us 4. Observations dt10@ic.ac.uk 2

Numerical methods: Latency vs Throughput GPUs and FPGAs are often seen as quite different GPUs are high throughput FPGAs are low latency and/or power efficient Tends to match traditional areas of use in finance GPUs: Number crunching, analytics FPGAs: High Frequency Trading, data routing Recent tools provide throughput oriented FPGAs Streaming processing (e.g. Maxeler); OpenCL (Altera) dt10@ic.ac.uk 3

When is latency important? Network to network round-trips, e.g. HFT Not considered here CPU to accelerator round-trips Software makes call to hardware accelerated function Needed for many sophisticated solvers Calibration, bisection, minimisation, root-finding,... Intelligence on the CPU, evaluation on the accelerator Tight dependency between two parts dt10@ic.ac.uk 4

What can a GPU do in 10us? Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization, Daniel Lustig and Margaret Martonosi, dt10@ic.ac.uk Princeton, HPCA 2013 5

How long is 10us? It s really not that long Two jumbo frames (~12000 bytes) over 10GigE About 30K scalar instructions on a 3GHz CPU 9M FLOPs on an NVidia GTX Titan (assuming peak perf.) 2500 Cycles in an 250MHz FPGA What can the platforms get done in 10us? GPU: Get the data over, start the kernel... run out of time. FPGA: Practical numerical computation dt10@ic.ac.uk 6

Option pricing as a numerical primitive Option pricing algorithms attempt to assess fair price e.g. How much is a call on MSFT with $50 strike worth? Three things determine the price of an option 1. M - A pricing model: e.g. Black-Scholes 2. O - Observed parameters: current stock price, interest rates 3. E - Estimated parameters: stock volatility, market sentiment Gives a unique and well-defined price p = P M (O,E) Difficult part is whether it can be calculated dt10@ic.ac.uk 7

How it s used in practise People rarely actually price options Can just look at the current bid/ask price at the exchange Much more common as part of an inverse problem We can collect all observable variables, including the price Then use the option pricer to get estimated variables Given p = P M (O,E), what is E? dt10@ic.ac.uk 8

Loop carried dependency float ImpliedVolBS(float S, float K, float r, float T, float price) { float sig_lo=0, sig_hi=10; while(sig_hi-sig_lo > THRESH){ float sig_mid=(sig_hi+sig_lo)/2; float price_mid=price(s, K, sig_mid, r, T); } if(price_mid>price) sig_hi=sig_mid; else sig_lo=sig_mid; } return (sig_lo+sig_hi)/2; dt10@ic.ac.uk 9

Practical pricing is complex Very few closed form models for option pricing Have to rely on numerical methods Numerical integration Finite difference p = P M (O,E) Binomial trees Monte-Carlo simulation dt10@ic.ac.uk 10

Method 1 : Low latency Monte-Carlo Monte-Carlo is a hard numerical method Generally has poor convergence RNGs often compute intensive Usually avoided if at all possible Some conventional wisdom: Avoid Monte-Carlo at all costs. Do not put Monte-Carlo in the middle of an optimiser. You cannot do Monte-Carlo in real-time. dt10@ic.ac.uk 11

How much Monte-Carlo is useful? Take a simple example: Asian option Let s assume it s discretely observed over t time-steps [mu,sigma]=setup(vol,r,dt); Scalar setup for i=1:n x=exp(mu+sigma*randn(t)); s=s0*cumprod(x); a=average(s); payoff=max(a-k,0); sum=sum+payoff; end price=sum/n; dt10@ic.ac.uk Task parallelism Data parallelism Loop dependency Intra-task reduction Inter-task reduction 12

Practical Quasi-MC : Brownian Bridges log(price) x T=8 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 13

Practical Quasi-MC : Brownian Bridges log(price) 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 14

Practical Quasi-MC : Brownian Bridges log(price) 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 15

Practical Quasi-MC : Brownian Bridges log(price) 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 16

Practical Quasi-MC : Brownian Bridges log(price) 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 17

Spatial recursion QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 L/2+N+R/2 r 3 x 2 r 4 L/2+N+R/2 L/2+N+R/2 x 1 x 2 x 3 dt10@ic.ac.uk 18

Spatial recursion QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 L/2+N+R/2 r 3 x 2 r 4 L/2+N+R/2 L/2+N+R/2 x 1 x 2 x 3 dt10@ic.ac.uk 19

Spatial recursion QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 L/2+N+R/2 r 3 x 2 r 4 L/2+N+R/2 L/2+N+R/2 x 1 x 2 x 3 dt10@ic.ac.uk 20

Spatial recursion QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 L/2+N+R/2 r 3 x 2 r 4 L/2+N+R/2 L/2+N+R/2 x 1 x 2 x 3 dt10@ic.ac.uk 21

Push QRNGs close to ALUs QRNG(1) QRNG(2) L/2+N+N/2 QRNG(3) x 2 QRNG(4) L/2+N+N/2 L/2+N+N/2 x 1 x 2 x 3 dt10@ic.ac.uk 22

Specialise QRNGs for position QRNG 1 QRNG 2 L/2+N+N/2 QRNG 3 x 2 QRNG 4 L/2+N+N/2 L/2+N+N/2 x 1 x 2 x 3 dt10@ic.ac.uk 23

Specialise QRNGs for depth QRNG 1 QRNG 2 L/2+N+N/2 QR 3 x 2 QR 4 L/2+N+N/2 L/2+N+N/2 x 1 x 2 x 3 dt10@ic.ac.uk 24

Advantages of spatial parallelism QRNG 1 QRNG 2 L/2+N+N/2 QR 3 x 2 QR 4 L/2+N+N/2 L/2+N+N/2 x 1 x 2 x 3 Fully parallel: one simulation per clock cycle Start-up overhead of a few cycles to prime pipeline All data-path: no scheduling or control overhead dt10@ic.ac.uk 25

Building a full Monte-Carlo unit QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 r 3 x 2 r 4 x 1 x 2 x 3 exp(.) exp(.) exp(.) exp(.) s 1 s 2 s 3 s 4 + + + a - dt10@ic.ac.uk price 26

Add statistics and we re done QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 r 3 x 2 r 4 x 1 x 2 x 3 exp(.) exp(.) exp(.) exp(.) s 2 s 3 s 4 s 1 + + + a - price dt10@ic.ac.uk Update Statistics 27

Using up hardware One path unit does not require a whole FPGA Can instantiate as many parallel instances as will fit Can fit around 8 instances in slightly old FPGA Simulate around 17000 paths in 9us Remaining 1us goes on start-up costs Communications with CPU Calculating constants needed during simulation Filling and draining internal pipelines dt10@ic.ac.uk 28

Method 2 : Binomial Trees v(4,4) p(5) v(3,3) p(4) v(2,2) v(4,3) v(1,1) v(3,2) p(3) v(2,1) v(4,2) v(3,1) p(2) v(4,1) p(1) dt10@ic.ac.uk 29

In-place evaluation p(5) v(4,4) p(4) v(3,3) v(4,3) p(3) v(2,2) v(3,2) v(4,2) p(2) v(1,1) v(2,1) v(3,1) v(4,1) p(1) dt10@ic.ac.uk 30

Scalar Code v(4,4) p(5) p(4) function [p]=binomial(n,p) for i=1:n c(i)=setup(i,p); v(i)=payoff(i,p); end v(2,2) v(3,3) v(3,2) v(4,3) v(4,2) p(3) p(2) for t=1:n for i=1:t-1 vn(i)=node(v(i),v(i+1),c(i)); end v=vn; end v(1,1) v(2,1) v(3,1) v(4,1) p(1) return v(1); end dt10@ic.ac.uk 31

Vector Code p(5) function [p]=binomial(n,p) i=1:n-1; v(4,4) p(4) c(i)=setup(i,p); v(i)=payoff(i,p); v(2,2) v(3,3) v(3,2) v(4,3) v(4,2) p(3) p(2) for t=1:n v(i)=node(v(i),v(i+1),c(i)); end v(1,1) v(2,1) v(3,1) v(4,1) p(1) return curr(1); end dt10@ic.ac.uk 32

Choosing a vector size The problem is intrinsically SIMD GPU: choose according to local work group size Each node update is ten or so instructions FPGA: let s make the SIMD width the tree height dt10@ic.ac.uk 33

A basic binomial node * * + + dt10@ic.ac.uk 34

Choose SIMD width of 512 * * * * * * * * + + + + + + + + dt10@ic.ac.uk 36

Getting the tree constants into registers * * + * * + + + dt10@ic.ac.uk 37

Result: systolic binomial tree Tree depth is limited only by resources Each node eval is 4 DSP blocks for 32-bit fixed-point We have no problem reaching n=512 in Virtex-7 Execution latency is 2*n*k + C n : depth of the tree k : depth of the node evaluation pipeline C : constant calculation pipeline depth 1 * n * k Calculate tree constants, move them into place 1 * n * k Evaluation the tree Choose n=512, k=6, f=300mhz : latency is 10us dt10@ic.ac.uk 38

FPGA Strengths: Global clock net GPUs are limited by point-to-point communication Global communication Global synchronisation A huge strength of FPGAs is the clock net Precisely schedule 1000s of DSPs over 1000s of clock cycles We have multiple bits of silicon on the same clock! Millions of maths primitives with a known schedule dt10@ic.ac.uk 39

FPGA Strengths: Sheer scale The bigger the FPGA, the more scaling helps Increases advantage of hierarchical composition Design effort in low-level primitives repaid Doubling the functional units: constant overhead One more layer on fan-out to create work One more layer on fan-in to collect results Not usually the case in GPUs more bus contention Can choose to spend it on performance or accuracy O(n -0.5 ) method : accuracy x 1.4, or latency.71 O(n -1 ) method : accuracy x 2.0, or latency.50 dt10@ic.ac.uk 40

Conclusion FPGAs are not only good for latency oriented tasks Because they can do latency, they also do throughput Can push into low-latency compute that GPUs cannot Low-latency does not mean simple Can complete (fairly) sophisticated Monte-Carlo in 5us Get throughput of 200K Monte-Carlos / sec Particularly useful on new platforms like Xilinx Zynq FPGA and ARM cores sat together on the same chip Good for optimisation, root finding, minimisation... dt10@ic.ac.uk 41