Analytics in 10 Micro-Seconds Using FPGAs. David B. Thomas Imperial College London

Similar documents
Accelerating Financial Computation

Financial Risk Modeling on Low-power Accelerators: Experimental Performance Evaluation of TK1 with FPGA

stratification strategy controlled by CPUs, to adaptively allocate the optimal number of simulations to a specific segment of the entire integration d

Efficient Reconfigurable Design for Pricing Asian Options

F1 Acceleration for Montecarlo: financial algorithms on FPGA

Reconfigurable Acceleration for Monte Carlo based Financial Simulation

Ultimate Control. Maxeler RiskAnalytics

Barrier Option. 2 of 33 3/13/2014

Efficient Reconfigurable Design for Pricing Asian Options

Automatic Generation and Optimisation of Reconfigurable Financial Monte-Carlo Simulations

High Performance and Low Power Monte Carlo Methods to Option Pricing Models via High Level Design and Synthesis

Design of a Financial Application Driven Multivariate Gaussian Random Number Generator for an FPGA

Numerix Pricing with CUDA. Ghali BOUKFAOUI Numerix LLC

Accelerating Quantitative Financial Computing with CUDA and GPUs

SPEED UP OF NUMERIC CALCULATIONS USING A GRAPHICS PROCESSING UNIT (GPU)

GPU-Accelerated Quant Finance: The Way Forward

Applications of Dataflow Computing to Finance. Florian Widmann

Energy-Efficient FPGA Implementation for Binomial Option Pricing Using OpenCL

Computational Finance. Computational Finance p. 1

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL

Algorithmic Differentiation of a GPU Accelerated Application

Architecture Exploration for Tree-based Option Pricing Models

PRICING AMERICAN OPTIONS WITH LEAST SQUARES MONTE CARLO ON GPUS. Massimiliano Fatica, NVIDIA Corporation

Hedging Strategy Simulation and Backtesting with DSLs, GPUs and the Cloud

Accelerating Reconfigurable Financial Computing

Many-core Accelerated LIBOR Swaption Portfolio Pricing

An Energy Efficient FPGA Accelerator for Monte Carlo Option Pricing with the Heston Model

GRAPHICAL ASIAN OPTIONS

Hardware Accelerators for Financial Mathematics - Methodology, Results and Benchmarking

S4199 Effortless GPU Models for Finance

TEPZZ 858Z 5A_T EP A1 (19) (11) EP A1 (12) EUROPEAN PATENT APPLICATION. (43) Date of publication: Bulletin 2015/15

HPC IN THE POST 2008 CRISIS WORLD

HIGH PERFORMANCE COMPUTING IN THE LEAST SQUARES MONTE CARLO APPROACH. GILLES DESVILLES Consultant, Rationnel Maître de Conférences, CNAM

List of Abbreviations

Outline. GPU for Finance SciFinance SciFinance CUDA Risk Applications Testing. Conclusions. Monte Carlo PDE

Real-Time Market Data Technology Overview

2.1 Mathematical Basis: Risk-Neutral Pricing

Quantitative Finance COURSE NUMBER: 22:839:510 COURSE TITLE: Numerical Analysis

EFFICIENT MONTE CARLO ALGORITHM FOR PRICING BARRIER OPTIONS

Financial Mathematics and Supercomputing

Monte Carlo Methods. Prof. Mike Giles. Oxford University Mathematical Institute. Lecture 1 p. 1.

Valuation of performance-dependent options in a Black- Scholes framework

Mark Redekopp, All rights reserved. EE 357 Unit 12. Performance Modeling

HyPER: A Runtime Reconfigurable Architecture for Monte Carlo Option Pricing in the Heston Model

Monte Carlo Option Pricing

Application of High Performance Computing in Investment Banks

Appendix A Financial Calculations

History of Monte Carlo Method

Module 4: Monte Carlo path simulation

Parallel Multilevel Monte Carlo Simulation

ELEMENTS OF MONTE CARLO SIMULATION

FINANCIAL OPTION ANALYSIS HANDOUTS

King s College London

Risk Neutral Valuation

Computational Efficiency and Accuracy in the Valuation of Basket Options. Pengguo Wang 1

1 Introduction. Term Paper: The Hall and Taylor Model in Duali 1. Yumin Li 5/8/2012

Why know about performance

CUDA-enabled Optimisation of Technical Analysis Parameters

Load Test Report. Moscow Exchange Trading & Clearing Systems. 07 October Contents. Testing objectives... 2 Main results... 2

Stochastic Grid Bundling Method

Computational Finance Improving Monte Carlo

Domokos Vermes. Min Zhao

Assignment - Exotic options

RiskTorrent: Using Portfolio Optimisation for Media Streaming

The Binomial Model. Chapter 3

Anne Bracy CS 3410 Computer Science Cornell University

New GPU Pricing Library

Implementing Models in Quantitative Finance: Methods and Cases

FPGA ACCELERATION OF MONTE-CARLO BASED CREDIT DERIVATIVE PRICING

Machine Learning for Quantitative Finance

High throughput implementation of the new Secure Hash Algorithm through partial unrolling

The Pennsylvania State University. The Graduate School. Department of Industrial Engineering AMERICAN-ASIAN OPTION PRICING BASED ON MONTE CARLO

Liangzi AUTO: A Parallel Automatic Investing System Based on GPUs for P2P Lending Platform. Gang CHEN a,*

Towards efficient option pricing in incomplete markets

Reinforcement Learning

The Binomial Lattice Model for Stocks: Introduction to Option Pricing

ANALYSIS OF THE BINOMIAL METHOD

Pricing Asian Options

Computational Finance Binomial Trees Analysis

Remarks on stochastic automatic adjoint differentiation and financial models calibration

Contents Critique 26. portfolio optimization 32

arxiv: v1 [cs.dc] 14 Jan 2013

Option Pricing with the SABR Model on the GPU

Lecture 8: Skew Tolerant Design (including Dynamic Circuit Issues)

Numerical Methods in Option Pricing (Part III)

Lecture outline. Monte Carlo Methods for Uncertainty Quantification. Importance Sampling. Importance Sampling

Multi-level Stochastic Valuations

Creating Internal Transparency to Forecast Workforce Needs

Numerical schemes for SDEs

Lecture 17. The model is parametrized by the time period, δt, and three fixed constant parameters, v, σ and the riskless rate r.

Monte Carlo Methods for Uncertainty Quantification

An Experimental Study of the Behaviour of the Proxel-Based Simulation Algorithm

MFE/3F Questions Answer Key

3. Monte Carlo Simulation

Math Computational Finance Option pricing using Brownian bridge and Stratified samlping

Multilevel quasi-monte Carlo path simulation

quan OPTIONS ANALYTICS IN REAL-TIME PROBLEM: Industry SOLUTION: Oquant Real-time Options Pricing

Calibrating to Market Data Getting the Model into Shape

Financial Computing with Python

Asian Option Pricing: Monte Carlo Control Variate. A discrete arithmetic Asian call option has the payoff. S T i N N + 1

Transcription:

Analytics in 10 Micro-Seconds Using FPGAs David B. Thomas dt10@imperial.ac.uk Imperial College London

Overview 1. The case for low-latency computation 2. Quasi-Random Monte-Carlo in 10us 3. Binomial Trees in 10us 4. Observations dt10@ic.ac.uk 2

Numerical methods: Latency vs Throughput GPUs and FPGAs are often seen as quite different GPUs are high throughput FPGAs are low latency and/or power efficient Tends to match traditional areas of use in finance GPUs: Number crunching, analytics FPGAs: High Frequency Trading, data routing Recent tools provide throughput oriented FPGAs Streaming processing (e.g. Maxeler); OpenCL (Altera) dt10@ic.ac.uk 3

When is latency important? Network to network round-trips, e.g. HFT Not considered here CPU to accelerator round-trips Software makes call to hardware accelerated function Needed for many sophisticated solvers Calibration, bisection, minimisation, root-finding,... Intelligence on the CPU, evaluation on the accelerator Tight dependency between two parts dt10@ic.ac.uk 4

What can a GPU do in 10us? Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization, Daniel Lustig and Margaret Martonosi, dt10@ic.ac.uk Princeton, HPCA 2013 5

How long is 10us? It s really not that long Two jumbo frames (~12000 bytes) over 10GigE About 30K scalar instructions on a 3GHz CPU 9M FLOPs on an NVidia GTX Titan (assuming peak perf.) 2500 Cycles in an 250MHz FPGA What can the platforms get done in 10us? GPU: Get the data over, start the kernel... run out of time. FPGA: Practical numerical computation dt10@ic.ac.uk 6

Option pricing as a numerical primitive Option pricing algorithms attempt to assess fair price e.g. How much is a call on MSFT with $50 strike worth? Three things determine the price of an option 1. M - A pricing model: e.g. Black-Scholes 2. O - Observed parameters: current stock price, interest rates 3. E - Estimated parameters: stock volatility, market sentiment Gives a unique and well-defined price p = P M (O,E) Difficult part is whether it can be calculated dt10@ic.ac.uk 7

How it s used in practise People rarely actually price options Can just look at the current bid/ask price at the exchange Much more common as part of an inverse problem We can collect all observable variables, including the price Then use the option pricer to get estimated variables Given p = P M (O,E), what is E? dt10@ic.ac.uk 8

Loop carried dependency float ImpliedVolBS(float S, float K, float r, float T, float price) { float sig_lo=0, sig_hi=10; while(sig_hi-sig_lo > THRESH){ float sig_mid=(sig_hi+sig_lo)/2; float price_mid=price(s, K, sig_mid, r, T); } if(price_mid>price) sig_hi=sig_mid; else sig_lo=sig_mid; } return (sig_lo+sig_hi)/2; dt10@ic.ac.uk 9

Practical pricing is complex Very few closed form models for option pricing Have to rely on numerical methods Numerical integration Finite difference p = P M (O,E) Binomial trees Monte-Carlo simulation dt10@ic.ac.uk 10

Method 1 : Low latency Monte-Carlo Monte-Carlo is a hard numerical method Generally has poor convergence RNGs often compute intensive Usually avoided if at all possible Some conventional wisdom: Avoid Monte-Carlo at all costs. Do not put Monte-Carlo in the middle of an optimiser. You cannot do Monte-Carlo in real-time. dt10@ic.ac.uk 11

How much Monte-Carlo is useful? Take a simple example: Asian option Let s assume it s discretely observed over t time-steps [mu,sigma]=setup(vol,r,dt); Scalar setup for i=1:n x=exp(mu+sigma*randn(t)); s=s0*cumprod(x); a=average(s); payoff=max(a-k,0); sum=sum+payoff; end price=sum/n; dt10@ic.ac.uk Task parallelism Data parallelism Loop dependency Intra-task reduction Inter-task reduction 12

Practical Quasi-MC : Brownian Bridges log(price) x T=8 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 13

Practical Quasi-MC : Brownian Bridges log(price) 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 14

Practical Quasi-MC : Brownian Bridges log(price) 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 15

Practical Quasi-MC : Brownian Bridges log(price) 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 16

Practical Quasi-MC : Brownian Bridges log(price) 0 1 2 3 4 5 6 7 t=8 dt10@ic.ac.uk 17

Spatial recursion QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 L/2+N+R/2 r 3 x 2 r 4 L/2+N+R/2 L/2+N+R/2 x 1 x 2 x 3 dt10@ic.ac.uk 18

Spatial recursion QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 L/2+N+R/2 r 3 x 2 r 4 L/2+N+R/2 L/2+N+R/2 x 1 x 2 x 3 dt10@ic.ac.uk 19

Spatial recursion QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 L/2+N+R/2 r 3 x 2 r 4 L/2+N+R/2 L/2+N+R/2 x 1 x 2 x 3 dt10@ic.ac.uk 20

Spatial recursion QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 L/2+N+R/2 r 3 x 2 r 4 L/2+N+R/2 L/2+N+R/2 x 1 x 2 x 3 dt10@ic.ac.uk 21

Push QRNGs close to ALUs QRNG(1) QRNG(2) L/2+N+N/2 QRNG(3) x 2 QRNG(4) L/2+N+N/2 L/2+N+N/2 x 1 x 2 x 3 dt10@ic.ac.uk 22

Specialise QRNGs for position QRNG 1 QRNG 2 L/2+N+N/2 QRNG 3 x 2 QRNG 4 L/2+N+N/2 L/2+N+N/2 x 1 x 2 x 3 dt10@ic.ac.uk 23

Specialise QRNGs for depth QRNG 1 QRNG 2 L/2+N+N/2 QR 3 x 2 QR 4 L/2+N+N/2 L/2+N+N/2 x 1 x 2 x 3 dt10@ic.ac.uk 24

Advantages of spatial parallelism QRNG 1 QRNG 2 L/2+N+N/2 QR 3 x 2 QR 4 L/2+N+N/2 L/2+N+N/2 x 1 x 2 x 3 Fully parallel: one simulation per clock cycle Start-up overhead of a few cycles to prime pipeline All data-path: no scheduling or control overhead dt10@ic.ac.uk 25

Building a full Monte-Carlo unit QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 r 3 x 2 r 4 x 1 x 2 x 3 exp(.) exp(.) exp(.) exp(.) s 1 s 2 s 3 s 4 + + + a - dt10@ic.ac.uk price 26

Add statistics and we re done QRNG(3) QRNG(2) QRNG(4) QRNG(1) r 3 r 2 r 4 r 3 x 2 r 4 x 1 x 2 x 3 exp(.) exp(.) exp(.) exp(.) s 2 s 3 s 4 s 1 + + + a - price dt10@ic.ac.uk Update Statistics 27

Using up hardware One path unit does not require a whole FPGA Can instantiate as many parallel instances as will fit Can fit around 8 instances in slightly old FPGA Simulate around 17000 paths in 9us Remaining 1us goes on start-up costs Communications with CPU Calculating constants needed during simulation Filling and draining internal pipelines dt10@ic.ac.uk 28

Method 2 : Binomial Trees v(4,4) p(5) v(3,3) p(4) v(2,2) v(4,3) v(1,1) v(3,2) p(3) v(2,1) v(4,2) v(3,1) p(2) v(4,1) p(1) dt10@ic.ac.uk 29

In-place evaluation p(5) v(4,4) p(4) v(3,3) v(4,3) p(3) v(2,2) v(3,2) v(4,2) p(2) v(1,1) v(2,1) v(3,1) v(4,1) p(1) dt10@ic.ac.uk 30

Scalar Code v(4,4) p(5) p(4) function [p]=binomial(n,p) for i=1:n c(i)=setup(i,p); v(i)=payoff(i,p); end v(2,2) v(3,3) v(3,2) v(4,3) v(4,2) p(3) p(2) for t=1:n for i=1:t-1 vn(i)=node(v(i),v(i+1),c(i)); end v=vn; end v(1,1) v(2,1) v(3,1) v(4,1) p(1) return v(1); end dt10@ic.ac.uk 31

Vector Code p(5) function [p]=binomial(n,p) i=1:n-1; v(4,4) p(4) c(i)=setup(i,p); v(i)=payoff(i,p); v(2,2) v(3,3) v(3,2) v(4,3) v(4,2) p(3) p(2) for t=1:n v(i)=node(v(i),v(i+1),c(i)); end v(1,1) v(2,1) v(3,1) v(4,1) p(1) return curr(1); end dt10@ic.ac.uk 32

Choosing a vector size The problem is intrinsically SIMD GPU: choose according to local work group size Each node update is ten or so instructions FPGA: let s make the SIMD width the tree height dt10@ic.ac.uk 33

A basic binomial node * * + + dt10@ic.ac.uk 34

Register tree-wide constants * * + + dt10@ic.ac.uk 35

Choose SIMD width of 512 * * * * * * * * + + + + + + + + dt10@ic.ac.uk 36

Getting the tree constants into registers * * + * * + + + dt10@ic.ac.uk 37

Result: systolic binomial tree Tree depth is limited only by resources Each node eval is 4 DSP blocks for 32-bit fixed-point We have no problem reaching n=512 in Virtex-7 Execution latency is 2*n*k + C n : depth of the tree k : depth of the node evaluation pipeline C : constant calculation pipeline depth 1 * n * k Calculate tree constants, move them into place 1 * n * k Evaluation the tree Choose n=512, k=6, f=300mhz : latency is 10us dt10@ic.ac.uk 38

FPGA Strengths: Global clock net GPUs are limited by point-to-point communication Global communication Global synchronisation A huge strength of FPGAs is the clock net Precisely schedule 1000s of DSPs over 1000s of clock cycles We have multiple bits of silicon on the same clock! Millions of maths primitives with a known schedule dt10@ic.ac.uk 39

FPGA Strengths: Sheer scale The bigger the FPGA, the more scaling helps Increases advantage of hierarchical composition Design effort in low-level primitives repaid Doubling the functional units: constant overhead One more layer on fan-out to create work One more layer on fan-in to collect results Not usually the case in GPUs more bus contention Can choose to spend it on performance or accuracy O(n -0.5 ) method : accuracy x 1.4, or latency.71 O(n -1 ) method : accuracy x 2.0, or latency.50 dt10@ic.ac.uk 40

Conclusion FPGAs are not only good for latency oriented tasks Because they can do latency, they also do throughput Can push into low-latency compute that GPUs cannot Low-latency does not mean simple Can complete (fairly) sophisticated Monte-Carlo in 5us Get throughput of 200K Monte-Carlos / sec Particularly useful on new platforms like Xilinx Zynq FPGA and ARM cores sat together on the same chip Good for optimisation, root finding, minimisation... dt10@ic.ac.uk 41