GTC 2016 HPC IN THE POST 2008 CRISIS WORLD Pierre SPATZ MUREX 2016
STANFORD CENTER FOR FINANCIAL AND RISK ANALYTICS HPC IN THE POST 2008 CRISIS WORLD Pierre SPATZ MUREX 2016
BACK TO 2008
FINANCIAL MARKETS THE PICTURE BEFORE 2008 Margins are high, regulation costs are small Flexibility of the tools, Invention of new exotic features and time to market count more than performance Tier 1 and big Tier 2 banks have no budget issues and invest in huge grid of computers Other banks are more intermediaries and resale products and need only an informative present value Code is mainly mono threaded Most quants focus only on the mathematics disregarding IT problems and we are not different 2015 Murex S.A.S. All rights reserved 4
MUREX POSITION THE PICTURE BEFORE 2008 We are already a leader in our market Tier 1 banks plug their own models inside our system and like our system for being fully integrated from front office to processing Murex front office teams invest heavily in risk measure, scenario flexibility, complex sensitivities for nested calibration cases computation and automatic grid management Our financial model library quality is close to the ones of the biggest banks Our customers who want to challenge Tier1 banks like our models but do want to not invest in a huge infrastructure 2015 Murex S.A.S. All rights reserved 5
COMPUTATION NEEDS IN FINANCE THE PICTURE BEFORE 2008 Pricing and front office risk management of Exotic structured products with scripted payoffs evaluated by Monte-Carlo Credit derivatives American and barrier options evaluated by partial differential equations 1 Year historical value at risk as a night batch 2015 Murex S.A.S. All rights reserved 6
COMPUTERS, CHIPS AND TOOLS THE PICTURE IN 2008 Xeon and Opteron have 4 cores and we have no practice of parallel programming Sun microsystem doesn t belong to Oracle and Solaris on Sparc processors is still preferred by our customers Quants love Excel and IT wants us to do everything in Java PlayStation 3 with its cell processor is available worldwide, can be used and programmed as a workstation under Yellow Dog Linux RoadRunner featuring a double precision friendly cell processor becomes the first computer to pass the PetaFlop barrier NVIDIA gaming GPUs are said to be programmable using something called CUDA and first Unix servers with Tesla cards are delivered to some universities and research centers We are playing with our first iphones and they are powered by a low consumption ARM processor 2015 Murex S.A.S. All rights reserved 7
CELL & ARM They are mostly CPUs like Intel Xeons ARM processors achieve better performance per watt by implementing simpler instructions and running at smaller frequency CELL processors achieve better performance for the same number of transistors by implementing wide vector functions inside simpler and slower cores, replacing cache by cores and by letting the programmer responsible of accessing the memory using explicit instructions with a high latency CELL processor was extremely complex to program and is deprecated today but we can consider Xeon Phi as its natural descendant featuring a cache 2015 Murex S.A.S. All rights reserved 8
CPU & GPU THEY ARE BOTH BUNCH OF CORES CPUs multi cores run at high frequency and are optimized for fast execution of mono threaded code with unpredictable execution stack GPUs many cores run at small frequency and are optimized for batch execution of the same set of instructions across the board CPUs are not specialized in computation GPUs are Flops machines CPUs can handle a huge amount of memory CPUs cores have fast access to the memory thanks to a huge and fast L2/L3 memory cache CPUs cores have a fast L1 cache managed automatically CPUs parallelization is better implemented at the level of the task CPUs multithreading is software managed GPUs memory is limited but has high bandwidth GPUs cores access memory with a latency but hide it by doing something else GPUs cores have a fast local memory managed by the programmer GPUs parallelization is better implemented at the level of the data GPUs multithreading is hardware managed 2015 Murex S.A.S. All rights reserved 9
GPU & 2008 PROBLEMS
EXOTIC STRUCTURED PRODUCTS MONTE-CARLO WITH SCRIPTED PAYOFFS WITH GPU Monte-Carlo is embarrassingly parallel Best performance with payoff scripting/dsl by path Generate and compile CUDA/OpenCL kernels In practice you are limited by the number of registers by CUDA core and the complexity of the payoff Best flexibility with payoff scripting/dsl by date Use your preferred interpreted scripting language on CPU and implement vector based operations on the GPU In practice you are limited by the memory bandwidth of the GPU Choose a good random number generator to cope with flexible implementation and be able to replay a part of the Monte-Carlo for optimization purposes In practice De Shaw Philox is great 2015 Murex S.A.S. All rights reserved 11
THE LATENCY PROBLEM GPUs are only efficient when treating big problems and there is a real latency when launching the kernels In practice reshape your code to see more problems at the same time sensitivities, scenarios, trades,.. but keep in mind that GPU memory is limited 2015 Murex S.A.S. All rights reserved 12
OPTION PRICING AND CALIBRATION SOLVED BY PARTIAL DIFFERENTIAL EQUATIONS LU solvers are not GPU friendly since they are sequential Choose instead a divide and conquer algorithm like PCR N log(n) operations but only in log(n) steps Stencil computation is more about accessing inputs than doing computation Keep as much as possible your data in local memory 1D problems are not big enough to feed a GPU but you have many options in your portfolios 2 a - b = 1 x1/2-1 a + 2 b - 1 c = 1 x1 - b + 2 c - d = 1 x1/2 x1/2-1 c + 2 d - 1 e = 1 x1 - d + 2 e - 1 f = 1 x1/2 x1/2-1 e + 2 f - 1 g = 1 x1-1 f + 2 g = 1 x1/2 + 1 b - 1/2 d = 2 x1/2-1/2 b + 1 d - 1/2 f = 2 x1-1/2 d + 1 f = 2 x1/2 1/2 d = 4 2015 Murex S.A.S. All rights reserved 13
BACK TO TODAY
FINANCIAL MARKETS THE PICTURE TODAY Lower margins, higher volumes, regulation costs are high We see a trend in exotic standardization but we still have 40 years PRDCS in our books Tier 1 banks and Murex have had GPUs in production for some time and are continuing to invest while other experiences like FPGAs for Monte Carlo have failed GPUs are mainstream in super-computers and are there to stay Medium size banks are obliged to be able to manage their risk and run their VAR on exotic portfolios even when trades are asset swapped and theoretically risk free CVA is our day to day topic and invest only in computers without a rewrite of an efficient and parallel friendly code is no-more an option A good quant is also a good computer science expert 2015 Murex S.A.S. All rights reserved 15
CVA & PFE A Monte-Carlo with a reduced set of paths on all the trades done with a counterpart Where we need to retrieve all PVs for all future paths and dates for future flexible aggregation and drill down type analysis Where counterparty trades composition and volume may be very different 2015 Murex S.A.S. All rights reserved 17
CVA A FLAVOR OF THE DIFFICULTY Swaps LCH Foreign branches Caps Exotics Many other small counterparts 1 TB of results generated when computing sensitivities for a medium size bank and far more for a Tier1 Considering all trades or all counterparts equivalent would be a mistake in building a system Re-compute everything in case of a failure is not an option 2015 Murex S.A.S. All rights reserved 18
HPC FOR CVA Group vanilla trades and evaluate them together on GPUs independently of their counterpart in a compute centric cluster aka small nodes Use GPU American Monte-Carlo with non linear regression for exotic trades Use specific boxes with enough memory for aggregation in a big data centric subcluster aka big nodes Use a parallel fast flash file storage as an intermediate buffer and checkpoint for the calculation chain to insure performance and reliability Use IB network as interconnect being able to convey several GB per second 2015 Murex S.A.S. All rights reserved 19
BACK TO THE FUTURE
THE PICTURE TOMORROW FRTB and till 15K scenarios using front office models MVA which leads to the computation of an historical VAR inside a Monte-Carlo in a scalable manner of all trades done with a CCP AD and AAD are back in the game but are no game changers yet Always faster and more flexible GPUs Cars become self aware 2015 Murex S.A.S. All rights reserved 21
AD AND AAD IN A NUTSHELL AD is the good old forward pathwise method for computing sensitivities but done automatically by tools AAD is about the same method but generates sensitivities to all inputs and intermediate values in a unique additional backward sweep at a ridiculous compute cost AAD can be implemented using some special compilers which are only partially compatible with GPUs or by overloading C++ basic scalar operators used to program the MC which is totally GPU friendly The operator execution keeps the record of all operations and intermediary results of the forward sweep. The tape is played backward on all path in // and the derivatives per path are computed using the rule of chain keeping future results constant The result sensitivities are finally the expectation of the sensitivities computed for each path θ x y z p p x = p z z x p y = p z z y p z z p x, p y x, y 2015 Murex S.A.S. All rights reserved 22
AD AND AAD PROMISING, GOOD FOR VANILLAS, BUT The method is simple but the implementation can be tricky. Everything should be done to have generic enough kernels to keep the GPU fed while avoiding race conditions To obtain the best performance one still needs to trick the order of operations inside the computation tree making the method often incompatible with cases where we want to keep the full flexibility at the level of the post-aggregation of several Monte-Carlo detailed results AAD is not applicable to all complex exotics even if the vibrato method smoother helps AAD doesn t solve the stress test and historical VAR problems AAD is also said to be memory bound. Well implemented it is only memory bandwidth bound 2015 Murex S.A.S. All rights reserved 23
PASCAL THE MEMORY BANDWIDTH JUMP FOR IN A SINGLE GPU GB/Sec X80 TITAN 1000 K40 288 M2090 178 C10160 102 0 100 200 300 400 500 600 700 800 900 1000 It is the first time since 2008 that the number of Bytes per Flop has increased for a single GPU during a generation change and maybe the last - Our AAD code will simply be 3.5 faster on next generation but most of our algorithms are at least partially limited by the memory bandwidth of the GPUs and will show huge benefits 2015 Murex S.A.S. All rights reserved 24
SIERRA SUPERCOMPUTER 2017-2018 A FULL FLEDGED CVA RISK SYSTEM IN A NODE The revival of the big nodes The Flops of 8 K40 A lot of CPU cores and memory to prepare inputs, convert outputs, interpret scripts, aggregate, query, Enough GPU/CPU interconnect speed to retrieve CVA or MVA profiles unnoticed NVRAM to replace external flash array storage Enough network bandwidth to have the flexibility of keeping results locally or remotely Bilateral MVA with SIMM at the same cost CCP MVA with full revaluation using only a few nodes 2015 Murex S.A.S. All rights reserved 25
THANK YOU! PARIS NEW YORK SINGAPORE linkedin.com/company/murex twitter.com/murex_group www.murex.com info@murex.com