Many-core Accelerated LIBOR Swaption Portfolio Pricing

Size: px

Start display at page:

Download "Many-core Accelerated LIBOR Swaption Portfolio Pricing"

Megan Leona Robinson
5 years ago
Views:

2012 SC Companion: High Performance Computing, Networking Storage and Analysis Many-core Accelerated LIBOR Swaption Portfolio Pricing Jörg Lotze, Paul D.

1 2012 SC Companion: High Performance Computing, Networking Storage and Analysis Many-core Accelerated LIBOR Swaption Portfolio Pricing Jörg Lotze, Paul D. Sutton, Hicham Lahlou Xcelerit Dunlop House, Fenian Street Dublin 2, Ireland Telephone: {jorg.lotze,paul.sutton,hicham.lahlou}@xcelerit.com Abstract This paper describes the acceleration of a Monte- Carlo algorithm for pricing a LIBOR swaption portfolio using multi-core CPUs and GPUs. Speedups of up to 305x are achieved on two Nvidia Tesla M2050 GPUs and up to 20.8x on two Intel Xeon E5620 CPUs, compared to a sequential CPU implementation. This performance is achieved by using the Xcelerit platform writing sequential, high-level C++ code and adopting a simple dataflow programming model. It avoids the complexity involved when using low-level high-performance computing frameworks such as OpenMP, OpenCL, CUDA, or SIMD intrinsics. The paper provides an overview of the Xcelerit platform, details how high performance is achieved through various automatic optimisation and parallelisation techniques, and shows how the tool can be used to implement portable accelerated Monte-Carlo algorithms in finance. It illustrates the implementation of the Monte-Carlo LIBOR swaption portfolio pricer and gives performance results. A comparison of the Xcelerit platform implementation with an equivalent low-level CUDA version shows that the overhead introduced is less than 1.5% in all scenarios. I. INTRODUCTION Pricing financial derivatives is one of the most important problems in computational finance. This is often a computeintensive task, especially when considering large portfolios or derivatives with complicated features. Typically closed-form algorithms only exist for the simplest of cases, for example single-asset European options with the Black-Scholes assumptions applied [1]. In all other cases, numerical solutions have to be found, e.g., applying latticebased methods (for example binomial or trinomial trees), finite difference schemes, or Monte-Carlo simulations. These algorithms are computationally demanding and financial institutions worldwide are currently exploring high-performance computing (HPC) hardware such as multi-core CPUs, grids, and GPUs to deal with these demands. However, financial analysts are primarily mathematicians and typically have little or no experience in programming HPC hardware using development frameworks such as OpenMP [2], OpenCL [3], or CUDA [4], or using SIMD intrinsics. This paper examines the acceleration of a Monte-Carlo algorithm to price a portfolio of LIBOR swaptions. The Xcelerit platform is used to exploit HPC hardware including multi-core CPUs and GPUs using high-level, portable C++ source code. The Xcelerit platform permits users to avoid the challenges associated with programming using the low-level frameworks mentioned above, yet still achieve high performance applications. Applications can be efficiently executed on different many-core processors and are compiled from a single source codebase. The Xcelerit platform has a GPU back-end using Nvidia s CUDA Toolkit [4], and we provide comparisons to a low-level CUDA implementation in this paper. CUDA is a proprietary programming toolkit that defines a programming model, a language based on C++, and an API for programming Nvidia GPUs. A compiler for GPU code and a large set of GPUbased libraries for specific purposes is included in the toolkit. Using CUDA requires expert knowledge about the GPU hardware architecture, the CUDA C++ language, and data-parallel programming and synchronization techniques. The Xcelerit platform avoids that complexity by providing a programming interface on a higher level of abstraction. Fully utilising the compute power of multi-core CPUs, including their Single Instruction Multiple Data (SIMD) instruction sets, requires parallel and low-level programming expertise. The Xcelerit platform automates this task, improving programmer productivity and code portability. The paper is structured as follows. The algorithm used to price a LIBOR swaption portfolio is presented in Sec. II. Sec. III provides an overview of the Xcelerit platform, it s programming model and how applications are developed. This section also describes the automatic parallelisation and optimisation techniques applied within the platform to achieve high performance. Monte-Carlo simulations are common in computational finance not only for financial derivatives, but also in risk management algorithms. Therefore a generic strategy for implementing financial Monte-Carlo algorithms using the Xcelerit platform is detailed in Sec. IV. This approach is applied in Sec. V to the LIBOR swaption portfolio pricing algorithm and performance results are presented for both multi-core CPU and GPU hardware. A detailed performance comparison with an equivalent low-level CUDA implementation is given in Sec. VI and Sec. VII concludes the paper. II. PRICING A LIBOR SWAPTION PORTFOLIO Swaptions are options on financial swap contracts, i.e., they provide one party with the right to enter a swap agreement at a future date, where a pre-determined fixed interest rate /13 $ IEEE DOI /SC.Companion

2 is exchanged for a floating rate [5]. The London Interbank Offered Rate (LIBOR) is the interest rate applied for loans between banks, and is calculated for ten different currencies and 15 borrowing terms ranging from overnight to one year on a daily basis (i.e., it is a floating rate subject to change every day) [5], [6]. In a LIBOR swaption, the floating interest rate of the swap agreement is the LIBOR rate. To value a portfolio of LIBOR swaptions, a stochastic model that can predict the future development of the LIBOR rate for a given term is required. Based on this model, the value of each swaption can be determined, and the overall portfolio value can be computed by applying a payoff function. The evaluation is typically performed using a Monte-Carlo simulation, i.e., a large number of different possible developments for the LIBOR rate for different time steps is simulated using random numbers, the portfolio is valued for each case, and the final result is obtained by computing an average of all paths. For high accuracy, a large number of Monte-Carlo paths is required; typically more than 100,000 paths. A. Algorithm In this paper, we apply the algorithm introduced in [7], which we briefly outline in the following. We denote the forward LIBOR rate by L n i for the time interval [iδ, (i + 1)δ], where δ is the LIBOR interval. If the simulation time-step is chosen equal to the LIBOR interval, the forward LIBOR rates at the times nδ < iδ can be approximated by the equations L n+1 i = L n i e (σi n 1Sn i 1 2 σ2 i n 1)δ+σ i n 1Z n δ for all i > n and where n = 0,..., N mat 1. The number of time steps to maturity is denoted by N mat. The variable Z n is a standard normal distributed random variable for the n th time-step, and S n i = i j=n+1 σ j n 1 δl n j 1 + δl n j (1). (2) This model treats the volatility σ as a function of time to maturity, which remains fixed once the maturity is reached. Therefore, L n+1 i = L n i for all time steps i n. Based on these forward LIBOR rates, the payoff V of a portfolio of N opt swaptions with the swap rates C j and maturities T j (with j = 0,..., N opt ) can be computed as { Nmat 1 } 1 V = 1 + δl i=0 i N opt 1 100(1 B Tj C o S Tj ) +, (3) where j=0 S m = B m = m 1 i=0 m 1 i=0 δb i, (4) 1. (5) 1 + δl Nmat+i Thus, to price a portfolio of LIBOR swaptions using a Monte-Carlo simulation, the following steps have to be taken: i) Draw N mat N paths standard normal random samples (N paths is the number of Monte-Carlo paths), ii) Compute the LIBOR forward rates for each path using (1), iii) Compute the portfolio payoff for each path using (3), and iv) Average all paths to obtain the final portfolio value. B. Greeks For financial institutions, it is valuable to not only determine the value of an instrument, but also the sensitivity to changes in the parameters on which this value depends. This permits the application of hedging techniques to compensate for the risk associated with one asset by adding other assets with reverse characteristics to the institution s portfolio [5]. These sensitivities are denoted by Greek letters, commonly referred to as the Greeks. For the LIBOR swaption portfolio pricing detailed in this paper, we focus on the λ Greek to serve as an example, which denotes the percentage change in derivative value per percentage change in the underlying price [8]. It is sometimes also called Ω or elasticity, and defined as λ = V S S V, (6) where V is the derivative s value and S is the price of the underlying asset. Here we use an adjoint method to compute the value of λ simultaneously with the portfolio value, as detailed in [9]. C. Computational Complexity Acceptable accuracies for a Monte-Carlo simulation are generally achieved for 100,000 paths or more. This means, for N mat = 40, a total number of 4 million random numbers must be drawn, and the equations outlined above have to be computed for each path. Furthermore, when estimating the Greeks the algorithm becomes significantly more complex. For example, a straightforward sequential C language implementation of the algorithm including Greeks for 15 swaptions, N mat = 40, and 128K paths (here K denotes 1024) takes 20 seconds on a Xeon E5620 CPU. For a financial institution with many assets to value and which typically uses more paths for better accuracy, this time is not acceptable. Therefore high-performance parallel processors have to be considered for this algorithm. III. THE XCELERIT PLATFORM The Xcelerit platform consists of the Xcelerit SDK at its core and several add-ons providing specialised functions or interfaces. The Xcelerit SDK permits the efficient use of many-core processors, i.e., multi-core CPUs and GPUs, from a single high-level codebase written in a high-level programming language such as C++, C, C#, or Java (the code samples in this paper use the C++ API). It is available for the Linux and Windows operating systems. Its dataflow programming model, based on the Synchronous Dataflow (SDF) model of 1186

3 source actor RandomSrc source actor RandomSrc generic actor Sum sink actor MemoryWriter Fig. 1. A simple Xcelerit dataflow graph, computing the sum of two random streams of numbers. computation [10], allows source code to be automatically optimised and parallelised without requiring programmers to include any parallel constructs. This ensures that user code is simple and portable, making it an attractive framework for implementing financial algorithms such as the LIBOR swaption portfolio pricing explained in Sec. II. Add-ons for the Xcelerit SDK include statistical functions and random number generators, or interfaces to Excel and Matlab. A complete package for computational finance, Xcelerit Quant, combines all add-ons relevant to this domain. A. Programming Model To use the Xcelerit API, compute-intensive algorithms are expressed as a dataflow graph a succession of processing stages (termed actors) which are connected together. A series of actors are continuously applied to streams of data, flowing through the graph. Source actors generate the data, generic actors take data from their input ports and compute the output data to be placed into their output ports, and sink actors consume data (for example, save it into a file). Fig. 1 shows a simple example dataflow graph, generating two streams of random numbers, computing their sum, and writing the results to memory. Generic actors can be configured using parameters and constant look-up tables. To describe a program using the Xcelerit API, source actors, sink actors, and generic actors are instantiated and connected to a dataflow graph, which is then executed in parallel on the available multi-core CPUs or GPUs. The use of this programming model enables parallel execution while ensuring that common parallel programming bugs such as race conditions or deadlocks cannot be introduced. B. Application Development Fig. 2 gives an overview of the components of the Xcelerit SDK. Developing applications using the Xcelerit SDK involves the following steps: i) Identify the performance-critical part of the application through profiling, ii) For these bottlenecks, map the algorithm to a dataflow graph, iii) Implement the dataflow actors needed (or choose from provided library actors), and iv) Instantiate actors, connect to a flow graph, and execute the graph. Fig. 2. The Xcelerit SDK Architecture. Using the C++ API, actors are implemented as a class inheriting from a common base class. Input and output ports and optionally parameters and constant look-up tables are added to the public interface as required. A run() method is implemented which performs the core computation of the actor, i.e., it takes the data from the input ports, processes it, and places the results into the output ports. It may use the parameter values and constant look-up tables, and helper functions and other classes for the computation. These actors are then compiled by the Xcelerit Compiler Driver, which drives a set of compilation tools and existing standard compilers for CPUs and GPUs. C++ actors are compiled into a single binary file that holds both GPU and CPU code. This permits the runtime to pick the appropriate implementation depending on the available compute resources. For composing a dataflow graph, actor objects are instantiated and connected using C++ operators. The graph can then be executed and the Xcelerit SDK runtime will automatically handle its efficient execution on the available compute resources. This simple methodology makes the Xcelerit SDK a good candidate to exploit high performance many-core processors for algorithms such as the LIBOR swaption portfolio pricer described in Sec. II. Through the use of a dataflow programming model, it ensures high performance parallel execution while maintaining programmer productivity. To ease development, the Xcelerit SDK comes with a set of development tools, e.g., a profiler, debugger integration, IDE plug-ins, and a library of often-used functions and actors. In addition, conventional CPU and GPU debuggers and profilers can be used. C. Under the Hood This section gives an insight into how the Xcelerit SDK works under the hood. It explains the optimisations employed in order to improve the performance. Several types of parallelism can be extracted for the efficient execution of Xcelerit programs. These types of parallelism are 1187

4 RandomSrc RandomSrc Sum MeanReduce time (a) + A A x 0 x 1 x 2 A y 2 y 1 y 0 input items output items (a) Fig. 4. Comparison of a sequential reduction (a) and a parallel version (b) for adding 8 values together. The sequential version takes 7 time steps to complete, while the parallel version finishes in 3. (b) (b) Fig. 3. Pipeline parallelism (a), and data parallelism (b) in dataflow graphs as applied by the Xcelerit SDK. The round arrow represents a concurrent thread of execution. extracted automatically from sequential Xcelerit-enabled user code by the Xcelerit Compiler Driver and Runtime System. Efficient parallel code is generated and CPU and GPU memory access optimisations are carried out. 1) Pipeline Parallelism: Pipeline parallelism is a form of task parallelism, where every stage in a processing pipeline is executed concurrently on different sets of data. This form of parallelism, illustrated in Fig. 3(a), yields a significant gain in performance, especially when the actors are of similar computational complexity. 2) Data Parallelism: It is possible to execute multiple instances of the same actor function in parallel, each working on a different set of the input and output data. This form of parallelism, illustrated in Fig. 3(b), provides efficient load balancing between concurrent executions, as all execute the same code. Data parallelism is the primary form of parallelism supported by GPUs. 3) Vectorisation (SIMD): Single-instruction multiple-data (SIMD) parallelism works at a much finer granularity. As the name implies, it applies single processor instructions to vectors of data at once, reducing the total number of instructions needed. It is a form of data parallelism which requires hardware support, i.e., the underlying hardware must provide special SIMD instructions and wide registers to hold vector data. Modern CPUs support SIMD in different instruction sets, e.g., SSE (16-byte data vectors), AVX (32 byte data vectors), or AltiVec (different widths). GPUs from AMD also provide SIMD instructions. An AVX multiplication operation can for example compute the element-wise product of vectors of 8 floating point numbers simultaneously which results in an approximately 8x faster computation compared to using scalar multiply instructions. The Xcelerit API is implemented specifically to allow back-end compilers to carry out automatic SIMD vectorization. Note that the next release, currently under development, will include more direct methods to implement vectorised computations from a high-level API. Main Memory (xgb) L2 Cache L1 Cache Registers L3 Cache (8MB) 256KB 64KB L2 Cache L1 Cache Registers Core 1... Core 6 (a) speed size Registers Main Memory Constant Memory Shared Memory Registers Core 1... Core 32 Fig. 5. Example memory and cache hierarchy of typical CPUs (a) and GPUs (b). On the left is an example of a 6-core CPU of Intel s Nehalem architecture, and on the right is a typical GPU of Nvidia s Fermi architecture. 4) Parallel Reductions: Reductions are central to many algorithms, where partial results are combined to a final result. It is therefore crucial for the compiler to generate efficient code for reductions. There are many ways to implement reductions in parallel, where partial reductions are computed concurrently and reduced further in subsequent steps. Fig. 4 illustrates a sequential and a parallel implementation of a reduction that sums up all individual items in a buffer. The parallel sum finishes after 3 steps, whereas the sequential version needs 7 steps to complete. More generically, the parallel reduction needs log 2 N steps, while the sequential version needs N 1 steps. On GPUs, shared memory (between threads) and efficient thread block sizes can be used to speed up the process. The Xcelerit SDK provides a set of highly-optimised built-in reduction actors and allows users to easily configure them with their own reduction functions. 5) Memory Access Optimisations: Typically, processors have a hierarchy of memory, starting from registers (the fastest and smallest), over several levels of cache (fast, larger with every level), to external memory (slowest and largest). Being able to place data used by actors in memory at a level close to the processing element can improve the performance of the computation significantly. The Xcelerit SDK exploits memory locality and performs cache optimisations where possible. That is, it ensures that the data needed is kept physically close to (b)

5 RandomSrc PathCalc Reduce Writer random samples payoff calculation reduce paths/option write results Fig. 6. Typical Xcelerit Dataflow Graph for Monte-Carlo Simulations. the processing element, using the memory hierarchy present in today s systems efficiently. For illustration, Fig. 5 shows the architecture of a typical CPU and GPU with different levels of cache and memory. Furthermore, the Xcelerit SDK ensures that while a thread is waiting for data to arrive from memory, other threads can compute (overlapping computation with memory access). For GPUs, Xcelerit applications make efficient use of fast shared memory and registers where possible. Fast global memory access is ensured where possible (coalescing). All of these optimisations ensure that all processors involved in the computation are kept busy and are not held back by memory access latencies. Note that these optimisations are automatic and do not require code annotations. IV. MONTE-CARLO SIMULATIONS USING THE XCELERIT PLATFORM Monte-Carlo simulations involve simulating a large number of random paths and computing the per-path results individually. These results are then combined using statistical analysis in order to compute the final results. Generally, the more paths used in the simulation, the more accurate the result. For pricing financial derivatives using a Monte-Carlo method, typically the following steps are followed: i) Generate random samples ii) Calculate the payoff for each path, using random samples, iii) Average the payoffs of all paths, and iv) Save the results The final result is a numerical estimate for the value of the derivative. This approach can be used to obtain the price of derivatives with different complexity. For example, several sources of uncertainty can easily be incorporated, correlations between underlyings in multi-asset options 1 can be modelled, or different probability distributions can be incorporated. Furthermore, Monte-Carlo simulation can also be applied in risk management and other areas, following a similar approach. They are a powerful tool for financial analysts. An Xcelerit dataflow graph for the general steps involved in a Monte-Carlo simulation is shown in Fig. 6. It is a straightforward mapping of the algorithm explained above, and can be implemented directly. A. Random Number Generation It is critical for any Monte-Carlo simulation that the pseudorandom number generator used exhibits the desired statistical properties, and has a long enough period to avoid repetitions of the data. The accuracy of the final result is highly dependent 1 multi-asset options are also called basket options or rainbow options on the quality of the random number generator used. The Xcelerit SDK Statistics Add-on provides a set of high-quality generators. Note that user-defined generators can also be implemented for specific requirements. B. Per-Path Computation This part of the algorithm is typically straightforward, as each path can be computed independently using different random numbers. Payoff functions can be of arbitrary complexity, depending on the type of derivative to be priced. Simple functions only need the final value of the underlying assets, while others might involve weighted payoffs of several underlyings. For risk computations, the per-path calculation can be of a different nature, for example computing the expected loss for each scenario. With the Xcelerit platform, this part is implemented within a custom generic actor, taking the per-path random numbers from its input port and computing the associated payoff for the output port. C. Reduction and Writer Averaging the individual per-path payoffs can be achieved by using the provided parallel reduction actors in the Xcelerit SDK. Usually multiple derivatives must be priced, which means a block-wise reduction is required to average all paths per option. This outputs a stream of option values (as illustrated in Fig. 6), which are received by some Writer sink actor and stored to file, for example. If only a single instrument is valued, as in the LIBOR swaption portfolio case, a full reduction can be used. This is a sink actor which simply reduces all per-path values on its input and stores the final result in a variable. The additional Writer component is not required in this case. If the provided reductions are not sufficient, users can configure a generic reduction sink actor with their own reduction function, which will then be executed in parallel. V. LIBOR SWAPTION PRICING WITH THE XCELERIT SDK This section presents the implementation of the LIBOR swaption portfolio pricing algorithm described in Sec. II using the Xcelerit SDK. Performance results for a number of different configurations are presented and compared with a sequential reference implementation. A. Implementation The first step in the implementation is to map the algorithm to a dataflow graph, as shown in Fig. 7. The algorithm includes the computation of the λ Greek, hence the two sink actors one for the swaption portfolio value and another for the associated λ. Only a single portfolio is priced, rather than multiple options, and therefore a full reduction sink can be used for both outputs (instead of a block-wise reduction and writer combination, as described in Sec. IV-C). To provide the random samples, the Xcelerit-provided source actor RandomSrc is set up with the MRG32K3a 1189

6 RandomSrc LiborSwaptionGreek MeanReduce portfolio value TABLE I SPEEDUPS ACHIEVED FOR THE LIBOR SWAPTION PRICER USING 512K PATHS (GPUS: TESLA M2050). normal distribution payoff and λ calc. MeanReduce λ value Fig. 7. Xcelerit Dataflow Graphs for LIBOR Swaption Portfolio Pricing (with λ Greek). random number generator [11] and a normal distribution. The number of samples needed, the mean and standard deviation, and the generator seed are set up in the constructor. The LiborSwaptionGreek actor is user-defined, with an outline given below: class LiborSwaptionGreek : public Actor { public: Input<float> z; // rand. sample Output<float> v, lb; // value/lambda Constant<float> swaprates, L0, lambda; Constant<int> maturities; Parameter<int> nopt;... // setup / construct // core algorithm actor void run() const { float L[NN], L2[L2_SIZE]; // temporary values float *L_b = L; // copy initial LIBOR rates from constant copy(l0.begin(), L0.end(), L); // LIBOR rates calc., store reverse paths path_calc_b1(l, L2); // compute portfolio value, updating L_b v[0] = portfolio_b(l, L_b); // Reverse path calc for Greek path_calc_b2(l_b, L2); // Greek is the last entry in L_b lb[0] = L_b[NN-1]; } private: actor void path_calc_b1(float* L, float* L2) const; actor float portfolio(float* L, float* L_b) const; actor float path_calc_b2(float* L_b, float* L2) const; }; As can be seen, the actor has one input for the random samples, and two outputs for the portfolio value and λ. It is initialised with the swaption data, i.e., the number of swaptions, maturities, and swap rates, and a set of initial LIBOR rates and λ values. From these, the portfolio value and λ are computed using the random input samples. The two reduction sinks are provided by the Xcelerit SDK, they simply compute the mean of all input values. The code for the actor instantiation, construction of the dataflow graph, and graph execution is as follows: // instantiate actors RandomSrc<float, RNDGEN_MRG32K3a, RNDDIST_NORMAL> samplegen(numpaths*nummat, SEED, 0.0f, 1.0f); LiborSwaptionGreek libor(swaprates, maturities, nopt, lambda0, L0); MeanReduce<float> meanvalue(&value); MeanReduce<float> meanlambda(&lambda); // construct dataflow graph Flowgraph f; Precision 1 GPU 2 GPUs single 155x 297x double 89x 171x f += samplegen >> libor, libor.v >> meanvalue, libor.lb >> meanlambda; // execute graph f.run(); It constructs objects of all needed actors, creates and connects a dataflow graph by connecting ports using the >> operator, and runs the graph. This executes the application in a highly efficient way on both multi-core CPUs and GPUs, depending on the available resources. B. Results To evaluate the performance of the Xcelerit platform implementation, the application is executed on a system with the following configuration: CPUs: 2 Intel Xeon E5620, hyperthreading off (8 cores) GPUs: 2 Nvidia Tesla M2050, ECC off RAM: 24GB OS: RedHat Enterprise Linux 5.4, 64bit CUDA SDK version: 4.2 GPU driver version: Xcelerit SDK version: Compiler: NVCC with GCC 4.1 as host compiler Compiler flags: -O3 -DNDEBUG Execution times for the LIBOR swaption pricing algorithm (with Greeks) are measured for single and double precision, using path numbers between 4K and 1024K (here K is 1024), a portfolio of 15 swaptions, and N mat = 40. The full dataflow graph execution is measured, including the random number generator and reduction as well as the CPU/GPU data transfers (managed by the Xcelerit SDK). These execution times are compared to an equivalent sequential implementation using a single CPU core. This sequential implementation uses the host API (CPU) of Nvidia curand s random number generator [12] to ensure that a generator of a quality comparable to Xcelerit s built-in random number generator is used. Fig. 8 shows the speedups achieved on GPU hardware with one and two GPUs. Using a single GPU, speedups of up to 155x can be realised for a single precision implementation for 512K paths. By using two M2050 GPUs, this speedup figure can be increased to 297x. This is an improvement of factor 1.9x when adding an extra GPU, which shows very good scalability, without changing the source code or re-compiling. The imperfect scalability can be explained by the unavoidable overhead involved in managing multiple GPUs and splitting the data and computation between them. 1190

7 speedup Nvidia Tesla M2050 GPU(s) 2 GPU, single 1 GPU, single 2 GPU, double 1 GPU, double 4K 16K 64K 256K 1024K number of paths TABLE II TIME COMPARISON XCELERIT SDK VS. CUDA FOR LIBOR SWAPTION PORTFOLIO PRICER (TIMES IN MILLISECONDS). Paths All DT 1 RNG 2 CE 3 RED 4 CUDA 32K 42.0 < Xcelerit 32K Overhead 32K 0% 90% 8% -2% 0% CUDA 128K < Xcelerit 128K Overhead 128K 1.5% 70% 11% -1.2% 67% CUDA 512K Xcelerit 512K Overhead 512K 1.4% 40% 24% -0.8% 67% AAA 1 CPU/GPU data transfers and memory allocations AAAI 2 Random number generation AAAI 3 Core LIBOR per-path calculation AAAI 4 Reductions (mean of all per-path values) Fig. 8. LIBOR swaption pricing performance on the GPU for double and single precision, compared to a sequential implementation running on a single core. All GPU results for 512K paths are also summarised in Tab. I. Speedups on multi-core CPUs range between 7.4x and 7.9x for 128K paths and more (using the 8 CPU cores in the test system). The difference between single and double precision is insignificant on the CPU. A new version of the Xcelerit SDK providing a direct simple API for vectorised computations is currently under development. Using this, preliminary results have shown speedups of 20.8x (single precision) and 13.6x (double precision) on the same system for 256K paths. This significantly higher speedup is thanks to a more efficient use of SIMD, and the difference between single and double precision is due to the different number of vector elements fitting into the available SIMD registers (128bit wide). VI. COMPARISONS WITH A LOW-LEVEL CUDA IMPLEMENTATION In this section, we will compare the performance and overhead of using the Xcelerit SDK for the LIBOR swaption portfolio pricer on GPUs with an equivalent low-level implementation using CUDA directly. A. CUDA Implementation The CUDA reference which serves as a basis and is available from Oxford University at [13]. We believe this code has a reasonable optimisation level which reflects the performance of the CUDA framework. The kernels have been left untouched, but for comparison fairness the random number generator has been replaced with Nvidia s curand library [12] (as the originally-used generator is not publicly available), and the mean reduction, originally done on the CPU, has been replaced by a parallel GPU-based version using the Thrust library [14]. Further, the number of threads per block has been adjusted from the original 64 (optimised for older GPUs) to 256 to be optimal for the Tesla M2050. All other code is identical to the original implementation [13]. As the reference CUDA implementation uses single precision floats and executes on a single GPU, we will compare with the equivalent variant of the Xcelerit SDK implementation. Adding multiple GPUs to the CUDA version would involve changes to the program architecture, using multiple host threads and a different approach to management of the data transfers. With the Xcelerit SDK, this is all handled automatically. Additional GPUs or multi-core CPU cores can be used without the need for source code changes or even recompilation of the application. A single binary makes it possible to fully leverage the available processing hardware. B. Metrics For the tests, the same machine configuration as mentioned in Sec. V is used. The following metrics are used for comparison: Overall application runtime Individual times: data transfers and memory allocations (DT), random number generation (RNG), core execution (CE), reduction (RED) Visual Profiler performance and efficiency metrics (detailed below) These are compared for a range of different Monte-Carlo path numbers. C. Results Tab. II shows the overall application runtime as well as a breakdown of the individual tasks for a range of different Monte-Carlo path numbers. As can be seen, the overhead introduced by the Xcelerit SDK is within 1.5% of the CUDA version in all cases. There is slightly more time taken for the memory allocations and data transfers with the Xcelerit SDK due to buffering of data between actors. The random number generator also takes slightly more time than the curand version used in the reference implementation. This is due to the more generic implementation of the Xcelerit SDK version. The most relevant part for the overall application is the core LIBOR function ( CE in table) which computes the per-path 1191

8 TABLE III NVIDIA VISUAL PROFILER METRICS FOR XCELERIT SDK VS. CUDA FOR LIBOR SWAPTION PORTFOLIO PRICER. Paths Reg/Thr GlbLd GlbSt DRAM Branch Occup CUDA 32K % 100% 42.8% 12.5% 55.4% Xcelerit 32K % 100% 46.8% 12.2% 56.3% CUDA 128K % 100% 48.3% 12.4% 64.1% Xcelerit 128K % 100% 52.3% 12.2% 64.3% CUDA 512K % 100% 49.5% 12.5% 65.5% Xcelerit 512K % 100% 53.5% 12.1% 65.7% portfolio values for all Monte-Carlo paths, all other parts are insignificant for the overall result. From the results in Tab. II it can be seen that the Xcelerit SDK is slightly faster than the original CUDA version. In the following we take a closer look into the core execution function by using the Nvidia Visual Profiler [15]. The following metrics reported by the profiler have been chosen for comparison: Reg/Thr Number of registers used per GPU thread (for information only) GlbLd Global Load Efficiency, i.e., efficiency of reading from global device memory (higher is better) GlbSt Global Store Efficiency, i.e., efficiency of writing to global device memory (higher is better) DRAM Utilization of the available DRAM bandwidth (higher is better) Branch Branch divergence overhead (lower is better) Occup Occupancy of the available processor cores (higher is better) The results are presented in Tab. III. As can be seen, most of the metrics show approximately the same results for the CUDA kernel and Xcelerit SDK actor. The biggest difference is in the global load efficiency, where the Xcelerit SDK achieves approximately 92% and the CUDA kernel only 20%. This explains the difference in the core execution times reported in Tab. II. The Xcelerit SDK takes care that the memory accesses on the device are coalesced, i.e., the data and memory reads are arranged in a fashion that avoids the serialisation of threads within a WARP (blocks of 32 threads) while accessing memory. This is not always possible, but in this application the benefit clearly shows. D. Summary As shown in this section, the overhead of the Xcelerit SDK compared to a low-level CUDA implementation is negligible for the LIBOR swaption portfolio pricing algorithm. This small overhead is the cost of developing algorithms on a much higher level (increasing productivity), with the added benefit of generating portable binaries that can run on any number of GPUs or multi-core CPUs. VII. CONCLUSIONS This paper has presented the acceleration of a Monte- Carlo LIBOR swaption portfolio pricer using the Xcelerit platform. It has shown that a dramatic performance increase can be achieved on GPUs (up to 305x on 2 Nvidia Tesla M2050 GPUs) while avoiding the complexity of low-level programming frameworks. The same application can also be executed on multi-core CPUs without re-compilation, achieving speedups of up to 20.8x on 2 Intel Xeon E5620 CPUs. Comparison with an equivalent low-level CUDA implementation has shown that the performance overhead added by the Xcelerit platform is very light (<1.5%). Details on how the Xcelerit platform achieves this high level of performance have been presented and a general strategy for implementing financial Monte-Carlo algorithms has been outlined. Thus, it has been shown that the Xcelerit platform can be used to implement complex financial algorithms such as the LIBOR swaption pricer using straightforward high-level programming techniques, while still achieving high performance and portability across HPC processing platforms. REFERENCES [1] F. Black and M. Scholes, The pricing of options and corporate liabilities, The Journal of Political Economy, pp , [2] OpenMP Application Program Interface, OpenMP Architecture Review Board, Rev. 3.1, Jul [Online]. Available: org/mp-documents/openmp3.1.pdf [3] (2011) OpenCL - The open standard for parallel programming of heterogeneous systems. Khronos Group. [Online]. Available: [4] (2012) NVIDIA CUDA Toolkit. NVIDIA Corporation. [Online]. Available: [5] J. C. Hull, Fundamentals of Futures and Option Markets, 7th ed., D. Battista, Ed. Peason Education, Inc., [6] E. V. Murphy, LIBOR: Frequently asked questions, Congressional Research Service, Washington, DC, CRS Report for Congress R42608, Jul [Online]. Available: pdf [7] M. Giles, Libor notes, [Online]. Available: ox.ac.uk/gilesm/libor/libor_notes.pdf [8] E. G. Haug, The Complete Guide to Option Pricing Formulas. McGraw- Hill Professional, [9] M. Giles, Monte carlo evaluation of sensitivities in computational finance, Oxford University Computing Laboratory, Oxford, UK, Tech. Rep. 07/12, Jun [Online]. Available: uk/1090/1/na pdf [10] E. A. Lee and D. G. Messerschmitt, Synchronous data flow, Proc. IEEE, vol. 75, no. 9, pp , Sep [11] P. L Ecuyer, R. Simard, E. J. Chen, and W. D. Kelton, An objectoriented random-number package with many long streams and substreams, Oper. Res., vol. 50, no. 6, pp , Nov [12] (2012) NVIDIA curand Random Number Generation library. NVIDIA Corporation. [Online]. Available: cuda/curand [13] M. Giles. (2007) Libor monte carlo application. Oxford University Computing Laboratory. Oxford, UK. [Online]. Available: http: //people.maths.ox.ac.uk/gilesm/hpc/ [14] (2012) Thrust library. NVIDIA Corporation. [Online]. Available: [15] (2011) NVIDIA Visual Profiler. NVIDIA Corporation. [Online]. Available:

Accelerating Financial Computation

Accelerating Financial Computation Wayne Luk Department of Computing Imperial College London HPC Finance Conference and Training Event Computational Methods and Technologies for Finance 13 May 2013 1 Accelerated