Stochastic Local Volatility & High Performance Computing

Size: px

Start display at page:

Download "Stochastic Local Volatility & High Performance Computing"

Ruth Owen
5 years ago
Views:

1 Stochastic Local Volatility & High Performance Computing A thesis submitted to The University of Manchester for the degree of Master of Philosophy in the Faculty of Humanities Zaid AIT HADDOU Manchester Business School 1

2 2

3 Contents 1 Introduction Thesis aim & contribution Thesis Structure Literature Review Stochastic Volatility Local Volatility Stochastic Local Volatility ADI & HPC The SLV Model SLV Model dynamic SLV calibration Leverage function calibration Dupire formula for local volatility: Fokker Planck PDE: ADI Implementation ADI Scheme Grid generation Spatial grid generation: Adapting the grid to specific required points Space discretization Temporal grid generation Solving PDE ADI for the Fokker-Planck PDE Boundary conditions

4 4.3.2 ADI for the Option Pricing PDE Boundary conditions: Tridiagonal Systems Solvers Thomas algorithm Cyclic Reduction Parallel Cyclic Reduction High Performance Computing Multi/Many Core architecture Graphics Processing Units GPGPU Tesla Architecture CUDA programming model: CUDA: CUDA program structure: Thread assignment CUDA Memory Types: Memory Access Optimizatization: OpenMP programming model: OpenMP: OpenMP program structure: OpenMP directives: Results Stochastic Local Volatility Model Results Volatilities & Leverage function surfaces SLV Call options pricing High Performance Computing Results Experimental environments Implemented tridiagonal solvers CR Implementation 1: CR Implementation 2: PCR implementation: ADI implementation OpenMP implementation

5 CUDA implementation: Method CUDA implementation: Method SLV implementation SLV implementation: OpenMP SLV implementation: CUDA CONCLUSION 89 Word count:

6 6

7 List of Tables 6.1 Linearly stored matrix Calibrated Heston Parameters European options pricing for the SLV and Heston model Technical Specifications GPU Technical Specifications CPU CR implementation 1 profiling (global memory) CR implementation 1 profiling (shared memory) CR implementation 2 profiling (global memory) CR second implementation profiling (shared memory) PCR implementation profiling (shared memory) OpenMP: ADI implementation timing OpenMP directives overhead in μs Method 1: ADI implementation profiling (CUDA:M1) Method 2: ADI implementation profiling (CUDA:M2) Method 2: ADI implementation profiling (coalescing) (OPT-CUDA:M2) OpenMP SLV implementation (512x512 & 150 time steps) CUDA SLV implementation profiling (512x512 & 150 time steps)

8 8

9 List of Figures 2.1 Implied dynamics from local volatility model Leverage function calibration procedure (Tian et al.(2013)) Initial probability distribution for the forward Fokker-Planck PDE Interim probability distribution for the forward Fokker-Planck PDE Non uniform grid (Tian et al.(2013)) Probability mass at different time steps Forward Reduction & Backward substitution in the CR algorithm (Zhang et al. (2010)) Forward Reduction in the PCR algorithm (Zhang et al. (2010)) GPU vs CPU Tesla architecture Execution of a CUDA program (CUDA Programming Guide) Grid configuration (CUDA Programming Guide) GPU Memory types Coalesced global memory access Uncoalesced global memory access Free Bank-conflicts example for a warp way shared memory bank conflicts OpenMP program structure Artificial Implied Volatility Surface Generated Local Volatility Surface Generated Leverage Function Surface European options pricing absolute error for the SLV and Heston model Implementation 1: forward reduction for CR algorithm

10 7.6 CR implementation 1 profiling (global memory) CR implementation 1 profiling (shared memory) Implementation 2: Forward reduction for CR algorithm CR implementation 2 profiling (global memory) CR second implementation profiling (shared memory) Tridiagonal solvers running time to solve a 512x512 size system (double precision) OpenMP directives overhead duration in μs ADI running time in milliseconds for one time step SLV implementation benchmark (512x512 & 150 time steps) SLV implementation benchmark (256x256 & 150 time steps)

11 The University of Manchester Zaid AIT HADDOU Master of Philosophy Stochastic Local Volatility & High Performance Computing September 2013 ABSTRACT In this thesis we try to investigate the implementation of a Stochastic Local Volatility (SLV) model, using the Alternate Direction Implicit scheme (ADI), on different High Performance Computing (HPC) platforms, such as CUDA and OpenMP. We start by analysing different implementations of serial and parallel tridiagonal solvers and the various optimization techniques that can make them faster. These tridiagonal solvers will be then used in order to speedup the ADI scheme and therefore the SLV model. To better analyse the factors affecting the performance of each implemented tridiagonal solver and ADI scheme using CUDA, we have used the NVIDIA visual profiler. The results obtained show that the coalesced global memory access and shared memory access with no bank conflicts proves to be crucial in achieving good speedup. In the final part of the thesis we benchmark the fastest GPU version of the SLV model against a fully multi-threaded CPU implementation. The results show that the CUDA and OpenMP implementation, with 8 threads, achieves approximately 8x and 7.5x speedup, respectively, over the single threaded SLV program. However, we believe that both the CUDA and the OpenMP codes of SLV can be more optimised. 11

12 Declaration No portion of the work referred to in the thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning. 12

13 Copyright Statement The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the Copyright ) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the Intellectual Property ) and any reproductions of copyright works in the thesis, for example graphs and tables ( Reproductions ), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see in any relevant Thesis restriction declarations deposited in the University Library, The University Library s regulations (see rary/aboutus/regulations) and in The University s policy on presentation of Theses. 13

14 Acknowledgment ˆ Zaid ait Haddou is a Marie Curie fellow at the University of Manchester. The research leading to these results has received funding from the European Community s Seventh Framework Programme FP7-PEOPLE-ITN-2008 under grant agreement number PITN-GA ˆ I would like to thank NAG, advice and technical support, and Chris Armstrong and Jacques Du Toit (both from NAG) for poviding codes and for many helpful comments and suggestions. ˆ My deepest gratitude goes also to both my supervisors, Prof. John Keane and Prof. Ser-Huang Poon, in Manchester Business School, for their precious help and understanding during the whole project. ˆ Furthermore, I would like to acknowledge with much appreciation the help provided by Mr. Erik Vynkier and also for having me one month in SWIP office in Edinburgh. ˆ I would like to thank also Mr. Daniel Egloff for his technical help. 14

15 Chapter 1 Introduction Today s financial applications need to deal with an enormous amount of information and data by using advanced mathematical structures to form strategies and models. These strategies and models, are becoming more and more complex, because they are used by banks to price and assess the credit risk associated with portfolios composed of several thousand derivatives. In turn, computers need more and more processing power to handle larger problems while ensuring at the same time highly accurate calculation in an acceptable time period. For this reason financial institutions are ever more interested in alternatives to their huge machine that are increasingly challenged not only in calculation speed but also in energy consumption. The accuracy and the speed of obtaining prices and evaluate the risk associated with financial products and the cost of the technology used to obtain these results is paramount, so the idea of technologies that are fast and consume less energy is extremely attractive. Parallel processing architectures offer a solution that is starting to be adopted by the financial industry. In the last decade, the use of parallel architectures has seen its scope expand from scientific applications to finance, which led to very interesting performance being achieved compared to traditional processors in pricing financial products and risk management. This advance was made possible thanks in part to the emergence of standardised programming languages such as OpenMP for multi-core CPUs with shared memory and the emergence of CUDA for programming NVIDIA GPUs. J.P. Morgan (2011), one of the largest investment bank in the world, equipped its data centers with NVIDIA TESLA M2070 GPUs in order to accelerate systems that calculate risk across several equity derivatives. JP Morgan was able to achieve a 15

16 performance of 40x speed up to its risk calculations, which enables the bank to have the desired results in matter of minutes rather than hours and therefore it became possible to run more frequent risk calculations and more complex scenarios. 1.1 Thesis aim & contribution This thesis main goal is to investigate the implementation of the Stochastic Local Volatility (SLV) model, using the Alternate Direction Implicit scheme (ADI), on different High Performance Computing (HPC) Platforms. The ADI scheme to solve the Partial differential Equation (PDE) for both the calibration and option pricing are the most computationally expensive part of the SLV model and therefore they are ported in this thesis from a single CPU platform to both a multi-core platform CPU using OpenMP and manycore platform GPU using CUDA. Solving several tridiagonal systems is the most computationally demanding part in the ADI scheme and therefore the focus of this thesis is also on the different tridiagonal system solvers that could be implemented to exploit fully the power of the parallelization offered by GPUs. Three tridiagonal solvers are discussed. The simplest of these three solvers is Thomas algorithm, which is basically a simplified form of Gaussian elimination. The Thomas algorithm is very simple to implement but inherently serial, in other words, not parallelizable. The second and third tridiagonal solvers discussed are the Cyclic Reduction (CR) and Parallel Cyclic Reduction (PCR) algorithm. They are harder to implement but parallelizable and therefore much faster than Thomas algorithm on parallel architectures. There are several works that discussed the implementation of tridiagonal solvers and ADI schemes on High performance computing platforms. However, as far as we are concerned, we did not not find any work that tackles the implementation of a complete SLV model using HPC technologies. So, the contribution of this thesis is to give an idea about the various numerical techniques and HPC technologies that can be used and mixed to speedup an SLV model based on solving PDEs using the ADI scheme. 16

17 1.2 Thesis Structure In the introduction we discussed the importance of introducing HPC technologies in financial institutions in order to price a high number of different financial products and assess faster the risk involved in each portfolio. In chapter 2, we analyse three important models for option pricing. We discuss the advantages and shortcomings of both the Stochastic Volatility (SV) and Local Volatility (LV) models and the enhancements made to improve the SV model calibration to market data. We discuss also the SLV model and how it might be used as an alternative that unify both the SV and LV models. Finally, we discuss the ADI scheme, the different tridiagonal solvers and the implementations made by several researchers using HPC platforms. Next, we examine the dynamics of the SLV model and how it can be calibrated. We mainly focus on the calibration of the leverage function and describe all the necessary steps to complete the task. In the next chapter we discuss the ADI scheme by explaining the theoretical background and how it can be implemented by describing first the algorithm to generate the Finite Difference grid in both space and time and then how to use it to solve the forward and backward Kolmogorov PDE 1 for the leverage function calibration 2 and option pricing respectively. In chapter 6, we discuss the different solvers implemented to solve tridiagonal systems. We start by describing Thomas algorithm then the Cyclic Reduction (CR) and finally the Parallel Cyclic Reduction (PCR) algorithm. In the next chapters we give a detailed description about HPC technologies (CUDA, OpenMP) by examining their architecture, and we also discuss various optimization techniques. We end the thesis by a results chapter that is divided to two sections. The first section is about the option prices generated using the SLV model and the impact of the leverage function on the Heston model. The second section is mainly focusing on HPC by analysing, first, different implementations of the tridiagonal solvers and determining which one is faster and for which reasons, then analysing the ADI scheme and how can we improve the speedup performance. Finally we compare the CUDA and OpenMP implementations of the SLV model to determine which is fastest. 1 The forward and backward Kolmogorv PDE will be discussed in section and respecively 2 The leverage function and its calibration will be discussed in details in Chapter 3 17

18 18

19 Chapter 2 Literature Review Black and Scholes (1973) developed one of the most important mathematical models able to give a theoretical fair estimate of European options prices. It is a very popular model, as it provides a simple formula to price options and calculates their hedging ratios. However, most market participants nowadays use an adjusted version of the Black-Scholes model due to some well-known problems. One of the greatest drawbacks of the Black-Scholes model is that the latter assumes that volatility is constant, which does not reflect the reality of option market prices. In other words, the market quotes of options in terms of implied volatility are not the same for different expiry dates and strikes. This phenomenon is known as the smile and therefore the assumption of the Black-Scholes model regarding volatility is certainly not representative of the much more complex market reality. Realising the inability of the Black-Scholes model to replicate the smile, many practitioners have tried to enhance it by introducing several modifications and this has given birth to two very important models widely used in the financial industry, SV and the LV models. 2.1 Stochastic Volatility Stochastic Volatility models for the smile have been discussed by many researchers such as Heston (1993), Stein(1991) and others. This type of models assume that volatility is a random process. By doing so practitioners have tried to have a better representation of the complex reality of financial markets. In other words, volatility of traded assets is variable through time and that is why any model used for pricing 19

20 or hedging of derivative contracts on such assets should take into consideration the fluctuating behaviour of volatility. Heston (1993) proposed a model where the stock price is assumed to follow a geometric Brownian Motion and volatility follow a CIR process. ds t = rs t dt + ν t S t dw 1 t (2.1) dν t = k(θ ν t ) + ξ ν t dw 2 t (2.2) dwt 1.dWt 2 = ρdt The Heston model is very popular among practitioners; one of its strengths is the possibility to derive an analytical formula to price European options which makes its calibration relatively easy and fast. In addition the CIR process for the variance, in continuous time, stays positive and therefore it will not generate negative variances. However, it may reach zero if the Feller condition 2kθ > ξ 2 is violated. The Heston model and the SV models in general are criticized because they introduce a new source of randomness or uncertainty. The fact that SV models consider a stochastic volatility make them much more complex than the simple Black- Scholes model, because we need to cover the uncertainty from the stochastic nature of volatility in order to create a riskless portfolio and as volatility is not a tradable asset this could be quite complicated and therefore it should be taken into consideration carefully in order to avoid inaccurate options pricing and hedging ratios. In addition, as observed by Medvedev and Scaillet (2003) SV models cannot be calibrated correctly to the whole market implied volatility surface as they tend to misprice short maturity options and the degree of the pricing error increases as the time to maturity decreases. In other words, the calibration of the SV models often lead to an implied volatility smile for short maturities less pronounced than the one observed in the market because both the stock price and its volatility follow a diffusion processes only. This observation is an argument for the introduction of process jumps to model the price of the underlying. Gatheral (2006) discussed the Heston plus jumps model which consists in combining the Merton s jump diffusion model and a CIR stochastic volatility process. ds t = rs t dt + ν t S t dw 1 t + (J 1)Sdq (2.3) 20

21 dν t = k(θ ν t ) + ξ ν t dw 2 t (2.4) dwt 1.dWt 2 = ρdt where dq is a Poisson process that is equal to 0 with probability λ t and 1 with probability 1 λ t and J is the jump size. When dq = 1 the process jumps from S to JS. Although adding jumps can improve the performance of the calibration of the stochastic volatility models for shorter maturities, it is still artificial and it ignores the use of a large amount of market volatility information and makes the model even more complex. 2.2 Local Volatility Taking into account the difficulties presented by the SV models to fit the prices of European options, practitioners tried to create a new simple model that will allow them to fit perfectly the volatility smile without having to introduce any additional source of randomness, in order to ensure that the completeness of the Black-Scholes model is preserved. Completeness is important, because it guarantees unique prices. The work of Derman and Kani (1994) and Dupire (1994) showed that under riskneutrality there is a unique diffusion process that could be used to replicate exactly all observed European options prices, this diffusion process is called the local volatility function and this has given birth to what is known today as the LV model. However, as it was discussed by Hagan et al. (2002), the LV model may have poor hedging performance because the dynamics of the implied volatility inferred from the local volatility model are incompatible with what is observed in the market. In other words the model predicts the wrong dynamics of the implied volatility curve, which might lead to pricing and hedging errors in the case of exotic options. Hagan et al. (2002) used the singular perturbation method to calculate the implied volatility from the local volatility. Though it is the reverse of what is usually done but it is quite useful to understand the dynamics of the LV model. Following the formula derived by Hagan et al. (2002) using asymptotic perturbation theory, the implied volatility at time t 0 with asset price S 0 and strike K can then be approximated by equation (2.5) below: 21

22 σ BS (K, T, S 0 ) = σ LV ( 1 2 [S 0 + K]) (2.5) Therefore, σ BS (K, T, S 0 + S) = σ LV ( 1 2 [(S 0 + (K + S)]) = σ BS (K + S, T, S 0 ) (2.6) As is shown in equation (2.6), for a specific maturity, when the spot price increases, the implied volatility curve, calculated according to equation (2.5), moves to the left and when the spot price decreases the implied volatility curve moves to the right which is, according to Hagan et al. (2002), contrary to what actually happens in the market. For this reason, local volatility is said to have the wrong dynamics for the implied volatility which make the vega and delta hedges derived, from the model, unstable and may actually be worse than simple Black-Scholes hedges. This flaw of predicting wrong dynamics in the LV model has led to the popularity of what is known as the SABR model which can captures the accurate dynamics of the smile, and therefore leads to stable hedges. This behaviour is illustrated in Figure (2.1) below: Figure 2.1: Implied dynamics from local volatility model Balland (2002) also criticized the hedging performance of the LV model, explaining that in a market where the implied volatility does not change as often as the asset price S t the hedging strategies implied by the LV model are inefficient due to 22

23 spot-change re-calibration. In other words, since the LV model depends on the spot asset price, then over a short time over which the spot asset price changed and the implied volatility did not, the LV model still needs to be re-calibrated. 2.3 Stochastic Local Volatility To incorporate the strengths of both the SV and LV models, Jex et al (1999) made an attempt to unify them into a single hybrid model known as the Stochastic Local Volatility model. In other words, they tried to keep the stochastic dynamic of volatility but ensuring at the same time a better calibration to market data. Therefore, the SLV model enables practitioners to price path dependent exotic options consistently because it is both representative of the market dynamic and able to reproduce perfectly the volatility smile generated by European options prices. Tian et al (2013), discussed the implementation of a Heston-like SLV model in order to price foreign exchange options. They started by explaining all the details for the calibration of the Heston parameters, for which they assumed a constant mean-reversion speed parameter and a variable vol-of-vol, mean reversion and mixing fraction weight parameters, that are calibrated to the implied volatilities of each specific maturity, in order to ensure that the stochastic part of the SLV model is well calibrated to market data. Tian et al (2013) also explained that, although SV models (Heston & SABR) can reproduce the implied volatilities around At-The-Money (ATM) region, they cannot adequately match the implied volatilities for In-The-Money (ITM) or Out-of-The- Money (OTM) options. That is why it is interesting to introduce a local volatility component, known as the leverage function, into the stochastic volatility model in order to be able to generate the correct implied volatilities in sensible regions and thus be able to replicate accurately the whole implied volatility surface. Furthermore, they discussed in details the steps to follow in order to calibrate the leverage function to a given local volatility surface by solving a 2D Fokker Planck PDE of transition probabilities, using the Douglas-Rachford (D-R) ADI Scheme 1. Then, they explained how to incorporate the leverage function in the original Heston PDE in order to price exotic options and compare the pricing results of the SLV model with pure LV and SV models. 1 The D-R ADI scheme will be discussed in details in chapter 4 and 5 23

24 2.4 ADI & HPC ADI methods were first used in the early 1950 s to reduce one N dimensional system of linear equations to N one dimensional systems of linear equations. ADI methods have proven to be very powerful techniques for computing numerical solutions of partial differential equations of elliptic and parabolic types. For this reason it has become a widely used scheme, in finance, to solve different PDE problems. Solving tridiagonal systems of linear equations, is very common in many scientific and engineering problems and it is certainly the most critical and time-consuming block of the ADI scheme. The advent of parallel processors made possible the development of several parallel algorithms such as the CR algorithm which was first presented by Hockney (1964) and Recursive Doubling (RD) by Stone (1973), etc. These algorithms accelerated the solution of tridigonal systems by exploiting the power of parallel processors. After the introduction of the GPGPU programming work began on the parallelization of tridiagonal solvers using GPUs. For example, Kass et al. (2006) are the first ones to use a GPU version of the CR algorithm for real-time depth-of-field, also, Sengupta et al (2008) implemented the CR algorithm using the CUDA API and applied it to real-time shallow water simulation. As most of GPU solvers, on CUDA, were based on the CR algorithm, Zhang et al.(2010) discussed the implementation of other types of algorithms, such as the RD, PCR and hybrid algorithms (CR+PCR and CR+RD). They gave a detailed analysis of the complexity of each implemented tridiagonal solver and discussed the advantages and drawbacks of each algorithm by benchmarking and comparing them. According to their results the hybrid algorithms were able to deliver better speed performance as they are more suitable to a GPU architecture. Sakharnykh (2009) also explored the implementation of ADI, using CUDA, to solve a 3D PDE for fluid simulation. The main idea was to solve several tridiagonal systems in parallel but each tridiagonal system by itself is solved serially using Thomas algorithm. This is quite a naïve method, because there is only one level of parallelization. However, it is still suitable for GPUs because using ADI to solve a 3D PDE involves the solution of numerous tridiagonal systems, so the majority of GPU cores will be utilised. For a 2D PDE we may speculate that this method will not be very efficient. Sakharnykh (2009) also discussed the coalescing of global memory access in GPUs in order to optimise performance and then benchmarked 24

25 his GPU results against the results generated by two different multi-core CPUs. Egloff (2010) analysed the implementation of efficient GPU solvers for one dimensional PDEs based on the finite difference scheme, using a Tesla C1060. The idea behind his work is to try to take advantage of the GPU architecture in order to be able to compute a large number of options prices in parallel. In other words, each individual PDE to price an option is built and solved in a thread block using the parallel cyclic reduction algorithm. That means, if a GPU can process N blocks in parallel, then N pricing problems can be processed in parallel. Egloff (2010) started by discussing various implementations of tridiagonal solvers on GPU and CPU. He benchmarked the CPU implementations of the Forsythe-Moler algorithm and the serial version of the cyclic reduction against the SSE optimized Intel MKL solver sgtsv and dgtsv in single and double precision. He showed that the Intel MKL solver is the fastest and that the cyclic reduction algorithm is not adequate for serial computing but still performs well for small dimensions. In additon, he analysed the performance of the parallel cyclic reduction algorithm on GPU for different grid sizes, using both shared and global memory. Finally he benchmarked the performance of the GPU PDE solver, on a set of European call and put options of different strikes and maturities, against a well optimized CPU implementation, showing a performance improvement of 25x on a single GPU and 38x on two GPUs. In this section we discussed various types of models used to price options, we discussed also the work completed by several researchers in order to accelerate the ADI scheme using algorithms that take advantage of parallel processing architectures. In the next chapter, we explain the implementation and calibration of the SLV model. 25

26 26

27 Chapter 3 The SLV Model 3.1 SLV Model dynamic The SLV model implemented in this thesis is a Heston like model. It is the model introduced by Jex et al (1999). The asset price process is a geometric Brownian Motion, with a local volatility component (Leverage function) L, and the variance moves according to a CIR process. η is known as the mixing fraction ratio and its utilisation will be discussed in the next section. ds t = r t S t dt + L(S t, t) ν t S t dw 1 t (3.1) dν t = k(θ ν t ) + ηξ ν t dw 2 t (3.2) dwt 1.dWt 2 = ρdt The issue with the model above is that in case the Feller condition is violated then simulating the discretized version of the variace CIR will start generating negative values. Hence to avoid this problem we rewrite equation (7) and (8) in terms of the log spot (X t = log( S t )) and the log variance (Z t = log( ν t )) scaled by the initial S 0 ν 0 values S 0 and ν 0 respectively. Therefore, dx t = [r t 1 2 L(X t, t) 2 exp(z t )ν 0 ]dt + L(X t, t) exp(z t )ν 0 dw 1 t, X 0 = 0 (3.3) dz t = [(kθ ξ2 ) k]dt + ηξ dwt 2, Z 0 = 0 (3.4) exp(z t )ν 0 exp(zt )ν 0 27

28 dwt 1.dWt 2 = ρdt 3.2 SLV calibration As it has been discussed by Tian et al. (2013), in order to calibrate the SLV model to market data we need to follow two steps: 1. Find the stochastic parameters of the pure Heston model to match given market implied volatility data. 2. Calibrate the leverage function L(S t, t) to a given local volatility surface with a suitable mixing fraction ratio η. The calibration of the pure Heston model to market data is quite simple and can be done easily by using the Heston analytical formula with a nonlinear least square method to find the optimal parameters. However, as has been discussed previously, the Heston model cannot explain the whole implied volatility surface and this is the reason why we need to calibrate the leverage function. After calibrating the Heston parameters, we add the mixing fraction ratio denoted by η [0, 1], which will be multiplied by the vol-of-vol parameter in order to manage the weight of local volatility and stochastic volatility in the SLV model. If we assume that η = 1 then the stochastic volatility part in the SLV model will take over and the local volatility implied by the leverage function has no effect on the dynamics of the SLV model. When η = 0 the opposite happens, in other words, the local volatility component dominates. When 0 < η < 1 then both the local volatility and the stochastic volatility components work together. The value of the mixing fraction weight parameter may be determined by trying to make the SLV model match the price of some exotic options. However, in this thesis we assume that η = 1. After adding the mixing fraction weight the model will not be calibrated to market data any more. Therefore we now need to calibrate the leverage function to a given local volatility surface so that the SLV model will be able to perfectly match European option prices. In other words, the leverage function corrects and pushes the implied volatilities generated by the Heston model in the right direction towards market implied volatilities. This calibration will be completed by solving a 28

29 2D Fokker Planck (forward Kolmogorov) PDE Leverage function calibration If we assume the LV model shown in equation (3.5) below: ds t = rs t dt + σ LV (S t, t)s t dw 1 t (3.5) In order to calibrate the leverage function we need to use the mimicking theorem 2 defined by Gyöngy (1986). In other words, we connect the local volatility component σ LV (S t, t) to the volatility part of the SLV model L(S t, t) ν t as follows: σ LV (K, t) 2 = E[L(S t, t) 2 ν t S t = K] = L(K, t) 2 E[ν t S t = K] (3.6) Hence, L(X, t) = σ LV (X, t) E[νt X] = σ LV (X, t) p(x, Z, t)dz (3.7) - exp(z)p(x, Z, t)dz where at time t = 0 we have, L(X, 0) = σ LV (X, 0) ν0 = 1 (3.8) p(x, Z, t) represents the transition probability density function for both X and Z at time t. From equation (3.7), we see that in order to calculate the leverage function we need to determine the local volatility surface and the transition probability density function p(x, Z, t) Dupire formula for local volatility: Given European options market prices, Dupire (1994) derived the local volatility σ LV (K, T ), as follows: 1 See section The mimicking theorem is discussed in more detail in the appendix 29

30 dc σ LV (K, T ) = 2 dt + rk dc dk K 2 d2 C K,T dk 2 (3.9) So, as is clear from equation (3.9), local volatility can be computed by using a set of European options market prices for different strikes and maturities. Gatheral (2006) explained that the local volatility function in equation (3.9) could be seen as a definition of local volatility irrespective of the kind process that is used to simulate volatility in any model type. As options prices are quoted in terms of implied volatilities σ IV, we can rewrite equation (3.9) as follows: σ LV (K, T ) = σ 2 IV + 2σ IV T dσ IV dt (1 + d 1 K T dσ IV dk )2 + σ IV K 2 T + 2rσ IV T K dσ IV [ d 2 σ IV dk 2 dk d 1 (S 0, K) T ( σ IV dk )2 ] (3.10) Fokker Planck PDE: Equation (3.11) below is the 2D Fokker-Planck PDE of the transition probability density function of the SLV model shown in equation (3.3) and (3.4), that must be solved to calibrate the leverage function. dp dt = d dx [(r t 1 2 L2 (X t, t) exp(z t )ν 0 )P ] d dz [((kθ ξ2 ) k)p ]+ 1 exp(z t )ν 0 2 dx 2 [L2 (X t, t) exp(z t )ν 0 P ]+ d 2 where dz 2 (ξ2 P ) + d2 exp(z t )ν 0 dzdx (ξρl(x t, t)p ) (3.11) d 2 P (X, Z, 0) = δ(x X 0 ).δ(z Z 0 ) (3.12) δ(.) is the Dirac delta function and X 0 = Z 0 = 0. We can see from equation (3.7), that by knowing P we can calculate L and by knowing L we can calculate P by solving the PDE shown in equation (3.11). Thus, as we already have the values of P and L at initial time t 0, then calculating P and L at different time points t 0 = 0, t 1,..., t N = T, will be straightforward. In other words, we can start from P (X, Z, t 0 ) and L(X, t 0 ) to calculate P (X, Z, t 1 ) by solving the PDE in the first time step, then we calculate L(X, t 1 ) via equation (3.7). We repeat 30

31 this alternating procedure until we find the value of L and P for all-time steps. Figure (3.1) below describes the mechanism to calibrate the leverage function: Figure 3.1: Leverage function calibration procedure (Tian et al.(2013)) In order to calculate the transition probability density function at t 0 = 0, P (X, Z, t 0 ), Jensen and Poulen (2002) proposed to use of the bivariate normal distribution density function with a very small time step dt in order to approximate the Dirac delta function. This is illustrated in Figure (3.2) below. Using this approximation increases the stability in solving PDE (3.11) and works for both a uniform and a non-uniform mesh. P 0 i,j = P (X i, Z j, 0) = ( (X i β x) 1 2 2πα x α.exp + (Z j β z) 2 α 2 x α 2 z z 1 ρ 2 2(1 ρ 2 ) 2ρ(X i β x)(z j β z) α xα z ) (3.13) β x, β z represent the mean and α x, α z represent the standard deviation of the stochastic variables X t and Z t respectively. Therefore according to equation (3.3) and (3.4) we have: β x = (r L(0, 0)2 ν 0 ) t; α x = L(0, 0) ν 0 t β z = [(kθ 1 2 ξ2 ) 1 t k] t; α z = ξ ν 0 Therefore, in a non uniform grid, the leverage function equation (3.7) can be rewrit- 31 ν 0

32 ten as follows: where L(X i, t n ) = σ LV (X i, t n ) NZ j=1 (P n i,j + P n i,j+1) Z Nz j=1 (exp(z j)ν 0 P n i,j + exp(z j+1)ν 0 P n i,j+1 ) Z (3.14) ˆ+ ˆ + P (X, Z, t)dzdx = 1 (3.15) Using the trapezoidal rule we can rewrite equation (3.15) as follows: N x Nz i=1 j=1 1 4 (P n i,j + P n i+1,j+1 + P n i,j+1 + P n i+1,j) X Z = 1 Figure 3.2: Initial probability distribution for the forward Fokker-Planck PDE Figure (3.3) below, shows how the probability density function looks at t=1, after we start solving the Fokker-Planck PDE using the ADI scheme. 32

33 Figure 3.3: Interim probability distribution for the forward Fokker-Planck PDE 33

34 34

35 Chapter 4 ADI Implementation 4.1 ADI Scheme The ADI scheme is a finite difference method used to solve PDEs in two or more dimensions. The idea behind the ADI scheme is to tackle multiple dimensions of a PDE in separate steps, therefore, the system of linear equations that we will have to solve in each step has a simple structure and can be solved efficiently with a simple tridiagonal matrix algorithm. There are various types of ADI schemes such as Douglas and Rachford (1956), Peaceman and Rachford (1955), Craig and Sneyd (1988), etc. The simplest one among these methods is the Douglas and Rachford (D-R) scheme and that is why we implemented it in this thesis. We will use as an example the PDE (4.1), shown below, in order to illustrate how the D-R ADI scheme works. dv dt = d2 V dx 2 + d2 V dy 2 (4.1) We start by discretizing implicitly the PDE in (4.1) and we end up with equation (4.2). (1 tδ 2 x tδ 2 y)v n+1 i,j = V n i,j (4.2) where δxv 2 n+1 i,j = V n+1 n+1 i+1,j 2Vi,j + V n+1 i 1,j x 2 35

36 We assume that the value of V at time n is known and we try to find the value of V at time n + 1. Solving equation (4.2) directly will be quite complicated because, as already explained, we will have to deal with the differentiation in both directions x and y at the same time. So, the idea behind D-R scheme is to add the term A = t 2 δxδ 2 yv 2 n+1 i,j to both sides of equation (4.2) and factorise the left hand side as is shown in equation (4.3) below: (1 tδ 2 x)(1 tδ 2 y)v n+1 i,j = V n i,j + t 2 δ 2 xδ 2 yv n+1 i,j (4.3) Assuming that t 2 δ 2 xδ 2 yv n+1 i,j = t 2 δ 2 xδ 2 yv n i,j, we can rewrite equation (4.3) as follows: (1 tδ 2 x)(1 tδ 2 y)v n+1 i,j = V n i,j + t 2 δ 2 xδ 2 yv n i,j (4.4) Equation (4.4) can now be solved in two separate steps by creating an intermediate point V n Splitting: (1 tδ 2 x)v n+ 1 2 i,j = (1 + tδ 2 y)v n i,j (4.5) (1 tδy)v 2 n+1 i,j = V n+ 1 2 i,j tδyv 2 i,j n (4.6) So as can be seen in equations (4.5) and (4.6), a two-dimensional problem has been reduced to two one dimensional problems, implicit in x in the first equation and implicit in y in the second equation. ˆ In equation (4.5), we solve for each j (each volatility in the ADI grid) a tridiagonal system of linear equations: 36

37 b 0 c a 1 b 1 c a 2 b 2 c a 3 b 3 c a i b i c i a imax b imax V n ,j V n ,j.. V n+ 1 2 i,j.. V n+ 1 2 imax,j = D n 0,j D n 1,j.. D n i,j.. D n imax,j where D n i,j = (1 + tδ 2 y)v n i,j ˆ In the ADI second step, we solve for each i (each asset price in the ADI grid) a tridiagonal system of linear equations: f 0 g e 1 f 1 g e 2 f 2 g e 3 f 3 g e j f j g j e jmax f jmax where H n,n+ 1 2 i,j = V n+ 1 2 i,j tδyv 2 i,j n V n+1 i,0 V n+1 i,1.. V n+1 i,j.. V n+1 i,jmax = H n,n+ 1 2 i,0 H n,n+ 1 2 i,1.. H n,n+ 1 2 i,j.. H n,n+ 1 2 i,jmax The systems that will have to be solved in each step have a simpler structure and can be solved efficiently with a simple tridiagonal matrix algorithm such as Thomas algorithm. 4.2 Grid generation In order to implement the ADI scheme we are going to use a non-uniform mesh in both the asset price and the volatility directions, X i and Z i respectively. A nonuniform mesh allows a finer mesh in the neighbourhood of critical points where more accuracy is required, thus improving the stability of the finite difference discretization. For instance, in the asset price direction, we usually try to concentrate the grid 37

38 on the initial spot and other critical points such as barrier upper and lower bound if we want to price, for example, barrier options. Furthermore, using a generic algorithm that generates grids with a non-uniform mesh provides flexibility when we try to solve PDEs because we can adapt the mesh, without any substantial changes to the algorithm, depending on the type of the pricing problem we are dealing with. The method described, in the sections below, to generate the non-uniform grid in both space and time has been discussed in details by Clark (2010) Spatial grid generation: Assuming that we want to generate a non uniform mesh points on the interval [X min, X max ] with an increased density in a specific point X conc, then we can proceed by following the method described below. The idea behind generating a non-uniform uniform mesh is to use a monotonically increasing linear function such that X i = f(e i ), where E i [0, 1/N, 2/N,..., 1] is a uniformly distributed grid. We can use different functions f(.) in order to map a grid E with a uniform mesh to a grid X with a non-uniform mesh. However, we have to make sure that the function is monotonically increasing. In others words, we have to be careful that the first derivative f (.) is strictly positive. To determine a function f(.) that will ensure a smooth mapping and which concentrates on specific critical points X conc,i, Tavella & Randall (2000) proposed to solve the ODE shown in equation (4.7) below: where [ df N ] 1 de = A J k (E i ) 2 2 Ei k=1 (4.7) J(E i ) = [β 2 + (f(e i ) X conc,k ) 2 ] 1 2 = [β 2 + (X i X conc,k ) 2 ] 1 2 = J(Xi ) (4.8) and A is constant value that will be determined by solving the ODE (4.7). When we have one critical point then we can integrate equation (4.7) with boundary conditions f(e i = 0) = X min and f(e i = 1) = X max to obtain the mapping function f(.), based on the hyperbolic sine, as is shown in equation (4.9) 38

39 X i = f(e i ) = X conc + βsinh(c 1 E i + C 2 (1 E i )) (4.9) β = X max X min ; U [0, ] U So, if we have a uniformly distributed mesh E i [E min = 0, 1/N, 2/N,..., E max = 1], then using equation (4.9) we can generate the non-uniform spatial grid X i. The interval over which the function sinh(.) is sampled is determined by the non-uniformity parameter β; the closer it is to zero, the more non-uniform is the mesh. Note that sinh 1 (x) = log(x x 2 ) C 1 = sinh 1 ( X min X conc ) (4.10) β C 2 = sinh 1 ( X max X conc ) (4.11) β If we want to make the generated grid denser in more than one point the method described above can not be used because we cannot obtain an analytical formula to derive the transformation equation, shown in equation (4.9), to map the uniform to the non-uniform grid. Hence, in order to handle this problem, as it was suggested by Tian et al.(2013), we can use a numerical method, such as the Runge-Kutta 4th-order to solve the ODE (4.7) in order obtain the transformation needed. The method is quite simple and is described by the algorithm below: Algorithm: Coordinate transformation with more than two critical points using Runge-Kutta 4th-order method N : the number of steps between E min = 0 and E max = 1 ΔE = (E max =E min )/N and E i = E min + (i=1)δe, i = 0, 1,..., N X max, X min are known; Choose tolerance level T OL and an initial guess of A Set X 0 = X min, X N = X max while X N =X max > T OL do for i = 1 : N do K 1 = ΔEJ(X i 1 ) K 2 = ΔEJ(X i K 1) K 3 = ΔEJ(X i K 2) 39

K 4 = ΔE J(X i 1 + K 3 ) X i = X i 1 + 1 6 (K 1 + 2K 2 + 2K 3 + K 4 ) end for if (X N =X max ) > T OL then A = A=T OL; else A = A + T OL; end if end while Figure (4.

40 K 4 = ΔE J(X i 1 + K 3 ) X i = X i (K 1 + 2K 2 + 2K 3 + K 4 ) end for if (X N =X max ) > T OL then A = A=T OL; else A = A + T OL; end if end while Figure (4.1) below, is an example of the mesh generated using the algorithm above. In the log-spot direction the grid become denser around the initial point X 0 = log(s 0 /S 0 ) = 0 and in the log-variance direction, there are two concentrated points, around V 0 and V = 0. Figure 4.1: Non uniform grid (Tian et al.(2013)) Adapting the grid to specific required points After the construction of the non-uniform grid we must ensure that some points of interest are in the grid. For example if you want to calculate the value of an option at S 0 we must ensure that S 0 appear in the grid in order to avoid having to use interpolation techniques. We use a simple example in order to explain how to force a point in a uniform grid, then we generalize the method to a non-uniform grid. Assuming we have a uniform mesh from E min = 0 to E max = 5 with E = 1. Hence, the grid E will be described as follows: 0, 1, 2, 3, 4, 5. So, calculating the 40

41 computed solution of a PDE for S 0 = 2.4 using a finite difference method, we can see that this is not possible because the value S 0 = 2.4 does not exist in the grid E. To solve this problem, the first step is to determine the nearest integer value to S 0 = 2.4 in the grid E, which corresponds to the value 2 in this example. The second and last step to generate the required grid, consists on using a linear interpolator L(.) with the input points (x, y) = (E min, E min ), (2, 2.4), (E max, E max ). Taking into account the fact that points in a uniform grid are determined as follows: E i = E min + i N (E max E min ) (4.12) We can determine the input point of the interpolator by using the inverse of equation (4.12) as is shown below: S E min n = N( ) (4.13) E max E min Using the C programming language, n could be rounded to the closest integer value by using the function floor: k = floor(n + 0.5) (4.14) Hence, the input points of the linear interpolator L(.) are (x, y) = [(E min, E min ), (E k = 2, 2.4), (E max, E max )], which will yield the transformed grid E with the required point 2.4. In the case of a non-uniform grid, we have to determine first the value S that should be included in the uniform grid that would be mapped to the non-uniform grid X, using the inverse of equation (4.9) S = [ 1 sinh 1 ( S ] 0 X conc ) c 1 c 2 c 1 β (4.15) Now that we have the value S, we can generate the transformed uniform grid E that will contain the value S using the method described above. Hence, the input points for the linear interpolator to generate E are (x, y) = [(E min, E min ), (E k, S ), (E max, E max )]. After generating E, we can then use again equation (4.9) to generate the non-uniform grid X with the desired points. 41

42 Space discretization Taking as an example a differentiable function F (X i ), we can approximate its first and second derivative on a non-uniform mesh points X 1, X 2,..., X N where X i = X i+1 X i by adopting a second order central discretization for the grid inner points and backward or forward discretization for points in the boundaries as explained by Tian et al (2013). where, ˆ First derivative approximation: Forward: df (X i) dx = f i,0f (X i ) + f i,1 F (X i+1 ) Central: df (X i) dx = c i, 1F (X i 1 ) + c i,0 F (X i ) + c i,1 F (X i+1 ) Backward: df (X i) dx = b i, 1F (X i ) + b i,0 F (X i+1 ) f i,0 = 1 X i ; f i,1 = 1 X i X i c i, 1 = X i 1 ( X i 1 + X i ) ; c i,0 = X i X i 1 ; c i,1 = X i X i 1 where, b i, 1 = ˆ Second derivative approximation: 1 X i 1, b i,0 = 1 X i 1 Central: d2 F (X i ) dx 2 = s i, 1 F (X i 1 ) + s i,0 F (X i ) + s i,1 F (X i+1 ) 2 s i, 1 = X i 1 ( X i 1 + X i ) ; s i,0 = ˆ Mixed derivative approximation: 2 X i X i 1 ; s i,1 = X i 1 X i ( X i 1 + X i ) 2 X i ( X i 1 + X i ) Central: d2 F (X i, Z j ) dxdz = k,l=1 k,l= 1 c i,kc j,l F (X i+k, Z j+l ) 42

43 4.2.2 Temporal grid generation For the discretization across time, following the method described by Clark (2010), we can use a non-uniform discretization that forces a finer mesh in the neighbourhood of t min = 0 and allows a larger mesh as we approach maturity t max. This discretization is appropriate for solving the forward Kolmogorov PDE and therefore for the calibration of the leverage function in the SLV model. This is explained by the fact that the first time steps are crucial for determining the general form of the transition probability density function and that is why we need to make sure that we will have more points at the beginning of the time grid, which will allow a robust calibration of the leverage function. In this method we map a uniform grid E i [0, 1/N, 2/N,..., 1]into a non-uniform grid t i using the formula below: t i = t max E αs+(α l α s) exp( λe i ) i (4.16) t i = t max exp(log(e i )(α s + (α l α s ) exp( λe i ))) (4.17) α s and α l decide the form of the time discretization in the short end and the long end respectively. These two terms are coupled together using a mixing parameter λ. 4.3 Solving PDE A distinctive feature of the SLV model implemented, is the correlation between the asset price and its variance. For this reason, solving the forward or backward Kolmogorov PDE either for the leverage function calibration or for option pricing respectively, means that we need to deal with a mixed spatial-derivative term. Originally ADI schemes were not created for PDEs that contain a mixed derivative term. Hence, we use a variant of the Douglas and Rachford scheme, developed by in t Hout & Foulon (2010), which handles the mixed derivative term explicitly while ensuring at the same time stability of the ADI scheme ADI for the Fokker-Planck PDE As it was mentioned previously, in order to solve PDE (3.11) we use the D-R ADI scheme. However, instead of the implicit discretization we used to explain the D-R 43

44 ADI scheme in section (4.1), we use instead the theta scheme discretization, which will lead, after the splitting, to equation (4.18) and (4.19) shown below: Y α t n F 1 (Y, t n ) = P n 1 + t n [F 0 (P n 1, t n 1 )+(1 α)f 1 (P n 1, t n 1 )+F 2 (P n 1, t n 1 )] (4.18) P n α t n F 2 (P n, t n ) = Y α t n F 2 (P n 1, t n 1 ); n = 1,..., N (4.19) Y represents the intermediate point, F 0 stems from the discretization of the mixed derivative term, F 1 and F 2 corresponds to the spatial discretization in the Z and X direction respectively. In other words: F 0 (P, t) = d2 dxdz [ξρlp ] F 1 (P, t) = d dz [(kθ ξ2 ) k)p ] + 1 d 2 1 exp(z t )v 0 2 dz 2 [ξ2 P ] exp(z t )ν 0 F 2 (P, t) = d dx [(r t 1 2 L2 exp(z t )ν 0 )P ] + 1 d 2 2 dx 2 [L2 exp(z t )ν 0 P ] The parameter α is very important because it affects the stability of the ADI scheme. When α = 0, we end up with the fully explicit discretization of the Fokker- Planck PDE, when α = 1 we have the fully implicit discretization and when α = 0.5 we have the Crank Nicolson discretization. In this thesis we assume that α = 0.5 because the ADI scheme will be then unconditionally stable Boundary conditions To solve the Fokker-Planck PDE for the leverage function calibration, Tian et al. (2013), considered a one-sided first order derivative and a zero second derivative for boundary points in the log-spot and log-variance. These are the boundaries assumed in this thesis. However, by using these boundaries the ADI scheme starts to suffer from a loss in the probability mass. In other words, the sum of the calculated probabilities will start decreasing after each time iteration. The pace of the loss in the probability mass increases as we increase the value of the vol-of-vol parameter of the Heston model which affect the accuracy of the calibration of the SLV model. This issue might be related to a problem mentioned by Sepp (2010), in which he mentioned that the SLV model with a large vol-of-vol parameter cannot be calibrated consistently to a given local volatility surface. Figure (4.2) shows that, with a relatively small vol-of-vol parameter, the sum 44

45 of the transition probabilities is stable. However, with a relatively high vol-of-vol parameter we start noticing a loss in the probability mass. Figure 4.2: Probability mass at different time steps This problem could be partially solved by trying to increase the value of the boundary X max and decrease the value of X min. However, for this project, this approach is not possible due to the fact that we are using an artificial implied volatility surface which puts some restrictions on the grid size in order to maintain positivity of the local volatility function ADI for the Option Pricing PDE After calibrationg the leverage function we incorporate the leverage function in the Heston PDE as is shown below, in order to price European options consistently with market prices. dc dt + [r t 1 2 L2 (X t, t) exp(z t )ν 0 ] dc dx + [(kθ ξ2 ) k] dc exp(z t )ν 0 dz + 1 d 2 C 2 L2 (X t, t) exp(z t )ν 0 dx d 2 C 2 ξ2 exp(z t )ν 0 dz 2 + ξρl(x t, t) d2 C dzdx r tc (4.20) To solve PDE (4.20) we use the same ADI scheme used to solve the Fokker-Planck PDE, i.e: Y α t n A 1 (Y, t n ) = C n+1 + t n [A 0 (C n+1, t n+1 )+(1 α)a 1 +A 2 (C n+1, t n+1 )] (4.21) 45

46 where, C n α t n A 2 (C n, t n ) = Y α t n A 2 (C n+1, t n+1 ); n = 0,..., N 1 (4.22) A 0 (C, t) = ξρl d2 C dxdz A 1 (C, t) = [(kθ ξ2 ) k] dc exp(z t )ν 0 dz d 2 C 2 ξ2 exp(z t )ν 0 dz r tc A 2 (C, t) = (r t 1 2 L2 exp(z t )ν 0 ) dc dx L2 exp(z t )ν 0 d 2 C dx r tc The term r t C is divided evenly between A 1 and A 2. After we calculate the leverage function, we must adapt it to the mesh, in time and space, of the new grid that will be used to price options. To realize this task we can use, as it was suggested by Tian et al. (2013), a cubic spline interpolation in the log-spot direction and a linear interpolation in the time direction. We may also use Monte Carlo simulation to price options instead of having to solve the backward PDE. However, we will need to interpolate the leverage function at each time step of each path which could make the simulation computationally demanding Boundary conditions: If we take a European call option as an example, then we can price it by solving PDE (5.14) with the following Dirichlet boundary conditions: ˆ C(X i, Z j, T ) = max(s 0 exp(x i ) K, 0) ˆ C(X min, Z j, t) = 0 r(t t) ˆ C(X max, Z j, t) = S 0 exp(x max ) Ke ˆ C(X i, Z min, t) = max(s 0 exp(x i ) Ke r(t t), 0) ˆ C(X i, Z max, t) = S 0 exp(x i ) 46

47 Chapter 5 Tridiagonal Systems Solvers In this section we present the different tridiagonal solvers that have been implemented in this dissertation to solve the tridiagonal systems that are built at each time step of the ADI D-R scheme. As an example we assume that we are solving the first matrix shown in section (4.1). 5.1 Thomas algorithm Thomas algorithm is an efficient method that is used to solve tridiagonal systems of linear equations. There are two important steps in the algorithm, the first one is the forward sweep where we eliminate the lower diagonal by calculating the coefficients c and d according to equations (5.1) and (5.2) and the second is the backward substitution where we calculate V i using c, d and V i+1 according to equation (5.3). ˆ Forward sweep: c i ; i = 1 c i = b i c i (5.1) b i c i 1 a ; i = 2, 3,..., n 1 i d i ; i = 1 d b i = i d i d (5.2) i 1a i b i c i 1 a ; i = 2, 3,..., n 1 i In the second step, the backward substitution, we calculate the solutions using the coefficient d and c as shown in equation (5.3). 47

48 ˆ Backward substitution: V n = d n (5.3) V i = d i c iv i+1 Thomas algorithm is very simple to implement, very fast, but inherently serial, because both d and c depend on their previous values, and V i on its subsequent value V i+1, which make the algorithm not parallelizable. 5.2 Cyclic Reduction There are also two steps in the CR algorithm, the forward reduction and the backward substitution. The basic idea behind forward reduction in the CR algorithm is to eliminate the unknowns with odd indices which leads to a new tridiagonal system half the size of the original system, which contains only unknowns with even indices. This reduction continues until we reach a system that can be solved directly. As an illustration of the forward reduction assume that we have the following three equations: a i 1 V i 2 + b i 1 V i 1 + c i 1 V i = d i 1 a i V i 1 + b i V i + c i V i+1 = d i ; i = 2,.., 2n (5.4) a i+1 V i + b i+1 V i+1 + c i+1 V i+2 = d i+1 ˆ Forward reduction: The first of these equations is multiplied by α i = a i and the last one by b i 1 λ i = c i then the three equations are added in order to eliminate V i 1, V i+1 b i+1 (odd indexed unknowns). Hence, we end up with equation (5.5) shown below: a (1) i V i 2 + b (1) i V i + c (1) i V i+2 = y (1) i ; i = 2,.., 2n (5.5) ˆ Backward substitution: After solving all the even-indexed unknowns, found through forward reduction, 48

49 we can just put them in the original equations to calculate all the unknowns with odd indices. Both steps of the CR algorithm are parallelizable and that is why its implementation is quite common in parallel architectures. The CR algorithm performs more operations than Thomas algorithm. However, on a parallel processor it only needs 2 log 2 (n) 1 computational steps instead of 2n steps for Thomas algorithm, where n is the size of the tridiagonal system. Figure (5.1) below illustrates the dynamic of the CR algorithm. Figure 5.1: Forward Reduction & Backward substitution in the CR algorithm (Zhang et al. (2010)) 5.3 Parallel Cyclic Reduction The PCR algorithm is a modified version of the CR algorithm. In other words, in the PCR algorithm there is no backward substitution phase, only forward reduction. The forward reduction mechanism is exactly the same as the CR algorithm. However, the PCR repeatedly reduces the current systems to two systems of half size instead of one, as is the case with the CR algorithm, until it finds all solutions. Figure (5.2) below illustrates the dynamic of the PCR algorithm. 49

50 Figure 5.2: Forward Reduction in the PCR algorithm (Zhang et al. (2010)) In this chapter, we discussed serial and parallel tridiagonal solvers and explained how they can be implemented. In the next chapter, we will describe in details the two HPC technologies that will be used to accelerate the tridiagonal solvers, the ADI and finally the SLV program. 50

51 Chapter 6 High Performance Computing 6.1 Multi/Many Core architecture For a long time, increasing the calculation capacity of a machine meant increasing the frequency of its processors as well as the complexity of its components. Moore s law had predicted that the number of transistors on CPUs will double approximately each two years or more precisely each 18 months. Hence, the performance of CPUs will approximately double each 18 months. However, nowadays, as it has been explained by Dally (2010), chief scientist of NVIDIA, doubling the number of transistors in a serial CPU results in a very modest increase in performance at a tremendous expense in energy. In addition, increasing the number of transistors requires also to manage different levels of heat dissipation which is a very big problem that hindered the development of processors. To solve these problems engineers turned to a new method which is based on the miniaturization of computing units in order to write more on a single chip, giving birth to multicore and manycore processors. Multicore processors typically refer to devices that have between 2 and 8 cores. In other words, Multi-core processors are basically processors with multiple physical cores that execute portions of a program in parallel. All the features needed to run a program are present in each physical core: registers, calculating units, etc. Manycore processors, such as GPUs, represent devices that have many hundreds of cores unlike the multicore processors which contain only a small number of cores. Hence, the level of parallelization that could be achieved by manycore devices is very high. However, the cores of the multicore processors are usually much more powerful 51

52 than the cores in the manycore processors, which is why running a single threaded software in a manycore device could be quite slow. So the question that we should asked ourselves is do we want to have few powerful processors or a large number of less-powerful ones? The answer to this question depends on the algorithm of the application one wants to design, if, for instance, the programmed algorithm is highly parallelizable then the manycore architecture is probably the best option otherwise the multicore architecture will almost certainly be much more efficient. 6.2 Graphics Processing Units Graphic processing is extremely demanding and requires millions of calculation per second and this is the reason why the CPU, taking into consideration its architecture is not really appropriate to this kind of tasks. To meet this need and relieve the CPU of much of the graphics processing, researchers have created a processor dedicated to graphic calculations and it is known as a Graphics Processing Unit (GPU). GPU is a massively parallelized processor that is able to manipulate a huge block of data in parallel which is why it is very effective for graphic processing. Figure 6.1: GPU vs CPU As shown in Figure (6.1), the goal of the CPU is to minimize the execution time of a single threaded program by reducing the latency as much as possible, which requires more logic control. The GPU, on the other hand, was designed not to perform operations as quickly as possible sequentially, but to have multiple processors execute, simultaneously, the same instruction on different pieces of data. This concept is known as data-parallelism or Single Instruction Multiple Data (SIMD). So, what is usually completed by executing a repeated succession of instructions can now be performed in parallel in one instruction, which implies less control logic. Thus, 52

53 giving more space to more units of calculations or Arithmetic Logic Units (ALU) for more parallelization. That being said, it is still possible to make GPU threads run different instructions. However, this is not advisable as it will lead to poor performance because computations only happens in parallel when the threads are doing the same computations. Taking as an example the ADI scheme to illustrate data-parallelism, we will have multiple threads building and solving multiple tridiagonal systems of linear equations in parallel. 6.3 GPGPU Following the development and popularization of GPUs, many industries have begun to be interested in using them for non-graphical applications, thus the name GPGPU (General Purpose GPU). At that time many applications: such as Monte Carlo simulation for option pricing were ported to GPU using programming languages, such as DirectX and OpenGL and the results were extremely promising as they were able to accomplish outstanding performance speedup. However the programming of GPUs was still very complex and represented a challenge for programmers because it required a deep knowledge of graphic libraries and the architecture of GPUs. In Addition GPUs, at the time, did not support double precision floating point arithmetic and therefore it was not possible to run some scientific applications with the required accuracy. These two problems were the two major obstacles impeding the use of GPUs in other sectors. In order to solve these issues NVIDIA firstly introduced the G80 (Geforce 8800) GPUs and then later the Tesla GPUs which are the first GPUs dedicated to GPGPU. Secondly, they have created CUDA which is a high level C- like programming language that allows a much simpler programming of GPUs. For these two reasons, GPUs have become a really powerful architectures, programmable and open for everyone. 53

54 6.4 Tesla Architecture Figure 6.2: Tesla architecture Figure (6.2) shows the Tesla architecture based on a block diagram of a GeForce 8800 (G80) GPU. The G80 contains 8 Texture Processor Cluster (TPC) blocks also known as SPA (Streaming Processor Array) where each TPC contains two Streaming Multiprocessors (SMs). An SM contains sixteen cores or Streaming Processors (SP) which are grouped in pairs of eight and two SFUs (Special Function Unit) which are mainly used for function evaluation such as sqrt, cos, sin, etc. According to NVIDIA GPU Programming Guide, the SMs are responsible for the creation, organization and execution of threads. Each thread is executed on a SP independently, with its own thread of execution and its state of register address. The SPs are scalar and not vector processors. On the one hand, the concept of vector calculus is the ability of a processor to work simultaneously on an array of values. On the other hand, a scalar processor is able of working only on one value at a time. SPs start from an input stream and each processor can resume a stream previously processed by other SPs. The fact that NVIDIA try to put the streaming 54

55 processors in large numbers is illustrative of efforts for greater parallelization of data processing. 6.5 CUDA programming model: CUDA: CUDA is a programming language that can be considered as an extension of C, and hence quite easy to learn for programmers that are already familiar with C. It has been developed by NVIDIA specifically for GPUs based on the TESLA architecture. A CUDA program is executed on both the host (CPU) and the device (GPU). The part of the program that is serial is executed on the CPU and is implemented using ANSI C code. The part of the program that we want to parallelize is executed on the GPU and is implemented using, what is known as, a device code that is basically an ANSI C code extended with keywords to describe parallel functions called kernels. Hence, a CUDA program is based on a code that incorporates both host and device code, which are separated in the compilation phase by the NVIDIA compiler (NVCC) CUDA program structure: The execution of a CUDA program follows the global schema described by Figure (6.3) below: Figure 6.3: Execution of a CUDA program (CUDA Programming Guide) When we call a kernel function in the middle of a program the GPU takes over the execution of the latter and starts generating threads in order to execute a specific part of the program in parallel. However, we have to make sure that all the data needed 55

56 for the calculations were transferred from the CPU to the GPU global memory using the memcopy function. After finishing the calculations the results are transferred back to the CPU memory using the same memcopy function. We will discuss in the next section how the threads are organized when a kernel is invoked Thread assignment A kernel is executed by calling a grid of threads. When we call a kernel we have to set the execution environment between brackets in order to describe the dimensions of the required grid: Kernel <<< numblocks, numthreads >>> As it is illustrated in Figure (6.4) a grid of threads is a two-dimensional array of 2D blocks of threads. The parameter numblocks defines the number of blocks in the grid and the parameter numthreads represents the number of threads in each block. An index system can uniquely identify each thread to specify which set of data it will work on. Each thread is located in the block via a three-dimensional coordinates by using the variables ThreadIdx.x, ThreadIdx.y and ThreadIdx.z. Similarly, each block is located in the grid via a two-dimensional coordinates by using the variables BlockIdx.x and BlockIdx.y respectively. Figure 6.4: Grid configuration (CUDA Programming Guide) 56

57 We now explain where each logical component (blocks and threads) reside physically on the hardware. To better explain this part, we take as example the architecture of the NVIDIA Tesla M2050 GPU. As already mentioned, execution thread resources are managed by the SMs. The M2050 consists of 14 SMs; each SM can contain a maximum of eight blocks at the same time as long as there are enough resources to satisfy the needs of each block. In case there are not enough resources to contain the 8 blocks simultaneously then the CUDA runtime system reduces the number of blocks assigned to each SM until the resource usage is under the limit.therefore, the M2050 can simultaneously hold a maximum of 8 14 = 112 blocks in all SMs. If we try to use more than 112 blocks, the CUDA runtime system organizes the execution of blocks such that every time the execution of a block ends it is replaced by a new one. Another limitation of the SM as well as of the block is the fact that they can accommodate only a limited number of threads or warps (a warp is a group of 32 threads). For instance, for the M2050, no more than 1536 threads (or 48 warps) can be assigned to each SM and no more than 1024 threads to each block, which means that we can have up to = threads that can reside simultaneously in all the SMs in order to be executed. From the explanation above, we can deduce that the thread execution on an SM can be possibly done with the following combinations: (3 blocks, 512 threads), (6 blocks, 256 threads), (8 blocks,192 threads), etc. However, the next combination (16 blocks and 128 threads) cannot run on a single SM for the simple reason that the SM in the M2050 cannot contain more than 8 blocks. It is also possible to have one block with 1536 threads because even if the limit number of threads per SM is not exceeded, the limit number of threads per block is violated. The fact that threads are organised in blocks and grids in CUDA provide a high level of scalability. In other words, we can allow a specific program to run, unchanged, on any NVIDIA GPU hardware that is based on the Tesla architecture CUDA Memory Types: So far we have seen the execution structure of a CUDA program, how to call a kernel function by specifying the number of threads and blocks wanted and how to transfer the data from the CPU memory to the GPU global memory. As shown in Figure (6.5) CUDA supports several types of memory. Understand- 57

58 ing how to use these memories is crucial to optimize a program. For this reason, in this section we analyse the characteristics of some of the type of memories we can use in a CUDA program for an NVIDIA GPU with compute capability 2.0. Figure 6.5: GPU Memory types ˆ Global memory: This memory can be accessed by all threads of all SMs. Its bandwidth is very large. However, it delivers relatively slow performance because of the contention between threads and because it is a dynamic random access memory (DRAM) with a high latency (hundreds of clock cycles) which leaves the SMs inactive during this time. ˆ Shared memory: As its name suggests, shared memory is shared only by the threads of a block. It is an on-chip memory which enables the threads to access it relatively quickly, as long as there is no bank conflicts 1. This fact makes its usage extremely important to achieve optimal memory access performance. For the SLV program and as we will see in the Results chapter, when we store the tridiagonal system arrays in the global memory the performance are slower than when we use shared memory. However, using shared memory can limit the occupancy of the streaming multiprocessor and thus decreasing performance. ˆ Registers: Register memory is thread local. In other words, each thread of the same block have a private version of each register variable. It is an on-chip 1 Bank conflicts will be discussed in details in the next chapters 58

59 memory which makes thread access very fast, about one cycle. However, the number of registers that are available per block is limited. ˆ Local memory: Figure (6.5) shows that local memory is local to each thread. However, physically the local memory is not close to threads nor as fast as registers or shared memory, only the global memory is as slow as the local memory. The compiler automatically allocates a variable into local memory if there are not enough registers available to the thread. This typically happens with arrays and structures created and used in the kernel function Memory Access Optimizatization: Colaesced global memory access: Global memory is implemented with DRAM. Hence, in order to be able to optimize global memory access using CUDA threads, understanding the concept of modern DRAM proves to be extremely important. This understanding allows programmers to deploy techniques that allow their software to achieve high global memory access efficiency. The crucial aspect of modern DRAM is the fact that it uses a parallel process to increase their rate of data access. In other words, each time a memory location is accessed, many consecutive memory locations, including the requested location, may be accessed. So taking into account how modern DRAM works and also the fact that a warp threads execute the same instruction at any given point in time, programmers can optimize global memory access by ensuring that consecutive global memory locations are accessed by consecutive threads. When, all threads in a warp execute, for instance, a load instruction, the hardware detects whether consecutive global memory addresses are accessed by consecutive threads. Therefore, if that is the case, the hardware coalesces all threads memory accesses. For example, If thread 0 accesses location t, thread 1 accesses location t + 1,... thread 31 accesses location t + 31, then all these multiple global memory accesses are coalesced into one single access. To give a concrete example let s assume that we have the following 3x3 matrix: 59

60 Table 6.1: Linearly stored matrix (6.1) Assuming that the matrix (Table 6.1) resides linearly in the global memory. Hence, each element (i,j) maps to the memory location (i*3 + j) So if we assume that we can use only 3 threads. Hence, threads can access the matrix elements in the ways described below: Thread 0: 0, 1, 2 Thread 1: 3, 4, 5 Thread 2: 6, 7, 8 Method 1: Each thread reads the elements of one row. Thread 0: 0, 3, 6 Thread 1: 1, 4, 7 Thread 2: 2, 5, 8 Method 2: Each threads read the elements of one column. According to the example above we can see that the second method represents the more favourable pattern because consecutive elements of the matrix are read by consecutive threads and therefore the global memory access is coalesced. Figure (6.6) and (6.7) illustrates a coalesced and non-coalesced global memory access Figure 6.6: Coalesced global memory access 60

61 Figure 6.7: Uncoalesced global memory access Shared memory bank conflicts: A memory bank is a logical unit of storage that logically organizes a physical memory space. For a GPU with compute capability (version) 2.0 or higher, shared memory is an interleaved memory that is divided into 32 banks where consecutive 4 bytes words are allocated to consecutive banks. Each bank has a data transfer capacity of 4 bytes per clock cycle and can handle one request at a time. This structure aims to allow, where possible, all warp threads to be executed at the same time. Hence, increasing the bandwidth of the shared memory. When all threads of a warp cannot be executed simultaneously then we have a bank conflict. In other words, two threads or more are attempting to access an address within the same bank and therefore the memory controller has to serialise the execution of these threads.the example below illustrates clearly how a bank conflict could happen: Let s assume that we have an array of N floats (4-byte word) and we want to transfer these float values from global memory to shared memory as shown in the code below: sharedarray[threadidx.x] = globalarray[threadidx.x]; Therefore, each thread k of a warp will write in bank k. For instance, thread 1 will copy a float value into bank 1 and thread 2 into bank 2 and so on. In this case there is no bank conflict because each warp has 32 threads that are mapped to the 32 4-byte banks. The fact that consecutive threads are accessing consecutive banks, as it is the case in this example, is not always relevant to avoid bank conflicts. The main task to avoid shared memory bank conflicts, is to make sure that different threads of a warp are accessing different banks. 61

simultaneously to the first bank. Hence, we have what is known as 4-way bank conflict, because 4 threads are trying to access the same bank simultaneously as it is illustrated in Figure (6.9).

62 warp. Figure (6.8) below illustrates a conflict free bank access to shared memory for a Figure 6.8: Free Bank-conflicts example for a warp If we take the same example as before but assume instead that we have char values (1-byte word) then threads 0,1,2 and 3 will try to write simultaneously to the first bank. Hence, we have what is known as 4-way bank conflict, because 4 threads are trying to access the same bank simultaneously as it is illustrated in Figure (6.9). Figure 6.9: 4-way shared memory bank conflicts If we assume now that we have double values (8-byte word) then each variable will be split into 2 4-byte accesses. Therefore threads 0 and 16 or 1 and 17, for instance, will access the same bank 0 or bank 2 respectively, which generate a 2-way bank conflict. 62

63 Shared memory in GPUs with compute capability 3.0 or higher have a configurable bank size that can be set to either 4-bytes or 8-bytes using the function cudadevicesetsharedmemconfig(). Therefore, by setting the bank size to 8-bytes we can use double-precision variables in shared memory without causing bank conflicts. According to the CUDA programming guide, bank conflicts in shared memory can have an important impact on a kernel performance. Hence, as long as it is possible, this problem shoud be avoided. 6.6 OpenMP programming model: OpenMP: Parallel programming using CPUs has long existed for some manufacturers such as Cray and IBM, but each had its own set of directives which meant it could be only used on their hardware and hence not portable. In order to solve this problem, the OpenMP standard was defined by a consortium of industry and academics in 1997 thereby defining an a programming language that parallelizes sequential programs,written in C, C + + or Fortran, on shared memory architectures. OpenMP is a set of simple directives that can be used to parallelize sequential programs. These directives allow the automatic management of parallelism. In other words, OpenMP can automatically create a specific number of threads needed for the parallelization, depending on the number of processors available on a machine, as well as handling their synchronization and their termination. It also allows the possibility to define the visibility of the variables used in the parallel region by using the directives private() and shared() OpenMP program structure: The parallel execution model of Open MP is known as the Fork-Join model as shown in Figure (6.10). 63

64 Figure 6.10: OpenMP program structure The concept of the fork-joint model is as follows: ˆ Initial thread (known also as the master thread): 1. Execution of the sequential code. 2. Fork: Starts the execution of the slave threads. 3. Join: destruction of the slave threads and return to master thread control ˆ Slave threads: 1. Execute the parallel region OpenMP directives: The code below describes the parallelization of a simple C loop using OpenMP directives: int A=0; #pragma omp p a r a l l e l default ( none ) private ( ) shared ( ) r e d u c t i o n (+:A) { #pragma omp for for ( int i =1; i <= imax ; i ++) { } } ˆ Keywords: 64

65 #pragma: All OpenMP directives start with the keyword #pragma. reduction: The reduction(+:a) directive calculates a single global variable A, as the sum of several private variables of the same name. #pragma omp parallel: This is used to declare the parallel region. #pragma omp for: This is used to parallelise the for loop where each thread will run a specific number of loop iterations. Default, Private, Shared: These directives are responsible for telling the compiler which variables are shared by the threads and which ones must be kept private. If the variable is shared it means that all the threads have access to it; when the variable is declared to be private then each thread will create its own separate copy of this variable which is destroyed when the parallel region is terminated. If the variable is created inside the parallel region it will be considered as private otherwise if the variable is declared outside the parallel region then it will considered as shared. The keyword default(none) means that we must indicate to the compiler the visibility of all the variables used in the parallel region. The example discussed above, is far from being comprehensive, and in order to have more details about all the possible OpenMP commands and how they can be used, the OpenMp official website contains details. The optimisation techniques that we discussed for CUDA will be used to improve the performance of the SLV program. The results will be discussed with more details in the Results chapter. 65

66 66

67 Chapter 7 Results The results chapter is divided to two parts. The first part will mainly focus on the the European call options prices generated using the SLV model in order to show that the leverage function corrects the mispricing of a simple Heston model. The second part of the chapter is about HPC applied to the SLV program. In other words, how we can make the ADI algorithm faster by discussing the different tridiagonal solvers, (explained in Chapter 5), and how they can be optimised using the CUDA optimization techniques discussed previously. 7.1 Stochastic Local Volatility Model Results Volatilities & Leverage function surfaces Figure (7.1) below shows an artificial implied volatility surface that was used to generate the local volatility surface to which the leverage function will be calibrated. We will not use real market implied volatilities in order to ensure that the Local Volatility code used based on cubic spline interpolation in time and in space will not generate negative local volatilities as it will cause problems when we will try to calibrate the leverage function. 67

68 Figure 7.1: Artificial Implied Volatility Surface Figure 7.2: Generated Local Volatility Surface 68

Figure 7.3: Generated Leverage Function Surface 7.1.2 SLV Call options pricing Table 7.1 shows the calibrated Heston parameters to the artificial implied volatility of a single maturity of T=0.

69 Figure 7.3: Generated Leverage Function Surface SLV Call options pricing Table 7.1 shows the calibrated Heston parameters to the artificial implied volatility of a single maturity of T=0.25 years. We assume a constant interest rate equal to 2%. V 0 ξ k θ S 0 ρ rate % Table 7.1: Calibrated Heston Parameters Figure (7.3) shows that the generated leverage function surface tends to be very close to 1, which is expected because as it is shown in Table 7.2 the calibrated Heston model is able to generate relatively accurate results, especially, for ATM and ITM call options. However, as shown in Figure (7.4), the pricing absolute error of the Heston model is higher than the SLV model. That being said, we can see that the SLV model is more accurate than the Heston model which is mainly due to the 69

70 introduction of the calibrated leverage function. The small pricing errors of the SLV model are due to the small inaccuracies of the ADI method implemented. Strike Heston SLV Market price Table 7.2: European options pricing for the SLV and Heston model Figure 7.4: European options pricing absolute error for the SLV and Heston model In the next section we will discuss the different techniques used to make the SLV program faster. 70

71 7.2 High Performance Computing Results Experimental environments M-Class NVIDIA GPU: TESLA M2050: As is shown in Table (7.3), for the many-core architecture we use CUDA to program an NVIDA Tesla M2050 GPU. It has 448 cores, the core clock speed is 1.15 Ghz, the compute capability is 2.0, the global memory size is 3 GB and the shared memory size could go up to 48kb per block. Peak double precision floating point performance 515 Gigaflops peak Peak single precision floating point performance 1030 Gigaflops peak Memory clock speed Core clock speed 1.55GHz 1.15GHz CUDA cores 448 Compute capability 2.0 Memory size (GDDR5) Memory bandwidth (ECC off) Power consumption 3 GigaBytes 148 GBytes/sec 225w TDP CUDA SDK CUDA Driver API Table 7.3: Technical Specifications GPU Intel CPU: Intel(R) Xeon(R) CPU X5650: As is shown in Table (7.4), for the multi-core architecture we use OpenMP to program 2 Intel Xeon processors X5650. Memory Size/Type Core clock speed 48Gb 2.67 Ghz Number of cores 12 Number of threads 12 Power consumption 95w TDP OpenMP version 2.0 Table 7.4: Technical Specifications CPU 71

7.2.2 Implemented tridiagonal solvers In this section we discuss how the CR and PCR tridiagonal solvers were implemented and the different optimization techniques that could be used to make them

72 7.2.2 Implemented tridiagonal solvers In this section we discuss how the CR and PCR tridiagonal solvers were implemented and the different optimization techniques that could be used to make them faster. We are going to analyse each implementation using NVIDIA Visual Profiler for both, single and double precision CR Implementation 1: Figure 7.5: Implementation 1: forward reduction for CR algorithm As can be seen in Figure (7.5), in this implementation of CR, we start, in the forward reduction, with the number of active threads being equal to half the number of equations in the triadigonal system. We reduce the number of active threads by half after each forward reduction. For the backward substitution we follow basically the same logic except that this time we do the opposite and we double the number of active threads as we are going backward in each step of the backward substitution. In the forward reduction, thread 0 is accessing equation 0, thread 1 is inactive, thread 2 accesses equation 2 and so on. This method is suffering from the problem known as warp divergence. CUDA blocks are split into warps. All 32 threads of each warp are executed simultaneously and each thread inside the same warp executes the same instruc- 72

73 tion. Therefore, the branch instructions, such as if-else conditions, causes different threads in the same warp to follow different paths, this is known as branch or warp divergence. Example 1: i f ( threadidx. x % 2 == 0) { // i f p a rt thread e x e c u t i o n path } else { // e l s e p a rt thread e x e c u t i o n path } The code above shows the same logic that is used to implement the forward reduction and backward substitution in this first implementation of the CR algorithm and illustrates one example of warp divergence. The 16 threads of a half warp with even IDs will go through the if condition at the same time, while the other 16 threads with odd IDs are waiting for them to finish in order to start executing the else condition. The threads will converge after all divergent paths are completed. Example 2: i f ( threadidx. x < WARP SIZE) { // i f p a rt warp e x e c u t i o n path } else { // e l s e p a rt warp e x e c u t i o n path } In the second example above there is no branch divergence, because the threads of the first warp of each block will execute the if condition and the rest of the warps will execute the else condition. Hence, the threads within one warp are all executing the same instructions. Compute resource are used more efficiently when all threads in a warp have the same branching behaviour because divergent branches lower warp execution efficiency, which hurts performance. It is worth insisting on the fact that a branch divergence can only happen within the threads of the same warp because different warps can be scheduled independently. In order to understand the impact of the warp divergence on the performance of this first implementation of the CR algorithm, we profile the code and we summarise the results in Tables (7.5) and (7.6) below: 73

74 Table (7.5) shows the performance of the first implementation of the CR algorithm to solve a 512x512 size system, where the lower, main and upper diagonals of the tridiagonal system are linearly stored in the GPU global memory Duration(ms) Warp EE SM RO GM RO GM S-L E Single precision % 0% 8% 34%-6.2% Double precision % 0% 15.6% 45.1%-8.1% Table 7.5: CR implementation 1 profiling (global memory) ˆ Warp EE: Warp execution efficiency ˆ SM RO: Shared Memory Replay Overhead ˆ GM RO: Global Memory Replay Overhead ˆ GM S-L E: Global Memory Store-Load efficiency Figure 7.6: CR implementation 1 profiling (global memory) Table (7.6) shows the performance of the first implementation of the CR algorithm to solve a 512x512 size system, where the lower, main and upper diagonals of the tridiagonal system are linearly stored in the GPU shared memory Duration(ms) Warp EE SM RO GM RO GM S-L E Single precision % 0% 1.2% 100%/100% Double precision % 0% 1.9% 100%/100% Table 7.6: CR implementation 1 profiling (shared memory) 74

Figure 7.7: CR implementation 1 profiling (shared memory) Figure (7.6) and (7.7) shows the percentage of the total time spent on each phase of the Cyclic Reduction algorithm. On one hand, Table (7.

75 Figure 7.7: CR implementation 1 profiling (shared memory) Figure (7.6) and (7.7) shows the percentage of the total time spent on each phase of the Cyclic Reduction algorithm. On one hand, Table (7.6) shows that, when we use shared memory the running time of the CR algorithm decreased significantly which is expected because, as stated previously, shared memory is faster than global memory. On the other hand, Table (7.5) and (7.6) shows that the Warp Execution Efficiency is low due to the warp divergence problem mentioned previously, which make the CR algorithm slower. 75

7.2.2.2 CR Implementation 2: Figure 7.8: Implementation 2: Forward reduction for CR algorithm For the second implementation of the CR algorithm, we use consecutive threads to access even equations.

76 CR Implementation 2: Figure 7.8: Implementation 2: Forward reduction for CR algorithm For the second implementation of the CR algorithm, we use consecutive threads to access even equations. In other words, as shown in Figure (7.8), thread 0 accesses the first equation 0 and thread 1 accesses equation 2 and so on. We used consecutive threads in order to avoid having highly divergent branching as it was the case in the first implementation of the CR algorithm. This implementation is based basically on the same logic explained in Example 2, above, and that is why it avoids warp divergence. To analyse this implementation we profile the code again. The results are summarised in Tables (7.7) and (7.8) below: Table (7.7) below shows the performance of the second implementation of the CR algorithm to solve a 512x512 size system, where the lower, main and upper diagonals of the tridiagonal system are linearly stored in the GPU global memory Duration(ms) Warp EE SM RO GM RO GM S-L E Single precision % 0% 17.8% 34%/7.2% Double precision % 0% 40.3% 45.1%/9.4% Table 7.7: CR implementation 2 profiling (global memory) 76

77 Figure 7.9: CR implementation 2 profiling (global memory) As can be seen in Table (7.7), the Warp Execution Efficiency metric increased significantly in comparison to the first implementation of the CR algorithm, which proves that warp threads are not highly divergent anymore. The fact that the Warp Execution Efficiency is not equal to 100% is mainly due to some if conditions that are used because we need to handle the first and last equation of the tridiagonal system in a different way than the other equations of the system. Also, Table (7.7) shows that the global memory replay overhead is relatively high in comparison to the CR first implementation. This could be explained by the following: Typically data is moved between cache memory and DRAM in a single bus transaction that reads or writes multiple elements in sequential addresses (cache lines). Hence, if the cache line size is, for instance, 12 bytes, so the float-type variables will be transferred between DRAM and cache memory in blocks of three variables (3 x 4 bytes). So, if we assume that a warp requests 32 aligned, consecutive 4-byte words then 128 bytes will be moved across the bus to the cache memory from the global memory. If all the warp threads read from or write in the same cache line, then we will have 100% bus utilization. However, if a warp accesses only one 4-byte word in a cache line and skips 124 bytes of data, then 124 bytes of data was wasted and only 4 bytes used. That means that the bus bandwidth is being wasted (cache misses) leading to low bus utilization that is equal in this example to (4/128 = %). For this reason coalesced global memory access is very important because it ensures 100% bus utilization and therefore much better memory access performance. 77

78 That being said and taking into consideration the fact, that in the algorithm described above, we have consecutive threads that are accessing memory addresses that are very far apart in physical memory due to the use of a non-unit stride that is doubling at each step of the forward reduction, then there is no or at least very few chances for the hardware to combine the accesses. The implemented equations of the CR algorithm illustrate this problem: Forward reduction first step: stride = 1 a[2] Thread 1:b [2] = b[2] c[2 1] b[2 1] a[2 + 1] c[2] b[2 + 1] a[4] Thread 2:b [4] = b[4] c[4 1] b[4 1] a[4 + 1] c[4] b[4 + 1] Forward reduction first step: stride = 2 a[4] Thread 1:b [4] = b[4] c[4 2] b[4 2] a[4 + 2] c[4] b[4 + 2] a[8] Thread 2:b [8] = b[8] c[8 2] b[8 2] a[8 + 2] c[8] b[8 + 2] This behaviour will cause low utilization of cache lines which will lead to very low store/load efficiency. It will also increase the number of bus transactions which explains why the global memory cache replay overhead is quite high, especially in the double precision case. Reducing bus transactions is critical to improve performance, since warps are blocked until their memory requests are solved. We can alleviate the strided global memory access by using the on-chip shared memory. Table (7.8) shows the performance of the second implementation of the CR algorithm to solve a 512x512 size system, where the lower, main and upper diagonals of the tridiagonal system are linearly stored in the GPU shared memory. Duration(ms) Warp EE SM RO GM RO GM S-L E Single precision % 60.1% 2.3% 100%/100% Double precision % 111% 4.5% 100%/100% Table 7.8: CR second implementation profiling (shared memory) As shown in Table (7.8), when we use shared memory the duration time decreased significantly as it is the case in the CR first implementation. However, the disadvantage now is that we tend to have more and more bank conflicts towards the end of the forward reduction and at the beginning of backward substitution 78

because of the strided shared memory access. This is the exact problem that has been discussed by Zhang et al. (2010). This issue explains why the shared memory replay overhead metric is very high.

79 because of the strided shared memory access. This is the exact problem that has been discussed by Zhang et al. (2010). This issue explains why the shared memory replay overhead metric is very high. Because as explained previously, when different threads try to access simultaneously the same shared memory bank, the execution of the warp threads will be serialized which will deteriorate the memory performance. Figure 7.10: CR second implementation profiling (shared memory) Figures 7.9 and 7.10 shows the percentage of the total time spent on each phase of the Cyclic Reduction algorithm. When we compare the two implementation of the CR algorithm we can see that the second method is slower than the first one which demonstrates that avoiding memory access overheads is more important than having a high warp execution efficiency PCR implementation: To recapitulate what we explained in Chapter 5, in the first step, the PCR algorithm performs a reduction on every equation, not on only half of the equations as is the case for the CR algorithm. Therefore all threads will be active with one thread per equation, thread 0 for equation 0, thread 1 for equation 1 and so on. After this step the problem is reduced to two sets of equations, each with half the number of equations of the original set, as it was shown previously in Figure (5.2). Then, following the same logic used in the first step, we perform the reduction again on each 79

80 equation of the independently generated new two sets. We continue the reduction by creating new smaller sets until we find the final solution. That being said, we can see that all block threads will be used and will not be halved after each step of the forward reduction. In the following PCR implementation, we create temporary arrays (lower, upper and main diagonal temporary arrays) that we will be used to store the values of each array of the tridiagonal system in order to perform the reduction. After finishing each step of the forward reduction we move the calculated values from the temporary arrays to the original ones. In this implementation the temporary arrays are also stored in the shared memory. Table (7.9) shows the performance of the PCR algorithm to solve a 512x512 size system, where the lower, main and upper diagonals, original and temporary arrays, of the tridiagonal system are linearly stored in the GPU shared memory Duration(ms) Warp E.E SM RO GM RO GM S-L E Single precision % 0% 1.6% 100%/100% Double precision % 0% 2.5% 100%/100% Table 7.9: PCR implementation profiling (shared memory) By comparing the execution time of the PCR and the second implementation of the CR algorithm (CR-M2D) with shared memory, we can see that the PCR is basically as fast as or faster than the forward reduction phase of the CR algorithm for both single and double precision, which makes it, as it is illustrated in Figure (7.11), almost twice as fast as CR because in PCR there is no backward substitution. The fact that the forward reduction phase in the PCR is faster than in the CR algorithm, is counter-intuitive because PCR does more work than CR at each step of the forward reduction. However, PCR is free of bank conflicts which make the forward reduction in the PCR much more efficient. When we compare, the execution time of the PCR and the first implementation of CR (CR-M1D), the PCR is still the fastest algorithm due mainly to the fact that PCR has a very high warp execution efficiency. However, it is not as fast as the forward reduction phase of CR because in this case the CR algorithm does not suffer from bank conflicts. 80

81 Figure 7.11: Tridiagonal solvers running time to solve a 512x512 size system (double precision) Figure 7.11, also, shows that running Thomas algorithm using a single CUDA thread (T-GPU) is extremely slow in comparison to the other parallel solvers, because the GPU s core clock speed is quite low. That is why implementing the ADI scheme so that we can solve tridiagonal systems using parallel solvers is extremely important to achieve good acceleration performance in CUDA. Thomas algorithm on CPU (T- CPU), on the other hand, is faster than the PCR and CR implementations on CUDA. That is one of the reasons, as we will see in the next section, that makes the ADI scheme and the SLV program acceleration, on OpenMP, competitive even if the number of tridiagonal systems solved in parallel is far fewer than with CUDA ADI implementation This section aims to explain the different implementation that has been made to accelerate the ADI scheme. We will take as an example, the ADI scheme for pricing an option in OpenMP and CUDA. All the results shown in this section below are for one time step and a grid size of 512x OpenMP implementation In the OpenMP implementation of ADI, each thread build and solve one system using Thomas algorithm. In other words, we are solving several systems in parallel 81

82 at each step of the ADI scheme, but each tridiagonal system is solved in a serial way. 1 Thread 2 Threads 4 Threads 8 Threads Duration (ms) Table 7.10: OpenMP: ADI implementation timing Table (7.10) shows that as we double the number of active threads the running time decrease significantly. Theoretically, one would expect that by doubling the number of threads the running time should be halved, unfortunately, this is not the case all the time because of some overhead factors in the threaded code. These factors are the time needed to create the threads, and the time spent by threading library to schedule pieces of work of each thread, thus, as we increase the number of threads the overhead time become more significant and therefore have a more important impact on the total running time. However, in this case the overhead is small because we are only parallelising the outer loop of each step of the ADI scheme. Hence, the impact of the threads overhead, on the total running time, will be very low. To prove this statement, we calculated the overhead caused by the omp pragma parallel and omp pragma for directives. The overhead time results are shown on Table (7.11) and Figure (7.12) below: 1 Thread 2 Threads 4 Threads 8 Threads Duration (μs): #pragma omp parallel Duration (μs): #pragma omp for Table 7.11: OpenMP directives overhead in μs 82

83 Figure 7.12: OpenMP directives overhead duration in μs As shown on Table (7.11) and Figure (7.12) the duration of the overhead for the parallel and for directives is very small and calculated in terms of microseconds. For this reason, the overhead does not a have noticeable impact on the total running time of the ADI algorithm CUDA implementation: Method 1 In the first method implemented for GPU, we follow the exact same logic that is used for OpenMP. However, as we have discussed in the literature review Chapter, for a 2D ADI most GPU cores are unused and therefore this method does not really take advantage of the parallelization power that is offered by GPUs. We store all the tridiagonal system arrays in the global memory. Each half step of the ADI scheme will be implemented in a separate kernel. In other words, in the first kernel we build and solve the system genetared by equation (4.21) and in the second kernel we solve and build the system generated by equation (4.22). Therefore, at each time step we will call 2 kernels. In order to analyse the performance of each kernel we profile the ADI for one time step. The results are shown in table (7.12) below: 83

84 Duration (ms) Warp E.E SM RO GM RO GM S-L E kernel % 0% 56.6% 9.9%/5.6% kernel % 0% 28.7% 19.8%/47.3% Total time: Table 7.12: Method 1: ADI implementation profiling (CUDA:M1) CUDA implementation: Method 2 In the second method, each tridiagonal system is solved and built by all the threads of each block. We have now two levels of parallelization; all tridiagonal systems are solved in parallel across blocs and each tridiagonal system is solved in parallel by all block threads using the PCR algorithm. This method is obviously harder to implement than the first method. However, it takes advantage of the number of cores provided by the GPU. The tridiagonal system arrays are stored linearly in the GPU shared memory except the 2D option prices matrix C, which is stored linearly in the GPU global memory. In order to analyse the performance of each kernel we will run, in the profiler, the ADI for one time step. The results are shown in Table (7.13) below: Duration(ms) Warp E.E SM RO GM RO GM S-L E kernel % 0% 0.3% 98.5%/74% kernel % 0% 4.5% 25%/10.3% Total time: 4.1 Table 7.13: Method 2: ADI implementation profiling (CUDA:M2) As it can be seen in Table (7.13), unlike the first kernel, the global memory load/store memory efficiency for the second kernel is very low. This result is expected and can be explained by the fact that in the first kernel, consecutive threads are accessing contiguous memory locations of the matrix C, as shown in the example below: C[i (jmax + 1) + threadid] This represents a coalesced access to global memory which explains why the global memory store/load is high in comparison to the second kernel. 84

85 In the second kernel, consecutive threads access non-contiguous memory locations of the matrix C, as shown in the example below: C[threadID (jmax + 1) + j] This represents a non-coalesced global memory access and therefore explains the low global memory load/store efficiency of the second kernel. We can solve this problem by using the transpose of matrix C in the second kernel. This allows each thread to access the desired value of matrix C in a similar way to the first kernel. That is to say that each thread will access the value in CT ranspose[j (jmax1)+threadid] instead of the value in C[threadID (jmax1)+j]. Using this technique the global memory access in the second kernel will also be coalesced. However, to implement this method we will need to call a kernel that will be used to calculate at each time iteration the transpose of the original and intermediate matrices with option prices,c and Y respectively, just before starting the ADI second half step, then after finishing it we need to call, once again, the same kernel to come back to the original matrix C, because we need to use it in the next time step. This means that we will need to call the kernel that computes the transpose a total of 3 times, which will increase, unfortunately, the computation time at each ADI time step. Table (7.14) shown below summarises the results: Duration(ms) Warp E.E SM RO GM RO GM S-L E kernel % 0% 0.3% 98.5%/74% kernel % 0% 0.2% 100%/70.5% Kernel Transpose % 0% 5% 100%/100% Total time: kernel 1+kernel 2 + 3*kernel Transpose = 3.75 ms Table 7.14: Method 2: ADI implementation profiling (coalescing) (OPT-CUDA:M2) After using the transpose technique in the second kernel the global memory store/load efficiency increased dramatically and the computation time of kernel 2 went down from 2.16 to 1.69 miliseconds. However, the improvement in the total running time is not very significant. Figure (7.13) below, summarizes the accelerations achieved for both CUDA and OpenMP. 85

Figure 7.13: ADI running time in milliseconds for one time step 7.2.

86 Figure 7.13: ADI running time in milliseconds for one time step SLV implementation Now that we have discussed the implementation of the ADI scheme we are going to explain how the SLV model is implemented in both CUDA and OpenMP. As we have already explained, the ADI method for both the calibration and option pricing are the most computationally demanding part of the SLV algorithm, hence, they will be accelerated using OpenMP and CUDA, in order to compare the acceleration achieved in both technologies. PCR is the tridiagonal solver used in the SLV program. The results are summarised in the two sections below: SLV implementation: OpenMP 1 Thread 2 Threads 4 Threads 8 Threads Duration (seconds) Table 7.15: OpenMP SLV implementation (512x512 & 150 time steps) 86

Heston Stochastic Local Volatility Model

Heston Stochastic Local Volatility Model Klaus Spanderen 1 R/Finance 2016 University of Illinois, Chicago May 20-21, 2016 1 Joint work with Johannes Göttker-Schnetmann Klaus Spanderen Heston Stochastic