# GPUs in Finance

## The Workshop

**Date and time:** 8:00 am – 6:30 pm, Friday, 11th September, 2009

**Venue:**

Department of Computing

Imperial College

South Kensington Campus

Huxley Building

180 Queen’s Gate

London

**Venue home page:** http://www.doc.ic.ac.uk/

This workshop focussed on Monte Carlo Pricing Libraries on High-Performance Multi-core Architectures.

## Schedule

__Morning Session__

**08:00 – 08:45:** Registration and refreshments

**08:45 – 09:00:** Welcome Address

**09:00 – 09:30:** Paul Kelly: Software engineering challenges in many-core computing

**09:30 – 10:30:** David Thomas: Monte Carlo methods implementation on GPUs

**10:30 – 12:30:** Lab Session (and coffee): Using Pseudo-random and Quasi-random Number Generators in CUDA

Thomas Bradley & Gernot Ziegler: NVIDIA CUDA SDK

Mike Giles & Robert Tong: NAG Numerical Routines for GPUs

**12:30 – 13:30:** LUNCH BREAK. Introduction to GPU computing hardware by Jas Garcha and NAQ Q&A session and demo by John Holden

__Afternoon Session__

**13:30 – 14:30:** Claudio Albanese: Fourth level BLAS and high performance pricing

**14:30 – 16:00:** Lab Session (and coffee): Claudio Albanese: OPLib, an open source library for Monte Carlo pricing implemented in CUDA

**16:00 – 17:45:** Mike Giles: Monte Carlo and finite difference computations on GPUs (including Lab session)

**17:45 – 18:30:** Panel and open discussion

## Speakers

## Claudio Albanese

Claudio Albanese is a Visiting Professor at the Financial Mathematics Group at King’s College and an independent consultant at Global Valuation Ltd. He received his doctorate in Physics from ETH Zurich, following which he held post-doctoral positions at New York University and Princeton University. He was Associate Professor in the Mathematics Department of the University of Toronto and then Professor of Mathematical Finance at Imperial College London.

## Thomas Bradley

**Thomas Bradley** MEng(Hons) MIEE graduated with a first-class MEng degree in Computer Systems Engineering from the University of Bristol in 2000, having also completed the final year of the Diplôme d’Ingénieur at l’École Nationale Supérieure de Télécommunications in Brest, France. He worked as processor architect for video encoding processors at STMicroelectronics before moving to ClearSpeed Technology plc to lead architecture development for general purpose parallel processors. Since then he has specialised in High Performance Computing software development at ClearSpeed and now at NVIDIA.

## Mike Giles

**Mike Giles** is a Professor of Scientific Computing, a member of the Oxford-Man Institute of Quantitative Finance, and Associated Director of the Oxford e-Research Centre. He focuses on improving the accuracy, efficiency and analysis of Monte Carlo and finite difference methods. He is interested in various aspects of scientific computing, including high performance parallel computing, and in the last couple of years has been working on the exploitation of graphics cards for scientific applications in both finance and computational engineering.

## Paul Kelly

**Paul Kelly** has been on the faculty at Imperial College London since 1989. He teaches courses on compilers and advanced computer architecture. The main focus of his Software Performance Optimisation research group is to extend compiler techniques beyond what conventional compilers can do, by exploiting application domain properties – in particular, recently, with regard to exploiting many-core, SIMT and SIMD hardware. He is serving as Program Committee co-Chair for the 2010 ACM Computing Frontiers conference, and chaired the Software track of IPDPS in 2007.

## David Thomas

**David Thomas** is a post-doctoral research associate in the Department of Computing in Imperial College, working mainly with FPGAs, as part of the Custom Computing group.

His two main research interests are random number generators for FPGAs, and also financial computing using FPGAs.

He has published some 25 articles on computational aspects of Monte Carlo simulations for finance applications, random number generators and applications of reconfigurable computing.

## Robert Tong

**Robert Tong** received a PhD from the University of Bristol (UK) in the area of applied mathematics and scientific computing, following a first degree in mathematics. His research project was prompted by the need to improve the modeling and design of coastal structures after a major failure.

He followed this with the post of Research Fellow in Applied Mathematics at the University of Birmingham, where his focus was on the role of numerical software and mathematical modeling in the context of the failure of structures due to extreme events.

His interest in numerical software led naturally to his joining NAG to work on the development of their libraries. While there he has worked on data approximation methods, including wavelets and radial basis functions, in addition to applications in finance.

## Gernot Ziegler

**Gernot Ziegler** (MSc/civ.ing.) is an Austrian engineer with an MSc degree in Computer Science and Engineering from Linköping University, Sweden. He pursued his PhD studies at the Max-Planck-Institute for Informatics in Saarbrücken, Germany, where he specialized in GPU algorithms for computer vision and data-parallel algorithms for spatial data structures.

## Sponsors

## Materials

## Materials

- Claudio Albanese, Hongyun Li. Monte Carlo Pricing using Operator Methods and Measure Changes.
- Claudio Albanese. Global Calibration.
- Mike Giles. Monte Carlo and Finite Difference Computations on GPUs.
- Mike Giles. Jacobi Iteration for a Laplace Discretisation on a 3D Structured Grid.
- Paul Kelly. Software Engineering Challenges in Many-Core Computing.
- David Thomas. Monte Carlo Implementations on GPUs.
- Mike Giles. Lab Session 1.
- Claudio Albanese. Lab Session 2.
- Mike Giles. Notes on CUDA practicals on “skynet” cluster.

## Lab Sessions

__Practical 2 Example Output:__

[cudaXX@compC007 prac2]$ bin/release/prac2

NAG GPU normal RNG execution time (ms): 83.077003 , samples/sec: 2.311109e+09

Monte Carlo kernel execution time (ms): 8.961000

Average value and standard deviation of error = 0.41755106 0.00048179

__LIBOR Example Output:__

[cudaXX@compC007 libor]$ bin/release/LIBOR_example_S_Release

GPU time (No Greeks) : 39.938999 msec

CPU time (No Greeks) : 56885.074219 msec

average value v = 48.95407269

average error = 0.00007025

__Example output from laplace3d:__

[cudaXX@compC007 laplace3d]$ ./bin/release/laplace3d_new

Grid dimensions: 256 x 256 x 256

Using device 0: Tesla C1060

Copy u1 to device: 31.299999 (ms)

dimGrid = 8 32 1

dimBlock = 32 8 1

10x GPU_laplace3d: 88.652000 (ms)

Copy u2 to host: 74.970001 (ms)

10x Gold_laplace3d: 1950.051025 (ms)

rms error = 0.000000

## OpLib: A Note from Claudio Albanese

Performance results to be reviewed and discussed during the Laboratory session are divided into GPU and CPU benchmarks for kernel calculation and scenario generation.

My conclusion based on the equipment I experimented with is that GPUs show an impressive factor 15-20 gain on kernel calculations. Sustained performances on real life problems are: Tesla 860: 180 GF/sec Tesla 1060: 340 GF/sec Xeon 5460: 15 GF/sec Xeon 5500 (Nehalem): 11 GF/sec. This kernel benchmark perhaps is biased in favour of GPUs because I spent a lot of time there developing BLAS Level 4 routines in CUDA such as SGEMM4 and SGEMV4 that operate on tensors. On the CPU side instead I am still using standard BLAS. I attempted to design a CPU side SGEMM4 by queuing MKL calls with no success and will leave that as a challenge for the audience.

On the Monte Carlo scenario generation side instead I worked a lot at optimizing both GPUs and CPUs. I concluded that the two need to be optimized with radically different strategies to exploit the very different memory/cache configurations. Results on stochastic volatility models on lattices with 512 sites are: Tesla 860: 100 million eval/sec Tesla 1060: 230 million eval/sec Xeon 5460: 180 milion eval/sec Xeon 5500, (Nehalem): 680 million eval/sec An “evaluation” here is a single period draw from a generic Markov chain. For instance, if an interest rate scenario involves generating a curve semiannually over 20 years, that corresponds to 40 evaluations. In my metric, CPUs outperform GPUs by a factor 3 at scenario generation.

The reason for the under-performance is that I am interested in generic processes, not solvable ones which can be coded using mostly registers and shared memory. For instance, the Black-Scholes example in the SDK only needs Box-Mueller transformations. Instead, I give myself a family of cumulative transition probability kernels stored as large matrices in global memory that I need to invert at every period in the simulation by performing a binary search. The factor 3 advantage CPUs show is due to the fact that kernels fit snugly into Level3 cache. On GPUs instead, as far as I understand, one is forced to use global memory for random uncoalesced access. Kernels would fit in shared memory if this was as large as 2 MB per thread block, but we are far from that mark. Instead, on recent CPUs we do have 2MB/core of cache.

Another reason for the difference is that I am using hash tables CPU side to speed up the calculation of inverse probability distribution functions. Since CPU cores are MIMDs with respect to each other, they can branch independently and can take advantage of this. Hash tables however would worsen the performance GPU side because of the overhead due to asymmetric branching in thread blocks and because ultimately thread blocks would default to the worse case scenario. CPUs accelerate slightly [a 10%] in single precision MC calculation, mostly because of better cache management with smaller data structures. I can’t use SSE2 there because I would have to eliminate hash tables in order to avoid uneven branching . On kernel calculations instead CPUs have about double speed in single than in double thanks to SSE2.

The ideal solution for orchestrating a Monte Carlo pricing engine is thus a hybrid one, with GPUs computing kernels by fourth level BLAS and CPUs generating scenarios and valuing payoffs. The two tasks would take about the same time in a balanced application. The calculation of sensitivities does not require generating scenarios and valuing payoffs repeatedly for each bumped input, but one can simply generate a new set of kernels and then value Radon-Nykodym derivatives numerically. The method is very robust even for second order derivatives and cross-gammas. Finally, backward induction solutions for callables and calibration should obviously be GPU driven. Development status I’ve just posted OPLib 1.0 RC5, still a beta but a nearly final release of a set of libraries and benchmarks for high performance pricing routines using a combination of lattice and Monte Carlo methods.