Applying Deep Learning to Derivatives Valuation
Version 1.1
Abstract
The universal approximation theorem of artificial neural networks states that a forward feed network with a single hidden layer can approximate any continuous function, given a finite number of hidden units under mild constraints on the activation functions(see Hornik, 1991; Cybenko, 1989). Deep neural networks are preferred over shallow neural networks, as the later can be shown to require an exponentially larger number of hidden units (Telgarsky, 2016). This paper applies deep learning to train deep artificial neural networks to approximate derivative valuation functions using a basket option as an example. To do so it develops a Monte Carlo based sampling technique to derive appropriate training and test data sets. The paper explores a range of network geometries. The performance of the training phase and the inference phase are presented using GPU technology.
/KOMAoption
fontsize9pt
1 Introduction
1.1 The need for speed
The case for fast derivative valuations is already well established in contexts like CVA and Credit Risk, where derivatives models are repeatedly called within a Monte Carlo simulation (see for example Green, 2015). Newer XVAs such as KVA and MVA are likely to require even more valuations to be calculated, further increasing requirements on computational performance.
It is not just in the context of XVA that computational performance is required. The new market risk capital (FRTB) rules will require up to 63 expected shortfall (ES) calculations to be applied to the Trading Book. Furthermore, some applications such as KVA on FRTBCVA will require an approximation for the future CVA and its sensitivities to be available within the Monte Carlo simulation.
While the needs of XVA and FRTB are currently driving the need for high performance valuations, faster valuations have always been sought after. Faster valuations allow more sensitivities to be calculated for the trading desk within the same compuational restraints. Desks have long had to weigh the benefits of increased accuracy versus the number sensitivities, stress tests, etc they could to calculate. For example, increasing the number of Monte Carlo simulations increases accuracy at the expense of compute time, and a trading desk will need to decide whether it prefers a small number of accurate sensitivities or larger number of less accurate sensitivities.
An important aspect of quantitative finance research is the development of new techniques that deliver more accuracy with the same computational effort or similar levels of accuracy with much lower computational effort. Monte Carlo variance reduction techniques are an example. Recently, there has been work done to find suitable approximations to computationally expensive valuation functions.
A number of approximation techniques such as LongstaffSchwartz regression (see for example Cesari et al., 2009) and lattice interpolation using PDE or treebased valuation models are commonly applied in XVA models. Chebyshev interpolation techniques have been applied in the context of XVA and more generally (Zeron and
Ruiz, 2017; Gaß et al., 2015).
Deep Neural Networks bring some major benefits to the approximation of derivative valuation functions:

DNNs do not suffer from the curse of dimensionality. The techniques developed in the paper can be applied to valuation models with hundreds of input parameters.

DNNs are broadly applicable. They can be trained by a wide variety of traditional models that employ Monte Carlo simulation, finite differences, binomial trees, etc. as their underlying framework.
1.2 Milestones in Neural Network Research
While the term Deep Learning has only recently come into widespread use, the underlying technology of artificial neural networks has a long history with three distinct phases of development. Artificial Neural Networks (ANNs), as the name implies, take their inspiration from neurons in the brain. ANNs consist of layers of connected nodes where the inputs of the node activate the output via a nonlinear activation function (see Figure 1). The computational model for ANNs was introduced by McCulloch and
Pitts (1943), while the perceptron, the first model that could learn the weights in the network, was introduced by Rosenblatt (1958). However, ANNs fell out of favour amongst Artificial Intelligence researchers in the 1970s and interest was not revived until the 1980s. The associative neural network was introduced by Hopfield (1982), while the backpropagation (Werbos, 1974; Rumelhart et al., 1986) algorithm, a special case of algorithmic differentiation accelerated the training of ANNs. Neural networks remained popular until the mid 1990s, when research did not meet the expectations of various commercial ventures, leading to investor disappointment (Goodfellow
et al., 2016). Neural networks again became popular in from 2006, following the introduction of deep belief networks by Hinton
et al. (2006). Over the last decade ANN have gained in popularity based on better performance than other machine learning techniques and rapid advances in computer hardware, especially GPU technology, better management of large data volumes and much enhanced software frameworks.
ANNs are perhaps best known in finance in the context of predictive algorithms for use as trading strategies (see for example Tenti, 1996). Credit scoring and bankruptcy prediction have also applications where ANNs have been applied. ANNs have recently been applied to CVA, alongside other machine learning techniques, where a classifier approach has been used to map CDS to illiquid counterparties (Brummelhuis and
Luo, 2017). Neural networks previously have been applied to the approximation of derivative valuations. M., W., and Tomaso (M.
et al.) applied shallow neural networks to the BlackScholes model in an early financial application. More recently, Culkin and
Das (2017) applied Deep Neural Networks to the same problem.
1.3 Applying Deep Learning to Derivatives Valuation
This paper makes the following contributions:

Demonstrates the use of deep neural network models as approximations to derivative valuation routines and provides a basket option as an example.

Develops a training methodology for neural network models, where the training and test set are generated using Monte Carlo simulation.

Explores a range of geometries for neural network models.

Provides an assessment of computational performance, while making use of CPU and GPU parallelism during training set generation and network parameter fitting.
The remainder of this paper is organised as follows. In Section 2, a brief overview of deep neural network models and the associated fitting algorithm is provided. The application of deep neural networks to approximating derivative valuations is provided in Section 3, including the training methodology. Example results and computational performance for the test case are presented in Section 4 respectively. Finally the paper concludes in Section 5.
2 Deep Neural Networks
2.1 Introducing Neural Networks
Notation  Description 

input layer, jth element of input layer  
Output (vector or scalar depending on problem context)  
Output value from the neural network for training example  
ith training example  
Matrices of all training examples  
Weight matrix for the lth layer and jth component vector  
Bias vector for the lth layer and jth element  
Number of layers in the neural network  
Nonlinear activation function  
Result of matrix operations on layer l  
Results of activation function on layer l  
Input and output layers  
Cost function  
Elementwise multiplication 
An artificial neural network consists of a series of layers of artificial neurons.^{4}^{4}4For a detailed introduction to Deep Learning, see Goodfellow
et al. (2016) and Ng (2018).^{5}^{5}5A summary of the notation used in this article can be found in Table 1. Each neuron takes a vector of inputs from the previous layer, and applies a weight vector, and a bias so that
(1) 
or in matrix notation
(2) 
The result from layer is then given by applying a nonlinear activation function ,
(3) 
A number of different activation functions can be used and the main types are listed in Table 2. It is essential that a nonlinear function be used, or the neural network model will only ever produce an output that is a linear function of the input, irrespective of the number of layers in the model.^{6}^{6}6Note that regression problems with a real value output may specify a linear activation function in the output layer. All neurons in the same layer use the same activation function but different layers often use different activation functions. Equations (2) and (3) together describe how the input vector is forward propagated through the network to give the final output .
Function  Definition 

Sigmoid  
tanh  
ReLU  
Leaky ReLU 
The number of inputs to the model is dictated by the number of input features, while the number of neurons in the output layer is determined by the problem. For a regression problem with one realvalued output, as described in this paper, there will be a single node in the output layer. If a neural network model is used in a classifier problem with multiple classes then there will be one neuron per class. The number of layers in the model, and the number of neurons in each layer can be considered hyperparameters and these are normally set using hyperparameter tuning and optimising over a portion of the training data set, as described in Section 2.3. Models with many hidden layers are known as deep neural networks. A example neural network geometry is illustrated in Figure 1.
2.2 Training the Model and Back Propagation
During the training phase, the weights and bias parameters of the model are systematically updated to minimize the error between the training data and estimates generated by the model. A set of training examples is used and the error is given by a cost function ,
(4) 
where the function , depends on the choice of loss measure. A number of choices are available for the loss measure including, for example, the L2 or L1 norm.
An optimization procedure is required to do this and the standard approach uses batch gradient descent or a variant of it such as minibatch gradient descent.^{7}^{7}7In this paper we use minibatch gradient descent where the size of each minibatch is a hyperparameter. Such optimization procedures require the gradients of the cost function,
(5)  
(6) 
which in the context of gradient descent are used to update the weights,
(7)  
(8) 
To obtain the gradients a specialised type of algorithmic differentiation is used, known as backpropagation. This algorithm recursively applies the chain rule to propagate derivatives back through the computational graph of the neural network model. So beginning with the output layer, the derivative is defined by
(9) 
The derivative is then given by,
(10) 
Hence the derivatives of the weights for layer L are given by,
(11)  
(12) 
The derivative for the next layer are then given by,
(13) 
and hence the derivatives can be propagated backwards through the network to obtain all the and . Forward and backward propagation are illustrated in Figure 2.
To train the model, training examples are used. Typically the input data set of examples is divided into three independent groups, , which is used to train the network, which is used to test the trained network and which is used to develop or crossvalidate the hyperparameters of the model. The fraction of data divided into each group depends on how much training data is available, with and forming a relatively larger proportion if is relatively small. The network is then trained over of a gradient descent algorithm with the training error and test set error computed.
The weights of the model should not be initialized to zero or the model will never be able to break symmetry. Hence the weights are initialized using a pseudorandom number generator, with samples drawn either from the uniform distribution or the standard normal distribution. One common procedure with the ReLU activation function initializes the weights in layer using a standard normal distribution with mean zero and variance .
2.3 Hyperparameter Tuning
While the weights, and the bias parameter, , are parameters of the model that are defined during training, other parameters such as the learning rate and the topology of the neural network are hyperparameters that are not directly trained. The hyperparameters could be simply exogenously specified, however, in general their optimal values are not known a priori. Hence hyperparameters are optimised by assessing model performance, once trained, on a separate independent set of samples.
In this paper we focus on two axis of development for deep neural networks and vary the number of artificial neurons per hidden layer and then the number of hidden layers, while holding the number of neurons in each layer fixed.
3 Derivatives Valuation and Deep Learning
A derivatives valuation model is ultimately just a function which maps inputs, consisting of market data and deal specific terms, to a single output representing the value. That function may have an known analytic form, though frequently numerical approaches including Monte Carlo simulation, binomial trees or finite difference approximations must be used. For simple European stock options, there are just five inputs while for a more complex product like a Bermudan swaption the number of inputs is significantly larger, involving all the properties of an underlying swap and an option exercise schedule. The number of parameters could be in the hundreds or thousands for such complex products. Having a large number of inputs to a