In previous posts, we have seen the basic RL algorithm, Deep Q learning (DQN). We have also seen it applied, using Neural Networks as the Agent, to an investment strategy. We finally even used it for a cryptocurrency investment strategy. This time, we will implement a slightly more advanced technique, **Double Deep Q-networks** (DDQN), and create a trading strategy using this algorithm.

## DQN revisited

One of the most successful family of algorithms developed in the last few years has been **Reinforcement Learning** (RL), along with its many variants. It has successfully beaten the best human players –and state-of-the-art algorithms– in many games, including chess and go.

RL works very well for a kind of problems which mathematicians call **games**. A game is a setup where one or many **agents** perform a series of **actions** which have an impact on the **state **of the environment. Based on the effect the action has, we can define a **reward** for the agent that has performed the action. As we saw in previous posts, DQN tries to satisfy the **Bellman equation**:

\( Q_{\theta}(s_t, a_t) = r_{t+1} + \gamma * \max\limits_{a_{t+1}}Q_{\theta}(s_{t+1}, a_{t+1}) \)

As we start with random weights, this equality is obviously not fulfilled at first. We can then use the difference between both sides of the equation as our **loss**. Using Mean Squared Error:

\( {loss} = (r_{t+1} + \gamma * \max\limits_{a_{t+1}}Q_{\theta}(s_{t+1}, a_{t+1}) – Q_{\theta}(s_t, a_t))^2 \)

There is a problem with this loss function, though. When we use the computed loss to update the parameters of the network (\( \theta \)), we change *both* the values of \( \max\limits_{a_{t+1}}Q(s_{t+1}, a_{t+1}) \) *and *\( Q(s_t, a_t) \). This can lead to a high instability in training. To avoid it, we define a new network, \( \bar\theta \) (called the **target network**). It will be a frozen, periodic copy of the weights from \( \theta \) (called the **online network**), and we will use it to compute \( \max\limits_{a_{t+1}}Q(s_{t+1}, a_{t+1}) \). The loss function will thus be:

\( {loss} = (r_{t+1} + \gamma * \max\limits_{a_{t+1}}Q_{\bar\theta}(s_{t+1}, a_{t+1}) – Q_{\theta}(s_t, a_t))^2 \)

## DDQN

Due to the maximization step in the equation, DQN tends to over-estimate the Q-values. To avoid it, we introduce DDQN. In this variation, we use \( \theta \) (the online network) to select the action, but we take that action’s Q-value from \( \bar\theta \) (the target network). DDQN’s loss function is, finally:

\( {loss} = (r_{t+1} + \gamma * Q_{\bar\theta}(s_{t+1}, \arg\max\limits_{a_{t+1}}Q_{\theta}(s_{t+1}, a_{t+1})) – Q_{\theta}(s_t, a_t))^2 \)

## Some more tweaks

### \( \epsilon \)-greedy actions

Having settled on the loss function, we can now train the network. We will be using the game’s **transitions** (identified by the tuple \( (s_t, a_t, r, s_{t+1}, a_{t+1}) \)). But, during training, how are these transitions decided? It is obvious that, once the network is trained, we will want to choose the action that maximizes Q. But, during training, we do not know the true values of Q, or even have good approximations for them. That is why it is a good idea to use an **epsilon-greedy** approach. That means, for each transition, choosing the action with the highest Q-value with a probability \( (1 – \epsilon), \) or a random action with a probability \( \epsilon \).

What is a good value for \( \epsilon \)? This is called the *exploration vs exploitation* problem. We want a high value at the beginning, so the network explores many different strategies. And then we want it to go down over time to settle at a much lower rate. The exact initial and final values, as well as the method to bring it down progressively, will depend on each problem’s nature.

### Experience replay

The final problem we will explore is the **correlation between states**. When given \( s_t \), the network chooses an action \( a_t \), the next state \( s_{t+1} \) is highly correlated with \( s_t \). In Supervised Learning (SL), we avoid clustering similar data together by shuffling the training data in each training epoch. But in RL, we are generating new data as we go by playing the game over and over again. We will use an **experience replay**: a buffer which will hold the last \( n \) transitions. We want \( n \) to be big enough to hold different plays that use different strategies. But not too big because, as the network learns, old transitions do not contain a lot of new information which the network doesn’t already know.

## A financial game

When is using RL advantageous? When we do not know what the best strategy is beforehand, but, given an outcome, we can judge how good that outcome is. It is also useful when the final outcome is dependent on a series of intermediate outcomes. Let’s create a setup where Double Deep Q-networks for trading makes sense!

We will have an investment universe made up of five assets: SPDR Gold Shares (GLD), iShares 20+ Year Treasury Bond ETF (TLT), United States Oil ETF (USO), iShares iBoxx Investment Grade Corporate Bond ETF (LQD) and SPDR S&P 500 ETF (SPY). The training period will be from 10 April 2006 to 31 December 2018, and the test period will be from 1 January 2019 to 29 November 2022. Once a week, we will choose one asset, and increase our exposure to it by 10%. This exposure will come at the expense of the asset with the current highest exposure.

The five assets’ cumulative prices for the train and test periods are, respectively:

### Results

We train our strategy, and compare it to a weekly-rebalanced equally-weighted strategy, to have a benchmark against which to compare. The results for the testing period are as follows:

As we can see, Double Deep Q-Networks show promise for trading. But of course, much more fine-tuning is needed before we can actually use a model like this in a real investment. It fares better than an Equally Weighted strategy for the whole period, but we suspect with a little fine-tuning, it could fare even better.

For this example, we have fixed some parameters, such as the training epochs, the final \( \epsilon \), its rate of decay, the memory size of the Replay Experience, the learning rate (along with the optimizer used, Adam), and the reward calculation (the gross return of the strategy the following week). We have also chosen a very simplified setup, only allowing for 10% increases in a position, with no direct account of transaction costs. Can you think of a setup where, instead of limiting the weekly rotation, we reward the neural net by the net returns, instead of the gross ones?

Implement your own version, and tell us what results you can achieve! See you next time!