One of the most appealing areas of Artificial Intelligence is Reinforcement Learning, for its applicability to a variety of areas. It can be applied to different kinds of problems, in the present article we will analyze an interesting one: Reinforcement Learning for trading strategies.
Reinforcement Learning
We introduced Reinforcement Learning and Q-Learning in a previous post. In order to highlight an important idea noted in that post, in the RL framework, we have an agent that interacts with an environment and makes some discrete action. After that, the environment responds with a reward and a new state. To answer the question how good is it to take action \(a\) in state \(s\) at timestep \(t\), Q-Learning models Q-value function with the Bellman Optimality Equation:
$$Q(s_t, a_t) = r_{t+1} + \gamma * max(Q(s_{t+1}))$$
where \( Q \) is the Q-value, \(r\) is the reward, and \(\gamma\) is the discount factor, which determines the importance between immediate reward and future rewards.
Now we just need to adapt this framework to use RL for trading. Let’s see how this can be done!
Implementation
All the code for our experiments is available here. We create our environment with openai’s gym, and the neural network of the agent with the model subclassing mode from TensorFlow 2.0, as we need a custom training loop.
We can see that our environment takes into account two assets that agents have: the money (balance) and stocks (shares_held).
For our problem, we need to define which actions can be taken by the agents, as Q-Learning just deals with discrete actions:
- The agent will sell 40 and 10% of the shares that it has using action 0 and 1, respectively
- Action 2 means that the agent does nothing
- Action 3 and 4 means that the agent will buy shares with 10 and 40% of the money it has
When our agent takes a step, it sees the prices for the last 50 days (this length can be changed) and the net worth of its assets. A neural network assigns a probability for each of the possible actions, afterwards, the agent calculates the Q-value of the action and decides what to do.
We will evaluate the performance of the algorithm by taking several games of 100 steps in the test data. For each of the windows, we will calculate the Compound annual growth rate (CAGR).
Results
We have used American Express stock data (AXP) in our experiment, from date 30-12-1999 to 25-11-2020. Test data begins in the year 2015.

Here we can see that this task is extremely difficult for our agent. The mean return that it is capable of reach is around 4%. Besides, regardless the training, the algorithm ceases to learn. This means that we need to make another change in our algorithm.
Conclusions
To wrap up, we have seen a possible Reinforcement Learning Stock Trading application. However, as a means to make our RL algorithm works better, will have to make a considerable amount of adjustments. We can consider the following options:
- Hyperparameter optimization: We have fixed some parameters in our experiments, so an experiment with other values is necessary. The library Tune, build on top of Ray, is a great tool to perform these adjustments.
- Different RL algorithms. There are other RL algorithms that we can use. And some of them have a more desirable performance. With the library Stable Baselines we can not just choose the algorithm that we prefer (some of them can deal even with continuous actions), but we can also choose the different policies.
- In our current scenario, we have used the data at our disposal. However, we can use synthetic data in order to improve the performance of our agent.
- Lastly, we could change to a better reward system, using instead the Sharpe Ratio or Sortino Ratio
We are going to give a shot to some of these in our next article, so stay tuned! Thanks for reading!