Image processing is one of the hot topics in AI research, alongside with reinforcement learning, ethics in AI and many others. A recent solution to perform ordinal regression on age of people has been published, and in this post we apply that technique to financial data.
Ranking classification is an usual challenge in companies and research hasn’t stopped looking for better ways to solve this matter. In a previous QuantDare post, we have explained the most important ways to solve this problem. In this post, we’ll focus on a pointwise approach proposed in the paper “Rank-consistent Ordinal Regression for Neural Networks” by Wenzhi Cao, Vahid Mirjalili and Sebastian Raschka.
In that paper, the authors tackle this problem as a K-1 binary classification problem, where K is the number of positions in the ranking. More specifically, we predict if the current observation is greater than position k in the ranking. We can calculate the final ranking position by using the following formula:
\( q = 1 + \sum_{K-1}^{k=1} ŷ(x_{i}) \),
where \(ŷ\) are the binary prediction for element \(x_{i}\). \(ŷ\) will be a list of size equal to K-1 and will have the binary indicators for each ranking position k.
This approach has been used in the past too. However, ranking consistency was not guaranteed because there was a chance that the prediction for position k could be 1 (the observation ranking position is greater than position k) but the prediction for position k-1 could be 0 (the observation ranking position is not greater than position k-1). Here’s an example to illustrate this idea better: we can’t say that a person who is 25 years old is above 20 years old and below 18 years old at the same time.
The authors of this paper provide an improvement to the loss function to optimize that considers the previous problem. The original loss function presented in the paper Ordinal Regression with Multiple Output CNN for Age Estimation by Niu et al. was categorical cross-entropy with an importance coefficient for each ranking position k. In this paper, they introduce a second part to the original equation that ensures the probability of the position k greater or equal than the probability of the position k+1. This aspect is explained in detail in the original paper.
For our problem, we will assign more weight to top-ranking positions so the model can focus on minimizing the error in those positions. Our weight will be defined like this:
\( \Lambda_{k} = \frac{q}{y} \)
Bringing the solution to financial problems
In the financial world, we want to prioritize which asset invest in, and that’s where the solution above kicks in! We’re going to predict the ranking position of a set of indices for each day. Let’s see how we’re going to formulate this problem.
Our data
The data we’re going to work has relative returns from closing prices for each business day for nine indices, starting from 7th January 1999 to 3rd March 2016.

Figure 1. Snapshot of the original data we are going to work with, daily returns from indices.
Target variable definition
The target variable will be a numeric discrete ranking from 0 to K-1, where K will be equal to the number of positions in the ranking and equal to the number of indices we have. The values to rank will be the linear moving weighted average returns for the next 21 days. Closer returns will have more importance than further returns.

Figure 2. Weights for each day in the window.
The higher the values, the closer we’ll be to the top of the ranking (better), and the lower the values the closer we’ll be at the bottom of the ranking (worse).
Each ranking position (q) will be transformed into a list of binary indicators. Here’s an example to illustrate this better: if our observation has a ranking position of 4 out of 9, the 4 position will be encoded like this [1, 1, 1, 0, 0, 0, 0, 0], meaning that exact observation ranking position is greater than the first ranking position, it’s greater than the second one, it’s greater than the third one, but it’s not greater than the following positions.
Input features
During my experience as a data scientist, I’ve worked with people that advocate using business knowledge, and people that want the model to learn its own features in order to avoid introducing human bias when it comes to create or select input features. I consider I belong to the first group, but this time, given we’re using neural networks and they have proven to be good at learning features by themselves, I’m going to just use the returns of each index as input and give a chance to that other point of view.
Model to use
We are going to use an LSTM model given the nature of our data, and we’re going to use an LSTM by each index. The reason behind using many models is that we want the model to learn from the dependency between values from a return series, and if we use data from all indices (shaping our data with K rows for each day) we’ll have consecutive values from different indices, and we don’t want that. For that reason, I will use one LSTM model by index to predict the ranking position returns for each day, feeding into the model the consecutive daily returns from that index.

Detail of an LSTM cell. (Source: http://bit.ly/2pOS3UT)
The model used for each index has two LSTM layers with 8 hidden units each, and a dense layer as an output layer with K-1 hidden units. The model was trained for 100 epochs, with a learning rate of 0.01 and a decreasing learning rate on the plateau that reaches a minimum learning rate of 0.0001 with the patience of 10 epochs.
Simulation
We’ll use from 7th January 1999 to 1st January 2012 as a training set, from 2nd January 2012 to 1st January 2014 as validation set, and from 2nd January 2014 to 1st January 2016 as a test set.
Results
Here you can see accuracy distribution throughout every data split by index:

Figures 4. Accuracy value distributions for each index by data split: blue for the training split, orange for the validation split and green for the test split.
From the above distribution plots, we can see that training split accuracy shows lights and shadows, as for some indices the most frequent value was greater than 80% (e.g. Global) but for others, its most frequent accuracy value was between 60% and 70% (e.g. Government Bond).
When it comes to checking distributions from test and validation splits, we see that there are mismatches between distributions, sometimes even testing and validation distributions, meaning that the performance of the model isn’t constant through time. In some cases we see that testing and validation distributions have their most frequent value been way below the training most frequent value (e.g. Commodities), meaning that there might be an overfitting problem.
In the following plot, we aim to have a better grasp of what indices are better or worse:

Figure 5. Box-Plots of accuracy by index.
Here we can see that some perform better than others, but clearly Commodities and Emerging show the worst median accuracy and reach lower minimum accuracy values.
Conclusions and further work
This has been a first hands-on applying a technique used in an image computing problem in a financial problem. From the results we’ve seen, there’s room for improvement, not only from the results per se but for some decisions that were taken. Here are some ideas:
- Choose a better set of hyper-parameters: the hyper-parameters used were chosen based on the fact that they’ve worked reasonably well on other projects with similar data, but a more rigorous procedure should be used to select those values.
- Debug the neural network: Is the neural network learning anything at all after epoch t? Are we suffering a gradient vanishing/explosion problem? How sensitive is the model to randomness? How does having a high decimal accuracy impact the neural network’s performance? These kinds of questions should be answered too.
- To have control of what we’re feeding into the model: Although it’s hard to know what’s going on inside a neural network, we can have under control (just a little actually) what it’s doing if we feed into the model the data we want. It could even see its performance improved by using more information, or even behave coherently according to prior business knowledge!
- Pointwise or listwise?: Though here we’ve used a pointwise approach, anyone could argue that the unit of prediction is a ranking generated by multiple observations every day and that the quality of that ranking is what it should be optimized. In this post, we’re not taking into account information from other indices to predict the ranking position of a specific index, and maybe that’s valuable information.
- Improve performance with basic aspects of binary classification: because we’re predicting ranking positions for each index independently and each index could be biased to occupy some ranking positions, we might have imbalanced classes. At this moment, we’re considering as 1 every probability greater than 0.5, but this might be refined by finding an optimal new threshold. Another way to tackle the class imbalance problem is to modify the weight of each ranking position on the loss function, giving more weight to minority classes/label positions.