The application of Machine Learning for ranking is widely spread. This application of Machine Learning is a little different from the classical ones of classification and regression.
In the case of ranking, the interest is not in the accuracy of an estimated value (regression) or the guess about the membership of an element in a cluster or other (classification); rankings care about proper ordering regarding some property (take a look at this great post about ranking aggregation). In any case, all ML ranking methods need to quantify the quality of the ordering so they can arrive at the best possible ranking (see here for a list of methods). In this post, we will talk about one of the most employed measures for this task: (Normalized) Discounted Cumulative Gain (n)DCG.
This measure of a ranking quality tries to emphasize the correction of the first places in a ranking and give less importance as we go down the ranking. In order to measure the level of correctness or accuracy of what we try to rank, it is employed the term relevance (rel in the formula). The relevance and its scale could be defined in many different ways. For example, if we try to rank some books regarding their relation with philosophy, in a scale of 10, we may give a relevance of 9 to Also sprach Zarathustra while we may give 5 to Treasure Island, basing the relevance in the number of times the word god is mentioned (no, I have not counted it).
You can see in the formula that only the n first elements of the ranking are considered, so it is not compulsory (neither sometimes advisable) to consider all the elements of the list ranked.
Imagine you plan to select 5 stocks from the S&P 500 to invest during one month, so you decide to order the 500 stocks, founding your decision on every stock’s last year momentum. Before doing a real investment, you want to test your (naive) strategy, so, in order to measure the relevance of every stock, you obtain the ranking one month ago (based on the momentum) and evaluate the performance of every stock during last month.
We can define the relevance in multiple ways; in this case we will establish the levels of relevance by separating the stocks in deciles over the performance of the last month: we will give a relevance of 10 to those stocks in the first decile, a relevance of 9 to those in the second decile, and so on. In this example, we are just analysing the ordering at a single date, but we could develop this analysis through historical data and train a learning-to-rank method.
The following table shows the calculations of the DCG for the first 5 stocks ordered by momentum and their relevance values. Even though the second element has a lower relevance (8) than the third one (10), its contribution to the DCG is higher (5.05 vs 5). First positions have more importance due to the division with the logarithmic term.
When we switch the second and the third positions, the DCG improves, since there is higher relevance in a higher position of the ranking.
Now, imagine that instead of selecting from 500 stocks, we just select 5 out of 30 possible stocks (maybe because we have imposed an ethical filter or because we have discarded some sectors). Is it possible to compare directly the ranking from the previous example to the one obtained in this new situation? Yes, if we normalize DCG appropriately. Normalized DCG follows next formula:
where the ideal DCG is:
Being “irel” the relevances of the first n elements if the whole list were perfectly ordered. Following the previous example, in the 500 stock list, as there are 50 elements by decile, the first 5 elements in the ideal ordering would have relevance 10; IDCG5 would be:
However, if the list were made up of 30 stocks, each decile would have 3 elements, so in the ideal ordering, only 3 out of the 5 first elements would have relevance 10, while the next 2 stocks would have relevance 9, as they would belong to the next decile.
Doing the calculations, for the same DCG 25.68, we would obtain on the one hand a nDCG of 0.87 for a list of 500 elements and on the other hand a nDCG of 0.9.
I hope you found the post interesting. Do not hesitate to comment your thoughts about ranking quality measures or methods.