We are rather used to reading this disclaimer (or some variation thereof) in mutual fund prospectuses or investment vehicle webpages. Despite warnings, investors and advisors insist on considering past performance (and some other related metrics) as important factors in asset selection. But, are they really wrong? In this post, I will try to shed some light on this topic by means of some metrics inspired by the information retrieval community.
The data
I will focus on weekly return data for current components of the DJIA (Dow Jones Industrial Average); a total of 30 stocks. More specifically, I will try to gain some insight on the joint movement of the returns by visualising how their weekly rankings have historically evolved.
This would answer questions such as: what is the probability that past week’s best-performing asset turns out to be this week’s best-performing one? Or, more generally, what is the probability that the i-th best-performing assets turns into this week’s j-th best-performing one? This information is depicted in the following Hinton diagram where the area of each square reflects this probability:
If we take a close look at the bottom left corner,we notice that the first ranked asset one week seems to consistently be the first one during next week. This is somehow an artifact of the amazingly good performance that Apple Inc. has consistently exhibited during recent years. However, if I leave aside this fact and remove the iCompany from the sample, data tell us a different story: there does not seem to be statistically significant persistence in return rankings.
But wait a sec! what if we look closer?
Ranking comparisons
A traditional approach to comparing two rankings has been in using correlation-based metrics which, rephrased to fit the problem at hand, would do the following:
- Pearson’s correlation coefficient: measures how linear the relationship between consecutive returns is.
- Spearman’s rank correlation coefficient: measures how linear the relationship between consecutive rankings is (indeed, it’s mathematically equivalent to the Pearson’s correlation among rankings).
- Kendall’s rank correlation coefficient (a.k.a. Kendall’s tau): counts the number of pairwise disagreements between the two ranking lists.
In the following picture, these metrics are applied to real returns and what I refer to as independent returns, that is, a situation in which consecutive weekly returns are statistically independent.
Note that a remarkable difference between real returns and independent returns shows up. However, as real scores are marginally different from zero, it may be difficult to use them to carry out successful predictions.
The problem with this family of metrics is that they give the same importance to what’s happening in any part of the ranking. In finance, however, we usually worry about what’s going on in extreme parts of the ranking. So…
What if the best-performing asset is the only relevant to you?
In case we only care about what happens in very specific parts of the ranking, we can think of applying metrics borrowed from information retrieval systems. Namely:
- Point-biserial correlation coefficient: measures how linear the relationship between the return and the fact of being relevant or not is.
- Precision at best (worst) N: fraction of next week’s N best (worst) assets represented by this week’s best (worst) asset.
- Average precision: average of the previously described precision values for all possible N.
- Average precision at best (worst) N: average of the previously described precision values for all possible n from 1 to N.
What if the N best-performing assets are all relevant to you?
All the aforementioned metrics can be trivially extended to the case where several assets are considered relevant (for example, the top 5).
Other related metrics are:
- Recall at best (worst) N: fraction of all relevant assets that fall inside the best (worst) N positions of next week’s ranking.
- R-precision: it may be useful in the particular case that the number of relevant assets fluctuates from case to case, due to changes in the asset universe size or assets deemed relevant if certain performance requirements are satisfied. This allows for a fair comparison among cases, and matches the precision at the number of assets that equates precision and recall.
- F1 at best (worst) N: if we are looking for a good tradeoff between precision and recall, the F1 score will provide us with this as the harmonic mean between precision and recall. The relative importance of each can be controlled by generalising this idea in what is usually referred to as F-beta score.
- Reciprocal rank: the inverse of the ranking position occupied by the single highest- (lowest) ranked relevant item. It is appropriate to judge a system when there’s only one relevant result, or when you only really care about the highest-ranked, even if several of them are relevant to you.
What if they are not equally relevant?
Having multiple relevant assets opens the door to another family of metrics: those in which graded relevances are taken into consideration. This is when relevance is not binary and we want to specify multiple levels of relevance. For example, we can consider that finding this week’s best asset among the top five assets next week is twice as important as finding the second best asset among them. Such a preference can be encoded as a vector of non-binary relevances, which gives rise to a bunch of other metrics:
- Cumulative gain at best (worst) N: accumulated relevance value among the N best (worst) assets. The computed value is unaffected by position changes among the N best (worst) assets.
- Discounted cumulative gain at best (worst) N: the same as cumulative gain but with the additional consideration that highly relevant assets appearing lower in next week’s ranking list should be penalised. This is achieved by diminishing the relevance value logarithmically proportional to the position of the asset.
- Normalized discounted cumulative gain at best (worst) N: again, as in the R-precision metric, when the number of relevant assets fluctuates from case to case it is useful to normalise the discounted cumulative gain with respect to its maximum achievable value. This value would be achieved only when assets are ordered according to the relevances we give to each.
In this example I have assumed that relevance scores go from 5 to 1 as we go from the first to the fifth asset:
One may also think of applying inequality-inspired metrics, borrowed from the socio-economic literature, and rooted in the concept of Lorenz curve … but that will have to wait for a different post. But if you want to know more right now, here are some pointers: Gini, Atkinson, entropy, and Theil (generalized entropy) indexes.
Concluding remarks
After evaluating dozens of scores we can conclude that all of them are consistently better for real returns than for independent asset returns. Sometimes only marginally better, but they show that past performance is telling us at least a bit about the future.
Keep in mind that this analysis has been conducted, intentionally, in some of the most difficult to predict asset classes and time horizons: stocks and one-week returns, respectively. Now it’s up to you to find out in which situation, and how, you may take advantage of this.