asset management

Reproducing the S&P500 by clustering



No Comments

Let’s begin with a simple question. Can we use the movement of the main stocks in the S&P 500 to predict the index movement? The second question is…Who are the main stocks? The bigger ones? Maybe the more bullish of them? So, how should we decide who are the most representative companies?

In this post we propose a relatively new method of clustering to try to predict the best stocks. The method is called t-SNE, and we will group the companies with affinity to the propagation method (AP method).

Feature engineering

First, we have to get the to best companies among a lot of players. So, in first place, we must to choose criteria to decide if stocks are good or not. This means starting with “features engineering”. However, we have to answer a little question… what is it about our stocks that we can measure? Well, a lot of things, yes! But in this case, we have chosen 2 measures:

  • daily returns
  • daily market capitalisation changes

So we will say that our stocks are similar if they have similar daily returns and capitalisation growth. The method split all historical daily returns and daily capitalization by weeks. So, every week, our stocks have around 10 features (5 daily returns and 5 daily cap changes). We must notice that we are compressing data in time, turning daily data into weekly data. Once we have 500 objects per week with 10 features in each, we continue with the dimensionality reduction problem as we describe in the next chapter.

Dealing with Data and skimming dimensions

Now, let’s take a glance at the t-SNE method. In summary, the method is based on the preservation of the similarities among objects in a high-dimensional space according to the Kullback-Leibler function, and translate it to lower dimensional space whilst preserving that similarity (in our case, the objects are SP’s components).

So, we will apply t-SNE over the combination of the discrete functions, probabilities of daily market capitalisation changes, and daily returns. In other words, we are looking to preserve the minimum difference between the compound distribution of each component.

Now we have designed the method, we need to apply it to data. This is not a trivial consideration. In this case, it is even more complicated because of the changes of components within the index. Therefore, we need to split the index into small enough chunks to apply the method to each. In this case, we have choosen a weekly split. Now, we can apply t-SNE and reduce the dimensions to 2D for each natural month in 2010. Here we have our results for a particular week, in the graph below we see all stocks grouped by the AP method. blog-figure1The AP method finds relevant clusters to get the best split possible: in this case we see every cluster in different colors. We can see a lot of clusters, but their number is lower than 500; in fact, we have an average of 20 clusters per week. Since our strategy suggests investment in the nearest stock to the center of its cluster, that means we only invest in 20 stocks to replicate the S&P 500. We will see how to do this in the next section.

Cluster in low dimensional space

Once we have reduced our problem’s dimension (using the tSNE method) and we have a good representation of the current week, we apply an affinity propagation algorithm (AP) for clustering the stocks based on euclidean distances as a measure of similarity. This is better suited for our purpose (finding similar components) than other methods (ie: k-means), because of its definition.

This method works by linking each object (points in an euclidean space) in backward processes (sending responsabilities) and forward processes (sending availabilities) until 2 objects are not related each other. At that point we have reached the cluster limit. You can find a whole description of the method in this article. The AP method is more demanding (in terms of computational resources) but it’s quick enough for a few hundred objects. In our case, we only have to group ~500 objects in each clustering process.

Running Simulations

We have chosen Python for simulating our strategy. More specifically, we have used packages like pandas, numpy and sklearn to perform our strategy. We have only applied the simulation over the year 2010, so a more deep testing may be neccessary to check all possible parameters of t-SNE and AP method.

Every week, our strategy give us the stocks and the weights we need to replicate the index over the following week, and we use that information to build a synthetic index (adding the returns of each stock multiplied by weights) and compare it against the returns of S&P 500. The results are surprising. We are able to reproduce S&P 500 with a high degree of similarity by investing only in a handful of stocks. We know that tSNE method and AP method are based on a local approach to optimum solutions depending on random seeds, so our result may vary from one simulation to another, but we have rested further, and the results are similar. replication_2_95p5_18 The grey series are extrapolations of the weekly portfolios (the recomendation of stocks and weights each week) to the rest of the year, so we have 51 grey lines in the previous chart (one for each week in the year, if we don’t take into account the last week of the year).

In the previous graph, we can see the accumulated percentage returns with daily reinvestment in both series. There is a 95.5% correlation between series while in the “clustered” series, and we are only investing in an average of 18 stocks. However, we don’t always have such accurate results, so we also have series that are a bit different, like the following.replication_3_86p61_18replication_4_80p75_18 The previous series have 86% and 80% correlation to the original index.


There are many possibilities in this method. More tests must be performed and predictive models must be developed if we want to use this as an investment system. In the meantime, we can get some conclusions:

  • We can obtain similar series to original index with a high correlation, which could be useful for getting series for many purposes (ie: testing investment strategies in similar series).
  • We can see that all the grey series lie above the index, which means that there are many chances of getting a better portfolio than the original index.
  • We have to consider how random initialisation could affect our strategy, so we would require some kind of persistance method to preserve our track record (such as DB or binary files for saving our results).

add a comment