Let’s begin with a simple question. Can we use the movement of the main stocks in the S&P 500 to predict the index movement? But… Who are the main stocks? That’s a good question, maybe the bigger ones, maybe the more bullish of them. So, how should we decide who are the most representative companies?

In this post we propose a relatively new method of clustering to try to predict which are **the best stocks**. The method is called t-SNE and we will group the companies with affinity the propagation method (AP method).

## Feature engineering

First, we have to get the best companies among a lot of players. So, in first place, we must to choose what are **our criteria** to decide if stocks are good or not which means starting with “features engineering”. However, we have to answer a “little” question… what can we measure about our stocks? Well, a lot of things, yes! But in this case we have chosen 2 measures:

- daily returns
- daily market capitalization changes

So we will say that our stocks are similar if they have similar daily returns and capitalization growth. The method split all historical daily returns and daily capitalization by weeks. So, every week, our stocks have around 10 features (5 daily returns and 5 daily cap changes). We must notice that we are compressing data in time, turning daily data into weekly data. Once we have 500 object per week with 10 features each one, we continue with the dimensionality reduction problem as we describe in the next chapter.

## Dealing with Data and skimming dimensions

Now, let’s take a glance at the t-SNE method. In summary, the method is based on the** preservation of the similarities** among objects in a high-dimensional space according to the Kullback-Leibler function (which is used like a function distance measure) and translate it to lower dimensional space preserving that similarity (in our case, the objects are SP’s components).

So, we will apply t-SNE over the combination of the discrete functions probabilities of daily market capitalization changes and daily returns. In other words, we are looking for preserving the minimum difference between the compound distribution of each component.

Well, we have designed the method but we need to** apply it to data** and this is not a trivial question. In this case, it is even more complicated because of the changes of components within the index so we need to split the index in small enough chunks to apply the method to each one, in this case we have choosen a weekly split. Now, we can apply t-SNE and reduce the dimensions to 2D for each natural month in 2010 year, here we have our results for a particular week, in the next graph we see all the stocks grouped by AP method. the AP method find the clusters to get the best split possible, in this case we see every cluster in different colors. We appreciate a lot of clusters but their number is lower than 500, in fact we have an average of 20 clusters per week. Since our startegy implies to invest in the nearest stock to the center of its cluster that means we only invest in 20 stocks to replicate the S&P 500. We will see how to do this in the next section.

## Cluster in low dimensional space

Once we have reduced our problem’s dimension (using the tSNE method) and we have a good representation of the current week, we apply an **affinity propagation algorithm** (AP) for clustering the stocks based on euclidean distances as a measure of similarity because it is better suited for our purpose (finding similar components) than other methods (ie: k-means) because of its definition.

This method works communicating each object (points in an euclidean space) in backward processes (sending responsabilities) and forward processes (sending availabilities) till 2 objects are not related each other, in that point we have reach the cluster limit. You can find a whole description of the method in the this article The AP method is more demanding (in terms of computational resources) but it’s quick enough for a few hundreds of objects (in our case, we only have to group ~500 objects in each clustering process).

## Running Simulations

We have chosen Python for simulating our strategy, in fact we have used packages like pandas, numpy and sklearn to perform our strategy. We have only applied the simulation over the year 2010 and a more deep testing maybe neccessary to check all possible parameters of t-SNE and AP method.

Every week, our strategy give us the stocks and the weights we need to replicate the index over the next week and we use that information to build a synthetic index (adding the returns of each stock multiplied by weights) and compare it against the returns of S&P 500. The results are surprising, we are able to reproduce S&P 500 with a high degree of similarity and investing only in a handful of stocks. We must know that tSNE method and AP method are based on a local approach to optimum solutions depending on random seeds so our result may vary from one simulation to another, but we have tried some times and the results are similar. The grey series are extrapolations of the weekly portfolios (the recomendation of stocks and weights each week) to the rest of the year, so we have 51 grey lines in the previous chart (one for each week in the year, if we don’t take into account the last week of the year).

In the previous graph we are seeing the accumulated percentage returns with daily reinvestment in both series. There is a **95.5% of correlation** between series while in the “clustered” serie and we are only investing in **18 stocks **in average**.** However, we don’t always have so accurate results, we can also have a little bit different series like next one: The previous series have 86% and 80% of correlation with the original index.

## Conclusions

There are many possibilities in this method. More tests have to be performed and a predictive models must be developed if we want to use this system as an investment system, but we can get some conclussions:

- We can get similar series to original index with a high correlation which could be useful for getting series for many purposes (ie: testing investment strategies in similar series)
- We can see that all the grey series are above the index which means that there are many chances of
**getting a better portfolio**than the original index. - We have to consider how the random initialization could affect in our strategy, so we probably will need some kind of persistance method to preserve our track record such as DB or binary files for saving our results.