post list
QuantDare
categories
asset management

Foreseeing the future: a user’s guide

Jose Leiva

asset management

Stochastic portfolio theory, revisited!

P. López

asset management

“Past performance is no guarantee of future results”, but helps a bit

ogonzalez

asset management

Playing with Prophet on Financial Time Series (Again)

rcobo

asset management

Shift or Stick? Should we really ‘sell in May’?

jsanchezalmaraz

asset management

What to expect when you are the SPX

mrivera

asset management

K-Means in investment solutions: fact or fiction

T. Fuertes

asset management

How to… use bootstrapping in Portfolio Management

psanchezcri

asset management

Playing with Prophet on Financial Time Series

rcobo

asset management

Dual Momentum Analysis

J. González

asset management

Random forest: many are better than one

xristica

asset management

Using Multidimensional Scaling on financial time series

rcobo

asset management

Comparing ETF Sector Exposure Using Chord Diagrams

rcobo

asset management

Euro Stoxx Strategy with Machine Learning

fjrodriguez2

asset management

Hierarchical clustering, using it to invest

T. Fuertes

asset management

Lasso applied in Portfolio Management

psanchezcri

asset management

Markov Switching Regimes say… bear or bullish?

mplanaslasa

asset management

Exploring Extreme Asset Returns

rcobo

asset management

Playing around with future contracts

J. González

asset management

BETA: Upside Downside

j3

asset management

Approach to Dividend Adjustment Factor Calculation

J. González

asset management

Are Low-Volatility Stocks Expensive?

jsanchezalmaraz

asset management

Predict returns using historical patterns

fjrodriguez2

asset management

Dream team: Combining classifiers

xristica

asset management

Stock classification with ISOMAP

j3

asset management

Could the Stochastic Oscillator be a good way to earn money?

T. Fuertes

asset management

Correlation and Cointegration

j3

asset management

Momentum premium factor (II): Dual momentum

J. González

asset management

Dynamic Markowitz Efficient Frontier

plopezcasado

asset management

‘Sell in May and go away’…

jsanchezalmaraz

asset management

S&P 500 y Relative Strength Index II

Tech

asset management

Performance and correlated assets

T. Fuertes

asset management

Size Effect Anomaly

T. Fuertes

asset management

Predicting Gold using Currencies

libesa

asset management

Inverse ETFs versus short selling: a misleading equivalence

J. González

asset management

S&P 500 y Relative Strength Index

Tech

asset management

Seasonality systems

J. González

asset management

Una aproximación Risk Parity

mplanaslasa

asset management

Using Decomposition to Improve Time Series Prediction

libesa

asset management

Las cadenas de Markov

j3

asset management

Momentum premium factor sobre S&P 500

J. González

asset management

Fractales y series financieras II

Tech

asset management

El gestor vago o inteligente…

jsanchezalmaraz

asset management

¿Por qué usar rendimientos logarítmicos?

jsanchezalmaraz

asset management

Fuzzy Logic

fuzzyperson

asset management

El filtro de Kalman

mplanaslasa

asset management

Fractales y series financieras

Tech

asset management

Volatility of volatility. A new premium factor?

J. González

asset management

Reproducing the S&P500 by clustering

fuzzyperson

08/05/2015

No Comments
Reproducing the S&P500 by clustering

Let’s begin with a simple question. Can we use the movement of the main stocks in the S&P 500 to predict the index movement? The second question is…Who are the main stocks? The bigger ones? Maybe the more bullish of them? So, how should we decide who are the most representative companies?

In this post we propose a relatively new method of clustering to try to predict the best stocks. The method is called t-SNE, and we will group the companies with affinity to the propagation method (AP method).

Feature engineering

First, we have to get the to best companies among a lot of players. So, in first place, we must to choose criteria to decide if stocks are good or not. This means starting with “features engineering”. However, we have to answer a little question… what is it about our stocks that we can measure? Well, a lot of things, yes! But in this case, we have chosen 2 measures:

  • daily returns
  • daily market capitalisation changes

So we will say that our stocks are similar if they have similar daily returns and capitalisation growth. The method split all historical daily returns and daily capitalization by weeks. So, every week, our stocks have around 10 features (5 daily returns and 5 daily cap changes). We must notice that we are compressing data in time, turning daily data into weekly data. Once we have 500 objects per week with 10 features in each, we continue with the dimensionality reduction problem as we describe in the next chapter.

Dealing with Data and skimming dimensions

Now, let’s take a glance at the t-SNE method. In summary, the method is based on the preservation of the similarities among objects in a high-dimensional space according to the Kullback-Leibler function, and translate it to lower dimensional space whilst preserving that similarity (in our case, the objects are SP’s components).

So, we will apply t-SNE over the combination of the discrete functions, probabilities of daily market capitalisation changes, and daily returns. In other words, we are looking to preserve the minimum difference between the compound distribution of each component.

Now we have designed the method, we need to apply it to data. This is not a trivial consideration. In this case, it is even more complicated because of the changes of components within the index. Therefore, we need to split the index into small enough chunks to apply the method to each. In this case, we have choosen a weekly split. Now, we can apply t-SNE and reduce the dimensions to 2D for each natural month in 2010. Here we have our results for a particular week, in the graph below we see all stocks grouped by the AP method. blog-figure1The AP method finds relevant clusters to get the best split possible: in this case we see every cluster in different colors. We can see a lot of clusters, but their number is lower than 500; in fact, we have an average of 20 clusters per week. Since our strategy suggests investment in the nearest stock to the center of its cluster, that means we only invest in 20 stocks to replicate the S&P 500. We will see how to do this in the next section.

Cluster in low dimensional space

Once we have reduced our problem’s dimension (using the tSNE method) and we have a good representation of the current week, we apply an affinity propagation algorithm (AP) for clustering the stocks based on euclidean distances as a measure of similarity. This is better suited for our purpose (finding similar components) than other methods (ie: k-means), because of its definition.

This method works by linking each object (points in an euclidean space) in backward processes (sending responsabilities) and forward processes (sending availabilities) until 2 objects are not related each other. At that point we have reached the cluster limit. You can find a whole description of the method in this article. The AP method is more demanding (in terms of computational resources) but it’s quick enough for a few hundred objects. In our case, we only have to group ~500 objects in each clustering process.

Running Simulations

We have chosen Python for simulating our strategy. More specifically, we have used packages like pandas, numpy and sklearn to perform our strategy. We have only applied the simulation over the year 2010, so a more deep testing may be neccessary to check all possible parameters of t-SNE and AP method.

Every week, our strategy give us the stocks and the weights we need to replicate the index over the following week, and we use that information to build a synthetic index (adding the returns of each stock multiplied by weights) and compare it against the returns of S&P 500. The results are surprising. We are able to reproduce S&P 500 with a high degree of similarity by investing only in a handful of stocks. We know that tSNE method and AP method are based on a local approach to optimum solutions depending on random seeds, so our result may vary from one simulation to another, but we have rested further, and the results are similar. replication_2_95p5_18 The grey series are extrapolations of the weekly portfolios (the recomendation of stocks and weights each week) to the rest of the year, so we have 51 grey lines in the previous chart (one for each week in the year, if we don’t take into account the last week of the year).

In the previous graph, we can see the accumulated percentage returns with daily reinvestment in both series. There is a 95.5% correlation between series while in the “clustered” series, and we are only investing in an average of 18 stocks. However, we don’t always have such accurate results, so we also have series that are a bit different, like the following.replication_3_86p61_18replication_4_80p75_18 The previous series have 86% and 80% correlation to the original index.

Conclusions

There are many possibilities in this method. More tests must be performed and predictive models must be developed if we want to use this as an investment system. In the meantime, we can get some conclusions:

  • We can obtain similar series to original index with a high correlation, which could be useful for getting series for many purposes (ie: testing investment strategies in similar series).
  • We can see that all the grey series lie above the index, which means that there are many chances of getting a better portfolio than the original index.
  • We have to consider how random initialisation could affect our strategy, so we would require some kind of persistance method to preserve our track record (such as DB or binary files for saving our results).

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Email this to someone

add a comment

wpDiscuz