post list
QuantDare
categories
asset management

Using Multidimensional Scaling on financial time series

rcobo

asset management

Comparing ETF Sector Exposure Using Chord Diagrams

rcobo

asset management

Euro Stoxx Strategy with Machine Learning

fjrodriguez2

asset management

Hierarchical clustering, using it to invest

T. Fuertes

asset management

Lasso applied in Portfolio Management

psanchezcri

asset management

Markov Switching Regimes say… bear or bullish?

mplanaslasa

asset management

Exploring Extreme Asset Returns

rcobo

asset management

Playing around with future contracts

J. González

asset management

BETA: Upside Downside

j3

asset management

Approach to Dividend Adjustment Factor Calculation

J. González

asset management

Are Low-Volatility Stocks Expensive?

jsanchezalmaraz

asset management

Predict returns using historical patterns

fjrodriguez2

asset management

Dream team: Combining classifiers

xristica

asset management

Stock classification with ISOMAP

j3

asset management

Could the Stochastic Oscillator be a good way to earn money?

T. Fuertes

asset management

Correlation and Cointegration

j3

asset management

Momentum premium factor (II): Dual momentum

J. González

asset management

Dynamic Markowitz Efficient Frontier

plopezcasado

asset management

‘Sell in May and go away’…

jsanchezalmaraz

asset management

S&P 500 y Relative Strength Index II

Tech

asset management

Performance and correlated assets

T. Fuertes

asset management

Size Effect Anomaly

T. Fuertes

asset management

Predicting Gold using Currencies

libesa

asset management

Inverse ETFs versus short selling: a misleading equivalence

J. González

asset management

S&P 500 y Relative Strength Index

Tech

asset management

Seasonality systems

J. González

asset management

Una aproximación Risk Parity

mplanaslasa

asset management

Using Decomposition to Improve Time Series Prediction

libesa

asset management

Las cadenas de Markov

j3

asset management

Momentum premium factor sobre S&P 500

J. González

asset management

Fractales y series financieras II

Tech

asset management

El gestor vago o inteligente…

jsanchezalmaraz

asset management

¿Por qué usar rendimientos logarítmicos?

jsanchezalmaraz

asset management

Fuzzy Logic

fuzzyperson

asset management

El filtro de Kalman

mplanaslasa

asset management

Fractales y series financieras

Tech

asset management

Volatility of volatility. A new premium factor?

J. González

asset management

Reproducing the S&P500 by clustering

fuzzyperson

08/05/2015

No Comments
Reproducing the S&P500 by clustering

Let’s begin with a simple question. Can we use the movement of the main stocks in the S&P 500 to predict the index movement? But… Who are the main stocks? That’s a good question, maybe the bigger ones, maybe the more bullish of them. So, how should we decide who are the most representative companies?

In this post we propose a relatively new method of clustering to try to predict which are the best stocks. The method is called t-SNE and we will group the companies with affinity the propagation method (AP method).

Feature engineering

First, we have to get the best companies among a lot of players. So, in first place, we must to choose what are our criteria to decide if stocks are good or not which means starting with “features engineering”. However, we have to answer a “little” question… what can we measure about our stocks? Well, a lot of things, yes! But in this case we have chosen 2 measures:

  • daily returns
  • daily market capitalization changes

So we will say that our stocks are similar if they have similar daily returns and capitalization growth. The method split all historical daily returns and daily capitalization by weeks. So, every week, our stocks have around 10 features (5 daily returns and 5 daily cap changes). We must notice that we are compressing data in time, turning daily data into weekly data. Once we have 500 object per week with 10 features each one, we continue with the dimensionality reduction problem as we describe in the next chapter.

Dealing with Data and skimming dimensions

Now, let’s take a glance at the t-SNE method. In summary, the method is based on the preservation of the similarities among objects in a high-dimensional space according to the Kullback-Leibler function (which is used like a function distance measure) and translate it to lower dimensional space preserving that similarity (in our case, the objects are SP’s components).

So, we will apply t-SNE over the combination of the discrete functions probabilities of daily market capitalization changes and daily returns. In other words, we are looking for preserving the minimum difference between the compound distribution of each component.

Well, we have designed the method but we need to apply it to data and this is not a trivial question. In this case, it is even more complicated because of the changes of components within the index so we need to split the index in small enough chunks to apply the method to each one, in this case we have choosen a weekly split. Now, we can apply t-SNE and reduce the dimensions to 2D for each natural month in 2010 year, here we have our results for a particular week, in the next graph we see all the stocks grouped by AP method. blog-figure1 the AP method find the clusters to get the best split possible, in this case we see every cluster in different colors. We appreciate a lot of clusters but their number is lower than 500, in fact we have an average of 20 clusters per week. Since our startegy implies to invest in the nearest stock to the center of its cluster that means we only invest in 20 stocks to replicate the S&P 500. We will see how to do this in the next section.

Cluster in low dimensional space

Once we have reduced our problem’s dimension (using the tSNE method) and we have a good representation of the current week, we apply an affinity propagation algorithm (AP) for clustering the stocks based on euclidean distances as a measure of similarity because it is better suited for our purpose (finding similar components) than other methods (ie: k-means) because of its definition.

This method works communicating each object (points in an euclidean space) in backward processes (sending responsabilities) and forward processes (sending availabilities) till 2 objects are not related each other, in that point we have reach the cluster limit. You can find a whole description of the method in the this article The AP method is more demanding (in terms of computational resources) but it’s quick enough for a few hundreds of objects (in our case, we only have to group ~500 objects in each clustering process).

Running Simulations

We have chosen Python for simulating our strategy, in fact we have used packages like pandas, numpy and sklearn to perform our strategy. We have only applied the simulation over the year 2010 and a more deep testing maybe neccessary to check all possible parameters of t-SNE and AP method.

Every week, our strategy give us the stocks and the weights we need to replicate the index over the next week and we use that information to build a synthetic index (adding the returns of each stock multiplied by weights) and compare it against the returns of S&P 500. The results are surprising, we are able to reproduce S&P 500 with a high degree of similarity and investing only in a handful of stocks. We must know that tSNE method and AP method are based on a local approach to optimum solutions depending on random seeds so our result may vary from one simulation to another, but we have tried some times and the results are similar. replication_2_95p5_18 The grey series are extrapolations of the weekly portfolios (the recomendation of stocks and weights each week) to the rest of the year, so we have 51 grey lines in the previous chart (one for each week in the year, if we don’t take into account the last week of the year).

In the previous graph we are seeing the accumulated percentage returns with daily reinvestment in both series. There is a 95.5% of correlation between series while in the “clustered” serie and we are only investing in 18 stocks in average. However, we don’t always have so accurate results, we can also have a little bit different series like next one: replication_3_86p61_18replication_4_80p75_18 The previous series have 86% and 80% of correlation with the original index.

Conclusions

There are many possibilities in this method. More tests have to be performed and a predictive models must be developed if we want to use this system as an investment system, but we can get some conclussions:

  • We can get similar series to original index with a high correlation which could be useful for getting series for many purposes (ie: testing investment strategies in similar series)
  • We can see that all the grey series are above the index which means that there are many chances of getting a better portfolio than the original index.
  • We have to consider how the random initialization could affect in our strategy, so we probably will need some kind of persistance method to preserve our track record such as DB or binary files for saving our results.

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Email this to someone

add a comment

wpDiscuz