It is known that data mining algorithms are not perfect, and can fail under certain conditions. K-Means is an example of that triviality, but there is a good alternative: K-Medoids.
In a previous post, “Machine Learning: A Brief Breakdown” we already mentioned that K-Means is the cluster analysis algorithm par excellence and it is one of the most important data mining and machine learning techniques; even psanchezcri used it to analyse the direction of a financial time series, in his post “Returns clustering with K-means algorithm“.
Nevertheless, it’s difficult to find discussions about the algorithm’s unexpected results in certain cases. The algorithm documentation is too broad on the Internet, so this post’s main objective is to focus on showing a financial example of the problem. With this in mind, we are going to follow 4 steps:
1. At first, we select 6 stocks from STOXX Europe 600 composition. Three pairs from different sectors:
– Financials: Banco Bilbao Vizcaya Argentaria S.A. & Banco Santander SA.
– Consumer Discretionary: LVMH Moet Hennessy Louis Vuitton SA & Christian Dior SA.
– Energy: BP PLC & Galp Energia SGPS SA.
2. We get the prices between 2013/01/01 and 2015/12/31:
3. Using daily returns, we calculate the result of “1-correlation distance” between each pair of series. Next, we do a dimensional reduction of the distance matrix to draw points in the Euclidean Space. The stocks turn out grouped by sector.
4. Finally, we apply K-Means with 3 clusters over distance matrix. We hope that each cluster matches with each sector. As K-means starts with random points, we execute the algorithm 15 times.
About 80% of the time, clustering K-Means obtains the expected result:
In the remaining 20%, there are “faulty” results like:
However, the very similar technique named K-Medoids provides expected results 100% of the time. It works like K-Means but its centroids are real point instead of means between points.