It’s hard enough deciding which Machine Learning technique to use, but after selecting an appropriate clustering algorithm the next challenges begin: how good is the separation and into how many groups should you divide the data?

Maybe three is not always a crowd…

## First, let’s set the scene

We want to create a diversified investment strategy using a **set of predictions** and making use of clustering as a step to selecting the right ones.

- Imagine a prediction object. It represents bets of the direction our financial series will move each day (up or down) and its aim is to correctly detect trend movements. We have many predictions that only use past information to decide how to bet.
- Now imagine an “almost perfect” bet. There are no surprises since it uses future information. To be more realistic, the bet is right most of the time but not always. We have a
**variety of perfect bets**using different timescales and movement magnitudes. - We use the perfect bets to evaluate how good our prediction objects are. The predictions have scores describing its “goodness” compared to each perfect bet.
- The final objective is to select a reduced number of decent prediction objects that
**diversify the investment decision**. Firstly, we should organise the predictions by their similarity to each other based on their scores. And then we can choose some from the best groups to form a joint financial strategy.

Our problem is clearly unsupervised since the data description is unknown and not predefined. We are interested in sorting the data, finding some clear structures and splitting it using the separation. For these reasons, we’re operating within the clustering world.

In supervised learning techniques, since the correct answers are known in the training and testing stages (before implementation), measuring their performance is a direct comparison of algorithm outputs and real answers. However, evaluating unsupervised techniques is not as clear.

In this post, we focus on the first part…organising our predictions into groups using clustering. We look at some cluster performance measures as well as how to decide the optimal number of clusters to use.

## Data Visualisation

We have arranged the data in a matrix of scores, comparing all pairs of predictions and perfect bets. The predictions are the data points and the scores are the features.

By the nature of the score calculation, these features vary between the minimum, worst score of 0 and the maximum, impossible and perfect score of 1.

Plotting our data, the scores tend to be above 0.2 and rarely go above 0.6. There are some patterns and structure in these features which indicate that at least some prediction groupings make sense. At this stage we could eliminate some features if they are too many and/or too similar. Some ML solutions are using dimension reduction algorithms like PCA or LDA.

Directly visualising the relationships between the predictions using all the features is not an option; we’d need as many dimensions as features. Here we have tried two techniques to view the data: Non-negative Matrix Factorization (NNMF) to transform the raw features into just two, and hence view the points in 2 dimensions.

We’re also using multidimensional scaling on the pairwise distances between the predictions to produce a different 2D representation of the data. There’s no clear cluster separation in either plot, although there are some obvious relationships.

But we’re still no closer to creating clusters…

## Clustering Techniques

There’s a varied choice of techniques that can be applied to separate data depending on its nature and structure. Most of them suffer from the “how many clusters?” doubt.

Some ‘partitioning method’ algorithms require choosing the number of clusters before execution; like the “k” in k-means or k-medoids. The same is true in spectral clustering, expectation-maximization algorithms, and also for most soft or fuzzy clustering methods like Gaussian Mixture Models (GMMs).

With hierarchical methods the choice is postponed until *after* running the algorithms. It’s possible to **visualise the whole decomposition or hierarchy of the universe** with the dendrogram and decide where to make a sensible cutoff. One way is to look at the “legs”; the longer the better. For our data the longest legs have 3 clusters.

Other techniques, like affinity propagation and the density-based technique DBSCAN, have the optimal cluster number built into the algorithm, hence removing this choice from the user.

So how do we choose the quantity of clusters k? And how do we know if the clustering is any good?

## Measuring Performance

Clustering validation is an extensive and interesting research topic. Due to its subjective nature it’s by no means a trivial or ‘complete’ area. Let’s take a look over some common statistics and methods.

Ours is a clustering situation where we don’t know the real labels for the data points. So it’s not an option to compare the answers to true values as in some metrics like the Rand index, which measures how similar the algorithm clustering is to the real one, and the V-measure, a mean of homogeneity (clusters of only one class) and completeness (all class members in one cluster).

To measure performance in unknown clustering problems, the best approach is to analyse two values:

**Within cluster similarity:**how close objects are in each cluster.**Between cluster differences:**how different objects are from other clusters.

The key is to reach a balance between them. Taking into account only one could produce an over-optimised solution. Maximising both avoids overfitting.

## Choosing k

A typical way to decide the number of clusters is the elbow method. This is a visual measure in which an error measurement, in this case the within cluster sum of point-to-centroid distances, is plotted against the number of clusters. The ‘elbow’ is when the error measurement decreases less with additional clusters. A smooth ‘elbow’ curve like ours makes it hard to assess the best cluster number.

One clever way to find the elbow is to draw a line from start to end of the curve and longest perpendicular distance to the curve is the optimal cluster number. With our data and this technique, this is closest to 6 clusters. Since this depends on the maximum clusters tested, an option is to estimate ‘the end of the line’ when the total variance explained reaches a high enough value.

The gap statistic is a formalisation to find this ‘elbow’ by comparing the error measurements of the real data and those from randomly distributed data (with no obvious clusters) generated by Monte Carlo sampling.What is this and how should we interpret it? First, **‘the gaps’ are the differences between the error measurements of the two datasets** (prediction vs random). These gaps usually increase with the cluster size since the error falls more with real data than random. When the gap does not increase (i.e. adding clusters is almost random) we have reached the elbow or optimal cluster number. This is the first positive value in the gap differences Gap(k)-Gap(k+1). Our data produces strange results, but the test indicates three clusters is the optimum (positive bar).

As we mentioned, ideally a validity measure should use both performance values (within and between cluster differences). Some examples are the Davies-Bouldin criterion, which involves minimsing the ratio between them, the Calinski-Harabasz and Dunn indexes, which use the same concept but with maximums.

A fourth measure that also uses the two values is the Silhouette Coefficient. It’s frequently used to **choose the best number of clusters** due to its pictorial and interpretable outputs.

Each prediction has a silhouette value (from -1 to 1) describing how similar it is to its assigned cluster, calculated as a function of the within cluster dissimilarity and the lowest between cluster dissimilarity.

The higher the value, the better matched the prediction is to its own cluster and more poorly matched to neighbouring clusters. According to our data, the **values are greater for three clusters than two**. And in fact, it has the highest average silhouette value compared to all values of k. We conclude that using these techniques, **the optimum number of clusters for our data is three**. Not so crowded then…