In previous posts (Visualising Fixed Income ETFs with T-SNE) we have talked about dimensionality reduction algorithms to visualize financial assets and find recognizable patterns. The conclusions were that it didn’t perform well compared to PCA, which is a more classical approach.
Can we do any better?
T-SNE was from 2008, but more dimensionality reduction algorithms have been released since then. One of them is UMAP, which stands for Uniform Manifold Approximation and Projection for Dimension Reduction, and was released in 2018. It’s true that within a gap of 12 years many researches on dimensionality reduction have been performed, but this algorithm is chosen in this post due to its comparison to the last algorithm tried in this kind of posts, T-SNE.
How UMAP works
The UMAP algorithm is based on heavy mathematical concepts, but here we’ll try to break it down in simple steps.
- First, we cluster observations in groups of k observations. By default, Euclidean distance is used but you can use any distance metric instead.
- Observations that lie within a cluster are connected, so let’s think of this as a “graph”.
- These “graphs” have weighted edges, so these “graphs” are not isolated connected components, they’re somehow connected. The weight of their edges makes each observation belong to a group in a fuzzy way, rather than just belonging to one group in a binary fashion.
- All this time we have been working in the high dimension: up to this point, we have calculated a way to represent the structure of our data in the high dimension. But the point was to represent it in a lower dimension! What should we do then?
- Remember that our “graphs” have weighted edges. What we want to do is to calculate these weights in the lower dimension, but we want these weights to remain as close as possible to those in the higher dimension. We can solve this as an optimization problem. To sum up, this last step, what we want is a low-level representation that resembles the high level (the original) representation.
How different is UMAP from T-sne
- UMAP is faster.
- It can represent new points in a lower dimension, while with T-SNE we have to rerun the algorithm again with the new data.
- It preserves better the global structure while T-SNE struggles with this. (Though some disagree on this point).
Show me the plots!
Now that we have a sense of how UMAP works. Now, let’s apply it!
We have tried different configurations, modifying only the number of elements within each neighborhood.
We’ll use the Fixed Income ETFs used in the t-SNE post that is still available.
It seems that UMAP can’t cluster the observations quite fine (e.g. Govern labels scattered at extremes of the different plots).
However, we can’t forget that the goal of dimensionality reduction techniques is not to cluster observations but to transform the features that describe each observation to another “features”. So we end up with a number of “features” less than the number of original features, and at the same time, we don’t lose much of the original information.
A common use of dimensionality reduction is to visualise observations having an open mind because maybe the algorithm has placed together observations.
So, maybe the final conclusion should be that given the nature of the algorithm we’re using we should look closely at our data because the pattern we expected is not the one the algorithm is finding. Or maybe the one the algorithm provides makes no sense at all from a business point of view.
PS: What if we use a specific clustering algorithm after reducing dimensionality?
The following references helped me to have a grasp on how UMAP works under the hood. The first links explain UMAP from a more general point of view and the last links go a little bit deeper: