Clustering data into groups that share common characteristics can be very useful, but using experts to perform this grouping is costly and in many cases decisions are influenced by emotions.
That is why clustering is one of the main topics of Unsupervised Machine Learning algorithms, that doesn’t require labels to find patterns in data.
Why not combining both?
Here I try to combine both by using a Fully Convolutional Autoencoder to reduce dimensionality of the S&P500 components, and applying a classical clustering method like KMeans to generate groups.
Why Fully Convolutional?
When using fully connected or convolutional Autoencoders, it is common to find a flatten operation that converts the features into a 1D vector. This operation removes spatial information present in the features.
Fully Convolutional avoids flattening the features vector by using only convolutional layers along the network structure.
In order to check the feasibility of the proposal, we have followed the next steps:
- Get the last 256 prices of the S&P500 components (from 2019-10-15 to 2020-10-06).
- Create the cumulative returns of all the components and scale them.
- Train a Fully Convolutional Autoencoder and extract the encoded features.
- Perform KMeans clustering over the encoded features.
The network architecture has two parts:
- The encoder: reduces the input size (500×256) by consecutively applying convolutions and max-pooling operations until reaching a smaller version (500×2).
- The decoder: increases the encoded size (500×2) by consecutively applying upsampling operations and convolutions until reaching the input size (500×256).
The input shape is 256 so it can be downsampled and upsampled by a factor of 2, ‘n’ times.
As an example, the Apple stock price (AAPL) is feedforwarded over the network, generating the following results:
The encoder is able to transform the 256 daily returns of the AAPL component in two values (left graph) and with those, the decoder does its best to reconstruct the original series (right graph).
Finally, we perform the clustering over the encoded samples but, how many clusters do we have in our data? That is where the “elbow” method comes into action.
The elbow method consists in executing a clustering algorithm with different parameters and calculate a metric called inertia.
There is not a right or a wrong number of clusters, but we should look at values that after decreasing, produce an “elbow” in the graph.
For this problem we select 4 clusters, leading to the following grouping of encoded components:
In order to check if the clusters correspond to similar behaviours, we show their original evolution along time:
Cluster 0 and 1 look very similar to an eye inspection. That happens because the instances of cluster 0 and 1 are very close in the encoded version (similar behaviour).
As one can guess, the cluster 3 corresponds to the big tech and pharmaceutical companies of USA. All the components belonging to cluster 3 are the following:
'AAPL', 'ADBE', 'ADSK', 'AMZN', 'CTXS', 'EBAY', 'FAST', 'BIIB', 'LRCX', 'MSFT', 'NVDA', 'QCOM', 'REGN', 'NLOK', 'TSCO', 'VRTX', 'MNST', 'NFLX', 'NDAQ', 'APD', 'BBY', 'CLX', 'CAG', 'DHR', 'DVA', 'LLY', 'FCX', 'KR', 'LB', 'LOW', 'SPGI', 'MCO', 'NEM', 'ROK', 'TER', 'TMO', 'TIF', 'UNH', 'VAR', 'FMC', 'PKI', 'CRM', 'FB', 'BLK', 'ABBV', 'CMG', 'EA', 'HUM', 'KSU', 'LDOS', 'URI', 'ALB', 'ANSS', 'BIO', 'CDNS', 'EQIX', 'JKHY', 'MTD', 'QRVO', 'RMD', 'ROL', 'SWKS', 'SNPS', 'CNC', 'DPZ', 'MSCI', 'FTNT', 'FBHS', 'INCY', 'KHC', 'PYPL', 'ATVI', 'CCI', 'CHTR', 'SBAC', 'NOW', 'TMUS', 'DXCM', 'OTIS', 'CARR'
Using a Fully Convolutional Autoencoder as a preprocessing step to cluster time series is useful to remove noise and extract key features, but condensing 256 prices into 2 values might be very restrictive.
There is some future work that might lead to better clustering:
- Generate encodings with higher dimensionality.
- Use more daily returns to capture past information.
- Apply a different clustering technique such as DBSCAN or Spectral Clustering.