We can use self-organizing maps for clustering data, trained in an unsupervised way. Let’s see how.
This week we are going back to basics, as we will see one of the first successfully deployed machine learning algorithms: self-organizing maps (SOM, sometimes also called Kohonen maps). This is an unsupervised technique, so we will not need any labeled data for training, just raw inputs. Technically, it is a special kind of neural network, although it works very differently from the classical feedforward/feedback/recurrent/convolutional networks we are so accustomed to these days. This algorithm is especially used for clustering. It is only rarely used by itself, being more commonly found used in conjunction alongside other, more sophisticated, algorithms.
The algorithm
First, we need to start with a map. It can have as many dimensions as we want but, for the sake of simplicity, we will use rectangular, square-filled maps. Each square on the map represents a neuron, and will hold as many variables (weights) as we like. For instance, we could be trying to cluster colors, represented by 3 values (RGB values). We initialize these neurons randomly, with reasonable values for our field of study. In our case, they should be initialized with 3 numbers from 0 to 255 (the range for RGB colors)
Then we need raw data. In the example we are discussing these would be examples of specific colors. We present each new datum to the map, and find the best matching unit (BMU). The BMU is the neuron that optimizes the value of a formula, usually the Euclidean distance, between the values of the example and the values held by the neuron. When we find the BMU we update its values and the values of its neighbors. We repeat the steps until we reach convergence, or until we reach a number of pre-defined steps. Why update the neighbors? Because we want similar values to be close together, and dissimilar values to be spread out. After all, we want to use the self-organizing map for clustering. Pretty easy, right?
Let’s look at it schematically:
- Randomly initialize the neurons of the map.
- Repeat until convergence or until maximum number of steps is reached:
- Shuffle examples
- For each example
- Find BMU
- Update BMU’s and neighbors’ values
So we have seen the basic algorithm. But we still haven’t defined two things, and they turn out to be this algorithm’s bread and butter! How do we update the neurons’ values? And how do we define “neighbor”? It turns out, these questions are closely related.
Updating the weight vectors
To update the weight vector of a neuron, the formula to apply is:
\(
W _{v}(t+1) = W_{v}(t) + \eta(t) \cdot \phi(u, v, t) \cdot (X(t) – W_{v}(t))
\)
Where:
\(v\) are the (in our case, 2D) coordinates of a neuron.
\(u\) are the coordinates of the BMU.
\( W_{v}(t) \) is the weight vector of neuron at coordinate \(v\) at time step \(t\).
\(X(t)\) is the example presented at time step \(t\).
\(\eta(t)\) is the learning rate at time step \(t\). It depends on the time step, as we will see a bit further.
\(\phi(u, v, t)\) is the neighborhood distance function, which will give us the distance between the neurons at \(u\) and \(v\) at time step \(t\).
There are a couple of things to comment on. First, why does the learning rate fall with time? Well, as when training with “regular” neural networks, sometimes we want the network to learn a lot at first, when all the knowledge it has embedded is random. Then, as an approximate solutions starts to appear, we want to explore it more slowly, to look at possible local minima. So how does it usually evolve? We will use the following formula:
\(
\eta(t) = \eta(0) \cdot e^{-t \cdot \lambda}
\)
Where:
\(\eta(0)\) is the initial learning rate.
\(\lambda\) is the learning decay rate.
The neighborhood distance function
And what about the neighborhood distance function? Why does it evolve with time? Well, the objective of the SOM is to have neighboring neurons recognize similar patterns, and far away neurons to recognize dissimilar things. Thus, we want to start from a big neighborhood, so close neurons recognize the same general patterns. But, as the network learns more, we want to fine-tune the neurons, so we will need a smaller neighborhood to act upon. So, how does the function evolve over time?
\(
\phi(u, v, t) = e ^{\frac {-d(u, v)^2} {2 \sigma(t) ^2}}
\)
Where:
\(d(u, v)\) is the (usually Euclidean) distance between coordinates \(u\) and \(v\).
\(\sigma(t)\) is the neighborhood distance at time step \(t\).
So, finally, how do we decide \(\sigma(t)\)?
\(
\sigma(t) = \sigma(0) \cdot e^{-t \cdot \beta}
\)
Where:
\(\sigma(0)\) is the initial neighborhood distance.
\(\beta\) is the neighborhood decay rate.
An example
Now that the algorithm is clear, let’s see an example. We will show a classic one: a SOM that has to classify colors, clustering them. So, when we begin, we will have random colors in our map. And by the end, we expect to have smooth transitions between areas with differing dominant colors.
...
m = 10
n = 10
n_x = 3000
train_data = np.random.randint(0, 255, (n_x, 3))
som = np.random.randint(0, 255, (m, n, 3)).astype(float)
plt.imshow(som.astype(int))
ax = plt.gca()
ax.axes.xaxis.set_visible(False)
ax.axes.yaxis.set_visible(False)
Now you should see something akin to this image:

The colors are all mized up. So we create and train our SOM.
learn_rate = .1
neighborhood_distance = 1
lr_decay = .1
neighborhood_decay = .1
epochs = 20
learn_rate_0 = learn_rate
radius_0 = neighborhood_distance
nrows = 5
ncols = 4
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 16))
for epoch in range(0, epochs):
np.random.shuffle(train_data)
for train_ex in train_data:
u = find_bmu(som, train_ex)
som = update_weights(som, learn_rate, train_ex, neighborhood_distance, u)
learn_rate = learn_rate_0 * np.exp(-epoch * lr_decay)
neighborhood_distance = radius_0 * np.exp(-epoch * neighborhood_decay)
ax[epoch//ncols, epoch%ncols].imshow(som.astype(int))
ax[epoch//ncols, epoch%ncols].title.set_text('Epoch ' + str(epoch + 1))
ax[epoch//ncols, epoch%ncols].axes.xaxis.set_visible(False)
ax[epoch//ncols, epoch%ncols].axes.yaxis.set_visible(False)
After 1, 10 and 20 epochs mine looks like this:
Training After 1 Epoch Training After 10 Epochs Training After 20 Epochs
As you can see, similar colors tend to stick together, while dissimilar ones tend to be farther away. So the self-organizing map has clustered the different colors. Now, we could give the map a new color to classify, and the BMU would tell up the closest color to the one given. The colors close by would be the most similar colors to it.
A few questions are left to the reader:
- How do the results change as we change the value of the parameters (initial learning rate and neighborhood distance, decays, number of training examples,…)?
- Why do we use the sigmoid function to apply the decays? Could we apply another one? How would the results change?
- Are there good fields in finance to apply SOM? Maybe clustering, either of fundamental data or of the asset’s price behavior?