# Self-organizing maps for clustering

### Alfredo Timermans

#### 15/12/2021

We can use self-organizing maps for clustering data, trained in an unsupervised way. Let’s see how.

This week we are going back to basics, as we will see one of the first successfully deployed machine learning algorithms: self-organizing maps (SOM, sometimes also called Kohonen maps). This is an unsupervised technique, so we will not need any labeled data for training, just raw inputs. Technically, it is a special kind of neural network, although it works very differently from the classical feedforward/feedback/recurrent/convolutional networks we are so accustomed to these days. This algorithm is especially used for clustering. It is only rarely used by itself, being more commonly found used in conjunction alongside other, more sophisticated, algorithms.

## The algorithm

First, we need to start with a map. It can have as many dimensions as we want but, for the sake of simplicity, we will use rectangular, square-filled maps. Each square on the map represents a neuron, and will hold as many variables (weights) as we like. For instance, we could be trying to cluster colors, represented by 3 values (RGB values). We initialize these neurons randomly, with reasonable values for our field of study. In our case, they should be initialized with 3 numbers from 0 to 255 (the range for RGB colors)

Then we need raw data. In the example we are discussing these would be examples of specific colors. We present each new datum to the map, and find the best matching unit (BMU). The BMU is the neuron that optimizes the value of a formula, usually the Euclidean distance, between the values of the example and the values held by the neuron. When we find the BMU we update its values and the values of its neighbors. We repeat the steps until we reach convergence, or until we reach a number of pre-defined steps. Why update the neighbors? Because we want similar values to be close together, and dissimilar values to be spread out. After all, we want to use the self-organizing map for clustering. Pretty easy, right?

Let’s look at it schematically:

1. Randomly initialize the neurons of the map.
2. Repeat until convergence or until maximum number of steps is reached:
1. Shuffle examples
2. For each example
1. Find BMU
2. Update BMU’s and neighbors’ values

So we have seen the basic algorithm. But we still haven’t defined two things, and they turn out to be this algorithm’s bread and butter! How do we update the neurons’ values? And how do we define “neighbor”? It turns out, these questions are closely related.

### Updating the weight vectors

To update the weight vector of a neuron, the formula to apply is:

$$W _{v}(t+1) = W_{v}(t) + \eta(t) \cdot \phi(u, v, t) \cdot (X(t) – W_{v}(t))$$

Where:
$$v$$ are the (in our case, 2D) coordinates of a neuron.
$$u$$ are the coordinates of the BMU.
$$W_{v}(t)$$ is the weight vector of neuron at coordinate $$v$$ at time step $$t$$.
$$X(t)$$ is the example presented at time step $$t$$.
$$\eta(t)$$ is the learning rate at time step $$t$$. It depends on the time step, as we will see a bit further.
$$\phi(u, v, t)$$ is the neighborhood distance function, which will give us the distance between the neurons at $$u$$ and $$v$$ at time step $$t$$.

There are a couple of things to comment on. First, why does the learning rate fall with time? Well, as when training with “regular” neural networks, sometimes we want the network to learn a lot at first, when all the knowledge it has embedded is random. Then, as an approximate solutions starts to appear, we want to explore it more slowly, to look at possible local minima. So how does it usually evolve? We will use the following formula:

$$\eta(t) = \eta(0) \cdot e^{-t \cdot \lambda}$$

Where:
$$\eta(0)$$ is the initial learning rate.
$$\lambda$$ is the learning decay rate.

### The neighborhood distance function

And what about the neighborhood distance function? Why does it evolve with time? Well, the objective of the SOM is to have neighboring neurons recognize similar patterns, and far away neurons to recognize dissimilar things. Thus, we want to start from a big neighborhood, so close neurons recognize the same general patterns. But, as the network learns more, we want to fine-tune the neurons, so we will need a smaller neighborhood to act upon. So, how does the function evolve over time?

$$\phi(u, v, t) = e ^{\frac {-d(u, v)^2} {2 \sigma(t) ^2}}$$

Where:

$$d(u, v)$$ is the (usually Euclidean) distance between coordinates $$u$$ and $$v$$.
$$\sigma(t)$$ is the neighborhood distance at time step $$t$$.

So, finally, how do we decide $$\sigma(t)$$?

$$\sigma(t) = \sigma(0) \cdot e^{-t \cdot \beta}$$

Where:

$$\sigma(0)$$ is the initial neighborhood distance.
$$\beta$$ is the neighborhood decay rate.

## An example

Now that the algorithm is clear, let’s see an example. We will show a classic one: a SOM that has to classify colors, clustering them. So, when we begin, we will have random colors in our map. And by the end, we expect to have smooth transitions between areas with differing dominant colors.

...
m = 10
n = 10

n_x = 3000

train_data = np.random.randint(0, 255, (n_x, 3))

som = np.random.randint(0, 255, (m, n, 3)).astype(float)

plt.imshow(som.astype(int))
ax = plt.gca()
ax.axes.xaxis.set_visible(False)
ax.axes.yaxis.set_visible(False)


Now you should see something akin to this image:

The colors are all mized up. So we create and train our SOM.

learn_rate = .1
neighborhood_distance = 1
lr_decay = .1
neighborhood_decay = .1
epochs = 20

learn_rate_0 = learn_rate

nrows = 5
ncols = 4
fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, 16))

for epoch in range(0, epochs):
np.random.shuffle(train_data)
for train_ex in train_data:
u = find_bmu(som, train_ex)
som = update_weights(som, learn_rate, train_ex, neighborhood_distance, u)

learn_rate = learn_rate_0 * np.exp(-epoch * lr_decay)
neighborhood_distance = radius_0 * np.exp(-epoch * neighborhood_decay)

ax[epoch//ncols, epoch%ncols].imshow(som.astype(int))
ax[epoch//ncols, epoch%ncols].title.set_text('Epoch ' + str(epoch + 1))
ax[epoch//ncols, epoch%ncols].axes.xaxis.set_visible(False)
ax[epoch//ncols, epoch%ncols].axes.yaxis.set_visible(False)



After 1, 10 and 20 epochs mine looks like this: