# Hierarchical clustering: explanation and classification

### Clara Díaz-Pinés

#### 25/05/2022

Clustering algorithms are one of the main techniques in the field of unsupervised learning. In the machine learning context, we understand by unsupervised learning the process of analyzing and identifying patterns in unlabeled datasets.

Unsupervised learning algorithms observe similarities and differences in input data in order to discover hidden patterns or data groupings. As their name suggests, they do not need human supervision; the fact that they use untagged data means that there is no correct answer we can check our result against. This also means that they are applicable in a broader set of cases, they are a good first step for data analysis.

We can distinguish two main techniques in the field of unsupervised learning:

• Dimensionality Reduction: It is used to reduce the number of dimensions or columns in the input dataset. It searches for dependencies between features in order to preserve the minimum possible number of them without losing important information. (If two variables capture similar information, you can use a linear combination of them instead of using both).
• Clustering: Methods used to find natural groups in the input data based on similar features.

## Hierarchical Clustering Algorithms

In this post we are going to discuss clustering algorithms, concretely hierarchical clustering. There are many different clustering algorithms and no single best method for all datasets; we classify and explain some types of hierarchical clustering algorithms, and let the reader decide what is the perfect fit for each concrete problem.

But, what is hierarchical clustering? Wikipedia says:

“In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters.”

Source: [1]

The best way to understand how they work is to dive directly into their characteristics.

## Classification

We can classify hierarchical clustering algorithms attending to three main criteria:

1. Attending to their departure point:
• Agglomerative clustering: This is a “Bottoms-up” approach. We start with each observation being a single cluster, and merge clusters together iteratively on the basis of similarity, to scale in the hierarchy. This is the most commonly used method.
• Divisive clustering: The exact opposite approach; “Top-down”. We start with all the observations falling in the same cluster, and split clusters recursively as we go down in the hierarchy.
2. Attending to the metric used to measure distances between points or observations: Given an input dataset with N features or columns (and assuming these features are real numbers) we can realise observations as points in an N-dimensional space $$(\mathbb{R}^N)$$. We can also easily define a way to measure distances in this space, endowing it with the structure of a metric space. Any function that fulfills the mathematical properties of a distance could be used here, although the most widely used one is the Euclidean distance:

$$d_2(a, b) = \Vert a-b \Vert_2 = \sqrt{\sum_{i=1}^N (a_i-b_i)^2 }.$$

1. Attending to the Linkage criteria, that is, the function we use to determine the distance between two clusters or sets of points, based on the pairwise distances of observations in the sets. The most common linkage functions to measure distance between two sets of observations, say R and S, are:
• Ward’s linkage: The distance between R and S is defined by the increase in total within-cluster variance after the clusters are merged.
• Average linkage: The distance between R and S is defined as the mean of the pairwise distances between points in R and points in S: $$D(R, S) = \frac{1}{\vert R\vert \cdot \vert S\vert} \sum_{p\in R}\sum_{q\in S} d(p,q)$$.
• Complete (or maximum) linkage: The distance between R and S is defined as the maximum distance between two points in each cluster: $$D(R, S) = \max\{d(p,q): p\in R, q\in S\}$$.
• Single (or minimum) linkage: The distance between R and S is defined as the minimum distance between two points in each cluster: $$D(R, S) = \min\{d(p,q): p\in R, q\in S\}$$. Note that this is the traditional topological definition of distance between two sets.

## Agglomerative Clustering: Explained

Since divisive clustering is not commonly used, we are going to focus in agglomerative clustering to explain how the algorithm works, but the functioning is fairly analogous and the reader can make a guess.

As we already stated, in agglomerative clustering we start by considering each observation to be a cluster of its own. The algorithm works iteratively, calculating at each step all the distances between clusters with the linkage function, and merging the two closest ones.

To implement the algorithm, distances are stored in what we call the proximity matrix, D. This matrix has in the $$(i,j)$$-position, the distance between the i-th and the j-th cluster, $$D_{ij} = D(X_i, X_j)$$. Therefore, in the first iteration, the proximity matrix is an $$N \times N$$ matrix, containing just the pairwise distances between our observation points, measured with the chosen metric: $$D_{ij} = d(x_i, x_j)$$.

The proximity matrix is updated in each step, erasing rows and columns as clusters are combined into larger ones. The algorithm stops when there is just one cluster containing all the points.

The result of the clustering is often presented as a dendrogram, which shows the sequence in which clusters were merged. Choosing a level of the dendrogram is what will give the final partitioning clustering, it is therefore like selecting the precision. This choice could be made, for example, by giving the desired number of clusters as an input to the algorithm.

In the above example, cutting at Level 2 of the dendrogram will yield clusters {p, r}, {s} and {q}, while cutting at the third Level will yield clusters {p, r, s} and {q}, which is a coarser clustering, with a smaller number but larger clusters.

## Example Inputs

Note that we have assumed that the features, i.e., the columns in our input dataset, are all numeric, so we can compute distances between data points. However, this isn’t always the case.

It is also worth noting that in order to obtain meaningful results, all the features must be scaled so they all receive the same weight/importance. If we had two features in very different scales, for example one in the order of 100 and the other in the order of 0.1, the clusters would be done attending practically only to the first feature, since distances are calculated in a purely geometric fashion.

Let’s present a simple example in which the characteristics of the data are numeric and comparable. Our input dataset will be a bunch of grocery shops together with the information on their sales of apples and bananas. We want to compare them and group them based on how good their sales of these two products are. The below table presents the average hourly sales of apples and bananas for each of the 7 stores analyzed.

We are going to use a hierarchical clustering algorithm to decide a grouping of this data.

## Naive Implementation

Finally, we present a working example of a single-linkage agglomerative algorithm and apply it to our greengrocer’s example.

In single-linkage clustering, the distance between two clusters is determined by the shortest of the distances between two elements (one in each cluster). At each step, the two clusters that contain the closest pair of elements, not yet belonging to the same cluster, are merged. The method is also known as nearest neighbour clustering, for obvious reasons.

We first create the single-linkage function according to the above definition. The metric we are going to use to measure distances between points is the Euclidean one.

# Returns the distance between two clusters according
# to the single linkage criterion.
distances = []
for point1, point2 in product(cluster1, cluster2):
d = np.linalg.norm(point1 - point2)
distances.append(d)
return min(distances)

Next, we define the algorithm, which takes as input a collection of arrays (the data points). It iterates over the number of clusters, updating the proximity matrix to contain the distances between clusters in the current clustering, until all the points are in the same cluster.

# Initializate clustering:
clusters = []
for x in X:
clusters.append([x])
N = len(clusters)
while N > 0:
# Print clustering
for cluster in clusters:
# create scatter of these samples
pyplot.scatter(np.array(cluster)[:, 0], np.array(cluster)[:, 1])
# show the plot
pyplot.show()
# Update the proximity matrix
D = np.zeros((N, N))
for i in range(N):
for j in range(N):
D[i, j] = distance_clusters
display(pd.DataFrame(data = D).round(1))
# We keep just the upper triangular matrix
up = np.triu_indices(N, k=1)
upper_triangular_D = D[up]
if N > 1:
# Find the minimun distance in the proximity matrix
min_distance = min(upper_triangular_D)
# Find the two clusters that are separated by the shortest distance
for k in range(N):
for l in range(N):
if l > k and D[k, l] == min_distance:
break
else:
continue
break
# Update clustering
clusters_updated = []
for n in range(N):
if n != k and n != l:
clusters_updated.append(clusters[n])
clusters_updated.append(clusters[k] + clusters[l])
clusters = clusters_updated
# Update the number of clusters
N = N - 1

Let’s apply this method to our example inputs. (Note that the feature space -number of columns of the data- doesn’t have to be 2-dimensional, it can have as many dimensions as wanted. We use 2-dimensional examples here just because they are easier to visualize). We obtain the following:

The result of this analysis will depend on how fine we want our grouping to be. Suppose that 3 is a sufficiently small number of clusters for us (this would be an input in an actual implementation of this kind of algorithm). Then the program would have stopped in the fourth iteration, giving the clustering: {A, B}, {C,D}, {E, F, G}. We would have divided the grocers into 3 groups, which could be seen as “worst sales”, “average sales” and “best sales”.

This naive algorithm for single-linkage clustering is easy to understand but slow, can you find a way to make it more efficient?

Try it on your own data!