All

Understanding the shape of data

jdporras

14/11/2018

No Comments

It is often useful to know what shape your data takes. Linear regression is one example, where you want to find out how well your data fits a straight line. Knowing that your data fits a straight line reasonably well gives you an advantage for understanding your data. For instance, you know that your data holds a linear relation, and you can predict the value of one of the variables with the value of the other.

But, of course, data does not always fit a straight line.

Most of the time, the shape it takes is more complex. The number of dimensions might even be too big to understand the shape your data takes, and dimensionality reduction techniques might not even be able to help. Luckily for us, we have Topology on our side. But… what is Topology? and how can it help us?

What is Topology?

Topology (not Topography, though it can be related) is a branch of Mathematics that studies the properties of spaces that remain invariant under continuous transformations. Ok, maybe this definition does not clarify things a lot, but let me expand on this. In Geometry, you are interested in concepts like distance, angles, etc. In Topology, you study the very same objects as in Geometry, but forgetting about distance. Just as in Geometry you would consider two cubes with the exact same size but one rotated respect to the other to be essentially the same object, in Topology, you would consider a cube and a sphere to be the same object, as you can continuously transform one into the other, and you can continuously undo the transformation. In Topology, you can stretch, compress, bend or twist an object without changing its topological nature. Only gluing or tearing wouldn’t be allowed, as they cannot be undone continuously. This is the reason why sometimes, Topology is called the “Rubber Sheet Geometry”.

Let me give you one more example, concerning the birth of Topology, that will make the concept clearer. Topology was born in 1736 when Leonhard Euler gave a solution to the Seven Bridges of Königsberg problem. Given the seven bridges of the city of Königsberg (now Kaliningrad) as shown in the picture, the problem is to find a route through the city that will cross all the bridges exactly once.

The seven bridges of the city of Königsberg (now Kaliningrad)

Euler noted that the lengths of the bridges or the particular shape of the river were not relevant aspects to take into account in order to solve the problem. The only important thing was to know which points the bridges connect. He made the following representation:

Graph representation of the bridges of Königsberg

Here, each line represents each one of the seven bridges, and each point represents the portion of land that is not separated by the river. You could have drawn the points further apart from each other, as you could have drawn the lines in a polygonal way or in a more curved way, but the information enclosed in this representation wouldn’t change at all. Topologically it would remain the same object. Using this simpler representation, Euler found a negative solution to the problem.

You might recognize the previous figure as a graph. That is correct. In fact, graphs are very interesting objects from the topological viewpoint, and as we have just seen, both Graph Theory and Topology share the same birthday.

This problem shows us that the key concepts that Topology studies are continuity and connection. Distances or angles are no longer important.

So… why use Topology?

Ok, so now we have an idea of what Topology is. But, how can it help us in understanding our data? Well, there are many properties of Topology that make it very useful when studying data. We can mention the following:

  1. Coordinates are no longer a thing to worry about, or at least they are not that big of a problem, as the topological features that we study will no longer be affected by them.
  2. Changes under small deformations can be easily tracked with Topology, therefore noise can be handled in a much easier way.
  3. Thanks to Topology we can study objects in a simplified representation, called simplicial complexes (which we will introduce later), that preserve the topological features that we want to focus on.

Ok, that sounds cool! So how can I start applying Topology to my data? Well, we need to introduce some concepts before.

Simplicial complexes

Simplicial complexes are a key concept in Algebraic Topology, that is, the branch of Topology that studies topological objects through their algebraic properties, such as homotopy or homology groups. If you don’t know what a group is, you might take a look at this link. All the groups that we’ll use are abelian groups (groups which hold the commutative property that we all know from primary school), which simplifies things a lot. We will introduce homology groups later on.

The reason for the importance of simplicial complexes is that many objects studied in Mathematics can be reduced to a simplicial complex that preserves the homological groups of the object.

So what exactly is a simplicial complex?

We first define what a simplex is:

  • A 0-simplex is a point. Its boundary is the empty set, i.e., it doesn’t exist.
  • A 1-simplex is a segment. Its boundary consists of two points.
  • A 2-simplex is a triangle. Its boundary consists of three segments, each of which has two points as a boundary.

Generalizing for higher dimensions, you get the idea of what a simplex is. Also, you can see that the boundary of a simplex consists of a union of simplices of lower dimensions, which have been glued by their boundaries. We call each simplex of the boundary a face. Note that the boundary of an n-simplex contains 0-faces (0-dimensional faces), 1-faces (1-dimensional faces), …, and all the way up to (n-1)-faces.

A simplicial complex is an object that results from the union of several simplices, such that:

  1. Every face of a simplex that the simplicial complex contains is also contained in the simplicial complex.
  2. The intersection of any two simplices of the simplicial complex is a face of each of the simplices.

The dimension of the complex is considered to be the highest of the dimensions of its simplices.

0-simplex, 1-simplex, 2-simplex and 3-simplex

From left to right: 0-simplex, 1-simplex, 2-simplex and 3-simplex. Bear in mind that the 3-simplex on the right is not hollow, as all the points enclosed by the 4 triangular faces are part of the 3-simplex.

Basically, you form a simplicial complex by taking several simplices and gluing them by their boundaries, in a way such that the parts that get glued are faces of each simplex being glued (and hence, the parts being glued must be of the same dimension).

simplicial complex example

Here you have an example of a simplicial complex of dimension 3 (you can see the pyramid on the lower left, which is the simplex of highest dimension).

not simplicial complex example

In this picture, you can see that the figure on the right is not a simplicial complex as property 2 doesn’t hold: the intersection of the 2 simplices is not the face of each of them, but only part of the face.

We mentioned boundaries before. A boundary is a key concept in Topology. What happens to the boundary when we glue two simplices together? We will give a proper definition shortly, but for the moment take a look at this particular example:

boundary of union of 1-simplicesThe boundary of the union of this five 1-simplices would consist of the points \(A_1\) and \(A_6\). The points that have been used for gluing are no longer part of the boundary, as they sort of cancel each other out.

 

So now we have boundaries, and we have simplicial complexes. Let’s jump now to the next section, where we define what simplicial homology is.

Simplicial Homology

We start by defining what a chain complex is.

First, taking all the k-simplices \(\sigma_i\) of a simplicial complex \(S\), we give each one of them an orientation, and then we define the kchain complex  \(C_k\) of \(S\) as the free abelian group generated by all the k-simplices of \(S\). What this means basically is that \(C_k\) is formed by all the possible linear combinations of the k-simplices of \(S\):

$$\sum_{i=1}^N c_i \sigma _i, $$

where \(c_i \)s are integers.

A negative sign of one of the  \(c_i\)s can be interpreted as giving the opposite orientation to the k-simplex accompanied by the \(c_i\).

We define the boundary operator on the chain complexes:

$$\partial_k: C_k\longrightarrow C_{k-1},$$

thus, the boundary operator takes each k-chain to a (k-1)-chain.

For a k-chain consisting of solely one k-simplex, we first get the (k-1)-simplices that form the boundary of the k-simplex. We sum them up to form a (k-1)-chain, where the signs of each of the (k-1)-simplices will be given according to the orientation that the k-simplex induces on them (see picture below).

Boundary of 2-simplex

In this picture, the triangle is a 2-simplex, where the orientation it has been assigned is a counterclockwise one. The orientation given to the 1-simplexes of its boundary are represented by the arrows. Then, when finding the boundary of this 2-simplex, as the orientations of the arrows match the counterclockwise orientation given to the 2-simplex, we would have that the boundary is \(a + b + c\). Now, imagine that at the beginning, we had given the \(a\) arrow the opposite direction. Then, as the direction wouldn’t match with that given by the 2-simplex orientation, it would be of negative sign, and the boundary would be \(-a+b+c\). This might look rather cumbersome, but it’s simpler than it seems: you can give the simplices any orientation you want, as long as they stick to it through the whole following process, and as long as you are careful enough with the signs when calculating the boundaries.

Now, as a k-chain is a linear combination of k-simplices, the boundary of a k-chain is defined as the linear combination of the boundaries of its k-simplices:

$$\partial_k\left( \sum_{i=1}^N c_i \sigma _i \right)= \sum_{i=1}^N c_i \partial_k(\sigma_i).$$

This is called a group homomorphism.

Remember when we said that when gluing together 2 simplices, the faces used to do the gluing sort of canceled each other out? This statement will become clearer with the following example:

boundary of sum of simplices

In this picture we see the 2-simplices \(A\) and \(B\), which share a face \(a\) and the points \(P_0\) and \(P_1\). Now as the orientation of both 2-simplices match, we can give a meaning to the 2-chain \(A+B\). It is the result of gluing \(A\) and \(B\) through face \(a\). If the orientations didn’t match, we would have to subtract one of the 2-simplices to the other in order to obtain the gluing. Otherwise, the sum would be just a formal sum, as it would be the case if the simplices had no common faces. Now, let’s find the boundary for \(A+B\). According to the formula we saw before, this boundary is the boundary of \(A\) summed to the boundary of \(B\), that is,

$$\partial_2(A+B) = \partial_2(A) + \partial_2(B) =  (a+b+c) + (-a + d – e).$$

Summing it all up, we have that the resulting boundary is \(b + c + d – e,\) that is, the 4 outermost faces after gluing \(A\) and \(B\), with the sign given by the orientation (the actual order of the addends does not matter, as we are in an abelian group, but the orientation is important to give the elements in the boundary a sign: note the minus sign on \(e\)).

Elements that have no boundaries, that is, elements in \(Z_k = \ker \partial_k,\) are called cycles, and elements in \(B_k = \text{Im}\partial_{k+1},\) that is, elements that are boundaries of other elements, are called, as you may have guessed, boundaries.

It’s straightforward to see that \(\partial^2 = 0,\) that is, if you are the boundary of something, then you don’t have a boundary. Or in other terms, all boundaries are cycles (try calculating step by step the boundary of the boundary of \(A + B\) given in the previous example. Hint: the boundary of \(b\) would be \(P_2 – P_1\), then … ).

Now, we call the sequence of maps $$ C_n \overset{\partial_n}{\rightarrow}\cdots \overset{\partial_{k+1}}{\rightarrow}C_k \overset{\partial_k}{\rightarrow} C_{k-1}\overset{\partial_{k-1}}{\rightarrow}\cdots \overset{\partial_1}{\rightarrow} C_0 $$

the Chain Complex of the simplicial complex S. This chain complex always starts in the n-dimensional chain complex \(C_n\), where n is the dimension of S, as with higher dimensions all the chains would be nil.

The k-th homology group of S is then defined as the quotient abelian group \(H_k(S) = Z_k(S)/B_k(S)\), that is, the cycles modulo the boundaries.

Then, the k-th Homology group won’t be 0 when there are k-cycles which are not boundaries. If you think about it, these cycles which are not boundaries represent k-th dimensional holes on S.

The k-th Betti number, defined as \(\beta_k = \text{rank}(H_k(S))\), tells us the number of k-dimensional holes in SWhen k=0, the Betti number tells us the number of connected components in S.

That is it! We have defined Simplicial homology groups. Now we want to apply them to our data. But data usually consists of a bunch of points distributed in space. So, where is the connection or continuity between points? Aren’t they just separate of each other? Is it just a big 0-simplicial complex that we have to analyze? Don’t worry, Čech has the answers to all these questions.

The Čech complex

Ok, so of course, a bunch of points in space (also called a point cloud) doesn’t have a topological structure per se. One has to define the Čech complex in order to extract an interesting structure that relates the points. I know that I told you before that in Topology we forget about distance, but in this case, between points distance will be the feature that relates the points. So, how do we define this Čech complex from the between-points distance?

First, we define a ball centered on each of the points. We take a parameter \(t>0,\) that will describe the diameter of all the balls. So as \(t\) starts to get bigger, the balls start to grow. When two balls intersect each other, then a 1-simplex gets formed between the two centers. When 3 balls intersect, a 2-simplex gets formed between the 3 centers, and so on. Then, it is clear that for each \(t>0\) we have a simplicial complex \(C_t,\) that will change for specific values of \(t\). This is the way that the Čech complex is formed.

Cech complex example

Example of Čech complex: see how whenever 2 balls have intersected, a 1-simplex (segment) has been formed between the centers of the 2 balls, and whenever 3 balls have intersected, a 2-simplex (triangle) has been formed. If you look at the lower right corner, you’ll see that an intersection between 4 balls is about to happen. When it happens, a 3-simplex (pyramid) will arise from the 4 vertices.

Note that the set of points need not lie in a particular space. Having a distance matrix of the set of points would suffice to construct the Čech complex.

Now, for each t that makes the complex change, we can obtain the homology groups, and the Betti numbers for \(C_t\). We call this the persistent homology of the point cloud, as we are seeing how persistent the topological features for this space are.

It’s clear then, that for each hole we have the values of t for which the hole is born or dies, as eventually, for a big enough \(t\), all the balls will intersect, thus closing all the holes that may exist. This way, we relate each hole with a tuple (birth, death). We can represent this in several ways. One of them is the persistence barcode. Let’s see an example to better understand this barcode:

persistence barcode, persistent homology

The X-axis represents the parameter t, and each segment represents the lifespan of a hole. Then we see that hole number 1 is born when t=0.5 and dies when t=0.55. Hole number 2 is born when t=0.75 and dies when t=0.83, and so on. We distinguish the dimension of the holes by the colors of their segments. In this particular case, color blue represents 1-dimensional holes, while the color red represents 2-dimensional holes (cavities). We see that some holes have a longer lifespan than others, that is, they are more persistent. Those can be considered as the real topological features of the data set, whereas those holes with short lifespans (in this case holes 1 and 2) can be considered a product of noise.

Ok, so there you have it! We have seen how to understand the shape of your data better by extracting topological information from it. But one thing is the theory, and another thing is applying it to practice. There is a lot of algorithmic work behind this that we haven’t showed. Lucky for us, it has already been implemented. For example, you have the package TDA for R, Javaplex for Java, or Topology Toolkit and Gudhi for C++, also usable in Python.

Well, I hope that this post, meant to be an introduction to Topology and Topological Data Analysis, has helped you understand the importance and usefulness of Topology, and has given you some new ideas to study your data. For the sake of simplicity, I’ve had to omit lots of technical details and definitions, but I hope you got a general idea, and in case you want to delve into these topics, you can ask for additional references. Of course, you are welcome to ask any questions that you may have. Now, it’s your turn to apply these ideas that you’ve learnt. Show us the results!