Do you know how a fireman and the direction of a financial time series are related? If your answer is no, you’re reading the right post.

## An introduction to k-Means: Voronoi diagram

Suppose that you a work at an emergency center, and your job is to tell the pilots of firefighter helicopters to take off. You receive an emergency call because there’s a point of the city on fire and a helicopter is necessary to put it out. You need to choose which pilot has to do this work. It’s obvious that the farther helicopters (* grey helicopters*) will arrive later than the closer helicopters (

**), but you don’t know exactly which is the closest.**

*red helicopters**Georgy Voronoi* (a mathematician born in the Russian Empire in 1868), defined the Voronoi diagram (also called *Thiessen polygons* or *Dirichlet Tesselation* in honor of *Alfred Thiessen* and *Gustav Lejeune Dirichlet*) to find the answer to this kind of problems. It **associates all the helicopters to a polygonal cell** (called *Voronoi* cell or *Voronoi* region), where all points included are closer to one helicopter than the others. Using the Euclidean distance, if you keep your eyes on a particular helicopter (\(h_{i}\)), for each pair of helicopters (\(h_{i}\), \(h_{j}\)) \(i \neq j\), the points set which is closer to \(h_{i}\) than \(h_{j}\) is defined as:

\(H_{i}^{j} = \{x \in X \; | \; d\left(x, h_{i}\right) \leq d\left(x, h_{j}\right), \; i \neq j \}\)

\(h_{i} \in H_{i}^{j}; \; h_{j} \notin H_{i}^{j}\)

\(h_{i}\) Voronoi cell is the intersection of all half-planes where \(h_{i}\) is inside:

\(vor\left(h_{i}\right) = \bigcap\limits_{j=1;j \neq i}^{n} H_{i}^{j}\)

If you calculate all *Voronoi* cells (for each helicopter) you could discover which helicopter is the closest:

You can find a lot of *Voronoi diagrams* in nature like:

*Giraffe skin:*

*W**ings of a dragonfly:*

*Dry desert and more:*

## k-Means algorithm in finance

The* Voronoi diagram* **is used in some machine learning techniques like clustering**. In this post, you can learn how the *k-means algorithm* works (a clustering algorithm).

Given a *n*-dimensional points set (*two*-dimensional points set in our example), you need to define the number of classes (*k* classes) that you want to get. Using both things, the algorithm makes the following steps:

1. At first, there are two options to set initial points:

1.1. Choose \(k\) (two-dimensional) points.

1.2. Generate \(k\) random points.

2. Calculate *Voronoi* cells for the initial points.

3. Calculate midpoint of all points included in each region.

4. Calculate *Voronoi* cells for the midpoints.

5. Repeat steps 3 and 4 while midpoints are changing.

*K-means algorithm* could be applied to a financial time series. In finance it’s really important to define *which *kind of returns and trends are in a time serie. As a first approach, using weekly percentage returns of a financial time serie (like *Standard & Poor’s 500 Index*), you could show *two*-dimensional scatter plot (weekly returns vs shift one week weekly returns) using the following matrix (*x* axis is the first column and *y* axis is the second column):

\( \left( \begin{array}{cc} Week_{2} & Week_{1} \\ Week_{3} & Week_{2} \\ \cdots & \cdots \\ Week_{i} & Week_{i-1} \\ \cdots & \cdots \\ Week_{n} & Week_{n-1} \end{array} \right) \)

At this point you need to look how many classes you must get. My objective is** to cluster weekly returns in three kinds of return movements** called * Upward*,

*and*

**Downward***. If I only select these 3 classes, the results are not coherent. So I try it by selecting more classes:*

**No trend***,*

**Upward***,*

**Low Upward***,*

**No trend***and*

**Low Downward***. If I test the*

**Downward***k-means*

*algorithm*using both initial options and 5 classes:

* Option 1*: Predefined initial points. I test the k-means function using like five initial points the four quadrants middle point and axis intersection, for example:

\( \begin{array}{cccc} + & + & \longrightarrow & \text{Upward} \\ – & + & \longrightarrow & \text{Low upward} \\ + & – & \longrightarrow & \text{Low downward} \\ – & – & \longrightarrow & \text{Downward} \\ 0 & 0 & \longrightarrow & \text{No trend} \\\end{array}\)

* Option 2*: Random initial points. Using 5 random points, the centers set by

*k-means algorithm*seem to be worse:

The trend separation produced by predefined initial points is better than the separation produced by random initial points, because the final voronoi cells produced in *option 1* are more coherent than *option 2,* and it’s not necessary to look at the subsequent points to associate it to a movement. I associate the classes (produced by *k-means algorithm* using week_{i} return and week_{i-1} return) to week_{i} because I don’t want to use information early.

To finish the clustering, I only need to rename classes and colour it in *S&P500 Index *spot:

1. * Downward* =

*+*

**Downward**

**Low downward**2. * Upward* =

*+*

**Upward**

**Low Upward**Trends separation doesn’t look perfect, and the trends themselves are short. I need to keep in mind that I’m using only two consecutive returns in clustering. It’s therefore neccesary to leave you with this next question: