**Outliers** are annoying. The analysis would be easier if they did not exit. Then, why not to remove them?

As libesa told us in her last post titled “Machine Learning: A Brief Breakdown”, world is going crazy with Machine Learning and now we use it in all domains. In this post, we will see another **application of Machine Learning**.

In Data Science we work with a great deal of data, but not all of it is valid. After all, many data is gathered by humans, and *to err is human*… Taking this into account, before drawing conclusions, we should discard the data that can distort the results, that is, we should get rid of the outliers. **Machine Learning Sistems are very “clever”** but we had better lend them a hand and not let them learn from examples that are not realistic.

## The idea

Let’s imagine we want to use a machine learning method which needs two characteristics to be solved. The size of the data to train the algorithm is 100.000 (that sounds good!). However, **if there is incorrect data, the conclusion of the system could be wrong**, although we followed a rigorous process to create it (testing of the model, splitting the sample correctly –training, cross-validation and test–, preventing overfitting, etc.).

The more training data, the better, however having hundreds and hundreds of values does not imply that all of these values are accurate. If we find a strange value in a sample with a huge size, it is hardly ever real, and we had better not use it. Moreover, if we only have one kind of data, it is useless and it is better to remove it.

Going on with our example, the two characteristics are X1 and X2. If we plotted these characteristics, we could easily identify the outliers because, by representing the two variables in a two-dimensional graph, outliers can be detected with the naked eye. However, when we have more characteristics than three, it is not possible to plot them and so we cannot readily identify the outliers. So **a method to identify outliers automatically is essential**.

## The procedure to identify outliers

How does the method to identify outliers work? It is really easy, as all Machine Learning Procedure.

First of all, we have to estimate the multivariate distribution of data. To simplify we will assume that this multivariate distribution is the Normal Distribution, so we only have to define the parameters that fit the sample (mean and covariance). Then we choose the training set and we use it in order to establish the threshold that will determine whether there are outliers. That is the Machine Learning Process! We mix all these inputs in the Machine and as if by magic the sample outliers appear.

## The method used for financial data

Could this idea of cleaning data from outliers be used in other fields of interest? It could, by using our imagination!

Let us go to the **financial world** where we can find unprocessed data, which sometimes show unusual jumps. These are usually incorrect and they can distort our models’ results.

In financial data when we say “outlier” we think “error”. This connection tells us that **we do not identify outliers to delete them**, but correct them.

We will follow the same procedure described before, but the variables we use now are characteristic of the financial series; in particular, we use the daily and monthly returns. We obtain the multivariate distribution and we apply it to the preselected training set, in which we have already located the outliers, in order to train the machine. This training makes us set the right threshold to identify the out-of-sample outliers.

## Practical case

We use the Machine Learning Process in 13.000 financial series, such as stocks, ETFs and funds. We use two different methods:

- Method 1: We apply the Machine Learning Process to each series separately. For each series we determine the distribution, we train the machine and we identify the outliers.
- Method 2: We apply the Machine Learning Process to each type of series. For each type (stocks, ETFs or funds) we determinate the distribution, we train the machine with a subset of all series in the group and we identify the outliers.

I would like to say this Machine Learning technique is perfect, but unfortunately I can’t. Method 1 really recognises all the outliers in the sample; however it also identifies correct data as outliers. Concerning Method 2, it definitely is not as good as Method 1 because we are mixing several kinds of series in order to identify the multivariate distribution and the threshold.

## To sum up

This is only a very simple test to learn more about the outlier detection algorithms and how it can be used to identify incorrect values in financial series. Probably, this method could be improved by setting better parameters. Meanwhile, **the method can be used as a preliminary filter** to find out errors in financial data.