Outliers are annoying. Analysis would be easier if they did not exist.
So why not to remove them?
As Libesa told us in her last post titled “Machine Learning: A Brief Breakdown”, the world is going crazy with Machine Learning, and now we use it in all domains. In this post, we will see another application of Machine Learning.
In Data Science we work with a great deal of data, but not all of it is valid. After all, many data are gathered by humans, and to err is human… Taking this into account, before drawing conclusions, we should discard data that can distort the results. That is, we should remove outliers. Machine Learning systems are very “clever” but we had better lend them a hand and not let them learn from unrealistic examples.
The idea
Let’s imagine we want to use a machine learning method which needs two characteristics to be solved. The size of the data to train the algorithm is 100.000 (that sounds good!). However, if there is incorrect data, the conclusion of the system could be wrong, even though we followed a rigorous process to create it (testing of the model, splitting the sample correctly –training, cross-validation and test–, preventing overfitting, etc.).
The more training data, the better. However, having hundreds and hundreds of values does not imply complete accuracy. If we find a strange value in a sample with a huge size, it’s hardly ever real, and we had better not use it. Moreover, if we only have one kind of data, it is useless and it’s better to remove it.
Continuing with our example, the two characteristics are X1 and X2. If we plotted these characteristics, we could easily identify the outliers because, by representing the two variables in a two-dimensional graph, outliers can be detected with the naked eye. However, when we have more characteristics than three, it is not possible to plot them and so we cannot readily identify the outliers. So an automatic method to identify outliers is essential.
The procedure to identify outliers
How does the method to identify outliers work? It’s actually really easy, as all Machine Learning Procedure.
First, we have to estimate the multivariate distribution of data. To simplify we will assume that this multivariate distribution is the Normal Distribution, so we only have to define the parameters that fit the sample (mean and covariance). Then we choose the training set and use it in order to establish the threshold that will determine whether there are outliers. That is the Machine Learning Process! We mix all these inputs in the Machine and, as if by magic, the sample outliers appear.
The method used for financial data
Could this idea of cleaning data from outliers be used in other fields of interest? Absolutely!
Let’s go to the financial world where we can find unprocessed data that sometimes show unusual jumps. These are usually incorrect and they can distort our models’ results.
In financial data when we say “outlier” we think “error”. This connection tells us that we do not identify outliers to delete them, but correct them.
We will follow the same procedure described before, but the variables we use now are characteristic of the financial series. In particular, we use the daily and monthly returns. We obtain the multivariate distribution and apply it to the preselected training set, in which we have already located the outliers, in order to train the machine. This training makes us set the right threshold to identify the out-of-sample outliers.
Practical case
We use the Machine Learning Process in 13.000 financial series, such as stocks, ETFs and funds. We use two different methods:
- Method 1: We apply the Machine Learning Process to each series separately. For each series we determine the distribution, we train the machine and we identify the outliers.
- Method 2: We apply the Machine Learning Process to each type of series. For each type (stocks, ETFs or funds) we determinate the distribution, we train the machine with a subset of all series in the group and we identify the outliers.
I would like to say this Machine Learning technique is perfect, but unfortunately I can’t. Method 1 really recognises all the outliers in the sample; however it also identifies correct data as outliers. Method 2 is definitely not as good as Method 1, because we are mixing several kinds of series in order to identify the multivariate distribution and the threshold.
To sum up
This is only a very simple test to learn more about the outlier detection algorithm and how it can be used to identify incorrect values in financial series. This method could likely be improved by setting better parameters. Meanwhile, the method can be used as a preliminary filter to find out errors in financial data.