post list
QuantDare
categories
artificial intelligence

What is the difference between Artificial Intelligence and Machine Learning?

ogonzalez

artificial intelligence

Random forest: many is better than one

xristica

artificial intelligence

Non-parametric Estimation

T. Fuertes

artificial intelligence

Classification trees in MATLAB

xristica

artificial intelligence

Applying Genetic Algorithms to define a Trading System

aparra

artificial intelligence

Graph theory: connections in the market

T. Fuertes

artificial intelligence

Data Cleansing & Data Transformation

psanchezcri

artificial intelligence

Learning with kernels: an introductory approach

ogonzalez

artificial intelligence

SVM versus a monkey. Make your bets.

P. López

artificial intelligence

Clustering: “Two’s company, three’s a crowd”

libesa

artificial intelligence

Euro Stoxx Strategy with Machine Learning

fjrodriguez2

artificial intelligence

Visualizing Fixed Income ETFs with T-SNE

j3

artificial intelligence

Hierarchical clustering, using it to invest

T. Fuertes

artificial intelligence

Markov Switching Regimes say… bear or bullish?

mplanaslasa

artificial intelligence

“K-Means never fails”, they said…

fjrodriguez2

artificial intelligence

What is the difference between Bagging and Boosting?

xristica

artificial intelligence

Machine Learning: A Brief Breakdown

libesa

artificial intelligence

Stock classification with ISOMAP

j3

artificial intelligence

Sir Bayes: all but not naïve!

mplanaslasa

artificial intelligence

Returns clustering with k-Means algorithm

psanchezcri

artificial intelligence

Confusion matrix & MCC statistic

mplanaslasa

artificial intelligence

Reproducing the S&P500 by clustering

fuzzyperson

artificial intelligence

Random forest vs Simple tree

xristica

artificial intelligence

Clasificando el mercado mediante árboles de decisión

xristica

artificial intelligence

Árboles de clasificación en Matlab

xristica

artificial intelligence

Redes Neuronales II

alarije

artificial intelligence

Análisis de Componentes Principales

j3

artificial intelligence

Vecinos cercanos en una serie temporal

xristica

artificial intelligence

Redes Neuronales

alarije

artificial intelligence

Caso Práctico: Multidimensional Scaling

rcobo

artificial intelligence

Outliers: Looking For A Needle In A Haystack

T. Fuertes

06/04/2016

No Comments
Outliers: Looking For A Needle In A Haystack

Outliers are annoying. Analysis would be easier if they did not exist.

So why not to remove them?

As libesa told us in her last post titled “Machine Learning: A Brief Breakdown”, the world is going crazy with Machine Learning, and now we use it in all domains. In this post, we will see another application of Machine Learning.

In Data Science we work with a great deal of data, but not all of it is valid. After all, many data are gathered by humans, and to err is human… Taking this into account, before drawing conclusions, we should discard data that can distort the results. That is, we should remove outliers. Machine Learning systems are very “clever” but we had better lend them a hand and not let them learn from unrealistic examples.

The idea

Let’s imagine we want to use a machine learning method which needs two characteristics to be solved. The size of the data to train the algorithm is 100.000 (that sounds good!). However, if there is incorrect data, the conclusion of the system could be wrong, even though we followed a rigorous process to create it (testing of the model, splitting the sample correctly –training, cross-validation and test–, preventing overfitting, etc.).

The more training data, the better. However, having hundreds and hundreds of values does not imply complete accuracy. If we find a strange value in a sample with a huge size, it’s hardly ever real, and we had better not use it. Moreover, if we only have one kind of data, it is useless and it’s better to remove it.

Continuing with our example, the two characteristics are X1 and X2. If we plotted these characteristics, we could easily identify the outliers because, by representing the two variables in a two-dimensional graph, outliers can be detected with the naked eye. However, when we have more characteristics than three, it is not possible to plot them and so we cannot readily identify the outliers. So an automatic method to identify outliers is essential.

example 2D

The procedure to identify outliers

How does the method to identify outliers work? It’s actually really easy, as all Machine Learning Procedure.

First, we have to estimate the multivariate distribution of data. To simplify we will assume that this multivariate distribution is the Normal Distribution, so we only have to define the parameters that fit the sample (mean and covariance). Then we choose the training set and use it in order to establish the threshold that will determine whether there are outliers. That is the Machine Learning Process! We mix all these inputs in the Machine and, as if by magic, the sample outliers appear.

process

The method used for financial data

Could this idea of cleaning data from outliers be used in other fields of interest? Absolutely!

Let’s go to the financial world where we can find unprocessed data that sometimes show unusual jumps. These are usually incorrect and they can distort our models’ results.

In financial data when we say “outlier” we think “error”. This connection tells us that we do not identify outliers to delete them, but correct them.

We will follow the same procedure described before, but the variables we use now are characteristic of the financial series. In particular, we use the daily and monthly returns. We obtain the multivariate distribution and apply it to the preselected training set, in which we have already located the outliers, in order to train the machine. This training makes us set the right threshold to identify the out-of-sample outliers.

financialprocess

Practical case

We use the Machine Learning Process in 13.000 financial series, such as stocks, ETFs and funds. We use two different methods:

methods

  • Method 1: We apply the Machine Learning Process to each series separately. For each series we determine the distribution, we train the machine and we identify the outliers.
  • Method 2: We apply the Machine Learning Process to each type of series. For each type (stocks, ETFs or funds) we determinate the distribution, we train the machine with a subset of all series in the group and we identify the outliers.

I would like to say this Machine Learning technique is perfect, but unfortunately I can’t. Method 1 really recognises all the outliers in the sample; however it also identifies correct data as outliers. Method 2 is definitely not as good as Method 1, because we are mixing several kinds of series in order to identify the multivariate distribution and the threshold.

Results Outliers

To sum up

This is only a very simple test to learn more about the outlier detection algorithm and how it can be used to identify incorrect values in financial series. This method could likely be improved by setting better parameters. Meanwhile, the method can be used as a preliminary filter to find out errors in financial data.

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Email this to someone

add a comment

wpDiscuz