Who deals with big dataset in order to use Machine Learning techniques knows that it is vital to keep data clean and to avoid data which is weird. In this point, outliers are a pain in the neck because they may make the results be misunderstood. Several methods can be used to remove outliers from the data, but this post will focus on an unsupervised Machine Learning technique: autoencoder, a kind of neural network.
In this blog we have already seen several ways to detect outliers based on Machine Learning techniques, but now we describe a method which uses neural networks. As well, this blog has some explanations about neural networks and several examples of using them. I encourage you to go deeper into those posts to know all the information that has been published here.
Autoencoder is an unsupervised artificial neural network. Its procedure starts compressing the original data into a shortcode ignoring noise. Then, the algorithm uncompresses that code to generate an image as close as possible to the original input.
How does it work?
The mechanism is based on three steps:
- The encoder. In this phase, the autoencoder compresses the data to delete unnecessary information.
- The decoder. In this step the algorithm learns how to reconstruct the original input, keeping it as similar as possible to the original input.
- The loss function. It helps the process to correct the error produced by the decoder.
This is an unsupervised training which is repeated until the error between original data and decoding data is minimised. This error is called reconstruction error.
Uses of autoencoder
Having seen how autoencoder works, it is natural to think that its main use is as a dimensional reduction technique, and that is right. In fact, if there is only a single hidden layer, the optimal solution of an autoencoder is strongly related to the solution of Principal Component Analysis (PCA). The autoencoder weights, however, are not exactly equal to the principal components.
Another use of autoencoder is as a technique to detect outliers. Notice that outliers are observations that “stand out” from the norm of a dataset. Then, if the model trains with a given dataset, outliers will be higher reconstruction error, so outliers will be easy to detect by using this neural network.
Let’s see a toy example of autoencoder as a outliers detector. Imagine we have a dataset of more than 7000 observations. Let’s divide the sample in a training part with 80% of the entries and a test part with 20% of the sample. We assure that in the training dataset there are not any outliers so that the neural network trains only with inliers. Meanwhile, there are only 17 outliers in the test dataset.
The goal is to find the outliers in the test dataset after training with the training dataset. By using an autoencoder, it detects 9 out of 17 real outliers. There are not any false positives, although there are false negatives because some of the outliers have not been found.
Here there is a useful way to work with neural networks. In this post, we have tried autoencoder as a outliers detector, although it is not its main use. I would like to know if this Machine Learning technique is useful for you as well in this context, or if it is more powerful as a dimensional reduction method.