This blog has previously covered various approaches to identify outliers. In this case, we are going to explore a different use case and apply a different method.
For this purpose, we have new sets of time series grouped by categories. Although we know they are well classified, meaning that all those series belong to the same category, we need to determine if there are any series within each category that could be outliers. To solve this problem, different Machine Learning models will be used. Thanks to this, it will be possible to decide whether all the time series follow the same pattern or there is an outlier.
To train a model for future prediction, we have 20595 financial series available, belonging to a total of 133 different categories. The time period for these series spans a total of 3 years, from 1 January 2020 to 31 December 2022. Since there were categories with very few series, the approach taken was to select categories that have a minimum of 3 financial series. As the goal is to detect which series are outliers, it is necessary for the training categories to have outliers. To achieve this, what has been done is to manually introduce financial series that do not follow the pattern of each category with the aim of having them as the outliers with which the model can later be trained.
Before building the model, it is necessary to create the features that will be used for training. We calculate the features by grouping the financial series by categories and performing the calculations explained below. The first thing that has been done is to create a distance matrix based on the correlation between the financial time series in order to extract statistics, which will be part of our features. From this distance matrix, we have created the following features:
- Mean: for each financial series, calculate the average distance to the rest of the series within each category. We will name this feature “mean_dist”.
- Standard deviation: for each financial series, calculate the standard deviation of the distances to the rest of the series grouped by categories. We will name this feature “std_dist”.
- Quartiles: calculate de first, second, and third quartiles for each financial series. We will name this features “Q1”, “Q2” and “Q3” respectively.
- Number of close series: for each category and each financial series, calculate the number of series that are close to it. This calculation is based on the mean, by subtracting each mean from the mean of the current financial series. We will consider series with differences smaller than 0.1 as close series. Therefore, this new feature called “n_series_close” will represent the total number of series with differences smaller than 0.1.
- Number of series: count the number of financial series that belong to each category. We will name this feature “n_series”.
- Category level: categorize the categories and assign a number to each of them. We will name this feature “category_level”.
Finally, we scale the previously described data from the training set.
Which model to choose?
On one hand, we have tested the K-nearest neighbors (KNN) model. The KNN is an instance-based classification algorithm. In this approach, we represent the training instances as points in a multidimensional space and classify new instances based on the majority of classes among the k nearest instances to them. The value of k repesents the number of nearest neighbors considered for making a classification decision. The KNN model is easy to understand and implement, but it can become computationally expensive for large datasets and can be affected by the presence of irrelevant or noisy features.
On the other hand, the Random Forest model has been tested. The Random Forest model is a supervised learning algorithm based on decision trees. Instead of a single decision tree, Random Forest constructs multiple decision trees and combines their results to obtain a more robust prediction. We train each decision tree on a random sample of the dataset and use a random selection of features at each split. Then, we combine the predictions from individual trees through voting to obtain the final prediction. Random Forest has a known ability to handle irrelevant features, handle large datasets well, and provide good generalization.
The main differences between the two models are as follows:
- KNN is simple and easy to understand, but it can be computationally expensive and sensitive to irrelevant features.
- Random Forest is more robust, efficient and can handle irrelevant features. However, it can be less interpretable due to the combination of multiple decision trees.
An example will be examined to determine which financial series are identified as outliers by each model.
To sum up
It has been carried out to demostrate a way to detect outliers in financial series. Based on the images, we have concluded that the Random Forest model performs better in this case. There is certainly room for further improvement, such as combining both models or fine-tuning the parameters. In the meantime, you can use this method as a preliminary filter to identify outliers in financial data.