We have talked about outliers several times in this blog. Examples include how to detect them or how to transform the data to remove them. Here we have another technique to detect outliers in our big data set: the isolation forest algorithm.
The idea behind the isolation forest method
The name of this technique is based on its main idea. The algorithm isolates each point in the data and splits them into outliers or inliers. This split depends on how long it takes to separate the points. Let me explain this idea for further clarity. If we try to segregate a point which is obviously a non-outlier, it’ll have many points in its round, so that it will be really difficult to isolate. On the other hand, if the point is an outlier, it’ll be alone and we’ll find it very easily. We explain the isolation process later in more detail, be patient.
An advantage of this algorithm is that it works with a huge data set and several dimensions. The dimensions refer to the different features that we have in our data set. The data refers, of course, to each element of the data set.
How do we separate each point? The simple procedure is as follows for each point of the data set:
- Select the point to isolate.
- For each feature, set the range to isolate between the minimum and the maximum.
- Choose a feature randomly.
- Pick a value that’s in the range, again randomly:
- If the chosen value keeps the point above, switch the minimum of the range of the feature to the value.
- If the chosen value keeps the point below, switch the maximum of the range of the feature to the value.
- Repeat steps 3 & 4 until the point is isolated. That is, until the point is the only one which is inside the range for all features.
- Count how many times you’ve had to repeat steps 3 & 4. We call this quantity the isolation number.
The algorithm claims that a point is an outlier if it doesn’t have to repeat the steps 3 & 4 several times.
Note that the above pseudocode is a simplification of the real process to understand it better. Actually, as it uses random numbers, this procedure is applied several times and the final isolation number is a combination of all isolation numbers.
The process in images
An image is said to be worth a thousand words, so here’s an illustration. We identify an outlier and an inlier of our data set, and we apply the previous procedure to both.
For inliers, the algorithm has to be repeated 15 times. Meanwhile, the outlier’s isolation number is 8. Isolating an outlier means fewer loops than an inlier.
A case study
Let’s see how isolation forest applies in a real data set. For that, we use Python’s sklearn library. It’s necessary to set the percentage of data that we want to consider as outlier: we fix this at 5%.
Each point represents two fundamental data of a stock in the S&P 500. As we see in the next illustration, most of the data is gathered, so it’s easy to see which are the outliers, and identify that the isolation forest algorithm works pretty well.
Now, enjoy trying it for yourself.