Random forest is one of the most well-known ensemble methods for good reason – it’s a substantial improvement on simple decision trees. In this post, I’m going to explain how to build a random forest from simple decision trees, and to test how they actually improve the original algorithm.
Maybe you first need to know more about a simple tree; if that’s the case, take a look at my previous post. Furthermore, if you would rather read in Spanish, you can find the translation of the post here.
Like in any other unsupervised learning method, the starting point is a set of features or attributes, and on the other side, what we would like to explain a set of labels or classes:
What is a random forest?
Random forest is a method that combines a large number of independent trees trained over random and equally distributed subsets of the data.
How to build a random forest
The learning stage consists of creating many independent decision trees from slightly different input data:
- The initial input data is randomly subsampled with replacement.
This step is what Bagging ensemble consists of. However, random forests usually include a second level of randomness; this time subsampling the features:
- When optimising each node partition, we will only take into account a random subsample of the attributes.
Once a large number of trees have been built, around 1000 for example, the classification stage works like this:
- All trees are evaluated independently and averaged to compute the forest estimate. The probability that a given input belongs to a given class is interpreted as the proportion of trees that classify that input as a member of that class.
What are the advantages of a random forest over a tree?
Stability. Random forests suffer less overfitting to a particular data set than simple trees.
Random forest versus simple tree. Test 1:
We have designed two trading systems. The first system uses a classification tree and the second one uses a random forest, but both are based on the same strategy:
- Attributes: A set of transformations of the input series.
- Classes: For each day, it will be the sign of the next price return (i.e. binary responses): 1 if price moves up and 0 otherwise.
- Learning stage: We will use the beginning of the time series to build the trees–3000 days in the example.
- Classification stage: We will use the remaining years to test classifier performance. For each day in this period, the tree and the forest will return an estimate, 1 or 0, and its probability.
Our strategy will buy when the probability of the class 1 is larger than the probability of the class 0, indicating up movement in the series, and sell otherwise. We will also use the classification probability to compute the trade’s magnitude.
Let’s see what the results of these strategies are by applying them to several different financial series as “test*”:
The result, positive or negative, is less extreme for random forest. It does not happen that the average result of a random forest is always better than a tree result, but the risk taking is always lower. That means better draw down control.
The trees that make up the forest were trained with different yet similar datasets, different random subsamples of the original dataset. This provides the random forest with a better capacity to generalise and to perform better in new unknown situations.
Random forest versus simple tree. Test 2:
Let’s do a second test. Imagine that we would like to build again the previous trees. This time, instead of using 3000 historical data points as the train set, we are going to use 3100 data points. We would expect both strategies to be similar. Although random forest behaves as expected, this is not true for the classification trees, which are very prone to overfitting.
We trained individual trees and random forests using slightly larger or smaller data sets, 2500 data to 3500 data points. Then we measured the variability of the results. In the following graphs, we show the range of the results and their standard deviation:
It’s clear that the random forest technique is less sensitive variations in the training set.
Therefore, it is not true that the random forest method is going to perform better than any classification tree.
Nevertheless, we can assure that random forest guarantee better drawdown control and higher stability. These advantages are important enough to make the extra complexity worth it.