All

Decision Trees: Gini vs Entropy

Pablo Aznar

02/12/2020

No Comments

Decision Trees are one of the best known supervised classification methods. As explained in previous posts, “A decision tree is a way of representing knowledge obtained in the inductive learning process. The space is split using a set of conditions, and the resulting structure is the tree

A tree is composed of nodes, and those nodes are chosen looking for the optimum split of the features. For that purpose, different criteria exist. In the decision tree Python implementation of the scikit-learn library, this is made by the parameter ‘criterion‘. This parameter is the function used to measure the quality of a split and it allows users to choose between ‘gini‘ or ‘entropy‘.

How does each criterion find the optimum split? And, what are the differences between both of them? In this post, we are going to answer these questions. First, explaining gini and entropy criteria and their differences, and then, a practical example that compares both of them is presented.

Moreover, if you are interested in decision trees, this post about tree ensembles may be of your interest.

Gini

The gini impurity is calculated using the following formula:

$$Gini Index = 1 – \sum_{j}p_{j}^{2}$$

Where \(p_{j}\) is the probability of class j.

The gini impurity measures the frequency at which any element of the dataset will be mislabelled when it is randomly labeled.

The minimum value of the Gini Index is 0. This happens when the node is pure, this means that all the contained elements in the node are of one unique class. Therefore, this node will not be split again. Thus, the optimum split is chosen by the features with less Gini Index. Moreover, it gets the maximum value when the probability of the two classes are the same.

$$Gini_{min} = 1 – (1^{2}) = 0$$

$$Gini_{max} = 1 – (0.5^{2} + 0.5^{2}) = 0.5$$

Entropy

The entropy is calculated using the following formula:

$$Entropy = – \sum_{j}p_{j} \cdot log_{2} \cdot p_{j}$$

Where, as before, \(p_{j}\) is the probability of class j.

Entropy is a measure of information that indicates the disorder of the features with the target. Similar to the Gini Index, the optimum split is chosen by the feature with less entropy. It gets its maximum value when the probability of the two classes is the same and a node is pure when the entropy has its minimum value, which is 0:

$$Entropy_{min} = -1 \cdot log_{2}(1) = 0$$

$$Entropy_{max} = – 0.5 \cdot log_{2}(0.5) – 0.5 \cdot log_{2}(0.5) = 1$$

Gini vs Entropy

The Gini Index and the Entropy have two main differences:

  • Gini Index has values inside the interval [0, 0.5] whereas the interval of the Entropy is [0, 1]. In the following figure, both of them are represented. The gini index has also been represented multiplied by two to see concretely the differences between them, which are not very significant.
Representation of Gini Index and Entropy
Gini Index and Entropy
  • Computationally, entropy is more complex since it makes use of logarithms and consequently, the calculation of the Gini Index will be faster.

Therefore, we are going to analyze the impact on the training time when using one criterion or the other. For that purpose, different synthetic datasets have been generated. All these datasets have 10 features and they can be grouped into 4 groups, depending on whether the features are informative, redundant, repeated, or random:

Types of features of the datasets

In this way, we can analyze the impact on the training time. Moreover, each group is composed of 5 datasets, where the number of samples varies (100, 1.000, 10.000, 100.000, and 200.000).

In the following graphs, the x-axis is the number of samples of the dataset and the y-axis is the training time.

Training time of different datasets

As can be seen, the training time when using the Entropy criterion is much higher. Furthermore, the impact of the nature of the features is quite noticeable. Redundant features are the most affected ones, making the training time 50% longer than when using informative features. Whereas, the use of random features or repeated features have a similar impact. The differences in training time are more noticeable in larger datasets.

Results

Besides, we are also going to compare the obtained results with both criteria. For that purpose, we are going to use the datasets used in the training time analysis, specifically the ones with 1000 samples. Furthermore, cross-validation has also been used with k=3. The following table shows the obtained results, these being the F-Score.

Results comparison

As can be seen, the results are very similar, being the ones where the entropy criterion is used slightly better.

Finally, if we compare the structure of the trees, we can see that they are different. For that purpose, we have created a new synthetic dataset with 400 samples and 4 features. The obtained results are an F-Score of 0.93 for both criteria but the resulting tree is different. In addition, the first split in the two trees is the same as the branch on the right of the tree, however, the rest of the tree is different.

Although the trees are not equal, the obtained result is practically identical.

Resulting tree using Gini criterion
Resulting tree using entropy criterion

Conclusions

In this post, we have compared the gini and entropy criterion for splitting the nodes of a decision tree.

On the one hand, the gini criterion is much faster because it is less computationally expensive.

On the other hand, the obtained results using the entropy criterion are slightly better.

Nevertheless, as the results are so similar, it does not seem to be worth the time invested in training when using the entropy criterion.

References

[1] https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity

[2] https://towardsdatascience.com/gini-index-vs-information-entropy-7a7e4fed3fcb

0 Comments
Inline Feedbacks
View all comments