Classification trees are used, as the name suggests, in solving classification problems. Here are some definitions and Matlab tips to help you dabble in this subject.
The objective of any problem of this nature is to assign an object to one of a number of specified categories or classes. Classification is a useful technique that’s widely applied in many other fields, such as pattern recognition.
A tree is essentially a set of sequential conditions and actions that relate certain factors with a result or decision. It is a supervised classification method, meaning that data are classified a priori and from this, the knowledge may be extracted.
First of all, the algorithm analyses the data provided in search of patterns, and then uses the result of this analysis to define sequences and conditions in order to create the classification model.
In this post we go over the main theoretical concepts and translate them to Matlab, which includes the tools required to work with trees in its Statistics Toolbox. I hope you find it useful.
What are classification trees?
A decision tree is a way of representing knowledge obtained in the inductive learning process. The space is split using a set of conditions, and the resulting structure is the tree.
These conditions are created from a series of characteristics or features, the explained variables:
We initialise the matrix a with features in Matlab. We define the function attributes to encompass the characteristics that we want the tree to consider.
If we are working with financial series, some examples of features are volatility, the efficiency of the movements, etc.
Each set of features is assigned to a response or class, which it attempts to explain. This classification has been done using future information; hence the possibility to categorise historical data, but it is impossible to do so with new data:
According to the class type, we have classification trees with discrete classes and regression trees with continuous classes. We focus on the binary case, as this is the condition for the Matlab function we are going to use. An example of a binary class, for example, is the price moving upwards or downwards, the price moving above a certain value or not, or if the market has been lateral moving or in a trend.
We label these classes as -1 and 1 from here onwards. We define our function of classes as a vector of responses for each data in x:
What are classification trees like?
A tree is made up of nodes:
Internal nodes: Each internal node contains a question about a specific feature (node = <attribute,value>) and it provides two children, one for each possible answer, classification or decision. The questions are of the type: \(a_i \geq j\)?
End nodes (leaves): The ones that are assigned to a single class at the bottom of the tree. The complexity of the tree is established by the number of leaves.
How are classification trees built?
The construction of a tree is the learning stage of the method. It consists of analysing a set of available features and obtaining some logical rules adapted to the known classification; the classes that are assigned to each vector \((a_1,\ldots,a_s)\).
The construction process is recursive:
- All the possible partitions are analysed (attribute, value) and from them we take the one with the best separation.
- The optimum separation is applied.
- Step 1 is repeated with the children nodes, only for the ones that are not leaves.
All partitions?
Yes, all of them. For each one of the features, the values of the observations are used.
The partitions are defined by using the midpoint between each two values as the cutoffs.
Which is the optimum separation?
The best separation is the one that divides the data into groups such that there is one dominant class (child node 1 => -1 and child node 2 => 1).
The so-called impurity measures are used and, from all possible partitions, we choose the one that minimises the impurity of the two child nodes.
We use Gini diversity index, although there are many other impurity measures, amongst them, we highlight entropy and information gain.
The impurity is calculated as the product of the probabilities of belonging to one class or the other.
When these probabilities are very equal it means that the separation doesn’t coincide well with the classification defined by the vector of classes. The more similar the probabilities are, more impurity exists and the greater the Gini index.
The Gini index of a node a(ij) =(attribute i, value j) is calculated by adding together the impurities of the children nodes.\(\)$$G(a_{ij})=P(a_{i}<j)\cdot G(c|a_{i}<j) + P(a_{i}\ge j)$$$$G(c|a_{ij})=P(c=1|a_{ij})\cdot P(c\neq 1|a_{ij})+P(c=-1|a_{ij})\cdot P(c\neq -1|a_{ij})=1-\sum_{c_k} P(c_{k}|a_{ij})^{2}$$A pure node has a Gini index of 0.
In this way, the importance of the features is established. It’s possible that the algorithm keeps some of the characteristics that we had chosen out of the tree definition. We, therefore, interpret this feature as irrelevant, since the decision is independent of its value.
In the same way, the features in the lower levels also have less importance.
By analysing the tree structure we can infer the interest of each of the chosen explanatory variables.
When is a node a leaf?
If the node contains a single class, then it is pure and the process is complete. However, it’s typical to use additional stop criteria, such as when the number of data within the node is too small. In such cases, progressing with its classification isn’t deemed relevant.
Furthermore, no further divisions are allowed if the impurity of the children is not better than the impurity of the parent: if we reach a node where any possible separation leads to children nodes with too few values, the partition would not be carried out either.
How should we use classification trees?
One of the greatest advantages of classification trees is that they are intuitive and easy to apply.
The application of the tree is known as the classification stage, and consists of assigning the class corresponding to a new set of features, independent from the learning set.
Once created, it’s straight forward to find the unknown class of a new value x with characteristics \((a_1,\ldots,a_s)\). One simply has to answer the posed questions at each node and follow the path mapped out by our tree, until the node leaf is reached. Class c is the predicted, most common class for the leaf to which the value x belongs.
The percentage of data with the most common class in the leaf for x is the accuracy of the classification.
The result of the model classification provided by the tree is not merely a decision, but it is also its own certainty, since the probabilities of all the possible alternatives are also available: