Decision tree is one of the major data structures of statistical learning. Their functioning is based on heuristics, which, while satisfying intuition, give remarkable results in practice (especially when used in “random forests”). Their tree structure also makes them readable by humans, unlike other approaches where the constructed predictor is a “black box”.
A decision tree models a hierarchy of tests on the values of a set of variables called attributes. At the end of these tests, the predictor produces a numerical value or chooses an element from a discrete set of conclusions.
Building a Decision tree
To build a decision tree, we must answer the following 4 questions :
1. How to choose, among all of the variables available, the variable of segmentation of a vertex?
2. When the variable is continuous, how to determine the cut-off threshold during segmentation?
3. How to determine the correct size of the tree? Is it desirable to produce pure leaves totally according to the variable to be predicted, even if the corresponding group corresponds to a particularly small fraction of the observations?
4. Finally, how to assign the value of the variable for prediction? When the group is pure the answer is obvious, in the opposite case, we must adopt a strategy.
To choose the segmentation variable on a vertex, the algorithm relies on a particularly crude technique — it tests all the potential variables and chooses the one that maximizes a given criterion. It is, therefore, necessary that the criterion used characterizes the purity (or the gain in purity) during the passage of the vertex to segment, towards the leaves produced by the segmentation. There are several informational or statistical criteria, the most popular are Shannon’s entropy and Gini coefficient, and their variants.
In the end, these criteria, as long as they make it possible to make partitioning tend towards the constitution of pure groups, play little role in the performances of the algorithms.
Processing of continuous variables
The treatment of continuous variables must be consistent with the use of the segmentation criterion. In the majority of cases, the principle of segmentation of continuous variables is particularly simple — sort the data according to the variable to be treated, test the set of possible cut-off points located between two successive points and to evaluate the value of the criterion in each case. The optimum cutoff point is simply the one that maximizes the segmentation criterion.
Set the right size of the tree
The objective being to produce homogeneous groups, it seems natural to set as a stop rule of construction of the tree the constitution of pure groups from the point of view of the variable to be predicted. Desirable in principle, this attitude is not sustainable in practice.
Indeed, we often work on a sample that we hope is representative of a population. The challenge is, therefore, to find the right measure between capturing useful information, really corresponding to a relationship in the population, and ingesting the specificities of the file we are working on (the so-called learning sample). In other words, we must never forget that the performance of the tree is evaluated on the same data used for its construction — the more the model is complex (the bigger the tree, the more branches it has, the more leaves it has), the more we run the risk of seeing this model unable to be extrapolated to new data.
Indeed, if in an extreme case, we decide to grow our tree as far as possible, we can obtain a tree composed of as many leaves as individuals in the learning sample. Our tree does not make any mistake on this sample because it matches all the characteristics — performance equal to 100%. When this model is applied to new data, that, by nature, do not have all of the characteristics of our learning sample (it is simply another sample), its performance will, therefore, be degraded to the limit that will be closer to 0% !
Thus, when we build a decision tree, we risk what we call an over-adjustment of the model — the model seems efficient (its average error is particularly low) but it is in fact absolutely not! We will have to find the smallest tree possible with the greatest possible performance. The smaller a tree, the more stable it will be in its future predictions (in statistics, the principle of parsimony prevails almost always). The more powerful a tree is, the more satisfying it is for the analyst.
That is to say, to avoid an over-adjustment of our trees (and it is also true for all mathematical models that could be built), it is necessary to apply the principle of parsimony. At identical performance, we always prefer the simplest model, if we want to use this model on new completely unknown data.
The Problem of overfitting models
How to achieve the performance / complexity arbitrage of the models used? We can do this by evaluating the performance of one or more models on the data used to build it (the learning sample or samples), and also on, what is called, one (or more) test sample, which is the data at our disposal that we voluntarily decide not to use in the construction of our models. It’s as if this test data were new data, the reality. It is, above all, the stability of the performance of our model on these two types of samples that will allow us to judge its potential over-adjustment, and consequently its capacity to be used under real conditions where the exact data is not known.
In the case of decision tree, several types of algorithmic solutions have been considered in order to avoid, as far as possible, a problem of potential over-adjustment of the models. These are techniques known as pre and post pruning of trees.
The first strategy that can be used to avoid massive over-adjustment of decision tree is to propose stopping criteria during the expansion phase. This is the principle of pre-pruning.
For example, we consider that segmentation is no longer necessary when the group is too small, or, when the purity has reached a sufficient level that we consider that it is no longer indispensable to segment it. Another criterion frequently encountered in this context is the use of a statistical test to evaluate whether the segmentation introduces a significant information as to the prediction of the values of the variable to be predicted.
The second strategy is to build the tree in two steps — to produce the purest tree possible in the expansion phase by using the first fraction of the data sample (learning sample, not to be confused with the whole of the sample); then reverse to reduce the tree. This is the post-pruning phase, relying on another fraction of the data in order to optimize the performance of the tree. Depending on the software, this second portion of the data is referred to as the validation sample or test sample, the sample used to measure model performance.
Decision tree is a vast topic. If you wish to have a detailed tutorial on the same, do let us know in the comments section below!