Decision tree

To understand the random forest model, we must first learn about the decision tree, the basic building block of a random forest. We all use decision trees in our daily lives, even if you don't know it by that name. You will be able to relate to the concepts of a decision tree once we start going through the example.

Imagine you approach a bank for a loan. The bank will scan you for a series of eligibility criteria before they approve the loan. For each individual, the loan amount they offer will vary, based on the different eligibility criteria they satisfy.

They may go ahead with various decision points to make the final decision to arrive at the possibility of granting a loan and the amount that can be given, such as the following:

Source of income: Employed or self-employed?
If employed, place of employment: Private sector or government sector?
If private sector, range of salary: Low, medium, or high?
If government sector, range of salary: Low, medium, or high?

There may be further questions, such as how long you've been employed with that company, or whether you have any outstanding loans. This process, in its most basic form, is a decision tree:

As you can see in the preceding diagram, a decision tree is a largely used non-parametric effective machine learning modeling technique for classification problems. To find solutions, a decision tree makes sequential, hierarchical decisions about the outcomes based on the predictor data.

For any given data item, a series of questions is asked, which leads to a class label or a value. This model asks a series of predefined questions of the incoming data item and, based on these answers, branches out to that series and proceeds until it arrives at the resulting data value or class label. The model is constructed based on the observed data, and there are no assumptions made about the distribution of the errors or the distribution of data itself.

In the decision tree models where the target variable uses a discrete set of values, this is called a classification tree. In these trees, each node, or leaf, represents class labels, while the branches represent features leading to class labels.

A decision tree where the target variable takes a continuous value, usually numbers, is called a regression tree.

These decision trees are well represented using directed acyclic graphs (DAGs). In these graphs, nodes represent decision points and edges are the connections between the nodes. In the preceding loan scenario, the salary range of $30,000-$70,000 would be an edge and the medium are nodes.