2 Introduction

In this chapter, we are going to learn about random forest for supervised classification and regression.

As you probably remember, supervised learning refers to methodologies that use some prior information about the data points (e.g. samples, patients) to create a way to connect that prior information to the data using some kind of model. In this context, classification deals with the situation where that prior information is of qualitative form. In this case, the example would be having a set of samples where half of them have a disease and the other half don’t. Given this example, the objective is to find a way to say who has the disease and who does not. The previous example was only about two groups but classification can be applied to any number of groups. In a regression setting, our prior information is continuous (numerical). In this case, our aim is to figure out a way of connecting the data pattern with the numbers.

Let’s agree from now that you want to show our data with \(X\) in which our samples are in the rows and our variables or measurements are in columns. We will also show our prior information with \(Y\). So as described above, we want to come up with a way to map our \(X\) to \(Y\):

\[Y=f(X)\]

This basically says we are after a function \(f\) that can get \(X\) (our data) and output \(Y\) (predictions). There are many methods for finding such a mapping. Linear/logistic regression, Neural Networks and Support Vector Machines are examples of such methods. Random forests algorithm can also be considered in this category. However, Random forest does not directly model \(f\) in a mathematical sense but rather come up with \(f\) using a series of segmentation as we will see soon.

In order to learn how random forest works, we will need to go through a few central topics:

Decision trees
Bagging
introducing randomness