Hyperparameter Tuning in Random Forest with R
Anirban Ghatak
1 Hyperparameter Tuning Using tuneRF
Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.
Random forests are built on the same fundamental principles as decision trees and bagging. Bagging trees introduces a random component in to the tree building process that reduces the variance of a single tree’s prediction and improves predictive performance. However, the trees in bagging are not completely independent of each other since all the original predictors are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to underlying relationships.
However, Random Forests have important parameters which cannot be directly estimated from the data. Searching optimal parameters that maximizes model performance is called the process of tuning parameter.
There are different approaches to searching for the best parameters. A general approach that can be applied to almost any model is to define a set of candidate values, generate reliable estimates of model utility across the candidates values, then choose the optimal settings.
For Random Forest, the most important parameter to tune is mtry
or the number of predictors to be selected for splitting at each node. The package randomForest
has an inbuilt function called tuneRF
to do exactly that.
1.1 Loading and Preparing the data
We will use the popular Boston
dataset that comes with MASS
package for this tutorial.
Now we will divide our data into training and testing datasets:
1.2 Running Random Forest with a Pre-decided parameter
We will train our model with a predecided parameter:
Now we will check our model’s performance on the test data. For that purpose, our first task will to get the predictions. After that, we will calculate the RMSE of the model.
Now, we will check the OOB error of this model using tunrRF
function.
set.seed(123)
tuneRF(Boston[-train,-14], boston.test, mtryStart = 6, stepFactor=1, improve=0.05, trace=TRUE, plot=F, doBest=TRUE)
## mtry = 6 OOB error = 10.73571
## Searching left ...
## Searching right ...
##
## Call:
## randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 10.98436
## % Var explained: 86.42
In this code above,
Boston[-train,-14]
has been used to denote the predictor variables.boston.test
is the response variable.mtryStart
is the starting point of the search for tuning parametermtry
.stepFactor
inflates or deflates themtry
by this amount from themtryStart
value. If we putstepFactor = 1
, the model will calculate the OOB error only for that onemtry
value. In this case, that is our goal.improve
is the (relative) improvement in OOB error must be by this much for the search to continue, as in this case we are calculating the OOB error only for onemtry
value, this will be ignored.trace
specifies whether to print the progress of the searchplot
specifies whether to plot the OOB error as function ofmtry
whether to run a forest using the optimalmtry
founddoBest
specifies whether to run a forest using the optimalmtry
found.
If doBest=FALSE
(default), it returns a matrix whose first column contains the mtry
values searched, and the second column the corresponding OOB error.
If doBest=TRUE
, it returns the randomForest
object produced with the optimal mtry
.
1.3 Tuning the Parameter mtry
Now we will see if there are any other value of mtry
that will provide us with a better model.
set.seed(123)
tunedrf1 <- tuneRF(Boston[-train,-14], boston.test, mtryStart=2, ntreeTry=500, stepFactor=1.5, improve=0.01, trace=TRUE, plot=F, doBest=F)
## mtry = 2 OOB error = 14.30229
## Searching left ...
## Searching right ...
## mtry = 3 OOB error = 12.29387
## 0.1404261 0.01
## mtry = 4 OOB error = 11.0511
## 0.1010886 0.01
## mtry = 6 OOB error = 10.41273
## 0.05776521 0.01
## mtry = 9 OOB error = 10.98407
## -0.05486934 0.01
The object tunedrf1
will produce a list of the mtry
values searched and the corresponding OOB error. Let’s have a look:
tunedrf1
## mtry OOBError
## 2 2 14.30229
## 3 3 12.29387
## 4 4 11.05110
## 6 6 10.41273
## 9 9 10.98407
We could have directly found out the best value using doBest=T
in the tuneRF
function:
set.seed(123)
tunedrf2 <- tuneRF(Boston[-train,-14], boston.test, mtryStart=2, ntreeTry=500, stepFactor=1.5, improve=0.01, trace=TRUE, plot=TRUE, doBest=T)
## mtry = 2 OOB error = 14.30229
## Searching left ...
## Searching right ...
## mtry = 3 OOB error = 12.29387
## 0.1404261 0.01
## mtry = 4 OOB error = 11.0511
## 0.1010886 0.01
## mtry = 6 OOB error = 10.41273
## 0.05776521 0.01
## mtry = 9 OOB error = 10.98407
## -0.05486934 0.01