1 Hyperparameter Tuning Using tuneRF

Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging.

Random forests are built on the same fundamental principles as decision trees and bagging. Bagging trees introduces a random component in to the tree building process that reduces the variance of a single tree’s prediction and improves predictive performance. However, the trees in bagging are not completely independent of each other since all the original predictors are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to underlying relationships.

However, Random Forests have important parameters which cannot be directly estimated from the data. Searching optimal parameters that maximizes model performance is called the process of tuning parameter.

There are different approaches to searching for the best parameters. A general approach that can be applied to almost any model is to define a set of candidate values, generate reliable estimates of model utility across the candidates values, then choose the optimal settings.

For Random Forest, the most important parameter to tune is mtry or the number of predictors to be selected for splitting at each node. The package randomForest has an inbuilt function called tuneRF to do exactly that.

1.1 Loading and Preparing the data

We will use the popular Boston dataset that comes with MASS package for this tutorial.

library(randomForest)
library(MASS)

Now we will divide our data into training and testing datasets:

set.seed(1234)
train = sample(1:nrow(Boston), nrow(Boston)/2) # a 50:50 split
boston.test=Boston[-train,"medv"]

1.2 Running Random Forest with a Pre-decided parameter

We will train our model with a predecided parameter:

rf.boston=randomForest(medv~.,data=Boston,subset=train,mtry=6,importance=TRUE)

Now we will check our model’s performance on the test data. For that purpose, our first task will to get the predictions. After that, we will calculate the RMSE of the model.

yhat.rf = predict(rf.boston,newdata=Boston[-train,])
mean((yhat.rf-boston.test)^2)
## [1] 10.10126

Now, we will check the OOB error of this model using tunrRF function.

set.seed(123)
tuneRF(Boston[-train,-14], boston.test, mtryStart = 6, stepFactor=1, improve=0.05, trace=TRUE, plot=F, doBest=TRUE)
## mtry = 6  OOB error = 10.73571 
## Searching left ...
## Searching right ...
## 
## Call:
##  randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1]) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 10.98436
##                     % Var explained: 86.42

In this code above,

  • Boston[-train,-14] has been used to denote the predictor variables.
  • boston.test is the response variable.
  • mtryStart is the starting point of the search for tuning parameter mtry.
  • stepFactor inflates or deflates the mtry by this amount from the mtryStart value. If we put stepFactor = 1, the model will calculate the OOB error only for that one mtry value. In this case, that is our goal.
  • improve is the (relative) improvement in OOB error must be by this much for the search to continue, as in this case we are calculating the OOB error only for one mtry value, this will be ignored.
  • trace specifies whether to print the progress of the search
  • plot specifies whether to plot the OOB error as function of mtry whether to run a forest using the optimal mtry found
  • doBest specifies whether to run a forest using the optimal mtry found.

If doBest=FALSE (default), it returns a matrix whose first column contains the mtry values searched, and the second column the corresponding OOB error. If doBest=TRUE, it returns the randomForest object produced with the optimal mtry.

1.3 Tuning the Parameter mtry

Now we will see if there are any other value of mtry that will provide us with a better model.

set.seed(123)
tunedrf1 <- tuneRF(Boston[-train,-14], boston.test, mtryStart=2, ntreeTry=500, stepFactor=1.5, improve=0.01, trace=TRUE, plot=F, doBest=F)
## mtry = 2  OOB error = 14.30229 
## Searching left ...
## Searching right ...
## mtry = 3     OOB error = 12.29387 
## 0.1404261 0.01 
## mtry = 4     OOB error = 11.0511 
## 0.1010886 0.01 
## mtry = 6     OOB error = 10.41273 
## 0.05776521 0.01 
## mtry = 9     OOB error = 10.98407 
## -0.05486934 0.01

The object tunedrf1 will produce a list of the mtry values searched and the corresponding OOB error. Let’s have a look:

tunedrf1
##   mtry OOBError
## 2    2 14.30229
## 3    3 12.29387
## 4    4 11.05110
## 6    6 10.41273
## 9    9 10.98407

We could have directly found out the best value using doBest=T in the tuneRF function:

set.seed(123)
tunedrf2 <- tuneRF(Boston[-train,-14], boston.test, mtryStart=2, ntreeTry=500, stepFactor=1.5, improve=0.01, trace=TRUE, plot=TRUE, doBest=T)
## mtry = 2  OOB error = 14.30229 
## Searching left ...
## Searching right ...
## mtry = 3     OOB error = 12.29387 
## 0.1404261 0.01 
## mtry = 4     OOB error = 11.0511 
## 0.1010886 0.01 
## mtry = 6     OOB error = 10.41273 
## 0.05776521 0.01 
## mtry = 9     OOB error = 10.98407 
## -0.05486934 0.01