Tune Random Forest for the optimal mtry and nodesize parameters
Finds the optimal mtry and nodesize tuning parameter for a random forest using out-of-bag (OOB) error. Applies to all families.
## S3 method for class 'rfsrc' tune(formula, data, mtryStart = ncol(data) / 2, nodesizeTry = c(1:9, seq(10, 100, by = 5)), ntreeTry = 50, sampsize = function(x){min(x * .632, max(150, x ^ (3/4)))}, nsplit = 10, stepFactor = 1.25, improve = 1e-3, strikeout = 3, maxIter = 25, trace = FALSE, doBest = TRUE, ...) ## S3 method for class 'rfsrc' tune.nodesize(formula, data, nodesizeTry = c(1:9, seq(10, 150, by = 5)), sampsize = function(x){min(x * .632, max(150, x ^ (4/5)))}, nsplit = 1, trace = FALSE, ...)
formula |
A symbolic description of the model to be fit. |
data |
Data frame containing the y-outcome and x-variables. |
mtryStart |
Starting value of mtry. |
nodesizeTry |
Values of nodesize optimized over. |
ntreeTry |
Number of trees used for the tuning step. |
sampsize |
Function specifying requested size of subsampled data. Can also be passed in as a number. |
nsplit |
Number of random splits used for splitting. |
stepFactor |
At each iteration, mtry is inflated (or deflated) by this value. |
improve |
The (relative) improvement in OOB error must be by this much for the search to continue. |
strikeout |
The search is discontinued when the relative
improvement in OOB error is negative. However |
maxIter |
The maximum number of iterations allowed for each mtry bisection search. |
trace |
Print the progress of the search? |
doBest |
Return a forest fit with the optimal mtry and nodesize parameters? |
... |
Further options to be passed to |
tune
returns a matrix whose first and second
columns contain the nodesize and mtry values searched and whose third
column is the corresponding OOB error. Uses standardized OOB error
and in the case of multivariate forests it is the averaged
standardized OOB error over the outcomes and for competing risks it is
the averaged standardized OOB error over the event types.
If doBest=TRUE
, also returns a forest object fit using the
optimal mtry
and nodesize
values.
All calculations (including the final optimized forest) are based on
the fast forest interface rfsrc.fast
which utilizes
subsampling. However, while this yields a fast optimization strategy,
such a solution can only be considered approximate. Users may wish to
tweak various options to improve accuracy. Increasing the default
sampsize
will definitely help. Increasing ntreeTry
(which is set to 50 for speed) may also help. It is also useful to
look at contour plots of the OOB error as a function of mtry
and nodesize
(see example below) to identify regions of the
parameter space where error rate is small.
tune.nodesize
returns the optimal nodesize where optimization is
over nodesize
only.
Hemant Ishwaran and Udaya B. Kogalur
## ------------------------------------------------------------ ## White wine classification example ## ------------------------------------------------------------ ## load the data data(wine, package = "randomForestSRC") wine$quality <- factor(wine$quality) ## default tuning call o <- tune(quality ~ ., wine) ## here is the optimized forest print(o$rf) ## visualize the nodesize/mtry OOB surface if (library("akima", logical.return = TRUE)) { ## nice little wrapper for plotting results plot.tune <- function(o, linear = TRUE) { x <- o$results[,1] y <- o$results[,2] z <- o$results[,3] so <- interp(x=x, y=y, z=z, linear = linear) idx <- which.min(z) x0 <- x[idx] y0 <- y[idx] filled.contour(x = so$x, y = so$y, z = so$z, xlim = range(so$x, finite = TRUE) + c(-2, 2), ylim = range(so$y, finite = TRUE) + c(-2, 2), color.palette = colorRampPalette(c("yellow", "red")), xlab = "nodesize", ylab = "mtry", main = "OOB error for nodesize and mtry", key.title = title(main = "OOB error", cex.main = 1), plot.axes = {axis(1);axis(2);points(x0,y0,pch="x",cex=1,font=2); points(x,y,pch=16,cex=.25)}) } ## plot the surface plot.tune(o) } ## ------------------------------------------------------------ ## tune nodesize for competing risk - wihs data ## ------------------------------------------------------------ data(wihs, package = "randomForestSRC") plot(tune.nodesize(Surv(time, status) ~ ., wihs, trace = TRUE)$err)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.