Prediction for Random Forests for Survival, Regression, and Classification
Obtain predicted values using a forest. Also returns performance values if the test data contains y-outcomes.
## S3 method for class 'rfsrc' predict(object, newdata, m.target = NULL, importance = c(FALSE, TRUE, "none", "permute", "random", "anti"), get.tree = NULL, block.size = if (any(is.element(as.character(importance), c("none", "FALSE")))) NULL else 10, ensemble = NULL, na.action = c("na.omit", "na.impute"), outcome = c("train", "test"), proximity = FALSE, forest.wt = FALSE, ptn.count = 0, distance = FALSE, var.used = c(FALSE, "all.trees", "by.tree"), split.depth = c(FALSE, "all.trees", "by.tree"), seed = NULL, do.trace = FALSE, membership = FALSE, statistics = FALSE, ...)
object |
An object of class |
newdata |
Test data. If missing, the original grow (training) data is used. |
m.target |
Character vector for multivariate families specifying the target outcomes to be used. The default is to use all coordinates. |
importance |
Method for computing variable importance (VIMP) when
test data contains y-outcomes values. Also see |
get.tree |
Vector of integer(s) identifying trees over which the
ensembles are calculated over. By default, uses all trees in the
forest. As an example, the user can extract the ensemble, the VIMP
, or proximity from a single tree (or several trees). Note that
|
block.size |
Should the error rate be calculated on every tree?
When |
ensemble |
Optional parameter for specifying the type of
ensemble. Can be |
na.action |
Missing value action. The default |
outcome |
Determines whether the y-outcomes from the training
data or the test data are used to calculate the predicted value.
The default and natural choice is |
proximity |
Should proximity between test observations
be calculated? Possible choices are |
distance |
Should distance between test observations
be calculated? Possible choices are |
forest.wt |
Should the forest weight matrix for test observations be calculated? Choices are the same as proximity. |
ptn.count |
The number of terminal nodes that each tree in the
grow forest should be pruned back to. The terminal node membership
for the pruned forest is returned but no other action is taken. The
default is |
var.used |
Record the number of times a variable is split? |
split.depth |
Return minimal depth for each variable for each case? |
seed |
Negative integer specifying seed for the random number generator. |
do.trace |
Number of seconds between updates to the user on approximate time to completion. |
membership |
Should terminal node membership and inbag information be returned? |
statistics |
Should split statistics be returned? Values can be
parsed using |
... |
Further arguments passed to or from other methods. |
Predicted values are obtained by "dropping" test data down the training forest (the forest grown using the training data). Performance values are returned if test data contains y-outcome values. Single as well as joint VIMP are also returned if requested.
Setting na.action="na.impute" imputes missing test data
(x-variables and/or y-outcomes). Test imputation uses only the
grow-forest and training data to avoid biasing error rates and VIMP
(Ishwaran et al. 2008). Also see the function impute
for an
alternate way to do fast and accurate imputation.
If no test data is provided, the original training data is used, and the code reverts to restore mode allowing the user to restore the original grow forest. This is useful, because it gives the user the ability to extract outputs from the forest that were not asked for in the original grow call.
If outcome="test", the predictor is calculated by using y-outcomes from the test data (outcome information must be present). Terminal nodes from the grow-forest are recalculated using y-outcomes from the test set. This yields a modified predictor in which the topology of the forest is based solely on the training data, but where predicted values are obtained from test data. Error rates and VIMP are calculated by bootstrapping the test data and using out-of-bagging to ensure unbiased estimates. See the examples below for illustration.
Use option csv=TRUE
to request case specific VIMP
csv=TRUE
to request case specific error rates. Applies to all
families except survival families. See examples below. These options
can also be applied at the grow stage.
An object of class (rfsrc, predict)
, which is a list with the
following components:
call |
The original grow call to |
family |
The family used in the analysis. |
n |
Sample size of test data (depends upon |
ntree |
Number of trees in the grow forest. |
yvar |
Test set y-outcomes or original grow y-outcomes if none. |
yvar.names |
A character vector of the y-outcome names. |
xvar |
Data frame of test set x-variables. |
xvar.names |
A character vector of the x-variable names. |
leaf.count |
Number of terminal nodes for each tree in the
grow forest. Vector of length |
proximity |
Symmetric proximity matrix of the test data. |
forest |
The grow forest. |
membership |
Matrix recording terminal node membership for the test data where each column contains the node number that a case falls in for that tree. |
inbag |
Matrix recording inbag membership for the test data where each column contains the number of times that a case appears in the bootstrap sample for that tree. |
var.used |
Count of the number of times a variable was used in growing the forest. |
imputed.indv |
Vector of indices of records in test data with missing values. |
imputed.data |
Data frame comprising imputed test data. The first columns are the y-outcomes followed by the x-variables. |
split.depth |
Matrix [i][j] or array [i][j][k] recording the minimal depth for variable [j] for case [i], either averaged over the forest, or by tree [k]. |
node.stats |
Split statistics returned when
|
err.rate |
Cumulative OOB error rate for the test data if y-outcomes are present. |
importance |
Test set variable importance (VIMP). Can be
|
predicted |
Test set predicted value. |
predicted.oob |
OOB predicted value ( |
quantile |
Quantile value at probabilities requested. |
quantile.oob |
OOB quantile value at probabilities requested ( |
++++++++ |
for classification settings, additionally ++++++++ |
class |
In-bag predicted class labels. |
class.oob |
OOB predicted class labels ( |
++++++++ |
for multivariate settings, additionally ++++++++ |
regrOutput |
List containing performance values for test multivariate regression responses (applies only in multivariate settings). |
clasOutput |
List containing performance values for test multivariate categorical (factor) responses (applies only in multivariate settings). |
++++++++ |
for survival settings, additionally ++++++++ |
chf |
Cumulative hazard function (CHF). |
chf.oob |
OOB CHF ( |
survival |
Survival function. |
survival.oob |
OOB survival function ( |
time.interest |
Ordered unique death times. |
ndead |
Number of deaths. |
++++++++ |
for competing risks, additionally ++++++++ |
chf |
Cause-specific cumulative hazard function (CSCHF) for each event. |
chf.oob |
OOB CSCHF for each event ( |
cif |
Cumulative incidence function (CIF) for each event. |
cif.oob |
OOB CIF for each event ( |
time.interest |
Ordered unique event times. |
ndead |
Number of events. |
The dimensions and values of returned objects depend heavily on the
underlying family and whether y-outcomes are present in the test data.
In particular, items related to performance will be NULL
when
y-outcomes are not present. For multivariate families, predicted
values, VIMP, error rate, and performance values are stored in the
lists regrOutput
and clasOutput
which can be extracted
using functions get.mv.error
, get.mv.predicted
and
get.mv.vimp
.
Hemant Ishwaran and Udaya B. Kogalur
Breiman L. (2001). Random forests, Machine Learning, 45:5-32.
Ishwaran H., Kogalur U.B., Blackstone E.H. and Lauer M.S. (2008). Random survival forests, Ann. App. Statist., 2:841-860.
Ishwaran H. and Kogalur U.B. (2007). Random survival forests for R, Rnews, 7(2):25-31.
## ------------------------------------------------------------ ## typical train/testing scenario ## ------------------------------------------------------------ data(veteran, package = "randomForestSRC") train <- sample(1:nrow(veteran), round(nrow(veteran) * 0.80)) veteran.grow <- rfsrc(Surv(time, status) ~ ., veteran[train, ], ntree = 100) veteran.pred <- predict(veteran.grow, veteran[-train , ]) print(veteran.grow) print(veteran.pred) ## ------------------------------------------------------------ ## example illustrating restore mode ## - if predict is called without specifying the test data ## the original training data is used and the forest is restored ## ------------------------------------------------------------ ## first we make the grow call airq.obj <- rfsrc(Ozone ~ ., data = airquality) ## now we restore it and compare it to the original call ## they are identical predict(airq.obj) print(airq.obj) ## we can retrieve various outputs that were not asked for in ## in the original call ## here we extract the proximity matrix prox <- predict(airq.obj, proximity = TRUE)$proximity print(prox[1:10,1:10]) ## here we extract the number of times a variable was used to grow ## the grow forest var.used <- predict(airq.obj, var.used = "by.tree")$var.used print(head(var.used)) ## ------------------------------------------------------------ ## vimp for each tree ## illustrates get.tree and how to extract information ## from trees, even if that information was not requested ## in the original call ## ------------------------------------------------------------ ## regression analysis but no VIMP o <- rfsrc(mpg~., mtcars) ## now extract VIMP for each tree using get.tree vimp.tree <- do.call(rbind, lapply(1:o$ntree, function(b) { predict(o, get.tree = b, importance = TRUE)$importance })) ## boxplot of tree VIMP boxplot(vimp.tree, outline = FALSE, col = "cyan") abline(h = 0, lty = 2, col = "red") ## summary information of tree VIMP print(summary(vimp.tree)) ## extract tree-averaged VIMP using importance=TRUE ## remember to set block.size to 1 print(predict(o, importance = TRUE, block.size = 1)$importance) ## use direct call to vimp() for tree-averaged VIMP print(vimp(o, block.size = 1)$importance) ## ------------------------------------------------------------ ## case-specific vimp ## returns VIMP for each case ## ------------------------------------------------------------ o <- rfsrc(mpg~., mtcars) op <- predict(o, importance = TRUE, csv = TRUE) csvimp <- get.mv.csvimp(op, standardize=TRUE) print(csvimp) ## ------------------------------------------------------------ ## case-specific error rate ## returns tree-averaged error rate for each case ## ------------------------------------------------------------ o <- rfsrc(mpg~., mtcars) op <- predict(o, importance = TRUE, cse = TRUE) cserror <- get.mv.cserror(op, standardize=TRUE) print(cserror) ## ------------------------------------------------------------ ## predicted probability and predicted class labels are returned ## in the predict object for classification analyses ## ------------------------------------------------------------ data(breast, package = "randomForestSRC") breast.obj <- rfsrc(status ~ ., data = breast[(1:100), ]) breast.pred <- predict(breast.obj, breast[-(1:100), ]) print(head(breast.pred$predicted)) print(breast.pred$class) ## ------------------------------------------------------------ ## unique feature of randomForestSRC ## cross-validation can be used when factor labels differ over ## training and test data ## ------------------------------------------------------------ ## first we convert all x-variables to factors data(veteran, package = "randomForestSRC") veteran.factor <- data.frame(lapply(veteran, factor)) veteran.factor$time <- veteran$time veteran.factor$status <- veteran$status ## split the data into unbalanced train/test data (5/95) ## the train/test data have the same levels, but different labels train <- sample(1:nrow(veteran), round(nrow(veteran) * .05)) summary(veteran.factor[train,]) summary(veteran.factor[-train,]) ## grow the forest on the training data and predict on the test data veteran.f.grow <- rfsrc(Surv(time, status) ~ ., veteran.factor[train, ]) veteran.f.pred <- predict(veteran.f.grow, veteran.factor[-train , ]) print(veteran.f.grow) print(veteran.f.pred) ## ------------------------------------------------------------ ## example illustrating the flexibility of outcome = "test" ## illustrates restoration of forest via outcome = "test" ## ------------------------------------------------------------ ## first we make the grow call data(pbc, package = "randomForestSRC") pbc.grow <- rfsrc(Surv(days, status) ~ ., pbc) ## now use predict with outcome = TEST pbc.pred <- predict(pbc.grow, pbc, outcome = "test") ## notice that error rates are the same!! print(pbc.grow) print(pbc.pred) ## note this is equivalent to restoring the forest pbc.pred2 <- predict(pbc.grow) print(pbc.grow) print(pbc.pred) print(pbc.pred2) ## similar example, but with na.action = "na.impute" airq.obj <- rfsrc(Ozone ~ ., data = airquality, na.action = "na.impute") print(airq.obj) print(predict(airq.obj)) ## ... also equivalent to outcome="test" but na.action = "na.impute" required print(predict(airq.obj, airquality, outcome = "test", na.action = "na.impute")) ## classification example iris.obj <- rfsrc(Species ~., data = iris) print(iris.obj) print(predict.rfsrc(iris.obj, iris, outcome = "test")) ## ------------------------------------------------------------ ## another example illustrating outcome = "test" ## unique way to check reproducibility of the forest ## ------------------------------------------------------------ ## primary call set.seed(542899) data(pbc, package = "randomForestSRC") train <- sample(1:nrow(pbc), round(nrow(pbc) * 0.50)) pbc.out <- rfsrc(Surv(days, status) ~ ., data=pbc[train, ]) ## standard predict call pbc.train <- predict(pbc.out, pbc[-train, ], outcome = "train") ##non-standard predict call: overlays the test data on the grow forest pbc.test <- predict(pbc.out, pbc[-train, ], outcome = "test") ## check forest reproducibilility by comparing "test" predicted survival ## curves to "train" predicted survival curves for the first 3 individuals Time <- pbc.out$time.interest matplot(Time, t(pbc.train$survival[1:3,]), ylab = "Survival", col = 1, type = "l") matlines(Time, t(pbc.test$survival[1:3,]), col = 2) ## ------------------------------------------------------------ ## ... just for _fun_ ... ## survival analysis using mixed multivariate outcome analysis ## compare the predicted value to RSF ## ------------------------------------------------------------ ## fit the pbc data using RSF data(pbc, package = "randomForestSRC") rsf.obj <- rfsrc(Surv(days, status) ~ ., pbc) yvar <- rsf.obj$yvar ## fit a mixed outcome forest using days and status as y-variables pbc.mod <- pbc pbc.mod$status <- factor(pbc.mod$status) mix.obj <- rfsrc(Multivar(days, status) ~., pbc.mod) ## compare oob predicted values rsf.pred <- rsf.obj$predicted.oob mix.pred <- mix.obj$regrOutput$days$predicted.oob plot(rsf.pred, mix.pred) ## compare C-index error rate rsf.err <- get.cindex(yvar$days, yvar$status, rsf.pred) mix.err <- 1 - get.cindex(yvar$days, yvar$status, mix.pred) cat("RSF :", rsf.err, "\n") cat("multivariate forest:", mix.err, "\n")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.