randomForestSRC: rfsrc.anonymous – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

rfsrc.anonymous

Anonymous Random Forests

Description

Anonymous random forests applies random forests but is carefully modified so as not to save the original training data. This allows users to share their forest with other researchers but without having to share their original data.

Usage

rfsrc.anonymous(formula, data, forest = TRUE, ...)

Arguments

`formula`	A symbolic description of the model to be fit. If missing, unsupervised splitting is implemented.
`data`	Data frame containing the y-outcome and x-variables.
`forest`	Should the forest object be returned? Used for prediction on new data and required by many of the package functions.
`...`	Further arguments as in `rfsrc`. See the `rfsrc` help file for details.

Details

Calls rfsrc and returns an object with the training data removed so that users can share their forest while maintaining privacy of their data.

In order to predict on test data, it is however necessary for certain minimal information to be saved from the training data. This includes the names of the original variables, and if factor variables are present, the levels of the factors. The topology of grow trees is also saved, which includes among other things, the split values used for splitting tree nodes.

For the most privacy, we recommend that variable names be made non-identifiable and that data be coerced to real values. If factors are required, the user should consider using non-identifiable factor levels. However, in all cases, it is the users responsibility to de-identify their data and to check that data privacy holds. We provide no guarantees of this.

While anonymous random forests works similar to random forests, there are caveats to keep in mind. First, no missing data is allowed since missing data imputation requires training data. Second, while anonymous forest tries to play nice with the functions in the package, it only works with functions that specifically do not require training data. Thus users are advised to keep this in mind if they decide to go this route.

Value

An object of class (rfsrc, grow, anonymous).

Author(s)

Hemant Ishwaran and Udaya B. Kogalur

Examples

## regression
print(rfsrc.anonymous(mpg ~ ., mtcars))

## plot anonymous regression tree (using get.tree)
## illustrates minimal information saved by the forest
plot(get.tree(rfsrc.anonymous(mpg ~ ., mtcars), 10))

## classification
print(rfsrc.anonymous(Species ~ ., iris))

## survival
data(veteran, package = "randomForestSRC")
print(rfsrc.anonymous(Surv(time, status) ~ ., data = veteran))

## competing risk
data(wihs, package = "randomForestSRC")
print(rfsrc.anonymous(Surv(time, status) ~ ., wihs, ntree = 100))

## unsupervised forests
print(rfsrc.anonymous(data = iris))

## multivariate regression
print(rfsrc.anonymous(Multivar(mpg, cyl) ~., data = mtcars))

##
## train/test setting but tricky because factor labels differ over
## training and test data
##

# first we convert all x-variables to factors
data(veteran, package = "randomForestSRC")
veteran.factor <- data.frame(lapply(veteran, factor))
veteran.factor$time <- veteran$time
veteran.factor$status <- veteran$status

# split the data into train/test data (25/75)
# the train/test data have the same levels, but different labels
train <- sample(1:nrow(veteran), round(nrow(veteran) * .5))
summary(veteran.factor[train,])
summary(veteran.factor[-train,])

# grow the forest on the training data and predict on the test data
v.grow <- rfsrc.anonymous(Surv(time, status) ~ ., veteran.factor[train, ]) 
v.pred <- predict(v.grow, veteran.factor[-train , ])
print(v.grow)
print(v.pred)

randomForestSRC

Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC)

v2.11.0

GPL (>= 3)

Authors

Hemant Ishwaran <hemant.ishwaran@gmail.com>, Udaya B. Kogalur <ubk@kogalur.com>

Initial release

2021-03-30