Importance of features in a model.
Creates a data.table of feature importances in a model.
xgb.importance( feature_names = NULL, model = NULL, trees = NULL, data = NULL, label = NULL, target = NULL )
feature_names |
character vector of feature names. If the model already
contains feature names, those would be used when |
model |
object of class |
trees |
(only for the gbtree booster) an integer vector of tree indices that should be included
into the importance calculation. If set to |
data |
deprecated. |
label |
deprecated. |
target |
deprecated. |
This function works for both linear and tree models.
For linear models, the importance is the absolute magnitude of linear coefficients. For that reason, in order to obtain a meaningful ranking by importance for a linear model, the features need to be on the same scale (which you also would want to do when using either L1 or L2 regularization).
For a tree model, a data.table with the following columns:
Features names of the features used in the model;
Gain represents fractional contribution of each feature to the model based on
the total gain of this feature's splits. Higher percentage means a more important
predictive feature.
Cover metric of the number of observation related to this feature;
Frequency percentage representing the relative number of times
a feature have been used in trees.
A linear model's importance data.table has the following columns:
Features names of the features used in the model;
Weight the linear coefficient of this feature;
Class (only for multiclass models) class label.
If feature_names is not provided and model doesn't have feature_names,
index of the features will be used instead. Because the index is extracted from the model dump
(based on C++ code), it starts at 0 (as in C/C++ or Python) instead of 1 (usual in R).
# binomial classification using gbtree:
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
xgb.importance(model = bst)
# binomial classification using gblinear:
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, booster = "gblinear",
eta = 0.3, nthread = 1, nrounds = 20, objective = "binary:logistic")
xgb.importance(model = bst)
# multiclass classification using gbtree:
nclass <- 3
nrounds <- 10
mbst <- xgboost(data = as.matrix(iris[, -5]), label = as.numeric(iris$Species) - 1,
max_depth = 3, eta = 0.2, nthread = 2, nrounds = nrounds,
objective = "multi:softprob", num_class = nclass)
# all classes clumped together:
xgb.importance(model = mbst)
# inspect importances separately for each class:
xgb.importance(model = mbst, trees = seq(from=0, by=nclass, length.out=nrounds))
xgb.importance(model = mbst, trees = seq(from=1, by=nclass, length.out=nrounds))
xgb.importance(model = mbst, trees = seq(from=2, by=nclass, length.out=nrounds))
# multiclass classification using gblinear:
mbst <- xgboost(data = scale(as.matrix(iris[, -5])), label = as.numeric(iris$Species) - 1,
booster = "gblinear", eta = 0.2, nthread = 1, nrounds = 15,
objective = "multi:softprob", num_class = nclass)
xgb.importance(model = mbst)Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.