yardstick: classification_cost – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

classification_cost

Costs function for poor classification

Description

classification_cost() calculates the cost of a poor prediction based on user-defined costs. The costs are multiplied by the estimated class probabilities and the mean cost is returned.

Usage

classification_cost(data, ...)

## S3 method for class 'data.frame'
classification_cost(
  data,
  truth,
  ...,
  costs = NULL,
  na_rm = TRUE,
  event_level = yardstick_event_level()
)

classification_cost_vec(
  truth,
  estimate,
  costs = NULL,
  na_rm = TRUE,
  event_level = yardstick_event_level(),
  ...
)

Arguments

`data`	A `data.frame` containing the `truth` and `estimate` columns.
`...`	A set of unquoted column names or one or more `dplyr` selector functions to choose which variables contain the class probabilities. If `truth` is binary, only 1 column should be selected. Otherwise, there should be as many columns as factor levels of `truth`.
`truth`	The column identifier for the true class results (that is a `factor`). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For `_vec()` functions, a `factor` vector.
`costs`	A data frame with columns `"truth"`, `"estimate"`, and `"cost"`. `"truth"` and `"estimate"` should be character columns containing unique combinations of the levels of the `truth` factor. `"costs"` should be a numeric column representing the cost that should be applied when the `"estimate"` is predicted, but the true result is `"truth"`. It is often the case that when `"truth" == "estimate"`, the cost is zero (no penalty for correct predictions). If any combinations of the levels of `truth` are missing, their costs are assumed to be zero. If `NULL`, equal costs are used, applying a cost of `0` to correct predictions, and a cost of `1` to incorrect predictions.
`na_rm`	A `logical` value indicating whether `NA` values should be stripped before the computation proceeds.
`event_level`	A single string. Either `"first"` or `"second"` to specify which level of `truth` to consider as the "event". This argument is only applicable when `estimator = "binary"`. The default uses an internal helper that generally defaults to `"first"`, however, if the deprecated global option `yardstick.event_first` is set, that will be used instead with a warning.
`estimate`	If `truth` is binary, a numeric vector of class probabilities corresponding to the "relevant" class. Otherwise, a matrix with as many columns as factor levels of `truth`. It is assumed that these are in the same order as the levels of `truth`.

Details

As an example, suppose that there are three classes: "A", "B", and "C". Suppose there is a truly "A" observation with class probabilities A = 0.3 / B = 0.3 / C = 0.4. Suppose that, when the true result is class "A", the costs for each class were A = 0 / B = 5 / C = 10, penalizing the probability of incorrectly predicting "C" more than predicting "B". The cost for this prediction would be 0.3 * 0 + 0.3 * 5 + 0.4 * 10. This calculation is done for each sample and the individual costs are averaged.

Value

A tibble with columns .metric, .estimator, and .estimate and 1 row of values.

For grouped data frames, the number of rows returned will be the same as the number of groups.

For class_cost_vec(), a single numeric value (or NA).

Author(s)

Max Kuhn

Examples

library(dplyr)

# ---------------------------------------------------------------------------
# Two class example
data(two_class_example)

# Assuming `Class1` is our "event", this penalizes false positives heavily
costs1 <- tribble(
  ~truth,   ~estimate, ~cost,
  "Class1", "Class2",  1,
  "Class2", "Class1",  2
)

# Assuming `Class1` is our "event", this penalizes false negatives heavily
costs2 <- tribble(
  ~truth,   ~estimate, ~cost,
  "Class1", "Class2",  2,
  "Class2", "Class1",  1
)

classification_cost(two_class_example, truth, Class1, costs = costs1)

classification_cost(two_class_example, truth, Class1, costs = costs2)

# ---------------------------------------------------------------------------
# Multiclass
data(hpc_cv)

# Define cost matrix from Kuhn and Johnson (2013)
hpc_costs <- tribble(
   ~estimate, ~truth, ~cost,
   "VF",   "VF",     0,
   "VF",    "F",     1,
   "VF",    "M",     5,
   "VF",    "L",    10,
   "F",    "VF",     1,
   "F",     "F",     0,
   "F",     "M",     5,
   "F",     "L",     5,
   "M",    "VF",     1,
   "M",     "F",     1,
   "M",     "M",     0,
   "M",     "L",     1,
   "L",    "VF",     1,
   "L",     "F",     1,
   "L",     "M",     1,
   "L",     "L",     0
)

# You can use the col1:colN tidyselect syntax
hpc_cv %>%
  filter(Resample == "Fold01") %>%
  classification_cost(obs, VF:L, costs = hpc_costs)

# Groups are respected
hpc_cv %>%
  group_by(Resample) %>%
  classification_cost(obs, VF:L, costs = hpc_costs)

yardstick

Tidy Characterizations of Model Performance

v0.0.8

MIT + file LICENSE

Authors

Max Kuhn [aut], Davis Vaughan [aut, cre], RStudio [cph]

Initial release