Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

undersample_tomek

Undersample a dataset by removing Tomek links.


Description

A Tomek link is a minority instance and majority instance that are each other's nearest neighbor. This function removes sufficient Tomek links that are an instance of cls to yield m instances of cls. If desired, samples are randomly discarded to yield m rows if insufficient Tomek links are in the data.

Usage

undersample_tomek(
  data,
  cls,
  cls_col,
  m,
  tomek = "minor",
  force_m = TRUE,
  dist_calc = "euclidean"
)

Arguments

data

Dataset to be undersampled.

cls

Majority class to be undersampled.

cls_col

Column in data containing class memberships.

m

Desired number of samples in undersampled dataset.

tomek

Definition used to determine if a point is considered a minority in the Tomek link definition.

  • minor: Minor classes are all those with fewer than m instances.

  • diff: Minor classes are all those that aren't cls.

force_m

If TRUE, uses random undersampling to discard samples if insufficient Tomek links are present to yield m rows of data.

dist_calc

Distance calculation method. See dist().

Value

Undersampled dataframe containing only cls.

Examples

table(iris$Species)
undersamp <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = TRUE)
nrow(undersamp)
undersamp2 <- undersample_tomek(iris, "setosa", "Species", 15, tomek = "diff", force_m = FALSE)
nrow(undersamp2)

scutr

Balancing Multiclass Datasets for Classification Tasks

v0.1.2
MIT + file LICENSE
Authors
Keenan Ganz [aut, cre]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.