collapse: recode-replace – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

recode-replace

Recode and Replace Values in Matrix-Like Objects

Description

A small suite of functions to efficiently perform common recoding and replacing tasks in matrix-like objects (vectors, matrices, arrays, data frames, lists of atomic objects):

recode_num and recode_char can be used to efficiently recode multiple numeric or character values, respectively. The syntax is inspired by dplyr::recode, but the functionality is enhanced in the following respects: (1) they are faster than dplyr::recode, (2) when passed a data frame / list, all appropriately typed columns will be recoded. (3) They preserve the attributes of the data object and of columns in a data frame / list, and (4) recode_char also supports regular expression matching using grepl.
replace_NA efficiently replaces NA/NaN with a value (default is 0L). data can be multi-typed. For numeric data a faster and more versatile alternative is provided by data.table::nafill and data.table::setnafill.
replace_Inf replaces Inf/-Inf (or optionally NaN/Inf/-Inf) with a value (default is NA). replace_Inf skips non-numeric columns in a data frame.
replace_outliers replaces values falling outside a 1- or 2-sided numeric threshold or outside a certain number of column- standard deviations with a value (default is NA). replace_outliers skips non-numeric columns in a data frame.

Usage

recode_num(X, ..., default = NULL, missing = NULL)

recode_char(X, ..., default = NULL, missing = NULL, regex = FALSE,
            ignore.case = FALSE, fixed = FALSE)

replace_NA(X, value = 0L)

replace_Inf(X, value = NA, replace.nan = FALSE)

replace_outliers(X, limits, value = NA,
                 single.limit = c("SDs", "min", "max", "overall_SDs"))

Arguments

`X`	a vector, matrix, array, data frame or list of atomic objects.
`...`	comma-separated recode arguments of the form: value = replacement, `2` = 0, Secondary = "SEC" etc.. `recode_char` with `regex = TRUE` also supports regular expressions i.e. `^S\|D$` = "STD" etc.
`default`	optional argument to specify a scalar value to replace non-matched elements with.
`missing`	optional argument to specify a scalar value to replace missing elements with. Note that to increase efficiency this is done before the rest of the recoding i.e. the recoding is performed on data where missing values are filled!
`regex`	logical. If `TRUE`, all recode-argument names are (sequentially) passed to `grepl` as a pattern to search `X`. All matches are replaced. Note that `NA`'s are also matched as strings by `grepl`.
`value`	a single (scalar) value to replace matching elements with.
`replace.nan`	logical. `TRUE` replaces `NaN/Inf/-Inf`. `FALSE` (default) replaces only `Inf/-Inf`.
`limits`	either a vector of two-numeric values `c(minval, maxval)` constituting a two-sided outlier threshold, or a single numeric value constituting either a factor of standard deviations (default), or the minimum or maximum of a one-sided outlier threshold. See also `single.limit`.
`single.limit`	a character or integer (argument only applies if `length(limits) == 1`): `1 - "SDs"` specifies that `limits` will be interpreted as a (two-sided) threshold in column standard-deviations. The underlying code is equivalent to `X[abs(fscale(X)) > limits] <- value` but faster. Since `fscale` is S3 generic with methods for `grouped_df`, `pseries` and `pdata.frame`, the standardizing will be grouped if such objects are passed (i.e. the outlier threshold is then measured in within-group standard deviations). `2 - "min"` specifies that `limits` will be interpreted as a (one-sided) minimum threshold. The underlying code is equivalent to `X[X < limits] <- value`. `3 - "max"` specifies that `limits` will be interpreted as a (one-sided) maximum threshold. The underlying code is equivalent to `X[X > limits] <- value`. `4 - "overall_SDs"` is equivalent to "SDs" but ignores groups when a `grouped_df`, `pseries` or `pdata.frame` is passed (i.e. standardizing and determination of outliers is by the overall column standard deviation).
`ignore.case, fixed`	logical. Passed to `grepl` and only applicable if `regex = TRUE`.

Note

These functions are not generic and do not offer support for factors or date(-time) objects. see dplyr::recode_factor, forcats and other appropriate packages for dealing with these classes.

Simple replacing tasks on a vector can also effectively be handled by data.table::fcase and data.table::fifelse. Fast vectorised switches are also offered by package kit (functions iif, nif, vswitch, nswitch).

Examples

recode_char(c("a","b","c"), a = "b", b = "c")
recode_char(month.name, ber = NA, regex = TRUE)
mtcr <- recode_num(mtcars, `0` = 2, `4` = Inf, `1` = NaN)
replace_Inf(mtcr)
replace_Inf(mtcr, replace.nan = TRUE)
replace_outliers(mtcars, c(2, 100))                 # Replace all values below 2 and above 100 w. NA
replace_outliers(mtcars, 2, single.limit = "min")   # Replace all value smaller than 2 with NA
replace_outliers(mtcars, 100, single.limit = "max") # Replace all value larger than 100 with NA
replace_outliers(mtcars, 2)                         # Replace all values above or below 2 column-
                                                    # standard-deviations from the column-mean w. NA
replace_outliers(fgroup_by(iris, Species), 2)       # Passing a grouped_df, pseries or pdata.frame
                                                    # allows to remove outliers according to
                                                    # in-group standard-deviation. see ?fscale

collapse

Advanced and Fast Data Transformation

v1.5.3

GPL (>= 2) | file LICENSE

Authors

Sebastian Krantz [aut, cre], Matt Dowle [ctb], Arun Srinivasan [ctb], Laurent Berge [ctb], Dirk Eddelbuettel [ctb], Josh Pasek [ctb], Kevin Tappe [ctb], R Core Team and contributors worldwide [ctb], Martyn Plummer [cph], 1999-2016 The R Core Team [cph]

Initial release

2021-03-05