corpus: new_stemmer – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

corpus

new_stemmer

Stemmer Construction

Description

Make a stemmer from a set of (term, stem) pairs.

Usage

new_stemmer(term, stem, default = NULL, duplicates = "first",
            vectorize = TRUE)

Arguments

`term`	character vector of terms to stem.
`stem`	character vector the same length as `term` with entries giving the corresponding stems.
`default`	if non-`NULL`, a default value to use for terms that do not have a stem; `NULL` specifies that such terms should be left unchanged.
`duplicates`	action to take for duplicates in the `term` list. See ‘Details’

vectorize

whether to produce a vectorized stemmer that accepts and returns vector arguments.

Details

Giving a list of terms and a corresponding list of stems, this produces a function that maps terms to their corresponding entry. If default = NULL, then values absent from the term argument get left as-is; otherwise, they get replaced by the default value.

The duplicates argument indicates the action to take if there are duplicate entries in the term argument:

duplicates = "first" take the first matching entry in the stem list.
duplicates = "last" take the last matching entry in the stem list.
duplicates = "omit" use the default value for duplicated terms.
duplicates = "fail" raise an error if there are duplicated terms.

Value

By default, with vectorize = TRUE, the resulting stemmer accepts a character vector as input and returns a character vector of the same length with entries giving the stems of the corresponding input entries.

Setting vectorize = FALSE gives a function that accepts a single input and returns a single output. This can be more efficient when used as part of a text_filter.

Examples

# map uppercase to lowercase, leave others unchanged
stemmer <- new_stemmer(LETTERS, letters)
stemmer(c("A", "E", "I", "O", "U", "1", "2", "3"))

# map uppercase to lowercase, drop others
stemmer <- new_stemmer(LETTERS, letters, default = NA)
stemmer(c("A", "E", "I", "O", "U", "1", "2", "3"))

corpus

Text Corpus Analysis

v0.10.2

Apache License (== 2.0) | file LICENSE

Authors

Leslie Huang [cre, ctb], Patrick O. Perry [aut, cph], Finn Årup Nielsen [cph, dtc] (AFINN Sentiment Lexicon), Martin Porter and Richard Boulton [ctb, cph, dtc] (Snowball Stemmer and Stopword Lists), The Regents of the University of California [ctb, cph] (Strtod Library Procedure), Carlo Strapparava and Alessandro Valitutti [cph, dtc] (WordNet-Affect Lexicon), Unicode, Inc. [cph, dtc] (Unicode Character Database)

Initial release

new_stemmer

Description

Usage

Arguments

Details

Value

See Also

Examples

corpus

We don't support your browser anymore