quanteda: tokens_select – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

tokens_select

Select or remove tokens from a tokens object

Description

These function select or discard tokens from a tokens object. For convenience, the functions tokens_remove and tokens_keep are defined as shortcuts for tokens_select(x, pattern, selection = "remove") and tokens_select(x, pattern, selection = "keep"), respectively. The most common usage for tokens_remove will be to eliminate stop words from a text or text-based object, while the most common use of tokens_select will be to select tokens with only positive pattern matches from a list of regular expressions, including a dictionary. startpos and endpos determine the positions of tokens searched for pattern and areas affected are expanded by window.

Usage

tokens_select(
  x,
  pattern,
  selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  padding = FALSE,
  window = 0,
  min_nchar = NULL,
  max_nchar = NULL,
  startpos = 1L,
  endpos = -1L,
  verbose = quanteda_options("verbose")
)

tokens_remove(x, ...)

tokens_keep(x, ...)

Arguments

`x`	tokens object whose token elements will be removed or kept
`pattern`	a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
`selection`	whether to `"keep"` or `"remove"` the tokens matching `pattern`
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`padding`	if `TRUE`, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected tokens, for instance if a window of adjacency needs to be computed.
`window`	integer of length 1 or 2; the size of the window of tokens adjacent to `pattern` that will be selected. The window is symmetric unless a vector of two elements is supplied, in which case the first element will be the token length of the window before `pattern`, and the second will be the token length of the window after `pattern`. The default is `0`, meaning that only the pattern matched token(s) are selected, with no adjacent terms. Terms from overlapping windows are never double-counted, but simply returned in the pattern match. This is because `tokens_select` never redefines the document units; for this, see `kwic()`.
`min_nchar, max_nchar`	optional numerics specifying the minimum and maximum length in characters for tokens to be removed or kept; defaults are `NULL` for no limits. These are applied after (and hence, in addition to) any selection based on pattern matches.
`startpos, endpos`	integer; position of tokens in documents where pattern matching starts and ends, where 1 is the first token in a document. For negative indexes, counting starts at the ending token of the document, so that -1 denotes the last token in the document, -2 the second to last, etc. When the length of the vector is equal to `ndoc`, tokens in corresponding positions will be selected. Otherwise, only the first element in the vector is used.
`verbose`	if `TRUE` print messages about how many tokens were selected or removed
`...`	additional arguments passed by `tokens_remove` and `tokens_keep` to `tokens_select`. Cannot include `selection`.

Value

a tokens object with tokens selected or removed based on their match to pattern

Examples

## tokens_select with simple examples
toks <- as.tokens(list(letters, LETTERS))
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = FALSE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = TRUE)

# how case_insensitive works
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = TRUE)
tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = FALSE)

# use window
tokens_select(toks, c("b", "f"), selection = "keep", window = 1)
tokens_select(toks, c("b", "f"), selection = "remove", window = 1)
tokens_remove(toks, c("b", "f"), window = c(0, 1))
tokens_select(toks, pattern = c("e", "g"), window = c(1, 2))

# tokens_remove example: remove stopwords
txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my
                   country to execute the functions of its Chief Magistrate.",
         wash2 <- "When the occasion proper for it shall arrive, I shall
                   endeavor to express the high sense I entertain of this
                   distinguished honor.")
tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))

# token_keep example: keep two-letter words
tokens_keep(tokens(txt, remove_punct = TRUE), "??")

quanteda

Quantitative Analysis of Textual Data

v3.0.0

GPL-3

Authors

Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Kohei Watanabe [aut] (<https://orcid.org/0000-0001-6519-5265>), Haiyan Wang [aut] (<https://orcid.org/0000-0003-4992-4311>), Paul Nulty [aut] (<https://orcid.org/0000-0002-7214-4666>), Adam Obeng [aut] (<https://orcid.org/0000-0002-2906-4775>), Stefan Müller [aut] (<https://orcid.org/0000-0002-6315-4125>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), William Lowe [aut] (<https://orcid.org/0000-0002-1549-6163>), Christian Müller [ctb], European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)

Initial release