quanteda: tokens_compound – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

tokens_compound

Convert token sequences into compound tokens

Description

Replace multi-token sequences with a multi-word, or "compound" token. The resulting compound tokens will represent a phrase or multi-word expression, concatenated with concatenator (by default, the "_" character) to form a single "token". This ensures that the sequences will be processed subsequently as single tokens, for instance in constructing a dfm.

Usage

tokens_compound(
  x,
  pattern,
  valuetype = c("glob", "regex", "fixed"),
  concatenator = "_",
  window = 0,
  case_insensitive = TRUE,
  join = TRUE
)

Arguments

`x`	an input tokens object
`pattern`	a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`concatenator`	the concatenation character that will connect the words making up the multi-word sequences. The default `_` is recommended since it will not be removed during normal cleaning and tokenization (while nearly all other punctuation characters, at least those in the Unicode punctuation class `[P]` will be removed).
`window`	integer; a vector of length 1 or 2 that specifies size of the window of tokens adjacent to `pattern` that will be compounded with matches to `pattern`. The window can be asymmetric if two elements are specified, with the first giving the window size before `pattern` and the second the window size after. If paddings (empty `""` tokens) are found, window will be shrunk to exclude them.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`join`	logical; if `TRUE`, join overlapping compounds into a single compound; otherwise, form these separately. See examples.

Value

A tokens object in which the token sequences matching pattern have been replaced by new compounded "tokens" joined by the concatenator.

Note

Patterns to be compounded (naturally) consist of multi-word sequences, and how these are expected in pattern is very specific. If the elements to be compounded are supplied as space-delimited elements of a character vector, wrap the vector in phrase(). If the elements to be compounded are separate elements of a character vector, supply it as a list where each list element is the sequence of character elements.

See the examples below.

Examples

txt <- "The United Kingdom is leaving the European Union."
toks <- tokens(txt, remove_punct = TRUE)

# character vector - not compounded
tokens_compound(toks, c("United", "Kingdom", "European", "Union"))

# elements separated by spaces - not compounded
tokens_compound(toks, c("United Kingdom", "European Union"))

# list of characters - is compounded
tokens_compound(toks, list(c("United", "Kingdom"), c("European", "Union")))

# elements separated by spaces, wrapped in phrase)() - is compounded
tokens_compound(toks, phrase(c("United Kingdom", "European Union")))

# supplied as values in a dictionary (same as list) - is compounded
# (keys do not matter)
tokens_compound(toks, dictionary(list(key1 = "United Kingdom",
                                      key2 = "European Union")))
# pattern as dictionaries with glob matches
tokens_compound(toks, dictionary(list(key1 = c("U* K*"))), valuetype = "glob")

# note the differences caused by join = FALSE
compounds <- list(c("the", "European"), c("European", "Union"))
tokens_compound(toks, pattern = compounds, join = TRUE)
tokens_compound(toks, pattern = compounds, join = FALSE)

# use window to form ngrams
tokens_remove(toks, pattern = stopwords("en")) %>%
    tokens_compound(pattern = "leav*", join = FALSE, window = c(0, 3))

quanteda

Quantitative Analysis of Textual Data

v3.0.0

GPL-3

Authors

Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Kohei Watanabe [aut] (<https://orcid.org/0000-0001-6519-5265>), Haiyan Wang [aut] (<https://orcid.org/0000-0003-4992-4311>), Paul Nulty [aut] (<https://orcid.org/0000-0002-7214-4666>), Adam Obeng [aut] (<https://orcid.org/0000-0002-2906-4775>), Stefan Müller [aut] (<https://orcid.org/0000-0002-6315-4125>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), William Lowe [aut] (<https://orcid.org/0000-0002-1549-6163>), Christian Müller [ctb], European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)

Initial release