quanteda: pattern2id – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

pattern2id

Match patterns against token types

Description

Developer function to match regex, fixed or glob patterns against token types. This allows C++ function to perform fast searches in tokens object. C++ functions use a list of type IDs to construct a hash table, against which sub-vectors of tokens object are matched. This function constructs an index of glob patterns for faster matching.

pattern2fixed converts regex and glob patterns to fixed patterns.

index_types is an auxiliary function for pattern2id that constructs an index of "glob" or "fixed" patterns to avoid expensive sequential search. For example, a type "cars" is index by keys "cars", "car?", "c*", "ca*", "car*" and "cars*" when valuetype="glob".

Usage

pattern2id(
  pattern,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  keep_nomatch = FALSE
)

pattern2fixed(
  pattern,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  keep_nomatch = FALSE
)

index_types(types, valuetype, case_insensitive, max_len = NULL)

Arguments

`pattern`	a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
`types`	token types against which patterns are matched
`valuetype`	the type of pattern matching: `"glob"` for "glob"-style wildcard expressions; `"regex"` for regular expressions; or `"fixed"` for exact matching. See valuetype for details.
`case_insensitive`	logical; if `TRUE`, ignore case when matching a `pattern` or dictionary values
`keep_nomatch`	keep patterns that did not match
`max_len`	maximum length of types to be indexed

Value

a list of integer vectors containing indices of matched types

pattern2fixed returns a list of character vectors containing types

index_types returns a list of integer vectors containing type IDs with index keys as an attribute

Examples

types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")

pats_regex <- list(c("^a$", "^b"), c("c"), c("d"))
pattern2id(pats_regex, types, "regex", case_insensitive = TRUE)

pats_glob <- list(c("a*", "b*"), c("c"), c("d"))
pattern2id(pats_glob, types, "glob", case_insensitive = TRUE)

pattern <- list(c("^a$", "^b"), c("c"), c("d"))
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)
index <- index_types(c("xxx", "yyyy", "ZZZ"), "glob", FALSE, 3)
quanteda:::search_glob("yy*", attr(index, "type_search"), index)

quanteda

Quantitative Analysis of Textual Data

v3.0.0

GPL-3

Authors

Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Kohei Watanabe [aut] (<https://orcid.org/0000-0001-6519-5265>), Haiyan Wang [aut] (<https://orcid.org/0000-0003-4992-4311>), Paul Nulty [aut] (<https://orcid.org/0000-0002-7214-4666>), Adam Obeng [aut] (<https://orcid.org/0000-0002-2906-4775>), Stefan Müller [aut] (<https://orcid.org/0000-0002-6315-4125>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), William Lowe [aut] (<https://orcid.org/0000-0002-1549-6163>), Christian Müller [ctb], European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)

Initial release