Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

pattern2id

Match patterns against token types


Description

Developer function to match regex, fixed or glob patterns against token types. This allows C++ function to perform fast searches in tokens object. C++ functions use a list of type IDs to construct a hash table, against which sub-vectors of tokens object are matched. This function constructs an index of glob patterns for faster matching.

pattern2fixed converts regex and glob patterns to fixed patterns.

index_types is an auxiliary function for pattern2id that constructs an index of "glob" or "fixed" patterns to avoid expensive sequential search. For example, a type "cars" is index by keys "cars", "car?", "c*", "ca*", "car*" and "cars*" when valuetype="glob".

Usage

pattern2id(
  pattern,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  keep_nomatch = FALSE
)

pattern2fixed(
  pattern,
  types,
  valuetype = c("glob", "fixed", "regex"),
  case_insensitive = TRUE,
  keep_nomatch = FALSE
)

index_types(types, valuetype, case_insensitive, max_len = NULL)

Arguments

pattern

a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.

types

token types against which patterns are matched

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

keep_nomatch

keep patterns that did not match

max_len

maximum length of types to be indexed

Value

a list of integer vectors containing indices of matched types

pattern2fixed returns a list of character vectors containing types

index_types returns a list of integer vectors containing type IDs with index keys as an attribute

Examples

types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")

pats_regex <- list(c("^a$", "^b"), c("c"), c("d"))
pattern2id(pats_regex, types, "regex", case_insensitive = TRUE)

pats_glob <- list(c("a*", "b*"), c("c"), c("d"))
pattern2id(pats_glob, types, "glob", case_insensitive = TRUE)

pattern <- list(c("^a$", "^b"), c("c"), c("d"))
types <- c("A", "AA", "B", "BB", "BBB", "C", "CC")
pattern2fixed(pattern, types, "regex", case_insensitive = TRUE)
index <- index_types(c("xxx", "yyyy", "ZZZ"), "glob", FALSE, 3)
quanteda:::search_glob("yy*", attr(index, "type_search"), index)

quanteda

Quantitative Analysis of Textual Data

v3.0.0
GPL-3
Authors
Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Kohei Watanabe [aut] (<https://orcid.org/0000-0001-6519-5265>), Haiyan Wang [aut] (<https://orcid.org/0000-0003-4992-4311>), Paul Nulty [aut] (<https://orcid.org/0000-0002-7214-4666>), Adam Obeng [aut] (<https://orcid.org/0000-0002-2906-4775>), Stefan Müller [aut] (<https://orcid.org/0000-0002-6315-4125>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), William Lowe [aut] (<https://orcid.org/0000-0002-1549-6163>), Christian Müller [ctb], European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.