spacyr: spacy_tokenize – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

spacy_tokenize

Tokenize text with spaCy

Description

Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.

Usage

spacy_tokenize(
  x,
  what = c("word", "sentence"),
  remove_punct = FALSE,
  remove_url = FALSE,
  remove_numbers = FALSE,
  remove_separators = TRUE,
  remove_symbols = FALSE,
  padding = FALSE,
  multithread = TRUE,
  output = c("list", "data.frame"),
  ...
)

Arguments

`x`	a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)
`what`	the unit for splitting the text, available alternatives are: `"word"` word segmenter `"sentence"` sentence segmenter
`remove_punct`	remove punctuation tokens.
`remove_url`	remove tokens that look like a url or email address.
`remove_numbers`	remove tokens that look like a number (e.g. "334", "3.1415", "fifty").
`remove_separators`	remove spaces as separators when all other remove functionalities (e.g. `remove_punct`) have to be set to `FALSE`. When `what = "sentence"`, this option will remove trailing spaces if `TRUE`.
`remove_symbols`	remove symbols. The symbols are either `SYM` in `pos` field, or currency symbols.
`padding`	if `TRUE`, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected tokens, for instance if a window of adjacency needs to be computed.
`multithread`	logical; If `TRUE`, the processing is parallelized using spaCy's architecture (https://spacy.io/api)
`output`	type of returning object. Either `list` or `data.frame`.
`...`	not used directly

Value

either list or data.frame of tokens

Examples

spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)

txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", 
          doc2 = "This is the second document.",
          doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)

spacyr

Wrapper to the 'spaCy' 'NLP' Library

v1.2.1

GPL-3

Authors

Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)

Initial release