Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

spacy_tokenize

Tokenize text with spaCy


Description

Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.

Usage

spacy_tokenize(
  x,
  what = c("word", "sentence"),
  remove_punct = FALSE,
  remove_url = FALSE,
  remove_numbers = FALSE,
  remove_separators = TRUE,
  remove_symbols = FALSE,
  padding = FALSE,
  multithread = TRUE,
  output = c("list", "data.frame"),
  ...
)

Arguments

x

a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif)

what

the unit for splitting the text, available alternatives are:

"word"

word segmenter

"sentence"

sentence segmenter

remove_punct

remove punctuation tokens.

remove_url

remove tokens that look like a url or email address.

remove_numbers

remove tokens that look like a number (e.g. "334", "3.1415", "fifty").

remove_separators

remove spaces as separators when all other remove functionalities (e.g. remove_punct) have to be set to FALSE. When what = "sentence", this option will remove trailing spaces if TRUE.

remove_symbols

remove symbols. The symbols are either SYM in pos field, or currency symbols.

padding

if TRUE, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected tokens, for instance if a window of adjacency needs to be computed.

multithread

logical; If TRUE, the processing is parallelized using spaCy's architecture (https://spacy.io/api)

output

type of returning object. Either list or data.frame.

...

not used directly

Value

either list or data.frame of tokens

Examples

spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)

txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", 
          doc2 = "This is the second document.",
          doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)

spacyr

Wrapper to the 'spaCy' 'NLP' Library

v1.2.1
GPL-3
Authors
Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.