Tokenize text with spaCy
Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.
spacy_tokenize( x, what = c("word", "sentence"), remove_punct = FALSE, remove_url = FALSE, remove_numbers = FALSE, remove_separators = TRUE, remove_symbols = FALSE, padding = FALSE, multithread = TRUE, output = c("list", "data.frame"), ... )
x |
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropensci/tif) |
what |
the unit for splitting the text, available alternatives are:
|
remove_punct |
remove punctuation tokens. |
remove_url |
remove tokens that look like a url or email address. |
remove_numbers |
remove tokens that look like a number (e.g. "334", "3.1415", "fifty"). |
remove_separators |
remove spaces as separators when
all other remove functionalities (e.g. |
remove_symbols |
remove symbols. The symbols are either |
padding |
if |
multithread |
logical; If |
output |
type of returning object. Either |
... |
not used directly |
either list
or data.frame
of tokens
spacy_initialize() txt <- "And now for something completely different." spacy_tokenize(txt) txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", doc2 = "This is the second document.", doc3 = "This is a \\\"quoted\\\" text." ) spacy_tokenize(txt2)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.