Simple tokenization functions for string splitting
Few simple tokenization functions. For more comprehensive list see tokenizers
package:
https://cran.r-project.org/package=tokenizers.
Also check stringi::stri_split_*
.
word_tokenizer(strings, ...) char_tokenizer(strings, ...) space_tokenizer(strings, sep = " ", xptr = FALSE, ...) postag_lemma_tokenizer(strings, udpipe_model, tagger = "default", tokenizer = "tokenizer", pos_keep = character(0), pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ"))
strings |
|
... |
other parameters (usually not used - see source code for details). |
sep |
|
xptr |
|
udpipe_model |
- udpipe model, can be loaded with |
tagger |
|
tokenizer |
|
pos_keep |
|
pos_remove |
|
list
of character
vectors. Each element of list contains vector of tokens.
doc = c("first second", "bla, bla, blaa") # split by words word_tokenizer(doc) #faster, but far less general - perform split by a fixed single whitespace symbol. space_tokenizer(doc, " ")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.