Iterators (and parallel iterators) over input objects
This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.
itoken(iterable, ...) ## S3 method for class 'character' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, progressbar = interactive(), ids = NULL, ...) ## S3 method for class 'list' itoken(iterable, n_chunks = 10, progressbar = interactive(), ids = names(iterable), ...) ## S3 method for class 'iterator' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, progressbar = interactive(), ...) itoken_parallel(iterable, ...) ## S3 method for class 'character' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...) ## S3 method for class 'iterator' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 1L, ...) ## S3 method for class 'list' itoken_parallel(iterable, n_chunks = 10, ids = NULL, ...)
iterable |
an object from which to generate an iterator |
... |
arguments passed to other methods |
preprocessor |
|
tokenizer |
|
n_chunks |
|
progressbar |
|
ids |
|
S3 methods for creating an itoken iterator from list of tokens
list
: all elements of the input list should be
character vectors containing tokens
character
: raw text
source: the user must provide a tokenizer function
ifiles
: from files, a user must provide a function to read in the file
(to ifiles) and a function to tokenize it (to itoken)
idir
: from a directory, the user must provide a function to
read in the files (to idir) and a function to tokenize it (to itoken)
ifiles_parallel
: from files in parallel
data("movie_review") txt = movie_review$review[1:100] ids = movie_review$id[1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids) # Example of stemming tokenizer # stem_tokenizer =function(x) { # lapply(word_tokenizer(x), SnowballC::wordStem, language="en") # } it = itoken_parallel(movie_review$review[1:100], n_chunks = 4) system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix'))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.