text2vec: itoken – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

itoken

Iterators (and parallel iterators) over input objects

Description

This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.

Usage

itoken(iterable, ...)

## S3 method for class 'character'
itoken(iterable, preprocessor = identity,
  tokenizer = space_tokenizer, n_chunks = 10,
  progressbar = interactive(), ids = NULL, ...)

## S3 method for class 'list'
itoken(iterable, n_chunks = 10,
  progressbar = interactive(), ids = names(iterable), ...)

## S3 method for class 'iterator'
itoken(iterable, preprocessor = identity,
  tokenizer = space_tokenizer, progressbar = interactive(), ...)

itoken_parallel(iterable, ...)

## S3 method for class 'character'
itoken_parallel(iterable, preprocessor = identity,
  tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...)

## S3 method for class 'iterator'
itoken_parallel(iterable, preprocessor = identity,
  tokenizer = space_tokenizer, n_chunks = 1L, ...)

## S3 method for class 'list'
itoken_parallel(iterable, n_chunks = 10, ids = NULL,
  ...)

Arguments

`iterable`	an object from which to generate an iterator
`...`	arguments passed to other methods
`preprocessor`	`function` which takes chunk of `character` vectors and does all pre-processing. Usually `preprocessor` should return a `character` vector of preprocessed/cleaned documents. See "Details" section.
`tokenizer`	`function` which takes a `character` vector from `preprocessor`, split it into tokens and returns a `list` of `character` vectors. If you need to perform stemming - call stemmer inside tokenizer. See examples section.
`n_chunks`	`integer`, the number of pieces that object should be divided into. Then each chunk is processed independently (and in case `itoken_parallel` in parallel if some parallel backend is registered). Usually there is tradeoff: larger number of chunks means lower memory footprint, but slower (if `preprocessor, tokenizer` functions are efficiently vectorized). And small number of chunks means larger memory footprint but faster execution (again if user supplied `preprocessor, tokenizer` functions are efficiently vectorized).
`progressbar`	`logical` indicates whether to show progress bar.
`ids`	`vector` of document ids. If `ids` is not provided, `names(iterable)` will be used. If `names(iterable) == NULL`, incremental ids will be assigned.

Details

S3 methods for creating an itoken iterator from list of tokens

list: all elements of the input list should be character vectors containing tokens
character: raw text source: the user must provide a tokenizer function
ifiles: from files, a user must provide a function to read in the file (to ifiles) and a function to tokenize it (to itoken)
idir: from a directory, the user must provide a function to read in the files (to idir) and a function to tokenize it (to itoken)
ifiles_parallel: from files in parallel

Examples

data("movie_review")
txt = movie_review$review[1:100]
ids = movie_review$id[1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer =function(x) {
#   lapply(word_tokenizer(x), SnowballC::wordStem, language="en")
# }
it = itoken_parallel(movie_review$review[1:100], n_chunks = 4)
system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'dgTMatrix'))

text2vec

Modern Text Mining Framework for R

v0.6

GPL (>= 2) | file LICENSE

Authors

Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph] (Coherence measures for topic models), Qing Wang [aut, cph] (Author of the WaprLDA C++ code)

Initial release