text2vec: vectorizers – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

vectorizers

Vocabulary and hash vectorizers

Description

This function creates an object (closure) which defines on how to transform list of tokens into vector space - i.e. how to map words to indices. It supposed to be used only as argument to create_dtm, create_tcm, create_vocabulary.

Usage

vocab_vectorizer(vocabulary)

hash_vectorizer(hash_size = 2^18, ngram = c(1L, 1L),
  signed_hash = FALSE)

Arguments

`vocabulary`	`text2vec_vocabulary` object, see create_vocabulary.
`hash_size`	`integer` The number of of hash-buckets for the feature hashing trick. The number must be greater than 0, and preferably it will be a power of 2.
`ngram`	`integer` vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of `n` such that ngram_min <= n <= ngram_max will be used.
`signed_hash`	`logical`, indicating whether to use a signed hash-function to reduce collisions when hashing.

Value

A vectorizer object (closure).

Examples

data("movie_review")
N = 100
vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L))
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, n_chunks = 10)
hash_dtm = create_dtm(it, vectorizer)

it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, n_chunks = 10)
v = create_vocabulary(it, c(1L, 1L) )

vectorizer = vocab_vectorizer(v)

it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, n_chunks = 10)

dtm = create_dtm(it, vectorizer)

text2vec

Modern Text Mining Framework for R

v0.6

GPL (>= 2) | file LICENSE

Authors

Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph] (Coherence measures for topic models), Qing Wang [aut, cph] (Author of the WaprLDA C++ code)

Initial release