text2vec: create_dtm – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

text2vec

create_dtm

Document-term matrix construction

Description

This is a high-level function for creating a document-term matrix.

Usage

create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix",
  "RsparseMatrix"), ...)

## S3 method for class 'itoken'
create_dtm(it, vectorizer, type = c("dgCMatrix",
  "dgTMatrix", "RsparseMatrix"), ...)

## S3 method for class 'itoken_parallel'
create_dtm(it, vectorizer,
  type = c("dgCMatrix", "dgTMatrix", "RsparseMatrix"), ...)

Arguments

`it`	itoken iterator or `list` of `itoken` iterators.
`vectorizer`	`function` vectorizer function; see vectorizers.
`type`	`character`, one of `c("dgCMatrix", "dgTMatrix")`.
`...`	placeholder for additional arguments (not used at the moment). over `it`.

Details

If a parallel backend is registered and first argument is a list of itoken, iterators, function will construct the DTM in multiple threads. User should keep in mind that he or she should split the data itself and provide a list of itoken iterators. Each element of it will be handled in separate thread and combined at the end of processing.

Value

A document-term matrix

Examples

## Not run: 
data("movie_review")
N = 1000
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer)
v = create_vocabulary(it)
#remove very common and uncommon words
pruned_vocab = prune_vocabulary(v, term_count_min = 10,
 doc_proportion_max = 0.5, doc_proportion_min = 0.001)
vectorizer = vocab_vectorizer(v)
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer)
dtm = create_dtm(it, vectorizer)
# get tf-idf matrix from bag-of-words matrix
dtm_tfidf = transformer_tfidf(dtm)

## Example of parallel mode
it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N])
vectorizer = hash_vectorizer()
dtm = create_dtm(it, vectorizer, type = 'dgTMatrix')

## End(Not run)

text2vec

Modern Text Mining Framework for R

v0.6

GPL (>= 2) | file LICENSE

Authors

Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph] (Coherence measures for topic models), Qing Wang [aut, cph] (Author of the WaprLDA C++ code)

Initial release

create_dtm

Description

Usage

Arguments

Details

Value

See Also

Examples

text2vec

We don't support your browser anymore