Document-term matrix construction
This is a high-level function for creating a document-term matrix.
create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken_parallel' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "RsparseMatrix"), ...)
it |
itoken iterator or |
vectorizer |
|
type |
|
... |
placeholder for additional arguments (not used at the moment).
over |
If a parallel backend is registered and first argument is a list of itoken
,
iterators, function will construct the DTM in multiple threads.
User should keep in mind that he or she should split the data itself and provide a list of
itoken iterators. Each element of it
will be handled in separate
thread and combined at the end of processing.
A document-term matrix
## Not run: data("movie_review") N = 1000 it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) v = create_vocabulary(it) #remove very common and uncommon words pruned_vocab = prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) dtm = create_dtm(it, vectorizer) # get tf-idf matrix from bag-of-words matrix dtm_tfidf = transformer_tfidf(dtm) ## Example of parallel mode it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) vectorizer = hash_vectorizer() dtm = create_dtm(it, vectorizer, type = 'dgTMatrix') ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.