Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

dtm_remove_sparseterms

Remove terms with high sparsity from a Document-Term-Matrix


Description

Remove terms with high sparsity from a Document-Term-Matrix and remove documents with no terms.
Sparsity indicates in how many documents the term is not occurring.

Usage

dtm_remove_sparseterms(dtm, sparsity = 0.99, remove_emptydocs = TRUE)

Arguments

dtm

an object returned by document_term_matrix

sparsity

numeric in 0-1 range indicating the sparsity percent. Defaults to 0.99 meaning drop terms which occur in less than 1 percent of the documents.

remove_emptydocs

logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to TRUE.

Value

a sparse Matrix as returned by sparseMatrix where terms with high sparsity are removed and documents without any terms are also removed

Examples

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, xpos == "NN")
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)


## Remove terms with low frequencies and documents with no terms
x <- dtm_remove_sparseterms(dtm, sparsity = 0.99)
dim(x)
x <- dtm_remove_sparseterms(dtm, sparsity = 0.99, remove_emptydocs = FALSE)
dim(x)

udpipe

Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

v0.8.5
MPL-2.0
Authors
Jan Wijffels [aut, cre, cph], BNOSAC [cph], Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic [cph], Milan Straka [ctb, cph], Jana Straková [ctb, cph]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.