Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

dtm_remove_lowfreq

Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms


Description

Remove terms occurring with low frequency from a Document-Term-Matrix and documents with no terms

Usage

dtm_remove_lowfreq(dtm, minfreq = 5, maxterms, remove_emptydocs = TRUE)

Arguments

dtm

an object returned by document_term_matrix

minfreq

integer with the minimum number of times the term should occur in order to keep the term

maxterms

integer indicating the maximum number of terms which should be kept in the dtm. The argument is optional.

remove_emptydocs

logical indicating to remove documents containing no more terms after the term removal is executed. Defaults to TRUE.

Value

a sparse Matrix as returned by sparseMatrix where terms with low occurrence are removed and documents without any terms are also removed

Examples

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, xpos == "NN")
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)


## Remove terms with low frequencies and documents with no terms
x <- dtm_remove_lowfreq(dtm, minfreq = 10)
dim(x)
x <- dtm_remove_lowfreq(dtm, minfreq = 10, maxterms = 25)
dim(x)
x <- dtm_remove_lowfreq(dtm, minfreq = 10, maxterms = 25, remove_emptydocs = FALSE)
dim(x)

udpipe

Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

v0.8.5
MPL-2.0
Authors
Jan Wijffels [aut, cre, cph], BNOSAC [cph], Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic [cph], Milan Straka [ctb, cph], Jana Straková [ctb, cph]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.