Weight a dfm by tf-idf
Weight a dfm by term frequency-inverse document frequency (tf-idf), with full control over options. Uses fully sparse methods for efficiency.
dfm_tfidf( x, scheme_tf = "count", scheme_df = "inverse", base = 10, force = FALSE, ... )
x |
object for which idf or tf-idf will be computed (a document-feature matrix) |
scheme_tf |
scheme for |
scheme_df |
scheme for |
base |
the base for the logarithms in the |
force |
logical; if |
... |
additional arguments passed to |
dfm_tfidf
computes term frequency-inverse document frequency
weighting. The default is to use counts instead of normalized term
frequency (the relative term frequency within document), but this
can be overridden using scheme_tf = "prop"
.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
dfmat1 <- as.dfm(data_dfm_lbgexample) head(dfmat1[, 5:10]) head(dfm_tfidf(dfmat1)[, 5:10]) docfreq(dfmat1)[5:15] head(dfm_weight(dfmat1)[, 5:10]) # replication of worked example from # https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf dfmat2 <- matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3), byrow = TRUE, nrow = 2, dimnames = list(docs = c("document1", "document2"), features = c("this", "is", "a", "sample", "another", "example"))) %>% as.dfm() dfmat2 docfreq(dfmat2) dfm_tfidf(dfmat2, scheme_tf = "prop") %>% round(digits = 2) ## Not run: # comparison with tm if (requireNamespace("tm")) { convert(dfmat2, to = "tm") %>% tm::weightTfIdf() %>% as.matrix() # same as: dfm_tfidf(dfmat2, base = 2, scheme_tf = "prop") } ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.