quanteda: docfreq – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

docfreq

Compute the (weighted) document frequency of a feature

Description

For a dfm object, returns a (weighted) document frequency for each term. The default is a simple count of the number of documents in which a feature occurs more than a given frequency threshold. (The default threshold is zero, meaning that any feature occurring at least once in a document will be counted.)

Usage

docfreq(
  x,
  scheme = c("count", "inverse", "inversemax", "inverseprob", "unary"),
  base = 10,
  smoothing = 0,
  k = 0,
  threshold = 0
)

Arguments

`x`	a dfm
`scheme`	type of document frequency weighting, computed as follows, where N is defined as the number of documents in the dfm and s is the smoothing constant: `count` df_j, the number of documents for which n_{ij} > threshold `inverse` \textrm{log}_{base}≤ft(s + \frac{N}{k + df_j}\right) `inversemax` \textrm{log}_{base}≤ft(s + \frac{\textrm{max}(df_j)}{k + df_j}\right) `inverseprob` \textrm{log}_{base}≤ft(\frac{N - df_j}{k + df_j}\right) `unary` 1 for each feature
`base`	the base with respect to which logarithms in the inverse document frequency weightings are computed; default is 10 (see Manning, Raghavan, and Schütze 2008, p123).
`smoothing`	added to the quotient before taking the logarithm
`k`	added to the denominator in the "inverse" weighting types, to prevent a zero document count for a term
`threshold`	numeric value of the threshold above which a feature will considered in the computation of document frequency. The default is 0, meaning that a feature's document frequency will be the number of documents in which it occurs greater than zero times.

Value

a numeric vector of document frequencies for each feature

References

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

Examples

dfmat1 <- dfm(data_corpus_inaugural[1:2])
docfreq(dfmat1[, 1:20])

# replication of worked example from
# https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf
dfmat2 <-
    matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
           byrow = TRUE, nrow = 2,
           dimnames = list(docs = c("document1", "document2"),
                           features = c("this", "is", "a", "sample",
                                        "another", "example"))) %>%
    as.dfm()
dfmat2
docfreq(dfmat2)
docfreq(dfmat2, scheme = "inverse")
docfreq(dfmat2, scheme = "inverse", k = 1, smoothing = 1)
docfreq(dfmat2, scheme = "unary")
docfreq(dfmat2, scheme = "inversemax")
docfreq(dfmat2, scheme = "inverseprob")

quanteda

Quantitative Analysis of Textual Data

v3.0.0

GPL-3

Authors

Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Kohei Watanabe [aut] (<https://orcid.org/0000-0001-6519-5265>), Haiyan Wang [aut] (<https://orcid.org/0000-0003-4992-4311>), Paul Nulty [aut] (<https://orcid.org/0000-0002-7214-4666>), Adam Obeng [aut] (<https://orcid.org/0000-0002-2906-4775>), Stefan Müller [aut] (<https://orcid.org/0000-0002-6315-4125>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), William Lowe [aut] (<https://orcid.org/0000-0002-1549-6163>), Christian Müller [ctb], European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)

Initial release