Weight the feature frequencies in a dfm
Weight the feature frequencies in a dfm
dfm_weight( x, scheme = c("count", "prop", "propmax", "logcount", "boolean", "augmented", "logave"), weights = NULL, base = 10, k = 0.5, smoothing = 0.5, force = FALSE ) dfm_smooth(x, smoothing = 1)
x |
document-feature matrix created by dfm |
scheme |
a label of the weight type:
|
weights |
if |
base |
base for the logarithm when |
k |
the k for the augmentation when |
smoothing |
constant added to the dfm cells for smoothing, default is 1
for |
force |
logical; if |
dfm_weight
returns the dfm with weighted values. Note the
because the default weighting scheme is "count"
, simply calling this
function on an unweighted dfm will return the same object. Many users will
want the normalized dfm consisting of the proportions of the feature counts
within each document, which requires setting scheme = "prop"
.
dfm_smooth
returns a dfm whose values have been smoothed by
adding the smoothing
amount. Note that this effectively converts a
matrix from sparse to dense format, so may exceed memory requirements
depending on the size of your input matrix.
Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
dfmat1 <- dfm(data_corpus_inaugural) dfmat2 <- dfm_weight(dfmat1, scheme = "prop") topfeatures(dfmat2) dfmat3 <- dfm_weight(dfmat1) topfeatures(dfmat3) dfmat4 <- dfm_weight(dfmat1, scheme = "logcount") topfeatures(dfmat4) dfmat5 <- dfm_weight(dfmat1, scheme = "logave") topfeatures(dfmat5) # combine these methods for more complex dfm_weightings, e.g. as in Section 6.4 # of Introduction to Information Retrieval head(dfm_tfidf(dfmat1, scheme_tf = "logcount")) # apply numeric weights str <- c("apple is better than banana", "banana banana apple much better") (dfmat6 <- dfm(str, remove = stopwords("english"))) dfm_weight(dfmat6, weights = c(apple = 5, banana = 3, much = 0.5)) # smooth the dfm dfmat <- dfm(data_corpus_inaugural) dfm_smooth(dfmat, 0.5)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.