tm: weightTfIdf – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

weightTfIdf

Weight by Term Frequency - Inverse Document Frequency

Description

Weight a term-document matrix by term frequency - inverse document frequency.

Usage

weightTfIdf(m, normalize = TRUE)

Arguments

`m`	A `TermDocumentMatrix` in term frequency format.
`normalize`	A Boolean value indicating whether the term frequencies should be normalized.

Details

Formally this function is of class WeightingFunction with the additional attributes name and acronym.

Term frequency \mathit{tf}_{i,j} counts the number of occurrences n_{i,j} of a term t_i in a document d_j. In the case of normalization, the term frequency \mathit{tf}_{i,j} is divided by ∑_k n_{k,j}.

Inverse document frequency for a term t_i is defined as

\mathit{idf}_i = \log_2 \frac{|D|}{|\{d \mid t_i \in d\}|}

where |D| denotes the total number of documents and where |\{d \mid t_i \in d\}| is the number of documents where the term t_i appears.

Term frequency - inverse document frequency is now defined as \mathit{tf}_{i,j} \cdot \mathit{idf}_i.

Value

The weighted matrix.

References

Gerard Salton and Christopher Buckley (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24/5, 513–523.

tm

Text Mining Package

v0.7-8

GPL-3

Authors

Ingo Feinerer [aut, cre] (<https://orcid.org/0000-0001-7656-8338>), Kurt Hornik [aut] (<https://orcid.org/0000-0003-4198-9911>), Artifex Software, Inc. [ctb, cph] (pdf_info.ps taken from GPL Ghostscript)

Initial release

2020-11-17