Term Statistics
Tokenize a set of texts and tabulate the term occurrence statistics.
term_stats(x, filter = NULL, ngrams = NULL, min_count = NULL, max_count = NULL, min_support = NULL, max_support = NULL, types = FALSE, subset, ...)
x |
a text vector to tokenize. |
filter |
if non- |
ngrams |
an integer vector of n-gram lengths to include, or
|
min_count |
a numeric scalar giving the minimum term count to include
in the output, or |
max_count |
a numeric scalar giving the maximum term count to include
in the output, or |
min_support |
a numeric scalar giving the minimum term support to
include in the output, or |
max_support |
a numeric scalar giving the maximum term support to
include in the output, or |
types |
a logical value indicating whether to include columns for the types that make up the terms. |
subset |
logical expression indicating elements or rows to keep: missing values are taken as false. |
... |
additional properties to set on the text filter. |
term_stats
tokenizes a set of texts and computes the occurrence
counts and supports for each term. The ‘count’ is the number of
occurrences of the term across all texts; the ‘support’ is the
number of texts containing the term. Each appearance of a term
increments its count by one. Likewise, an appearance of a term in text
i
increments its support once, not for each occurrence
in the text.
To include multi-type terms, specify the designed term lengths using
the ngrams
argument.
A data frame with columns named term
, count
, and
support
, with one row for each appearing term. Rows are sorted
in descending order according to support
and then count
,
with ties broken lexicographically by term
, using the
character ordering determined by the current locale
(see Comparison
for details).
If types = TRUE
, then the result also includes columns named
type1
, type2
, etc. for the types that make up the
term.
term_stats("A rose is a rose is a rose.") # remove punctuation and English stop words term_stats("A rose is a rose is a rose.", text_filter(drop_symbol = TRUE, drop = stopwords_en)) # unigrams, bigrams, and trigrams term_stats("A rose is a rose is a rose.", ngrams = 1:3) # also include the type information term_stats("A rose is a rose is a rose.", ngrams = 1:3, types = TRUE)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.