Identify and score multi-word expressions
Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.
textstat_collocations( x, method = "lambda", size = 2, min_count = 2, smoothing = 0.5, tolower = TRUE, ... )
x |
a character, corpus, or tokens object whose collocations will be
scored. The tokens object should include punctuation, and if any words
have been removed, these should have been removed with |
method |
association measure for detecting collocations. Currently this
is limited to |
size |
integer; the length of the collocations to be scored |
min_count |
numeric; minimum frequency of collocations that will be scored |
smoothing |
numeric; a smoothing parameter added to the observed counts (default is 0.5) |
tolower |
logical; if |
... |
additional arguments passed to |
Documents are grouped for the purposes of scoring, but collocations will not
span sentences. If x
is a tokens object and some tokens have been
removed, this should be done using [tokens_remove](x, pattern, padding = TRUE)
so that counts will still be accurate, but the pads will prevent those
collocations from being scored.
The lambda
computed for a size = K-word target multi-word expression
the coefficient for the K-way interaction parameter in the saturated
log-linear model fitted to the counts of the terms forming the set of
eligible multi-word expressions. This is the same as the "lambda" computed in
Blaheta and Johnson's (2001), where all multi-word expressions are considered
(rather than just verbs, as in that paper). The z
is the Wald
z-statistic computed as the quotient of lambda
and the Wald statistic
for lambda
as described below.
In detail:
Consider a K-word target expression x, and let z be any
K-word expression. Define a comparison function c(x,z)=(j_{1},
…, j_{K})=c such that the kth element of c is 1 if the
kth word in z is equal to the kth word in x, and 0
otherwise. Let c_{i}=(j_{i1}, …, j_{iK}), i=1, …,
2^{K}=M, be the possible values of c(x,z), with c_{M}=(1,1,
…, 1). Consider the set of c(x,z_{r}) across all expressions
z_{r} in a corpus of text, and let n_{i}, for i=1,…,M,
denote the number of the c(x,z_{r}) which equal c_{i}, plus the
smoothing constant smoothing
. The n_{i} are the counts in a
2^{K} contingency table whose dimensions are defined by the
c_{i}.
λ: The K-way interaction parameter in the saturated loglinear model fitted to the n_{i}. It can be calculated as
λ = ∑_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}
where b_{i} is the number of the elements of c_{i} which are equal to 1.
Wald test z-statistic z is calculated as:
z = \frac{λ}{[∑_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}
textstat_collocations
returns a data.frame of collocations and
their scores and statistics. This consists of the collocations, their
counts, length, and λ and z statistics. When size
is a
vector, then count_nested
counts the lower-order collocations that occur
within a higher-order collocation (but this does not affect the
statistics).
Kenneth Benoit, Jouni Kuha, Haiyan Wang, and Kohei Watanabe
Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
library("quanteda") corp <- data_corpus_inaugural[1:2] head(cols <- textstat_collocations(corp, size = 2, min_count = 2), 10) head(cols <- textstat_collocations(corp, size = 3, min_count = 2), 10) # extracting multi-part proper nouns (capitalized terms) toks1 <- tokens(data_corpus_inaugural) toks2 <- tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE) toks3 <- tokens_select(toks2, pattern = "^([A-Z][a-z\\-]{2,})", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) tstat <- textstat_collocations(toks3, size = 3, tolower = FALSE) head(tstat, 10) # vectorized size txt <- c(". . . . a b c . . a b c . . . c d e", "a b . . a b . . a b . . a b . a b", "b c d . . b c . b c . . . b c") textstat_collocations(txt, size = 2:3) # compounding tokens from collocations toks <- tokens("This is the European Union.") colls <- tokens("The new European Union is not the old European Union.") %>% textstat_collocations(size = 2, min_count = 1, tolower = FALSE) colls tokens_compound(toks, colls, case_insensitive = FALSE) #' # from a collocations object (coll <- textstat_collocations(tokens("a b c a b d e b d a b"))) phrase(coll)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.