Calculate keyness statistics
Calculate "keyness", a score for features that occur differentially across different categories. Here, the categories are defined by reference to a "target" document index in the dfm, with the reference group consisting of all other documents.
textstat_keyness( x, target = 1L, measure = c("chi2", "exact", "lr", "pmi"), sort = TRUE, correction = c("default", "yates", "williams", "none"), ... )
x |
a dfm containing the features to be examined for keyness |
target |
the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference |
measure |
(signed) association measure to be used for computing keyness.
Currently available: |
sort |
logical; if |
correction |
if |
... |
not used |
a data.frame of computed statistics and associated p-values, where
the features scored name each row, and the number of occurrences for both
the target and reference groups. For measure = "chi2"
this is the
chi-squared value, signed positively if the observed value in the target
exceeds its expected value; for measure = "exact"
this is the
estimate of the odds ratio; for measure = "lr"
this is the
likelihood ratio G2 statistic; for "pmi"
this is the pointwise
mutual information statistics.
textstat_keyness
returns a data.frame of features and
their keyness scores and frequency counts.
Bondi, M. & Scott, M. (eds) (2010). Keyness in Texts. Amsterdam, Philadelphia: John Benjamins.
Stubbs, M. (2010). Three Concepts of Keywords. In Keyness in Texts, Bondi, M. & Scott, M. (eds): 1–42. Amsterdam, Philadelphia: John Benjamins.
Scott, M. & Tribble, C. (2006). Textual Patterns: Keyword and Corpus Analysis in Language Education. Amsterdam: Benjamins: 55.
Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1): 61–74.
library("quanteda") # compare pre- v. post-war terms using grouping period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war") dfmat1 <- tokens(data_corpus_inaugural) %>% dfm() %>% dfm_group(groups = period) head(dfmat1) # make sure 'post-war' is in the first row head(tstat1 <- textstat_keyness(dfmat1), 10) tail(tstat1, 10) # compare pre- v. post-war terms using logical vector dfmat2 <- dfm(tokens(data_corpus_inaugural)) head(textstat_keyness(dfmat2, docvars(data_corpus_inaugural, "Year") >= 1945), 10) # compare Trump 2017 to other post-war preseidents dfmat3 <- dfm(tokens(corpus_subset(data_corpus_inaugural, period == "post-war"))) head(textstat_keyness(dfmat3, target = "2017-Trump"), 10) # using the likelihood ratio method head(textstat_keyness(dfm_smooth(dfmat3), measure = "lr", target = "2017-Trump"), 10)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.