Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

compare_corpus

Compare tCorpus vocabulary to that of another (reference) tCorpus


Description

Compare tCorpus vocabulary to that of another (reference) tCorpus

Usage

compare_corpus(
  tc,
  tc_y,
  feature,
  smooth = 0.1,
  min_ratio = NULL,
  min_chi2 = NULL,
  is_subset = F,
  yates_cor = c("auto", "yes", "no"),
  what = c("freq", "docfreq", "cooccurrence")
)

Arguments

tc

a tCorpus

tc_y

the reference tCorpus

feature

the column name of the feature that is to be compared

smooth

Laplace smoothing is used for the calculation of the probabilities. Here you can set the added (pseuocount) value.

min_ratio

threshold for the ratio value, which is the ratio of the relative frequency of a term in dtm.x and dtm.y

min_chi2

threshold for the chi^2 value

is_subset

Specify whether tc is a subset of tc_y. In this case, the term frequencies of tc will be subtracted from the term frequencies in tc_y

yates_cor

mode for using yates correctsion in the chi^2 calculation. Can be turned on ("yes") or off ("no"), or set to "auto", in which case cochrans rule is used to determine whether yates' correction is used.

what

choose whether to compare the frequency ("freq") of terms, or the document frequency ("docfreq"). This also affects how chi^2 is calculated, comparing either freq relative to vocabulary size or docfreq relative to corpus size (N)

Value

A vocabularyComparison object

Examples

tc = create_tcorpus(sotu_texts, doc_column = 'id')

tc$preprocess('token', 'feature', remove_stopwords = TRUE, use_stemming = TRUE)

obama = tc$subset_meta(president == 'Barack Obama', copy=TRUE)
bush = tc$subset_meta(president == 'George W. Bush', copy=TRUE)

comp = compare_corpus(tc, bush, 'feature')
comp = comp[order(-comp$chi),]
head(comp)
plot(comp)

corpustools

Managing, Querying and Analyzing Tokenized Text

v0.4.10
GPL-3
Authors
Kasper Welbers and Wouter van Atteveldt
Initial release
2022-05-03

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.