udpipe: dtm_chisq – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

dtm_chisq

Compare term usage across 2 document groups using the Chi-square Test for Count Data

Description

Perform a chisq.test to compare if groups of documents have more prevalence of specific terms.
The function looks to each term in the document term matrix and applies a chisq.test comparing the frequency of occurrence of each term compared to the other terms in the document group.

Usage

dtm_chisq(dtm, groups, correct = TRUE, ...)

Arguments

`dtm`	a document term matrix: an object returned by `document_term_matrix`
`groups`	a logical vector with 2 groups (TRUE / FALSE) where the size of the `groups` vector is the same as the number of rows of `dtm` and where element i corresponds row i of `dtm`
`correct`	passed on to `chisq.test`
`...`	further arguments passed on to `chisq.test`

Value

a data.frame with columns term, chisq, p.value, freq, freq_true, freq_false indicating for each term in the dtm, how frequently it occurs in each group, the Chi-Square value and it's corresponding p-value.

Examples

data(brussels_reviews_anno)
##
## Which nouns occur in text containing the term 'centre'
##
x <- subset(brussels_reviews_anno, xpos == "NN" & language == "fr")
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)
relevant <- dtm_chisq(dtm, groups = dtm[, "centre"] > 0)
head(relevant, 10)

##
## Which adjectives occur in text containing the term 'hote'
##
x <- subset(brussels_reviews_anno, xpos == "JJ" & language == "fr")
x <- x[, c("doc_id", "lemma")]
x <- document_term_frequencies(x)
dtm <- document_term_matrix(x)

group <- subset(brussels_reviews_anno, lemma %in% "hote")
group <- rownames(dtm) %in% group$doc_id
relevant <- dtm_chisq(dtm, groups = group)
head(relevant, 10)


## Not run: 
# do not show scientific notation of the p-values
options(scipen = 100)
head(relevant, 10)

## End(Not run)

udpipe

Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

v0.8.5

MPL-2.0

Authors

Jan Wijffels [aut, cre, cph], BNOSAC [cph], Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague, Czech Republic [cph], Milan Straka [ctb, cph], Jana Straková [ctb, cph]

Initial release