quanteda: corpus_sample – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

corpus_sample

Randomly sample documents from a corpus

Description

Take a random sample of documents of the specified size from a corpus, with or without replacement, optionally by grouping variables or with probability weights.

Usage

corpus_sample(x, size = ndoc(x), replace = FALSE, prob = NULL, by = NULL)

Arguments

`x`	a corpus object whose documents will be sampled
`size`	a positive number, the number of documents to select; when used with `by`, the number to select from each group or a vector equal in length to the number of groups defining the samples to be chosen in each category of `by`. By defining a size larger than the number of documents, it is possible to oversample when `replace = TRUE`.
`replace`	if `TRUE`, sample with replacement
`prob`	a vector of probability weights for obtaining the elements of the vector being sampled. May not be applied when `by` is used.
`by`	optional grouping variable for sampling. This will be evaluated in the docvars data.frame, so that docvars may be referred to by name without quoting. This also changes previous behaviours for `by`. See `news(Version >= "2.9", package = "quanteda")` for details.

Value

a corpus object (re)sampled on the documents, containing the document variables for the documents sampled.

Examples

set.seed(123)
# sampling from a corpus
summary(corpus_sample(data_corpus_inaugural, size = 5))
summary(corpus_sample(data_corpus_inaugural, size = 10, replace = TRUE))

# sampling with by
corp <- data_corpus_inaugural
corp$century <- paste(floor(corp$Year / 100) + 1)
corp$century <- paste0(corp$century, ifelse(corp$century < 21, "th", "st"))
corpus_sample(corp, size = 2, by = century) %>% summary()
# needs drop = TRUE to avoid empty interactions
corpus_sample(corp, size = 1, by = interaction(Party, century, drop = TRUE), replace = TRUE) %>%
    summary()

# sampling sentences by document
corp <- corpus(c(one = "Sentence one.  Sentence two.  Third sentence.",
                 two = "First sentence, doc2.  Second sentence, doc2."),
               docvars = data.frame(var1 = c("a", "a"), var2 = c(1, 2)))
corpus_reshape(corp, to = "sentences") %>%
    corpus_sample(replace = TRUE, by = docid(.))

# oversampling
corpus_sample(corp, size = 5, replace = TRUE)

quanteda

Quantitative Analysis of Textual Data

v3.0.0

GPL-3

Authors

Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Kohei Watanabe [aut] (<https://orcid.org/0000-0001-6519-5265>), Haiyan Wang [aut] (<https://orcid.org/0000-0003-4992-4311>), Paul Nulty [aut] (<https://orcid.org/0000-0002-7214-4666>), Adam Obeng [aut] (<https://orcid.org/0000-0002-2906-4775>), Stefan Müller [aut] (<https://orcid.org/0000-0002-6315-4125>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), William Lowe [aut] (<https://orcid.org/0000-0002-1549-6163>), Christian Müller [ctb], European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)

Initial release