Simple Corpora
Create simple corpora.
SimpleCorpus(x, control = list(language = "en"))
x |
a |
control |
a named list of control parameters.
|
A simple corpus is fully kept in memory. Compared to a VCorpus,
it is optimized for the most common usage scenario: importing plain texts from
files in a directory or directly from a vector in R, preprocessing and
transforming the texts, and finally exporting them to a term-document matrix.
It adheres to the Corpus API. However, it takes
internally various shortcuts to boost performance and minimize memory
pressure; consequently it operates only under the following contraints:
only DataframeSource, DirSource and VectorSource
are supported,
no custom readers, i.e., each document is read in and stored as plain text (as a string, i.e., a character vector of length one),
transformations applied via tm_map must be able to
process character vectors and return character vectors (of the same
length),
no lazy transformations in tm_map,
no meta data for individual documents (i.e., no "local" in
meta).
An object inheriting from SimpleCorpus and Corpus.
Corpus for basic information on the corpus infrastructure
employed by package tm.
txt <- system.file("texts", "txt", package = "tm")
(ovid <- SimpleCorpus(DirSource(txt, encoding = "UTF-8"),
control = list(language = "lat")))Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.