Corpus class initialization
Corpora indexed using the Corpus Workbench (CWB) offer an efficient data
structure for large, linguistically annotated corpora. The
corpus-class keeps basic information on a CWB corpus. Corresponding to
the name of the class, the corpus-method is the initializer for
objects of the corpus class. A CWB corpus can also be hosted remotely
on an OpenCPU server. The remote_corpus
class (which inherits from the corpus class) will handle respective
information. A (limited) set of polmineR functions and methods can be
executed on the corpus on the remote machine from the local R session by
calling them on the remote_corpus object. Calling the
corpus-method without an argument will return a data.frame with
basic information on the corpora that are available.
## S4 method for signature 'character' corpus(.Object, server = NULL, restricted) ## S4 method for signature 'missing' corpus()
.Object |
The upper-case ID of a CWB corpus stated by a
length-one |
server |
If |
restricted |
A |
Calling corpus() will return a data.frame listing the corpora
available locally and described in the active registry directory, and some
basic information on the corpora.
A corpus object is instantiated by passing a corpus ID as
argument .Object. Following the conventions of the Corpus Workbench
(CWB), Corpus IDs are written in upper case. If .Object includes
lower case letters, the corpus object is instantiated nevertheless,
but a warning is issued to prevent bad practice. If .Object is not a
known corpus, the error message will include a suggestion if there is a
potential candidate that can be identified by agrep.
A limited set of methods of the polmineR package is exposed
to be executed on a remote OpenCPU server. As a matter of convenience, the
whereabouts of an OpenCPU server hosting a CWB corpus can be stated in an
environment variable "OPENCPU_SERVER". Environment variables for R sessions
can be set easily in the .Renviron file. A convenient way to do this
is to call usethis::edit_r_environ().
corpusA length-one character vector, the upper-case ID of a CWB
corpus.
data_dirThe directory where the files for the indexed corpus are.
typeThe type of the corpus (e.g. "plpr" for a corpus of plenary protocols).
nameAn additional name for the object that may be more telling than the corpus ID.
encodingThe encoding of the corpus, given as a length-one
character vector.
sizeNumber of tokens (size) of the corpus, a length-one integer
vector.
serverThe URL (can be IP address) of the OpenCPU server. The slot is
available only with the remote_corpus class inheriting from the
corpus class.
userIf the corpus on the server requires authentication, the username.
passwordIf the corpus on the server requires authentication, the password.
Methods to extract basic information from a corpus object are
covered by the corpus-methods documentation object. Use the
s_attributes method to get information on structural
attributes. Analytical methods available for corpus objects are
size, count, dispersion,
kwic, cooccurrences,
as.TermDocumentMatrix.
use("polmineR")
# get corpora present locally
y <- corpus()
# initialize corpus object
r <- corpus("REUTERS")
r <- corpus ("reuters") # will work, but will result in a warning
# apply core polmineR methods
a <- size(r)
b <- s_attributes(r)
c <- count(r, query = "oil")
d <- dispersion(r, query = "oil", s_attribute = "id")
e <- kwic(r, query = "oil")
f <- cooccurrences(r, query = "oil")
# used corpus initialization in a pipe
y <- corpus("REUTERS") %>% s_attributes()
y <- corpus("REUTERS") %>% count(query = "oil")
# working with a remote corpus
## Not run:
REUTERS <- corpus("REUTERS", server = Sys.getenv("OPENCPU_SERVER"))
count(REUTERS, query = "oil")
size(REUTERS)
kwic(REUTERS, query = "oil")
GERMAPARL <- corpus("GERMAPARL", server = Sys.getenv("OPENCPU_SERVER"))
s_attributes(GERMAPARL)
size(x = GERMAPARL)
count(GERMAPARL, query = "Integration")
kwic(GERMAPARL, query = "Islam")
p <- partition(GERMAPARL, year = 2000)
s_attributes(p, s_attribute = "year")
size(p)
kwic(p, query = "Islam", meta = "date")
GERMAPARL <- corpus("GERMAPARLMINI", server = Sys.getenv("OPENCPU_SERVER"))
s_attrs <- s_attributes(GERMAPARL, s_attribute = "date")
sc <- subset(GERMAPARL, date == "2009-11-10")
## End(Not run)Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.