read a text file(s)
Read texts and (if any) associated document-level meta-data from one or more source files. The text source files come from the textual component of the files, and the document-level metadata ("docvars") come from either the file contents or filenames.
readtext( file, ignore_missing_files = FALSE, text_field = NULL, docid_field = NULL, docvarsfrom = c("metadata", "filenames", "filepaths"), dvsep = "_", docvarnames = NULL, encoding = NULL, source = NULL, cache = TRUE, verbosity = readtext_options("verbosity"), ... )
file |
the complete filename(s) to be read. This is designed to automagically handle a number of common scenarios, so the value can be a "glob"-type wildcard value. Currently available filetypes are: Single file formats:
Reading multiple files and file types: In addition,
|
ignore_missing_files |
if |
text_field, docid_field |
a variable (column) name or column number
indicating where to find the texts that form the documents for the corpus
and their identifiers. This must be specified for file types |
docvarsfrom |
used to specify that docvars should be taken from the
filenames, when the |
dvsep |
separator (a regular expression character string) used in
filenames to delimit docvar elements if |
docvarnames |
character vector of variable names for |
encoding |
vector: either the encoding of all files, or one encoding for each files |
source |
used to specify specific formats of some input file types, such
as JSON or HTML. Currently supported types are |
cache |
if |
verbosity |
|
... |
additional arguments passed through to low-level file reading
function, such as |
a data.frame consisting of a columns doc_id
and text
that contain a document identifier and the texts respectively, with any
additional columns consisting of document-level variables either found
in the file containing the texts, or created through the
readtext
call.
## get the data directory if (!interactive()) pkgload::load_all() DATA_DIR <- system.file("extdata/", package = "readtext") ## read in some text data # all UDHR files (rt1 <- readtext(paste0(DATA_DIR, "/txt/UDHR/*"))) # manifestos with docvars from filenames (rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"), docvarsfrom = "filenames", docvarnames = c("unit", "context", "year", "language", "party"), encoding = "LATIN1")) # recurse through subdirectories (rt3 <- readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"), docvarsfrom = "filepaths", docvarnames = "sentiment")) ## read in csv data (rt4 <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"))) ## read in tab-separated data (rt5 <- readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")) ## read in JSON data (rt6 <- readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")) ## read in pdf data # UNHDR (rt7 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filenames", docvarnames = c("document", "language"))) Encoding(rt7$text) ## read in Word data (.doc) (rt8 <- readtext(paste0(DATA_DIR, "/word/*.doc"))) Encoding(rt8$text) ## read in Word data (.docx) (rt9 <- readtext(paste0(DATA_DIR, "/word/*.docx"))) Encoding(rt9$text) ## use elements of path and filename as docvars (rt10 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filepaths", dvsep = "[/_.]"))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.