Read in a corpus file.
Converts pre-processed document matrices stored in popular formats to stm format.
readCorpus(corpus, type = c("dtm", "slam", "Matrix"))
corpus |
An input file or filepath to be processed |
type |
The type of input file. We offer several sources, see details. |
This function provides a simple utility for converting other document
formats to our own. Briefly- dtm
takes as input a standard matrix
and converts to our format. slam
converts from the
simple_triplet_matrix
representation used by the slam
package.
This is also the representation of corpora in the popular tm
package
and should work in those cases.
dtm
expects a matrix object where each row represents a document and
each column represents a word in the dictionary.
slam
expects a simple_triplet_matrix
from that
package.
Matrix
attempts to coerce the matrix to a
simple_triplet_matrix
and convert using the
functionality built for the slam
package. This will work for most
applicable classes in the Matrix
package such as dgCMatrix
.
If you are trying to read a .ldac
file see readLdac
.
documents |
A documents object in our format |
vocab |
A vocab object if information is available to construct one |
## Not run: library(textir) data(congress109) out <- readCorpus(congress109Counts, type="Matrix") documents <- out$documents vocab <- out$vocab ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.