Document vectorization
Vectorize a corpus of documents.
vectorize.docs( vectorizer = NULL, corpus = NULL, lang = "en", stopwords = lang, ngram = 1, mincount = 10, minphrasecount = NULL, transform = c("tfidf", "lsa", "l1", "none"), latentdim = 50, returndata = TRUE, ... )
vectorizer |
The document vectorizer. |
corpus |
The corpus of documents (a vector of characters). |
lang |
The language of the documents (NULL if no stemming). |
stopwords |
Stopwords, or the language of the documents. NULL if stop words should not be removed. |
ngram |
maximum size of n-grams. |
mincount |
Minimum word count to be considered as frequent. |
minphrasecount |
Minimum collocation of words count to be considered as frequent. |
transform |
Transformation (TF-IDF, LSA, L1 normanization, or nothing). |
latentdim |
Number of latent dimensions if LSA transformation is performed. |
returndata |
If true, the vectorized documents are returned. If false, a "vectorizer" is returned. |
... |
Other parameters. |
The vectorized documents.
## Not run: require (text2vec) data ("movie_review") # Clustering docs = vectorize.docs (corpus = movie_review$review, transform = "tfidf") km = KMEANS (docs [sample (nrow (docs), 100), ], k = 10) # Classification d = movie_review [, 2:3] d [, 1] = factor (d [, 1]) d = splitdata (d, 1) vectorizer = vectorize.docs (corpus = d$train.x, returndata = FALSE, mincount = 50) train = vectorize.docs (corpus = d$train.x, vectorizer = vectorizer) test = vectorize.docs (corpus = d$test.x, vectorizer = vectorizer) model = NB (as.matrix (train), d$train.y) pred = predict (model, as.matrix (test)) evaluation (pred, d$test.y) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.