Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

vectorize.docs

Document vectorization


Description

Vectorize a corpus of documents.

Usage

vectorize.docs(
  vectorizer = NULL,
  corpus = NULL,
  lang = "en",
  stopwords = lang,
  ngram = 1,
  mincount = 10,
  minphrasecount = NULL,
  transform = c("tfidf", "lsa", "l1", "none"),
  latentdim = 50,
  returndata = TRUE,
  ...
)

Arguments

vectorizer

The document vectorizer.

corpus

The corpus of documents (a vector of characters).

lang

The language of the documents (NULL if no stemming).

stopwords

Stopwords, or the language of the documents. NULL if stop words should not be removed.

ngram

maximum size of n-grams.

mincount

Minimum word count to be considered as frequent.

minphrasecount

Minimum collocation of words count to be considered as frequent.

transform

Transformation (TF-IDF, LSA, L1 normanization, or nothing).

latentdim

Number of latent dimensions if LSA transformation is performed.

returndata

If true, the vectorized documents are returned. If false, a "vectorizer" is returned.

...

Other parameters.

Value

The vectorized documents.

See Also

Examples

## Not run: 
require (text2vec)
data ("movie_review")
# Clustering
docs = vectorize.docs (corpus = movie_review$review, transform = "tfidf")
km = KMEANS (docs [sample (nrow (docs), 100), ], k = 10)
# Classification
d = movie_review [, 2:3]
d [, 1] = factor (d [, 1])
d = splitdata (d, 1)
vectorizer = vectorize.docs (corpus = d$train.x,
                             returndata = FALSE, mincount = 50)
train = vectorize.docs (corpus = d$train.x, vectorizer = vectorizer)
test = vectorize.docs (corpus = d$test.x, vectorizer = vectorizer)
model = NB (as.matrix (train), d$train.y)
pred = predict (model, as.matrix (test))
evaluation (pred, d$test.y)

## End(Not run)

fdm2id

Data Mining and R Programming for Beginners

v0.9.5
GPL-3
Authors
Alexandre Blansché [aut, cre]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.