fdm2id: getvocab – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

fdm2id

getvocab

Extract words and phrases from a corpus

Description

Extract words and phrases from a corpus of documents.

Usage

getvocab(
  corpus,
  mincount = 5,
  minphrasecount = NULL,
  ngram = 1,
  lang = "en",
  stopwords = lang,
  ...
)

Arguments

`corpus`	The corpus of documents (a vector of characters).
`mincount`	Minimum word count to be considered as frequent.
`minphrasecount`	Minimum collocation of words count to be considered as frequent.
`ngram`	maximum size of n-grams.
`lang`	The language of the documents (NULL if no stemming).
`stopwords`	Stopwords, or the language of the documents. NULL if stop words should not be removed.
`...`	Other parameters.

Value

The vocabulary used in the corpus of documents.

Examples

## Not run: 
text = loadtext ("http://mattmahoney.net/dc/text8.zip")
vocab1 = getvocab (text) # With stemming
nrow (vocab1)
vocab2 = getvocab (text, lang = NULL) # Without stemming
nrow (vocab2)

## End(Not run)

fdm2id

Data Mining and R Programming for Beginners

v0.9.5

GPL-3

Authors

Alexandre Blansché [aut, cre]

Initial release

getvocab

Description

Usage

Arguments

Value

See Also

Examples

fdm2id

We don't support your browser anymore