Preprocess tokens in a character vector
Preprocess tokens in a character vector
preprocess_tokens( x, context = NULL, language = "english", use_stemming = F, lowercase = T, ngrams = 1, replace_whitespace = F, as_ascii = F, remove_punctuation = T, remove_stopwords = F, remove_numbers = F, min_freq = NULL, min_docfreq = NULL, max_freq = NULL, max_docfreq = NULL, min_char = NULL, max_char = NULL, ngram_skip_empty = T )
x |
A character or factor vector in which each element is a token (i.e. a tokenized text) |
context |
Optionally, a character vector of the same length as x, specifying the context of token (e.g., document, sentence). Has to be given if ngram > 1 |
language |
The language used for stemming and removing stopwords |
use_stemming |
Logical, use stemming. (Make sure the specify the right language!) |
lowercase |
Logical, make token lowercase |
ngrams |
A number, specifying the number of tokens per ngram. Default is unigrams (1). |
replace_whitespace |
Logical. If TRUE, all whitespace is replaced by underscores |
as_ascii |
Logical. If TRUE, tokens will be forced to ascii |
remove_punctuation |
Logical. if TRUE, punctuation is removed |
remove_stopwords |
Logical. If TRUE, stopwords are removed (Make sure to specify the right language!) |
remove_numbers |
remove features that are only numbers |
min_freq |
an integer, specifying minimum token frequency. |
min_docfreq |
an integer, specifying minimum document frequency. |
max_freq |
an integer, specifying minimum token frequency. |
max_docfreq |
an integer, specifying minimum document frequency. |
min_char |
an integer, specifying minimum number of characters in a term |
max_char |
an integer, specifying maximum number of characters in a term |
ngram_skip_empty |
if ngrams are used, determines whether empty (filtered out) terms are skipped (i.e. c("this", NA, "test"), becomes "this_test") or |
a factor vector
tokens = c('I', 'am', 'a', 'SHORT', 'example', 'sentence', '!') ## default is lowercase without punctuation preprocess_tokens(tokens) ## optionally, delete stopwords, perform stemming, and make ngrams preprocess_tokens(tokens, remove_stopwords = TRUE, use_stemming = TRUE) preprocess_tokens(tokens, context = NA, ngrams = 3)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.