recompile a serialized tokens object
This function recompiles a serialized tokens object when the vocabulary has been changed in a way that makes some of its types identical, such as lowercasing when a lowercased version of the type already exists in the type table, or introduces gaps in the integer map of the types. It also re-indexes the types attribute to account for types that may have become duplicates, through a procedure such as stemming or lowercasing; or the addition of new tokens through compounding.
tokens_recompile(x, method = c("C++", "R"), gap = TRUE, dup = TRUE)
x |
the tokens object to be recompiled |
method |
|
gap |
if |
dup |
if |
# lowercasing toks1 <- tokens(c(one = "a b c d A B C D", two = "A B C d")) attr(toks1, "types") <- char_tolower(attr(toks1, "types")) unclass(toks1) unclass(quanteda:::tokens_recompile(toks1)) # stemming toks2 <- tokens("Stemming stemmed many word stems.") unclass(toks2) unclass(quanteda:::tokens_recompile(tokens_wordstem(toks2))) # compounding toks3 <- tokens("One two three four.") unclass(toks3) unclass(tokens_compound(toks3, "two three")) # lookup dict <- dictionary(list(test = c("one", "three"))) unclass(tokens_lookup(toks3, dict)) # empty pads unclass(tokens_select(toks3, dict)) unclass(tokens_select(toks3, dict, pad = TRUE)) # ngrams unclass(tokens_ngrams(toks3, n = 2:3))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.