Select or remove tokens from a tokens object
These function select or discard tokens from a tokens object. For
convenience, the functions tokens_remove
and tokens_keep
are defined as
shortcuts for tokens_select(x, pattern, selection = "remove")
and
tokens_select(x, pattern, selection = "keep")
, respectively. The most
common usage for tokens_remove
will be to eliminate stop words from a text
or text-based object, while the most common use of tokens_select
will be to
select tokens with only positive pattern matches from a list of regular
expressions, including a dictionary. startpos
and endpos
determine the
positions of tokens searched for pattern
and areas affected are
expanded by window
.
tokens_select( x, pattern, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, padding = FALSE, window = 0, min_nchar = NULL, max_nchar = NULL, startpos = 1L, endpos = -1L, verbose = quanteda_options("verbose") ) tokens_remove(x, ...) tokens_keep(x, ...)
x |
tokens object whose token elements will be removed or kept |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection |
whether to |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
padding |
if |
window |
integer of length 1 or 2; the size of the window of tokens
adjacent to Terms from overlapping windows are never double-counted, but simply
returned in the pattern match. This is because |
min_nchar, max_nchar |
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
|
startpos, endpos |
integer; position of tokens in documents where pattern
matching starts and ends, where 1 is the first token in a document. For
negative indexes, counting starts at the ending token of the document, so
that -1 denotes the last token in the document, -2 the second to last, etc.
When the length of the vector is equal to |
verbose |
if |
... |
additional arguments passed by |
a tokens object with tokens selected or removed based on their
match to pattern
## tokens_select with simple examples toks <- as.tokens(list(letters, LETTERS)) tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = FALSE) tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = TRUE) tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = FALSE) tokens_select(toks, c("b", "e", "f"), selection = "remove", padding = TRUE) # how case_insensitive works tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = TRUE) tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = FALSE) # use window tokens_select(toks, c("b", "f"), selection = "keep", window = 1) tokens_select(toks, c("b", "f"), selection = "remove", window = 1) tokens_remove(toks, c("b", "f"), window = c(0, 1)) tokens_select(toks, pattern = c("e", "g"), window = c(1, 2)) # tokens_remove example: remove stopwords txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.", wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.") tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english")) # token_keep example: keep two-letter words tokens_keep(tokens(txt, remove_punct = TRUE), "??")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.