Select features from a dfm or fcm
dfm_select( x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, min_nchar = NULL, max_nchar = NULL, verbose = quanteda_options("verbose") ) dfm_remove(x, ...) dfm_keep(x, ...) fcm_select( x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, verbose = quanteda_options("verbose"), ... ) fcm_remove(x, ...) fcm_keep(x, ...)
x |
|
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection |
whether to |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
min_nchar, max_nchar |
optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
|
verbose |
if |
... |
used only for passing arguments from |
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
dfm_keep
and fcm_keep
are simply a convenience wrappers to
calling dfm_select
and fcm_select
with selection = "keep"
.
For compatibility with earlier versions, when pattern
is a
dfm object and selection = "keep"
, then this will be
equivalent to calling dfm_match()
. In this case, the following
settings are always used: case_insensitive = FALSE
, and
valuetype = "fixed"
. This functionality is deprecated, however, and
you should use dfm_match()
instead.
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim()
.
dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?")) %>% dfm(tolower = FALSE) dict <- dictionary(list(countries = c("United_States", "Sweden", "France"), wordsEndingInY = c("by", "my"), notintext = "blahblah")) dfm_select(dfmat, pattern = dict) dfm_select(dfmat, pattern = dict, case_insensitive = FALSE) dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex") dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex") dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed") dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed") # select based on character length dfm_select(dfmat, min_nchar = 5) dfmat <- dfm(tokens(c("This is a document with lots of stopwords.", "No if, and, or but about it: lots of stopwords."))) dfmat dfm_remove(dfmat, stopwords("english")) toks <- tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots"), remove_punct = TRUE) fcmat <- fcm(toks) fcmat fcm_remove(fcmat, stopwords("english"))
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.