Apply a dictionary to a dfm
Apply a dictionary to a dfm by looking up all dfm features for matches in a a
set of dictionary values, and replace those features with a count of
the dictionary's keys. If exclusive = FALSE
then the behaviour is to
apply a "thesaurus", where each value match is replaced by the dictionary
key, converted to capitals if capkeys = TRUE
(so that the replacements
are easily distinguished from features that were terms found originally in
the document).
dfm_lookup( x, dictionary, levels = 1:5, exclusive = TRUE, valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, capkeys = !exclusive, nomatch = NULL, verbose = quanteda_options("verbose") )
x |
the dfm to which the dictionary will be applied |
dictionary |
a dictionary class object |
levels |
levels of entries in a hierarchical dictionary that will be applied |
exclusive |
if |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
capkeys |
if |
nomatch |
an optional character naming a new feature that will contain
the counts of features of |
verbose |
print status messages if |
If using dfm_lookup
with dictionaries containing multi-word
values, matches will only occur if the features themselves are multi-word
or formed from ngrams. A better way to match dictionary values that include
multi-word patterns is to apply tokens_lookup()
to the tokens,
and then construct the dfm.
dfm_replace
dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"), opposition = c("Opposition", "reject", "notincorpus"), taxglob = "tax*", taxregex = "tax.+$", country = c("United_States", "Sweden"))) dfmat <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?")), remove = stopwords("english")) dfmat # glob format dfm_lookup(dfmat, dict, valuetype = "glob") dfm_lookup(dfmat, dict, valuetype = "glob", case_insensitive = FALSE) # regex v. glob format: note that "united_states" is a regex match for "tax*" dfm_lookup(dfmat, dict, valuetype = "glob") dfm_lookup(dfmat, dict, valuetype = "regex", case_insensitive = TRUE) # fixed format: no pattern matching dfm_lookup(dfmat, dict, valuetype = "fixed") dfm_lookup(dfmat, dict, valuetype = "fixed", case_insensitive = FALSE) # show unmatched tokens dfm_lookup(dfmat, dict, nomatch = "_UNMATCHED")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.