Collocations model.
Creates Collocations model which can be used for phrase extraction.
Collocations
R6Class
object.
collocation_stat
data.table
with collocations(phrases) statistics.
Useful for filtering non-relevant phrases
For usage details see Methods, Arguments and Examples sections.
model = Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0, sep = "_") model$partial_fit(it, ...) model$fit(it, n_iter = 1, ...) model$transform(it) model$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0) model$collocation_stat
$new(vocabulary = NULL, collocation_count_min = 50, sep = "_")
Constructor for Collocations model. For description of arguments see Arguments section.
$fit(it, n_iter = 1, ...)
fit Collocations model to input iterator it
.
Iterating over input iterator it
n_iter
times, so hierarchically can learn multi-word phrases.
Invisibly returns collocation_stat
.
$partial_fit(it, ...)
iterates once over data and learns collocations. Invisibly returns collocation_stat
.
Workhorse for $fit()
$transform(it)
transforms input iterator using learned collocations model.
Result of the transformation is new itoken
or itoken_parallel
iterator which will
produce tokens with phrases collapsed into single token.
$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0)
filter out non-relevant phrases with low score. User can do it directly by modifying collocation_stat
object.
A Collocation
model object
number of iteration over data
minimal scores of the corresponding statistics in order to collapse tokens into collocation:
pointwise mutual information
"gensim" scores - https://radimrehurek.com/gensim/models/phrases.html adapted from word2vec paper
log-frequency biased mutual dependency
Dunning's logarithm of the ratio between the likelihoods of the hypotheses of dependence and independence
See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8101&rep=rep1&type=pdf,
http://www.aclweb.org/anthology/I05-1050 for details.
Also see data in model$collocation_stat
for better intuition
An input itoken
or itoken_parallel
iterator
text2vec_vocabulary
- if provided will look for collocations consisted of only from vocabulary
library(text2vec) data("movie_review") preprocessor = function(x) { gsub("[^[:alnum:]\\s]", replacement = " ", tolower(x)) } sample_ind = 1:100 tokens = word_tokenizer(preprocessor(movie_review$review[sample_ind])) it = itoken(tokens, ids = movie_review$id[sample_ind]) system.time(v <- create_vocabulary(it)) v = prune_vocabulary(v, term_count_min = 5) model = Collocations$new(collocation_count_min = 5, pmi_min = 5) model$fit(it, n_iter = 2) model$collocation_stat it2 = model$transform(it) v2 = create_vocabulary(it2) v2 = prune_vocabulary(v2, term_count_min = 5) # check what phrases model has learned setdiff(v2$term, v$term) # [1] "main_character" "jeroen_krabb" "boogey_man" "in_order" # [5] "couldn_t" "much_more" "my_favorite" "worst_film" # [9] "have_seen" "characters_are" "i_mean" "better_than" # [13] "don_t_care" "more_than" "look_at" "they_re" # [17] "each_other" "must_be" "sexual_scenes" "have_been" # [21] "there_are_some" "you_re" "would_have" "i_loved" # [25] "special_effects" "hit_man" "those_who" "people_who" # [29] "i_am" "there_are" "could_have_been" "we_re" # [33] "so_bad" "should_be" "at_least" "can_t" # [37] "i_thought" "isn_t" "i_ve" "if_you" # [41] "didn_t" "doesn_t" "i_m" "don_t" # and same way we can create document-term matrix which contains # words and phrases! dtm = create_dtm(it2, vocab_vectorizer(v2)) # check that dtm contains phrases which(colnames(dtm) == "jeroen_krabb")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.