Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

Collocations

Collocations model.


Description

Creates Collocations model which can be used for phrase extraction.

Usage

Collocations

Format

R6Class object.

Fields

collocation_stat

data.table with collocations(phrases) statistics. Useful for filtering non-relevant phrases

Usage

For usage details see Methods, Arguments and Examples sections.

model = Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0,
                         lfmd_min = -Inf, llr_min = 0, sep = "_")
model$partial_fit(it, ...)
model$fit(it, n_iter = 1, ...)
model$transform(it)
model$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0)
model$collocation_stat

Methods

$new(vocabulary = NULL, collocation_count_min = 50, sep = "_")

Constructor for Collocations model. For description of arguments see Arguments section.

$fit(it, n_iter = 1, ...)

fit Collocations model to input iterator it. Iterating over input iterator it n_iter times, so hierarchically can learn multi-word phrases. Invisibly returns collocation_stat.

$partial_fit(it, ...)

iterates once over data and learns collocations. Invisibly returns collocation_stat. Workhorse for $fit()

$transform(it)

transforms input iterator using learned collocations model. Result of the transformation is new itoken or itoken_parallel iterator which will produce tokens with phrases collapsed into single token.

$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0)

filter out non-relevant phrases with low score. User can do it directly by modifying collocation_stat object.

Arguments

model

A Collocation model object

n_iter

number of iteration over data

pmi_min, gensim_min, lfmd_min, llr_min

minimal scores of the corresponding statistics in order to collapse tokens into collocation:

  • pointwise mutual information

  • "gensim" scores - https://radimrehurek.com/gensim/models/phrases.html adapted from word2vec paper

  • log-frequency biased mutual dependency

  • Dunning's logarithm of the ratio between the likelihoods of the hypotheses of dependence and independence

See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.8101&rep=rep1&type=pdf, http://www.aclweb.org/anthology/I05-1050 for details. Also see data in model$collocation_stat for better intuition

it

An input itoken or itoken_parallel iterator

vocabulary

text2vec_vocabulary - if provided will look for collocations consisted of only from vocabulary

Examples

library(text2vec)
data("movie_review")

preprocessor = function(x) {
  gsub("[^[:alnum:]\\s]", replacement = " ", tolower(x))
}
sample_ind = 1:100
tokens = word_tokenizer(preprocessor(movie_review$review[sample_ind]))
it = itoken(tokens, ids = movie_review$id[sample_ind])
system.time(v <- create_vocabulary(it))
v = prune_vocabulary(v, term_count_min = 5)

model = Collocations$new(collocation_count_min = 5, pmi_min = 5)
model$fit(it, n_iter = 2)
model$collocation_stat

it2 = model$transform(it)
v2 = create_vocabulary(it2)
v2 = prune_vocabulary(v2, term_count_min = 5)
# check what phrases model has learned
setdiff(v2$term, v$term)
# [1] "main_character"  "jeroen_krabb"    "boogey_man"      "in_order"
# [5] "couldn_t"        "much_more"       "my_favorite"     "worst_film"
# [9] "have_seen"       "characters_are"  "i_mean"          "better_than"
# [13] "don_t_care"      "more_than"       "look_at"         "they_re"
# [17] "each_other"      "must_be"         "sexual_scenes"   "have_been"
# [21] "there_are_some"  "you_re"          "would_have"      "i_loved"
# [25] "special_effects" "hit_man"         "those_who"       "people_who"
# [29] "i_am"            "there_are"       "could_have_been" "we_re"
# [33] "so_bad"          "should_be"       "at_least"        "can_t"
# [37] "i_thought"       "isn_t"           "i_ve"            "if_you"
# [41] "didn_t"          "doesn_t"         "i_m"             "don_t"

# and same way we can create document-term matrix which contains
# words and phrases!
dtm = create_dtm(it2, vocab_vectorizer(v2))
# check that dtm contains phrases
which(colnames(dtm) == "jeroen_krabb")

text2vec

Modern Text Mining Framework for R

v0.6
GPL (>= 2) | file LICENSE
Authors
Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph] (Coherence measures for topic models), Qing Wang [aut, cph] (Author of the WaprLDA C++ code)
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.