Train a word2vec model on text
Construct a word2vec model on text. The algorithm is explained at https://arxiv.org/pdf/1310.4546.pdf
word2vec( x, type = c("cbow", "skip-gram"), dim = 50, window = ifelse(type == "cbow", 5L, 10L), iter = 5L, lr = 0.05, hs = FALSE, negative = 5L, sample = 0.001, min_count = 5L, split = c(" \n,.-!?:;/\"#$%&'()*+<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"), stopwords = character(), threads = 1L, encoding = "UTF-8", ... )
x |
a character vector with text or the path to the file on disk containing training data |
type |
the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow' |
dim |
dimension of the word vectors. Defaults to 50. |
window |
skip length between words. Defaults to 5. |
iter |
number of training iterations. Defaults to 5. |
lr |
initial learning rate also known as alpha. Defaults to 0.05 |
hs |
logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling. |
negative |
integer with the number of negative samples. Only used in case hs is set to FALSE |
sample |
threshold for occurrence of words. Defaults to 0.001 |
min_count |
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5. |
split |
a character vector of length 2 where the first element indicates how to split words and the second element indicates how to split sentences in |
stopwords |
a character vector of stopwords to exclude from training |
threads |
number of CPU threads to use. Defaults to 1. |
encoding |
the encoding of |
... |
further arguments passed on to the C++ function |
Some advice on the optimal set of parameters to use for training as defined by Mikolov et al.
argument type: skip-gram (slower, better for infrequent words) vs cbow (fast)
argument hs: the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
argument dim: dimensionality of the word vectors: usually more is better, but not always
argument window: for skip-gram usually around 10, for cbow around 5
argument sample: sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 0.001 to 0.00001)
an object of class w2v_trained
which is a list with elements
model: a Rcpp pointer to the model
data: a list with elements file: the training data used, stopwords: the character vector of stopwords, n
vocabulary: the number of words in the vocabulary
success: logical indicating if training succeeded
error_log: the error log in case training failed
control: as list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample, split_words, split_sents, expTableSize and expValueMax
library(udpipe) ## Take data and standardise it a bit data(brussels_reviews, package = "udpipe") x <- subset(brussels_reviews, language == "nl") x <- tolower(x$feedback) ## Build the model get word embeddings and nearest neighbours model <- word2vec(x = x, dim = 15, iter = 20) emb <- as.matrix(model) head(emb) emb <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding") emb nn <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5) nn ## Get vocabulary vocab <- summary(model, type = "vocabulary") # Do some calculations with the vectors and find similar terms to these emb <- as.matrix(model) vector <- emb["buurt", ] - emb["rustige", ] + emb["restaurants", ] predict(model, vector, type = "nearest", top_n = 10) vector <- emb["gastvrouw", ] - emb["gastvrij", ] predict(model, vector, type = "nearest", top_n = 5) vectors <- emb[c("gastheer", "gastvrouw"), ] vectors <- rbind(vectors, avg = colMeans(vectors)) predict(model, vectors, type = "nearest", top_n = 10) ## Save the model to hard disk path <- "mymodel.bin" write.word2vec(model, file = path) model <- read.word2vec(path) ## ## Example getting word embeddings ## which are different depending on the parts of speech tag ## Look to the help of the udpipe R package ## to get parts of speech tags on text ## library(udpipe) data(brussels_reviews_anno, package = "udpipe") x <- subset(brussels_reviews_anno, language == "fr") x <- subset(x, grepl(xpos, pattern = paste(LETTERS, collapse = "|"))) x$text <- sprintf("%s/%s", x$lemma, x$xpos) x <- subset(x, !is.na(lemma)) x <- paste.data.frame(x, term = "text", group = "doc_id", collapse = " ") x <- x$text model <- word2vec(x = x, dim = 15, iter = 20, split = c(" ", ".\n?!")) emb <- as.matrix(model) nn <- predict(model, c("cuisine/NN", "rencontrer/VB"), type = "nearest") nn nn <- predict(model, c("accueillir/VBN", "accueillir/VBG"), type = "nearest") nn
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.