Similarity and distance computation between documents or features
These functions compute matrixes of distances and similarities between
documents or features from a dfm()
and return a matrix of
similarities or distances in a sparse format. These methods are fast
and robust because they operate directly on the sparse dfm objects.
The output can easily be coerced to an ordinary matrix, a data.frame of
pairwise comparisons, or a dist format.
textstat_simil( x, y = NULL, selection = NULL, margin = c("documents", "features"), method = c("correlation", "cosine", "jaccard", "ejaccard", "dice", "edice", "hamman", "simple matching"), min_simil = NULL, ... ) textstat_dist( x, y = NULL, selection = NULL, margin = c("documents", "features"), method = c("euclidean", "manhattan", "maximum", "canberra", "minkowski"), p = 2, ... )
x, y |
a dfm objects; |
selection |
(deprecated - use |
margin |
identifies the margin of the dfm on which similarity or
difference will be computed: |
method |
character; the method identifying the similarity or distance measure to be used; see Details. |
min_simil |
numeric; a threshold for the similarity values below which similarity values will not be returned |
... |
unused |
p |
The power of the Minkowski distance. |
textstat_simil
options are: "correlation"
(default),
"cosine"
, "jaccard"
, "ejaccard"
, "dice"
,
"edice"
, "simple matching"
, and "hamman"
.
textstat_dist
options are: "euclidean"
(default),
"manhattan"
, "maximum"
, "canberra"
,
and "minkowski"
.
A sparse matrix from the Matrix package that will be symmetric
unless y
is specified.
The output objects from textstat_simil()
and textstat_dist()
can be
transformed easily into a list format using
as.list()
, which returns a list for each unique
element of the second of the pairs, a data.frame using
as.data.frame()
, which returns pairwise
scores, as.dist()
for a dist object,
or as.matrix()
to convert it into an ordinary matrix.
If you want to compute similarity on a "normalized" dfm object
(controlling for variable document lengths, for methods such as correlation
for which different document lengths matter), then wrap the input dfm in
[dfm_weight](x, "prop")
.
# similarities for documents library("quanteda") dfmat <- corpus_subset(data_corpus_inaugural, Year > 2000) %>% tokens(remove_punct = TRUE) %>% tokens_remove(stopwords("english")) %>% dfm() (tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents")) as.matrix(tstat1) as.list(tstat1) as.list(tstat1, diag = TRUE) # min_simil (tstat2 <- textstat_simil(dfmat, method = "cosine", margin = "documents", min_simil = 0.6)) as.matrix(tstat2) # similarities for for specific documents textstat_simil(dfmat, dfmat["2017-Trump", ], margin = "documents") textstat_simil(dfmat, dfmat["2017-Trump", ], method = "cosine", margin = "documents") textstat_simil(dfmat, dfmat[c("2009-Obama", "2013-Obama"), ], margin = "documents") # compute some term similarities tstat3 <- textstat_simil(dfmat, dfmat[, c("fair", "health", "terror")], method = "cosine", margin = "features") head(as.matrix(tstat3), 10) as.list(tstat3, n = 6) # distances for documents (tstat4 <- textstat_dist(dfmat, margin = "documents")) as.matrix(tstat4) as.list(tstat4) as.dist(tstat4) # distances for specific documents textstat_dist(dfmat, dfmat["2017-Trump", ], margin = "documents") (tstat5 <- textstat_dist(dfmat, dfmat[c("2009-Obama" , "2013-Obama"), ], margin = "documents")) as.matrix(tstat5) as.list(tstat5) ## Not run: # plot a dendrogram after converting the object into distances plot(hclust(as.dist(tstat4))) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.