Corpus clustering based on the Reinert method - Simple clustering
Corpus clustering based on the Reinert method - Simple clustering
rainette( dtm, k = 10, min_uc_size = 10, min_split_members = 5, cc_test = 0.3, tsj = 3, min_members )
dtm |
quanteda dfm object of documents to cluster, usually the
result of |
k |
maximum number of clusters to compute |
min_uc_size |
minimum number of forms by document |
min_split_members |
don't try to split groups with fewer members |
cc_test |
contingency coefficient value for feature selection |
tsj |
minimum frequency value for feature selection |
min_members |
deprecated, use |
See the references for original articles on the method. Computations and results may differ quite a bit, see the package vignettes for more details.
The dtm object is automatically converted to boolean.
The result is a list of both class hclust
and rainette
. Besides the elements
of an hclust
object, two more results are available :
uce_groups
give the group of each document for each k
group
give the group of each document for the maximum value of k available
Reinert M, Une méthode de classification descendante hiérarchique : application à l'analyse lexicale par contexte, Cahiers de l'analyse des données, Volume 8, Numéro 2, 1983. http://www.numdam.org/item/?id=CAD_1983__8_2_187_0
Reinert M., Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval, Bulletin de Méthodologie Sociologique, Volume 26, Numéro 1, 1990. doi: 10.1177/075910639002600103
require(quanteda) corpus <- data_corpus_inaugural corpus <- head(corpus, n = 10) corpus <- split_segments(corpus) dtm <- dfm(corpus, remove = stopwords("en"), tolower = TRUE, remove_punct = TRUE) dtm <- dfm_wordstem(dtm, language = "english") dtm <- dfm_trim(dtm, min_termfreq = 3) res <- rainette(dtm, k = 3)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.