lda: newsgroups – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

newsgroups

A collection of newsgroup messages with classes.

Description

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Usage

data(newsgroup.train.documents)
data(newsgroup.test.documents)
data(newsgroup.train.labels)
data(newsgroup.test.labels)
data(newsgroup.vocab)
data(newsgroup.label.map)

Format

newsgroup.train.documents and newsgroup.test.documents comprise a corpus of 20,000 newsgroup documents conforming to the LDA format, partitioned into 11269 training and 7505 training and test cases evenly distributed across 20 classes.

newsgroup.train.labels is a numeric vector of length 11269 which gives a class label from 1 to 20 for each training document in the corpus.

newsgroup.test.labels is a numeric vector of length 7505 which gives a class label from 1 to 20 for each training document in the corpus.

newsgroup.vocab is the vocabulary of the corpus.

newsgroup.label.map maps the numeric class labels to actual class names.

Source

http://qwone.com/~jason/20Newsgroups/

Examples

data(newsgroup.train.documents)
data(newsgroup.test.documents)
data(newsgroup.train.labels)
data(newsgroup.test.labels)
data(newsgroup.vocab)
data(newsgroup.label.map)

lda

Collapsed Gibbs Sampling Methods for Topic Models

v1.4.2

LGPL

Authors

Jonathan Chang

Initial release

2015-11-22