Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

newsgroups

A collection of newsgroup messages with classes.


Description

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

Usage

data(newsgroup.train.documents)
data(newsgroup.test.documents)
data(newsgroup.train.labels)
data(newsgroup.test.labels)
data(newsgroup.vocab)
data(newsgroup.label.map)

Format

newsgroup.train.documents and newsgroup.test.documents comprise a corpus of 20,000 newsgroup documents conforming to the LDA format, partitioned into 11269 training and 7505 training and test cases evenly distributed across 20 classes.

newsgroup.train.labels is a numeric vector of length 11269 which gives a class label from 1 to 20 for each training document in the corpus.

newsgroup.test.labels is a numeric vector of length 7505 which gives a class label from 1 to 20 for each training document in the corpus.

newsgroup.vocab is the vocabulary of the corpus.

newsgroup.label.map maps the numeric class labels to actual class names.

Source

http://qwone.com/~jason/20Newsgroups/

See Also

lda.collapsed.gibbs.sampler for the format of the corpus.

Examples

data(newsgroup.train.documents)
data(newsgroup.test.documents)
data(newsgroup.train.labels)
data(newsgroup.test.labels)
data(newsgroup.vocab)
data(newsgroup.label.map)

lda

Collapsed Gibbs Sampling Methods for Topic Models

v1.4.2
LGPL
Authors
Jonathan Chang
Initial release
2015-11-22

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.