Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

textstat_summary

Summarize documents as syntactic and lexical feature counts


Description

Count syntactic and lexical features of documents such as tokens, types, sentences, and character categories.

Usage

textstat_summary(x, ...)

Arguments

x

corpus to be summarized

...

additional arguments passed through to dfm()

Details

Count the total number of characters, tokens and sentences as well as special tokens such as numbers, punctuation marks, symbols, tags and emojis.

  • chars = number of characters; equal to nchar()

  • sents = number of sentences; equal ntoken(tokens(x), what = "sentence")

  • tokens = number of tokens; equal to ntoken()

  • types = number of unique tokens; equal to ntype()

  • puncts = number of punctuation marks (^\p{P}+$)

  • numbers = number of numeric tokens (^\p{Sc}{0,1}\p{N}+([.,]*\p{N})*\p{Sc}{0,1}$)

  • symbols = number of symbols (^\p{S}$)

  • tags = number of tags; sum of pattern_username and pattern_hashtag in quanteda::quanteda_options()

  • emojis = number of emojis (^\p{Emoji_Presentation}+$)

Examples

if (Sys.info()["sysname"] != "SunOS") {
library("quanteda")
corp <- data_corpus_inaugural[1:5]
textstat_summary(corp)
toks <- tokens(corp)
textstat_summary(toks)
dfmat <- dfm(toks)
textstat_summary(dfmat)
}

quanteda.textstats

Textual Statistics for the Quantitative Analysis of Textual Data

v0.94.1
GPL-3
Authors
Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Kohei Watanabe [aut] (<https://orcid.org/0000-0001-6519-5265>), Haiyan Wang [aut] (<https://orcid.org/0000-0003-4992-4311>), Jiong Wei Lua [aut], Jouni Kuha [aut] (<https://orcid.org/0000-0002-1156-8465>), European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.