quanteda: tokenize_internal – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

quanteda

tokenize_internal

quanteda tokenizers

Description

Internal methods for tokenization providing default and legacy methods for text segmentation.

Usage

tokenize_word(x, split_hyphens = FALSE, verbose = quanteda_options("verbose"))

tokenize_word1(x, split_hyphens = FALSE, verbose = quanteda_options("verbose"))

tokenize_character(x, ...)

tokenize_sentence(x, ..., verbose = FALSE)

tokenize_fasterword(x, ...)

tokenize_fastestword(x, ...)

Arguments

`x`	(named) character; input texts
`split_hyphens`	logical; if `TRUE`, split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. `"self-aware"` becomes `c("self", "-", "aware")`
`verbose`	if `TRUE`, print timing messages to the console
`...`	used to pass arguments among the functions

Value

a list of characters corresponding to the (most conservative) tokenization, including whitespace where applicable; except for tokenize_word1(), which is a special tokenizer for Internet language that includes URLs, #hashtags, @usernames, and email addresses.

Examples

## Not run: 
txt <- c(doc1 = "Tweet https://quanteda.io using @quantedainit and #rstats.",
         doc2 = "The £1,000,000 question.",
         doc4 = "Line 1.\nLine2\n\nLine3.",
         doc5 = "?",
         doc6 = "Self-aware machines! \U0001f600")
tokenize_word(txt)
tokenize_word(txt, split_hyphens = TRUE)
tokenize_word2(txt, split_hyphens = FALSE)
tokenize_word2(txt, split_hyphens = TRUE)
tokenize_fasterword(txt)
tokenize_fastestword(txt)
tokenize_sentence(txt)
tokenize_character(txt[2])

## End(Not run)

quanteda

Quantitative Analysis of Textual Data

v3.0.0

GPL-3

Authors

Kenneth Benoit [cre, aut, cph] (<https://orcid.org/0000-0002-0797-564X>), Kohei Watanabe [aut] (<https://orcid.org/0000-0001-6519-5265>), Haiyan Wang [aut] (<https://orcid.org/0000-0003-4992-4311>), Paul Nulty [aut] (<https://orcid.org/0000-0002-7214-4666>), Adam Obeng [aut] (<https://orcid.org/0000-0002-2906-4775>), Stefan Müller [aut] (<https://orcid.org/0000-0002-6315-4125>), Akitaka Matsuo [aut] (<https://orcid.org/0000-0002-3323-6330>), William Lowe [aut] (<https://orcid.org/0000-0002-1549-6163>), Christian Müller [ctb], European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS)

Initial release

tokenize_internal

Description

Usage

Arguments

Value

Examples

quanteda

We don't support your browser anymore