tokenizers: chunk_text – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

chunk_text

Chunk text into smaller segments

Description

Given a text or vector/list of texts, break the texts into smaller segments each with the same number of words. This allows you to treat a very long document, such as a novel, as a set of smaller documents.

Usage

chunk_text(x, chunk_size = 100, doc_id = names(x), ...)

Arguments

`x`	A character vector or a list of character vectors to be tokenized into n-grams. If `x` is a character vector, it can be of any length, and each element will be chunked separately. If `x` is a list of character vectors, each element of the list should have a length of 1.
`chunk_size`	The number of words in each chunk.
`doc_id`	The document IDs as a character vector. This will be taken from the names of the `x` vector if available. `NULL` is acceptable.
`...`	Arguments passed on to `tokenize_words`.

Details

Chunking the text passes it through tokenize_words, which will strip punctuation and lowercase the text unless you provide arguments to pass along to that function.

Examples

## Not run: 
chunked <- chunk_text(mobydick, chunk_size = 100)
length(chunked)
chunked[1:3]

## End(Not run)

tokenizers

Fast, Consistent Tokenization of Natural Language Text

v0.2.1

MIT + file LICENSE

Authors

Lincoln Mullen [aut, cre] (<https://orcid.org/0000-0001-5103-6917>), Os Keyes [ctb] (<https://orcid.org/0000-0001-5196-609X>), Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb] (<https://orcid.org/0000-0001-9953-3904>), Kenneth Benoit [ctb] (<https://orcid.org/0000-0002-0797-564X>)

Initial release

2018-03-29