tm: tokenizer – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

tokenizer

Tokenizers

Description

Tokenize a document or character vector.

Usage

Boost_tokenizer(x)
MC_tokenizer(x)
scan_tokenizer(x)

Arguments

`x`	A character vector, or an object that can be coerced to character by `as.character`.

Details

The quality and correctness of a tokenization algorithm highly depends on the context and application scenario. Relevant factors are the language of the underlying text and the notions of whitespace (which can vary with the used encoding and the language) and punctuation marks. Consequently, for superior results you probably need a custom tokenization function.

Boost_tokenizer: Uses the Boost (https://www.boost.org) Tokenizer (via Rcpp).
MC_tokenizer: Implements the functionality of the tokenizer in the MC toolkit (https://www.cs.utexas.edu/users/dml/software/mc/).
scan_tokenizer: Simulates scan(..., what = "character").

Value

A character vector consisting of tokens obtained by tokenization of x.

Examples

data("crude")
Boost_tokenizer(crude[[1]])
MC_tokenizer(crude[[1]])
scan_tokenizer(crude[[1]])
strsplit_space_tokenizer <- function(x)
    unlist(strsplit(as.character(x), "[[:space:]]+"))
strsplit_space_tokenizer(crude[[1]])

tm

Text Mining Package

v0.7-8

GPL-3

Authors

Ingo Feinerer [aut, cre] (<https://orcid.org/0000-0001-7656-8338>), Kurt Hornik [aut] (<https://orcid.org/0000-0003-4198-9911>), Artifex Software, Inc. [ctb, cph] (pdf_info.ps taken from GPL Ghostscript)

Initial release

2020-11-17