Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

tokenizer

Tokenizers


Description

Tokenize a document or character vector.

Usage

Boost_tokenizer(x)
MC_tokenizer(x)
scan_tokenizer(x)

Arguments

x

A character vector, or an object that can be coerced to character by as.character.

Details

The quality and correctness of a tokenization algorithm highly depends on the context and application scenario. Relevant factors are the language of the underlying text and the notions of whitespace (which can vary with the used encoding and the language) and punctuation marks. Consequently, for superior results you probably need a custom tokenization function.

Boost_tokenizer

Uses the Boost (https://www.boost.org) Tokenizer (via Rcpp).

MC_tokenizer

Implements the functionality of the tokenizer in the MC toolkit (https://www.cs.utexas.edu/users/dml/software/mc/).

scan_tokenizer

Simulates scan(..., what = "character").

Value

A character vector consisting of tokens obtained by tokenization of x.

See Also

getTokenizers to list tokenizers provided by package tm.

Regexp_Tokenizer for tokenizers using regular expressions provided by package NLP.

tokenize for a simple regular expression based tokenizer provided by package tau.

tokenizers for a collection of tokenizers provided by package tokenizers.

Examples

data("crude")
Boost_tokenizer(crude[[1]])
MC_tokenizer(crude[[1]])
scan_tokenizer(crude[[1]])
strsplit_space_tokenizer <- function(x)
    unlist(strsplit(as.character(x), "[[:space:]]+"))
strsplit_space_tokenizer(crude[[1]])

tm

Text Mining Package

v0.7-8
GPL-3
Authors
Ingo Feinerer [aut, cre] (<https://orcid.org/0000-0001-7656-8338>), Kurt Hornik [aut] (<https://orcid.org/0000-0003-4198-9911>), Artifex Software, Inc. [ctb, cph] (pdf_info.ps taken from GPL Ghostscript)
Initial release
2020-11-17

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.