Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

shingle-tokenizers

Character shingle tokenizers


Description

The character shingle tokenizer functions like an n-gram tokenizer, except the units that are shingled are characters instead of words. Options to the function let you determine whether non-alphanumeric characters like punctuation should be retained or discarded.

Usage

tokenize_character_shingles(x, n = 3L, n_min = n, lowercase = TRUE,
  strip_non_alphanum = TRUE, simplify = FALSE)

Arguments

x

A character vector or a list of character vectors to be tokenized into character shingles. If x is a character vector, it can be of any length, and each element will be tokenized separately. If x is a list of character vectors, each element of the list should have a length of 1.

n

The number of characters in each shingle. This must be an integer greater than or equal to 1.

n_min

This must be an integer greater than or equal to 1, and less than or equal to n.

lowercase

Should the characters be made lower case?

strip_non_alphanum

Should punctuation and white space be stripped?

simplify

FALSE by default so that a consistent value is returned regardless of length of input. If TRUE, then an input with a single element will return a character vector of tokens instead of a list.

Value

A list of character vectors containing the tokens, with one element in the list for each element that was passed as input. If simplify = TRUE and only a single element was passed as input, then the output is a character vector of tokens.

Examples

x <- c("Now is the hour of our discontent")
tokenize_character_shingles(x)
tokenize_character_shingles(x, n = 5)
tokenize_character_shingles(x, n = 5, strip_non_alphanum = FALSE)
tokenize_character_shingles(x, n = 5, n_min = 3, strip_non_alphanum = FALSE)

tokenizers

Fast, Consistent Tokenization of Natural Language Text

v0.2.1
MIT + file LICENSE
Authors
Lincoln Mullen [aut, cre] (<https://orcid.org/0000-0001-5103-6917>), Os Keyes [ctb] (<https://orcid.org/0000-0001-5196-609X>), Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb] (<https://orcid.org/0000-0001-9953-3904>), Kenneth Benoit [ctb] (<https://orcid.org/0000-0002-0797-564X>)
Initial release
2018-03-29

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.