Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

txt_clean_word2vec

Text cleaning specific for input to word2vec


Description

Standardise text by

  • Conversion of text from UTF-8 to ASCII

  • Keeping only alphanumeric characters: letters and numbers

  • Removing multiple spaces

  • Removing leading/trailing spaces

  • Performing lowercasing

Usage

txt_clean_word2vec(x, ascii = TRUE, alpha = TRUE, tolower = TRUE, trim = TRUE)

Arguments

x

a character vector in UTF-8 encoding

ascii

logical indicating to use iconv to convert the input from UTF-8 to ASCII. Defaults to TRUE.

alpha

logical indicating to keep only alphanumeric characters. Defaults to TRUE.

tolower

logical indicating to lowercase x. Defaults to TRUE.

trim

logical indicating to trim leading/trailing white space. Defaults to TRUE.

Value

a character vector of the same length as x which is standardised by converting the encoding to ascii, lowercasing and keeping only alphanumeric elements

Examples

x <- c("  Just some.texts,  ok?", "123.456 and\tsome MORE!  ")
txt_clean_word2vec(x)

word2vec

Distributed Representations of Words

v0.3.3
Apache License (>= 2.0)
Authors
Jan Wijffels [aut, cre, cph] (R wrapper), BNOSAC [cph] (R wrapper), Max Fomichev [ctb, cph] (Code in src/word2vec)
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.