Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

check_text

Check Text For Potential Problems


Description

check_text - Uncleaned text may result in errors, warnings, and incorrect results in subsequent analysis. check_text checks text for potential problems and suggests possible fixes. Potential text anomalies that are detected include: factors, missing ending punctuation, empty cells, double punctuation, non-space after comma, no alphabetic characters, non-ASCII, missing value, and potentially misspelled words.

available_check - Provide a data.frame view of all the available checks in the check_text function.

Usage

check_text(x, file = NULL, checks = NULL, n = 10, ...)

available_checks()

Arguments

x

The text variable.

file

A connection, or a character string naming the file to print to. If NULL prints to the console. Note that this is assigned as an attribute and passed to print.

checks

A vector of checks to include from which_are. If checks = NULL, all checks from which_are which be used. Note that all meta checks will be conducted (see which_are for details on meta checks).

n

The number of affected elements to print out (the rest are truncated).

...

ignored.

Value

Returns a list with the following potential text faults report:

  • contraction- Text elements that contain contractions

  • date- Text elements that contain dates

  • digit- Text elements that contain digits/numbers

  • email- Text elements that contain email addresses

  • emoticon- Text elements that contain emoticons

  • empty- Text elements that contain empty text cells (all white space)

  • escaped- Text elements that contain escaped back spaced characters

  • hash- Text elements that contain Twitter style hash tags (e.g., #rstats)

  • html- Text elements that contain HTML markup

  • incomplete- Text elements that contain incomplete sentences (e.g., uses ending punctuation like ...)

  • kern- Text elements that contain kerning (e.g., 'The B O M B!')

  • list_column- Text variable that is a list column

  • missing_value- Text elements that contain missing values

  • misspelled- Text elements that contain potentially misspelled words

  • no_alpha- Text elements that contain elements with no alphabetic (a-z) letters

  • no_endmark- Text elements that contain elements with missing ending punctuation

  • no_space_after_comma- Text elements that contain commas with no space afterwards

  • non_ascii- Text elements that contain non-ASCII text

  • non_character- Text variable that is not a character column (likely factor)

  • non_split_sentence- Text elements that contain unsplit sentences (more than one sentence per element)

  • tag- Text elements that contain Twitter style handle tags (e.g., @trinker)

  • time- Text elements that contain timestamps

  • url- Text elements that contain URLs

Note

The output is a list containing meta checks and elemental checks but prints as a pretty formatted output with potential problem elements, the accompanying text, and possible suggestions to fix the text.


textclean

Text Cleaning Tools

v0.9.3
GPL-2
Authors
Tyler Rinker [aut, cre], ctwheels StackOverflow [ctb]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.