Character classes
Match character classes.
alnum(lo, hi, char_class = TRUE) alpha(lo, hi, char_class = TRUE) blank(lo, hi, char_class = TRUE) cntrl(lo, hi, char_class = TRUE) digit(lo, hi, char_class = TRUE) graph(lo, hi, char_class = TRUE) lower(lo, hi, char_class = TRUE) printable(lo, hi, char_class = TRUE) punct(lo, hi, char_class = TRUE) space(lo, hi, char_class = TRUE) upper(lo, hi, char_class = TRUE) hex_digit(lo, hi, char_class = TRUE) any_char(lo, hi) grapheme(lo, hi) newline(lo, hi) dgt(lo, hi, char_class = FALSE) wrd(lo, hi, char_class = FALSE) spc(lo, hi, char_class = FALSE) not_dgt(lo, hi, char_class = FALSE) not_wrd(lo, hi, char_class = FALSE) not_spc(lo, hi, char_class = FALSE) ascii_digit(lo, hi, char_class = TRUE) ascii_lower(lo, hi, char_class = TRUE) ascii_upper(lo, hi, char_class = TRUE) ascii_alpha(lo, hi, char_class = TRUE) ascii_alnum(lo, hi, char_class = TRUE) char_range(lo, hi, char_class = lo < hi)
lo |
A non-negative integer. Minimum number of repeats, when grouped. |
hi |
positive integer. Maximum number of repeats, when grouped. |
char_class |
A logical value. Should |
A character vector representing part or all of a regular expression.
R has many built-in locale-dependent character classes, like
[:alnum:]
(representing alphanumeric characters, that is lower or
upper case letters or numbers). Some of these behave in unexpected ways
when using the ICU engine (that is, when using stringi
or
stringr
). See the punctuation example. For these engines, using
Unicode properties (UnicodeProperty
) may give
you a more reliable match.
There are also some generic character classes like \w
(representing
lower or upper case letters or numbers or underscores). Since version 0.0-3,
these use the default char_class = FALSE
, since they already act as
character classes.
Finally, there are ASCII-only ways of specifying letters like a-zA-Z
.
Which version you want depends upon how you want to deal with international
characters, and the vagaries of the underlying regular expression engine.
I suggest reading the regex
help page and doing lots of
testing.
# R character classes alnum() alpha() blank() cntrl() digit() graph() lower() printable() punct() space() upper() hex_digit() # Special chars any_char() grapheme() newline() # Generic classes dgt() wrd() spc() # Generic negated classes not_dgt() not_wrd() not_spc() # Non-locale-specific classes ascii_digit() ascii_lower() ascii_upper() # Don't provide a class wrapper digit(char_class = FALSE) # same as DIGIT # Match repeated values digit(3) digit(3, 5) digit(0) digit(1) digit(0, 1) # Ranges of characters char_range(0, 7) # octal number # Usage (rx <- digit(3)) stringi::stri_detect_regex(c("123", "one23"), rx) # Some classes behave differently under different engines # In particular PRCE and Perl recognise all these characters # as punctuation but ICU does not p <- c( "!", "@", "#", "$", "%", "^", "&", "*", "(", ")", "[", "]", "{", "}", ";", ":", "'", '"', ",", "<", ">", ".", "/", "?", "\\", "|", "`", "~" ) icu_matched <- stringi::stri_detect_regex(p, punct()) p[icu_matched] p[!icu_matched] pcre_matched <- grepl(punct(), p) p[pcre_matched] p[!pcre_matched] # A grapheme is a character that can be defined by more than one code point # PCRE does not recognise the concept. x <- c("Chloe", "Chlo\u00e9", "Chlo\u0065\u0301") stringi::stri_match_first_regex(x, "Chlo" %R% capture(grapheme())) # newline() matches three types of line ending: \r, \n, \r\n. # You can standardize line endings using stringi::stri_replace_all_regex("foo\nbar\r\nbaz\rquux", NEWLINE, "\n")
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.