Posix Character Class Does Not Work in Base R Regex

POSIX character class does not work in base R regex

Although stringr ICU regex engines supports bare POSIX character classes in the pattern, in base R regex flavors (both PCRE (perl=TRUE) and TRE), POSIX character classes must be inside bracket expressions. [:alnum:] -> [[:alnum:]].

x <- c("AZaz09 y AZaz09", "ĄŻaz09 y AZŁł09", "26 de Marzo y Pareyra de la Luz")
grepl("[[:alnum:][:blank:]]+[[:blank:]][yY][[:blank:]][[:alnum:][:blank:]]+", x)
## => [1] TRUE TRUE TRUE
grepl("[[:alnum:][:blank:]]+[[:blank:]][yY][[:blank:]][[:alnum:][:blank:]]+", x, perl=TRUE)
## => [1] TRUE TRUE TRUE

See the online demo

When you use [:alnum:] alone, it is a simple bracket expression that matches a single character, a :, a, l, n, u, m.

Pattern details:

  • [[:alnum:][:blank:]]+ - 1+ alphanumeric or horizontal whitespace symbols
  • [[:blank:]] - 1 horizontal whitespace symbols
  • [yY] - either y or Y
  • [[:blank:]] - 1 horizontal whitespace symbols
  • [[:alnum:][:blank:]]+ - 1+ alphanumeric or horizontal whitespace symbols

Problem with the outputs from using double square brackets

You should be careful with square brackets inside regular expressions. Now, I will assume you are using the default TRE regex library that is used with the base R regex functions (when no perl=TRUE is passed).

In this case, you should differentiate between

  • [ and ] that mark the start and end of the POSIX character class, e.g. [:alpha:]
  • [ and ] that mark the start and end of a bracket expression
  • unescaped ] that is not preceded with a matching unescaped open [ is treated as a literal ] char.

The [[a-z][0-9]] regex is not equal to [a-z0-9].

  • [[a-z][0-9]] matches strings like [1], a1] and means:
    • [[a-z] - a bracket expression matching a [ char or any lowercase ASCII letter
    • [0-9] - a digit
    • ] - a ] char.

The [a-z0-9] bracket expression just matches a lowercase ASCII letter or digit.

There is no such a construct in regex as double square brackets. Inside a character class, [ can be used anywhere to match a [ char. ] only matches a ] when it is placed at the start of a bracket expression:

  • [a-z[] matches a single char, a lowercase ASCII letter or [
  • [][a-z] matches a single char, a lowercase ASCII letter, [ or ]
  • [[a-z]] matches a lowercase ASCII letter or [ and then a ] char (so, 2 chars in total)

More things to consider

  • [:digit:] indeed matches :, d, i, g or t in TRE and PCRE regex, but ICU regex library (used in stringr/stringi regex functions) supports these bare POSIX character classes (as mentioned in POSIX character class does not work in base R regex) and matches a digit even outside bracket expressions
  • [A-z] matches more than just letters.

Unable to form the required regex in C

[.*] does not mean "0 or more occurrences" of anything. It means "a single character, either a (literal) . or a (literal) [*]". […] defines a character class, which matches exactly one character from the specified set. Brackets are not even remotely the same as parentheses.

So if you wanted to express "zero or more of any character except newline", you could just write .*. That's what .* means. And if you wanted "one or more" instead of "zero or more", you could change the * to a plus, as long as you remember that regex.h regexes should always be compiled with the REG_EXTENDED flag. Without that flag, + is just an ordinary character. (And there are a lot of other inconveniences.)

But that's probably not really what you want. My guess is that you want something like:

^[*]([.][A-Za-z0-9_]+){2,}$

although you'll have to correct the character class to specify the precise set of characters you think are legitimate.

Again, don't forget the crucial REG_EXTENDED flag when you call regcomp.

Some notes:

  • The {2,} requires at least two components after the *, so that *.cool doesn't match.

  • The ^ and $ at the beginning and end of the regex "anchor" the match to the entire input. That stops the pattern matching just a part of the input, but it might not be exactly what you want, either.

  • Finally, I deliberately used a single-character character class to force [*] and [.] to be ordinary characters. I find that a lot more readable than falling timber (\\) and it avoids having to think about the combination of string escaping and regex-escaping.

For more information, I highly recommend reading man regcomp and man 7 regex. A good introduction to regexes might be useful, as well.

regular expressions in base R: 'perl=TRUE' vs. the default (PCRE vs. TRE)

It is not a good idea to compare apples to oranges, as PCRE regex can do much more than TRE regex enine. Although they share similar constructs, even then appearances might turn out deceitful.

What is similar between TRE and PCRE

TRE supports literals as PCRE. A literal is either an ordinary character, an 8-bit hex character (like \x1B), a wide hex character (like \x{263a}), or an escaped character: \a, \e, \f, \n, \r, \t. PCRE supports more: \cx ("control-x", where x is any ASCII character), \0dd (character with octal code 0dd), \ddd (character with octal code ddd, or back reference), \o{ddd..} (character with octal code ddd..), \xhh (character with hex code hh), \x{hhh..} (character with hex code hhh..).

Both have a . wildcard, but in TRE, it matches any char, in PCRE, it only matches any char but line break char(s) (and which ones depend on the newline convention PCRE verb, (*CR), (*LF), (*CRLF), (*ANYCRLF), (*ANY)). gsub(".+", "~", "_\n_") will result in ~, but gsub(".+", "~", "_\n_", perl=TRUE) will yield ~\n~. And an opposite example, to make TRE . act as in PCRE, use (?n) modifier, gsub("(?n).+", "~", "_\n_") to yield ~\n~ (with no way to choose among line ending styles). In PCRE patterns, to make . match line breaks, you need to use (?s) inline DOTALL modifier before . (or (?s:.*) like modifier groups).

Both support an alternation operator, but since TRE is a text-directed engine the longest alternative matches, and in PCRE, the leftmost alternative "wins". sub("(s|su)", "~", "sub") yields ~b (as su is the longest matching alternative), but sub("(s|su)", "~", "sub", perl=TRUE) produces ~ub (since s is first alternative to match).

Both support backreferences, but TRE only supports up to 9 backreferences. If you need 10 or more, use PCRE. sub("(.)\\1(.)\\2(.)\\3(.)\\4(.)\\5(.)\\6(.)\\7(.)\\8(.)\\9(.)\\10", "~", "112233445566778899aa", perl=TRUE) will find a match, with no perl=TRUE, no match will be detected.

Both seem to have character classes, [...] like constructs, but in fact, in the POSIX world where TRE belongs to, these are called bracket expressions. While you may define literal char ranges in both, or specify literal chars with OR relationship between them, one can't use shorthand char classes in bracket expressions, nor any escape sequence. The [\d]+ pattern in a TRE regex is treated as 1 or more backslashes or/and d letters, while in a PCRE pattern it will be parsed as 1+ digits (try gsub("[\\d]+", "~", "00\\99d") (-> 00~99~) and gsub("[\\d]+", "~", "00\\99d", perl=TRUE) (-> ~\~d)). This fact will explain why [\]\-\[]+ in a PCRE pattern matches 1+ ], - or [ and does not in a TRE expression where you need to use "smart placing", like [][-].

TRE and PCRE support the \d (digits), \D (non-digits), \w ("word" chars), \W ("non-word" chars), \s (any whitespace), \S (any non-whitespace) shorthand character classes. However, PCRE also supports \v (any vertical whitespace), \V (any char other than a vertical whitespace), \h (any horizontal whitespace), \H (any character that is not a horizontal whitespace), \N (any non-newline character), \X (any Unicode grapheme, useful when processing letters with diacritics), \R (any Unicode line break sequence).

Both flavors support quantifiers, regular, greedy ?, *, +, lazy ??, *?, +?, range/limiting quantifiers like greedy {3}, {8,26} or {3,} and their lazy counterparts with ? behind them. Note that TRE has poorer support for limiting quantifiers (it only supports values lower than 256 for {min} quantifier, and throws "out of memory" exceptions for {2557,} and bigger values. Make sure you always use the 0 value for the min value if it is what you imply, since {,2} in TRE actually matches 3 occurrences. However, PCRE supports possessive quantifiers, ++, ?+, *+, {1,5}+. The patterns quantified with them disallow backtracking into them, once matched, the engine never retries them. Besides, like all other regex libraries based on Henry Spencer's regex library dated back to 1986 (Tcl, PostgreSQL), one should avoid mixing lazy and greedy quantifiers on the same level in the regex, because the first pattern sets the greediness of the whole pattern level and often leads to unexpected results.

Both flavors support POSIX character classes that can be used in between [...]. However, TRE supports [:alnum:] (alphanumeric), [:alpha:] (letters), [:blank:] (horizontal whitespace), [:cntrl:] (control chars), [:digit:] (digits), [:graph:] (visible chars, anything except spaces and control characters), [:lower:] (lowercase letters), [:print:] (all printable chars), [:punct:] (symbols and punctuation), [:space:] (any whitespace), [:upper:] (uppercase letters) and [:xdigit:] (chars in hex values). PCRE adds [:word:] ("word" chars) and [:ascii:] (any ASCII char).

Both support word boundaries, but PCRE patterns do that in a more reliable way. Cf. gsub("\\b", "~", "CODE") yielding ~C~O~D~E~ and gsub("\\b", "~", "CODE", perl=T) producing ~CODE~. Although TRE support specific leading \< and trailing \> word boundaries, PCRE \b are still more reliable.

Both support inline modifiers that change certain pattern behavior when using them inside a pattern, e.g. (?i). TRE supports i (case insensitive), n (dot no longer matches newline), r (causes the regex to be matched in a right associative manner rather than the normal left associative manner. By default, concatenation is left associative in TRE, as per the grammar given in the base specifications on regular expressions of Std 1003.1-2001 (POSIX). This flag flips associativity of concatenation to right associative. Associativity can have an effect on how a match is divided into submatches, but does not change what is matched by the entire regexp) and U (swaps greediness, *? becomes greedy and * becomes lazy). PCRE supports i and U modifiers, and more: m (^ and $ match start/end of the line, not the whole string), s (dot matches newline), x (allows using whitespace to format the pattern and use comments), J (allows using names capturing groups with the same name), X (makes escaping letters with a backslash an error if that combination is not a valid regex token), D (makes $ only match the very end of string, else, it also matches a position before the final trailing newline in the string) and A (only match at the start of string, as if there was \A or ^ in front).

Backtracking aspect

See TRE docs: Matching algorithm used in TRE uses linear worst-case time in the length of the text being searched, and quadratic worst-case time in the length of the used regular expression. In other words, the time complexity of the algorithm is O(M2N), where M is the length of the regular expression and N is the length of the text. That leads to issues with patterns like (.{2,})\1+ to search for duplicate consecutive substrings. See Remove repeated elements in a string with R.

So, when you need to rely on backtracking much, choose PCRE.

What can PCRE do and TRE can't

The most visible shortcoming of TRE is that it does not support lookarounds. However, there are a lot of things that PCRE can boast of:

  • (*SKIP)(*FAIL) PCRE verb combination to match and skip patterns while matching
  • Recursion to match whole patterns that can be nested
  • Subroutines to match nested, balanced substrings to match recursive structures
  • \G anchor that matches start of string or the end of the previous successful match
  • (?|...|...) branch reset group allowing capturing groups inside it to share the same IDs
  • \p{...} and opposite \P{...} Unicode character properties
  • Case-changing operators in the replacement patterns turning the whole or part of the match to lower (\L) or upper case (\U) (up to the \E or end of match if it is missing) (actually, it is an extension of the PCRE library used in R)
  • An infinite-width positive lookbehind alternative, \K match reset operator (\K reference)
  • PCRE supports named capturing groups, but they are not widely used in R. Here is a custom example.

There are more things, like anchors (\A (start of string), \Z (end of string), \z (very end of string)), conditional "if-then-else" construct, atomic groupings (working in the same way as possessive quantifiers, but disallowing backtracking into whole sequences of patterns), etc.

Benchmark tests in Windows 7, Linux Ubuntu 16.04, MacOS Sierra 10.12.6

If we want to compare the performance of the TRE and PCRE regex engines in R, we should use simple patterns that match literally the same texts with these 2 engines.

I use R in Windows mostly, but I installed R 3.2.3 on a Linux VM specifically for this testing. The results for MacOS are borrowed from the t.kalinowski's answer.

Let's compare TRE (default) and PCRE (perl=TRUE) regex performance using microbenchmark library (see more benchmarking options in R):

library(microbenchmark)

The text is a Wikipedia article about butterflies.

txt <- "Butterflies are insects in the macrolepidopteran clade Rhopalocera from the order Lepidoptera, which also includes moths. Adult butterflies have large, often brightly coloured wings, and conspicuous, fluttering flight. The group comprises the large superfamily Papilionoidea, which contains at least one former group, the skippers (formerly the superfamily \"Hesperioidea\") and the most recent analyses suggest it also contains the moth-butterflies (formerly the superfamily \"Hedyloidea\"). Butterfly fossils date to the Paleocene, which was about 56 million years ago."

Let's try and extract the last text inside parentheses with sub, a very common sub operation in R:

# sub('.*\\((.*)\\).*', '\\1', txt)
# => [1] "formerly the superfamily \"Hedyloidea\""
PCRE_1 <- function(text) { return(sub('.*\\((.*)\\).*', '\\1', txt, perl=TRUE)) }
TRE_1 <- function(text) { return(sub('.*\\((.*)\\).*', '\\1', txt)) }
test <- microbenchmark( PCRE_1(txt), TRE_1(txt), times = 500000 )
test

The results are the following:

WINDOWS
-------
Unit: microseconds
expr min lq mean median uq max neval
PCRE_1(txt) 163.607 165.418 168.65393 166.625 167.229 7314.588 5e+05
TRE_1(txt) 70.031 72.446 74.53842 73.050 74.257 38026.680 5e+05

MacOS
-----
Unit: microseconds
expr min lq mean median uq max neval
PCRE_1(txt) 31.693 32.857 37.00757 33.413 35.805 43810.177 5e+05
TRE_1(txt) 46.037 47.199 53.06407 47.807 51.981 7702.869 5e+05

Linux
------
Unit: microseconds
expr min lq mean median uq max neval
PCRE_1(txt) 10.557 11.555 13.78216 12.097 12.662 4301.178 5e+05
TRE_1(txt) 25.875 27.350 31.51925 27.805 28.737 17974.716 5e+05

TRE regex sub wins only in Windows, more than 2 times as fast. On both MacOS and Linux, PCRE (perl=TRUE) version wins with a similar ratio.

Now, let's compare the performance of regexps that don't use backtracking that heavily and extract the words inside double quotes:

# regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt))
# => [1] "\"Hesperioidea\"" "\"Hedyloidea\""
PCRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt, perl=TRUE))) }
TRE_2 <- function(text) { return(regmatches(txt, gregexpr("\"[A-Za-z]+\"", txt))) }
test <- microbenchmark( PCRE_2(txt), TRE_2(txt), times = 500000 )
test

WINDOWS
-------
Unit: microseconds
expr min lq mean median uq max neval
PCRE_2(txt) 324.799 330.232 349.0281 332.646 336.269 124404.14 5e+05
TRE_2(txt) 187.755 191.981 204.7663 193.792 196.208 74554.94 5e+05

MacOS
-----
Unit: microseconds
expr min lq mean median uq max neval
PCRE_2(txt) 63.801 68.115 75.51773 69.164 71.219 47686.40 5e+05
TRE_2(txt) 63.825 67.849 75.20246 68.883 70.933 49691.92 5e+05

LINUX
-----
Unit: microseconds
expr min lq mean median uq max neval
PCRE_2(txt) 30.199 34.750 44.05169 36.151 43.403 38428.2 5e+05
TRE_2(txt) 37.752 41.854 52.58230 43.409 51.781 38915.7 5e+05

The best mean value belongs to the PCRE regex in Linux, in MacOS, the difference is almost negligent, and in Windows, TRE works much faster.

Summary

It is clear that TRE (default) regex library works much faster in Windows. In Linux, PCRE regex is considerably faster. In MacOS, PCRE regex is still preferable since, with backtracking patterns, PCRE regex is faster than TRE in that OS.

[[::]] or [[::]] don't match

It is a bug, because these constructs (starting word boundary, [[:<:]], and ending [[:>:]] word boundary) are supported by the PCRE library itself:

COMPATIBILITY FEATURE FOR WORD BOUNDARIES

In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
and "end of word". PCRE treats these items as follows:

[[:<:]] is converted to \b(?=\w)
[[:>:]] is converted to \b(?<=\w)

Only these exact character sequences are recognized. A sequence such as
[a[:<:]b] provokes error for an unrecognized POSIX class name. This
support is not compatible with Perl. It is provided to help migrations
from other environments, and is best not used in any new patterns. Note
that \b matches at the start and the end of a word (see "Simple asser-
tions" above), and in a Perl-style pattern the preceding or following
character normally shows which is wanted, without the need for the
assertions that are used above in order to give exactly the POSIX be-
haviour.

When used in PHP code, it works:

if (preg_match_all('/[[:<:]]home[[:>:]]/', 'homeless and home', $m))
{
print_r($m[0]);
}

finds Array ( [0] => home). See the online PHP demo.

So, it is the regex101.com developer team that decided (or forgot) to include support for these paired word boundaries.

At regex101.com, instead, use \b word boundaries (both as starting and ending ones) that are supported by all 4 regex101.com regex engines: PCRE, JS, Python and Go.

These word boundaries are mostly supported by POSIX-like engines, see this PostgreSQL regex demo, for example. The [[:<:]]HR[[:>:]] regex finds a match in Head of HR, but finds no match in <A HREF="some.html and CHROME.

Other regex engines that support [[:<:]] and [[:>:]] word boundaries are base R (gsub with no perl=TRUE argument, e.g.) and MySQL.

In Tcl regex, there is \m for [[:<:]] (starting word boundary) and \M for ending word boundary ([[:>:]]).



Related Topics



Leave a reply



Submit