Using Shorthand Character Classes Inside Character Classes in R Regex

Using shorthand character classes inside character classes in R regex

You should keep in mind that, in TRE regex patterns, you cannot use regex escapes like \s, \d, \w inside bracket expressions.

So, the regex in your case, "[\\s0-9a-z]+,", matches 1 or more \, s, digits and lowercase ASCII letters, and then a single ,.

You may use POSIX character classes instead, like [:space:] (any whitespaces) or [:blank:] (horizontal whitespaces):

> gsub("[[:space:]0-9a-z]+,", "", vec)
[1] " Fast"

Or, use a PCRE regex with \s and perl=TRUE argument:

> gsub("[\\s0-9a-z]+,", "", vec, perl=TRUE)
[1] " Fast"

To make \s match all Unicode whitespaces, add (*UCP) PCRE verb at the pattern start: gsub("(*UCP)[\\s0-9a-z]+,", "", vec, perl=TRUE).

Problem using \\d inside a user-defined character class

As requested:

Character classes such as \\d, \\s, \\w are from Perl so when you use those make sure to add perl = T in your code.

For example:

sub("([\\w.]+)([€$¥])", "\\2\\1", a_1, perl = T) 

More information can be found here:

https://perldoc.perl.org/perlrecharclass

Reusing a character class in a regular expression

Keep in mind that regex features are dependant on the language being used.

With Java, you can do this:

[acegikmoqstz@#&](?:.*[acegikmoqstz@#&]){2}

But that's all, with java you can't refer to named subpattern.

With PHP you can do that:

(?(DEFINE)(?<a>[acegikmoqstz@#&]))\g<a>(?:.*\g<a>){2}

Which regular expression operator means 'Don't' match this character?

You can use negated character classes to exclude certain characters: for example [^abcde] will match anything but a,b,c,d,e characters.

Instead of specifying all the characters literally, you can use shorthands inside character classes: [\w] (lowercase) will match any "word character" (letter, numbers and underscore), [\W] (uppercase) will match anything but word characters; similarly, [\d] will match the 0-9 digits while [\D] matches anything but the 0-9 digits, and so on.

If you use PHP you can take a look at the regex character classes documentation.

Regex search for specific pattern, if found, replace with something else

Your pattern does not work because TRE regex flavor does not support shorthand character classes inside bracket expressions. You should either use [[:digit:]] or [0-9], but not [\\d] (that actually matches a \ or a letter d).

You may use

Before <- "ACEMOGLU, D., ROBINSON, J., (2012) WHY NATIONS FAIL, (3)"
gsub("\\((\\d{4})\\)", "\\1,", Before)
## => [1] "ACEMOGLU, D., ROBINSON, J., 2012, WHY NATIONS FAIL, (3)"

See the R online demo

NOTE that I am using \\d without square brackets (=bracket expression) around it. TRE regex engine treats "\\d{4}" as a four digit matching pattern. It is equal to [0-9]{4} or [[:digit:]]{4}.

Details

  • \\( - a literal (
  • (\\d{4}) - Group 1: any four digits
  • \\) - a literal )
  • \\1 - the backreference to Group 1 value

how to negate any alphanumeric character with alnum in r (str_)

You can use

library(stringr)
str_replace_all(name, "[^[:alnum:]]+", "")
## or
str_replace_all(name, "[:^alnum:]+", "")

The [^[:alnum:]] pattern is a negated bracket expression ([^...]) that matches any chars other than letters and digits ([:alnum:], a POSIX character class).

The [:^alnum:] pattern is an extension of the POSIX character class with an inverse meaning.

The + is a quantifier, it matches one or more occurrences of the pattern it quantifies.

Also, in stringr, the shorthand character classes are Unicode aware, so you may also use

str_replace_all(name, "[\\W_]+", "")

where \W matches any char other than Unicode letters, digits or underscores, and _ matches underscores.

using regular expression in list.files of R function

list_files <- list.files(path="my_file_path", recursive = TRUE, pattern = "un[0-9]", full.names = TRUE)


Related Topics



Leave a reply



Submit