Reuse Part of a Regex Pattern

Reuse part of a Regex pattern

No, when using the standard library re module, regular expression patterns cannot be 'symbolized'.

You can always do so by re-using Python variables, of course:

digit_letter_letter_digit = r'\d\w\w\d'

then use string formatting to build the larger pattern:

match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)

or, using Python 3.6+ f-strings:

dlld = r'\d\w\w\d'
match(fr"{dlld},{dlld}", inputtext)

I often do use this technique to compose larger, more complex patterns from re-usable sub-patterns.

If you are prepared to install an external library, then the regex project can solve this problem with a regex subroutine call. The syntax (?<digit>) re-uses the pattern of an already used (implicitly numbered) capturing group:

(\d\w\w\d),(?1)
^........^ ^..^
| \
| re-use pattern of capturing group 1
\
capturing group 1

You can do the same with named capturing groups, where (?<groupname>...) is the named group groupname, and (?&groupname), (?P&groupname) or (?P>groupname) re-use the pattern matched by groupname (the latter two forms are alternatives for compatibility with other engines).

And finally, regex supports the (?(DEFINE)...) block to 'define' subroutine patterns without them actually matching anything at that stage. You can put multiple (..) and (?<name>...) capturing groups in that construct to then later refer to them in the actual pattern:

(?(DEFINE)(?<dlld>\d\w\w\d))(?&dlld),(?&dlld)
^...............^ ^......^ ^......^
| \ /
creates 'dlld' pattern uses 'dlld' pattern twice

Just to be explicit: the standard library re module does not support subroutine patterns.

Is it possible to define a pattern and reuse it to capture multiple groups?

To reuse a pattern, you could use (?n) where n is the number of the group to repeat. For example, your actual pattern :

(PAT),(PAT), ... ,(PAT)

can be replaced by:

(PAT),(?1), ... ,(?1)

(?1) is the same pattern as (PAT)whatever PAT is.

You may have multiple patterns:

(PAT1),(PAT2),(PAT1),(PAT2),(PAT1),(PAT2),(PAT1),(PAT2)

may be reduced to:

(PAT1),(PAT2),(?1),(?2),(?1),(?2),(?1),(?2)

or:

((PAT1),(PAT2)),(?1),(?1),(?1)

or:

((PAT1),(PAT2)),(?1){3}

Is there a way to reuse a part of a pattern in a C++ regex?

To repeat a pattern in C++ when using boost::regex, it is possible to use regex subroutines: capture a pattern you need to repeat and use (?n) where n is the capturing group ID. Use (?R) to repeat the whole pattern.

Example:

std::string s{"This is a cat"};
boost::smatch what;
boost::regex expr{R"~(\[(cat)\]|(?1))~"};
if (boost::regex_search(s, what, expr))
{
std::cout << what[0] << '\n';
}

std::regex does not allow that. You need to build patterns dynamically:

std::string s{"This is a cat"};
std::string block{"cat"};
std::smatch what;
std::regex expr{"\\[" + block + "\\]|" + block};
if (std::regex_search(s, what, expr))
{
std::cout << what[0] << '\n';
}

See this C++ demo.

Regex reuse a pattern to capture multiple groups?

In reviewing the Java documentation they still do not follow the PCRE guidelines for subroutines. Basically Java Regex does not support subroutines.

see also Java Regex Manual

regular expression in R, reuse matched string in replacement

Unfortunately, it is not possible to use a backreference to the whole match in base R regex functions.

You can use

sub("(M)([0-9])$", "\\10\\2", x)

With TRE regex like here, you do not have to worry about a digit after a backreference, since only 9 backreferences starting with 1 till 9 are allowed in TRE regex patterns. What is of interest is that you may use perl=TRUE in the above line of code and it will yield the same results.

See the R demo online:

x <-  c('2020M6','2020M10')
sub("(M)([0-9])$", "\\10\\2", x)
## => [1] "2020M06" "2020M10"

Also, see the regex demo.

Is it possible to reuse a subpattern inside of a pattern in Regex

^([A-Z][a-z]+(?:\s+|$)){3}$

You can try this.Instead of 3 you can use whatever you want.See demo.

http://regex101.com/r/tF5fT5/33

Can I reuse a character in the next group of a regular expression?

More groups.

>>> re.findall(r'(([a-z])-([a-z]))','a b-c d')
[('b-c', 'b', 'c')]

And since you don't actually care about the original...

>>> re.findall(r'([a-z])-([a-z])','a b-c d')
[('b', 'c')]

Java regex : How to reuse a consumed character in pattern matching?

Try this way

String data = "aaaabbbaaaaab";
Matcher m = Pattern.compile("(?=(a+b+|b+a+))(^|(?<=a)b|(?<=b)a)").matcher(data);
while(m.find())
System.out.println(m.group(1));

This regex uses look around mechanisms and will find (a+b+|b+a+) that

  • exists at start ^ of the input
  • starts with b that is predicted by a
  • starts with a that is predicted by b.

Output:

aaaabbb
bbbaaaaa
aaaaab

Is ^ essentially needed in this regular expression?

Yes, without ^ this regex wouldn't capture aaaabbb placed at start of input.

If I wouldn't add (^|(?<=a)b|(?<=b)a) after (?=(a+b+|b+a+)) this regex would match

aaaabbb
aaabbb
aabbb
abbb
bbbaaaaa
bbaaaaa
baaaaa
aaaaab
aaaab
aaab
aab
ab

so I needed to limit this results to only these that starts with a that has b before it (but not include b in match - so look behind was perfect for that) and b that is predicted by a.

But lets not forget about a or b that are placed at start of the string and are not predicted by anything. To include them we can use ^.


Maybe it will be easier to show this idea with this regex

(?=(a+b+|b+a+))((?<=^|a)b|(?<=^|b)a).

  • (?<=^|a)b will match b that is placed at start of string, or has a before it
  • (?<=^|b)a will match a that is placed at start of string, or has b before it


Related Topics



Leave a reply



Submit