Regex Named Groups in Java

Regex Named Groups in Java

(Update: August 2011)

As geofflane mentions in his answer, Java 7 now support named groups.

tchrist points out in the comment that the support is limited.

He details the limitations in his great answer "Java Regex Helper"

Java 7 regex named group support was presented back in September 2010 in Oracle's blog.

In the official release of Java 7, the constructs to support the named capturing group are:

  • (?<name>capturing text) to define a named group "name"
  • \k<name> to backreference a named group "name"
  • ${name} to reference to captured group in Matcher's replacement string
  • Matcher.group(String name) to return the captured input subsequence by the given "named group".

Other alternatives for pre-Java 7 were:

  • Google named-regex (see John Hardy's answer)

    Gábor Lipták mentions (November 2012) that this project might not be active (with several outstanding bugs), and its GitHub fork could be considered instead.
  • jregex (See Brian Clozel's answer)

(Original answer: Jan 2009, with the next two links now broken)

You can not refer to named group, unless you code your own version of Regex...

That is precisely what Gorbush2 did in this thread.

Regex2

(limited implementation, as pointed out again by tchrist, as it looks only for ASCII identifiers. tchrist details the limitation as:

only being able to have one named group per same name (which you don’t always have control over!) and not being able to use them for in-regex recursion.

Note: You can find true regex recursion examples in Perl and PCRE regexes, as mentioned in Regexp Power, PCRE specs and Matching Strings with Balanced Parentheses slide)

Example:

String:

"TEST 123"

RegExp:

"(?<login>\\w+) (?<id>\\d+)"

Access

matcher.group(1) ==> TEST
matcher.group("login") ==> TEST
matcher.name(1) ==> login

Replace

matcher.replaceAll("aaaaa_$1_sssss_$2____") ==> aaaaa_TEST_sssss_123____
matcher.replaceAll("aaaaa_${login}_sssss_${id}____") ==> aaaaa_TEST_sssss_123____

(extract from the implementation)

public final class Pattern
implements java.io.Serializable
{
[...]
/**
* Parses a group and returns the head node of a set of nodes that process
* the group. Sometimes a double return system is used where the tail is
* returned in root.
*/
private Node group0() {
boolean capturingGroup = false;
Node head = null;
Node tail = null;
int save = flags;
root = null;
int ch = next();
if (ch == '?') {
ch = skip();
switch (ch) {

case '<': // (?<xxx) look behind or group name
ch = read();
int start = cursor;
[...]
// test forGroupName
int startChar = ch;
while(ASCII.isWord(ch) && ch != '>') ch=read();
if(ch == '>'){
// valid group name
int len = cursor-start;
int[] newtemp = new int[2*(len) + 2];
//System.arraycopy(temp, start, newtemp, 0, len);
StringBuilder name = new StringBuilder();
for(int i = start; i< cursor; i++){
name.append((char)temp[i-1]);
}
// create Named group
head = createGroup(false);
((GroupTail)root).name = name.toString();

capturingGroup = true;
tail = root;
head.next = expr(tail);
break;
}

Java support for (? name pattern) in patterns

This is supported starting in Java 7. Your C# code can be translated to something like this:

String pattern = ";(?<foo>\\d{6});(?<bar>\\d{6});";
Pattern regex = Pattern.compile(pattern);
Matcher matcher = regex.matcher(";123456;123456;");
boolean success = matcher.find();

String foo = success ? matcher.group("foo") : null;
String bar = success ? matcher.group("bar") : null;

You have to create a Matcher object which doesn't actually perform the regex test until you call find().

(I used find() because it can find a match anywhere in the input string, like the Regex.Match() method. The .matches() method only returns true if the regex matches the entire input string.)

Regular Expression named capturing groups support in Java 7

Specifying named capturing group

Use the following regex with a single capturing group as an example ([Pp]attern).

Below are 4 examples on how to specify a named capturing group for the regex above:

(?<Name>[Pp]attern)
(?<group1>[Pp]attern)
(?<name>[Pp]attern)
(?<NAME>[Pp]attern)

Note that the name of the capturing group must strictly matches the following Pattern:

[A-Za-z][A-Za-z0-9]*

The group name is case-sensitive, so you must specify the exact group name when you are referring to them (see below).

Backreference the named capturing group in regex

To back-reference the content matched by a named capturing group in the regex (correspond to 4 examples above):

\k<Name>
\k<group1>
\k<name>
\k<NAME>

The named capturing group is still numbered, so in all 4 examples, it can be back-referenced with \1 as per normal.

Refer to named capturing group in replacement string

To refer to the capturing group in replacement string (correspond to 4 examples above):

${Name}
${group1}
${name}
${NAME}

Same as above, in all 4 examples, the content of the capturing group can be referred to with $1 in the replacement string.

Named capturing group in COMMENT mode

Using (?<name>[Pp]attern) as an example for this section.

Oracle's implementation of the COMMENT mode (embedded flag (?x)) parses the following examples to be identical to the regex above:

(?x)  (  ?<name>             [Pp] attern  )
(?x) ( ?< name > [Pp] attern )
(?x) ( ?< n a m e > [Pp] attern )

Except for ?< which must not be separated, it allows arbitrary spacing even in between the name of the capturing group.

Same name for different capturing groups?

While it is possible in .NET, Perl and PCRE to define the same name for different capturing groups, it is currently not supported in Java (Java 8). You can't use the same name for different capturing groups.

Named capturing group related APIs

New methods in Matcher class to support retrieving captured text by group name:

  • group(String name) (from Java 7)
  • start(String name) (from Java 8)
  • end(String name) (from Java 8)

The corresponding method is missing from MatchResult class as of Java 8. There is an on-going Enhancement request JDK-8065554 for this issue.

There is currently no API to get the list of named capturing groups in the regex. We have to jump through extra hoops to get it. Though it is quite useless for most purposes, except for writing a regex tester.

How to get the names of the regex named capturing group in a match in Java?

Here is my attempt in Scala:

import java.util.regex.{MatchResult, Pattern}

class GroupNamedRegex(pattern: Pattern, namedGroups: Set[String]) {
def this(regex: String) = this(Pattern.compile(regex),
"\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>".r.findAllMatchIn(regex).map(_.group(1)).toSet)

def findNamedMatches(s: String): Iterator[GroupNamedRegex.Match] = new Iterator[GroupNamedRegex.Match] {
private[this] val m = pattern.matcher(s)
private[this] var _hasNext = m.find()

override def hasNext = _hasNext

override def next() = {
val ans = GroupNamedRegex.Match(m.toMatchResult, namedGroups.find(group => m.group(group) != null))
_hasNext = m.find()
ans
}
}
}

object GroupNamedRegex extends App {
case class Match(result: MatchResult, groupName: Option[String])

val r = new GroupNamedRegex("(?<FB>(FACE(\\p{Space}?)BOOK))|(?<GOOGL>(GOOGL(E)?))")
println(r.findNamedMatches("FACEBOOK is buying GOOGLE and FACE BOOK FB").map(s => s.groupName -> s.result.group()).toList)
}

Java String.replaceAll backreference with named groups

Based on https://blogs.oracle.com/xuemingshen/entry/named_capturing_group_in_jdk7

you should use ${nameOfCapturedGroup} which in your case would be ${render}.

DEMO:

String test = "{0000:Billy} bites {0001:Jake}";
test = test.replaceAll("\\{(?<id>\\d\\d\\d\\d):(?<render>.*?)\\}", "${render}");
System.out.println(test);

Output: Billy bites Jake

Get group names in java regex

There is no API in Java to obtain the names of the named capturing groups. I think this is a missing feature.

The easy way out is to pick out candidate named capturing groups from the pattern, then try to access the named group from the match. In other words, you don't know the exact names of the named capturing groups, until you plug in a string that matches the whole pattern.

The Pattern to capture the names of the named capturing group is \(\?<([a-zA-Z][a-zA-Z0-9]*)> (derived based on Pattern class documentation).

(The hard way is to implement a parser for regex and get the names of the capturing groups).

A sample implementation:

import java.util.Scanner;
import java.util.Set;
import java.util.TreeSet;
import java.util.Iterator;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.regex.MatchResult;

class RegexTester {

public static void main(String args[]) {
Scanner scanner = new Scanner(System.in);

String regex = scanner.nextLine();
StringBuilder input = new StringBuilder();
while (scanner.hasNextLine()) {
input.append(scanner.nextLine()).append('\n');
}

Set<String> namedGroups = getNamedGroupCandidates(regex);

Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
int groupCount = m.groupCount();

int matchCount = 0;

if (m.find()) {
// Remove invalid groups
Iterator<String> i = namedGroups.iterator();
while (i.hasNext()) {
try {
m.group(i.next());
} catch (IllegalArgumentException e) {
i.remove();
}
}

matchCount += 1;
System.out.println("Match " + matchCount + ":");
System.out.println("=" + m.group() + "=");
System.out.println();
printMatches(m, namedGroups);

while (m.find()) {
matchCount += 1;
System.out.println("Match " + matchCount + ":");
System.out.println("=" + m.group() + "=");
System.out.println();
printMatches(m, namedGroups);
}
}
}

private static void printMatches(Matcher matcher, Set<String> namedGroups) {
for (String name: namedGroups) {
String matchedString = matcher.group(name);
if (matchedString != null) {
System.out.println(name + "=" + matchedString + "=");
} else {
System.out.println(name + "_");
}
}

System.out.println();

for (int i = 1; i < matcher.groupCount(); i++) {
String matchedString = matcher.group(i);
if (matchedString != null) {
System.out.println(i + "=" + matchedString + "=");
} else {
System.out.println(i + "_");
}
}

System.out.println();
}

private static Set<String> getNamedGroupCandidates(String regex) {
Set<String> namedGroups = new TreeSet<String>();

Matcher m = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>").matcher(regex);

while (m.find()) {
namedGroups.add(m.group(1));
}

return namedGroups;
}
}
}

There is a caveat to this implementation, though. It currently doesn't work with regex in Pattern.COMMENTS mode.

How do I take a string with a named group and replace only that named capture group with a value in Java 7

I got it:

String string = "/this/(?<capture1>.*)/a/string/(?<capture2>.*)";
Pattern pattern = Pattern.compile(string);
Matcher matcher = pattern.matches(string);

string.replace(matcher.group("capture1"), "value 1");
string.replace(matcher.group("capture2"), "value 2");

Crazy, but works.

android java regex named groups

Android Pattern class implementation is provided by ICU, to be precise, ICU4C.

The regular expression implementation used in Android is provided by ICU. The notation for the regular expressions is mostly a superset of those used in other Java language implementations. This means that existing applications will normally work as expected, but in rare cases Android may accept a regular expression that is not accepted by other implementations.

And ICU4C currently doesn't support named capturing group. You have to fall back on numbered capturing groups.

ICU does not support named capture groups. http://bugs.icu-project.org/trac/ticket/5312

You need to write a wrapper and parse the expression yourself to provide named capturing group capability, if you really need the feature.



Related Topics



Leave a reply



Submit