Java Regex Capturing Groups

Java Regex Capturing Groups

The issue you're having is with the type of quantifier. You're using a greedy quantifier in your first group (index 1 - index 0 represents the whole Pattern), which means it'll match as much as it can (and since it's any character, it'll match as many characters as there are in order to fulfill the condition for the next groups).

In short, your 1st group .* matches anything as long as the next group \\d+ can match something (in this case, the last digit).

As per the 3rd group, it will match anything after the last digit.

If you change it to a reluctant quantifier in your 1st group, you'll get the result I suppose you are expecting, that is, the 3000 part.

Note the question mark in the 1st group.

String line = "This order was placed for QT3000! OK?";
Pattern pattern = Pattern.compile("(.*?)(\\d+)(.*)");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
System.out.println("group 2: " + matcher.group(2));
System.out.println("group 3: " + matcher.group(3));
}

Output:

group 1: This order was placed for QT
group 2: 3000
group 3: ! OK?

More info on Java Pattern here.

Finally, the capturing groups are delimited by round brackets, and provide a very useful way to use back-references (amongst other things), once your Pattern is matched to the input.

In Java 6 groups can only be referenced by their order (beware of nested groups and the subtlety of ordering).

In Java 7 it's much easier, as you can use named groups.

How to write a regex capture group which matches a character 3 or 4 times before a delimiter?

I suggest this pattern:

(?![\\s,])(?:[^:]*:){3}\\S*(?![^,])

Negative lookaheads avoid to match leading or trailing delimiters. The second one in particular forces the match to be followed by the delimiter or the end of the string (not followed by a character that isn't a comma).

demo

Note that the pattern doesn't have capture groups, so the result is the whole match (or group 0).

Java regex capturing groups indexes

Capturing and grouping

Capturing group (pattern) creates a group that has capturing property.

A related one that you might often see (and use) is (?:pattern), which creates a group without capturing property, hence named non-capturing group.

A group is usually used when you need to repeat a sequence of patterns, e.g. (\.\w+)+, or to specify where alternation should take effect, e.g. ^(0*1|1*0)$ (^, then 0*1 or 1*0, then $) versus ^0*1|1*0$ (^0*1 or 1*0$).

A capturing group, apart from grouping, will also record the text matched by the pattern inside the capturing group (pattern). Using your example, (.*):, .* matches ABC and : matches :, and since .* is inside capturing group (.*), the text ABC is recorded for the capturing group 1.

Group number

The whole pattern is defined to be group number 0.

Any capturing group in the pattern start indexing from 1. The indices are defined by the order of the opening parentheses of the capturing groups. As an example, here are all 5 capturing groups in the below pattern:

(group)(?:non-capturing-group)(g(?:ro|u)p( (nested)inside)(another)group)(?=assertion)
| | | | | | || | |
1-----1 | | 4------4 |5-------5 |
| 3---------------3 |
2-----------------------------------------2

The group numbers are used in back-reference \n in pattern and $n in replacement string.

In other regex flavors (PCRE, Perl), they can also be used in sub-routine calls.

You can access the text matched by certain group with Matcher.group(int group). The group numbers can be identified with the rule stated above.

In some regex flavors (PCRE, Perl), there is a branch reset feature which allows you to use the same number for capturing groups in different branches of alternation.

Group name

From Java 7, you can define a named capturing group (?<name>pattern), and you can access the content matched with Matcher.group(String name). The regex is longer, but the code is more meaningful, since it indicates what you are trying to match or extract with the regex.

The group names are used in back-reference \k<name> in pattern and ${name} in replacement string.

Named capturing groups are still numbered with the same numbering scheme, so they can also be accessed via Matcher.group(int group).

Internally, Java's implementation just maps from the name to the group number. Therefore, you cannot use the same name for 2 different capturing groups.

Trying to understand Capturing groups in regex with Java

Among other things, regex lets you obtain portions of the input that were matched by various parts of the regular expression. Sometimes you need the entire match, but often you need only a part of it. For example, this regular expression matches "Page X of Y" strings:

Page \d+ of \d+

If you pass it a string

Page 14 of 203

you will match the entire string. Now let's say that you want only 14 and 203. No problem - regex library lets you enclose the two \d+ in parentheses, and then retrieve only the "14" and "203" strings from the match.

Page (\d+) of (\d+)

The above expression creates two capturing groups. The Matcher object obtained by matching the pattern lets you retrieve the content of these groups individually:

Pattern p = Pattern.compile("Page (\\d+) of (\\d+)");
String text = "Page 14 of 203";
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println(m.group(1));
System.out.println(m.group(2));
}

This prints 14 and 203.

Demo on ideone.

Are non-capturing groups redundant?

Your (?:wo)?men and (wo)?men are semantically equivalent, but technically are different, namely, the first is using a non-capturing and the other a capturing group. Thus, the question is why use non-capturing groups when we have capturing ones?

Non-caprturing groups are of help sometimes.

  1. To avoid excessive number of backreferences (remember that it is sometimes difficult to use backreferences higher than 9)
  2. To avoid the problem with 99 numbered backreferences limit (by reducing the number of numbered capturing groups) (source: Regular-expressions.info: Most regex flavors support up to 99 capturing groups and double-digit backreferences.)
    NOTE this does not pertain to Java regex engine, nor to PHP or .NET regex engines.
  3. To lessen the overhead caused by storing the captures in the stack
  4. We can add more groupings to existing regex without ruining the order of capturing groups.

Also, it is just makes our matches cleaner:

You can use a non-capturing group to retain the organisational or grouping benefits but without the overhead of capturing.

It does not seem a good idea to re-factor existing regular expressions to convert capturing to non-capturing groups, since it may ruin the code or require too much effort.

Java regex repeating capture groups

Basically, your regex main problem is that it matches only at the end of string, and you match many more chars that just letters with [A-z]. Your grouping also seem off.

If you load your regex at regex101, you will see it matches

  • \$\{
  • ( - start of a capturing group

    • (?: - start of a non-capturing group

      • (?:[A-z]+ - start of a non-capturing group, and it matches 1+ chars between A and z (your first mistake)

        • (?:\.[A-z0-9()\[\]\"]+)* - 0 or more repetitions of a . and then 1+ letters, digits, (, ), [, ], ", \, ^, _, and a backtick
      • )+ - repeat the non-capturing group 1 or more times
      • | - or
      • (?:\"[\w/?.&=_\-]*\")+ - 1 or more occurrences of ", 0 or more word, /, ?, ., &, =, _, - chars and then a "
      • )+ - repeat the group pattern 1+ times
    • ) - end of non-capturing group
  • }+ - 1+ } chars
  • $ - end of string.

To match any occurrence of your pattern inside a string, you need to use

\$\{(\"[^\"]*\"|\w+(?:\(\))?(?:\.\w+(?:\(\))?)*)}

See the regex demo, get Group 1 value after a match is found. Details:

  • \$\{ - a ${ substring
  • (\"[^\"]*\"|\w+(?:\(\))?(?:\.\w+(?:\(\))?)*) - Capturing group 1:

    • \"[^\"]*\" - ", 0+ chars other than " and then a "
    • | - or
    • \w+(?:\(\))? - 1+ word chars and an optional () substring
    • (?:\.\w+(?:\(\))?)* - 0 or more repetitions of . and then 1+ word chars and an optional () substring
  • } - a } char.

See the Java demo:

String s = "${test.one}${test.two}\n${test.one}${test.two()}\n${test.one}${\"hello\"}";
Pattern pattern = Pattern.compile("\\$\\{(\"[^\"]*\"|\\w+(?:\\(\\))?(?:\\.\\w+(?:\\(\\))?)*)}");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
}

Output:

test.one
test.two
test.one
test.two()
test.one
"hello"

Regex to capture groups and ignore last two characters where one is optional

Edited based on comment to include ; and " in the comments as per the examples given:

(?<key>\w+)\s*=\s*(?:[\"]?)(?<value>((")(?!;?$)|;(?!$)|[^;"])+)"?;?$

The following one additionally doesn't allow ; or " to appear in the numeric text. However, to include this, I had to rename the capturing groups because the name cannot be used for more than one group.

(?<key>\w+)\s*=\s*((?:")(?<valueT>((")(?!;?$)|;(?!$)|[^;"])+)";?$|(?<valueN>[^;"]+);?$)

Here is a class that tests it.

For readability, I have separated the key and value regexes in the class. I have added the test cases in a method within the class. However, this still doesn't handle the case of a numeric text containing ; or ". Also, the line needs to be trimmed before being subjected to the pattern test (which I think is feasible).

public class NameValuePairRegex{

public static void main( String[] args ){
String SPACE = "\\s*";
String EQ = "=";
String OR = "|";

/* The original regex tried by you (for comparison). */
String orig = "(?<key>\\w+)\\s*=\\s*(?:[\\\"]?)(?<value>.+(?:(?=;)))";

String key = "(?<key>\\w+)";
String valuePatternForText = "(?:\")(?<valueT>((\")(?!;?$)|;(?!$)|[^;\"])+)\";?$";
String valuePatternForNumbers = "(?<valueN>[^;\"]+);?$";
String p = key + SPACE + EQ + SPACE + "(" + valuePatternForText + OR + valuePatternForNumbers + ")";

Pattern nvp = Pattern.compile( p );
System.out.println( nvp.pattern() );
print( input(), nvp );
}

private static void print( List<String> input, Pattern ep ) {
for( String e : input ) {
System.out.println( e );
Matcher m = ep.matcher( e );
boolean found = m.find();
if( !found ) {
System.out.println( "\t\tNo match" );
continue;
}

String valueT = m.group( "valueT" );
String valueN = m.group( "valueN" );

System.out.print( "\t\t" + m.group( "key" ) + " -> " + ( valueT == null ? "" : valueT ) + " " + ( valueN == null ? "" : valueN ) );
System.out.println( );
}

}

private static List<String> input(){
List<String> neg = new ArrayList<>();
Collections.addAll( neg,
"Comment = \"This is a comment\";",
"Comment = \"This is a comment with semicolon ;\";",
"Comment = \"This is a comment with semicolon ; and quote\"\";",
"Comment = \"This is a comment\"",
"Comment = \"This is a \"comment\"; This is still a comment\";",
"NumericValue = 123456;",
"NumericValue = 123;456;",
"NumericValue = 123\"456;",
"NumericValue = 123456" );

return neg;
}

}

Original answer:

The following changed regex is fulfilling the requirements you mentioned. I added the exclusion of ; and " from the value part.

Original that you tried:

(?<key>\w+)\s*=\s*(?:[\"]?)(?<group>.+(?:(?=[\"]?;)))

The changed one:

(?<key>\w+)\s*=\s*(?:[\"]?)(?<value>[^;"]+)


Related Topics



Leave a reply



Submit