Regex Look-Behind Without Obvious Maximum Length in Java

Java regex error - Look-behind group does not have an obvious maximum length

Java doesn't support variable length in look behind.

In this case, it seems you can easily ignore it (assuming your entire input is one word):

([a-z])(?!.*\1)([a-z])(?!.*\2)(.)(\3)(.)(\5)

Both lookbehinds do not add anything: the first asserts at least two characters where you only had one, and the second checks the second character is different from the first, which was already covered by (?!.*\1).

Working example: http://regexr.com?2up96

Java regex look-behind group does not have obvious maximum length error

Java Lookbehind is Notoriously Buggy

So you thought Java did not support infinite lookbehind?

But the following pattern will compile!

(?<=\d+)\w+

...though in a Match All it will yield unexpected results (see demo).

On the other hand, you can with success use this other infinite lookbehind (which I found with great surprise on this question)

(?<=\\G\\d+,\\d+,\\d+),

to split this string: 0,123,45,6789,4,5,3,4,6000

It will correctly output (see the online demo):

0,123,45
6789,4,5
3,4,6000

This time the results are what you expect.

But if you tweak the regex the slightest bit to obtain pairs instead of triplets, with (?<=\\G\\d+,\\d+),, this time it will not split (see the demo).


The bottom line

Java lookbehind is notoriously buggy. Knowing this, I recommend you
don't waste time trying to understand why it does something
that is undocumented.

The decisive words that drove me to this conclusion some time ago are those from Jan Goyvaerts, who is a co-author of The Regex Cookbook and an arch-regex-guru who has created a terrific regex engine and needs to stay on top of most regex flavors under the sun for his debugging tool RegexBuddy:

Java has a number of bugs in its lookbehind implementation. Some (but
not all) of those were fixed in Java 6.

Regex look-behind without obvious maximum length in Java

Glancing at the source code for Pattern.java reveals that the '*' and '+' are implemented as instances of Curly (which is the object created for curly operators). So,

a*

is implemented as

a{0,0x7FFFFFFF}

and

a+

is implemented as

a{1,0x7FFFFFFF}

which is why you see exactly the same behaviors for curlies and stars.

Java Regex: Look behind group does not have an obvious maximum length

The problem is - once again - quoting of strings in the Java code vs no quoting when read via some kind of input.

When you paste the string (?<=\\().+?(?=\\){1}) like this:

String s1 = "(?<=\\().+?(?=\\){1})";
System.out.println(s1);

you will get this output

(?<=\().+?(?=\){1})

and this is what the regexp parser sees.

But when the same string is read via an InputStream (just as an example), nothing is altered:

String s1 = new BufferedReader(new InputStreamReader(System.in)).readLine();
System.out.println(s1);

will print

(?<=\\().+?(?=\\){1})

Which means, that the {1} is attributed to the (?=\\) part and not to the (?<= part.

Positive lookbehind regex obvious maximum length

Java supports variable length lookbehind only if the size is limited and the subpattern in the lookbehind isn't too complicated.

In short, you can't write:

(?<=\\[\\w*\\]).*

But you can write:

(?<=\\[\\w{0,1000}\\]).*

However something like:

(?<=\\[(?:\\w{0,2}){0,500}\\w?\\]).*

doesn't work since the max length isn't obvious.

Why does the look-behind expression in this regex not have an obvious maximum length?

\[ is only a single character, so it seems like the obvious maximum length should be 1 + whatever the obvious maximum length was of the look-behind group in the first expression. What gives?

That's the point, "whatever the obvious maximum length was of the look-behind group in the first expression", is not obvious. A rule of fist is that you can't use + or * inside a look-behind. This is not only so for Java's regex engine, but for many more PCRE-flavored engines (even Perl's (v5.10) engine!).

You can do this with look-aheads however:

Pattern p = Pattern.compile("(?=(\\[[a-z]+]))");
Matcher m = p.matcher("] [abc] [123] abc]");
while(m.find()) {
System.out.println("Found a ']' before index: " + m.end(1));
}

(I.e. a capture group inside a look ahead (!) which can be used to get the end(...) of the group)

will print:

Found a ']' before index: 7

EDIT

And if you're interested in replacing such ]'s, you could do something like this:

String s = "] [abc] [123] abc] [foo] bar]";
System.out.println(s);
System.out.println(s.replaceAll("(\\[[a-z]+)]", "$1_"));

which will print:

] [abc] [123] abc] [foo] bar]
] [abc_ [123] abc] [foo_ bar]


Related Topics



Leave a reply



Submit