Java.Lang.Stackoverflowerror While Using a Regex to Parse Big Strings

StackOverflowError in regular expression

The problem comes from (.|\\n)* part of p6:

Pattern p6=Pattern.compile("(BENCH:)((.|\\n)*)(BENCH)((.|\\n)*)(CITATION)");

(.|\\n)* compiles into the following structure on Oracle/OpenJDK JRE, whose implementation uses recursion (note GroupTail goes back to Loop) to match repetition of non-deterministic pattern (alternation is always considered non-deterministic in the implementation).

Prolog. Loop wrapper
Loop [732768bb]. Greedy quantifier {0,2147483647}
GroupHead. (DEBUG) local=0
Branch. Alternation (in printed order):
Dot. (?:.), equivalent to [^\n\r\u0085\u2028\u2029]
---
Single. Match code point: U+000A LINE FEED (LF)
---
BranchConn [204d080d]. Connect branches to sequel.
GroupTail [214b9e0c]. (DEBUG) local=0, group=2. --[next]--> Loop [732768bb]

On a long string, the stack runs out, so you get StackOverflowError.

If you want to match any character without exception, then you should use . alone in combination with Pattern.DOTALL flag.

  • You can pass the flag to Pattern.compile(String regex, int flags) method to turn on the flag for the whole expression:

    Pattern p6 = Pattern.compile("(BENCH:)(.*)(BENCH)(.*)(CITATION)", Pattern.DOTALL);
  • Or as suggested in the comment by Jonny 5, you can also use the inline flag (?s):

    Pattern p6 = Pattern.compile("(?s)(BENCH:)(.*)(BENCH)(.*)(CITATION)");
  • Alternatively, you can also turn on the flag for a sub-pattern (?s:.*):

    Pattern p6 = Pattern.compile("(BENCH:)(?s:(.*))(BENCH)(?s:(.*))(CITATION)");

By the way, are you sure you want to match |onrable in p3?

Pattern p3 = Pattern.compile("(([H|h]on)|(HON)).*((ble)|BLE)(.*)");

If you don't want it, remove | from the character class:

Pattern p3 = Pattern.compile("(([Hh]on)|(HON)).*((ble)|BLE)(.*)");

I also see an excessive amount of capturing group. Please review and check whether they are really necessary.

Getting java.lang.StackOverflowError on String.replaceAll in Java

Can someone suggest how I can improve my regex to avoid the
StackOverflowError?

Yes I can gives you two solutions, you just need to see your problem from another side.

Here is a quick analyse about your problem and a quick solution, you can use this regex instead (.*?\"\s+)\bOR\b(\s+application.*?) :

Solution one

String inputStr = //that long String
String regex = "(.*?\"\\s+)\\bOR\\b(\\s+application.*?)";
String replacedStr = inputStr.replaceAll(regex, "$1||$2");

System.out.println(replacedStr);

I notice that the OR you want to replace exist after " ans space OR the application, my regex will match that OR and replace it.

Output for the short example, it will gives you the same result for the long one :

application.path="EXCEL.exe" || application.path="EXCELSIOR.exe" || application.path="XYZ OR ABC.exe"
^^ ^^ ^^ ^^

Solution two

If you are using Java 9+ you can use this regex application.path=(\"(.*?)\"), to match every thing like application.path="something here", the collect the result with ||

String regex = "application.path=(\"(.*?)\")";
String text = Pattern.compile(regex)
.matcher(inputStr).results().map(MatchResult::group)
.collect(Collectors.joining(" || "));

Regex Patterns causing StackoverFlow

I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.

Sporadic Stack Overflow error in java Matcher

There are 2 ways to fix your problem:

  • Parse the input string properly and get the key values from the Map.

    I strongly recommend using this method, since the code will be much cleaner, and we no longer have to watch the limit on the input size.

  • Modify the existing regex to greatly reduce the impact of the implementation flaw which causes StackOverflowError.

Parse the input string

You can parse the input string with the following regex:

\G\s*+(\w++)=([^\s"]++|"[^"]*+")(?:\s++|$)
  • All quantifiers are made possessive (*+ instead of *, ++ instead of +), since the pattern I wrote doesn't need backtracking.

  • You can find the basic regex (\w++)=([^\s"]++|"[^"]*+") to match key-value pairs in the middle.

  • \G is to make sure the match starts from where the last match leaves off. It is used to prevent the engine from "bump-along" when it fails to match.

  • \s*+ and (?:\s++|$) are for consuming excess spaces. I specify (?:\s++|$) instead of \s*+ to prevent key="value"key=value from being recognized as valid input.

The full example code can be found below:

private static final Pattern KEY_VALUE = Pattern.compile("\\G\\s*+(\\w++)=([^\\s\"]++|\"[^\"]*+\")(?:\\s++|$)");

public static Map<String, String> parseKeyValue(String kvString) {
Matcher matcher = KEY_VALUE.matcher(kvString);

Map<String, String> output = new HashMap<String, String>();
int lastIndex = -1;

while (matcher.find()) {
output.put(matcher.group(1), matcher.group(2));
lastIndex = matcher.end();
}

// Make sure that we match everything from the input string
if (lastIndex != kvString.length()) {
return null;
}

return output;
}

You might want to unquote the values, depending on your requirement.

You can also rewrite the function to pass a List of keys you want to extract, and pick them out in the while loop as you go to avoid storing keys that you don't care about.

Modify the regex

The problem is due to the outer repetition (?:[^"]|"[^"]*")* being implemented with recursion, leading to StackOverflowError when the input string is long enough.

Specifically, in each repetition, it matches either a quoted token, or a single non-quoted character. As a result, the stack grows linearly with the number of non-quoted characters and blows up.

You can replace all instance of (?:[^"]|"[^"]*")* with [^"]*(?:"[^"]*"[^"]*)*. The stack will now grow linearly as the number of quoted tokens, so StackOverflowError will not occur, unless you have thousands of quoted tokens in the input string.

Pattern KEY_CAPTURE = Pattern.compile("app=github(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+user=(?<user>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+repo=(?<repo>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+remote_address=(?<ip>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+now=\"(?<time>\\S+)\\+\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+url=\"(?<url>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+referer=\"(?<referer>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+status=(?<status>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+elapsed=(?<elapsed>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_method=(?<requestmethod>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+created_at=\"(?<createdat>\\S+)(?:-|\\+)\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+pull_request_id=(?<pullrequestid>\\d+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+at=(?<at>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+fn=(?<fn>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+method=(?<method>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+current_user=(?<user2>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+content_length=(?<contentlength>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_category=(?<requestcategory>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+controller=(?<controller>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+action=(?<action>\\S+))?");

It follows equivalent expansion of the regex (A|B)*A*(BA*)*. Which to use as A or B depends on their number of repetitions - whichever repeats more should be A and the other should be B.

Deep diving into the implementation

StackOverflowError in Pattern is a known problem, which could happen when your pattern contains a repetition of a non-deterministic1 capturing/non-capturing group, which is the subpattern (?:[^"]|"[^"]*")* in your case.

1 This is a terminology used in the source code of Pattern, which is probably intended to be an indicator that the pattern has fixed length. However, the implementation considers alternation | to be non-deterministic, regardless of the actual pattern.

Greedy or lazy repetition of a non-deterministic capturing/non-capturing group is compiled into Loop/LazyLoop classes, which implement repetition by recursion. As a result, such pattern is extremely prone to trigger StackOverflowError, especially when the group contains a branch where only a single character is matched at a time.

On the other hand, deterministic2 repetition, possessive repetition, and repetition of independent group (?>...) (a.k.a. atomic group or non-backtracking group) are compiled into Curly/GroupCurly classes, which processes the repetition with loop in most cases, so there will be no StackOverflowError.

2 The repeated pattern is a character class, or a fixed length capturing/non-capturing group without any alternation

You can see how a fragment of your original regex is compiled below. Take note of the problematic part, which starts with Loop, and compare it to your stack trace.

app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?
BnM. Boyer-Moore (BMP only version) (length=10)
app=github
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=0
Prolog. Loop wrapper
Loop [1889ca51]. Greedy quantifier {0,2147483647}
GroupHead. local=1
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [7e41986c]. Connect branches to sequel.
GroupTail [47e1b36]. local=1, group=0. --[next]--> Loop [1889ca51]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
user=
GroupHead. local=3
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [732c7887]. local=3, group=2. --[next]--> GroupTail [6c9d2223]
GroupTail [6c9d2223]. local=0, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
Ques. Greedy optional quantifier
Pos. Positive look-ahead
GroupHead. local=4
Prolog. Loop wrapper
Loop [402c5f8a]. Greedy quantifier {0,2147483647}
GroupHead. local=5
Branch. Alternation (in printed order):
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
---
Single. Match code point: U+0022 QUOTATION MARK
Curly. Greedy quantifier {0,2147483647}
CharProperty.complement. S̄:
BitClass. Match any of these 1 character(s):
"
Node. Accept match
Single. Match code point: U+0022 QUOTATION MARK
---
BranchConn [21347df0]. Connect branches to sequel.
GroupTail [7d382897]. local=5, group=0. --[next]--> Loop [402c5f8a]
Curly. Greedy quantifier {1,2147483647}
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
Slice. Match the following sequence (BMP only version) (length=5)
repo=
GroupHead. local=7
Curly. Greedy quantifier {1,2147483647}
CharProperty.complement. S̄:
Ctype. POSIX (US-ASCII): SPACE
Node. Accept match
GroupTail [71f111ba]. local=7, group=4. --[next]--> GroupTail [9c304c7]
GroupTail [9c304c7]. local=4, group=0. --[next]--> Node [4ea5d7f2]
Node. Accept match
Node. Accept match
LastNode.
Node. Accept match

Regex causes StackOverflowError

First of all, you do not need the inner capturing groups. Repeated capturing groups only capture the last occurrence, so, they are useless in Java.

Next, you may unroll this regex so that it matches linearly (without alternation that eats up a lot of resources):

[^']*(?:''[^']*)*

See the updated regex demo.

Pattern details:

  • [^']* - matches zero or more characters other than '
  • (?:''[^']*)* - matches zero or more sequences of:

    • '' - two single apostrophes
    • [^']* - zero or more characters other than '


Related Topics



Leave a reply



Submit