JavaScript Regex Hangs (Using V8)

Javascript regex hangs (using v8)

This catastrophically backtracks on long sequences of spaces that occur after the last closing </tag:main> tag. Consider the case where the subject string ends with 100 spaces. First it matches them all with the . on the left of the alternation. That fails because there's no closing tag, so it tries matching the last character with the \s instead. That fails too, so it tries matching the second-to-last space as a \s and the last space as a .. That fails (still no closing tag) so it tries the last space as a \s. When that fails it matches the third-to-last space as a \s and tries all 4 ways to match the last two spaces. When that fails it tries the fourth-to-last space as a \s and all 8 ways on the last 3 spaces. Then 16, 32 etc. The universe ends before it gets to the 100th-to-last space.

Different VMs have different reactions to regexp matches that take forever because of catastrophic backtracking. Some will simply report 'no match'. In V8 it's like writing any other infinite or near-infinite loop.

Using non-greedy * will do what you want (you want to stop at the first </tag:main>, not the last), but will still do catastrophic backtracking for long strings of spaces where the closing sequence is missing.

Making sure the same characters in the inner bracket can't match both sides of the alternation will reduce the problem from an exponential one to one that is linear in the length of the string. Use a character class instead of an alternation or put \n on the right hand side of the alternation bar. \n is disjoint with . so if you hit a long sequence of spaces the regexp engine doesn't try all left-right-left etc. combinations before terminating.

What could be causing this weird regular expression behaviour in node.js?

I am pretty sure what's happening here is catastrophic backtracking. For example on my machine:

  • For 39 chars in "Hello there, best wishes, from Syria..." it takes around 13-14 seconds.
  • For 40 chars in "Hello there, vbest wishes, from Syria..." it takes 27-28 seconds.
  • For 41 chars in "Hello there, vebest wishes, from Syria..." it takes 56 seconds.

You can see the time taken increases exponentially. To explain how regex engine matches string via backtracking, I will quote the example from above link. It applies regex (x+x+)+y on string xxxxxxxxxxy :

Let's see what happens when you apply this regex to xxxxxxxxxxy. The
first x+ will match all 10 x characters. The second x+ fails. The
first x+ then backtracks to 9 matches, and the second one picks up the
remaining x. The group has now matched once. The group repeats, but
fails at the first x+. Since one repetition was sufficient, the group
matches. y matches y and an overall match is found. The regex is
declared functional, the code is shipped to the customer, and his
computer explodes. Almost.

The above regex turns ugly when the y is missing from the subject
string. When y fails, the regex engine backtracks. The group has one
iteration it can backtrack into. The second x+ matched only one x, so
it can't backtrack. But the first x+ can give up one x. The second x+
promptly matches xx. The group again has one iteration, fails the next
one, and the y fails. Backtracking again, the second x+ now has one
backtracking position, reducing itself to match x. The group tries a
second iteration. The first x+ matches but the second is stuck at the
end of the string. Backtracking again, the first x+ in the group's
first iteration reduces itself to 7 characters. The second x+ matches
xxx. Failing y, the second x+ is reduced to xx and then x. Now, the
group can match a second iteration, with one x for each x+. But this
(7,1),(1,1) combination fails too. So it goes to (6,4) and then
(6,2)(1,1) and then (6,1),(2,1) and then (6,1),(1,2) and then I think
you start to get the drift.

See this page from Jeff on regex performance, which uses the same example. So moral of the story : don't just match stuff, improve the regex. When nesting repetition operators, make absolutely sure that there is only one way to match the same match. For the example I quoted xx+y works better. And for your regex the answer Esailija gave will work much better /^([\w, ]+)$/

Why jquery regular expression return different - different test cases each time?

The problem is that you are using g flag on the RegExp literal. The solution is to remove the g flag.


According to the documentation of RegExp.test() (emphasis mine):

As with exec (or in combination with it), test called multiple times on the same global regular expression instance will advance past the previous match.

After the first call to test, the lastIndex property of the RegExp object /<(.|\n)*?>/g is updated. The index used as the offset for the second call to test. Since there is no more match for the pattern, the second call returns false and lastIndex is reset to 0. You can see the lastIndex property of the regex being changed in this JSFiddle.

Removing the g flag will cause the engine to ignore the lastIndex property.

I have described in detailed behaviour of RegExp.exec (which is also the same for RegExp.text) in this post.

Optimizing a small regex with capturing and ?! assertion

You can increase your pattern speed using this:

\$this->(m_\w++)(?!(?>[^p]++|\Bp++|p(?!rivate \$\1\b))++private \$\1\b)

What's causing this regex to match everything?

I think the best approach would be to use a xpath expression or a xml parser.

However, as you stated in your comment if you want to capture that specific portion using regex, then you can use this:

(<ProjectReference.*?Project2[\s\S]*?</ProjectReference>)

Working demo

Match information

MATCH 1
1. [209-384] `<ProjectReference Include="..\..\Project2\Project2.csproj">
<Project>{6c2a7631-8b47-4ae9-a68f-f728666105b9}</Project>
<Name>Project2</Name>
</ProjectReference>`

Besides regex101 also used SublimeText to show it's working, however Notepad++ has a poor regex engine and usually messes it up with tricks like [\s\S]*?:

Sample Image

On the other hand, related to your question about "why is failing", your regex is not failing but your pattern allows that greedy match (even using the lazy operator) because of your (.|\s) alternation:

^(\s+)<ProjectReference(.|\s)+?(Project2)</Name>(.|\s)+?</ProjectReference>
^--- HERE

If you check the Regex101 explanation, you can see:

2nd Capturing group (.|\s)+?
Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
1st Alternative: .
. matches any character (except newline)
2nd Alternative: \s
\s match any white space character [\r\n\t\f ]

Regex to remove multi line comments

the dot catches everything except newlines..( if the dotall is false )

so either use the dotall ( as mentioned in other answers/comments this is not supported in javascript, but i will leave it here for reference )

/\/\*(.*)\*\//gs

or add the whitespace chars \s in your expressions

/\/\*((\s|.)*?)\*\//g

Alan mentioned in his comment a bad performance from the answer i gave so use the following instead.. ( which translates to everything whitespace and everything non whitespace, so everything.. )

/\/\*([\s\S]*?)\*\//g

java regex why these two regular expressions are different

When you include regexes in a post, it's a good idea to post them as you're actually using them--in this case, as Java string literals.

"[.\\s]" is a Java string literal representing the regex [.\s]; it matches a literal dot or a whitespace character. Your regex is not trying to match a backslash or an 's', as others have said, but the crucial factor is that . loses its special meaning inside a character class.

"(.|\\s)" is a Java string literal representing the regex (.|\s); it matches (anything but a line separator character OR any whitespace character). It works as you intended, but don't use it! It leaves you extremely vulnerable to catastrophic backtracking, as explained in this answer.

But no worries, all you really need to do is use DOTALL mode (also known as single-line mode), which enables . to match anything including line separator characters.

(?s)<dl\b[^>]*>.*?</dl>

Cases when its better to use parantheses over square brackets in regular expressions?

This is common error among regex beginners, and it's a serious one. Square brackets are used to create character classes, while parentheses create groups. Not only do these constructs serve different purposes, there is no overlap in their functions. In particular, square brackets are not used for grouping. Here are some examples to illustrate:

(abc) matches the sequence "abc"
[abc] matches one of the characters 'a', 'b', or 'c'.

(abc)+ matches abc one or more times ("abc", "abcabc", etc.)

[abc]+ matches one or more characters from the set {'a', 'b', 'c'} ("a", "cc", "baccbcaab", etc.)

(x+) matches at least one 'x' ("x", "xx", "xxxxxxxx", etc.)

[x+] matches 'x' or '+' (the letter 'x' or a literal plus sign - most regex metacharacters lose their special meanings inside character classes)

(a-z) matches the sequence "a-z" ('a', hyphen, 'z')

[a-z] matches any one character in the range a through z inclusive

(\d) matches a digit - \d is a shorthand for [0-9] (ASCII semantics) or \p{Nd} (Unicode semantics; "decimal digit")

[\d] matches a digit - unlike metacharacters, character-class shorthands retain their meanings inside "longhand" (or enumerated) character classes

(\d\d) matches two digits

[\d\d] matches one digit

A character class is an atom: it consumes exactly one character, the same as a literal character like x or % or does. But it allows you to define a set of characters, and it consumes the next character if it's a member of that set. (Specifying the same character more than once has no effect: [abracadabra] consumes one character from the set {'a', 'b', 'c', 'd', 'r'}.)

A group encloses one or more atoms, allowing them to be handled like a single atom:

  • abc? consumes an 'a', followed by a 'b', and the next character if it happens to be 'c'.
  • (abc)? consumes "abc" or nothing.

And while there are many kinds of groups, serving different purposes, none of them is equivalent to a character class. You can use alternation inside a group to achieve similar results--for example, (a|b|c) will match the same thing as [abc]--but it's inherently less efficient, as well as less readable. In fact, it can easily lead to catastrophe, as this answer explains. If you have a choice between a character class and an alternation, you should always go with the character class. If you need to capture the character, wrap the class in parens: ([abc]).

javascript multiline regexp replace

I can not understand why this code does not replace foo on bar

Because the dot . explicitly does not match newline characters.

This would work:

"foo\r\nbar".replace(/foo[\s\S]+/m, "bar")

because newline characters count as whitespace (\s).

Note that the parentheses around foo are superfluous, grouping has no benefits here.



Related Topics



Leave a reply



Submit