Javascript regex hangs (using v8)
This catastrophically backtracks on long sequences of spaces that occur after the last closing </tag:main>
tag. Consider the case where the subject string ends with 100 spaces. First it matches them all with the .
on the left of the alternation. That fails because there's no closing tag, so it tries matching the last character with the \s
instead. That fails too, so it tries matching the second-to-last space as a \s
and the last space as a .
. That fails (still no closing tag) so it tries the last space as a \s
. When that fails it matches the third-to-last space as a \s
and tries all 4 ways to match the last two spaces. When that fails it tries the fourth-to-last space as a \s
and all 8 ways on the last 3 spaces. Then 16, 32 etc. The universe ends before it gets to the 100th-to-last space.
Different VMs have different reactions to regexp matches that take forever because of catastrophic backtracking. Some will simply report 'no match'. In V8 it's like writing any other infinite or near-infinite loop.
Using non-greedy *
will do what you want (you want to stop at the first </tag:main>
, not the last), but will still do catastrophic backtracking for long strings of spaces where the closing sequence is missing.
Making sure the same characters in the inner bracket can't match both sides of the alternation will reduce the problem from an exponential one to one that is linear in the length of the string. Use a character class instead of an alternation or put \n
on the right hand side of the alternation bar. \n
is disjoint with .
so if you hit a long sequence of spaces the regexp engine doesn't try all left-right-left etc. combinations before terminating.
What could be causing this weird regular expression behaviour in node.js?
I am pretty sure what's happening here is catastrophic backtracking. For example on my machine:
- For 39 chars in "
Hello there, best wishes, from Syria...
" it takes around 13-14 seconds. - For 40 chars in "
Hello there, vbest wishes, from Syria...
" it takes 27-28 seconds. - For 41 chars in "
Hello there, vebest wishes, from Syria...
" it takes 56 seconds.
You can see the time taken increases exponentially. To explain how regex engine matches string via backtracking, I will quote the example from above link. It applies regex (x+x+)+y
on string xxxxxxxxxxy
:
Let's see what happens when you apply this regex to xxxxxxxxxxy. The
first x+ will match all 10 x characters. The second x+ fails. The
first x+ then backtracks to 9 matches, and the second one picks up the
remaining x. The group has now matched once. The group repeats, but
fails at the first x+. Since one repetition was sufficient, the group
matches. y matches y and an overall match is found. The regex is
declared functional, the code is shipped to the customer, and his
computer explodes. Almost.The above regex turns ugly when the y is missing from the subject
string. When y fails, the regex engine backtracks. The group has one
iteration it can backtrack into. The second x+ matched only one x, so
it can't backtrack. But the first x+ can give up one x. The second x+
promptly matches xx. The group again has one iteration, fails the next
one, and the y fails. Backtracking again, the second x+ now has one
backtracking position, reducing itself to match x. The group tries a
second iteration. The first x+ matches but the second is stuck at the
end of the string. Backtracking again, the first x+ in the group's
first iteration reduces itself to 7 characters. The second x+ matches
xxx. Failing y, the second x+ is reduced to xx and then x. Now, the
group can match a second iteration, with one x for each x+. But this
(7,1),(1,1) combination fails too. So it goes to (6,4) and then
(6,2)(1,1) and then (6,1),(2,1) and then (6,1),(1,2) and then I think
you start to get the drift.
See this page from Jeff on regex performance, which uses the same example. So moral of the story : don't just match stuff, improve the regex. When nesting repetition operators, make absolutely sure that there is only one way to match the same match. For the example I quoted xx+y
works better. And for your regex the answer Esailija gave will work much better /^([\w, ]+)$/
Why jquery regular expression return different - different test cases each time?
The problem is that you are using g
flag on the RegExp
literal. The solution is to remove the g
flag.
According to the documentation of RegExp.test()
(emphasis mine):
As with
exec
(or in combination with it),test
called multiple times on the same global regular expression instance will advance past the previous match.
After the first call to test
, the lastIndex
property of the RegExp
object /<(.|\n)*?>/g
is updated. The index used as the offset for the second call to test
. Since there is no more match for the pattern, the second call returns false
and lastIndex
is reset to 0. You can see the lastIndex
property of the regex being changed in this JSFiddle.
Removing the g
flag will cause the engine to ignore the lastIndex
property.
I have described in detailed behaviour of RegExp.exec
(which is also the same for RegExp.text
) in this post.
Optimizing a small regex with capturing and ?! assertion
You can increase your pattern speed using this:
\$this->(m_\w++)(?!(?>[^p]++|\Bp++|p(?!rivate \$\1\b))++private \$\1\b)
What's causing this regex to match everything?
I think the best approach would be to use a xpath expression or a xml parser.
However, as you stated in your comment if you want to capture that specific portion using regex, then you can use this:
(<ProjectReference.*?Project2[\s\S]*?</ProjectReference>)
Working demo
Match information
MATCH 1
1. [209-384] `<ProjectReference Include="..\..\Project2\Project2.csproj">
<Project>{6c2a7631-8b47-4ae9-a68f-f728666105b9}</Project>
<Name>Project2</Name>
</ProjectReference>`
Besides regex101 also used SublimeText to show it's working, however Notepad++ has a poor regex engine and usually messes it up with tricks like [\s\S]*?
:
On the other hand, related to your question about "why is failing", your regex is not failing but your pattern allows that greedy
match (even using the lazy operator) because of your (.|\s)
alternation:
^(\s+)<ProjectReference(.|\s)+?(Project2)</Name>(.|\s)+?</ProjectReference>
^--- HERE
If you check the Regex101 explanation, you can see:
2nd Capturing group (.|\s)+?
Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
1st Alternative: .
. matches any character (except newline)
2nd Alternative: \s
\s match any white space character [\r\n\t\f ]
Regex to remove multi line comments
the dot catches everything except newlines..( if the dotall is false )
so either use the dotall ( as mentioned in other answers/comments this is not supported in javascript, but i will leave it here for reference )
/\/\*(.*)\*\//gs
or add the whitespace chars \s
in your expressions
/\/\*((\s|.)*?)\*\//g
Alan mentioned in his comment a bad performance from the answer i gave so use the following instead.. ( which translates to everything whitespace and everything non whitespace, so everything.. )
/\/\*([\s\S]*?)\*\//g
java regex why these two regular expressions are different
When you include regexes in a post, it's a good idea to post them as you're actually using them--in this case, as Java string literals.
"[.\\s]"
is a Java string literal representing the regex [.\s]
; it matches a literal dot or a whitespace character. Your regex is not trying to match a backslash or an 's', as others have said, but the crucial factor is that .
loses its special meaning inside a character class.
"(.|\\s)"
is a Java string literal representing the regex (.|\s)
; it matches (anything but a line separator character OR any whitespace character). It works as you intended, but don't use it! It leaves you extremely vulnerable to catastrophic backtracking, as explained in this answer.
But no worries, all you really need to do is use DOTALL mode (also known as single-line mode), which enables .
to match anything including line separator characters.
(?s)<dl\b[^>]*>.*?</dl>
Cases when its better to use parantheses over square brackets in regular expressions?
This is common error among regex beginners, and it's a serious one. Square brackets are used to create character classes, while parentheses create groups. Not only do these constructs serve different purposes, there is no overlap in their functions. In particular, square brackets are not used for grouping. Here are some examples to illustrate:
(abc)
matches the sequence "abc"
[abc]
matches one of the characters 'a'
, 'b'
, or 'c'
.
(abc)+
matches abc
one or more times ("abc"
, "abcabc"
, etc.)[abc]+
matches one or more characters from the set {'a', 'b', 'c'}
("a"
, "cc"
, "baccbcaab"
, etc.)
(x+)
matches at least one 'x'
("x"
, "xx"
, "xxxxxxxx"
, etc.)[x+]
matches 'x'
or '+'
(the letter 'x' or a literal plus sign - most regex metacharacters lose their special meanings inside character classes)
(a-z)
matches the sequence "a-z"
('a'
, hyphen, 'z'
)[a-z]
matches any one character in the range a
through z
inclusive
(\d)
matches a digit - \d
is a shorthand for [0-9]
(ASCII semantics) or \p{Nd}
(Unicode semantics; "decimal digit")[\d]
matches a digit - unlike metacharacters, character-class shorthands retain their meanings inside "longhand" (or enumerated) character classes
(\d\d)
matches two digits[\d\d]
matches one digit
A character class is an atom: it consumes exactly one character, the same as a literal character like x
or %
or ☺
does. But it allows you to define a set of characters, and it consumes the next character if it's a member of that set. (Specifying the same character more than once has no effect: [abracadabra]
consumes one character from the set {'a', 'b', 'c', 'd', 'r'}
.)
A group encloses one or more atoms, allowing them to be handled like a single atom:
abc?
consumes an'a'
, followed by a'b
', and the next character if it happens to be'c'
.(abc)?
consumes"abc"
or nothing.
And while there are many kinds of groups, serving different purposes, none of them is equivalent to a character class. You can use alternation inside a group to achieve similar results--for example, (a|b|c)
will match the same thing as [abc]
--but it's inherently less efficient, as well as less readable. In fact, it can easily lead to catastrophe, as this answer explains. If you have a choice between a character class and an alternation, you should always go with the character class. If you need to capture the character, wrap the class in parens: ([abc])
.
javascript multiline regexp replace
I can not understand why this code does not replace foo on bar
Because the dot .
explicitly does not match newline characters.
This would work:
"foo\r\nbar".replace(/foo[\s\S]+/m, "bar")
because newline characters count as whitespace (\s
).
Note that the parentheses around foo
are superfluous, grouping has no benefits here.
Related Topics
Chrome Extension - How to Get Http Response Body
Regex.Test V.S. String.Match to Know If a String Matches a Regular Expression
Export JavaScript Data to CSV File Without Server Interaction
Change Form Values After Submit Button Pressed
Best JavaScript Solution for Client-Side Form Validation and Interaction
How to Convert HTML Table to Excel with Multiple Sheet
How to Change HTML Object Element Data Attribute Value in JavaScript
JavaScript Code Works in Jsfiddle and But Not in the Browser
How to Enable Bootstrap Tooltip on Disabled Button
What Is "Context" in Jquery Selector
How to Access Iframe Parent Page Using Jquery
React Hooks - Right Way to Clear Timeouts and Intervals
Count Number of Matches of a Regex in JavaScript
Why Does a Form with One Text Input Submit on Enter While One with Two Text Inputs Does Not
Create a String of Variable Length, Filled with a Repeated Character