Difference between std::regex_match & std::regex_search?
regex_match
only returns true
when the entire input sequence has been matched, while regex_search
will succeed even if only a sub-sequence matches the regex
.
Quoting from N3337,
§28.11.2/2
regex_match
[re.alg.match]
Effects: Determines whether there is a match between the regular expressione
, and all of the character sequence[first,last)
....
Returnstrue
if such a match exists,false
otherwise.
The above description is for the regex_match
overload that takes a pair of iterators to the sequence to be matched. The remaining overloads are defined in terms of this overload.
The corresponding regex_search
overload is described as
§28.11.3/2
regex_search
[re.alg.search]
Effects: Determines whether there is some sub-sequence within[first,last)
that matches the regular expressione
....
Returnstrue
if such a sequence exists,false
otherwise.
In your example, if you modify the regex
to r{R"(.*?\s\d{2}\s.*)"};
both regex_match
and regex_search
will succeed (but the match result is not just the day, but the entire date string).
Live demo of a modified version of your example where the day is being captured and displayed by both regex_match
and regex_search
.
Difference between regex_match and regex_search?
Your regex works fine (both match, which is correct) in VS 2012rc.
In g++ 4.7.1 (-std=gnu++11)
, if using:
".*FILE_(.+)_EVENT\\.DAT.*"
,regex_match
matches, butregex_search
doesn't.".*?FILE_(.+?)_EVENT\\.DAT.*"
, neitherregex_match
norregex_search
matches (O_o).
All variants should match but some don't (for reasons that have been pointed out already by betabandido). In g++ 4.6.3 (-std=gnu++0x)
, the behavior is identical to g++ 4.7.1.
Boost (1.50) matches everything correctly w/both pattern varieties.
Summary:
regex_match regex_search
-----------------------------------------------------
g++ 4.6.3 linux OK/- -
g++ 4.7.1 linux OK/- -
vs 2010 OK OK
vs 2012rc OK OK
boost 1.50 win OK OK
boost 1.50 linux OK OK
-----------------------------------------------------
Regarding your pattern, if you mean a dot character '.'
, then you should write so ("\\."
). You can also reduce backtracking by using non-greedy modifiers (?
):
".*?FILE_(.+?)_EVENT\\.DAT.*"
What is the difference between re.search and re.match?
re.match
is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using ^
in the pattern.
As the re.match documentation says:
If zero or more characters at the
beginning of string match the regular expression pattern, return a
correspondingMatchObject
instance.
ReturnNone
if the string does not
match the pattern; note that this is
different from a zero-length match.Note: If you want to locate a match
anywhere in string, usesearch()
instead.
re.search
searches the entire string, as the documentation says:
Scan through string looking for a
location where the regular expression
pattern produces a match, and return a
correspondingMatchObject
instance.
ReturnNone
if no position in the
string matches the pattern; note that
this is different from finding a
zero-length match at some point in the
string.
So if you need to match at the beginning of the string, or to match the entire string use match
. It is faster. Otherwise use search
.
The documentation has a specific section for match
vs. search
that also covers multiline strings:
Python offers two different primitive
operations based on regular
expressions:match
checks for a match
only at the beginning of the string,
whilesearch
checks for a match
anywhere in the string (this is what
Perl does by default).Note that
match
may differ fromsearch
even when using a regular expression
beginning with'^'
:'^'
matches only
at the start of the string, or in
MULTILINE
mode also immediately
following a newline. The “match
”
operation succeeds only if the pattern
matches at the start of the string
regardless of mode, or at the starting
position given by the optionalpos
argument regardless of whether a
newline precedes it.
Now, enough talk. Time to see some example code:
# example code:
string_with_newlines = """something
someotherthing"""
import re
print re.match('some', string_with_newlines) # matches
print re.match('someother',
string_with_newlines) # won't match
print re.match('^someother', string_with_newlines,
re.MULTILINE) # also won't match
print re.search('someother',
string_with_newlines) # finds something
print re.search('^someother', string_with_newlines,
re.MULTILINE) # also finds something
m = re.compile('thing$', re.MULTILINE)
print m.match(string_with_newlines) # no match
print m.match(string_with_newlines, pos=4) # matches
print m.search(string_with_newlines,
re.MULTILINE) # also matches
boost regex_match vs regex_search
No, they aren't equivalent, because the $ in regex_search will match the line-end and ^ will match line-start.
So in a multi-line string the regex_search would still find sub-matches.
I guess adding the flags boost::match_not_eol and boost::match_not_bol would create the regex_match behaviour.
C++ regex difference between platforms
According to the C++ ECMAScript regex flavor reference,
The decimal escape
\0
is NOT a backreference: it is a character escape that represents the nul character. It cannot be followed by a decimal digit.
So, to match a NULL char, you need to use \0
literal text, a literal \
char and a 0
char. You can define it with a regular string literal as "\\0"
or - better - with a raw string literal, R"(\0)"
.
The following prints "Success":
#include <string>
#include <regex>
#include <iostream>
int main()
{
std::string s;
s += '\x06';
s += '\x00';
std::regex r(std::string(1, '\x06') + R"(\0)");
std::smatch sm;
if (std::regex_search(s, sm, r))
{
std::cout << "Success\n";
return 0;
}
std::cout << "Failure\n";
}
C++ standard regex difference between std=c++11 and std=gnu++11
std::regex::extended
uses extended POSIX regular expressions. According to those syntax rules, a backslash can only precede a "special character", which is one of .[\()*+?{|^$
. While a left bracket [
is a special character, the right bracket ]
is not. So your regular expression should be "\\[1]"
instead of "\\[1\\]"
to be standard-compliant.
Looking at the standard library source code, there is the following in regex_scanner.tcc:
#ifdef __STRICT_ANSI__
// POSIX says it is undefined to escape ordinary characters
__throw_regex_error(regex_constants::error_escape,
"Unexpected escape character.");
#else
_M_token = _S_token_ord_char;
_M_value.assign(1, __c);
#endif
Which shows that it is a GNU extension to allow escaping non-special characters. I don't know where this extension is documented.
Python regex - understanding the difference between match and search
When calling the function re.match
specifically, the ^
character does have little meaning because this function begins the matching process at the beginning of the line. However, it does have meaning for other functions in the re module, and when calling match on a compiled regular expression object.
For example:
text = """\
Mares eat oats
and does eat oats
"""
print re.findall('^(\w+)', text, re.MULTILINE)
This prints:
['Mares', 'and']
With a re.findall()
and re.MULTILINE
enabled, it gives you the first word (with no leading whitespace) on each line of your text.
It might be useful if doing something more complex, like lexical analysis with regular expressions, and passing into the compiled regular expression a starting position in the text it should start matching at (which you can choose to be the ending position from the previous match). See the documentation for RegexObject.match method.
Simple lexer / scanner as an example:
text = """\
Mares eat oats
and does eat oats
"""
pattern = r"""
(?P<firstword>^\w+)
|(?P<lastword>\w+$)
|(?P<word>\w+)
|(?P<whitespace>\s+)
|(?P<other>.)
"""
rx = re.compile(pattern, re.MULTILINE | re.VERBOSE)
def scan(text):
pos = 0
m = rx.match(text, pos)
while m:
toktype = m.lastgroup
tokvalue = m.group(toktype)
pos = m.end()
yield toktype, tokvalue
m = rx.match(text, pos)
for tok in scan(text):
print tok
which prints
('firstword', 'Mares')
('whitespace', ' ')
('word', 'eat')
('whitespace', ' ')
('lastword', 'oats')
('whitespace', '\n')
('firstword', 'and')
('whitespace', ' ')
('word', 'does')
('whitespace', ' ')
('word', 'eat')
('whitespace', ' ')
('lastword', 'oats')
('whitespace', '\n')
This distinguishes between types of word; a word at the beginning of a line, a word at the end of a line, and any other word.
What's the difference between $/ and $¢ in regex?
The variable $/
refers to the most recent match while the variable $¢
refers to the most recent outermost match. In most basic regexes like the above, that may be one and the same. But as can be seen from the output of the .raku
method, Match
objects can contain other Match
objects (that's what you get when you use $<foo>
or $1
for captures).
Suppose instead we had the following regex with a quantified capture
/ ab (cd { say $¢.from, " ", $¢.to } ) + /
And ran it would see the following output if we matched against "abcdcdcd":
0 2
0 4
0 6
But if we change from using $¢
to $/
, we get a different result:
2 2
4 4
6 6
(The reason the .to
seems to be a bit off is that it —and .pos
— are not updated until the end of the capture block.)
In other words, $¢
will always refer to what will be your final match object (i.e., $final = $text ~~ $regex
) so you can traverse a complex capture tree inside of the regex exactly as you would after having finished the full match So in the above example, you could just do $¢[0]
to refer to the first match, $¢[1]
the second, etc.
Inside of a regex code block, $/
will refer to the most immediate match. In the above case, that's the match for inside the ( )
and won't know about the other matches, nor the original start of the matching: just the start for the ( )
block. So give a more complex regex:
/ a $<foo>=(b $<bar>=(c)+ )+ d /
We can access at any point using $¢ all of the foo
tokens by saying $¢<foo>
. We can access the bar
tokens of a given foo
by using $¢<foo>[0]<bar>
. If we insert a code block inside of foo
's capture, it will be able to access bar
tokens by using $<bar>
or $/<bar>
, but it won't be able to access other foo
s.
Regular Expressions - What is the difference between .* and (.*)?
.*
Matches any character zero or more times.(.*)
- Matched characters are stored into a group for later back-referencing(any charcter within()
would be captrued).AB.DE
Matches the string ABanycharDE. Dot represent any character except newline character.AB(.)DE
AB and DE are matched and the in-between character is captured.
Related Topics
Releasesemaphore Does Not Release the Semaphore
C++ Double Dispatch for Equals()
Why Does Int8_T and User Input via Cin Shows Strange Result
How to Get Size C++ Dynamic Array
Why Is a C++ Vector Called a Vector
Cuda Linking Error - Visual Express 2008 - Nvcc Fatal Due to (Null) Configuration File
Need a Fast Random Generator for C++
How Do Boost::Variant and Boost::Any Work
C++ Static Const Access Through a Null Pointer
Image Retrieval System by Colour from the Web Using C++ with Openframeworks
How to Parse CSV Using Boost::Spirit
How to Do Static_Assert with MACros
Toy Shell Not Piping Correctly
Why Would I Std::Move an Std::Shared_Ptr
Differencebetween Const_Iterator and Iterator