Difference Between Regex_Match and Regex_Search

Difference between std::regex_match & std::regex_search?

regex_match only returns true when the entire input sequence has been matched, while regex_search will succeed even if only a sub-sequence matches the regex.

Quoting from N3337,

§28.11.2/2 regex_match [re.alg.match]

Effects: Determines whether there is a match between the regular expression e, and all of the character sequence [first,last). ... Returns true if such a match exists, false otherwise.

The above description is for the regex_match overload that takes a pair of iterators to the sequence to be matched. The remaining overloads are defined in terms of this overload.

The corresponding regex_search overload is described as

§28.11.3/2 regex_search [re.alg.search]

Effects: Determines whether there is some sub-sequence within [first,last) that matches the regular expression e. ... Returns true if such a sequence exists, false otherwise.

In your example, if you modify the regex to r{R"(.*?\s\d{2}\s.*)"}; both regex_match and regex_search will succeed (but the match result is not just the day, but the entire date string).

Live demo of a modified version of your example where the day is being captured and displayed by both regex_match and regex_search.

Difference between regex_match and regex_search?

Your regex works fine (both match, which is correct) in VS 2012rc.

In g++ 4.7.1 (-std=gnu++11), if using:

".*FILE_(.+)_EVENT\\.DAT.*", regex_match matches, but regex_search doesn't.
".*?FILE_(.+?)_EVENT\\.DAT.*", neither regex_match nor regex_search matches (O_o).

All variants should match but some don't (for reasons that have been pointed out already by betabandido). In g++ 4.6.3 (-std=gnu++0x), the behavior is identical to g++ 4.7.1.

Boost (1.50) matches everything correctly w/both pattern varieties.

Summary:


                        regex_match      regex_search
 -----------------------------------------------------
 g++ 4.6.3 linux            OK/-               -
 g++ 4.7.1 linux            OK/-               -
 vs 2010                     OK                OK
 vs 2012rc                   OK                OK
 boost 1.50 win              OK                OK
 boost 1.50 linux            OK                OK
 -----------------------------------------------------

Regarding your pattern, if you mean a dot character '.', then you should write so ("\\."). You can also reduce backtracking by using non-greedy modifiers (?):

".*?FILE_(.+?)_EVENT\\.DAT.*"

What is the difference between re.search and re.match?

re.match is anchored at the beginning of the string. That has nothing to do with newlines, so it is not the same as using ^ in the pattern.

As the re.match documentation says:

If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding MatchObject instance.
Return None if the string does not
match the pattern; note that this is
different from a zero-length match.

Note: If you want to locate a match
anywhere in string, use search()
instead.

re.search searches the entire string, as the documentation says:

Scan through string looking for a
location where the regular expression
pattern produces a match, and return a
corresponding MatchObject instance.
Return None if no position in the
string matches the pattern; note that
this is different from finding a
zero-length match at some point in the
string.

So if you need to match at the beginning of the string, or to match the entire string use match. It is faster. Otherwise use search.

The documentation has a specific section for match vs. search that also covers multiline strings:

Python offers two different primitive
operations based on regular
expressions: match checks for a match
only at the beginning of the string,
while search checks for a match
anywhere in the string (this is what
Perl does by default).

Note that match may differ from search
even when using a regular expression
beginning with '^': '^' matches only
at the start of the string, or in
MULTILINE mode also immediately
following a newline. The “match”
operation succeeds only if the pattern
matches at the start of the string
regardless of mode, or at the starting
position given by the optional pos
argument regardless of whether a
newline precedes it.

Now, enough talk. Time to see some example code:

# example code:
string_with_newlines = """something
someotherthing"""

import re

print re.match('some', string_with_newlines) # matches
print re.match('someother', 
               string_with_newlines) # won't match
print re.match('^someother', string_with_newlines, 
               re.MULTILINE) # also won't match
print re.search('someother', 
                string_with_newlines) # finds something
print re.search('^someother', string_with_newlines, 
                re.MULTILINE) # also finds something

m = re.compile('thing$', re.MULTILINE)

print m.match(string_with_newlines) # no match
print m.match(string_with_newlines, pos=4) # matches
print m.search(string_with_newlines, 
               re.MULTILINE) # also matches

boost regex_match vs regex_search

No, they aren't equivalent, because the $ in regex_search will match the line-end and ^ will match line-start.
So in a multi-line string the regex_search would still find sub-matches.
I guess adding the flags boost::match_not_eol and boost::match_not_bol would create the regex_match behaviour.

C++ regex difference between platforms

According to the C++ ECMAScript regex flavor reference,

The decimal escape \0 is NOT a backreference: it is a character escape that represents the nul character. It cannot be followed by a decimal digit.

So, to match a NULL char, you need to use \0 literal text, a literal \ char and a 0 char. You can define it with a regular string literal as "\\0" or - better - with a raw string literal, R"(\0)".

The following prints "Success":

#include <string>
#include <regex>
#include <iostream>

int main()
{
        std::string s;
        s += '\x06';
        s += '\x00';
        std::regex r(std::string(1, '\x06') + R"(\0)");
        std::smatch sm;
        if (std::regex_search(s, sm, r))
        {
                std::cout << "Success\n";
                return 0;
        }

        std::cout << "Failure\n";
}

C++ standard regex difference between std=c++11 and std=gnu++11

std::regex::extended uses extended POSIX regular expressions. According to those syntax rules, a backslash can only precede a "special character", which is one of .[\()*+?{|^$. While a left bracket [ is a special character, the right bracket ] is not. So your regular expression should be "\\[1]" instead of "\\[1\\]" to be standard-compliant.

Looking at the standard library source code, there is the following in regex_scanner.tcc:

#ifdef __STRICT_ANSI__
      // POSIX says it is undefined to escape ordinary characters
      __throw_regex_error(regex_constants::error_escape,
                  "Unexpected escape character.");
#else
      _M_token = _S_token_ord_char;
      _M_value.assign(1, __c);
#endif

Which shows that it is a GNU extension to allow escaping non-special characters. I don't know where this extension is documented.

Python regex - understanding the difference between match and search

When calling the function re.match specifically, the ^ character does have little meaning because this function begins the matching process at the beginning of the line. However, it does have meaning for other functions in the re module, and when calling match on a compiled regular expression object.

For example:

text = """\
Mares eat oats
and does eat oats
"""

print re.findall('^(\w+)', text, re.MULTILINE)

This prints:

['Mares', 'and']

With a re.findall() and re.MULTILINE enabled, it gives you the first word (with no leading whitespace) on each line of your text.

It might be useful if doing something more complex, like lexical analysis with regular expressions, and passing into the compiled regular expression a starting position in the text it should start matching at (which you can choose to be the ending position from the previous match). See the documentation for RegexObject.match method.

Simple lexer / scanner as an example:

text = """\
Mares eat oats
and does eat oats
"""

pattern = r"""
(?P<firstword>^\w+)
|(?P<lastword>\w+$)
|(?P<word>\w+)
|(?P<whitespace>\s+)
|(?P<other>.)
"""

rx = re.compile(pattern, re.MULTILINE | re.VERBOSE)

def scan(text):
    pos = 0
    m = rx.match(text, pos)
    while m:
        toktype = m.lastgroup
        tokvalue = m.group(toktype)
        pos = m.end()
        yield toktype, tokvalue
        m = rx.match(text, pos)

for tok in scan(text):
    print tok

which prints

('firstword', 'Mares')
('whitespace', ' ')
('word', 'eat')
('whitespace', ' ')
('lastword', 'oats')
('whitespace', '\n')
('firstword', 'and')
('whitespace', ' ')
('word', 'does')
('whitespace', ' ')
('word', 'eat')
('whitespace', ' ')
('lastword', 'oats')
('whitespace', '\n')

This distinguishes between types of word; a word at the beginning of a line, a word at the end of a line, and any other word.

What's the difference between $/ and $¢ in regex?

The variable $/ refers to the most recent match while the variable $¢ refers to the most recent outermost match. In most basic regexes like the above, that may be one and the same. But as can be seen from the output of the .raku method, Match objects can contain other Match objects (that's what you get when you use $<foo> or $1 for captures).

Suppose instead we had the following regex with a quantified capture

/ ab (cd { say $¢.from, " ", $¢.to } ) + /

And ran it would see the following output if we matched against "abcdcdcd":

0 2
0 4
0 6

But if we change from using $¢ to $/, we get a different result:

2 2
4 4
6 6

(The reason the .to seems to be a bit off is that it —and .pos— are not updated until the end of the capture block.)

In other words, $¢ will always refer to what will be your final match object (i.e., $final = $text ~~ $regex) so you can traverse a complex capture tree inside of the regex exactly as you would after having finished the full match So in the above example, you could just do $¢[0] to refer to the first match, $¢[1] the second, etc.

Inside of a regex code block, $/ will refer to the most immediate match. In the above case, that's the match for inside the ( ) and won't know about the other matches, nor the original start of the matching: just the start for the ( ) block. So give a more complex regex:

/ a $<foo>=(b $<bar>=(c)+ )+ d /

We can access at any point using $¢ all of the foo tokens by saying $¢<foo>. We can access the bar tokens of a given foo by using $¢<foo>[0]<bar>. If we insert a code block inside of foo's capture, it will be able to access bar tokens by using $<bar> or $/<bar>, but it won't be able to access other foos.

Regular Expressions - What is the difference between .* and (.*)?

.* Matches any character zero or more times.
(.*) - Matched characters are stored into a group for later back-referencing(any charcter within () would be captrued).
AB.DE Matches the string ABanycharDE. Dot represent any character except newline character.
AB(.)DE AB and DE are matched and the in-between character is captured.