What Does Std::Match_Results::Size Return

What does std::match_results::size return?

EDIT: Some people have downvoted this answer. That may be for a variety of reasons, but if it is because it does not apply to the answer I criticized (no one left a comment to explain the decision), they should take note that W. Stribizew changed the code two months after I wrote this, and I was unaware of it until today, 2021-01-18. The rest of the answer is unchanged from when I first wrote it.

@stribizhev's solution has quadratic worst case complexity for sane regular expressions. For insane ones (e.g. "y*"), it doesn't terminate. In some applications, these issues could be DoS attacks waiting to happen. Here's a fixed version:

string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
auto beg = str.cbegin();
while (regex_search(beg, str.cend(), smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
if ( smtch.length(0) > 0 )
std::advance(beg, smtch.length(0));
else if ( beg != str.cend() )
++beg;
else
break;
}

According to my personal preference, this will find n+1 matches of an empty regex in a string of length n. You might also just exit the loop after an empty match.

If you want to compare the performance for a string with millions of matches, add the following lines after the definition of str (and don't forget to turn on optimizations), once for each version:

for (int j = 0; j < 20; ++j)
str = str + str;

What is returned in std::smatch and how are you supposed to use it?

The std::smatch is an instantiation of the match_results class template for matches on string objects (with string::const_iterator as its iterator type). The members of this class are those described for match_results, but using string::const_iterator as its BidirectionalIterator template parameter.

std::match_results supports a operator[]:

If n > 0 and n < size(), returns a reference to the std::sub_match representing the part of the target sequence that was matched by the nth captured marked subexpression).

If n == 0, returns a reference to the std::sub_match representing the part of the target sequence matched by the entire matched regular expression.

if n >= size(), returns a reference to a std::sub_match representing an unmatched sub-expression (an empty subrange of the target sequence).

In your case, regex_search finds the first match only and then match[0] holds the entire match text, match[1] would contain the text captured with the first capturing group (the fist parenthesized pattern part), etc. In this case though, your regex does not contain capturing groups.

Here, you need to use a capturing mechanism here since std::regex does not support a lookbehind. You used a lookahead that checks the text that immediately follows the current location, and the regex you have is not doing what you think it is.

So, use the following code:

#include <regex>
#include <string>
#include <iostream>
using namespace std;

int main() {
std::regex expression(R"(am\s+(\d+))");
std::smatch match;
std::string what("I am 5 years old.");
if (regex_search(what, match, expression))
{
cout << match.str(1) << endl;
}
return 0;
}

Here, the pattern is am\s+(\d+)". It is matching am, 1+ whitespaces, and then captures 1 or more digits with (\d+). Inside the code, match.str(1) allows access to the values that are captured with capturing groups. As there is only one (...) in the pattern, one capturing group, its ID is 1. So, str(1) returns the text captured into this group.

The raw string literal (R"(...)") allows using a single backslash for regex escapes (like \d, \s, etc.).

C++11: Safe practice with regex of two possible number of matches

m.size() will always be the number of marked subexpressions in your expression plus 1 (for the whole expression).

In your code you have 4 marked subexpressions, whether these are matched or not has no effect on the size of m.

If you want to now if there are milliseconds, you can check:

m[4].matched

`regex_match` returns both 'not found' and `match_results`

In GCC's implementation of match_results the prefix, suffix, and unmatched string are stored at the end of the sequence managed by the match_results object (which is implemented as a private std::vector base class). Those extra elements should not be visible when iterating from begin() to end(), but the end() function is returning the wrong position. It's returning an iterator to the end of the vector, after the three extra elements. It should be returning an iterator just before those, which would be equal to begin().

This is a bug, obviously. I'll fix it.

The fix is:

       const_iterator
end() const noexcept
- { return _Base_type::end() - (empty() ? 0 : 3); }
+ { return _Base_type::end() - (_Base_type::empty() ? 0 : 3); }

/p>

Does zero match always matches when regex_search returns true?

You're misunderstanding the post-conditions information because the C++11 standard (N3337) contains redundant wording in that section.

If regex_search returns false, meaning a match was not found anywhere within the input string, then the state of the match_results object is unspecified, except for the member functions match_results::ready, which returns true, match_results::size, which returns 0, and match_results::empty, which returns true.

The result of match_results::operator[] is unspecified in that case, and you should not be calling it.

On the other hand, if regex_search returns true, that means a match was found, in which case m[0].matched will always be true. There is no case where it can be false in this situation.

This is clarified in the latest draft N3936, which simply states in Table 143:

m[0].matched | true

The issue report that brought about this wording change can be viewed here. Quoting from it:

There's an analogous probem in Table 143: the condition for m[0].matched is "true if a match was found, false otherwise." But Table 143 gives post-conditions for a successful match, so the condition should be simply "true".

How to match multiple results using std::regex

This can be done in regex of C++11.

Two methos:

  1. You can use () in regex to define your captures.

Like this:

    string var = "first second third forth";

const regex r("(.*) (.*) (.*) (.*)");
smatch sm;

if (regex_search(var, sm, r)) {
for (int i=1; i<sm.size(); i++) {
cout << sm[i] << endl;
}
}

See it live: http://coliru.stacked-crooked.com/a/e1447c4cff9ea3e7


  1. You can use sregex_token_iterator():

     string var = "first second third forth";

    regex wsaq_re("\\s+");
    copy( sregex_token_iterator(var.begin(), var.end(), wsaq_re, -1),
    sregex_token_iterator(),
    ostream_iterator<string>(cout, "\n"));

See it live: http://coliru.stacked-crooked.com/a/677aa6f0bb0612f0

Need help constructing Regular expression pattern

There is no problem with the code itself. You mistake m.size() for the number of matches, when in fact, it is a number of groups your regex returns.

The std::match_results::size reference is not helpful with understanding that:

Returns the number of matches and sub-matches in the match_results object.

There are 2 groups (since you defined a capturing group around the 2 alternatives) and 1 match all in all.

See this IDEONE demo

#include <regex>
#include <string>
#include <iostream>
#include <time.h>
using namespace std;

int main()
{
string data("ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
std::regex pattern("(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
std::smatch result;

while (regex_search(data, result, pattern)) {
std::cout << "Match: " << result[0] << std::endl;
std::cout << "Captured text 1: " << result[1] << std::endl;
std::cout << "Size: " << result.size() << std::endl;
data = result.suffix().str();
}
}

It outputs:

Match: dl.boxcloud.com
Captured text 1: dl.boxcloud.com
Size: 2

See, the captured text equals the whole match.

To "fix" that, you may use non-capturing group, or remove grouping at all:

std::regex pattern("(?:dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
// or
std::regex pattern("dl\\.boxcloud\\.com|api-content\\.dropbox\\.com");

Also, consider using raw string literal when declaring a regex (to avoid backslash hell):

std::regex pattern(R"(dl\.boxcloud\.com|api-content\.dropbox\.com)");

C++: Matching regex, what is in smatch?

The type of matches is a std::match_results, not a vector, but it does have an operator[].

From the reference:

If n == 0, returns a reference to the std::sub_match representing the part of the target sequence matched by the entire matched regular expression.

If n > 0 and n < size(), returns a reference to the std::sub_match representing the part of the target sequence that was matched by the nth captured marked subexpression).

where n is the argument to operator[]. So matches[0] contains the entire matched expression, and matches[1], matches[2], ... contain consecutive capture group expressions.



Related Topics



Leave a reply



Submit