What does std::match_results::size return?
EDIT: Some people have downvoted this answer. That may be for a variety of reasons, but if it is because it does not apply to the answer I criticized (no one left a comment to explain the decision), they should take note that W. Stribizew changed the code two months after I wrote this, and I was unaware of it until today, 2021-01-18. The rest of the answer is unchanged from when I first wrote it.
@stribizhev's solution has quadratic worst case complexity for sane regular expressions. For insane ones (e.g. "y*"), it doesn't terminate. In some applications, these issues could be DoS attacks waiting to happen. Here's a fixed version:
string str("abcdefabcghiabc");
int i = 0;
regex rgx1("abc");
smatch smtch;
auto beg = str.cbegin();
while (regex_search(beg, str.cend(), smtch, rgx1)) {
std::cout << i << ": " << smtch[0] << std::endl;
i += 1;
if ( smtch.length(0) > 0 )
std::advance(beg, smtch.length(0));
else if ( beg != str.cend() )
++beg;
else
break;
}
According to my personal preference, this will find n+1 matches of an empty regex in a string of length n. You might also just exit the loop after an empty match.
If you want to compare the performance for a string with millions of matches, add the following lines after the definition of str
(and don't forget to turn on optimizations), once for each version:
for (int j = 0; j < 20; ++j)
str = str + str;
What is returned in std::smatch and how are you supposed to use it?
The std::smatch
is an instantiation of the match_results class template for matches on string objects (with string::const_iterator
as its iterator type). The members of this class are those described for match_results
, but using string::const_iterator
as its BidirectionalIterator
template parameter.
std::match_results
supports a operator[]
:
If
n > 0
andn < size()
, returns a reference to thestd::sub_match
representing the part of the target sequence that was matched by the nth captured marked subexpression).If
n == 0
, returns a reference to thestd::sub_match
representing the part of the target sequence matched by the entire matched regular expression.if
n >= size()
, returns a reference to astd::sub_match
representing an unmatched sub-expression (an empty subrange of the target sequence).
In your case, regex_search
finds the first match only and then match[0]
holds the entire match text, match[1]
would contain the text captured with the first capturing group (the fist parenthesized pattern part), etc. In this case though, your regex does not contain capturing groups.
Here, you need to use a capturing mechanism here since std::regex
does not support a lookbehind. You used a lookahead that checks the text that immediately follows the current location, and the regex you have is not doing what you think it is.
So, use the following code:
#include <regex>
#include <string>
#include <iostream>
using namespace std;
int main() {
std::regex expression(R"(am\s+(\d+))");
std::smatch match;
std::string what("I am 5 years old.");
if (regex_search(what, match, expression))
{
cout << match.str(1) << endl;
}
return 0;
}
Here, the pattern is am\s+(\d+)"
. It is matching am
, 1+ whitespaces, and then captures 1 or more digits with (\d+)
. Inside the code, match.str(1)
allows access to the values that are captured with capturing groups. As there is only one (...)
in the pattern, one capturing group, its ID is 1. So, str(1)
returns the text captured into this group.
The raw string literal (R"(...)"
) allows using a single backslash for regex escapes (like \d
, \s
, etc.).
C++11: Safe practice with regex of two possible number of matches
m.size()
will always be the number of marked subexpressions in your expression plus 1 (for the whole expression).
In your code you have 4 marked subexpressions, whether these are matched or not has no effect on the size of m
.
If you want to now if there are milliseconds, you can check:
m[4].matched
`regex_match` returns both 'not found' and `match_results`
In GCC's implementation of match_results
the prefix, suffix, and unmatched string are stored at the end of the sequence managed by the match_results
object (which is implemented as a private std::vector
base class). Those extra elements should not be visible when iterating from begin()
to end()
, but the end()
function is returning the wrong position. It's returning an iterator to the end of the vector, after the three extra elements. It should be returning an iterator just before those, which would be equal to begin()
.
This is a bug, obviously. I'll fix it.
The fix is:
const_iterator
end() const noexcept
- { return _Base_type::end() - (empty() ? 0 : 3); }
+ { return _Base_type::end() - (_Base_type::empty() ? 0 : 3); }
/p>
Does zero match always matches when regex_search returns true?
You're misunderstanding the post-conditions information because the C++11 standard (N3337) contains redundant wording in that section.
If regex_search
returns false
, meaning a match was not found anywhere within the input string, then the state of the match_results
object is unspecified, except for the member functions match_results::ready
, which returns true
, match_results::size
, which returns 0
, and match_results::empty
, which returns true
.
The result of match_results::operator[]
is unspecified in that case, and you should not be calling it.
On the other hand, if regex_search
returns true
, that means a match was found, in which case m[0].matched
will always be true
. There is no case where it can be false
in this situation.
This is clarified in the latest draft N3936, which simply states in Table 143:
m[0].matched | true
The issue report that brought about this wording change can be viewed here. Quoting from it:
There's an analogous probem in Table 143: the condition for
m[0].matched
is "true if a match was found, false otherwise." But Table 143 gives post-conditions for a successful match, so the condition should be simply "true".
How to match multiple results using std::regex
This can be done in regex
of C++11
.
Two methos:
- You can use
()
inregex
to define your captures.
Like this:
string var = "first second third forth";
const regex r("(.*) (.*) (.*) (.*)");
smatch sm;
if (regex_search(var, sm, r)) {
for (int i=1; i<sm.size(); i++) {
cout << sm[i] << endl;
}
}
See it live: http://coliru.stacked-crooked.com/a/e1447c4cff9ea3e7
You can use
sregex_token_iterator()
:string var = "first second third forth";
regex wsaq_re("\\s+");
copy( sregex_token_iterator(var.begin(), var.end(), wsaq_re, -1),
sregex_token_iterator(),
ostream_iterator<string>(cout, "\n"));
See it live: http://coliru.stacked-crooked.com/a/677aa6f0bb0612f0
Need help constructing Regular expression pattern
There is no problem with the code itself. You mistake m.size()
for the number of matches, when in fact, it is a number of groups your regex returns.
The std::match_results::size
reference is not helpful with understanding that:
Returns the number of matches and sub-matches in the match_results object.
There are 2 groups (since you defined a capturing group around the 2 alternatives) and 1 match all in all.
See this IDEONE demo
#include <regex>
#include <string>
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
string data("ABOUTLinkedIn\r\n\r\nwall of textdl.boxcloud.com/this/file/bitbyte.zip sent you a message.\r\n\r\nDate: 12/04/2012\r\n\r\nSubject: RE: Reference Ask\r\n\r\nOn 12/03/12 2:02 PM, wall of text wrote:\r\n--------------------\r\nRuba,\r\n\r\nI am looking for a n.");
std::regex pattern("(dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
std::smatch result;
while (regex_search(data, result, pattern)) {
std::cout << "Match: " << result[0] << std::endl;
std::cout << "Captured text 1: " << result[1] << std::endl;
std::cout << "Size: " << result.size() << std::endl;
data = result.suffix().str();
}
}
It outputs:
Match: dl.boxcloud.com
Captured text 1: dl.boxcloud.com
Size: 2
See, the captured text equals the whole match.
To "fix" that, you may use non-capturing group, or remove grouping at all:
std::regex pattern("(?:dl\\.boxcloud\\.com|api-content\\.dropbox\\.com)");
// or
std::regex pattern("dl\\.boxcloud\\.com|api-content\\.dropbox\\.com");
Also, consider using raw string literal when declaring a regex (to avoid backslash hell):
std::regex pattern(R"(dl\.boxcloud\.com|api-content\.dropbox\.com)");
C++: Matching regex, what is in smatch?
The type of matches
is a std::match_results
, not a vector
, but it does have an operator[]
.
From the reference:
If n == 0, returns a reference to the std::sub_match representing the part of the target sequence matched by the entire matched regular expression.
If n > 0 and n < size(), returns a reference to the std::sub_match representing the part of the target sequence that was matched by the nth captured marked subexpression).
where n
is the argument to operator[]
. So matches[0]
contains the entire matched expression, and matches[1]
, matches[2]
, ... contain consecutive capture group expressions.
Related Topics
When Should I Use _Mm_Sfence _Mm_Lfence and _Mm_Mfence
What Is a Simple Example of Floating Point/Rounding Error
How to Get Double Quotes into a String Literal
Scope Vs. Lifetime of Variable
Include Header Files Using Command Line Option
When Is Uint8_T ≠ Unsigned Char
Efficient Way of Reading a File into an Std::Vector≪Char≫
Isn't the Template Argument (The Signature) of Std::Function Part of Its Type
Convert String to Int With Bool/Fail in C++
How to Remove/Refactor a «Friend» Dependency Declaration Properly
What Does Std::Match_Results::Size Return
Same Random Numbers Every Loop Iteration
What Does It Mean When a Numeric Constant in C/C++ Is Prefixed With a 0
Fast Ceiling of an Integer Division in C/C++
What Does the Thread_Local Mean in C++11
How to Use Base Class'S Constructors and Assignment Operator in C++