C++11 Regex Matching

C++11 Regex Matching

See gcc's stdc++11 implementation status page -- regexes are not supported as of gcc 4.8

Edit for posterity: As mentioned in the comments, the regex library is now in libstdc++ and should be in gcc 4.9 and on.

C++11 regex matching a full word that does not end with a period?

You need to make sure the word is followed with a word boundary:

std::regex rex(R"(\w+\b(?!\.))");

See the regex demo

Otherwise, backtracking occurs and you find jo in joe. with your pattern.

I also advise to use raw string literals when defining a regex, you get rid of excessive backslashes this way.

Regex grouping matches with C++ 11 regex library

Your regular expression is incorrect because neither capture group does what you want. The first is looking to match a single character from the set [a-zA-Z0-9] followed by <space>:, which works for single character usernames, but nothing else. The second capture group will always be empty because you're looking for zero or more characters, but also specifying the match should not be greedy, which means a zero character match is a valid result.

Fixing both of these your regex becomes

std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");

But simply instantiating a regex and a match_results object does not produce matches, you need to apply a regex algorithm. Since you only want to match part of the input string the appropriate algorithm to use in this case is regex_search.

std::regex_search(s, matches, rgx);

Putting it all together

    std::string s{R"(
tХB:Username!Username@Username.tcc.domain.com Connected
tХB:Username!Username@Username.tcc.domain.com WEBMSG #Username :this is a message
tХB:Username!Username@Username.tcc.domain.com Status: visible
)"};

std::regex rgx("WEBMSG #([a-zA-Z0-9]+) :(.*)");
std::smatch matches;

if(std::regex_search(s, matches, rgx)) {
std::cout << "Match found\n";

for (size_t i = 0; i < matches.size(); ++i) {
std::cout << i << ": '" << matches[i].str() << "'\n";
}
} else {
std::cout << "Match not found\n";
}

Live demo

How to match a sequence of whitespaces with c++11 regex

Just turn \s* to \s+ in your regex because \s* matches an empty string also(ie, \s* matches zero or more spaces) also and you don't need to have a capturing group.

matching text ranges with C++11 regexes

Based on the rules described in [re.grammar], we have:

— During matching of a regular expression finite state machine against a sequence of characters, two
characters c and d are compared using the following rules:

1. if (flags() & regex_constants::icase) the two characters are equal if traits_inst.translate_nocase(c) == traits_inst.translate_nocase(d);

2. otherwise, if flags() & regex_constants::collate the two characters are equal if traits_inst.translate(c) == traits_inst.translate(d);
3. otherwise, the two characters are equal if c == d.

This applies to your pattern2, we're matching a sequence of characters and we have flags() & icase, so we do a nocase comparison. Since each character in the sequence matches, it "works".

However, with pattern, we don't have a sequence of characters. So we instead use this rule:

— During matching of a regular expression finite state machine against a sequence of characters, comparison
of a collating element range c1-c2 against a character c
is conducted as follows: if flags() & regex_constants::collate is false then the character c is matched if c1 <= c && c <= c2, otherwise
c is matched in accordance with the following algorithm:

string_type str1 = string_type(1,
flags() & icase ?
traits_inst.translate_nocase(c1) : traits_inst.translate(c1);
string_type str2 = string_type(1,
flags() & icase ?
traits_inst.translate_nocase(c2) : traits_inst.translate(c2);
string_type str = string_type(1,
flags() & icase ?
traits_inst.translate_nocase(c) : traits_inst.translate(c);
return traits_inst.transform(str1.begin(), str1.end())
<= traits_inst.transform(str.begin(), str.end())
&& traits_inst.transform(str.begin(), str.end())
<= traits_inst.transform(str2.begin(), str2.end());

Since you don't have collate set, the character is matched literally for the range a-z. There is no accounting for icase here, that is why it "doesn't work." If you provide collate however:

std::regex pattern("[a-z]+", 
std::regex_constants::icase | std::regex_constants::collate);

Then we use the algorithm described, which will do a no-case comparison, and the result will be "works". Both compilers are correct - though I find the expected behavior confusing in this case.

How to handle or avoid exceptions from C++11 regex matching functions (§28.11)?

C++11 §28.6 states

The class regex_error defines the type of objects thrown as exceptions
to report errors from the regular expression library.

Which means that the <regex> library should not throw anything else by itself. You are correct that constructing a regex_error which inherits from runtime_error may throw bad_alloc during construction due to out-of-memory conditions, therefore you must also check for this in your error handling code. Unfortunately this makes it impossible to determine which regex_error construction actually throws bad_alloc.

For regular expressions algorithms in §28.11 it is stated in §28.11.1 that

The algorithms described in this subclause may throw an exception of type regex_error. If such an exception e is thrown, e.code() shall return either regex_constants::error_complexity or regex_-constants::error_stack.

This means that if the functions in §28.11 ever throw a regex_error, it shall hold one of these codes and nothing else. However, note also that things you pass to the <regex> library, such as allocators etc might also throw, e.g. the allocator of match_results which may trigger if results are added to the given match_results container. Also note that §28.11 has shorthand functions which "as if" construct match_results, such as

template <class BidirectionalIterator, class charT, class traits>
bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
const basic_regex<charT, traits> & e,
regex_constants::match_flag_type flags =
regex_constants::match_default);

template <class BidirectionalIterator, class charT, class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
const basic_regex<charT, traits> & e,
regex_constants::match_flag_type flags =
regex_constants::match_default);

and possibly others. Since such might construct and use match_results with the standard allocator internally, they might throw anything std::allocator throws. Therefore your simple example of regex_match(anyString, regex(".")) might also throw due to construction and usage of the default allocator.

Another caveat to note that for some <regex> functions and classes it is currently impossible to determine whether a bad_alloc was thrown by some allocator or during construction of a regex_error exception.

In general, if you need something with a better exception specifications avoid using <regex>. If you require simple pattern matching you're better off rolling your own safe match/search/replace functions, because it is impossible to constrain your regular expressions to avoid these exceptions in a portable nor forwards-compatible manner, even using an empty regular expression "" might give you an exception.

PS: Note that the C++11 standard is rather poorly written in some aspects, lacking complete cross referencing. E.g. there's no explicit notice under the clauses for the methods of match_results to throw anything, whereas §28.10.1.1 states (emphasis mine):

In all match_results constructors, a copy of the Allocator argument shall be used for any memory allocation performed by the constructor or member functions during the lifetime of the object.

So take care when browsing the standards like a lawyer! ;-)

C++11: Safe practice with regex of two possible number of matches

m.size() will always be the number of marked subexpressions in your expression plus 1 (for the whole expression).

In your code you have 4 marked subexpressions, whether these are matched or not has no effect on the size of m.

If you want to now if there are milliseconds, you can check:

m[4].matched

C++ regex match, not matching

The regex_match fails when the string doesnt match EXACTLY the pattern. Note that the brd ff:ff:ff:ff:ff:ff part of the string isnt being matched. All you need to do, then, is to append a .* to the pattern:

^\\d{1}:\\s+(\\w+).*?link\\/ether\\s{1}([a-z0-9:]+).*

Also, for that example, the loop isnt necessary. You can use:

if (std::regex_match(line, pieces, interface_address)) {
std::string name = pieces[1];
std::string address = pieces[2];
std::cout << name << address << std::endl;
}


Related Topics



Leave a reply



Submit