Regex C++: Extract Substring

C Regular Expressions: Extracting the Actual Matches

There are quite a lot of regular expression packages, but yours seems to match the one in POSIX: regcomp() etc.

The two structures it defines in <regex.h> are:

  • regex_t containing at least size_t re_nsub, the number of parenthesized subexpressions.

  • regmatch_t containing at least regoff_t rm_so, the byte offset from start of string to start of substring, and regoff_t rm_eo, the byte offset from start of string of the first character after the end of substring.

Note that 'offsets' are not pointers but indexes into the character array.

The execution function is:

  • int regexec(const regex_t *restrict preg, const char *restrict string,
    size_t nmatch, regmatch_t pmatch[restrict], int eflags);

Your printing code should be:

for (int i = 0; i <= r.re_nsub; i++)
{
int start = m[i].rm_so;
int finish = m[i].rm_eo;
// strcpy(matches[ind], ("%.*s\n", (finish - start), p + start)); // Based on question
sprintf(matches[ind], "%.*s\n", (finish - start), p + start); // More plausible code
printf("Storing: %.*s\n", (finish - start), matches[ind]); // Print once
ind++;
printf("%.*s\n", (finish - start), p + start); // Why print twice?
}

Note that the code should be upgraded to ensure that the string copy (via sprintf()) does not overflow the target string — maybe by using snprintf() instead of sprintf(). It is also a good idea to mark the start and end of a string in the printing. For example:

    printf("<<%.*s>>\n", (finish - start), p + start);

This makes it a whole heap easier to see spaces etc.

[In future, please attempt to provide an MCVE (Minimal, Complete, Verifiable Example) or SSCCE (Short, Self-Contained, Correct Example) so that people can help more easily.]

This is an SSCCE that I created, probably in response to another SO question in 2010. It is one of a number of programs I keep that I call 'vignettes'; little programs that show the essence of some feature (such as POSIX regexes, in this case). I find them useful as memory joggers.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <regex.h>

#define tofind "^DAEMONS=\\(([^)]*)\\)[ \t]*$"

int main(int argc, char **argv)
{
FILE *fp;
char line[1024];
int retval = 0;
regex_t re;
regmatch_t rm[2];
//this file has this line "DAEMONS=(sysklogd network sshd !netfs !crond)"
const char *filename = "/etc/rc.conf";

if (argc > 1)
filename = argv[1];

if (regcomp(&re, tofind, REG_EXTENDED) != 0)
{
fprintf(stderr, "Failed to compile regex '%s'\n", tofind);
return EXIT_FAILURE;
}
printf("Regex: %s\n", tofind);
printf("Number of captured expressions: %zu\n", re.re_nsub);

fp = fopen(filename, "r");
if (fp == 0)
{
fprintf(stderr, "Failed to open file %s (%d: %s)\n", filename, errno, strerror(errno));
return EXIT_FAILURE;
}

while ((fgets(line, 1024, fp)) != NULL)
{
line[strcspn(line, "\n")] = '\0';
if ((retval = regexec(&re, line, 2, rm, 0)) == 0)
{
printf("<<%s>>\n", line);
// Complete match
printf("Line: <<%.*s>>\n", (int)(rm[0].rm_eo - rm[0].rm_so), line + rm[0].rm_so);
// Match captured in (...) - the \( and \) match literal parenthesis
printf("Text: <<%.*s>>\n", (int)(rm[1].rm_eo - rm[1].rm_so), line + rm[1].rm_so);
char *src = line + rm[1].rm_so;
char *end = line + rm[1].rm_eo;
while (src < end)
{
size_t len = strcspn(src, " ");
if (src + len > end)
len = end - src;
printf("Name: <<%.*s>>\n", (int)len, src);
src += len;
src += strspn(src, " ");
}
}
}
return EXIT_SUCCESS;
}

This was designed to find a particular line starting DAEMONS= in a file /etc/rc.conf (but you can specify an alternative file name on the command line). You can adapt it to your purposes easily enough.

How can I extract submatches using C regex

The issue you do not get offsets for the first capturing group is that you pass 1 as the third, size_t __nmatch, argument to regexec.

The 1 value should be changed to 2 as there will be two groups whenever \(.\)ave regex matches: Group 0 will be holding the whole match and Group 1 will hold the first capturing group value.

So, you need to use

ret = regexec(&preg, str, 2, pmatch, 0);
// ^^^

Also, to print the Group 1 value you can use

if(pmatch[1].rm_so != -1) {
printf("match[1]=%.*s<\n", pmatch[1].rm_eo, &str[pmatch[1].rm_so]);
}

See this C demo:

#include <stdio.h>
#include <regex.h>
#include <string.h>

int main() {

regex_t preg;
char str[] = "dave";
char regex[] = "\\(.\\)ave";

// flag REG_EXTENDED with unescaped parens in the r.e. doesn't fix anything
int ret, cflags = REG_ICASE;

// the elements of unused pmatches used to be set to -1 by regexec, but no longer. a clue perhaps.

regmatch_t pmatch[2] = {{-1,-1},{-1,-1}};

ret = regcomp(&preg, regex, cflags);
if (ret) {
puts("regcomp fail");
return ret;
}
else
// preg.re_nsub contains the correct number of groups that regcomp recognized in the r.e. Tests succeeded for 0, 1, 2, and 3 groups.
printf("regcomp ok; re_nsub=%zu\n", preg.re_nsub);

ret = regexec(&preg, str, 2, pmatch, 0); // 1 changed to 2 as there is Group 0 (whole match) and Group 1 (for the first capturing group)

if(ret)
puts("no match");
else {
printf("match offsets are %d %d\n", pmatch[0].rm_so, pmatch[0].rm_eo);
printf("match[0]=%*s<\n", pmatch[0].rm_eo, &str[pmatch[0].rm_so]);

printf("submatch offsets are %d %d\n", pmatch[1].rm_so, pmatch[1].rm_eo);
if(pmatch[1].rm_so != -1) {
printf("match[1]=%.*s<\n", pmatch[1].rm_eo, &str[pmatch[1].rm_so]);
}
}
return 0;
}
/*
regcomp ok; re_nsub=1
match offsets are 0 4
match[0]=dave<
submatch offsets are 0 1
match[1]=d<
*/

Regex C++: extract substring

Since last year C++ has regular expression built into the standard. This program will show how to use them to extract the string you are after:

#include <regex>
#include <iostream>

int main()
{
const std::string s = "/home/toto/FILE_mysymbol_EVENT.DAT";
std::regex rgx(".*FILE_(\\w+)_EVENT\\.DAT.*");
std::smatch match;

if (std::regex_search(s.begin(), s.end(), match, rgx))
std::cout << "match: " << match[1] << '\n';
}

It will output:


match: mysymbol

It should be noted though, that it will not work in GCC as its library support for regular expression is not very good. Works well in VS2010 (and probably VS2012), and should work in clang.


By now (late 2016) all modern C++ compilers and their standard libraries are fully up to date with the C++11 standard, and most if not all of C++14 as well. GCC 6 and the upcoming Clang 4 support most of the coming C++17 standard as well.

Regular Expression extract substring (.)*

Remove the extra capturing groups from your regular expression.

<sip:\+\d{11}(.*)

How to extract just part of this string in C?

Do you really need regex for this? Could you not just split this string into substrings and work with that?

  1. You can remove the extension with finding the dot with strchr
  2. Substring the file name
  3. Use regex to get the rest with ([0-9]{4}.*$)

Regex C++: extract substring between tags

You shouldn't be using regexes to try to match html, but, for this special case, you could do:

#include <string>
#include <regex>

// Your string
std::string str = "<column r="1"><t b=\"red\"><v>1</v></t></column>";

// Your regex, in this specific scenario
// Will NOT work for nested <column> tags!
std::regex rgx("<column.*?>(.*?)</column>");
std::smatch match;

// Try to match it
if(std::regex_search(str.begin(), str.end(), match, rgx)) {
// You can use `match' here to get your substring
};

As Anton said above: don't.

c++ regex extract all substrings using regex_search()

std::regex_search returns after only the first match found.

What std::smatch gives you is all the matched groups in the regular expression. Your regular expression only contains one group so std::smatch only has one item in it.

If you want to find all matches you need to use std::sregex_iterator.

int main()
{
std::string s1("{1,2,3}");
std::regex e(R"(\d+)");

std::cout << s1 << std::endl;

std::sregex_iterator iter(s1.begin(), s1.end(), e);
std::sregex_iterator end;

while(iter != end)
{
std::cout << "size: " << iter->size() << std::endl;

for(unsigned i = 0; i < iter->size(); ++i)
{
std::cout << "the " << i + 1 << "th match" << ": " << (*iter)[i] << std::endl;
}
++iter;
}
}

Output:

{1,2,3}
size: 1
the 1th match: 1
size: 1
the 1th match: 2
size: 1
the 1th match: 3

The end iterator is default constructed by design so that it is equal to iter when iter has run out of matches. Notice at the bottom of the loop I do ++iter. That moves iter on to the next match. When there are no more matches, iter has the same value as the default constructed end.

Another example to show the submatching (capture groups):

int main()
{
std::string s1("{1,2,3}{4,5,6}{7,8,9}");
std::regex e(R"~((\d+),(\d+),(\d+))~");

std::cout << s1 << std::endl;

std::sregex_iterator iter(s1.begin(), s1.end(), e);
std::sregex_iterator end;

while(iter != end)
{
std::cout << "size: " << iter->size() << std::endl;

std::cout << "expression match #" << 0 << ": " << (*iter)[0] << std::endl;
for(unsigned i = 1; i < iter->size(); ++i)
{
std::cout << "capture submatch #" << i << ": " << (*iter)[i] << std::endl;
}
++iter;
}
}

Output:

{1,2,3}{4,5,6}{7,8,9}
size: 4
expression match #0: 1,2,3
capture submatch #1: 1
capture submatch #2: 2
capture submatch #3: 3
size: 4
expression match #0: 4,5,6
capture submatch #1: 4
capture submatch #2: 5
capture submatch #3: 6
size: 4
expression match #0: 7,8,9
capture submatch #1: 7
capture submatch #2: 8
capture submatch #3: 9

How to extract a substring using regex

Assuming you want the part between single quotes, use this regular expression with a Matcher:

"'(.*?)'"

Example:

String mydata = "some string with 'the data i want' inside";
Pattern pattern = Pattern.compile("'(.*?)'");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find())
{
System.out.println(matcher.group(1));
}

Result:


the data i want


Related Topics



Leave a reply



Submit