How to Specify Non-Capturing Groups in Sed

how do you specify non-capturing groups in sed?

The answer, is that as of writing, you can't - sed does not support it.

Non-capturing groups have the syntax of (?:a) and are a PCRE syntax.

Sed supports BRE(Basic regular expressions), aka POSIX BRE, and if using GNU sed, there is the option -r that makes it support ERE(extended regular expressions) aka POSIX ERE, but still not PCRE)

Perl will work, for windows or linux

examples here

https://superuser.com/questions/416419/perl-for-matching-with-regular-expressions-in-terminal

e.g. this from cygwin in windows

$ echo -e 'abcd' | perl -0777 -pe 's/(a)(?:b)(c)(d)/\1/s'
a

$ echo -e 'abcd' | perl -0777 -pe 's/(a)(?:b)(c)(d)/\2/s'
c

There is a program albeit for Windows, which can do search and replace on the command line, and does support PCRE. It's called rxrepl. It's not sed of course, but it does search and replace with PCRE support.

C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(c)" -r "\1"
a

C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(c)" -r "\3"
c

C:\blah\rxrepl>echo abc | rxrepl -s "(a)(b)(?:c)" -r "\3"
Invalid match group requested.

C:\blah\rxrepl>echo abc | rxrepl -s "(a)(?:b)(c)" -r "\2"
c

C:\blah\rxrepl>

The author(not me), mentioned his program in an answer over here https://superuser.com/questions/339118/regex-replace-from-command-line

It has a really good syntax.

The standard thing to use would be perl, or almost any other programming language that people use.

What is a non-capturing group in regular expressions?

Let me try to explain this with an example.

Consider the following text:

http://stackoverflow.com/
https://stackoverflow.com/questions/tagged/regex

Now, if I apply the regex below over it...

(https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?

... I would get the following result:

Match "http://stackoverflow.com/"
Group 1: "http"
Group 2: "stackoverflow.com"
Group 3: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
Group 1: "https"
Group 2: "stackoverflow.com"
Group 3: "/questions/tagged/regex"

But I don't care about the protocol -- I just want the host and path of the URL. So, I change the regex to include the non-capturing group (?:).

(?:https?|ftp)://([^/\r\n]+)(/[^\r\n]*)?

Now, my result looks like this:

Match "http://stackoverflow.com/"
Group 1: "stackoverflow.com"
Group 2: "/"

Match "https://stackoverflow.com/questions/tagged/regex"
Group 1: "stackoverflow.com"
Group 2: "/questions/tagged/regex"

See? The first group has not been captured. The parser uses it to match the text, but ignores it later, in the final result.


EDIT:

As requested, let me try to explain groups too.

Well, groups serve many purposes. They can help you to extract exact information from a bigger match (which can also be named), they let you rematch a previous matched group, and can be used for substitutions. Let's try some examples, shall we?

Imagine you have some kind of XML or HTML (be aware that regex may not be the best tool for the job, but it is nice as an example). You want to parse the tags, so you could do something like this (I have added spaces to make it easier to understand):

   \<(?<TAG>.+?)\> [^<]*? \</\k<TAG>\>
or
\<(.+?)\> [^<]*? \</\1\>

The first regex has a named group (TAG), while the second one uses a common group. Both regexes do the same thing: they use the value from the first group (the name of the tag) to match the closing tag. The difference is that the first one uses the name to match the value, and the second one uses the group index (which starts at 1).

Let's try some substitutions now. Consider the following text:

Lorem ipsum dolor sit amet consectetuer feugiat fames malesuada pretium egestas.

Now, let's use this dumb regex over it:

\b(\S)(\S)(\S)(\S*)\b

This regex matches words with at least 3 characters, and uses groups to separate the first three letters. The result is this:

Match "Lorem"
Group 1: "L"
Group 2: "o"
Group 3: "r"
Group 4: "em"
Match "ipsum"
Group 1: "i"
Group 2: "p"
Group 3: "s"
Group 4: "um"
...

Match "consectetuer"
Group 1: "c"
Group 2: "o"
Group 3: "n"
Group 4: "sectetuer"
...

So, if we apply the substitution string:

$1_$3$2_$4

... over it, we are trying to use the first group, add an underscore, use the third group, then the second group, add another underscore, and then the fourth group. The resulting string would be like the one below.

L_ro_em i_sp_um d_lo_or s_ti_ a_em_t c_no_sectetuer f_ue_giat f_ma_es m_la_esuada p_er_tium e_eg_stas.

You can use named groups for substitutions too, using ${name}.

To play around with regexes, I recommend http://regex101.com/, which offers a good amount of details on how the regex works; it also offers a few regex engines to choose from.

Sed capture group not working

Your sed pattern is not matching complete line as it is not consuming remaining string after your match i.e. a [0-9]+. That's the reason you see remaining text in output.

You can use:

echo "a 10 b 12" | sed -E -n 's/a ([0-9]+).*/\1/p'
10

Or just:

echo "a 10 b 12" | sed -E 's/a ([0-9]+).*/\1/'
10

Non-capturing groups


Is it ever possible that you would have a non-capturing group that couldn't be substituted with a [...] ?

Sure, when the substrings you want to match aren't individual characters, for example:

^(?:foo|bar)+$

That will match a string like foobarbar. Doing the same using a character set alone wouldn't be possible.

(?:7|8|9)

Vs -

(7|8|9)

A capturing group should be used whenever you need to capture the text and use it later. For example, if you want to examine the match and extract the matched group, or if you want to backreference the matched group later in the pattern.

Otherwise, the capturing group serves no purpose and a non-capturing group should be used instead.

Using a capturing group when a non-capturing group would work just fine has 2 issues:

  • It's computationally more expensive, since the regex engine has to keep track of the matched group (despite the fact that it doesn't need to)
  • It makes the intent of the pattern harder to understand at a glance. When someone reading a regular expression sees a non-capturing group, they can be sure that the group is being used only for implementing a particular logic (like repetition or alternation), but that whatever gets matched doesn't have to be kept track of for later. In contrast, if a reader of a regular expression sees a capturing group, they will probably expect that the capturing group will be used later, and will have to keep that in mind while reading the rest of the pattern. If the captured group doesn't actually get used anywhere, it's unnecessary cognitive overhead.

Use sed to Remove Capture Group 1 From All Lines In a File

You can use

sed 's/@[^,]*,/,/' input.csv > input_Corrected.csv
sed 's/@[^,]*//' input.csv > input_Corrected.csv

The @[^,]*, POSIX BRE pattern matches a @ and then any zero or more chars other than , and then a , (in the first example, use it if there MUST be a comma after the match) and replaces with a comma (in the first example, keep the replacement empty if you use the second approach).

See the online demo:

s='ABCD123RTY,steve_tyler@gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy@hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2@netnet,10.20.30.l6,2021-08-20T15:30:34.480Z'
sed 's/@[^,]*,/,/' <<< "$s"

Output:

ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z

How do I output only a capture group with sed

You are missing the regex after #. This should solve it:

$ sed -nE "s/(^pytest.+)#.*/\1/p" ./requirements/local.txt

How can I output only captured groups with sed?

The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.

string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'

This says:

  • don't default to printing each line (-n)
  • exclude zero or more non-digits
  • include one or more digits
  • exclude one or more non-digits
  • include one or more digits
  • exclude zero or more non-digits
  • print the substitution (p) (on one line)

In general, in sed you capture groups using parentheses and output what you capture using a back reference:

echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'

will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:

echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'

There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:

echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'

outputs "a bar a".

If you have GNU grep:

echo "$string" | grep -Po '\d+'

It may also work in BSD, including OS X:

echo "$string" | grep -Eo '\d+'

These commands will match any number of digit sequences. The output will be on multiple lines.

or variations such as:

echo "$string" | grep -Po '(?<=\D )(\d+)'

The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.

capturing groups in sed

sed is outputting its input because the substitution isn't matching. Since you're probably using GNU sed, try this:

echo "ko05414     ko:ITGA4" | sed 's/\(^ko[0-9]\{5\}\)\tko:\(.*$\)/\1\2/'
  • \d -> [0-9] since GNU sed doesn't recognize \d
  • {} -> \{\} since GNU sed by default uses basic regular expressions.

prevent sed from printing non-captured group

Converting my comment to answer so that solution is easy to find for future visitors.

You may use this sed to disable normal printing of unmatched lines:

sed -nE 's/^.*\[(.* 0.*)\].*/\1/p' file

Also please understand that .* is greedy in nature and due to lot of backtracking this pattern tends to get slower for large files.

I suggest using this regex with negated character class:

sed -nE 's/^[^[]*\[([^ ]* 0[^]]*)\].*/\1/p' file


Related Topics



Leave a reply



Submit