Regex to Match Only Commas Not in Parentheses

Regex to match only commas not in parentheses?

Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):

Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);

This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.

perl regex to get comma not in parenthesis or nested parenthesis

A single regex for this is massively overcomplicated and difficult to maintain or extend. Here is an iterative parser approach:

use strict;
use warnings;

my $str = 'a , (b) , (d$_,c) , ((,),d,(,))';

my $nesting = 0;
my $buffer = '';
my @vals;
while ($str =~ m/\G([,()]|[^,()]+)/g) {
my $token = $1;
if ($token eq ',' and !$nesting) {
push @vals, $buffer;
$buffer = '';
} else {
$buffer .= $token;
if ($token eq '(') {
$nesting++;
} elsif ($token eq ')') {
$nesting--;
}
}
}
push @vals, $buffer if length $buffer;

print "$_\n" for @vals;

You can use Parser::MGC to construct this sort of parser more abstractly.

RegEx for matching all commas unless they are enclosed between parentheses or brackets

This looks more like a job for a custom parser than a single regex. I would love to be proved wrong, but while we're waiting, here's a very pedestrian parsing function that gets the job done.

parse_nested <- function(string) {

chars <- strsplit(string, "")[[1]]

parentheses <- numeric(length(chars))
parentheses[chars == "("] <- 1
parentheses[chars == ")"] <- -1
parentheses <- cumsum(parentheses)

brackets <- numeric(length(chars))
brackets[chars == "["] <- 1
brackets[chars == "]"] <- -1
brackets <- cumsum(brackets)

split_on <- which(brackets == 0 & parentheses == 0 & chars == ",")
split_on <- c(0, split_on, length(chars) + 1)

result <- character()

for(i in seq_along(head(split_on, -1))) {
x <- paste0(chars[(split_on[i] + 1):(split_on[i + 1] - 1)], collapse = "")
result <- c(result, x)
}

trimws(result)
}

Which produces:

parse_nested(x)
#> [1] "A" "B (C, D, E)" "F"
#> [4] "G [H, I, J]" "K (L (M, N), O)" "P (Q (R, S (T, U)))"

Regex to match only commas not in parentheses or square brackets

Maybe you want something like this:

(?!<(?:\(|\[)[^)\]]+),(?![^(\[]+(?:\)|\]))

Demo

When fed to Java with the input (note additional ] and ( inserted at random positions to make it well-formed):

Potatoes, Vegetable Oil (Sunflower, Corn, And/or Canola Oil), Honey BBQ Seasoning [Sugar, Salt, Dextrose, Torula Yeast], Onion Powder, Spices, Maltodextrin Fructose, Yeast Extract, Molasses, Natural Flavor [Including Milk], Corn Starch, Honey, Gum Arabic, Paprika Extracts, Caramel Color (Garlic Powder, Citric Acid, And Sunflower Oil).

it produces the output:

Potatoes
Vegetable Oil (Sunflower, Corn, And/or Canola Oil)
Honey BBQ Seasoning [Sugar, Salt, Dextrose, Torula Yeast]
Onion Powder
Spices
Maltodextrin Fructose
Yeast Extract
Molasses
Natural Flavor [Including Milk]
Corn Starch
Honey
Gum Arabic
Paprika Extracts
Caramel Color (Garlic Powder, Citric Acid, And Sunflower Oil).

which is exactly the "split at top-level commas".

However, note that this regex is really inefficient. Counting parentheses with regex-lookarounds is not a very good idea. It seems as if it could be solved with a simple scan-left followed by simple split.

Regex to match only comma's but not inside multiple parentheses

Here is the regex which works perfectly for your input.

,(?![^()]*(?:\([^()]*\))?\))

DEMO

Explanation:

,                        ','
(?! negative look ahead, to see if there is not:
[^()]* any character except: '(', ')' (0 or
more times)
(?: group, but do not capture (optional):
\( '('
[^()]* any character except: '(', ')' (0 or
more times)
\) ')'
)? end of grouping, ? after the non-capturing group makes the
whole non-capturing group as optional.
\) ')'
) end of look-ahead

Limitations:

This regex works based on the assumption that parentheses will not be nested at a depth greater than 2, i.e. paren within paren. It could also fail if unbalanced, escaped, or quoted parentheses occur in the input, because it relies on the assumption that each closing paren corresponds to an opening paren and vice versa.

Regex to match commas that are not in an array (enclosed in square brackets)

You could use the following regex to match commas not in arrays:

(,)(?![^[]*\])

(Explanation on regex101.)

This says to match any comma which, if it is followed by a close bracket, has an opening bracket before that close bracket.


Example in JS:

outPut = outPut.replace(/(,)(?![^[]*\])/g, '\n');

gives:

"{glossary:{title:example glossary
GlossDiv:{title:S
GlossList:{GlossEntry:{ID:SGML
SortAs:SGML
GlossTerm:Standard Generalized Markup Language
Acronym:SGML
Abbrev:ISO 8879:1986
GlossDef:{para:A meta-markup language
used to create markup languages such as DocBook.
GlossSeeAlso:[GML,XML]}
GlossSee:markup}}}}}"

match all commas that are outside parentheses and square brackets in perl regex

The problem here is in identifying "balanced" pairs, of parenthesis/brackets in this case. This is a well recognized problem, for which there are libraries. They can find the top-level matching pairs, (...)/[...] with all that's inside, and all else outside parens -- then process the "else."

One way, using Regexp::Common

use warnings;
use strict;
use feature 'say';

use Regexp::Common;

my $str = shift // q{A, t(a,b(c,))u B, C, p(d,)q D,};

my @all_parts = split /$RE{balanced}{-parens=>'()[]'}/, $str;

my @no_paren_parts = grep { not /\(.*\) | \[.*\]/x } @all_parts;

say for @no_paren_parts;

This uses split's property to return the list with separators included when the regex in the separator pattern captures. The library regex captures so we get it all back -- the parts obtained by splitting the string by what regex matched but also the parts matched by the regex. The separators contain the paired delimiters while other terms cannot, by construction, so I filter them out by that. Prints


A, t
u B, C, p
q D,

The paren/bracket terms are gone, but how the string is split is otherwise a bit arbitrary.

The above is somewhat "generic," using the library merely to extract the balanced pairs ()/[], along with all other parts of the string. Or, we can remove those patterns from the string

$str =~ s/$RE{balanced}{-parens=>'()[]'}//g;

to stay with


A, tu B, C, pq D,

Now one can simply split by commas

my @terms = split /\s*,\s*/, $str;
say for @terms;

for


A
tu B
C
pq D

This is the desired result in this case, as clarified in comments.

Another most notable library, in many ways more fundamental, is the core Text::Balance. See Shawn's answer here, and for example this post and this one and this one for examples.


An example. With

my $str = q(it, is; surely);

my @terms = split /[,;]/, $str;

one gets it is surely in the array @terms, while with

my @terms = split /([,;])/, $str;

we get in @terms all of: it , is ; surely


Also by construction, it contains what the regex matched at even indices. So for all other parts we can fetch elements at odd indices

my @other_than_matched_parts = @all_parts[ grep { not $_ & 1 } 0..$#all_parts ];

Replace a comma that is not in parentheses using regex

Use a negative lookahead to achieve this:

,(?![^()]*\))

Explanation:

,         # Match a literal ','
(?! # Start of negative lookahead
[^()]* # Match any character except '(' & ')', zero or more times
\) # Followed by a literal ')'
) # End of lookahead

Regex101 Demo

Regex split by comma not inside parenthesis (.NET)

This PCRE regex - (\((?:[^()]++|(?1))*\))(*SKIP)(*F)|, - uses recursion, .NET does not support it, but there is a way to do the same thing using balancing construct. The From the PCRE verbs - (*SKIP) and (*FAIL) - only (*FAIL) can be written as (?!) (it causes an unconditional fail at the place where it stands), .NET does not support skipping a match at a specific position and resuming search from that failed position.

I suggest replacing all commas that are not inside nested parentheses with some temporary value, and then splitting the string with that value:

var s = Regex.Replace(text, @"\((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\)|(,)", m =>
m.Groups[1].Success ? "___temp___" : m.Value);
var results = s.Split("___temp___");

Details

  • \((?>[^()]+|(?<o>)\(|(?<-o>)\))*(?(o)(?!))\) - a pattern that matches nested parentheses:

    • \( - a ( char
    • (?>[^()]+|(?<o>)\(|(?<-o>)\))* - 0 or more occurrences of

      • [^()]+| - 1+ chars other than ( and ) or
      • (?<o>)\(| - a ( and a value is pushed on to the Group "o" stack
      • (?<-o>)\) - a ) and a value is popped from the Group "o" stack
    • (?(o)(?!)) - a conditional construct that fails the match if Group "o" stack is not empty
    • \) - a ) char
  • | - or
  • (,) - Group 1: a comma

Only the comma captured in Group 1 is replaced with a temp substring since the m.Groups[1].Success check is performed in the match evaluator part.



Related Topics



Leave a reply



Submit