Splitting Data Inside Quotes and Comma Using Regex

How can I split by commas while ignoring any comma that's inside quotes?

Update:

I think the final version in a line should be:

var cells = (rows[i] + ',').split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/).slice(1).reduce((a, b) => (a.length > 0 && a[a.length - 1].length < 4) ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]] : [...a, [b]], []).map(e => e.reduce((a, b) => a !== undefined ? a : b, undefined))

or put it more beautifully:

var cells = (rows[i] + ',')
.split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/)
.slice(1)
.reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)
.map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)
;

This is rather long, but still looks purely functional. Let me explain it:

First, the regular expression part. Basically, a segment you want may fall into 3 possibilities:

  1. *?([^",]+?) *?,, which is a string without " or , surrounded with spaces, followed by a ,.
  2. " *?(.+?)" *?,, which is a string, surrounded with a pair of quotes and an indefinite number of spaces beyond the quotes, followed by a ,.
  3. ( *?),, which is an indefinite number of spaces, followed by a ','.

So splitting by a non-capturing group of a union of these three will basically get us to the answer.

Recall that when splitting with a regular expression, the resulting array consists of:

  1. Strings separated by the separator (the regular expression)
  2. All the capturing groups in the separator

In our case, the separators fill the whole string, so the strings separated are all empty strings, except that last desired part, which is left out because there is no , following it. Thus the resulting array should be like:

  1. An empty string
  2. Three strings, representing the three capturing groups of the first separator matched
  3. An empty string
  4. Three strings, representing the three capturing groups of the second separator matched
  5. ...
  6. An empty string
  7. The last desired part, left alone

So why simply adding a , at the end so that we can get a perfect pattern? This is how (rows[i] + ',') comes about.

In this case the resulting array becomes capturing groups separated by empty strings. Removing the first empty string, they will appear in a group of 4 as [ 1st capturing group, 2nd capturing group, 3rd capturing group, empty string ].

What the reduce block does is exactly grouping them into groups of 4:

  .reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)

And finally, find the first non-undefined elements (an unmatched capturing group will appear as undefined. Our three patterns are exclusive in that any 2 of them cannot be matched simultaneously. So there is exactly 1 such element in each group) in each group which are precisely the desired parts:

  .map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)

This completes the solution.


I think the following should suffice:

var cells = rows[i].split(/([^",]+?|".+?") *, */).filter(e => e)

or if you don't want the quotes:

var cells = rows[i].split(/(?:([^",]+?)|"(.+?)") *, */).filter(e => e)

Split a string by commas but ignore commas within double-quotes using Javascript

Here's what I would do.

var str = 'a, b, c, "d, e, f", g, h';
var arr = str.match(/(".*?"|[^",\s]+)(?=\s*,|\s*$)/g);

Sample Image
/* will match:

    (
".*?" double quotes + anything but double quotes + double quotes
| OR
[^",\s]+ 1 or more characters excl. double quotes, comma or spaces of any kind
)
(?= FOLLOWED BY
\s*, 0 or more empty spaces and a comma
| OR
\s*$ 0 or more empty spaces and nothing else (end of string)
)

*/
arr = arr || [];
// this will prevent JS from throwing an error in
// the below loop when there are no matches
for (var i = 0; i < arr.length; i++) console.log('arr['+i+'] =',arr[i]);

Regular Expression for Comma Based Splitting Ignoring Commas inside Quotes

You need to use the split(java.lang.String, int)
method

Your code would then look like:

String str = "20Y-62-27412,20Y6227412NK,BRACKET,101H,00D505060,H664374,06/25/2013,1,,";
String[] rowData = str.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)", -1);

Python, split a string at commas, except within quotes, ignoring whitespace

You can use the regular expression

".+?"|[\w-]+

This will match double-quotes, followed by any characters, until the next double-quote is found - OR, it will match word characters (no commas nor quotes).

https://regex101.com/r/IThYf7/1

import re
s = 'abc,def, ghi, "jkl, mno, pqr","stu"'
for r in re.findall(r'".+?"|[\w-]+', s):
print(r)

If you want to get rid of the "s around the quoted sections, the best I could figure out by using the regex module (so that \K was usable) was:

(?:^"?|, ?"?)\K(?:(?<=").+?(?=")|[\w-]+)

https://regex101.com/r/IThYf7/3

Splitting on comma outside quotes

You can try out this regex:

str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.

Explanation:

,           // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)

You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:

String[] arr = str.split("(?x)   " + 
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);

C# Regex Split - commas outside quotes

You could split on all commas, that do have an even number of quotes following them , using the following Regex to find them:

",(?=(?:[^']*'[^']*')*[^']*$)"

You'd use it like

var result = Regex.Split(samplestring, ",(?=(?:[^']*'[^']*')*[^']*$)");

C# Regex Split Quotes and Comma Syntax Error

The problems are the double quotes inside the regex, the compiler chokes on them, think they are the end of string.
You must escape them, like this:

"[^\s\"']+|\"([^\"]*)\"|\'([^\']*)"

Edit:

You can actually do all, that you want with one regex, without first splitting:

@"(?<=[""])[^,]*?(?=[""])"

Here I use an @ quoted string where double quotes are doubled instead of escaped.

The regex uses look behind to look for a double quote, then matching any character except comma ',' zero ore more times, then looks ahead for a double quote.

How to use:

string test = @"""0"",""Column"",""column2"",""Column3""";
Regex regex = new Regex(@"(?<=[""])[^,]*?(?=[""])");
foreach (Match match in regex.Matches(test))
{
Console.WriteLine(match.Value);
}

Javascript: Splitting a string by comma but ignoring commas in quotes

> str.match(/('[^']+'|[^,]+)/g)
["A", "B", "C", "E", "'F,G,bb'", "H", "'I9,I8'", "J", "K"]

Though you requested this, you may not accounted for corner-cases where for example:

  • 'bob\'s' is a string where ' is escaped
  • a,',c
  • a,,b
  • a,b,
  • ,a,b
  • a,b,'
  • ',a,b
  • ',a,b,c,'

Some of the above are handled correctly by this; others are not. I highly recommend that people use a library that has thought this through, to avoid things such as security vulnerabilities or subtle bugs, now or in the future (if you expand your code, or if other people use it).


Explanation of the RegEx:

  • ('[^']+'|[^,]+) - means match either '[^']+' or [^,]+
  • '[^']+' means quote...one-or-more non-quotes...quote.
  • [^,]+ means one-or-more non-commas

Note: by consuming the quoted string before the unquoted string, we make the parsing of the unquoted string case easier.



Related Topics



Leave a reply



Submit