Splitting on Comma Outside Quotes

Splitting on comma outside quotes

You can try out this regex:

str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.

Explanation:

,           // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)

You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:

String[] arr = str.split("(?x)   " + 
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);

Splitting on comma outside quotes when escaped quotes exist

Use this regex for the split:

String[] parts = source.split(", *(?=((([^']|'')*'){2})*([^']|'')*$)");

This regex uses a look ahead that asserts the number of quotes following the current position is even, which logically means the comma is not enclosed.

The "key" here is using an alternation to define a "non quote" as either [^'] or '', which means double quotes are consumed/treated as if they are a single character.

Note:

There is a missing final quote in your test case, which I have repaired in the test code below. If the quote is not added, your test case is syntactically invalid SQL and this code relies on quotes being balanced.


Some test code:

String source = "ADDRESS.CITY || ', UK''s', ADDRESS.CITY || ', US''s', ADDRESS.CITY || ', UK''s'";
String[] parts = source.split(", *(?=((([^']|'')*'){2})*([^']|'')*$)");
Arrays.stream(parts).forEach(System.out::println);

Output:

ADDRESS.CITY || ', UK''s'
ADDRESS.CITY || ', US''s'
ADDRESS.CITY || ', UK''s'

Java: splitting a comma-separated string but ignoring commas in quotes

Try:

public class Main { 
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}

Output:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

Or, a bit friendlier for the eyes:

public class Main { 
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);

String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}

which produces the same as the first example.

EDIT

As mentioned by @MikeFHay in the comments:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

Splitting on comma outside pair of double quotes . Ignore double quotes if it is single

Your input format is not a valid CSV format. According to the Wikipedia Comma-separated values page, if quoting is used at all, a literal quote character in field must be quoted.

This means that it is unlikely that any existing general purpose CSV parser library will cope with both types of line in the same file.

To illustrate how deep this problem is, consider:

   130,TEXT 1" 67 SERIES, TEXT 2",4,1,998,.010,9,-,7,130

This could mean:

  • one field containing TEXT 1" 67 SERIES, TEXT 2"
  • one field containing TEXT 1 67 SERIES, TEXT 2, or
  • two fields TEXT 1" 67 SERIES and TEXT 2".

The only way to disambiguate this is to code some custom logic to sort it out .... based on your own business rules.

I don't think you can do this with split and regexes. You need to write a proper custom parser.

But in this case, I think you would be entitled to push back on whoever / whatever is creating this CSV data. They should be following the rules. I would be tempted to implement my system to feed the CSV file(s) through an off-the-shelf syntax checker and automatically reject any files that fail validation.

Can you fix the errors in quoting automatically? I think not ... in the general case. As noted, there is no way of telling whether a double-quote in a malformed CSV is supposed to be literal or not. It requires human intelligence, and domain knowledge to understand what the data is supposed to mean.

How can I split by commas while ignoring any comma that's inside quotes?

Update:

I think the final version in a line should be:

var cells = (rows[i] + ',').split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/).slice(1).reduce((a, b) => (a.length > 0 && a[a.length - 1].length < 4) ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]] : [...a, [b]], []).map(e => e.reduce((a, b) => a !== undefined ? a : b, undefined))

or put it more beautifully:

var cells = (rows[i] + ',')
.split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/)
.slice(1)
.reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)
.map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)
;

This is rather long, but still looks purely functional. Let me explain it:

First, the regular expression part. Basically, a segment you want may fall into 3 possibilities:

  1. *?([^",]+?) *?,, which is a string without " or , surrounded with spaces, followed by a ,.
  2. " *?(.+?)" *?,, which is a string, surrounded with a pair of quotes and an indefinite number of spaces beyond the quotes, followed by a ,.
  3. ( *?),, which is an indefinite number of spaces, followed by a ','.

So splitting by a non-capturing group of a union of these three will basically get us to the answer.

Recall that when splitting with a regular expression, the resulting array consists of:

  1. Strings separated by the separator (the regular expression)
  2. All the capturing groups in the separator

In our case, the separators fill the whole string, so the strings separated are all empty strings, except that last desired part, which is left out because there is no , following it. Thus the resulting array should be like:

  1. An empty string
  2. Three strings, representing the three capturing groups of the first separator matched
  3. An empty string
  4. Three strings, representing the three capturing groups of the second separator matched
  5. ...
  6. An empty string
  7. The last desired part, left alone

So why simply adding a , at the end so that we can get a perfect pattern? This is how (rows[i] + ',') comes about.

In this case the resulting array becomes capturing groups separated by empty strings. Removing the first empty string, they will appear in a group of 4 as [ 1st capturing group, 2nd capturing group, 3rd capturing group, empty string ].

What the reduce block does is exactly grouping them into groups of 4:

  .reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)

And finally, find the first non-undefined elements (an unmatched capturing group will appear as undefined. Our three patterns are exclusive in that any 2 of them cannot be matched simultaneously. So there is exactly 1 such element in each group) in each group which are precisely the desired parts:

  .map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)

This completes the solution.


I think the following should suffice:

var cells = rows[i].split(/([^",]+?|".+?") *, */).filter(e => e)

or if you don't want the quotes:

var cells = rows[i].split(/(?:([^",]+?)|"(.+?)") *, */).filter(e => e)

Split string on comma and ignore comma in double quotes

I think you can use the regex,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$) from here: Splitting on comma outside quotes

You can test the pattern here: http://regexr.com/3cddl

Java code example:

public static void main(String[] args) {
String txt = "0, 2, 23131312,\"This, is a message\", 1212312";

System.out.println(Arrays.toString(txt.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)")));

}

C# Regex Split - commas outside quotes

You could split on all commas, that do have an even number of quotes following them , using the following Regex to find them:

",(?=(?:[^']*'[^']*')*[^']*$)"

You'd use it like

var result = Regex.Split(samplestring, ",(?=(?:[^']*'[^']*')*[^']*$)");

Split a string by commas but ignore commas within double-quotes using Javascript

Here's what I would do.

var str = 'a, b, c, "d, e, f", g, h';
var arr = str.match(/(".*?"|[^",\s]+)(?=\s*,|\s*$)/g);

Sample Image
/* will match:

    (
".*?" double quotes + anything but double quotes + double quotes
| OR
[^",\s]+ 1 or more characters excl. double quotes, comma or spaces of any kind
)
(?= FOLLOWED BY
\s*, 0 or more empty spaces and a comma
| OR
\s*$ 0 or more empty spaces and nothing else (end of string)
)

*/
arr = arr || [];
// this will prevent JS from throwing an error in
// the below loop when there are no matches
for (var i = 0; i < arr.length; i++) console.log('arr['+i+'] =',arr[i]);


Related Topics



Leave a reply



Submit