Split a Comma-Separated String With Both Quoted and Unquoted Strings

split a comma-separated string with both quoted and unquoted strings

Depending on your needs you may not be able to use a csv parser, and may in fact want to re-invent the wheel!!

You can do so with some simple regex

(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)

This will do the following:

(?:^|,) = Match expression "Beginning of line or string ,"

(\"(?:[^\"]+|\"\")*\"|[^,]*) = A numbered capture group, this will select between 2 alternatives:

  1. stuff in quotes
  2. stuff between commas

This should give you the output you are looking for.

Example code in C#

 static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);

public static string[] SplitCSV(string input)
{

List<string> list = new List<string>();
string curr = null;
foreach (Match match in csvSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}

list.Add(curr.TrimStart(','));
}

return list.ToArray();
}

private void button1_Click(object sender, RoutedEventArgs e)
{
Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
}

Warning As per @MrE's comment - if a rogue new line character appears in a badly formed csv file and you end up with an uneven ("string) you'll get catastrophic backtracking (https://www.regular-expressions.info/catastrophic.html) in your regex and your system will likely crash (like our production system did). Can easily be replicated in Visual Studio and as I've discovered will crash it. A simple try/catch will not trap this issue either.

You should use:

(?:^|,)(\"(?:[^\"])*\"|[^,]*)

instead

Splitting on comma outside quotes

You can try out this regex:

str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.

Explanation:

,           // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)

You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:

String[] arr = str.split("(?x)   " + 
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);

C# Regex Split - commas outside quotes

You could split on all commas, that do have an even number of quotes following them , using the following Regex to find them:

",(?=(?:[^']*'[^']*')*[^']*$)"

You'd use it like

var result = Regex.Split(samplestring, ",(?=(?:[^']*'[^']*')*[^']*$)");

Java: splitting a comma-separated string but ignoring commas in quotes

Try:

public class Main { 
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}

Output:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

Or, a bit friendlier for the eyes:

public class Main { 
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);

String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}

which produces the same as the first example.

EDIT

As mentioned by @MikeFHay in the comments:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

How can I split a string with string delimiter, ignoring delimiter inside quotes and producing empty strings?

This is a 2-state machine that reads each character in the string, when it encounters a double-quote it will enter a state where it will treat every subsequent character as part of the value until it encounters another double-quote. When it's in the normal state it will form a string from each character encountered until it encounters a comma and adds it to a list of strings to return:

enum State {
InQuotes,
InValue
}

List<String> result = new List<String>();

using(TextReader rdr = new StringReader( line )) {

State state = State.InValue;
StringBuilder sb = new StringBuilder();

Int32 nc; Char c;
while( (nc = rdr.Read()) != -1 ) {
c = (Char)nc;

switch( state ) {

case State.InValue:

if( c == '"' ) {
state = State.InQuotes;
} else if( c == ',' ) {
result.Add( sb.ToString() );
sb.Length = 0;
} else {
sb.Append( c );
}
break;
case State.InQuotes:

if( c == '"' ) {
state = State.InValue;
} else {
sb.Append( c );
}
break;
} // switch
} // while
if( sb.Length > 0 ) result.Add( sb.ToString() );
} // using

Split comma separated string with quotes and commas within quotes and escaped quotes within quotes

You can do it like this:

List<String> result = new ArrayList<String>();

Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);

while(m.find()) {
result.add(m.group());
}

Split a string by commas but ignore commas within double-quotes using Javascript

Here's what I would do.

var str = 'a, b, c, "d, e, f", g, h';
var arr = str.match(/(".*?"|[^",\s]+)(?=\s*,|\s*$)/g);

Sample Image
/* will match:

    (
".*?" double quotes + anything but double quotes + double quotes
| OR
[^",\s]+ 1 or more characters excl. double quotes, comma or spaces of any kind
)
(?= FOLLOWED BY
\s*, 0 or more empty spaces and a comma
| OR
\s*$ 0 or more empty spaces and nothing else (end of string)
)

*/
arr = arr || [];
// this will prevent JS from throwing an error in
// the below loop when there are no matches
for (var i = 0; i < arr.length; i++) console.log('arr['+i+'] =',arr[i]);


Related Topics



Leave a reply



Submit