split a comma-separated string with both quoted and unquoted strings
Depending on your needs you may not be able to use a csv parser, and may in fact want to re-invent the wheel!!
You can do so with some simple regex
(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
This will do the following:
(?:^|,)
= Match expression "Beginning of line or string ,
"
(\"(?:[^\"]+|\"\")*\"|[^,]*)
= A numbered capture group, this will select between 2 alternatives:
- stuff in quotes
- stuff between commas
This should give you the output you are looking for.
Example code in C#
static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
public static string[] SplitCSV(string input)
{
List<string> list = new List<string>();
string curr = null;
foreach (Match match in csvSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}
list.Add(curr.TrimStart(','));
}
return list.ToArray();
}
private void button1_Click(object sender, RoutedEventArgs e)
{
Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
}
Warning As per @MrE's comment - if a rogue new line character appears in a badly formed csv file and you end up with an uneven ("string) you'll get catastrophic backtracking (https://www.regular-expressions.info/catastrophic.html) in your regex and your system will likely crash (like our production system did). Can easily be replicated in Visual Studio and as I've discovered will crash it. A simple try/catch will not trap this issue either.
You should use:
(?:^|,)(\"(?:[^\"])*\"|[^,]*)
instead
Splitting on comma outside quotes
You can try out this regex:
str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
This splits the string on ,
that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.
Explanation:
, // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)
You can even type like this in your code, using (?x)
modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:
String[] arr = str.split("(?x) " +
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);
C# Regex Split - commas outside quotes
You could split on all commas, that do have an even number of quotes following them , using the following Regex to find them:
",(?=(?:[^']*'[^']*')*[^']*$)"
You'd use it like
var result = Regex.Split(samplestring, ",(?=(?:[^']*'[^']*')*[^']*$)");
Java: splitting a comma-separated string but ignoring commas in quotes
Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDITAs mentioned by @MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by
String#split()
, so I did:Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))
How can I split a string with string delimiter, ignoring delimiter inside quotes and producing empty strings?
This is a 2-state machine that reads each character in the string, when it encounters a double-quote it will enter a state where it will treat every subsequent character as part of the value
until it encounters another double-quote. When it's in the normal state it will form a string from each character encountered until it encounters a comma and adds it to a list of strings to return:
enum State {
InQuotes,
InValue
}
List<String> result = new List<String>();
using(TextReader rdr = new StringReader( line )) {
State state = State.InValue;
StringBuilder sb = new StringBuilder();
Int32 nc; Char c;
while( (nc = rdr.Read()) != -1 ) {
c = (Char)nc;
switch( state ) {
case State.InValue:
if( c == '"' ) {
state = State.InQuotes;
} else if( c == ',' ) {
result.Add( sb.ToString() );
sb.Length = 0;
} else {
sb.Append( c );
}
break;
case State.InQuotes:
if( c == '"' ) {
state = State.InValue;
} else {
sb.Append( c );
}
break;
} // switch
} // while
if( sb.Length > 0 ) result.Add( sb.ToString() );
} // using
Split comma separated string with quotes and commas within quotes and escaped quotes within quotes
You can do it like this:
List<String> result = new ArrayList<String>();
Pattern p = Pattern.compile("(?>[^,'\"]++|(['\"])(?>[^\"'\\\\]++|\\\\.|(?!\\1)[\"'])*\\1|(?<=,|^)\\s*(?=,|$))+", Pattern.DOTALL);
Matcher m = p.matcher(checkString);
while(m.find()) {
result.add(m.group());
}
Split a string by commas but ignore commas within double-quotes using Javascript
Here's what I would do.
var str = 'a, b, c, "d, e, f", g, h';
var arr = str.match(/(".*?"|[^",\s]+)(?=\s*,|\s*$)/g);
/* will match:
(
".*?" double quotes + anything but double quotes + double quotes
| OR
[^",\s]+ 1 or more characters excl. double quotes, comma or spaces of any kind
)
(?= FOLLOWED BY
\s*, 0 or more empty spaces and a comma
| OR
\s*$ 0 or more empty spaces and nothing else (end of string)
)
*/
arr = arr || [];
// this will prevent JS from throwing an error in
// the below loop when there are no matches
for (var i = 0; i < arr.length; i++) console.log('arr['+i+'] =',arr[i]);
Related Topics
Sorting a Collection Containing Strings And/Or Numbers
Asp.Net Core Get Json Array Using Iconfiguration
Redirecting to Another Page in ASP.NET MVC Using Javascript/Jquery
How to Determine If a Json Object Contains Only a Specific Key
Smtpexception: Unable to Read Data from the Transport Connection: Net_Io_Connectionclosed
Check If a File Is Real or a Symbolic Link
How to Set Shadow Effect on Imageview
Post Json Array to MVC Controller
How to Save Mailmessage Object to Disk as *.Eml or *.Msg File
C# Windows Form Application for Employee Management
How to Print the Elements With Text Value That Contains in a List Selenium C#
Ssh.Net Sftp Get a List of Directories and Files Recursively
Asp Core Webapi Test File Upload Using Postman
Update Value in Datatable from Another Datatable
Using Newtonsoft to Deserialize a Date Stamp That Might Consist Only of a Year