C# Regex Split - Commas Outside Quotes

C# Regex Split - commas outside quotes

You could split on all commas, that do have an even number of quotes following them , using the following Regex to find them:

",(?=(?:[^']*'[^']*')*[^']*$)"

You'd use it like

var result = Regex.Split(samplestring, ",(?=(?:[^']*'[^']*')*[^']*$)");

split a comma-separated string with both quoted and unquoted strings

Depending on your needs you may not be able to use a csv parser, and may in fact want to re-invent the wheel!!

You can do so with some simple regex

(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)

This will do the following:

(?:^|,) = Match expression "Beginning of line or string ,"

(\"(?:[^\"]+|\"\")*\"|[^,]*) = A numbered capture group, this will select between 2 alternatives:

  1. stuff in quotes
  2. stuff between commas

This should give you the output you are looking for.

Example code in C#

 static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);

public static string[] SplitCSV(string input)
{

List<string> list = new List<string>();
string curr = null;
foreach (Match match in csvSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}

list.Add(curr.TrimStart(','));
}

return list.ToArray();
}

private void button1_Click(object sender, RoutedEventArgs e)
{
Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
}

Warning As per @MrE's comment - if a rogue new line character appears in a badly formed csv file and you end up with an uneven ("string) you'll get catastrophic backtracking (https://www.regular-expressions.info/catastrophic.html) in your regex and your system will likely crash (like our production system did). Can easily be replicated in Visual Studio and as I've discovered will crash it. A simple try/catch will not trap this issue either.

You should use:

(?:^|,)(\"(?:[^\"])*\"|[^,]*)

instead

How can I split by commas while ignoring any comma that's inside quotes?

Update:

I think the final version in a line should be:

var cells = (rows[i] + ',').split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/).slice(1).reduce((a, b) => (a.length > 0 && a[a.length - 1].length < 4) ? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]] : [...a, [b]], []).map(e => e.reduce((a, b) => a !== undefined ? a : b, undefined))

or put it more beautifully:

var cells = (rows[i] + ',')
.split(/(?: *?([^",]+?) *?,|" *?(.+?)" *?,|( *?),)/)
.slice(1)
.reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)
.map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)
;

This is rather long, but still looks purely functional. Let me explain it:

First, the regular expression part. Basically, a segment you want may fall into 3 possibilities:

  1. *?([^",]+?) *?,, which is a string without " or , surrounded with spaces, followed by a ,.
  2. " *?(.+?)" *?,, which is a string, surrounded with a pair of quotes and an indefinite number of spaces beyond the quotes, followed by a ,.
  3. ( *?),, which is an indefinite number of spaces, followed by a ','.

So splitting by a non-capturing group of a union of these three will basically get us to the answer.

Recall that when splitting with a regular expression, the resulting array consists of:

  1. Strings separated by the separator (the regular expression)
  2. All the capturing groups in the separator

In our case, the separators fill the whole string, so the strings separated are all empty strings, except that last desired part, which is left out because there is no , following it. Thus the resulting array should be like:

  1. An empty string
  2. Three strings, representing the three capturing groups of the first separator matched
  3. An empty string
  4. Three strings, representing the three capturing groups of the second separator matched
  5. ...
  6. An empty string
  7. The last desired part, left alone

So why simply adding a , at the end so that we can get a perfect pattern? This is how (rows[i] + ',') comes about.

In this case the resulting array becomes capturing groups separated by empty strings. Removing the first empty string, they will appear in a group of 4 as [ 1st capturing group, 2nd capturing group, 3rd capturing group, empty string ].

What the reduce block does is exactly grouping them into groups of 4:

  .reduce(
(a, b) => (a.length > 0 && a[a.length - 1].length < 4)
? [...a.slice(0, a.length - 1), [...a[a.length - 1], b]]
: [...a, [b]],
[],
)

And finally, find the first non-undefined elements (an unmatched capturing group will appear as undefined. Our three patterns are exclusive in that any 2 of them cannot be matched simultaneously. So there is exactly 1 such element in each group) in each group which are precisely the desired parts:

  .map(
e => e.reduce(
(a, b) => a !== undefined ? a : b, undefined,
),
)

This completes the solution.


I think the following should suffice:

var cells = rows[i].split(/([^",]+?|".+?") *, */).filter(e => e)

or if you don't want the quotes:

var cells = rows[i].split(/(?:([^",]+?)|"(.+?)") *, */).filter(e => e)

Replace all commas outside parentheses AND quotes with REGEX or/and C#

You can use

Regex.Replace(text, @"(\([^()]*\)|'[^']*')|\s*,", m => 
m.Groups[1].Success ? m.Value : " DESC,")

Details:

  • ( - Group 1 start (the capturing group is necessary to restore the match later in the resulting string):
    • \([^()]*\) - a ( char, then any zero or more chars other than ( and ) and then a ) char
  • | - or
    • '[^']*' - ', zero or more chars other than ', and a ' char
  • ) - end of the capturing group
  • | - or
  • \s*, - zero or more whitespaces and then a , char.

See the C# demo:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

public class Test
{
public static void Main()
{
var text = "ISNULL(d.Type, 0), d.Subject + ', ' + d.Name, d.Something,array_position(ARRAY['f', 'p', 'i', 'a']::varchar[], x_field), test";
var pattern = @"(\([^()]*\)|'[^']*')|\s*,";
var result = Regex.Replace(text, pattern, m => m.Groups[1].Success ? m.Value : " DESC,");
Console.WriteLine(result);
}
}

Output:

ISNULL(d.Type, 0) DESC, d.Subject + ', ' + d.Name DESC, d.Something DESC,array_position(ARRAY['f', 'p', 'i', 'a']::varchar[], x_field) DESC, test

Splitting a string and ignoring the delimiter inside quotes

You are better off with a parser, like those mentioned in the comments. That said, it's possible to do it with regex in the following way:

,(?=(?:[^"]*"[^"]*")*[^"]*$)

The positive lookahead ((?= ... )) ensures that there is an even number of quotes ahead of the comma to split on (i.e. either they occur in pairs, or there are none).

[^"]* matches non-quote characters.

Regex split for comma and double quotes CSV format

Use a real csv parser instead of using string methods or regex. You could use the TextFieldParser which is the only one available in the framework directly:

var allLineFields = new List<string[]>();
using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(new StringReader(str)))
{
parser.Delimiters = new string[] { "," };
parser.HasFieldsEnclosedInQuotes = true; // <--- !!!
string[] lineFields;
while ((lineFields = parser.ReadFields()) != null)
{
allLineFields.Add(lineFields);
}
}

You need to add a reference to the Microsoft.VisualBasic dll to your project.

There are other available: Parsing CSV files in C#, with header

How can I Split(',') a string while ignore commas in between quotes?

This is a fairly straight forward CSV Reader implementation we use in a few projects here. Easy to use and handles those cases you are talking about.

First the CSV Class

public static class Csv
{
public static string Escape(string s)
{
if (s.Contains(QUOTE))
s = s.Replace(QUOTE, ESCAPED_QUOTE);

if (s.IndexOfAny(CHARACTERS_THAT_MUST_BE_QUOTED) > -1)
s = QUOTE + s + QUOTE;

return s;
}

public static string Unescape(string s)
{
if (s.StartsWith(QUOTE) && s.EndsWith(QUOTE))
{
s = s.Substring(1, s.Length - 2);

if (s.Contains(ESCAPED_QUOTE))
s = s.Replace(ESCAPED_QUOTE, QUOTE);
}

return s;
}


private const string QUOTE = "\"";
private const string ESCAPED_QUOTE = "\"\"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };

}

Then a pretty nice Reader implementation - If you need it. You should be able to do what you need with just the CSV class above.

public sealed class CsvReader : System.IDisposable
{
public CsvReader(string fileName)
: this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
}

public CsvReader(Stream stream)
{
__reader = new StreamReader(stream);
}

public System.Collections.IEnumerable RowEnumerator
{
get
{
if (null == __reader)
throw new System.ApplicationException("I can't start reading without CSV input.");

__rowno = 0;
string sLine;
string sNextLine;

while (null != (sLine = __reader.ReadLine()))
{
while (rexRunOnLine.IsMatch(sLine) && null != (sNextLine = __reader.ReadLine()))
sLine += "\n" + sNextLine;

__rowno++;
string[] values = rexCsvSplitter.Split(sLine);

for (int i = 0; i < values.Length; i++)
values[i] = Csv.Unescape(values[i]);

yield return values;
}

__reader.Close();
}

}

public long RowIndex { get { return __rowno; } }

public void Dispose()
{
if (null != __reader) __reader.Dispose();
}

//============================================


private long __rowno = 0;
private TextReader __reader;
private static Regex rexCsvSplitter = new Regex(@",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))");
private static Regex rexRunOnLine = new Regex(@"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$");

}

Then you can use it like this.

var reader = new CsvReader(new FileStream(file, FileMode.Open));

Note: This would open an existing CSV file, but can be modified fairly easily to take a string[] like you need.



Related Topics



Leave a reply



Submit