Splitting a csv file with quotes as text-delimiter using String.split()
public static void main(String[] args) {
String s = "Sachin,,M,\"Maths,Science,English\",Need to improve in these subjects.";
String[] splitted = s.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)");
System.out.println(Arrays.toString(splitted));
}
Output:
[Sachin, , M, "Maths,Science,English", Need to improve in these subjects.]
Java: splitting a comma-separated string but ignoring commas in quotes
Try:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
Output:
> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"
In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.
Or, a bit friendlier for the eyes:
public class Main {
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);
String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}
which produces the same as the first example.
EDIT
As mentioned by @MikeFHay in the comments:
I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by
String#split()
, so I did:Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))
Splitting a CSV File in Java that has extra commas and extra quotes in them
Thanks to Andreas and Tamas Hegedus for helping you clarify the question! Try:
br = new BufferedReader(new FileReader(customerListAllCustomers));
while ((line = br.readLine()) != null) {
// one column, so don't need to use comma as separator
String line2 = line.replaceAll("^\"","").replaceAll("\"$","").replaceAll("\\\"","\"");
System.out.println(line2);
The replaceAll
calls strip leading quotes (^\"
) and trailing quotes (\"$
), and then unescape the remaining quotes (\\\"
).
split a comma-separated string with both quoted and unquoted strings
Depending on your needs you may not be able to use a csv parser, and may in fact want to re-invent the wheel!!
You can do so with some simple regex
(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)
This will do the following:
(?:^|,)
= Match expression "Beginning of line or string ,
"
(\"(?:[^\"]+|\"\")*\"|[^,]*)
= A numbered capture group, this will select between 2 alternatives:
- stuff in quotes
- stuff between commas
This should give you the output you are looking for.
Example code in C#
static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);
public static string[] SplitCSV(string input)
{
List<string> list = new List<string>();
string curr = null;
foreach (Match match in csvSplit.Matches(input))
{
curr = match.Value;
if (0 == curr.Length)
{
list.Add("");
}
list.Add(curr.TrimStart(','));
}
return list.ToArray();
}
private void button1_Click(object sender, RoutedEventArgs e)
{
Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
}
Warning As per @MrE's comment - if a rogue new line character appears in a badly formed csv file and you end up with an uneven ("string) you'll get catastrophic backtracking (https://www.regular-expressions.info/catastrophic.html) in your regex and your system will likely crash (like our production system did). Can easily be replicated in Visual Studio and as I've discovered will crash it. A simple try/catch will not trap this issue either.
You should use:
(?:^|,)(\"(?:[^\"])*\"|[^,]*)
instead
How to split csv whose columns may contain comma
Use the Microsoft.VisualBasic.FileIO.TextFieldParser
class. This will handle parsing a delimited file, TextReader
or Stream
where some fields are enclosed in quotes and some are not.
For example:
using Microsoft.VisualBasic.FileIO;
string csv = "2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,\"Corvallis, OR\",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34";
TextFieldParser parser = new TextFieldParser(new StringReader(csv));
// You can also read from a file
// TextFieldParser parser = new TextFieldParser("mycsvfile.csv");
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
string[] fields;
while (!parser.EndOfData)
{
fields = parser.ReadFields();
foreach (string field in fields)
{
Console.WriteLine(field);
}
}
parser.Close();
This should result in the following output:
2
1016
7/31/2008 14:22
Geoff Dalgas
6/5/2011 22:21
http://stackoverflow.com
Corvallis, OR
7679
351
81
b437f461b3fd27387c5d8ab47a293d35
34
See Microsoft.VisualBasic.FileIO.TextFieldParser for more information.
You need to add a reference to Microsoft.VisualBasic
in the Add References .NET tab.
Delimit a string by character unless within quotation marks C#
Copied from my comment: Use an available csv parser like VisualBasic.FileIO.TextFieldParser
or this or this.
As requested, here is an example for the TextFieldParser
:
var allLineFields = new List<string[]>();
string sampleText = "Method,\"value1,value2\"";
var reader = new System.IO.StringReader(sampleText);
using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader))
{
parser.Delimiters = new string[] { "," };
parser.HasFieldsEnclosedInQuotes = true; // <--- !!!
string[] fields;
while ((fields = parser.ReadFields()) != null)
{
allLineFields.Add(fields);
}
}
This list now contains a single string[]
with two strings. I have used a StringReader
because this sample uses a string, if the source is a file use a StreamReader
(f.e. via File.OpenText
).
Split string on comma and ignore comma in double quotes
I think you can use the regex,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)
from here: Splitting on comma outside quotes
You can test the pattern here: http://regexr.com/3cddl
Java code example:
public static void main(String[] args) {
String txt = "0, 2, 23131312,\"This, is a message\", 1212312";
System.out.println(Arrays.toString(txt.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)")));
}
Related Topics
Native Query with Named Parameter Fails with "Not All Named Parameters Have Been Set"
How to "Pretty Print" a Duration in Java
How Do Format a Phone Number as a String in Java
How to Shutdown an Executorservice
Find an Array Inside Another Larger Array
How to Get Which Jradiobutton Is Selected from a Buttongroup
Why I'm Not Able to Unwrap and Serialize a Java Map Using the Jackson Java Library
Right Way to Write JSON Deserializer in Spring or Extend It
How to Write Swap Method in Java
How to Get Subnet Mask of Local System Using Java
How to Add Days to a Date in Java
Gson.Tojson() Throws Stackoverflowerror
What Are the Rules Dictating the Inheritance of Static Variables in Java
How to Replace Groups in Java Regex
Java: Get Month Integer from Date