Tokenizing a String But Ignoring Delimiters Within Quotes

Tokenizing a String but ignoring delimiters within quotes

It's much easier to use a java.util.regex.Matcher and do a find() rather than any kind of split in these kinds of scenario.

That is, instead of defining the pattern for the delimiter between the tokens, you define the pattern for the tokens themselves.

Here's an example:

    String text = "1 2 \"333 4\" 55 6    \"77\" 8 999";
// 1 2 "333 4" 55 6 "77" 8 999

String regex = "\"([^\"]*)\"|(\\S+)";

Matcher m = Pattern.compile(regex).matcher(text);
while (m.find()) {
if (m.group(1) != null) {
System.out.println("Quoted [" + m.group(1) + "]");
} else {
System.out.println("Plain [" + m.group(2) + "]");
}
}

The above prints (as seen on ideone.com):

Plain [1]
Plain [2]
Quoted [333 4]
Plain [55]
Plain [6]
Quoted [77]
Plain [8]
Plain [999]

The pattern is essentially:

"([^"]*)"|(\S+)
\_____/ \___/
1 2

There are 2 alternates:

  • The first alternate matches the opening double quote, a sequence of anything but double quote (captured in group 1), then the closing double quote
  • The second alternate matches any sequence of non-whitespace characters, captured in group 2
  • The order of the alternates matter in this pattern

Note that this does not handle escaped double quotes within quoted segments. If you need to do this, then the pattern becomes more complicated, but the Matcher solution still works.

References

  • regular-expressions.info/Brackets for Grouping and Capturing, Alternation with Vertical Bar, Character Class, Repetition with Star and Plus

See also

  • regular-expressions.info/Examples - Programmer - Strings - for pattern with escaped quotes

Appendix

Note that StringTokenizer is a legacy class. It's recommended to use java.util.Scanner or String.split, or of course java.util.regex.Matcher for most flexibility.

Related questions

  • Difference between a Deprecated and Legacy API?
  • Scanner vs. StringTokenizer vs. String.Split
  • Validating input using java.util.Scanner - has many examples

String Tokenizer (Double Quotes and Whitespace)

    String line = "addPhoto \"DSC_018.jpg\" \"DSC_018\" \"Colorado Springs\"";
String[] pieces = line.split(" \"");

for (String p : pieces) {
System.out.println(p.replaceAll("\"", ""));
}

String Tokenizer : split string by comma and ignore comma in double quotes

Use a CSV parser like OpenCSV to take care of things like commas in quoted elements, values that span multiple lines etc. automatically. You can use the library to serialize your text back as CSV as well.

String str = "value1, value2, value3, value4, \"value5, 1234\", " +
"value6, value7, \"value8\", value9, \"value10, 123.23\"";

CSVReader reader = new CSVReader(new StringReader(str));

String [] tokens;
while ((tokens = reader.readNext()) != null) {
System.out.println(tokens[0]); // value1
System.out.println(tokens[4]); // value5, 1234
System.out.println(tokens[9]); // value10, 123.23
}

Java/Kotlin: Tokenize a string ignoring the contents of nested quotes

You can achieve that with the following regex: ["']+[^"']+?["']+. Using that pattern you retrieve the indices where you want to split like this:

val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()

The rest is building the list out of substrings. Here the complete function:

fun String.splitByPattern(pattern: String): List<String> {

val indices = Regex(pattern).findAll(this).map{ listOf(it.range.start, it.range.endInclusive) }.flatten().toMutableList()

var lastIndex = 0
return indices.mapIndexed { i, ele ->

val end = if(i % 2 == 0) ele else ele + 1 // magic

substring(lastIndex, end).apply {
lastIndex = end
}
}
}

Usage:

val str = """
this "'"is a possible option"'" and ""so is this"" and '''this one too''' and even ""mismatched quotes"
""".trim()

println(str.splitByPattern("""["']+[^"']+?["']+"""))

Output:

[this , "'"is a possible option"'", and , ""so is this"", and , '''this one too''', and even , ""mismatched quotes"]

Try it out on Kotlin's playground!

Java: splitting a comma-separated string but ignoring commas in quotes

Try:

public class Main { 
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}

Output:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

Or, a bit friendlier for the eyes:

public class Main { 
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);

String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}

which produces the same as the first example.

EDIT

As mentioned by @MikeFHay in the comments:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))

C++ Tokenize a string with spaces and quotes

No library is needed. An iteration can do the task ( if it is as simple as you describe).

string str = "add string \"this is a string with space!\"";

for( size_t i=0; i<str.length(); i++){

char c = str[i];
if( c == ' ' ){
cout << endl;
}else if(c == '\"' ){
i++;
while( str[i] != '\"' ){ cout << str[i]; i++; }
}else{
cout << c;
}
}

that outputs

add
string
this is a string with space!

Tokenize CSV line escape double quotes

Don't parse a CSV yourself, use a library. Even such a simple format as CSV has nuances: fields can be escaped with quotes or unescaped, the file can have or have not a header and so on. Besides that you have to test and maintain the code you've wrote. So writing less code and reusing libraries is good.

There are a plenty of libraries for CSV in Java:

  • Apache Commons CSV
  • OpenCSV
  • Super CSV
  • Univocity
  • flatpack

IMHO, the first two are the most popular.

Here is an example for Apache Commons CSV:

final Reader in = new FileReader("counties.csv");
final Iterable<CSVRecord> records = CSVFormat.DEFAULT.parse(in);

for (final CSVRecord record : records) { // Simply iterate over the records via foreach loop. All the parsing is handler for you
String populationString = record.get(7); // Indexes are zero-based
String populationString = record.get("population"); // Or, if your file has headers, you can just use them

… // Do whatever you want with the population
}

Look how easy it is! And it will be similar with other parsers.

Splitting on comma outside quotes

You can try out this regex:

str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

This splits the string on , that is followed by an even number of double quotes. In other words, it splits on comma outside the double quotes. This will work provided you have balanced quotes in your string.

Explanation:

,           // Split on comma
(?= // Followed by
(?: // Start a non-capture group
[^"]* // 0 or more non-quote characters
" // 1 quote
[^"]* // 0 or more non-quote characters
" // 1 quote
)* // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
[^"]* // Finally 0 or more non-quotes
$ // Till the end (This is necessary, else every comma will satisfy the condition)
)

You can even type like this in your code, using (?x) modifier with your regex. The modifier ignores any whitespaces in your regex, so it's becomes more easy to read a regex broken into multiple lines like so:

String[] arr = str.split("(?x)   " + 
", " + // Split on comma
"(?= " + // Followed by
" (?: " + // Start a non-capture group
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" [^\"]* " + // 0 or more non-quote characters
" \" " + // 1 quote
" )* " + // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
" [^\"]* " + // Finally 0 or more non-quotes
" $ " + // Till the end (This is necessary, else every comma will satisfy the condition)
") " // End look-ahead
);


Related Topics



Leave a reply



Submit