How to Make My Split Work Only on One Real Line and Be Capable to Skip Quoted Parts of String

How to make my split work only on one real line and be capable to skip quoted parts of string?

The following code:

vector<string>::const_iterator matchSymbol(const string & s, string::const_iterator i, const vector<string> & symbols)
{
vector<string>::const_iterator testSymbol;
for (testSymbol=symbols.begin();testSymbol!=symbols.end();++testSymbol) {
if (!testSymbol->empty()) {
if (0==testSymbol->compare(0,testSymbol->size(),&(*i),testSymbol->size())) {
return testSymbol;
}
}
}

assert(testSymbol==symbols.end());
return testSymbol;
}

vector<string> split(const string& s, const vector<string> & delims, const vector<string> & terms, const bool keep_empty = true)
{
vector<string> result;
if (delims.empty()) {
result.push_back(s);
return result;
}

bool checkForDelim=true;

string temp;
string::const_iterator i=s.begin();
while (i!=s.end()) {
vector<string>::const_iterator testTerm=terms.end();
vector<string>::const_iterator testDelim=delims.end();

if (checkForDelim) {
testTerm=matchSymbol(s,i,terms);
testDelim=matchSymbol(s,i,delims);
}

if (testTerm!=terms.end()) {
i=s.end();
} else if (testDelim!=delims.end()) {
if (!temp.empty() || keep_empty) {
result.push_back(temp);
temp.clear();
}
string::const_iterator j=testDelim->begin();
while (i!=s.end() && j!=testDelim->end()) {
++i;
++j;
}
} else if ('"'==*i) {
if (checkForDelim) {
string::const_iterator j=i;
do {
++j;
} while (j!=s.end() && '"'!=*j);
checkForDelim=(j==s.end());
if (!checkForDelim && !temp.empty() || keep_empty) {
result.push_back(temp);
temp.clear();
}
temp.push_back('"');
++i;
} else {
//matched end quote
checkForDelim=true;
temp.push_back('"');
++i;
result.push_back(temp);
temp.clear();
}
} else if ('\n'==*i) {
temp+="\\n";
++i;
} else {
temp.push_back(*i);
++i;
}
}

if (!temp.empty() || keep_empty) {
result.push_back(temp);
}
return result;
}

int runTest()
{
vector<string> delims;
delims.push_back(" ");
delims.push_back("\t");
delims.push_back("\n");
delims.push_back("split_here");

vector<string> terms;
terms.push_back(">");
terms.push_back("end_here");

const vector<string> words = split("close no \"\n end_here matter\" how \n far testsplit_heretest\"another split_here test\"with some\"mo>re", delims, terms, false);

copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));
}

generates:

close
no
"\n end_here matter"
how
far
test
test
"another split_here test"
with
some"mo

Based on the examples you gave, you seemed to want newlines to count as delimiters when they appear outside of quotes and be represented by the literal \n when inside of quotes, so that's what this does. It also adds the ability to have multiple delimiters, such as split_here as I used the test.

I wasn't sure if you want unmatched quotes to be split the way matched quotes do since the example you gave has the unmatched quote separated by spaces. This code treats unmatched quotes as any other character, but it should be easy to modify if this is not the behavior you want.

The line:

if (0==testSymbol->compare(0,testSymbol->size(),&(*i),testSymbol->size())) {

will work on most, if not all, implementations of the STL, but it is not gauranteed to work. It can be replaced with the safer, but slower, version:

if (*testSymbol==s.substr(i-s.begin(),testSymbol->size())) {

How to split but ignore separators in quoted strings, in python?

Most of the answers seem massively over complicated. You don't need back references. You don't need to depend on whether or not re.findall gives overlapping matches. Given that the input cannot be parsed with the csv module so a regular expression is pretty well the only way to go, all you need is to call re.split with a pattern that matches a field.

Note that it is much easier here to match a field than it is to match a separator:

import re
data = """part 1;"this is ; part 2;";'this is ; part 3';part 4;this "is ; part" 5"""
PATTERN = re.compile(r'''((?:[^;"']|"[^"]*"|'[^']*')+)''')
print PATTERN.split(data)[1::2]

and the output is:

['part 1', '"this is ; part 2;"', "'this is ; part 3'", 'part 4', 'this "is ; part" 5']

As Jean-Luc Nacif Coelho correctly points out this won't handle empty groups correctly. Depending on the situation that may or may not matter. If it does matter it may be possible to handle it by, for example, replacing ';;' with ';<marker>;' where <marker> would have to be some string (without semicolons) that you know does not appear in the data before the split. Also you need to restore the data after:

>>> marker = ";!$%^&;"
>>> [r.replace(marker[1:-1],'') for r in PATTERN.split("aaa;;aaa;'b;;b'".replace(';;', marker))[1::2]]
['aaa', '', 'aaa', "'b;;b'"]

However this is a kludge. Any better suggestions?

Tokenize a string excluding delimiters inside quotes

I answered a very similar question here:

How to make my split work only on one real line and be capable to skip quoted parts of string?

The example code

  • uses Boost Spirit
  • supports quoted strings, partially quoted fields, user defined delimiters, escaped quotes
  • supports many (diverse) output containers generically
  • supports models of the Range concept as input (includes char[], e.g.)

Tested with a relatively wide range of compiler versions and Boost versions.

https://gist.github.com/bcfbe2b5f071c7d153a0

Regex for splitting a string using space when not surrounded by single or double quotes

I don't understand why all the others are proposing such complex regular expressions or such long code. Essentially, you want to grab two kinds of things from your string: sequences of characters that aren't spaces or quotes, and sequences of characters that begin and end with a quote, with no quotes in between, for two kinds of quotes. You can easily match those things with this regular expression:

[^\s"']+|"([^"]*)"|'([^']*)'

I added the capturing groups because you don't want the quotes in the list.

This Java code builds the list, adding the capturing group if it matched to exclude the quotes, and adding the overall regex match if the capturing group didn't match (an unquoted word was matched).

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
if (regexMatcher.group(1) != null) {
// Add double-quoted string without the quotes
matchList.add(regexMatcher.group(1));
} else if (regexMatcher.group(2) != null) {
// Add single-quoted string without the quotes
matchList.add(regexMatcher.group(2));
} else {
// Add unquoted word
matchList.add(regexMatcher.group());
}
}

If you don't mind having the quotes in the returned list, you can use much simpler code:

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(subjectString);
while (regexMatcher.find()) {
matchList.add(regexMatcher.group());
}

Split a string that has white spaces, unless they are enclosed within quotes?

string input = "one \"two two\" three \"four four\" five six";
var parts = Regex.Matches(input, @"[\""].+?[\""]|[^ ]+")
.Cast<Match>()
.Select(m => m.Value)
.ToList();

Boost spirit grammar based string splitting

String handling (the way you do it) got a lot easier with more recent versions of Spirit. I'd suggest to use Spirit from Boost V1.47, which got a major rewrite of the attribute handling code.

But even if it compiled the way you want, it wouldn't parse the way you expect. Spirit is inherently greedy, that means that +char_ will consume whatever is left in your input unconditionally. It seems to be better to have

+~char_(", ;") % char_(", ;")

i.e. one or more (+) characters which are not (~) in the set ", ;" interpersed with exactly one of those characters. The list parser (%) exposes a vector<A>, where A is the attribute of the left hand expression, a vector<string> in the case above.

How can I Split(',') a string while ignore commas in between quotes?

This is a fairly straight forward CSV Reader implementation we use in a few projects here. Easy to use and handles those cases you are talking about.

First the CSV Class

public static class Csv
{
public static string Escape(string s)
{
if (s.Contains(QUOTE))
s = s.Replace(QUOTE, ESCAPED_QUOTE);

if (s.IndexOfAny(CHARACTERS_THAT_MUST_BE_QUOTED) > -1)
s = QUOTE + s + QUOTE;

return s;
}

public static string Unescape(string s)
{
if (s.StartsWith(QUOTE) && s.EndsWith(QUOTE))
{
s = s.Substring(1, s.Length - 2);

if (s.Contains(ESCAPED_QUOTE))
s = s.Replace(ESCAPED_QUOTE, QUOTE);
}

return s;
}

private const string QUOTE = "\"";
private const string ESCAPED_QUOTE = "\"\"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };

}

Then a pretty nice Reader implementation - If you need it. You should be able to do what you need with just the CSV class above.

public sealed class CsvReader : System.IDisposable
{
public CsvReader(string fileName)
: this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
}

public CsvReader(Stream stream)
{
__reader = new StreamReader(stream);
}

public System.Collections.IEnumerable RowEnumerator
{
get
{
if (null == __reader)
throw new System.ApplicationException("I can't start reading without CSV input.");

__rowno = 0;
string sLine;
string sNextLine;

while (null != (sLine = __reader.ReadLine()))
{
while (rexRunOnLine.IsMatch(sLine) && null != (sNextLine = __reader.ReadLine()))
sLine += "\n" + sNextLine;

__rowno++;
string[] values = rexCsvSplitter.Split(sLine);

for (int i = 0; i < values.Length; i++)
values[i] = Csv.Unescape(values[i]);

yield return values;
}

__reader.Close();
}

}

public long RowIndex { get { return __rowno; } }

public void Dispose()
{
if (null != __reader) __reader.Dispose();
}

//============================================

private long __rowno = 0;
private TextReader __reader;
private static Regex rexCsvSplitter = new Regex(@",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))");
private static Regex rexRunOnLine = new Regex(@"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$");

}

Then you can use it like this.

var reader = new CsvReader(new FileStream(file, FileMode.Open));

Note: This would open an existing CSV file, but can be modified fairly easily to take a string[] like you need.

Java: splitting a comma-separated string but ignoring commas in quotes

Try:

public class Main { 
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";
String[] tokens = line.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}

Output:

> foo
> bar
> c;qual="baz,blurb"
> d;junk="quux,syzygy"

In other words: split on the comma only if that comma has zero, or an even number of quotes ahead of it.

Or, a bit friendlier for the eyes:

public class Main { 
public static void main(String[] args) {
String line = "foo,bar,c;qual=\"baz,blurb\",d;junk=\"quux,syzygy\"";

String otherThanQuote = " [^\"] ";
String quotedString = String.format(" \" %s* \" ", otherThanQuote);
String regex = String.format("(?x) "+ // enable comments, ignore white spaces
", "+ // match a comma
"(?= "+ // start positive look ahead
" (?: "+ // start non-capturing group 1
" %s* "+ // match 'otherThanQuote' zero or more times
" %s "+ // match 'quotedString'
" )* "+ // end group 1 and repeat it zero or more times
" %s* "+ // match 'otherThanQuote'
" $ "+ // match the end of the string
") ", // stop positive look ahead
otherThanQuote, quotedString, otherThanQuote);

String[] tokens = line.split(regex, -1);
for(String t : tokens) {
System.out.println("> "+t);
}
}
}

which produces the same as the first example.

EDIT

As mentioned by @MikeFHay in the comments:

I prefer using Guava's Splitter, as it has saner defaults (see discussion above about empty matches being trimmed by String#split(), so I did:

Splitter.on(Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)"))


Related Topics



Leave a reply



Submit