Split String by Delimiter, But Not If It Is Escaped

Split string by delimiter, but not if it is escaped

Use dark magic:

$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);

\\\\. matches a backslash followed by a character, (*SKIP)(*FAIL) skips it and \| matches your delimiter.

Split using delimiter except when delimiter is escaped

First off I've dealt with data from Excel before and what you typically see is comma separated values and if the value is considered to be a string it will have double quotes around it (and can contain commas and double quotes). If it is considered to be numeric then there are not double quotes. Additionally if the data contains a double quote that will be delimited by a double quote like "". So assuming all of that here's how I've dealt with this in the past

public static IEnumerable<string> SplitExcelRow(this string value)
{
value = value.Replace("\"\"", """);
bool quoted = false;
int currStartIndex = 0;
for (int i = 0; i < value.Length; i++)
{
char currChar = value[i];
if (currChar == '"')
{
quoted = !quoted;
}
else if (currChar == ',')
{
if (!quoted)
{
yield return value.Substring(currStartIndex, i - currStartIndex)
.Trim()
.Replace("\"","")
.Replace(""","\"");
currStartIndex = i + 1;
}
}
}
yield return value.Substring(currStartIndex, value.Length - currStartIndex)
.Trim()
.Replace("\"", "")
.Replace(""", "\"");
}

Of course this assumes the data coming in is valid so if you have something like "fo,o"b,ar","bar""foo" this will not work. Additionally if your data contains " then it will be turned into a " which may or may not be desirable.

Split a string where the separators can be escaped

If using a JavaScript with a regex engine that supports negative look-behinds (eg. Chrome), and in a case of only a single/simple escape shown, and no method to escape-the-escape, it's possible to use a relatively simple negative look-behind:

'|1|2|\\|Three and Four\\||5'.split(/(?<!\\)\|/)

# -> ["", "1", "2", "\|Three and Four\|", "5"]

This says to - in Chrome which supports negative look-behinds - split on a "|" that is not preceded by a "\".

Here is a method to convert a look-behind to a look-ahead for engine compatibility. Variations are also dicussed in RegEx needed to split javascript string on "|" but not "\|".

However, as pointed out, the above doesn't touch the \| sequence and thus leaves in the escape sequence.


Alternatively, a multistep approach can also solve this, which can also takes care of the escape character as part of the process.

  1. Replace the escaped separators with an "alternate" character/string
  2. Split on the remaining (non-escaped) separators
  3. Convert the "alternate" character/string back in the individual components

In code,

str = '|1|2|\\|Three and Four\\||5'

# replace \| -> "alternative"
# this assumes that \\| (escape-the-escape) is not allowed
rep = str.replace(/\\[|]/g, '~~~~')

# replace back, without any of the escapes
res = rep.split('|').map(function (f) { return f.replace(/~~~~/g, "|") })

# res -> ["", "1", "2", "|Three and Four|", "5"]

How to properly split on a non escaped delimiter?

Extracting approach

You can use a matching approach as it is the most stable and allows arbitrary amount of escaping \ chars. You can use

(?s)(?:\\.|[^\\|])+

See the regex demo. Details:

  • (?s) - Pattern.DOTALL embedded flag option
  • (?:\\.|[^\\|])+ - one or more repetitions of \ and then any one char, or any char but \ and |.

See the Java demo:

String s = "A|B\\|C\\\\|D\\\\\\|E\\\\\\\\|F";
Pattern pattern = Pattern.compile("(?:\\\\.|[^\\\\|])+", Pattern.DOTALL);
Matcher matcher = pattern.matcher(s);
List<String> results = new ArrayList<>();
while (matcher.find()){
results.add(matcher.group());
}
System.out.println(results);
// => [A, B\|C\\, D\\\|E\\\\, F]

Splitting approach (workaround for split)

You may (ab)use the constrained-width lookbehind pattern support in Java regex and use limiting quantifier like {0,1000} instead of * quantifier. A work-around would look like

String s = "A|B\\|C\\\\|D\\\\\\|E\\\\\\\\|F";
String[] results = s.split("(?<=(?<!\\\\)(?:\\\\{2}){0,1000})\\|"); System.out.println(Arrays.toString(results));

See this Java demo.

Note (?:\\{2}){0,1000} part will only allow up to 1000 escaping backslashes that should suffice in most cases, I believe, but you might want to test this first. I'd still recommend the first solution.

Details:

  • (?<= - start of a positive lookbehind:
    • (?<!\\) - a location not immediately preceded with a \
    • (?:\\{2}){0,1000} - zero to one thousand occurrences of double backslash
  • ) - end of the positive lookbehind
  • \| - a | char.

Split string by comma, but not escaped in JavaScript

So use match instead of split

"Keyword,slug,description".match(/([^,]+),([^,]+),(.*)/);

will result in

["Keyword,slug,description", "Keyword", "slug", "description"]

There are other ways to write the regular expression, just picked something quick.

Split string with regex separator except when separator is escaped

You may match the sequences with a pattern that will either match any chars that are not a comma, or any 1+ commas preceded with odd number of Zs:

import re
a = 'aaa,bbbZ,cccZZ,dddZZZ,eee'
print(re.findall(r'(?:(?<!Z)Z(?:ZZ)*,+|[^,])+', a))
# => ['aaa', 'bbbZ,cccZZ', 'dddZZZ,eee']

See the Python demo and a regex demo.

Pattern details:

  • (?:(?<!Z)Z(?:ZZ)*,+|[^,])+ - 1 or more occurrences of:

    • (?<!Z)Z - a Z not immediately preceded with Z
    • (?:ZZ)* - zero or more sequences of ZZ
    • ,+ - 1 or more commas
    • | - or
    • [^,] - any char that is not a comma

With a PyPi regex module, you may use regex.split method with a (?<=(?<!Z)(?:ZZ)*),+ regex:

import regex
a = 'aaa,bbbZ,cccZZ,dddZZZ,eee'
print(regex.split(r'(?<=(?<!Z)(?:ZZ)*),+', a))
# ['aaa', 'bbbZ,cccZZ', 'dddZZZ,eee']

See another online Python demo.

Here, the pattern matches 1 or more commas (,+) that are preceded with any 0+ sequences of ZZ that are not preceded with another Z (that is, with an even number of Z).

JavaScript split on char but ignoring double escaped chars

See the function below named splitOnNonEscapedDelimeter(), which accepts the string to split, and the delimeter to split on, which in this case is :. The usage is within the function onChange().

Note that you must escape the delimeter you pass to splitOnNonEscapedDelimeter(), so that it is not interpreted as a special character in the regular expression.

function nonEscapedDelimeter(delimeter) {
return new RegExp(String.raw`[^${delimeter}]*?(?:\\\\${delimeter}[^${delimeter}]*?)*(?:${delimeter}|$)`, 'g')
}

function nonEscapedDelimeterAtEnd(delimeter) {
return new RegExp(String.raw`([^\\].|.[^\\]|^.?)${delimeter}$`)
}

function splitOnNonEscapedDelimeter(string, delimeter) {
const reMatch = nonEscapedDelimeter(delimeter)
const reReplace = nonEscapedDelimeterAtEnd(delimeter)

return string.match(reMatch).slice(0, -1).map(section => {
return section.replace(reReplace, '$1')
})
}

function onChange() {
console.log(splitOnNonEscapedDelimeter(i.value, ':'))
}

i.addEventListener('change', onChange)

onChange()
<textarea id=i>dtet:du\\,eduh ei\\:di:e,j</textarea>

Split String While Ignoring Escaped Character

You need to use a negative lookbehind to take care of escaped single quotes:

String str = 
"Some message I want to split 'but keeping this a\\'s a single string' Voila!";

String[] toks = str.split( " +(?=((.*?(?<!\\\\)'){2})*[^']*$)" );
for (String tok: toks)
System.out.printf("<%s>%n", tok);

output:

<Some>
<message>
<I>
<want>
<to>
<split>
<'but keeping this a\'s a single string'>
<Voila!>

PS: As you noted that escaped single quote needs to be typed as \\' in String assignment otherwise it will be treated as plain '



Related Topics



Leave a reply



Submit