Split string by delimiter, but not if it is escaped
Use dark magic:
$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);
\\\\.
matches a backslash followed by a character, (*SKIP)(*FAIL)
skips it and \|
matches your delimiter.
Split using delimiter except when delimiter is escaped
First off I've dealt with data from Excel before and what you typically see is comma separated values and if the value is considered to be a string it will have double quotes around it (and can contain commas and double quotes). If it is considered to be numeric then there are not double quotes. Additionally if the data contains a double quote that will be delimited by a double quote like ""
. So assuming all of that here's how I've dealt with this in the past
public static IEnumerable<string> SplitExcelRow(this string value)
{
value = value.Replace("\"\"", """);
bool quoted = false;
int currStartIndex = 0;
for (int i = 0; i < value.Length; i++)
{
char currChar = value[i];
if (currChar == '"')
{
quoted = !quoted;
}
else if (currChar == ',')
{
if (!quoted)
{
yield return value.Substring(currStartIndex, i - currStartIndex)
.Trim()
.Replace("\"","")
.Replace(""","\"");
currStartIndex = i + 1;
}
}
}
yield return value.Substring(currStartIndex, value.Length - currStartIndex)
.Trim()
.Replace("\"", "")
.Replace(""", "\"");
}
Of course this assumes the data coming in is valid so if you have something like "fo,o"b,ar","bar""foo"
this will not work. Additionally if your data contains "
then it will be turned into a " which may or may not be desirable.
Split a string where the separators can be escaped
If using a JavaScript with a regex engine that supports negative look-behinds (eg. Chrome), and in a case of only a single/simple escape shown, and no method to escape-the-escape, it's possible to use a relatively simple negative look-behind:
'|1|2|\\|Three and Four\\||5'.split(/(?<!\\)\|/)
# -> ["", "1", "2", "\|Three and Four\|", "5"]
This says to - in Chrome which supports negative look-behinds - split on a "|" that is not preceded by a "\".
Here is a method to convert a look-behind to a look-ahead for engine compatibility. Variations are also dicussed in RegEx needed to split javascript string on "|" but not "\|".
However, as pointed out, the above doesn't touch the \| sequence and thus leaves in the escape sequence.
Alternatively, a multistep approach can also solve this, which can also takes care of the escape character as part of the process.
- Replace the escaped separators with an "alternate" character/string
- Split on the remaining (non-escaped) separators
- Convert the "alternate" character/string back in the individual components
In code,
str = '|1|2|\\|Three and Four\\||5'
# replace \| -> "alternative"
# this assumes that \\| (escape-the-escape) is not allowed
rep = str.replace(/\\[|]/g, '~~~~')
# replace back, without any of the escapes
res = rep.split('|').map(function (f) { return f.replace(/~~~~/g, "|") })
# res -> ["", "1", "2", "|Three and Four|", "5"]
How to properly split on a non escaped delimiter?
Extracting approach
You can use a matching approach as it is the most stable and allows arbitrary amount of escaping \
chars. You can use
(?s)(?:\\.|[^\\|])+
See the regex demo. Details:
(?s)
-Pattern.DOTALL
embedded flag option(?:\\.|[^\\|])+
- one or more repetitions of\
and then any one char, or any char but\
and|
.
See the Java demo:
String s = "A|B\\|C\\\\|D\\\\\\|E\\\\\\\\|F";
Pattern pattern = Pattern.compile("(?:\\\\.|[^\\\\|])+", Pattern.DOTALL);
Matcher matcher = pattern.matcher(s);
List<String> results = new ArrayList<>();
while (matcher.find()){
results.add(matcher.group());
}
System.out.println(results);
// => [A, B\|C\\, D\\\|E\\\\, F]
Splitting approach (workaround for split
)
You may (ab)use the constrained-width lookbehind pattern support in Java regex and use limiting quantifier like {0,1000}
instead of *
quantifier. A work-around would look like
String s = "A|B\\|C\\\\|D\\\\\\|E\\\\\\\\|F";
String[] results = s.split("(?<=(?<!\\\\)(?:\\\\{2}){0,1000})\\|"); System.out.println(Arrays.toString(results));
See this Java demo.
Note (?:\\{2}){0,1000}
part will only allow up to 1000 escaping backslashes that should suffice in most cases, I believe, but you might want to test this first. I'd still recommend the first solution.
Details:
(?<=
- start of a positive lookbehind:(?<!\\)
- a location not immediately preceded with a\
(?:\\{2}){0,1000}
- zero to one thousand occurrences of double backslash
)
- end of the positive lookbehind\|
- a|
char.
Split string by comma, but not escaped in JavaScript
So use match instead of split
"Keyword,slug,description".match(/([^,]+),([^,]+),(.*)/);
will result in
["Keyword,slug,description", "Keyword", "slug", "description"]
There are other ways to write the regular expression, just picked something quick.
Split string with regex separator except when separator is escaped
You may match the sequences with a pattern that will either match any chars that are not a comma, or any 1+ commas preceded with odd number of Z
s:
import re
a = 'aaa,bbbZ,cccZZ,dddZZZ,eee'
print(re.findall(r'(?:(?<!Z)Z(?:ZZ)*,+|[^,])+', a))
# => ['aaa', 'bbbZ,cccZZ', 'dddZZZ,eee']
See the Python demo and a regex demo.
Pattern details:
(?:(?<!Z)Z(?:ZZ)*,+|[^,])+
- 1 or more occurrences of:(?<!Z)Z
- aZ
not immediately preceded withZ
(?:ZZ)*
- zero or more sequences ofZZ
,+
- 1 or more commas|
- or[^,]
- any char that is not a comma
With a PyPi regex module, you may use regex.split
method with a (?<=(?<!Z)(?:ZZ)*),+
regex:
import regex
a = 'aaa,bbbZ,cccZZ,dddZZZ,eee'
print(regex.split(r'(?<=(?<!Z)(?:ZZ)*),+', a))
# ['aaa', 'bbbZ,cccZZ', 'dddZZZ,eee']
See another online Python demo.
Here, the pattern matches 1 or more commas (,+
) that are preceded with any 0+ sequences of ZZ
that are not preceded with another Z
(that is, with an even number of Z
).
JavaScript split on char but ignoring double escaped chars
See the function below named splitOnNonEscapedDelimeter()
, which accepts the string
to split, and the delimeter
to split on, which in this case is :
. The usage is within the function onChange()
.
Note that you must escape the
delimeter
you pass tosplitOnNonEscapedDelimeter()
, so that it is not interpreted as a special character in the regular expression.
function nonEscapedDelimeter(delimeter) {
return new RegExp(String.raw`[^${delimeter}]*?(?:\\\\${delimeter}[^${delimeter}]*?)*(?:${delimeter}|$)`, 'g')
}
function nonEscapedDelimeterAtEnd(delimeter) {
return new RegExp(String.raw`([^\\].|.[^\\]|^.?)${delimeter}$`)
}
function splitOnNonEscapedDelimeter(string, delimeter) {
const reMatch = nonEscapedDelimeter(delimeter)
const reReplace = nonEscapedDelimeterAtEnd(delimeter)
return string.match(reMatch).slice(0, -1).map(section => {
return section.replace(reReplace, '$1')
})
}
function onChange() {
console.log(splitOnNonEscapedDelimeter(i.value, ':'))
}
i.addEventListener('change', onChange)
onChange()
<textarea id=i>dtet:du\\,eduh ei\\:di:e,j</textarea>
Split String While Ignoring Escaped Character
You need to use a negative lookbehind to take care of escaped single quotes:
String str =
"Some message I want to split 'but keeping this a\\'s a single string' Voila!";
String[] toks = str.split( " +(?=((.*?(?<!\\\\)'){2})*[^']*$)" );
for (String tok: toks)
System.out.printf("<%s>%n", tok);
output:
<Some>
<message>
<I>
<want>
<to>
<split>
<'but keeping this a\'s a single string'>
<Voila!>
PS: As you noted that escaped single quote needs to be typed as \\'
in String
assignment otherwise it will be treated as plain '
Related Topics
Ternary Operator Left Associativity
How to Call Codeigniter Controller Function from View
Fix Malformed Xml in PHP Before Processing Using Domdocument Functions
Use Strings to Access (Potentially Large) Multidimensional Arrays
Formatting Phone Numbers in PHP
PHP Case-Insensitive In_Array Function
PHP - Remove <Img> Tag from String
How to Merge Transparent Png with Image Using PHP
Turn Database Result into Array
How to Use Spl_Autoload() Instead of _Autoload()
PHP Link to Image File Outside Default Web Directory
PHP Get File Listing Including Sub Directories
Populate Another Select Dropdown from Database Based on Dropdown Selection
PHP Messing with HTML Charset Encoding
Split String by Delimiter, But Not If It Is Escaped