How to Split a String Containing Both Delimiter and The Escaped Delimiter

How to split a string containing both delimiter and the escaped delimiter?

1.8.7 doesn't have negative lookbehind without Oniguruma (which may be compiled in).

1.9.3; yay:

> s = "a;b;c\\;d"
=> "a;b;c\\;d"
> s.split /(?<!\\);/
=> ["a", "b", "c\\;d"]

1.8.7 with Oniguruma doesn't offer a trivial split, but you can get match offsets and pull apart the substrings that way. I assume there's a better way to do this I'm not remembering:

> require 'oniguruma'
> re = Oniguruma::ORegexp.new "(?<!\\\\);"
> s = "hello;there\\;nope;yestho"
> re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds = re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds.collect {|md| md.offset}
=> [[5, 6], [17, 18]]

Other options include:

  • Splitting on ; and post-processing the results looking for trailing \\, or
  • Do a char-by-char loop and maintain some simple state and just split manually.

Split string which contains escaped delimiters

static final char ESCAPING_CHAR = '\\';

private List<String> parseString(final String str,
final char delimiter,
final boolean removeEmpty)
throws IOException
{
final Reader input = new StringReader(str);
final StringBuilder part = new StringBuilder();
final List<String> result = new ArrayList<String>();

int c;
do {
c = input.read(); // get the next character

if (c != delimiter) { // so long as it isn't a delimiter...
if (c == ESCAPING_CHAR) // if it's an escape
c = input.read(); // use the following character instead

if (c >= 0) { // only if NOT at end of string...
part.append((char) c); // append to current part
continue; // move on to next character
}
}

/* we're at either a real delimiter, or end of string => part complete */

if (part.length() > 0 || !removeEmpty) { // keep this part?
result.add(part.toString()); // add current part to result
part.setLength(0); // reset for next part
}
} while (c >= 0); // repeat until end of string found

return result;
}

Split a string where the separators can be escaped

If using a JavaScript with a regex engine that supports negative look-behinds (eg. Chrome), and in a case of only a single/simple escape shown, and no method to escape-the-escape, it's possible to use a relatively simple negative look-behind:

'|1|2|\\|Three and Four\\||5'.split(/(?<!\\)\|/)

# -> ["", "1", "2", "\|Three and Four\|", "5"]

This says to - in Chrome which supports negative look-behinds - split on a "|" that is not preceded by a "\".

Here is a method to convert a look-behind to a look-ahead for engine compatibility. Variations are also dicussed in RegEx needed to split javascript string on "|" but not "\|".

However, as pointed out, the above doesn't touch the \| sequence and thus leaves in the escape sequence.


Alternatively, a multistep approach can also solve this, which can also takes care of the escape character as part of the process.

  1. Replace the escaped separators with an "alternate" character/string
  2. Split on the remaining (non-escaped) separators
  3. Convert the "alternate" character/string back in the individual components

In code,

str = '|1|2|\\|Three and Four\\||5'

# replace \| -> "alternative"
# this assumes that \\| (escape-the-escape) is not allowed
rep = str.replace(/\\[|]/g, '~~~~')

# replace back, without any of the escapes
res = rep.split('|').map(function (f) { return f.replace(/~~~~/g, "|") })

# res -> ["", "1", "2", "|Three and Four|", "5"]

Split string with escaped delimeter using a delimeter

You can user regular expression!

split if ?<! current position of string is not preceded with backward (\, two slashes to escape it)slash and ampersand symbol(&)

>>> import re
>>> re.split(r'(?<!\\)&', string)
['fir\\&st_part', 'secon\\&d_part']

With the resulting list, you can iterate and replace the escaped '\&' with '&' if necessary!

>>> import re
>>> print [each.replace("\&","&") for each in re.split(r'(?<!\\)&', string)]
['fir&st_part', 'secon&d_part']

Python - How do I split a string that includes an escape character as a delimiter?

Convert your string to raw string by doing r'string'

Try this:

MyString = r'A\x92\xa4\xbf'
delim = '\\' + 'x' #OR simply: delim = '\\x'
MyList = MyString.split(delim)
print(MyList)

Output:

['A', '92', 'a4', 'bf']

This technique works for any escape sequence (let me know otherwise xD) \x, just set delimiter as \\x. Working sample : https://repl.it/@stupidlylogical/RawStringPython

Works because:

Python raw string treats backslash (\) as a literal character. This is
useful when we want to have a string that contains backslash and don't
want it to be treated as an escape character.

Explanation:

When an 'r' or 'R' prefix is present, a character following a
backslash is included in the string without change, and all
backslashes are left in the string.

More: https://docs.python.org/2/reference/lexical_analysis.html#string-literals

Java String.split() regex for handling escaped delimeter and escaped escape characters

If it has to be split then you can try something like

split("(?<!(?<!\\\\)\\\\(\\\\{2}){0,1000000000}),")

I used {0,1000000000} instead of * because look-behind in Java needs to have obvious maximal length, and 1000000000 seems to be good enough, unless you can have more than 1000000000 continuous \\ in your text.


If it doesn't have to be split then you can use

Matcher m = Pattern.compile("(\\G.*?(?<!\\\\)(\\\\{2})*)(,|(?<!\\G)$)",
Pattern.DOTALL).matcher(testString);
while (m.find()) {
System.out.println(m.group(1));
}

\\G means end of previous match, or in case this is first iteration of Matcher and there was no previous match start of the string ^.


But fastest and not so hart to implement would be writing your own parser, which would use flag like escaped to signal that current checked character was escaped with \.

public static List<String> parse(String text) {
List<String> tokens = new ArrayList<>();
boolean escaped = false;
StringBuilder sb = new StringBuilder();

for (char ch : text.toCharArray()) {
if (ch == ',' && !escaped) {
tokens.add(sb.toString());
sb.delete(0, sb.length());
} else {
if (ch == '\\')
escaped = !escaped;
else
escaped = false;
sb.append(ch);
}
}

if (sb.length() > 0) {
tokens.add(sb.toString());
sb.delete(0, sb.length());
}

return tokens;
}

Demo of all approaches:

String testString = "a\\,b\\\\,c,d\\\\\\,e,f\\\\g";
String[] splitedString = testString
.split("(?<!(?<!\\\\)\\\\(\\\\{2}){0,1000000000}),");
for (String string : splitedString) {
System.out.println(string);
}

System.out.println("-----");
Matcher m = Pattern.compile("(\\G.*?(?<!\\\\)(\\\\{2})*)(,|(?<!\\G)$)",
Pattern.DOTALL).matcher(testString);
while (m.find()) {
System.out.println(m.group(1));
}

System.out.println("-----");
for (String s : parse(testString))
System.out.println(s);

Output:

a\,b\\
c
d\\\,e
f\\g
-----
a\,b\\
c
d\\\,e
f\\g
-----
a\,b\\
c
d\\\,e
f\\g

Split using delimiter except when delimiter is escaped

First off I've dealt with data from Excel before and what you typically see is comma separated values and if the value is considered to be a string it will have double quotes around it (and can contain commas and double quotes). If it is considered to be numeric then there are not double quotes. Additionally if the data contains a double quote that will be delimited by a double quote like "". So assuming all of that here's how I've dealt with this in the past

public static IEnumerable<string> SplitExcelRow(this string value)
{
value = value.Replace("\"\"", """);
bool quoted = false;
int currStartIndex = 0;
for (int i = 0; i < value.Length; i++)
{
char currChar = value[i];
if (currChar == '"')
{
quoted = !quoted;
}
else if (currChar == ',')
{
if (!quoted)
{
yield return value.Substring(currStartIndex, i - currStartIndex)
.Trim()
.Replace("\"","")
.Replace(""","\"");
currStartIndex = i + 1;
}
}
}
yield return value.Substring(currStartIndex, value.Length - currStartIndex)
.Trim()
.Replace("\"", "")
.Replace(""", "\"");
}

Of course this assumes the data coming in is valid so if you have something like "fo,o"b,ar","bar""foo" this will not work. Additionally if your data contains " then it will be turned into a " which may or may not be desirable.

Split string by delimiter, but not if it is escaped

Use dark magic:

$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);

\\\\. matches a backslash followed by a character, (*SKIP)(*FAIL) skips it and \| matches your delimiter.



Related Topics



Leave a reply



Submit