How to split a string containing both delimiter and the escaped delimiter?
1.8.7 doesn't have negative lookbehind without Oniguruma (which may be compiled in).
1.9.3; yay:
> s = "a;b;c\\;d"
=> "a;b;c\\;d"
> s.split /(?<!\\);/
=> ["a", "b", "c\\;d"]
1.8.7 with Oniguruma doesn't offer a trivial split, but you can get match offsets and pull apart the substrings that way. I assume there's a better way to do this I'm not remembering:> require 'oniguruma'
> re = Oniguruma::ORegexp.new "(?<!\\\\);"
> s = "hello;there\\;nope;yestho"
> re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds = re.match_all s
=> [#<MatchData ";">, #<MatchData ";">]
> mds.collect {|md| md.offset}
=> [[5, 6], [17, 18]]
Other options include:- Splitting on
;
and post-processing the results looking for trailing\\
, or - Do a char-by-char loop and maintain some simple state and just split manually.
Split string which contains escaped delimiters
static final char ESCAPING_CHAR = '\\';
private List<String> parseString(final String str,
final char delimiter,
final boolean removeEmpty)
throws IOException
{
final Reader input = new StringReader(str);
final StringBuilder part = new StringBuilder();
final List<String> result = new ArrayList<String>();
int c;
do {
c = input.read(); // get the next character
if (c != delimiter) { // so long as it isn't a delimiter...
if (c == ESCAPING_CHAR) // if it's an escape
c = input.read(); // use the following character instead
if (c >= 0) { // only if NOT at end of string...
part.append((char) c); // append to current part
continue; // move on to next character
}
}
/* we're at either a real delimiter, or end of string => part complete */
if (part.length() > 0 || !removeEmpty) { // keep this part?
result.add(part.toString()); // add current part to result
part.setLength(0); // reset for next part
}
} while (c >= 0); // repeat until end of string found
return result;
}
Split a string where the separators can be escaped
If using a JavaScript with a regex engine that supports negative look-behinds (eg. Chrome), and in a case of only a single/simple escape shown, and no method to escape-the-escape, it's possible to use a relatively simple negative look-behind:
'|1|2|\\|Three and Four\\||5'.split(/(?<!\\)\|/)
# -> ["", "1", "2", "\|Three and Four\|", "5"]
This says to - in Chrome which supports negative look-behinds - split on a "|" that is not preceded by a "\".Here is a method to convert a look-behind to a look-ahead for engine compatibility. Variations are also dicussed in RegEx needed to split javascript string on "|" but not "\|".
However, as pointed out, the above doesn't touch the \| sequence and thus leaves in the escape sequence.
Alternatively, a multistep approach can also solve this, which can also takes care of the escape character as part of the process.
- Replace the escaped separators with an "alternate" character/string
- Split on the remaining (non-escaped) separators
- Convert the "alternate" character/string back in the individual components
str = '|1|2|\\|Three and Four\\||5'
# replace \| -> "alternative"
# this assumes that \\| (escape-the-escape) is not allowed
rep = str.replace(/\\[|]/g, '~~~~')
# replace back, without any of the escapes
res = rep.split('|').map(function (f) { return f.replace(/~~~~/g, "|") })
# res -> ["", "1", "2", "|Three and Four|", "5"]
Split string with escaped delimeter using a delimeter
You can user regular expression!
split if ?<!
current position of string is not preceded with backward (\, two slashes to escape it)slash and ampersand symbol(&)
>>> import re
>>> re.split(r'(?<!\\)&', string)
['fir\\&st_part', 'secon\\&d_part']
With the resulting list, you can iterate and replace the escaped '\&' with '&' if necessary!>>> import re
>>> print [each.replace("\&","&") for each in re.split(r'(?<!\\)&', string)]
['fir&st_part', 'secon&d_part']
Python - How do I split a string that includes an escape character as a delimiter?
Convert your string to raw string by doing r'string'
Try this:
MyString = r'A\x92\xa4\xbf'
delim = '\\' + 'x' #OR simply: delim = '\\x'
MyList = MyString.split(delim)
print(MyList)
Output:['A', '92', 'a4', 'bf']
This technique works for any escape sequence (let me know otherwise xD) \x
, just set delimiter as \\x
. Working sample : https://repl.it/@stupidlylogical/RawStringPythonWorks because:
Explanation:Python raw string treats backslash (\) as a literal character. This is
useful when we want to have a string that contains backslash and don't
want it to be treated as an escape character.
More: https://docs.python.org/2/reference/lexical_analysis.html#string-literalsWhen an 'r' or 'R' prefix is present, a character following a
backslash is included in the string without change, and all
backslashes are left in the string.
Java String.split() regex for handling escaped delimeter and escaped escape characters
If it has to be split then you can try something like
split("(?<!(?<!\\\\)\\\\(\\\\{2}){0,1000000000}),")
I used {0,1000000000}
instead of *
because look-behind in Java needs to have obvious maximal length, and 1000000000
seems to be good enough, unless you can have more than 1000000000
continuous \\
in your text. If it doesn't have to be
split
then you can use Matcher m = Pattern.compile("(\\G.*?(?<!\\\\)(\\\\{2})*)(,|(?<!\\G)$)",
Pattern.DOTALL).matcher(testString);
while (m.find()) {
System.out.println(m.group(1));
}
\\G
means end of previous match, or in case this is first iteration of Matcher and there was no previous match start of the string ^
.But fastest and not so hart to implement would be writing your own parser, which would use flag like
escaped
to signal that current checked character was escaped with \
. public static List<String> parse(String text) {
List<String> tokens = new ArrayList<>();
boolean escaped = false;
StringBuilder sb = new StringBuilder();
for (char ch : text.toCharArray()) {
if (ch == ',' && !escaped) {
tokens.add(sb.toString());
sb.delete(0, sb.length());
} else {
if (ch == '\\')
escaped = !escaped;
else
escaped = false;
sb.append(ch);
}
}
if (sb.length() > 0) {
tokens.add(sb.toString());
sb.delete(0, sb.length());
}
return tokens;
}
Demo of all approaches:
String testString = "a\\,b\\\\,c,d\\\\\\,e,f\\\\g";
String[] splitedString = testString
.split("(?<!(?<!\\\\)\\\\(\\\\{2}){0,1000000000}),");
for (String string : splitedString) {
System.out.println(string);
}
System.out.println("-----");
Matcher m = Pattern.compile("(\\G.*?(?<!\\\\)(\\\\{2})*)(,|(?<!\\G)$)",
Pattern.DOTALL).matcher(testString);
while (m.find()) {
System.out.println(m.group(1));
}
System.out.println("-----");
for (String s : parse(testString))
System.out.println(s);
Output:a\,b\\
c
d\\\,e
f\\g
-----
a\,b\\
c
d\\\,e
f\\g
-----
a\,b\\
c
d\\\,e
f\\g
Split using delimiter except when delimiter is escaped
First off I've dealt with data from Excel before and what you typically see is comma separated values and if the value is considered to be a string it will have double quotes around it (and can contain commas and double quotes). If it is considered to be numeric then there are not double quotes. Additionally if the data contains a double quote that will be delimited by a double quote like ""
. So assuming all of that here's how I've dealt with this in the past
public static IEnumerable<string> SplitExcelRow(this string value)
{
value = value.Replace("\"\"", """);
bool quoted = false;
int currStartIndex = 0;
for (int i = 0; i < value.Length; i++)
{
char currChar = value[i];
if (currChar == '"')
{
quoted = !quoted;
}
else if (currChar == ',')
{
if (!quoted)
{
yield return value.Substring(currStartIndex, i - currStartIndex)
.Trim()
.Replace("\"","")
.Replace(""","\"");
currStartIndex = i + 1;
}
}
}
yield return value.Substring(currStartIndex, value.Length - currStartIndex)
.Trim()
.Replace("\"", "")
.Replace(""", "\"");
}
Of course this assumes the data coming in is valid so if you have something like "fo,o"b,ar","bar""foo"
this will not work. Additionally if your data contains "
then it will be turned into a " which may or may not be desirable. Split string by delimiter, but not if it is escaped
Use dark magic:
$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);
\\\\.
matches a backslash followed by a character, (*SKIP)(*FAIL)
skips it and \|
matches your delimiter.
Related Topics
Facebook Redirect Url in Ruby on Rails Open Ssl Error
Proper Usage of Ruby Statement Modifiers
How to Use The "Self" Keyword in Rails
How to Save Data with Has_Many: Through
How to Evaluate a Block Inside a Proc
One or More Params in Model Find Conditions with Ruby on Rails
How to Unescape C-Style Escape Sequences from Ruby
Axlsx - Formatting Text Within a Cell
Can't Install Nokogiri for Ruby in Windows
Ruby - Append Content at The End of The Existing S3 File Using Fog
How to Add Usr/Local/Bin to Path Environment Variable on Ubuntu 12.0.4
Changing The Reading Order of Rubygem Sources
How to Get All Message History from Hipchat for a Room via The API
How to Compare Xml Output in a Cucumber Step Using a Multiline String Example
Ruby Fails on Osx Lion with Rbenv
Best/Most Elegant Way to Share Objects Between a Stack of Rack Mounted Apps/Middlewares