Regex Split String But Keep Separators

How to keep the delimiters of Regex.Split?

Just put the pattern into a capture-group, and the matches will also be included in the result.

string[] result = Regex.Split("123.456.789", @"(\.)");

Result:

{ "123", ".", "456", ".", "789" }

This also works for many other languages:

  • JavaScript: "123.456.789".split(/(\.)/g)
  • Python: re.split(r"(\.)", "123.456.789")
  • Perl: split(/(\.)/g, "123.456.789")

(Not Java though)

How to split string with Regex.Split and keep all separators?

You need a pattern with a lookahead only:

\s+(?=delim1|delim2)

The \s+ will match 1 or more whitespaces (since your string contains whitespaces). In case there can be no whitespaces, use \s* (but then you will need to remove empty entries from the result). See the regex demo. If these delimiters must be whole words, use \b word boundaries: \s+(?=\b(?:delim1|delim2)\b).

In C#:

addrArr = Regex.Split(inputText, string.Format(@"\s+(?={0})", string.Join("|", delimeters)));

If the delimiters can contain special regex metacharacters, you will need to run Regex.Escape on your delimiters list.

A C# demo:

var inputText = "substring1 delim1 substring2 delim2 substr3";
var delimeters = new List<string> { "delim1", "delim2" };
var addrArr = Regex.Split(inputText,
string.Format(@"\s+(?={0})", string.Join("|", delimeters.Select(Regex.Escape))));
Console.WriteLine(string.Join("\n", addrArr));

Python RE library String Split but keep the delimiters/separators as part of the next string

If you are using python 3.7+ you can split by zero-length matches using re.split and positive lookahead:

string = 'a+0b-2a+b-b'
re.split(r'(?=[+-])', string)

# ['a', '+0b', '-2a', '+b', '-b']

Demo: https://regex101.com/r/AB6UBa/1

Javascript and regex: split string and keep the separator

I was having similar but slight different problem. Anyway, here are examples of three different scenarios for where to keep the deliminator.

"1、2、3".split("、") == ["1", "2", "3"]
"1、2、3".split(/(、)/g) == ["1", "、", "2", "、", "3"]
"1、2、3".split(/(?=、)/g) == ["1", "、2", "、3"]
"1、2、3".split(/(?!、)/g) == ["1、", "2、", "3"]
"1、2、3".split(/(.*?、)/g) == ["", "1、", "", "2、", "3"]

Warning: The fourth will only work to split single characters. ConnorsFan presents an alternative:

// Split a path, but keep the slashes that follow directories
var str = 'Animation/rawr/javascript.js';
var tokens = str.match(/[^\/]+\/?|\//g);

Regex split string but keep separators

Use zero-length maching lookarounds; you want to split on

(?=\[)|(?<=\])

That is, anywhere where we assert a match of a literal [ ahead, or where we assert a match of literal ] behind.

As a C# string literal, this is

@"(?=\[)|(?<=\])"

See also

  • regular-expressions.info/Lookarounds

Related questions

  • Java split is eating my characters. -- has many examples

Example in Java

    System.out.println(java.util.Arrays.toString(
"abc[s1]def[s2][s3]ghi".split("(?=\\[)|(?<=\\])")
));
// prints "[abc, [s1], def, [s2], [s3], ghi]"

System.out.println(java.util.Arrays.toString(
"abc;def;ghi;".split("(?<=;)")
));
// prints "[abc;, def;, ghi;]"

System.out.println(java.util.Arrays.toString(
"OhMyGod".split("(?=(?!^)[A-Z])")
));
// prints "[Oh, My, God]"

Dart - Split String with Regex and keep the delimiter

You're close. Try:

var re = RegExp(r'(?=<link=".*?">)|(?<=</link>)');

It has two differences from your RegExp:

  • It swaps the (?= and (?<= because you want a split before a <link...>, so you want a lookahead for that, and after a </link>, so a lookbehind for that.
  • I added the ? to ".*?", because otherwise it could potentially match until a later " on the same line, instead of the first one. Your example didn't have that, but better safe than sorry.

With that, you get the strings:

  1. "This is first text "
  2. "<link=\"www.stackoverflow.com\">First Hello</link>"
  3. "\nThis is the second text "
  4. "<link=\"www.stackoverflow.com\">Second</link>"
  5. "\n"

If you don't want the newlines to be included, you should probably remove them first.

if you want to combine the \n with the </link>, you can change the RegExp to

var re = RegExp(r'(?=<link=".*?">)|(?<=</link>\n*(?<=\n))');

That gives you:

  1. "This is first text "
  2. "<link=\"www.stackoverflow.com\">First Hello</link>\n"
  3. "This is the second text "
  4. "<link=\"www.stackoverflow.com\">Second</link>\n"

How to split string but keep delimiters in java?

As from your input string and expected results, I can infer that you want to split your string basically from three rules.

  • Split from the point which is preceded and followed by a colon
  • Split from the point which is preceded by a space and followed by a colon
  • Split from the point which is preceded by a colon and followed by a space

Hence you can use this regex using alternations for all three cases mentioned above.

(?<=:)(?=:)|(?<= )(?=:)|(?<=:)(?= )

Regex Demo

Java code,

String s = "Hello, :smile::hearth: world!";
System.out.println(Arrays.toString(s.split("(?<=:)(?=:)|(?<= )(?=:)|(?<=:)(?= )")));

Prints like your expected output,

[Hello, , :smile:, :hearth:,  world!]

Also, as an alternative if you can use matching the text rather than split, the regex would be much simpler to use and it would be this,

:[^:]+:|\S+

Regex Demo using match

Java code,

String s = "Hello, :smile::hearth: world!";
Pattern p = Pattern.compile(":[^:]+:|\\S+");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
}

Prints,

Hello,
:smile:
:hearth:
world!

How to Split string but keep delimiter at the start

Instead of lookbehind you need to use a lookahead for splitting:

(?=,)

RegEx Demo

What you want is splitting on a position when you have comma at next position that makes it a lookahead assertion. On the other hand a lookbehind assertion will split when we have comma at previous position thus splitting after comma not before it.

Code:

String text = "1,2,3,4,5,6";
var split = Regex.Split(text, @"(?=,)");
//=> ["1", ",2", ",3", ",4", ",5", ",6"]

How to split a string with a regex of 2 delimiters, but keep the delimiters in Python?

Don't use re.split(), use re.findall() with a regexp that matches each sub-expression.

import re

s = "+2x-10+5"
result = re.findall(r'[-+]\w+', s)
print(result)

How to split a string, but also keep the delimiters?

You can use lookahead and lookbehind, which are features of regular expressions.

System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));

And you will get:

[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]

The last one is what you want.

((?<=;)|(?=;)) equals to select an empty character before ; or after ;.

EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:

static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";

public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}


Related Topics



Leave a reply



Submit