How to Keep the Delimiters of Regex.Split

How to keep the delimiters of Regex.Split?

Just put the pattern into a capture-group, and the matches will also be included in the result.

string[] result = Regex.Split("123.456.789", @"(\.)");

Result:

{ "123", ".", "456", ".", "789" }

This also works for many other languages:

  • JavaScript: "123.456.789".split(/(\.)/g)
  • Python: re.split(r"(\.)", "123.456.789")
  • Perl: split(/(\.)/g, "123.456.789")

(Not Java though)

In Python, how do I split a string and keep the separators?

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

Dart - Split String with Regex and keep the delimiter

You're close. Try:

var re = RegExp(r'(?=<link=".*?">)|(?<=</link>)');

It has two differences from your RegExp:

  • It swaps the (?= and (?<= because you want a split before a <link...>, so you want a lookahead for that, and after a </link>, so a lookbehind for that.
  • I added the ? to ".*?", because otherwise it could potentially match until a later " on the same line, instead of the first one. Your example didn't have that, but better safe than sorry.

With that, you get the strings:

  1. "This is first text "
  2. "<link=\"www.stackoverflow.com\">First Hello</link>"
  3. "\nThis is the second text "
  4. "<link=\"www.stackoverflow.com\">Second</link>"
  5. "\n"

If you don't want the newlines to be included, you should probably remove them first.

if you want to combine the \n with the </link>, you can change the RegExp to

var re = RegExp(r'(?=<link=".*?">)|(?<=</link>\n*(?<=\n))');

That gives you:

  1. "This is first text "
  2. "<link=\"www.stackoverflow.com\">First Hello</link>\n"
  3. "This is the second text "
  4. "<link=\"www.stackoverflow.com\">Second</link>\n"

C# split large string with Regex.Split. Must keep delimiters

You can use

var text = "Artículo 1. This is a test that includes : 1) Sample text 2) Sample text";
var result = Regex.Split(text, @"(?!^)\s+(?=\bArtículo\s+[0-9]+\.|[a-z]\)|[1-9]\d?\)|\bPárrafo\b)", RegexOptions.None);
Console.WriteLine(string.Join("\n", result));
// => Artículo 1. This is a test that includes :
// => 1) Sample text
// => 2) Sample text

See the C# demo and the regex demo.

The regex is

(?!^)\s+(?=\bArtículo\s+[0-9]+\.|[a-z]\)|[1-9]\d?\)|\bPárrafo\b)

It matches

  • (?!^) - a location other than start of string
  • \s+ - 1+ whitespaces (if you use \s*, you will need to add .Where(x => !string.IsNullOrEmpty(x)) after the Regex.Split call)
  • (?=\bArtículo\s+[0-9]+\.|[a-z]\)|[1-9]\d?\)|\bPárrafo\b) - a location that is immediately followed with
    • \bArtículo\s+[0-9]+\.| - whole word Artículo, 1+ whitespaces, 1+ ASCII digits, and a ., or
    • [a-z]\)| - a lowercase ASCII letter and ), or
    • [1-9]\d?\)| - a non-zero digit, then an optional digit and a ), or
    • \bPárrafo\b - a whole word Párrafo.

How to split string with Regex.Split and keep all separators?

You need a pattern with a lookahead only:

\s+(?=delim1|delim2)

The \s+ will match 1 or more whitespaces (since your string contains whitespaces). In case there can be no whitespaces, use \s* (but then you will need to remove empty entries from the result). See the regex demo. If these delimiters must be whole words, use \b word boundaries: \s+(?=\b(?:delim1|delim2)\b).

In C#:

addrArr = Regex.Split(inputText, string.Format(@"\s+(?={0})", string.Join("|", delimeters)));

If the delimiters can contain special regex metacharacters, you will need to run Regex.Escape on your delimiters list.

A C# demo:

var inputText = "substring1 delim1 substring2 delim2 substr3";
var delimeters = new List<string> { "delim1", "delim2" };
var addrArr = Regex.Split(inputText,
string.Format(@"\s+(?={0})", string.Join("|", delimeters.Select(Regex.Escape))));
Console.WriteLine(string.Join("\n", addrArr));

How to split a string, but also keep the delimiters?

You can use lookahead and lookbehind, which are features of regular expressions.

System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));

And you will get:

[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]

The last one is what you want.

((?<=;)|(?=;)) equals to select an empty character before ; or after ;.

EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:

static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";

public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}

Javascript and regex: split string and keep the separator

I was having similar but slight different problem. Anyway, here are examples of three different scenarios for where to keep the deliminator.

"1、2、3".split("、") == ["1", "2", "3"]
"1、2、3".split(/(、)/g) == ["1", "、", "2", "、", "3"]
"1、2、3".split(/(?=、)/g) == ["1", "、2", "、3"]
"1、2、3".split(/(?!、)/g) == ["1、", "2、", "3"]
"1、2、3".split(/(.*?、)/g) == ["", "1、", "", "2、", "3"]

Warning: The fourth will only work to split single characters. ConnorsFan presents an alternative:

// Split a path, but keep the slashes that follow directories
var str = 'Animation/rawr/javascript.js';
var tokens = str.match(/[^\/]+\/?|\//g);

How to split string but keep delimiters in java?

As from your input string and expected results, I can infer that you want to split your string basically from three rules.

  • Split from the point which is preceded and followed by a colon
  • Split from the point which is preceded by a space and followed by a colon
  • Split from the point which is preceded by a colon and followed by a space

Hence you can use this regex using alternations for all three cases mentioned above.

(?<=:)(?=:)|(?<= )(?=:)|(?<=:)(?= )

Regex Demo

Java code,

String s = "Hello, :smile::hearth: world!";
System.out.println(Arrays.toString(s.split("(?<=:)(?=:)|(?<= )(?=:)|(?<=:)(?= )")));

Prints like your expected output,

[Hello, , :smile:, :hearth:,  world!]

Also, as an alternative if you can use matching the text rather than split, the regex would be much simpler to use and it would be this,

:[^:]+:|\S+

Regex Demo using match

Java code,

String s = "Hello, :smile::hearth: world!";
Pattern p = Pattern.compile(":[^:]+:|\\S+");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
}

Prints,

Hello,
:smile:
:hearth:
world!

How to split a string with a regex of 2 delimiters, but keep the delimiters in Python?

Don't use re.split(), use re.findall() with a regexp that matches each sub-expression.

import re

s = "+2x-10+5"
result = re.findall(r'[-+]\w+', s)
print(result)


Related Topics



Leave a reply



Submit