Split String into a List, But Keeping the Split Pattern

In Python, how do I split a string and keep the separators?

>>> re.split('(\W)', 'foo/bar spam\neggs')
['foo', '/', 'bar', ' ', 'spam', '\n', 'eggs']

Typescript / Javascript - Split string but keeping the split pattern

Assuming your separator pattern is [A-Z]1, you can opt for either of these options

  1. Use a combined positive lookahead and lookbehind to separate strings proceeding or preceding the delimiter

    str.split(/(?=[A-Z]1)|(?<=[A-Z]1)/)

    Note that browser lookbehind support is still reasonably patchy. Do not use if you need Sarafi support (MacOS or iOS).

  2. Use a capturing group to include the separator and filter out empty values

    str.split(/([A-Z]1)/).filter(s => s.length)

Split string into a list, but keeping the split pattern

Thanks to Mark Wilkins for inpsiration, but here's a shorter bit of code for doing it:

irb(main):015:0> s = "split on the word on okay?"
=> "split on the word on okay?"
irb(main):016:0> b=[]; s.split(/(on)/).each_slice(2) { |s| b << s.join }; b
=> ["split on", " the word on", " okay?"]

or:

s.split(/(on)/).each_slice(2).map(&:join)

See below the fold for an explanation.


Here's how this works. First, we split on "on", but wrap it in parentheses to make it into a match group. When there's a match group in the regular expression passed to split, Ruby will include that group in the output:

s.split(/(on)/)
# => ["split", "on", "the word", "on", "okay?"

Now we want to join each instance of "on" with the preceding string. each_slice(2) helps by passing two elements at a time to its block. Let's just invoke each_slice(2) to see what results. Since each_slice, when invoked without a block, will return an enumerator, we'll apply to_a to the Enumerator so we can see what the Enumerator will enumerator over:

s.split(/(on)/).each_slice(2).to_a
# => [["split", "on"], ["the word", "on"], ["okay?"]]

We're getting close. Now all we have to do is join the words together. And that gets us to the full solution above. I'll unwrap it into individual lines to make it easier to follow:

b = []
s.split(/(on)/).each_slice(2) do |s|
b << s.join
end
b
# => ["split on", "the word on" "okay?"]

But there's a nifty way to eliminate the temporary b and shorten the code considerably:

s.split(/(on)/).each_slice(2).map do |a|
a.join
end

map passes each element of its input array to the block; the result of the block becomes the new element at that position in the output array. In MRI >= 1.8.7, you can shorten it even more, to the equivalent:

s.split(/(on)/).each_slice(2).map(&:join)

Python: Split string without losing split character

If you want to do this in a single line:


string = "HELLO.WORLD.AGAIN."
pattern = "."
result = string.replace(pattern, f" {pattern} ").split(" ")
# if you want to omit the last element because of the punctuation at the end of the string uncomment this
# result = result[:-1]

How to split a string, but also keep the delimiters?

You can use lookahead and lookbehind, which are features of regular expressions.

System.out.println(Arrays.toString("a;b;c;d".split("(?<=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("(?=;)")));
System.out.println(Arrays.toString("a;b;c;d".split("((?<=;)|(?=;))")));

And you will get:

[a;, b;, c;, d]
[a, ;b, ;c, ;d]
[a, ;, b, ;, c, ;, d]

The last one is what you want.

((?<=;)|(?=;)) equals to select an empty character before ; or after ;.

EDIT: Fabian Steeg's comments on readability is valid. Readability is always a problem with regular expressions. One thing I do to make regular expressions more readable is to create a variable, the name of which represents what the regular expression does. You can even put placeholders (e.g. %1$s) and use Java's String.format to replace the placeholders with the actual string you need to use; for example:

static public final String WITH_DELIMITER = "((?<=%1$s)|(?=%1$s))";

public void someMethod() {
final String[] aEach = "a;b;c;d".split(String.format(WITH_DELIMITER, ";"));
...
}

How do I split a string and keep the separators using python re library?

You can use re.findall to capture each parenthesis group:

import re
string = r"('Option A' | 'Option B') & ('Option C' | 'Option D')"
pattern = r"(\([^\)]+\))"
re.findall(pattern, string)
# ["('Option A' | 'Option B')", "('Option C' | 'Option D')"]

This also works with re.split

re.split(pattern, string)
# ['', "('Option A' | 'Option B')", ' & ', "('Option C' | 'Option D')", '']

If you want to remove empty elements from using re.split you can:

[s for s in re.split(pattern, string) if s]
# ["('Option A' | 'Option B')", ' & ', "('Option C' | 'Option D')"]

How the pattern works:

  • ( begin capture group
  • \( matches the character ( literally
  • [^\)]+ Match between one and unlimited characters that are not )
  • \) matches the character ) literally
  • ) end capture group

Split string in Python while keeping the line break inside the generated list

Split String using Regex findall()

import re

my_string = "This is a test.\nAlso\tthis"
my_list = re.findall(r"\S+|\n", my_string)

print(my_list)

How it Works:

  • "\S+": "\S" = non whitespace characters. "+" is a greed quantifier so it find any groups of non-whitespace characters aka words
  • "|": OR logic
  • "\n": Find "\n" so it's returned as well in your list

Output:

['This', 'is', 'a', 'test.', '\n', 'Also', 'this']

Javascript and regex: split string and keep the separator

I was having similar but slight different problem. Anyway, here are examples of three different scenarios for where to keep the deliminator.

"1、2、3".split("、") == ["1", "2", "3"]
"1、2、3".split(/(、)/g) == ["1", "、", "2", "、", "3"]
"1、2、3".split(/(?=、)/g) == ["1", "、2", "、3"]
"1、2、3".split(/(?!、)/g) == ["1、", "2、", "3"]
"1、2、3".split(/(.*?、)/g) == ["", "1、", "", "2、", "3"]

Warning: The fourth will only work to split single characters. ConnorsFan presents an alternative:

// Split a path, but keep the slashes that follow directories
var str = 'Animation/rawr/javascript.js';
var tokens = str.match(/[^\/]+\/?|\//g);

How to regex split, but keep the split string?

You need to use capture group :

>>> re.split(r'(watch/\d+/)', 'http://www.hulu.jp/watch/589851/supernatural-dub-hollywood-babylon/en')
['http://www.hulu.jp/', 'watch/589851/', 'supernatural-dub-hollywood-babylon/en']

How to split string with Regex.Split and keep all separators?

You need a pattern with a lookahead only:

\s+(?=delim1|delim2)

The \s+ will match 1 or more whitespaces (since your string contains whitespaces). In case there can be no whitespaces, use \s* (but then you will need to remove empty entries from the result). See the regex demo. If these delimiters must be whole words, use \b word boundaries: \s+(?=\b(?:delim1|delim2)\b).

In C#:

addrArr = Regex.Split(inputText, string.Format(@"\s+(?={0})", string.Join("|", delimeters)));

If the delimiters can contain special regex metacharacters, you will need to run Regex.Escape on your delimiters list.

A C# demo:

var inputText = "substring1 delim1 substring2 delim2 substr3";
var delimeters = new List<string> { "delim1", "delim2" };
var addrArr = Regex.Split(inputText,
string.Format(@"\s+(?={0})", string.Join("|", delimeters.Select(Regex.Escape))));
Console.WriteLine(string.Join("\n", addrArr));


Related Topics



Leave a reply



Submit