Empty Strings at the Beginning and End of Split

Why does split on an empty string return a non-empty array?

For the same reason that

",test" split ','

and

",test," split ','

will return an array of size 2. Everything before the first match is returned as the first element.

Empty strings at the beginning and end of split

After reading AWK's specification following mu is too short, I came to feel that the original intention for split in AWK was to extract substrings that correspond to fields, each of which is terminated by a punctuation mark like ,, ., and the separator was considered something like an "end of field character". The intention was not splitting a string symmetrically into the left and the right side of each separator position, but was terminating a substring on the left side of a separator position. Under this conception, it makes sense to always have some string (even if it is empty) on the left of the separator, but not necessarily on the right side of the separator. This may have been inherited to Ruby via Perl.

How to split String without leaving behind empty strings?

Add the "one or more times" greediness quantifier to your character class:

String[] inputTokens = input.split("[(),\\s]+");

This will result in one leading empty String, which is unavoidable when using the split() method and splitting away the immediate start of the String and otherwise no empty Strings.

When splitting an empty string in Python, why does split() return an empty list while split('\n') returns ['']?

Question: I am using split('\n') to get lines in one string, and found that ''.split() returns an empty list, [], while ''.split('\n') returns [''].

The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

In contrast, the second mode (with an argument such as \n) will produce the first empty field. Consider if you had written '\n'.split('\n'), you would get two fields (one split, gives you two halves).

Question: Is there any specific reason for such a difference?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print(line.split())

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print(line.split(','))

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python's str.split(delimiter) method:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

Question: And is there any more convenient way to count lines in a string?

Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the \n. If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn't "terrible" either. For the most part, Guido's API design choices have stood the test of time.

The current API is not without advantages. Consider strings such as:

ps_aux_header  = 'USER               PID  %CPU %MEM      VSZ'
patient_header = 'name,age,height,weight'

When asked to break these strings into fields, people tend to describe both using the same English word, "split". When asked to read code such as fields = line.split() or fields = line.split(','), people tend to correctly interpret the statements as "splits a line into fields".

Microsoft Excel's text-to-columns tool made a similar API choice and
incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.

Why are empty strings returned in split() results?

str.split complements str.join, so

"/".join(['', 'segment', 'segment', ''])

gets you back the original string.

If the empty strings were not there, the first and last '/' would be missing after the join().

String.Split(), empty strings and method deleting specified characters

string.Split() method:

" ".Split(); will result in an array with 2 string.Empty items as there is nothing (empty) on either side of the space character.

" something".Split(); and "something ".Split(); will result in an array with two items, that one of them is an empty string, and actually one side of the space character is empty.

"a  b".Split(); //double space in between

The first space has a on the left side and an empty string on the right side (the right side is empty because there is another delimiter right after), the second space, has an empty string on the left side and b on the right side. so the result will be:

{"a","","","b"}

Split by regex without resulting empty strings in Python

The empty strings are just an inevitable result of the regex split (though there is good reasoning as to why that behavior might be desireable). To get rid of them you can call filter on the result.

results = re.split(...)
results = list(filter(None, results))

Note the list() transform is only necessary in Python 3 -- in Python 2 filter() returns a list, while in 3 it returns a filter object.

Java: String split(): I want it to include the empty strings at the end

use str.split("\n", -1) (with a negative limit argument). When split is given zero or no limit argument it discards trailing empty fields, and when it's given a positive limit argument it limits the number of fields to that number, but a negative limit means to allow any number of fields and not discard trailing empty fields. This is documented here and the behavior is taken from Perl.