Why Are Empty Strings Returned in Split() Results

Why are empty strings returned in split() results?

str.split complements str.join, so

"/".join(['', 'segment', 'segment', ''])

gets you back the original string.

If the empty strings were not there, the first and last '/' would be missing after the join().

When splitting an empty string in Python, why does split() return an empty list while split('\n') returns ['']?

Question: I am using split('\n') to get lines in one string, and found that ''.split() returns an empty list, [], while ''.split('\n') returns [''].

The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

In contrast, the second mode (with an argument such as \n) will produce the first empty field. Consider if you had written '\n'.split('\n'), you would get two fields (one split, gives you two halves).

Question: Is there any specific reason for such a difference?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print(line.split())

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print(line.split(','))

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python's str.split(delimiter) method:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

Question: And is there any more convenient way to count lines in a string?

Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the \n. If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn't "terrible" either. For the most part, Guido's API design choices have stood the test of time.

The current API is not without advantages. Consider strings such as:

ps_aux_header  = 'USER               PID  %CPU %MEM      VSZ'
patient_header = 'name,age,height,weight'

When asked to break these strings into fields, people tend to describe both using the same English word, "split". When asked to read code such as fields = line.split() or fields = line.split(','), people tend to correctly interpret the statements as "splits a line into fields".

Microsoft Excel's text-to-columns tool made a similar API choice and
incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.

Why does split on an empty string return a non-empty array?

For the same reason that

",test" split ','

and

",test," split ','

will return an array of size 2. Everything before the first match is returned as the first element.

why is string.split returning extra empty entries in this example?

This should work.

function formatDate(userDate) {  // format from M/D/YYYY to YYYYMMDD  console.log(userDate);  var dateParts = userDate.split("/");  return dateParts[2] + dateParts[0] + dateParts[1]; }console.log(formatDate("12/31/2014"));

Split by regex without resulting empty strings in Python

The empty strings are just an inevitable result of the regex split (though there is good reasoning as to why that behavior might be desireable). To get rid of them you can call filter on the result.

results = re.split(...)
results = list(filter(None, results))

Note the list() transform is only necessary in Python 3 -- in Python 2 filter() returns a list, while in 3 it returns a filter object.

why is split returning empty strings even tho capturing parenthesis are not present?

Let's look at a more minimal example:

",a,,b,".split(",")
// ["", "a", "", "b", ""]

What does this have to do with your case? Well, if you have two delimiters next to each other, a leading delimiter, or an trailing delimiter, you'll get an empty string in the result, since that's what's between them (and in order to maintain the behavior that x.split(a).join(a) should equal x). In your case, both </td> and <td> in the middle are matched, which means there are 2 "delimiters" right next to each other, leading to the empty string in the middle. The <td> at the start and the </td> at the end lead to a leading and trailing delimiter, leading to the empty strings at the start and the end.

Why does splitting a string on itself return an empty slice with a length of two?

As I understand it, the split function returns everything before the / (which is nothing) in the first item, and everything after the / (also nothing) in the second item. Hence, two empty strings. As for why you ever get empty strings, it's so that split() can basically be the opposite of join, as explained here:

Why are empty strings returned in split() results?

Why does Kotlin's split() function result in a leading and trailing empty string?

This is because the Java split(String regex) method explicitly removes them:

This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

split(String regex, int limit) mentions:

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

"" is a zero-width match. Not sure why you consider toCharArray() to not be intuitive here, splitting by an empty string to iterate over all characters is a roundabout way of doing things. split() is intended to pattern match and get groups of Strings.

PS: I checked JDK 8, 11 and 17, behavior seems to be consistent for a while now.

Why Are Empty Strings Returned in Split() Results