Why Does "Split" on an Empty String Return a Non-Empty Array

Why does split on an empty string return a non-empty array?

For the same reason that

",test" split ','

and

",test," split ','

will return an array of size 2. Everything before the first match is returned as the first element.

Split empty string should return empty array

Just do a simple check if the string is falsy before calling split:

function returnArr(str){
  return !str ? [] : str.split(',')
}

returnArr('1,2,3')
// ['1','2','3']
returnArr('')
//[]

When splitting an empty string in Python, why does split() return an empty list while split('\n') returns ['']?

Question: I am using split('\n') to get lines in one string, and found that ''.split() returns an empty list, [], while ''.split('\n') returns [''].

The str.split() method has two algorithms. If no arguments are given, it splits on repeated runs of whitespace. However, if an argument is given, it is treated as a single delimiter with no repeated runs.

In the case of splitting an empty string, the first mode (no argument) will return an empty list because the whitespace is eaten and there are no values to put in the result list.

In contrast, the second mode (with an argument such as \n) will produce the first empty field. Consider if you had written '\n'.split('\n'), you would get two fields (one split, gives you two halves).

Question: Is there any specific reason for such a difference?

This first mode is useful when data is aligned in columns with variable amounts of whitespace. For example:

>>> data = '''\
Shasta      California     14,200
McKinley    Alaska         20,300
Fuji        Japan          12,400
'''
>>> for line in data.splitlines():
        print(line.split())

['Shasta', 'California', '14,200']
['McKinley', 'Alaska', '20,300']
['Fuji', 'Japan', '12,400']

The second mode is useful for delimited data such as CSV where repeated commas denote empty fields. For example:

>>> data = '''\
Guido,BDFL,,Amsterdam
Barry,FLUFL,,USA
Tim,,,USA
'''
>>> for line in data.splitlines():
        print(line.split(','))

['Guido', 'BDFL', '', 'Amsterdam']
['Barry', 'FLUFL', '', 'USA']
['Tim', '', '', 'USA']

Note, the number of result fields is one greater than the number of delimiters. Think of cutting a rope. If you make no cuts, you have one piece. Making one cut, gives two pieces. Making two cuts, gives three pieces. And so it is with Python's str.split(delimiter) method:

>>> ''.split(',')       # No cuts
['']
>>> ','.split(',')      # One cut
['', '']
>>> ',,'.split(',')     # Two cuts
['', '', '']

Question: And is there any more convenient way to count lines in a string?

Yes, there are a couple of easy ways. One uses str.count() and the other uses str.splitlines(). Both ways will give the same answer unless the final line is missing the \n. If the final newline is missing, the str.splitlines approach will give the accurate answer. A faster technique that is also accurate uses the count method but then corrects it for the final newline:

>>> data = '''\
Line 1
Line 2
Line 3
Line 4'''

>>> data.count('\n')                               # Inaccurate
3
>>> len(data.splitlines())                         # Accurate, but slow
4
>>> data.count('\n') + (not data.endswith('\n'))   # Accurate and fast
4

Question from @Kaz: Why the heck are two very different algorithms shoe-horned into a single function?

The signature for str.split is about 20 years old, and a number of the APIs from that era are strictly pragmatic. While not perfect, the method signature isn't "terrible" either. For the most part, Guido's API design choices have stood the test of time.

The current API is not without advantages. Consider strings such as:

ps_aux_header  = 'USER               PID  %CPU %MEM      VSZ'
patient_header = 'name,age,height,weight'

When asked to break these strings into fields, people tend to describe both using the same English word, "split". When asked to read code such as fields = line.split() or fields = line.split(','), people tend to correctly interpret the statements as "splits a line into fields".

Microsoft Excel's text-to-columns tool made a similar API choice and
incorporates both splitting algorithms in the same tool. People seem to mentally model field-splitting as a single concept even though more than one algorithm is involved.

Java - Why does string split for empty string give me a non empty array?

An interesting puzzle indeed:

> "".split(" ")
String[1] { "" }
> " ".split(" ")
String[0] {  }

The question is, when you split the empty string, why does the result contain the empty string, and when you split a space, why does the result not contain anything? It seems inconsistent, but all is explained in the documentation.

The String.split(String) method "works as if by invoking the two-argument split method with the given expression and a limit argument of zero", so let's read the docs for String.split(String, int). The case of the empty string is answered by this part:

If the expression does not match any part of the input then the resulting array has just one element, namely this string.

The empty string has no part matching a space, so the output is an array containing one element, the input string, exactly as the docs say should happen.

The case of the string " " is answered by these two parts:

A zero-width match at the beginning however never produces such empty leading substring.
If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

The whole input string " " matches the splitting pattern. In principle we could include an empty string on either side of the match, but the docs say that an empty leading substring is never included, and (because the limit parameter n = 0) the trailing empty string is also discarded. Hence, the empty strings before and after the match are both not included in the resulting array, so it's empty.

Java String's split method ignores empty substrings

Use String.split(String regex, int limit) with negative limit (e.g. -1).

"aa,bb,cc,dd,,,,".split(",", -1)

When String.split(String regex) is called, it is called with limit = 0, which will remove all trailing empty strings in the array (in most cases, see below).

The actual behavior of String.split(String regex) is quite confusing:

Splitting an empty string will result in an array of length 1. Empty string split will always result in length 1 array containing the empty string.
Splitting ";" or ";;;" with regex being ";" will result in an empty array. Non-empty string split will result in all trailing empty strings in the array removed.

The behavior above can be observed from at least Java 5 to Java 8.

There was an attempt to change the behavior to return an empty array when splitting an empty string in JDK-6559590. However, it was soon reverted in JDK-8028321 when it causes regression in various places. The change never makes it into the initial Java 8 release.

Why does Kotlin's split() function result in a leading and trailing empty string?

This is because the Java split(String regex) method explicitly removes them:

This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

split(String regex, int limit) mentions:

When there is a positive-width match at the beginning of this string then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

"" is a zero-width match. Not sure why you consider toCharArray() to not be intuitive here, splitting by an empty string to iterate over all characters is a roundabout way of doing things. split() is intended to pattern match and get groups of Strings.

PS: I checked JDK 8, 11 and 17, behavior seems to be consistent for a while now.

Why are empty strings returned in split() results?

str.split complements str.join, so

"/".join(['', 'segment', 'segment', ''])

gets you back the original string.

If the empty strings were not there, the first and last '/' would be missing after the join().

The confusion about the split() function of JavaScript with an empty string

From the MDC doc center:

Note: When the string is empty, split returns an array containing one empty string, rather than an empty array.

Read the full docs here: https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/String/split

In other words, this is by design, and not an error :)

JavaScript split gives array size one instead of zero for empty string

Everything before the first match is returned as the first element. Even if the String is Empty. It's not null

If you want split and return an 0 length Array, I recommand you to use the underscore.string module and the words method :

_str.words("", ",");
// => []

_str.words("Foo", ",");
// => [ 'Foo' ]

Why Does "Split" on an Empty String Return a Non-Empty Array