Why in Java 8 Split Sometimes Removes Empty Strings At Start of Result Array

Why in Java 8 split sometimes removes empty strings at start of result array?

The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.

Documentation

Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:

When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

The same clause is also added to String.split in Java 8, compared to Java 7.

Reference implementation

Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

Java 7

public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);

// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}

// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};

// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());

// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}

Java 8

public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);

// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}

// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};

// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());

// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}

The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

            if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}

Maintaining compatibility

Following behavior in Java 8 and above

To make split behaves consistently across versions and compatible with the behavior in Java 8:

  1. If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
  2. If your regex can't match zero-length string, you don't need to do anything.
  3. If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.

(?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

Following behavior in Java 7 and prior

There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.

Why does split on an empty string return a non-empty array?

For the same reason that

",test" split ','

and

",test," split ','

will return an array of size 2. Everything before the first match is returned as the first element.

Java String split removed empty values

split(delimiter) by default removes trailing empty strings from result array. To turn this mechanism off we need to use overloaded version of split(delimiter, limit) with limit set to negative value like

String[] split = data.split("\\|", -1);

Little more details:

split(regex) internally returns result of split(regex, 0) and in documentation of this method you can find (emphasis mine)

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array.

If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter.

If n is non-positive then the pattern will be applied as many times as possible and the array can have any length.

If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

Exception:

It is worth mentioning that removing trailing empty string makes sense only if such empty strings were created by the split mechanism. So for "".split(anything) since we can't split "" farther we will get as result [""] array.

It happens because split didn't happen here, so "" despite being empty and trailing represents original string, not empty string which was created by splitting process.

Java String split inconsistency

According to java docs. split creates an empty String if the first character is the separator, but doesn't create an empty String (or empty Strings) if the last character (or consecutive characters) is the separator. You will get the same behavior regardless of the separator you use.

String split() dropping trailing empty entries

See javadoc:

This method works as if by invoking the two-argument split method with the given expression and a limit argument of zero. Trailing empty strings are therefore not included in the resulting array.

So it is behaving as defined. If you're not happy with that, you can do what the manual suggests and use a negative parameter for limit.

String[] parts = "a.b.c...d...".split("\\.", -1);
for ( int i = 0; i < parts.length; i++ )
System.out.println("" + i + ": '" + parts[i] + "'" );

0: 'a'
1: 'b'
2: 'c'
3: ''
4: ''
5: 'd'
6: ''
7: ''
8: ''

Java - Why does string split for empty string give me a non empty array?

An interesting puzzle indeed:

> "".split(" ")
String[1] { "" }
> " ".split(" ")
String[0] { }

The question is, when you split the empty string, why does the result contain the empty string, and when you split a space, why does the result not contain anything? It seems inconsistent, but all is explained in the documentation.

The String.split(String) method "works as if by invoking the two-argument split method with the given expression and a limit argument of zero", so let's read the docs for String.split(String, int). The case of the empty string is answered by this part:

If the expression does not match any part of the input then the resulting array has just one element, namely this string.

The empty string has no part matching a space, so the output is an array containing one element, the input string, exactly as the docs say should happen.

The case of the string " " is answered by these two parts:

A zero-width match at the beginning however never produces such empty leading substring.

If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

The whole input string " " matches the splitting pattern. In principle we could include an empty string on either side of the match, but the docs say that an empty leading substring is never included, and (because the limit parameter n = 0) the trailing empty string is also discarded. Hence, the empty strings before and after the match are both not included in the resulting array, so it's empty.

why is split returning empty strings even tho capturing parenthesis are not present?

Let's look at a more minimal example:

",a,,b,".split(",")
// ["", "a", "", "b", ""]

What does this have to do with your case? Well, if you have two delimiters next to each other, a leading delimiter, or an trailing delimiter, you'll get an empty string in the result, since that's what's between them (and in order to maintain the behavior that x.split(a).join(a) should equal x). In your case, both </td> and <td> in the middle are matched, which means there are 2 "delimiters" right next to each other, leading to the empty string in the middle. The <td> at the start and the </td> at the end lead to a leading and trailing delimiter, leading to the empty strings at the start and the end.

Split a string with empty data

You should use split like this to prevent removing empty values

for (int i = 0; i < lines.length; i++){
line = lines[i];
data = line.split(":",-1);
data1 = data[0];
data2 = data[1];
data3 = data[2];
data4 = data[3];
}

If n is non-positive then the pattern will be applied as many times as possible and the array can have any length.

Why is String.split behaving like this?

You don't get an extra space, you get the empty string (with length 0). It says so in the javadoc:

 * <p> When there is a positive-width match at the beginning of this
* string then an empty leading substring is included at the beginning
* of the resulting array. A zero-width match at the beginning however
* never produces such empty leading substring


Related Topics



Leave a reply



Submit