Java String.Split() Sometimes Giving Blank Strings

Java String.split() sometimes giving blank strings

Digging through the source code, I got the exact issue behind this behaviour.

The String.split() method internally uses Pattern.split(). The split method before returning the resulting array checks for the last matched index or if there is actually a match. If the last matched index is 0, that means, your pattern matched just an empty string at the beginning of the string or didn't match at all, in which case, the returned array is a single element array containing the same element.

Here's the source code:

public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<String>();
Matcher m = matcher(input);

// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);

// Consider this assignment. For a single empty string match
// m.end() will be 0, and hence index will also be 0
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}

// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};

// Rest of them is not required

If the last condition in the above code - index == 0, is true, then the single element array is returned with the input string.

Now, consider the cases when the index can be 0.

  1. When there is no match at all. (As already in the comment above that condition)
  2. If the match is found at the beginning, and the length of matched string is 0, then the value of index in the if block (inside the while loop) -

    index = m.end();

    will be 0. The only possible match string is an empty string (length = 0). Which is exactly the case here. And also there shouldn't be any further matches, else index would be updated to a different index.

So, considering your cases:

  • For d%, there is just a single match for the pattern, before the first d. Hence the index value would be 0. But since there isn't any further matches, the index value is not updated, and the if condition becomes true, and returns the single element array with original string.

  • For d20+2 there would be two matches, one before d, and one before +. So index value will be updated, and hence the ArrayList in the above code will be returned, which contains the empty string as a result of split on delimiter which is the first character of the string, as already explained in @Stema's answer.

So, to get the behaviour you want (that is split on delimiter only when it is not at the beginning, you can add a negative look-behind in your regex pattern):

"(?<!^)(?=[dk+-])"  // You don't need to escape + and hyphen(when at the end)

this will split on empty string followed by your character class, but not preceded by the beginning of the string.


Consider the case of splitting the string "ad%" on regex pattern - "a(?=[dk+-])". This will give you an array with the first element as empty string. What the only change here is, the empty string is replaced with a:

"ad%".split("a(?=[dk+-])");  // Prints - `[, d%]`

Why? That's because the length of the matched string is 1. So the index value after the first match - m.end() wouldn't be 0 but 1, and hence the single element array won't be returned.

Java String's split method ignores empty substrings

Use String.split(String regex, int limit) with negative limit (e.g. -1).

"aa,bb,cc,dd,,,,".split(",", -1)

When String.split(String regex) is called, it is called with limit = 0, which will remove all trailing empty strings in the array (in most cases, see below).

The actual behavior of String.split(String regex) is quite confusing:

  • Splitting an empty string will result in an array of length 1. Empty string split will always result in length 1 array containing the empty string.
  • Splitting ";" or ";;;" with regex being ";" will result in an empty array. Non-empty string split will result in all trailing empty strings in the array removed.

The behavior above can be observed from at least Java 5 to Java 8.

There was an attempt to change the behavior to return an empty array when splitting an empty string in JDK-6559590. However, it was soon reverted in JDK-8028321 when it causes regression in various places. The change never makes it into the initial Java 8 release.

Why in Java 8 split sometimes removes empty strings at start of result array?

The behavior of String.split (which calls Pattern.split) changes between Java 7 and Java 8.

Documentation

Comparing between the documentation of Pattern.split in Java 7 and Java 8, we observe the following clause being added:

When there is a positive-width match at the beginning of the input sequence then an empty leading substring is included at the beginning of the resulting array. A zero-width match at the beginning however never produces such empty leading substring.

The same clause is also added to String.split in Java 8, compared to Java 7.

Reference implementation

Let us compare the code of Pattern.split of the reference implemetation in Java 7 and Java 8. The code is retrieved from grepcode, for version 7u40-b43 and 8-b132.

Java 7

public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);

// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}

// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};

// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());

// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}

Java 8

public String[] split(CharSequence input, int limit) {
int index = 0;
boolean matchLimited = limit > 0;
ArrayList<String> matchList = new ArrayList<>();
Matcher m = matcher(input);

// Add segments before each match found
while(m.find()) {
if (!matchLimited || matchList.size() < limit - 1) {
if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}
String match = input.subSequence(index, m.start()).toString();
matchList.add(match);
index = m.end();
} else if (matchList.size() == limit - 1) { // last one
String match = input.subSequence(index,
input.length()).toString();
matchList.add(match);
index = m.end();
}
}

// If no match was found, return this
if (index == 0)
return new String[] {input.toString()};

// Add remaining segment
if (!matchLimited || matchList.size() < limit)
matchList.add(input.subSequence(index, input.length()).toString());

// Construct result
int resultSize = matchList.size();
if (limit == 0)
while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
resultSize--;
String[] result = new String[resultSize];
return matchList.subList(0, resultSize).toArray(result);
}

The addition of the following code in Java 8 excludes the zero-length match at the beginning of the input string, which explains the behavior above.

            if (index == 0 && index == m.start() && m.start() == m.end()) {
// no empty leading substring included for zero-width match
// at the beginning of the input char sequence.
continue;
}

Maintaining compatibility

Following behavior in Java 8 and above

To make split behaves consistently across versions and compatible with the behavior in Java 8:

  1. If your regex can match zero-length string, just add (?!\A) at the end of the regex and wrap the original regex in non-capturing group (?:...) (if necessary).
  2. If your regex can't match zero-length string, you don't need to do anything.
  3. If you don't know whether the regex can match zero-length string or not, do both the actions in step 1.

(?!\A) checks that the string does not end at the beginning of the string, which implies that the match is an empty match at the beginning of the string.

Following behavior in Java 7 and prior

There is no general solution to make split backward-compatible with Java 7 and prior, short of replacing all instance of split to point to your own custom implementation.

Why does split on an empty string return a non-empty array?

For the same reason that

",test" split ','

and

",test," split ','

will return an array of size 2. Everything before the first match is returned as the first element.

Extra empty string when parsing string for digits

The reason you are getting the leading empty string is because your delimiter (a series of non-digit characters) leads the string. It would be the same reason why:

String s = ",";
String[] splitResult = s.split(",");

would give splitResult to be ["",""]. Every time there is a delimiter, even if it is at the beginning or end of the string, it is splitting two tokens in the string. If the delimiter is at the beginning of the string, at the end of the string, or if there are two adjacent delimiters (which wouldn't happen in your case because of the greedy + quantifier, but would happen in the above case with String s = ",,";), then the split result will have empty strings.

An easy way to solve your problem is to filter out the empty strings. You know it can only appear at the front or back of your input string, so it is not really a problem. E.g.:

  • Check the first and last string in nums. If they are empty, remove them.
  • Use regex to search for the digits, not split by the non-digits.
  • Remove any leading/trailing non-digits before running your code.

I'm not too fluent in Java, but I'm sure there's probably some cleaner Java idiom to do this with the code you've provided.

Behaviour of String.split() when input is empty

In the first case, the original string is returned, because the separator is not found.

From the API docs:

If the expression does not match any part of the input then the resulting array has just one element, namely this string.

String split behaviour on empty string and on single delimiter string

As per the java.util.regex.Pattern source, which String.split(..) uses,

"".split("x");   // returns {""} - valid - when no match is found, return the original string
"x".split("x"); // returns {} - valid - trailing empty strings are removed from the resultant array {"", ""}
"xa".split("x"); // returns {"", "a"} - valid - only trailing empty strings are removed
"ax".split("x"); // returns {"a"} - valid - trailing empty strings are removed from the resultant array {"a", ""}


Related Topics



Leave a reply



Submit