Difference Between String.Scan and String.Split

Difference between String.scan and String.split

They serve entirely different purposes. String#scan is used to extract matches of a regular expression from a string and return the matches in an array, while String#split is intended to split a string up into an array, based on a delimiter. The delimiter may be either a static string (like ; to split on a single semicolon) or a regular expression (like /\s/+ to split on any whitespace characters).

The output of String#split doesn't include the delimiter. Rather, everything except the delimiter would be returned in the output array, while the output of String#scan would only include what is matched by the delimiter.

# A delimited string split on | returns everything surrounding the | delimiters
"a|delimited|string".split("|")
# Prints: ["a", "delimited", "string"]

# The same string scanninng for | only returns the matched |
"a|delimited|string".scan("|")
# Prints: ["|", "|"]

Both of the above would also accept a regular expression in place of the simple string "|".

# Split on everything between and including two t's
"a|delimited|string".split(/t.+t/)
# Prints: ["a|delimi", "ring"]

# Search for everything between and including two t's
"a|delimited|string".scan(/t.+t/)
# Prints: ["ted|st"]

Scanner vs. StringTokenizer vs. String.Split

They're essentially horses for courses.

  • Scanner is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.
  • String.split() and Pattern.split() give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.
  • StringTokenizer is even more restrictive than String.split(), and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast as String.split(). (See my comparison of String.split() and StringTokenizer.) It also predates the regular expressions API, of which String.split() is a part.

You'll note from my timings that String.split() can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer that it gives you the output as a string array, which is usually what you want. Using an Enumeration, as provided by StringTokenizer, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer is a bit of a waste of space nowadays, and you may as well just use String.split().

Is there any difference between String.split( ) and String.split(/ +/g)?

If you had "some string" with multiple spaces, the first one will get you:

Array(5) [ "some", "", "", "", "string" ]

The second one will treat multiple spaces as one:

Array [ "some", "string" ]

What is the difference between split method in String class and the split method in Apache StringUtils?

It depends on the use case.

What's the difference ?

String[] split(String regEx)

String[] results = StringUtils.split(String str,String separatorChars)

  1. Apache utils split() is null safe. StringUtils.split(null) will return null. The JDK default is not null safe:

    try{
    String testString = null;
    String[] result = testString.split("-");
    System.out.println(result.length);
    } catch(Exception e) {
    System.out.println(e); // results NPE
    }

  2. The default String#split() uses a regular expression for splitting the string.

    The Apache version StringUtils#split() uses whitespace/char/String characters/null [depends on split() method signature].

    Since complex regular expressions are very expensive when using extensively, the default String.split() would be a bad idea. Otherwise it's better.

  3. When used for tokenizing a string like following string.split()
    returns an additional empty string. while Apache version gave the
    correct results

     String testString = "$Hello$Dear$";

String[] result = testString.split("\\$");
System.out.println("Length is "+ result.length); //3
int i=1;
for(String str : result) {
System.out.println("Str"+(i++)+" "+str);
}

Output

Length is 3
Str1
Str2 Hello
Str3 Dear

String[] result = StringUtils.split(testString,"$");
System.out.println("Length is "+ result.length); // 2
int i=1;
for(String str : result) {
System.out.println("Str"+(i++)+" "+str);
}

Output

Length is 2
Str1 Hello
Str2 Dear

Is there a difference between string.Split() with a single character vs a one length char array?

Your intuition is correct. There will probably be no significant performance difference between the two, and there's definitely no behavioral difference, but the second one is certainly a lot more verbose.

Split function difference between char and string arguments

There's a big difference in the function use.

The split function is overloaded, and this is the implementation from the source code of Scala:

/** For every line in this string:

  • Strip a leading prefix consisting of blanks or control characters
  • followed by | from the line.

*/

  def stripMargin: String = stripMargin('|')

private def escape(ch: Char): String = "\\Q" + ch + "\\E"

@throws(classOf[java.util.regex.PatternSyntaxException])
def split(separator: Char): Array[String] = toString.split(escape(separator))

@throws(classOf[java.util.regex.PatternSyntaxException])
def split(separators: Array[Char]): Array[String] = {
val re = separators.foldLeft("[")(_+escape(_)) + "]"
toString.split(re)
}

So when you're calling split() with a char, you ask to split by that specific char:

scala> "ASD-ASD.KZ".split('.')
res0: Array[String] = Array(ASD-ASD, KZ)

And when you're calling split() with a string, it means that you want to have a regex. So for you to get the exact result using the double quotes, you need to do:

scala> "ASD-ASD.KZ".split("\\.")
res2: Array[String] = Array(ASD-ASD, KZ)

Where:

  • First \ escapes the following character
  • Second \ escapes character for the dot which is a regex expression, and we want to use it as a character
  • . - the character to split the string by

Advantages of .contains() and splitting a string to compare?

I think the most efficient way is .startsWith. It will only read the characters to the end of the time format, and will also break off search from the moment one character differs.

Why not .split?

Split iterates over the line to the end, this because it aims to split the string in an arbitrary number of parts, so it is possible that there is a # at the end of the string.

Why not .contains?

Same reason: it will keep trying to match the date in the string. Furthermore it is even possible that there is a date stored somewhere in the middle of the text, in that case you can thus even match lines that are technically not correct.

For instance - here making a small assumption about the format - if the line reads:

20141231 # Scheduled an appointment with Tim on 20150115

Then searching for 20150115 will result in a match, although the line has something to do with that date, it is not posted on that date.



Related Topics



Leave a reply



Submit