Difference between String.scan and String.split
They serve entirely different purposes. String#scan
is used to extract matches of a regular expression from a string and return the matches in an array, while String#split
is intended to split a string up into an array, based on a delimiter. The delimiter may be either a static string (like ;
to split on a single semicolon) or a regular expression (like /\s/+
to split on any whitespace characters).
The output of String#split
doesn't include the delimiter. Rather, everything except the delimiter would be returned in the output array, while the output of String#scan
would only include what is matched by the delimiter.
# A delimited string split on | returns everything surrounding the | delimiters
"a|delimited|string".split("|")
# Prints: ["a", "delimited", "string"]
# The same string scanninng for | only returns the matched |
"a|delimited|string".scan("|")
# Prints: ["|", "|"]
Both of the above would also accept a regular expression in place of the simple string "|"
.
# Split on everything between and including two t's
"a|delimited|string".split(/t.+t/)
# Prints: ["a|delimi", "ring"]
# Search for everything between and including two t's
"a|delimited|string".scan(/t.+t/)
# Prints: ["ted|st"]
Scanner vs. StringTokenizer vs. String.Split
They're essentially horses for courses.
Scanner
is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.String.split()
andPattern.split()
give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.StringTokenizer
is even more restrictive thanString.split()
, and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast asString.split()
. (See my comparison ofString.split()
andStringTokenizer
.) It also predates the regular expressions API, of whichString.split()
is a part.
You'll note from my timings that String.split()
can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer
that it gives you the output as a string array, which is usually what you want. Using an Enumeration
, as provided by StringTokenizer
, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer
is a bit of a waste of space nowadays, and you may as well just use String.split()
.
Is there any difference between String.split( ) and String.split(/ +/g)?
If you had "some string"
with multiple spaces, the first one will get you:
Array(5) [ "some", "", "", "", "string" ]
The second one will treat multiple spaces as one:
Array [ "some", "string" ]
What is the difference between split method in String class and the split method in Apache StringUtils?
It depends on the use case.
What's the difference ?
String[] split(String regEx)
String[] results = StringUtils.split(String str,String separatorChars)
Apache utils split() is null safe.
StringUtils.split(null)
will returnnull
. The JDK default is not null safe:try{
String testString = null;
String[] result = testString.split("-");
System.out.println(result.length);
} catch(Exception e) {
System.out.println(e); // results NPE
}The default String#split() uses a regular expression for splitting the string.
The Apache version StringUtils#split() uses whitespace/char/String characters/null [depends on split() method signature].
Since complex regular expressions are very expensive when using extensively, the defaultString.split()
would be a bad idea. Otherwise it's better.When used for tokenizing a string like following string.split()
returns an additional empty string. while Apache version gave the
correct results
String testString = "$Hello$Dear$";
String[] result = testString.split("\\$");
System.out.println("Length is "+ result.length); //3
int i=1;
for(String str : result) {
System.out.println("Str"+(i++)+" "+str);
}
Output
Length is 3
Str1
Str2 Hello
Str3 Dear
String[] result = StringUtils.split(testString,"$");
System.out.println("Length is "+ result.length); // 2
int i=1;
for(String str : result) {
System.out.println("Str"+(i++)+" "+str);
}
Output
Length is 2
Str1 Hello
Str2 Dear
Is there a difference between string.Split() with a single character vs a one length char array?
Your intuition is correct. There will probably be no significant performance difference between the two, and there's definitely no behavioral difference, but the second one is certainly a lot more verbose.
Split function difference between char and string arguments
There's a big difference in the function use.
The split
function is overloaded, and this is the implementation from the source code of Scala:
/** For every line in this string:
- Strip a leading prefix consisting of blanks or control characters
- followed by
|
from the line.
*/
def stripMargin: String = stripMargin('|')
private def escape(ch: Char): String = "\\Q" + ch + "\\E"
@throws(classOf[java.util.regex.PatternSyntaxException])
def split(separator: Char): Array[String] = toString.split(escape(separator))
@throws(classOf[java.util.regex.PatternSyntaxException])
def split(separators: Array[Char]): Array[String] = {
val re = separators.foldLeft("[")(_+escape(_)) + "]"
toString.split(re)
}
So when you're calling split()
with a char, you ask to split by that specific char:
scala> "ASD-ASD.KZ".split('.')
res0: Array[String] = Array(ASD-ASD, KZ)
And when you're calling split()
with a string, it means that you want to have a regex. So for you to get the exact result using the double quotes, you need to do:
scala> "ASD-ASD.KZ".split("\\.")
res2: Array[String] = Array(ASD-ASD, KZ)
Where:
- First
\
escapes the following character - Second
\
escapes character for the dot which is a regex expression, and we want to use it as a character .
- the character to split the string by
Advantages of .contains() and splitting a string to compare?
I think the most efficient way is .startsWith
. It will only read the characters to the end of the time format, and will also break off search from the moment one character differs.
Why not .split
?
Split iterates over the line to the end, this because it aims to split the string in an arbitrary number of parts, so it is possible that there is a #
at the end of the string.
Why not .contains
?
Same reason: it will keep trying to match the date in the string. Furthermore it is even possible that there is a date stored somewhere in the middle of the text, in that case you can thus even match lines that are technically not correct.
For instance - here making a small assumption about the format - if the line reads:
20141231 # Scheduled an appointment with Tim on 20150115
Then searching for 20150115
will result in a match, although the line has something to do with that date, it is not posted on that date.
Related Topics
Run a Cli Thor App Without Arguments or Task Name
How to Beautify Xml Code in Rails Application
Safe Navigation Equivalent to Rails Try for Hashes
How to Determine Leap Year in Ruby
How to Detect User Agent in Rails 3.1
Using Activerecord Interface for Models Backed by External API in Ruby on Rails
Rails: Url/Path with Parameters
Ruby Amazon S3 Access Denied When Listing Buckets
Limiting Characters/Words in View - Ruby on Rails
Memory Size of a Hash or Other Object
Is There a Ruby Http Client Library with a Response Cache
Rails 3 - Best Way to Handle Nested Resource Queries in Your Controllers