Scanner vs. StringTokenizer vs. String.Split
They're essentially horses for courses.
Scanner
is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.String.split()
andPattern.split()
give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.StringTokenizer
is even more restrictive thanString.split()
, and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast asString.split()
. (See my comparison ofString.split()
andStringTokenizer
.) It also predates the regular expressions API, of whichString.split()
is a part.
You'll note from my timings that String.split()
can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer
that it gives you the output as a string array, which is usually what you want. Using an Enumeration
, as provided by StringTokenizer
, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer
is a bit of a waste of space nowadays, and you may as well just use String.split()
.
StringTokenizer vs. String.split?
If you want split
to split on the characters '('
, ')'
, ','
, and ' '
, you need to pass a regex that matches any of those. The easiest is to use a character class:
String[] array = a.split("[(), ]");
Normally, parentheses in a regex are a grouping operator and would have to be escaped if you intended them to be used as literals. However, inside the character class delimiters, the parenthesis characters do not have to be escaped.
String.split vs StringTokenizer on efficiency level
There is an efficient & more feature rich string splitting methods are available in Google Guava library .
Guava's split method
Ex:
Iterable<String> splitted = Splitter.on(',')
.omitEmptyStrings()
.trimResults()
.split("one,two,, ,three");
for (String text : splitted) {
System.out.println(text);
}
Output:
one
two
three
Performance of StringTokenizer class vs. String.split method in Java
If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.
However, getting the data from a database is still likely to much more expensive.
StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
sb.append(i).append(' ');
String sample = sb.toString();
int runs = 100000;
for (int i = 0; i < 5; i++) {
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
StringTokenizer st = new StringTokenizer(sample);
List<String> list = new ArrayList<String>();
while (st.hasMoreTokens())
list.add(st.nextToken());
}
long time = System.nanoTime() - start;
System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
Pattern spacePattern = Pattern.compile(" ");
for (int r = 0; r < runs; r++) {
List<String> list = Arrays.asList(spacePattern.split(sample, 0));
}
long time = System.nanoTime() - start;
System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
List<String> list = new ArrayList<String>();
int pos = 0, end;
while ((end = sample.indexOf(' ', pos)) >= 0) {
list.add(sample.substring(pos, end));
pos = end + 1;
}
}
long time = System.nanoTime() - start;
System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
}
}
prints
StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us
The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)
If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/
Is StringTokenizer more efficient in splitting strings in JAVA?
String.split() is more flexible and easier to use than StringTokenizer. StringTokenizer predates Java support for regular expression while String.split() supports regular expressions, this makes it a whole lot more powerful than StringTokenizer. Also the results of String.split is a string array which is usually how we want our results. StringTokenizer is indeed faster that String.split() but for most practical purposes String.split() is fast enough.
Check the answers on this question for more details Scanner vs. StringTokenizer vs. String.Split
What's the Difference between StringTokenizer and java.util.Scanner Class
From the JavaDoc
:
StringTokenizer is a legacy class that is retained for compatibility
reasons although its use is discouraged in new code. It is recommended
that anyone seeking this functionality use thesplit
method of String
or thejava.util.regex
package instead.
What is StringTokenizer in Java
What does StringTokenizer do than split method for white spaces in java.
It provides backwards compatibility ... for old Java code that was developed before String.split
was implemented; i.e. prior to Java 1.4.
You don't have to use it. In fact you are recommended not to use it. However, they can't / won't remove it because removing it would break old code.
(So far, the class is not officially deprecated. There is no definite harm in using it rather than the more modern ways of splitting strings, so (IMO) deprecation is unlikely.)
Related Topics
Using Regular Expressions to Extract a Value in Java
"Program to an Interface". What Does It Mean
Why Does the Division of Two Integers Return 0.0 in Java
Java.Lang.Classnotfoundexception When Running in Intellij Idea
Collect Successive Pairs from a Stream
How to Subtract X Days from a Date Using Java Calendar
Differences in Boolean Operators: & VS && and | VS ||
Problems Using Maven and Ssl Behind Proxy
Spring Data JPA - Zoneddatetime Format for JSON Serialization
Difference Between File.Separator and Slash in Paths
Java.Lang.Classnotfoundexception: Com.Mysql.Jdbc.Driver in Eclipse
How to Decrypt File in Java Encrypted with Openssl Command Using Aes
Tomcat 10.0.4 Doesn't Load Servlets (@Webservlet Classes) with 404 Error
Java Reading a File into an Arraylist
Initialising a Multidimensional Array in Java