Performance of Stringtokenizer Class VS. String.Split Method in Java

Performance of StringTokenizer class vs. String.split method in Java

If your data already in a database you need to parse the string of words, I would suggest using indexOf repeatedly. Its many times faster than either solution.

However, getting the data from a database is still likely to much more expensive.

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
StringTokenizer st = new StringTokenizer(sample);
List<String> list = new ArrayList<String>();
while (st.hasMoreTokens())
list.add(st.nextToken());
}
long time = System.nanoTime() - start;
System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
Pattern spacePattern = Pattern.compile(" ");
for (int r = 0; r < runs; r++) {
List<String> list = Arrays.asList(spacePattern.split(sample, 0));
}
long time = System.nanoTime() - start;
System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
}
{
long start = System.nanoTime();
for (int r = 0; r < runs; r++) {
List<String> list = new ArrayList<String>();
int pos = 0, end;
while ((end = sample.indexOf(' ', pos)) >= 0) {
list.add(sample.substring(pos, end));
pos = end + 1;
}
}
long time = System.nanoTime() - start;
System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
}
}

prints

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

The cost of opening a file will be about 8 ms. As the files are so small, your cache may improve performance by a factor of 2-5x. Even so its going to spend ~10 hours opening files. The cost of using split vs StringTokenizer is far less than 0.01 ms each. To parse 19 million x 30 words * 8 letters per word should take about 10 seconds (at about 1 GB per 2 seconds)

If you want to improve performance, I suggest you have far less files. e.g. use a database. If you don't want to use an SQL database, I suggest using one of these http://nosql-database.org/

String.split vs StringTokenizer on efficiency level

There is an efficient & more feature rich string splitting methods are available in Google Guava library .

Guava's split method

Ex:

Iterable<String> splitted = Splitter.on(',')
.omitEmptyStrings()
.trimResults()
.split("one,two,, ,three");

for (String text : splitted) {
System.out.println(text);
}

Output:

one

two

three

Is StringTokenizer more efficient in splitting strings in JAVA?

String.split() is more flexible and easier to use than StringTokenizer. StringTokenizer predates Java support for regular expression while String.split() supports regular expressions, this makes it a whole lot more powerful than StringTokenizer. Also the results of String.split is a string array which is usually how we want our results. StringTokenizer is indeed faster that String.split() but for most practical purposes String.split() is fast enough.

Check the answers on this question for more details Scanner vs. StringTokenizer vs. String.Split

Scanner vs. StringTokenizer vs. String.Split

They're essentially horses for courses.

  • Scanner is designed for cases where you need to parse a string, pulling out data of different types. It's very flexible, but arguably doesn't give you the simplest API for simply getting an array of strings delimited by a particular expression.
  • String.split() and Pattern.split() give you an easy syntax for doing the latter, but that's essentially all that they do. If you want to parse the resulting strings, or change the delimiter halfway through depending on a particular token, they won't help you with that.
  • StringTokenizer is even more restrictive than String.split(), and also a bit fiddlier to use. It is essentially designed for pulling out tokens delimited by fixed substrings. Because of this restriction, it's about twice as fast as String.split(). (See my comparison of String.split() and StringTokenizer.) It also predates the regular expressions API, of which String.split() is a part.

You'll note from my timings that String.split() can still tokenize thousands of strings in a few milliseconds on a typical machine. In addition, it has the advantage over StringTokenizer that it gives you the output as a string array, which is usually what you want. Using an Enumeration, as provided by StringTokenizer, is too "syntactically fussy" most of the time. From this point of view, StringTokenizer is a bit of a waste of space nowadays, and you may as well just use String.split().

Java split String performances

String.split(String) won't create regexp if your pattern is only one character long. When splitting by single character, it will use specialized code which is pretty efficient. StringTokenizer is not much faster in this particular case.

This was introduced in OpenJDK7/OracleJDK7. Here's a bug report and a commit. I've made a simple benchmark here.


$ java -version
java version "1.8.0_20"
Java(TM) SE Runtime Environment (build 1.8.0_20-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

$ java Split
split_banthar: 1231
split_tskuzzy: 1464
split_tskuzzy2: 1742
string.split: 1291
StringTokenizer: 1517

What is StringTokenizer in Java

What does StringTokenizer do than split method for white spaces in java.

It provides backwards compatibility ... for old Java code that was developed before String.split was implemented; i.e. prior to Java 1.4.

You don't have to use it. In fact you are recommended not to use it. However, they can't / won't remove it because removing it would break old code.

(So far, the class is not officially deprecated. There is no definite harm in using it rather than the more modern ways of splitting strings, so (IMO) deprecation is unlikely.)

stringtokenizer java - performance

So can someone confirm the following

Option 1 given

import java.text.ParseException;
import java.util.StringTokenizer;

public class stringtok
{
public static void main(String[] argv)
throws Exception
{
String data="ABC";
final StringTokenizer stoken=new StringTokenizer(data.toString(),";");
if (stoken.hasMoreTokens()) {
final String test=stoken.nextToken();
} else {
throw new ParseException("Some msg",0);
}
}
}

produces in bytecode

Compiled from "stringtok.java"
public class stringtok {
public stringtok();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return

public static void main(java.lang.String[]) throws java.lang.Exception;
Code:
0: ldc #2 // String ABC
2: astore_1
3: new #3 // class java/util/StringTokenizer
6: dup
7: aload_1
8: invokevirtual #4 // Method java/lang/String.toString:()Ljava/lang/String;
11: ldc #5 // String ;
13: invokespecial #6 // Method java/util/StringTokenizer."<init>":(Ljava/lang/String;Ljava/lang/String;)V
16: astore_2
17: aload_2
18: invokevirtual #7 // Method java/util/StringTokenizer.hasMoreTokens:()Z
21: ifeq 32
24: aload_2
25: invokevirtual #8 // Method java/util/StringTokenizer.nextToken:()Ljava/lang/String;
28: astore_3
29: goto 43
32: new #9 // class java/text/ParseException
35: dup
36: ldc #10 // String Some msg
38: iconst_0
39: invokespecial #11 // Method java/text/ParseException."<init>":(Ljava/lang/String;I)V
42: athrow
43: return

}

Options 2 & 3 given (curly braces is identical bytecode)

import java.text.ParseException;
import java.util.StringTokenizer;

public class stringtok2
{
public static void main(String[] argv)
throws Exception
{
String data="ABC";
final StringTokenizer stoken=new StringTokenizer(data.toString(),";");
if (!stoken.hasMoreTokens()) throw new ParseException("Some msg",0);
final String test=stoken.nextToken();
}

}

produces in bytecode

Compiled from "stringtok2.java"
public class stringtok2 {
public stringtok2();
Code:
0: aload_0
1: invokespecial #1 // Method java/lang/Object."<init>":()V
4: return

public static void main(java.lang.String[]) throws java.lang.Exception;
Code:
0: ldc #2 // String ABC
2: astore_1
3: new #3 // class java/util/StringTokenizer
6: dup
7: aload_1
8: invokevirtual #4 // Method java/lang/String.toString:()Ljava/lang/String;
11: ldc #5 // String ;
13: invokespecial #6 // Method java/util/StringTokenizer."<init>":(Ljava/lang/String;Ljava/lang/String;)V
16: astore_2
17: aload_2
18: invokevirtual #7 // Method java/util/StringTokenizer.hasMoreTokens:()Z
21: ifne 35
24: new #8 // class java/text/ParseException
27: dup
28: ldc #9 // String Some msg
30: iconst_0
31: invokespecial #10 // Method java/text/ParseException."<init>":(Ljava/lang/String;I)V
34: athrow
35: aload_2
36: invokevirtual #11 // Method java/util/StringTokenizer.nextToken:()Ljava/lang/String;
39: astore_3
40: return
}

So the answer is Option 2 & 3 as they are theoretically less bytecode instructions.

can someone confirm



Related Topics



Leave a reply



Submit