Is String Interning Really Useful

Is string interning really useful?

No, Java and .NET don't do it "automatically with all strings". They (well, Java and C#) do it with constant string expressions expressed in bytecode/IL, and on demand via the String.intern and String.Intern (.NET) methods. The exact situation in .NET is interesting, but basically the C# compiler will guarantee that every reference to an equal string constant within an assembly ends up referring to the same string object. That can be done efficiently at type initialization time, and can save a bunch of memory.

It doesn't happen every time a new string is created.

(On the string immutability front, I for one am extremely glad that strings are immutable. I don't want to have to take a copy every time I receive a parameter etc, thank you very much. I haven't seen it make string processing tasks harder, either...)

And as others have pointed out, looking up a string in a hash table isn't generally an O(n) operation, unless you're incredibly unlucky with hash collisions...

Personally I don't use string interning in user-land code; if I want some sort of cache of strings I'll create a HashSet<string> or something similar. That can be useful in various situations where you expect to come across the same strings several times (e.g. XML element names) but with a simple collection you don't pollute a system-wide cache.

Is it good practice to use java.lang.String.intern()?

When would I use this function in favor to String.equals()

when you need speed since you can compare strings by reference (== is faster than equals)

Are there side effects not mentioned in the Javadoc?

The primary disadvantage is that you have to remember to make sure that you actually do intern() all of the strings that you're going to compare. It's easy to forget to intern() all strings and then you can get confusingly incorrect results. Also, for everyone's sake, please be sure to very clearly document that you're relying on the strings being internalized.

The second disadvantage if you decide to internalize strings is that the intern() method is relatively expensive. It has to manage the pool of unique strings so it does a fair bit of work (even if the string has already been internalized). So, be careful in your code design so that you e.g., intern() all appropriate strings on input so you don't have to worry about it anymore.

(from JGuru)

Third disadvantage (Java 7 or less only): interned Strings live in PermGen space, which is usually quite small; you may run into an OutOfMemoryError with plenty of free heap space.

(from Michael Borgwardt)

What is Java String interning?

http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#intern()

Basically doing String.intern() on a series of strings will ensure that all strings having same contents share same memory. So if you have list of names where 'john' appears 1000 times, by interning you ensure only one 'john' is actually allocated memory.

This can be useful to reduce memory requirements of your program. But be aware that the cache is maintained by JVM in permanent memory pool which is usually limited in size compared to heap so you should not use intern if you don't have too many duplicate values.


More on memory constraints of using intern()

On one hand, it is true that you can remove String duplicates by
internalizing them. The problem is that the internalized strings go to
the Permanent Generation, which is an area of the JVM that is reserved
for non-user objects, like Classes, Methods and other internal JVM
objects. The size of this area is limited, and is usually much smaller
than the heap. Calling intern() on a String has the effect of moving
it out from the heap into the permanent generation, and you risk
running out of PermGen space.

--
From: http://www.codeinstructions.com/2009/01/busting-javalangstringintern-myths.html


From JDK 7 (I mean in HotSpot), something has changed.

In JDK 7, interned strings are no longer allocated in the permanent generation of the Java heap, but are instead allocated in the main part of the Java heap (known as the young and old generations), along with the other objects created by the application. This change will result in more data residing in the main Java heap, and less data in the permanent generation, and thus may require heap sizes to be adjusted. Most applications will see only relatively small differences in heap usage due to this change, but larger applications that load many classes or make heavy use of the String.intern() method will see more significant differences.

-- From Java SE 7 Features and Enhancements

Update: Interned strings are stored in main heap from Java 7 onwards. http://www.oracle.com/technetwork/java/javase/jdk7-relnotes-418459.html#jdk7changes

Performance penalty of String.intern()

I did a little bit of benchmarking myself. For the search cost part, I've decided to compare String.intern() with ConcurrentHashMap.putIfAbsent(s,s). Basically, those two methods do the same things, except String.intern() is a native method that stores and read from a SymbolTable that is managed directly in the JVM, and ConcurrentHashMap.putIfAbsent() is just a normal instance method.

You can find the benchmark code on github gist (for a lack of a better place to put it). You can also find the options I used when launching the JVM (to verify that the benchmark is not skewed) in the comments at the top of the source file.

Anyway here are the results:

Search cost (single threaded)

Legend

  • count: the number of distinct strings that we are trying to pool
  • initial intern: the time in ms it took to insert all the strings in the string pool
  • lookup same string: the time in ms it took to lookup each of the strings again from the pool, using exactly the same instance as was previously entered in the pool
  • lookup equal string: the time in ms it took to lookup each of the strings again from the pool, but using a different instance

String.intern()

count       initial intern   lookup same string  lookup equal string
1'000'000 40206 34698 35000
400'000 5198 4481 4477
200'000 955 828 803
100'000 234 215 220
80'000 110 94 99
40'000 52 30 32
20'000 20 10 13
10'000 7 5 7

ConcurrentHashMap.putIfAbsent()

count       initial intern   lookup same string  lookup equal string
1'000'000 411 246 309
800'000 352 194 229
400'000 162 95 114
200'000 78 50 55
100'000 41 28 28
80'000 31 23 22
40'000 20 14 16
20'000 12 6 7
10'000 9 5 3

The conclusion for the search cost: String.intern() is surprisingly expensive to call. It scales extremely badly, in something of O(n) where n is the number of strings in the pool. When the number of strings in the pool grows, the amount of time to lookup one string from the pool grows much more (0.7 microsecond per lookup with 10'000 strings, 40 microseconds per lookup with 1'000'000 strings).

ConcurrentHashMap scales as expected, the number of strings in the pool has no impact on the speed of the lookup.

Based on this experiment, I'd strongly suggest avoiding to use String.intern() if you are going to intern more than a few strings.

Downside to interning strings?

There could be two downsides:

  • The CPU cost of using the sys.intern() call. Calling a function requires the current frame to be pushed on the stack and popped again when the function returns. If you do this for a lot of strings the cost adds up. It's a tradeoff of CPU cycles vs. memory you need to take into account.

  • You may end up using more memory if your strings are mostly used singly. Interning also looks up the string object in a hash table, which by necessity needs to allocate more memory slots than the number of strings stored. Using a hashtable with N + overhead percentage slots could outstrip the memory needed for N strings, each used infrequently and thus not duplicated.

That said, we've used interning successfully and to significant effect in a multi-gigabyte in-memory cache, where strings by necessity appear in multiple locations in a tree structure.

When should we use intern method of String on String literals

Java automatically interns String literals. This means that in many cases, the == operator appears to work for Strings in the same way that it does for ints or other primitive values.

Since interning is automatic for String literals, the intern() method is to be used on Strings constructed with new String()

Using your example:

String s1 = "Rakesh";
String s2 = "Rakesh";
String s3 = "Rakesh".intern();
String s4 = new String("Rakesh");
String s5 = new String("Rakesh").intern();

if ( s1 == s2 ){
System.out.println("s1 and s2 are same"); // 1.
}

if ( s1 == s3 ){
System.out.println("s1 and s3 are same" ); // 2.
}

if ( s1 == s4 ){
System.out.println("s1 and s4 are same" ); // 3.
}

if ( s1 == s5 ){
System.out.println("s1 and s5 are same" ); // 4.
}

will return:

s1 and s2 are same
s1 and s3 are same
s1 and s5 are same

In all the cases besides of s4 variable, a value for which was explicitly created using new operator and where intern method was not used on it's result, it is a single immutable instance that's being returned JVM's string constant pool.

Refer to JavaTechniques "String Equality and Interning" for more information.

What is the purpose of Java's String.intern()?

There are essentially two ways that our String objects can enter in to the pool:

  • Using a literal in source code like "bbb".
  • Using intern.

intern is for when you have a String that's not otherwise from the pool. For example:

String bb = "bbb".substring(1); // substring creates a new object

System.out.println(bb == "bb"); // false
System.out.println(bb.intern() == "bb"); // true

Or slightly different:

System.out.println(new String("bbb").intern() == "bbb"); // true

new String("bbb") does create two objects...

String fromLiteral = "bbb";                     // in pool
String fromNewString = new String(fromLiteral); // not in pool

...but it's more like a special case. It creates two objects because "bbb" refers to an object:

A string literal is a reference to an instance of class String [...].

Moreover, a string literal always refers to the same instance of class String.

And new String(...) creates a copy of it.

However, there are many ways String objects are created without using a literal, such as:

  • All the String methods that perform some kind of mutation. (substring, split, replace, etc.)
  • Reading a String from some kind of input such as a Scanner or Reader.
  • Concatenation when at least one operand is not a compile-time constant.

intern lets you add them to the pool or retrieve an existing object if there was one. Under most circumstances interning Strings is unnecessary but it can be used as an optimization because:

  • It lets you compare with ==.
  • It can save memory because duplicates can be garbage collected.

Will Interning strings help performance in a parser?

I couldn't really say exactly whether this would help your performance or not. It would depend on how many strings you use, and how frequently you create instances of those strings. Interning is generally done automatically, so explicitly checking if the string is interned may actually increase your overhead and reduce your performance. When it comes to memory usage, interned strings can definitely use less memory.

If you do wish to use string interning, there are some better ways to achieve it. First and foremost, I would stick your element names in a static class full of public string constants. Any string literal found in your program source code is definitely and automatically interned. Such strings are loaded into the intern pool when your application is loaded. If your strings can not be defined as constants for compile-time intern preparation, then I would simply call String.Intern(...) rather than doing the full ternary expression String.IsInterned(...) ? ... : String.Intern(...). The Intern method will automatically check if the string is interned, return the interned version if it is, and will otherwise add the string to the intern pool and return that if it is not. No need to manually check IsInterned yourself.

Again, I can not say whether manually interning strings will improve performance. If you use constants, they will be automatically interned for you, in the most optimal way, and that is the best approach to improving performance and memory usage of regularly reused strings. I would honestly recommend you stay away from manual interning, and let the compiler and runtime handle optimization for you.

Can we avoid interning of strings in java?

Permgen size/usage isn't an issue with modern JVMs. Interned strings are made available to the garbage collector when there are no remaining references to them.

(And no, you can't "turn off" interning of string literals).



Related Topics



Leave a reply



Submit