How Does Java Store Strings and How Does Substring Work Internally

How does Java store Strings and how does substring work internally?

See the comments:

    String str = "abcd";  // new String LITERAL which is interned in the pool
String str1 = new String("abcd"); // new String, not interned: str1 != str
String str2 = str.substring(0,2); // new String which is a view on str
String str3 = str.substring(0,2); // same: str3 != str2
String str7 = str1.substring(0,str1.length()); // special case: str1 is returned

Notes:

  • Since Java 7u6, substring returns a new string instead of a view on the original string (but that does not make a difference for that example)
  • Special case when you call str1.substring(0,str1.length()); - see code:

    public String substring(int beginIndex, int endIndex) {
    //some exception checking then
    return ((beginIndex == 0) && (endIndex == value.length)) ? this
    : new String(value, beginIndex, subLen);
    }

EDIT

What is a view?

Until Java 7u6, a String is basically a char[] that contains the characters of the string with an offset and a count (i.e. the string is composed of count characters starting from the offset position in the char[]).

When calling substring, a new string is created with the same char[] but a different offset / count, to effectively create a view on the original string. (Except when count = length and offset = 0 as explained above).

Since java 7u6, a new char[] is created every time, because there is no more count or offset field in the string class.

Where is the common pool stored exactly?

This is implementation specific. The location of the pool has actually moved in recent versions. In more recent versions, it is stored on the heap.

How is the pool managed?

Main characteristics:

  • String literals are stored in the pool
  • Interned strings are stored in the pool (new String("abc").intern();)
  • When a string S is interned (because it is a literal or because intern() is called), the JVM will return a reference to a string in the pool if there is one that is equals to S (hence "abc" == "abc" should always return true).
  • Strings in the pool can be garbage collected (meaning that an interned string might be removed from the pool at some stage if it becomes full)

how the subString() function of string class works

I know that line 2 will still point to "Monday" and have a new String object with the offset and count set to 0,3.

That is currently true of the Sun JRE implementation. I seem to recall that was not true of the Sun implementation in the past, and is not true of other implementations of the JVM. Do not rely on behaviour which is not specified. GNU classpath might copy the array (I can't remember off hand what ratio is uses to decide when to copy, but it does copy if the copy is a small enough fraction of the original, which turned one nice O(N) algorithm to O(N^2)).

The line 4 will create a new String "Mon" in string pool and point to it.

No, it creates a new string object in the heap, subject to the same garbage collection rules as any other object. Whether or not it shares the same underlying character array is implementation dependant. Do not rely on behaviour which is not specified.

The String(String) constructor says:

Initializes a newly created String object so that it represents the same sequence of characters as the argument; in other words, the newly created string is a copy of the argument string.

The String(char[]) constructor says:

Allocates a new String so that it represents the sequence of characters currently contained in the character array argument. The contents of the character array are copied; subsequent modification of the character array does not affect the newly created string.

Following good OO principles, no method of String actually requires that it is implemented using a character array, so no part of the specification of String requires operations on an character array. Those operations which take an array as input specify that the contents of the array are copied to whatever internal storage is used in the String. A string could use UTF-8 or LZ compression internally and conform to the API.

However, if your JVM doesn't make the small-ratio sub-string optimisation, then there's a chance that it does copy only the relevant portion when you use new String(String), so it's a case of trying it a seeing if it improves the memory use. Not everything which effects Java runtimes is defined by Java.

To obtain a string in the string pool which is equal to a string, use the intern() method. This will either retrieve a string from the pool if one with the value already has been interned, or create a new string and put it in the pool. Note that pooled strings have different (again implementation dependent) garbage collection behaviour.

How is a string represented in Java internally?

Nope, C strings are an array of chars and thus there is no length associated with them. The side affect of this decision is that to determine a string's length, one must iterate through it to find the \0, which isn't as efficient of carrying the length around.

Java strings have a char array for their chars and carry an offset length and a string length. This means determining a string's length is rather efficient.

Source.

Java - Strings and the substring method

Strings in Java are immutable. Basically this means that, once you create a string object, you won't be able to modify/change the content of a string. As a result, if you perform any manipulation on a string object which "appears to" change the content of the string, Java creates a new string object, and performs the manipulation on the newly created one.

Based on this, your code above appears to create five string objects - two are created by the declaration, two are created by calls to substring, and the last one is created after you concatenate the two pieces.

Immutability however leads to another interesting consequence. JVM internally maintains something like a string pool for creating string literals. For saving up memory, JVM will try to use string objects from this pool. Whenever you create a new string literal, JVM will loop into the pool to see if any existing strings can be used. If there is, JVM will simply use it and return it.

So, technically, before Java 7, JVM will create only one string object for your whole code. Even your substring calls won't create new string objects in the pool, it will use the existing "Hello World" one, but in this case it will only use characters from position 0 to 3 for your first call to substring, for example. Starting from Java 7, substring will not share the characters, but will create a new one. So, total object count will be 4 - the last one will be created with the concatenation of the two substrings.

Edit
To answer your question in the comment, take a look at Java Language Specification -

In the Java programming language, unlike C, an array of char is not a
String, and neither a String nor an array of char is terminated by
'\u0000' (the NUL character).

A String object is immutable, that is, its contents never change,
while an array of char has mutable elements.

The method toCharArray in class String returns an array of characters
containing the same character sequence as a String. The class
StringBuffer implements useful methods on mutable arrays of
characters.

So, no, char arrays are not immutable in Java, they are mutable.

String.intern() how to work

why str3.intern() == str3 is true

Because, as you said:

Otherwise, thisString object is added to the pool and a reference to this String object is returned.

You're in that case. The pool doesn't contain str3 (i.e. "mathanalyze") yet. So str3 is added to the pool and returned.

For str5, you're in the other case:

if the pool already contains a string equal to thisString object as determined by the equals(Object) method, then the string from the pool is returned

So, the pool already contains the string "java" when your code is executed, which is not surprising since java is, for example, the name of the top-level package of all the standard classes, and also the name of the executable used to launch the JVM. There is a huge chance that the literal string "java" is used in the code that bootstraps the application and loads classes before executing your main method.

Does java keep the String in a form of char array?

  1. Until Java 8, Strings were internally represented as an array of characters – char[], encoded in UTF-16, so that every character uses two bytes of memory.
  2. When we create a String via the new operator, the Java compiler will create a new object and store it in the heap. For example.

    String str1= new String("Hello");

    When we create a String variable and assign a value to it, the JVM searches the pool for a String of equal value. If found, the Java compiler will simply return a reference to its memory address, without allocating additional memory.If not found, it’ll be added to the pool and its reference will be returned.

    String str2= "Hello";
  3. toCharArray() internally creates a new char[] array by copying the characters of original array to the new one.

  4. charAt(int index) returns the value of specified index of the internal (original) char[] array.

With Java 9 a new representation is provided, called Compact Strings. This new format will choose the appropriate encoding between char[] and byte[] depending on the stored content. Since the new String representation will use the UTF-16 encoding only when necessary, the amount of heap memory will be significantly lower, which in turn causes less Garbage Collector overhead on the JVM.

Source:
http://www.baeldung.com/java-string-pool

Why a String do not point to the same object in the String pooled area?

If you look at the implementation of substring method you will see its create a String with the new operator, thus the returned string is not present in the string pool.

public String substring(int beginIndex, int endIndex) {
if (beginIndex < 0) {
throw new StringIndexOutOfBoundsException(beginIndex);
}
if (endIndex > value.length) {
throw new StringIndexOutOfBoundsException(endIndex);
}
int subLen = endIndex - beginIndex;
if (subLen < 0) {
throw new StringIndexOutOfBoundsException(subLen);
}
return ((beginIndex == 0) && (endIndex == value.length)) ? this
: new String(value, beginIndex, subLen);
}

Replace your code

String s2 = "Hello friends".substring(0, 5);

with
String s2 = "Hello friends".substring(0, 5).intern();

you will see its returning true.

while using substring to print a character pattern why has i been initialized to 1 instead of 0?

In the String.substring (int begIndex, int endIndex) method, endIndex doesn't index the last character of the substring. It indexes the character after the last character of the substring.

https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#substring(int,%20int)

Thus, the length of the substring is endIndex-beginIndex

How Java String pool works when String concatenation?

"When the string is created by concatenation does java make something
different or simple == comparator have another behaviour?"

No it does not change its behavior, what happens is that:

When concatenating two string literals "a" + "b" the jvm joins the two values and then check the string pool, then it realizes the value already exists in the pool so it just simply assign this reference to the String. now in more details:

Look at the compiled bytecode below of this simple program:

public class Test  {    
public static void main(String... args) {
String a = "hello world!";
String b = "hello" + " world!";
boolean compare = (a == b);
}
}

Simple program

First the JVM loads the string "hello world! and then push it to string pool (in this case) and then loads it to the stack (ldc = Load constant) [see point 1 in Image]

Then it assign the reference created in the pool to the local variable (astore_1) [see point 2 in Image]

Notice that the reference created in the string pool for this literal is #2 [See point 3 in Image]

The next operation is about the same: in concatenates the string, push it to the runtime constant pool (string pool in this case), but then it realizes a literal with the same content already exists so it uses this reference (#2) and assign in to a local variable (astore_2).

Thus when you do (a == b) is true because both of them are referencing to the string pool #2 which is "hello world!".

Your example C is kind of different tho, because you're using the += operator which when compiled to bytecode it uses StringBuilder to concatenate the strings, so this creates a new instance of StringBuilder Object thus pointing to a different reference. (string pool vs Object)



Related Topics



Leave a reply



Submit