What Is the Java's Internal Represention for String? Modified Utf-8? Utf-16

What is the Java's internal represention for String? Modified UTF-8? UTF-16?

Java uses UTF-16 for the internal text representation

The representation for String and StringBuilder etc in Java is UTF-16

https://docs.oracle.com/javase/8/docs/technotes/guides/intl/overview.html

How is text represented in the Java platform?

The Java programming language is based on the Unicode character set, and several libraries implement the Unicode standard. The primitive data type char in the Java programming language is an unsigned 16-bit integer that can represent a Unicode code point in the range U+0000 to U+FFFF, or the code units of UTF-16. The various types and classes in the Java platform that represent character sequences - char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator - are UTF-16 sequences.

At the JVM level, if you are using -XX:+UseCompressedStrings (which is default for some updates of Java 6) The actual in-memory representation can be 8-bit, ISO-8859-1 but only for strings which do not need UTF-16 encoding.

http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

and supports a non-standard modification of UTF-8 for string serialization.

Serialized Strings use UTF-8 by default.

And how many bytes does Java use for a char in memory?

A char is always two bytes, if you ignore the need for padding in an Object.

Note: a code point (which allows character > 65535) can use one or two characters, i.e. 2 or 4 bytes.

Which encoding does Java uses UTF-8 or UTF-16?

Characters are a graphical entity which is part of human culture. When a computer needs to handle text, it uses a representation of those characters in bytes. The exact representation used is called an encoding.

There are many encodings that can represent the same character - either through the Unicode character set, or through other character sets like the various ISO-8859 encodings, or the JIS X 0208.

Internally, Java uses UTF-16. This means that each character can be represented by one or two sequences of two bytes. The character you were using, 最, has the code point U+6700 which is represented in UTF-16 as the byte 0x67 and the byte 0x00.

That's the internal encoding. You can't see it unless you dump your memory and look at the bytes in the dumped image.

But the method getBytes() does not return this internal representation. Its documentation says:

public byte[] getBytes()

Encodes this String into a sequence of bytes
using the platform's default charset, storing the result into a new
byte array.

The "platform's default charset" is what your locale variables say it is. That is, UTF-8. So it takes the UTF-16 internal representation, and converts that into a different representation - UTF-8.

Note that

new String(bytes, StandardCharsets.UTF_16);

does not "convert it to UTF-16 explicitly" as you assumed it does. This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding.

But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16. This is wrong, and you do not get the character - or the bytes - that you expect.

You can't tell Java how to internally store strings. It always stores them as UTF-16. The constructor String(byte[],Charset) tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. The method getBytes(Charset) tells Java to give you a sequence of bytes that represent the string in the given encoding (charset). And the method getBytes() without an argument does the same - but uses your platform's default character set for the conversion.

So you misunderstood what getBytes() gives you. It's not the internal representation. You can't get that directly. only getBytes(StandardCharsets.UTF_16) will give you that, and only because you know that UTF-16 is the internal representation in Java. If a future version of Java decided to represent the characters in a different encoding, then getBytes(StandardCharsets.UTF_16) would not show you the internal representation.

Edit: in fact, Java 9 introduced just such a change in internal representation of strings, where, by default, strings whose characters all fall in the ISO-8859-1 range are internally represented in ISO-8859-1, whereas strings with at least one character outside that range are internally represented in UTF-16 as before. So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation.

Why does Java use UTF-16 for the internal text representation

Java was designed and first implemented back in the days when Unicode was specified to be a set of 16 bit code-points. That is why char is a 16 bit type, and why String is modeled as a sequence of char.

Now, if the Java designers had been able to foresee that Unicode would add extra "code planes", they might1 have opted for a 32 bit char type.

Java 1.0 came out in January 1996. Unicode 2.0 (which introduced the higher code planes and the surrogate mechanism) was released in July 1996.


Internally, I believe that some versions of Java have used UTF-8 as the representation for strings, at least at some level. However, it is still necessary to map this to the methods specified in the String API because that is what Java applications require. Doing that if the primary internal representation is UTF-8 rather than UTF-16 is going to be inefficient.

And before you suggest that they should "just change the String APIs" ... consider how many trillions of lines of Java code already exist that depend on the current String APIs.


For what it is worth, most if not all programming languages that support Unicode do it via a 16 bit char or wchar type.


1 - ... and possibly not, bearing in mind that memory was a lot more expensive in those days, and programmers worried much more about such things in those days.

Clarification on how character encodings work

A Java String is always encoded in UTF-16; input and output are converted as necessary.

This, however, can be better written:

 if (char < 'a') {char += 32;}

as

 if (ch >= 'A' && ch <= 'Z')
ch += ('a' - 'A');

Reason:

  1. Checking for the expected range is just more cautious

  2. You do not need to 'know' that the distance between lower-case alphabetics and upper-case alphabetics is 32.

Also, 'char' is a keyword in Java.

This of course only works for letters in the unaccented USA/UK alphabet.

However, I would suggest you use (as you yourself stated) 'toLowerCase()' since that's what it's there for - to relieve you of details.

Understanding encoding in character streams

The read() method has returned a Java char value, which is an unsigned 2-byte binary number (0-65535).

The actual return type is int (signed 4-byte binary number) to allow for a special -1 value meaning end-of-stream.

A Java char is a UTF-16 encoded Unicode character. This means that all characters from the Basic Multilingual Plane will appear unencoded, i.e. the char value is the Unicode value.

Java String internal representation

I took a gcore dump of a mini java process with this code:

 class Hi {
public static void main(String args[]) {
String hello = "Hello";
try {
Thread.sleep(60_000);
} catch (InterruptedException e) {
e.printStackTrace();
}

}
}

And did a gcore memory dump on Ubuntu. (usign jps to get the pid and passed that to gcore)

If found this: 48 65 6C 6C 6F in the dump using a Hexeditor, so it is somewhere in the memory as ASCII.

But also 48 00 65 00 6C 00 6C which is part of the UTF-16 representation of the String

What does it mean to say Java Modified UTF-8 Encoding ?

This is described in detail in the javadoc of DataInput:

Modified UTF-8


Implementations of the DataInput and DataOutput interfaces represent Unicode strings in a format that is a slight modification of UTF-8. (For information regarding the standard UTF-8 format, see section 3.9 Unicode Encoding Forms of The Unicode Standard, Version 4.0). Note that in the following tables, the most significant bit appears in the far left-hand column.

... (some tables, please click the javadoc link to see yourself) ...

The differences between this format and the standard UTF-8 format are the following:

  • The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls.
  • Only the 1-byte, 2-byte, and 3-byte formats are used.
  • Supplementary characters are represented in the form of surrogate pairs.

How to read it is described in detail in the javadoc of DataInput#readUTF():

readUTF

String readUTF()
throws IOException

Reads in a string that has been encoded using a modified UTF-8 format. The general contract of readUTF is that it reads a representation of a Unicode character string encoded in modified UTF-8 format; this string of characters is then returned as a String.

First, two bytes are read and used to construct an unsigned 16-bit integer in exactly the manner of the readUnsignedShort method . This integer value is called the UTF length and specifies the number of additional bytes to be read. These bytes are then converted to characters by considering them in groups. The length of each group is computed from the value of the first byte of the group. The byte following a group, if any, is the first byte of the next group.

If the first byte of a group matches the bit pattern 0xxxxxxx (where x means "may be 0 or 1"), then the group consists of just that byte. The byte is zero-extended to form a character.

If the first byte of a group matches the bit pattern 110xxxxx, then the group consists of that byte a and a second byte b. If there is no byte b (because byte a was the last of the bytes to be read), or if byte b does not match the bit pattern 10xxxxxx, then a UTFDataFormatException is thrown. Otherwise, the group is converted to the character:

(char)(((a& 0x1F) << 6) | (b & 0x3F))

If the first byte of a group matches the bit pattern 1110xxxx, then the group consists of that byte a and two more bytes b and c. If there is no byte c (because byte a was one of the last two of the bytes to be read), or either byte b or byte c does not match the bit pattern 10xxxxxx, then a UTFDataFormatException is thrown. Otherwise, the group is converted to the character:

(char)(((a & 0x0F) << 12) | ((b & 0x3F) << 6) | (c & 0x3F))

If the first byte of a group matches the pattern 1111xxxx or the pattern 10xxxxxx, then a UTFDataFormatException is thrown.

If end of file is encountered at any time during this entire process, then an EOFException is thrown.

After every group has been converted to a character by this process, the characters are gathered, in the same order in which their corresponding groups were read from the input stream, to form a String, which is returned.

The writeUTF method of interface DataOutput may be used to write data that is suitable for reading by this method.



Related Topics



Leave a reply



Submit