What Is a "Surrogate Pair" in Java

What is a surrogate pair in Java?

The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.

In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.

Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.

The surrogate code units are in two ranges known as "high surrogates" and "low surrogates", depending on whether they are allowed at the start or end of the two-code-unit sequence.

What is a surrogate range and surrogate code in Java?

A surrogate pair is a pair of 16-bit values usind in UTF-16 to encode a Unicode code-point outside of the BMP / plane 0; i.e. any Unicode code-point that is greater than 65535.

The surrogate range is the range of 16 bit values that the two values of a pair come from;

  • The high value of a surrogate pair comes from the range D800 through DBFF
  • The low value of a surrogate pair comes from the range DC00 through DFFF.

For example: the Unicode code point U+10437 is represented in UTF-16 as the surrogate pair D801 DC37.

For more information, read the Wikipedia article on UTF-16.


What is a surrogate range and surrogate code in Java?

The two surrogate ranges are described above.

A surrogate code is a code1 in one of the two surrogate ranges.


What is the use of surrogate methods like isSurrogate(), isSurrogatePair(), isLowSurrogate(), isHighSurrogate().

  • isSurrogate() tests if a char is either a low or high surrogate
  • isSurrogatePair() tests if a pair of char values is a valid surrogate pair
  • isLowSurrogate() tests if a char is a low surrogate value
  • isHighSurrogate() tests if a char is a high surrogate value

The use of these methods is self-evident. They are used to test char values when interpreting UTF-16 code units as Unicode code points.


1 - This could be either a code unit or a code point, depending on the context. If you have a sequence of 16-bit code-units that constitute a UTF-16 string, then these are code-units. On the other hand, if you have a sequence of Unicode code-points, then if you were to encounter high and low surrogates in that sequence they would be code points. However the surrogate code-points are not meaningful as text in that context.

How to find the surrogate pair of a symbol in java

Finding the surrogate pair for a symbol in Java is straightforward, with a couple of caveats:

  • Not all characters are represented by a surrogate pair, including your example ("⋀"), so always check for this before attempting to get the surrogate pair.
  • You need a font that can display the symbols, both in your source code, and in any output produced by your code. I used Monospaced in NetBeans for the code and output shown below.

Here is code to display some basic Unicode information for arbitrary symbols. It gets the code point for a symbol, determine whether it is represented by a surrogate pair, and if so gets the high and low surrogates. The code processes two symbols, one with a surrogate pair (the emoji "quot;), and one without (your "⋀" example).

package surrogates;

public class Surrogates {

public static void main(String[] args) {
Surrogates.displaySymbolDetails("️️");
Surrogates.displaySymbolDetails("⋀️️");
}

static void displaySymbolDetails(String symbol) {
int cp = symbol.codePointAt(0);
String name = Character.getName(cp);
System.out.println(symbol + " has code point " + cp + " (hex " + Integer.toHexString(cp) + ").");
System.out.println(symbol + " has Unicode name " + name + ".");
boolean isSupplemenetary = Character.isSupplementaryCodePoint(cp);
if (isSupplemenetary) {
System.out.println(symbol + " is a supplementary character.");
char high = Character.highSurrogate​(cp);
char low = Character.lowSurrogate​(cp);
System.out.println(symbol + " has high surrogate: " + (int) high + ".");
System.out.println(symbol + " has low surrogate: " + (int) low + ".");
} else {
System.out.println(symbol + " is in the BMP and therefore is not represented by a surrogate pair.");
}
}
}

Here is the output:

️️ has code point 128522 (hex 1f60a).
️️ has Unicode name SMILING FACE WITH SMILING EYES.
️️ is a supplementary character.
️️ has high surrogate: 55357.
️️ has low surrogate: 56842.
⋀️️ has code point 8896 (hex 22c0).
⋀️️ has Unicode name N-ARY LOGICAL AND.
⋀️️ is in the BMP and therefore is not represented by a surrogate pair.

Notes:

  • "symbol" can mean multiple things, but I am assuming that in your question you are simply referring to some Unicode character.
  • Symbols (i.e Unicode characters) in the basic multilingual plane (BMP) are not represented by a surrogate pair. All other symbols are in some supplementary plane (SMP), and are represented by a surrogate pair.

Java - what are characters, code points and surrogates? What difference is there between them?

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 216.

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.

Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.

Java uses UTF-16 internally to represent text.

In particular, a char (character) is an unsigned two-byte value that contains a UTF-16 value.

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2

How are surrogate pairs calculated?

Unicode code points are scalar values which range from 0x000000 to 0x10FFFF. Thus they are are 21 bit integers, not 17 bit.

Surrogate pairs are a mechanism of the UTF-16 form. This represents the 21-bit scalar values as one or two 16-bit code units.

  • Scalar values from 0x000000 to 0x00FFFF are represented as a single 16-bit code unit, from 0x0000 to 0xFFFF.
  • Scalar values from 0x00D800 to 0x00DFFF are not characters in Unicode, and so they will never occur in a Unicode character string.
  • Scalar values from 0x010000 to 0x10FFFF are represented as two 16-bit code units. The first code unit encodes the upper 11 bits of the scalar value, as a code unit ranging from 0xD800-0xDBFF. There's a bit of trickiness to encode values from 0x01-0x10 in four bits. The second code unit encodes the lower 10 bits of the scalar value, as a code unit ranging from 0xDC00-0xDFFF.

This is explained in detail, with sample code, in the Unicode consortium's FAQ, UTF-8, UTF-16, UTF-32 & BOM. That FAQ refers to the section of the Unicode Standard which has even more detail.

XML marshalling of surrogate pairs

The unwanted "codepoint" in your output is the artifact of a bug in your output code.

Java strings have an interface with a bias towards UTF-16. All offsets and all methods working with the char data type pretend that the string is an array of UTF-16 code units.

The same goes for string escaping like "\uD83D\uDCB3". It does not contain two Unicode code points. Rather it contains two UTF-16 code units that together form a single code point, namely the code point for the credit card symbol.

Your output code mixes code points and UTF-16 code units by accessing code points using codePointAt() but incrementing the offset (variable i) by code units. Thus, the credit card code point is accessed twice: once correctly and the second time incorrectly (with the offset pointing into the middle of the surrogate pair).

The correct code looks like so:

int offset = 0;
while (offset < xmlString.length()) {
int codePoint = xmlString.codePointAt(offset);
System.out.print(codePoint);
System.out.print("|");
offset += Character.charCount(codePoint);
}



Related Topics



Leave a reply



Submit