What is a surrogate pair in Java?
The term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme.
In the Unicode character encoding, characters are mapped to values between 0x0 and 0x10FFFF.
Internally, Java uses the UTF-16 encoding scheme to store strings of Unicode text. In UTF-16, 16-bit (two-byte) code units are used. Since 16 bits can only contain the range of characters from 0x0 to 0xFFFF, some additional complexity is used to store values above this range (0x10000 to 0x10FFFF). This is done using pairs of code units known as surrogates.
The surrogate code units are in two ranges known as "high surrogates" and "low surrogates", depending on whether they are allowed at the start or end of the two-code-unit sequence.
What is a surrogate range and surrogate code in Java?
A surrogate pair is a pair of 16-bit values usind in UTF-16 to encode a Unicode code-point outside of the BMP / plane 0; i.e. any Unicode code-point that is greater than 65535.
The surrogate range is the range of 16 bit values that the two values of a pair come from;
- The high value of a surrogate pair comes from the range D800 through DBFF
- The low value of a surrogate pair comes from the range DC00 through DFFF.
For example: the Unicode code point U+10437 is represented in UTF-16 as the surrogate pair D801 DC37.
For more information, read the Wikipedia article on UTF-16.
What is a surrogate range and surrogate code in Java?
The two surrogate ranges are described above.
A surrogate code is a code1 in one of the two surrogate ranges.
What is the use of surrogate methods like
isSurrogate()
,isSurrogatePair()
,isLowSurrogate()
,isHighSurrogate()
.
isSurrogate()
tests if achar
is either a low or high surrogateisSurrogatePair()
tests if a pair ofchar
values is a valid surrogate pairisLowSurrogate()
tests if achar
is a low surrogate valueisHighSurrogate()
tests if achar
is a high surrogate value
The use of these methods is self-evident. They are used to test char
values when interpreting UTF-16 code units as Unicode code points.
1 - This could be either a code unit or a code point, depending on the context. If you have a sequence of 16-bit code-units that constitute a UTF-16 string, then these are code-units. On the other hand, if you have a sequence of Unicode code-points, then if you were to encounter high and low surrogates in that sequence they would be code points. However the surrogate code-points are not meaningful as text in that context.
How to find the surrogate pair of a symbol in java
Finding the surrogate pair for a symbol in Java is straightforward, with a couple of caveats:
- Not all characters are represented by a surrogate pair, including your example ("⋀"), so always check for this before attempting to get the surrogate pair.
- You need a font that can display the symbols, both in your source code, and in any output produced by your code. I used Monospaced in NetBeans for the code and output shown below.
Here is code to display some basic Unicode information for arbitrary symbols. It gets the code point for a symbol, determine whether it is represented by a surrogate pair, and if so gets the high and low surrogates. The code processes two symbols, one with a surrogate pair (the emoji "quot;), and one without (your "⋀" example).
package surrogates;
public class Surrogates {
public static void main(String[] args) {
Surrogates.displaySymbolDetails("️️");
Surrogates.displaySymbolDetails("⋀️️");
}
static void displaySymbolDetails(String symbol) {
int cp = symbol.codePointAt(0);
String name = Character.getName(cp);
System.out.println(symbol + " has code point " + cp + " (hex " + Integer.toHexString(cp) + ").");
System.out.println(symbol + " has Unicode name " + name + ".");
boolean isSupplemenetary = Character.isSupplementaryCodePoint(cp);
if (isSupplemenetary) {
System.out.println(symbol + " is a supplementary character.");
char high = Character.highSurrogate(cp);
char low = Character.lowSurrogate(cp);
System.out.println(symbol + " has high surrogate: " + (int) high + ".");
System.out.println(symbol + " has low surrogate: " + (int) low + ".");
} else {
System.out.println(symbol + " is in the BMP and therefore is not represented by a surrogate pair.");
}
}
}
Here is the output:
️️ has code point 128522 (hex 1f60a).
️️ has Unicode name SMILING FACE WITH SMILING EYES.
️️ is a supplementary character.
️️ has high surrogate: 55357.
️️ has low surrogate: 56842.
⋀️️ has code point 8896 (hex 22c0).
⋀️️ has Unicode name N-ARY LOGICAL AND.
⋀️️ is in the BMP and therefore is not represented by a surrogate pair.
Notes:
- "symbol" can mean multiple things, but I am assuming that in your question you are simply referring to some Unicode character.
- Symbols (i.e Unicode characters) in the basic multilingual plane (BMP) are not represented by a surrogate pair. All other symbols are in some supplementary plane (SMP), and are represented by a surrogate pair.
Java - what are characters, code points and surrogates? What difference is there between them?
To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.
A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 216.
Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.
When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.
Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.
Java uses UTF-16 internally to represent text.
In particular, a char
(character) is an unsigned two-byte value that contains a UTF-16 value.
If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2
How are surrogate pairs calculated?
Unicode code points are scalar values which range from 0x000000 to 0x10FFFF. Thus they are are 21 bit integers, not 17 bit.
Surrogate pairs are a mechanism of the UTF-16 form. This represents the 21-bit scalar values as one or two 16-bit code units.
- Scalar values from 0x000000 to 0x00FFFF are represented as a single 16-bit code unit, from 0x0000 to 0xFFFF.
- Scalar values from 0x00D800 to 0x00DFFF are not characters in Unicode, and so they will never occur in a Unicode character string.
- Scalar values from 0x010000 to 0x10FFFF are represented as two 16-bit code units. The first code unit encodes the upper 11 bits of the scalar value, as a code unit ranging from 0xD800-0xDBFF. There's a bit of trickiness to encode values from 0x01-0x10 in four bits. The second code unit encodes the lower 10 bits of the scalar value, as a code unit ranging from 0xDC00-0xDFFF.
This is explained in detail, with sample code, in the Unicode consortium's FAQ, UTF-8, UTF-16, UTF-32 & BOM. That FAQ refers to the section of the Unicode Standard which has even more detail.
XML marshalling of surrogate pairs
The unwanted "codepoint" in your output is the artifact of a bug in your output code.
Java strings have an interface with a bias towards UTF-16. All offsets and all methods working with the char data type pretend that the string is an array of UTF-16 code units.
The same goes for string escaping like "\uD83D\uDCB3"
. It does not contain two Unicode code points. Rather it contains two UTF-16 code units that together form a single code point, namely the code point for the credit card symbol.
Your output code mixes code points and UTF-16 code units by accessing code points using codePointAt()
but incrementing the offset (variable i
) by code units. Thus, the credit card code point is accessed twice: once correctly and the second time incorrectly (with the offset pointing into the middle of the surrogate pair).
The correct code looks like so:
int offset = 0;
while (offset < xmlString.length()) {
int codePoint = xmlString.codePointAt(offset);
System.out.print(codePoint);
System.out.print("|");
offset += Character.charCount(codePoint);
}
Related Topics
How to Programmatically Download a Webpage in Java
Why Should a Java Class Implement Comparable
Using Regex to Generate Strings Rather Than Match Them
Regex for Matching Something If It Is Not Preceded by Something Else
How to Print a Float with 2 Decimal Places in Java
String's Maximum Length in Java - Calling Length() Method
Initialising a Multidimensional Array in Java
Convert Date/Time for Given Timezone - Java
Why Does the Tostring Method in Java Not Seem to Work for an Array
Dynamically Add Components to a Jdialog
Jsoup Java HTML Parser:Executing JavaScript Events
Java Wait Cursor Display Problem
Difference Between Thread's Context Class Loader and Normal Classloader
How Are Spring Data Repositories Actually Implemented
How to Search Google Programmatically Java API
Jquery, Spring MVC @Requestbody and JSON - Making It Work Together