JavaScript Strings Outside of the Bmp

JavaScript strings outside of the BMP

Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can.

But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length, split, slice etc) all deal with code units not characters, so will quite happily split surrogate pairs or hold invalid surrogate sequences.

If you want surrogate-aware methods, I'm afraid you're going to have to start writing them yourself! For example:

String.prototype.getCodePointLength= function() {
return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1;
};

String.fromCodePoint= function() {
var chars= Array.prototype.slice.call(arguments);
for (var i= chars.length; i-->0;) {
var n = chars[i]-0x10000;
if (n>=0)
chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));
}
return String.fromCharCode.apply(null, chars);
};

How to escape a character out of Basic Multilingual Plane?

You can use a pair of escaped surrogate code points, as described in @duskwuff’s answer. You can use my Full Unicode input utility to get the notations (button “Show \u”), or use the Fileformat.info character search to find them out (item “C/C++/Java source code”, because JavaScript uses the same notation here).

Alternatively, you can enter the characters directly: “You can enter non-BMP characters as such into string literals in your JavaScript code,whether in a separate file or as embedded in HTML. Naturally, you need suitable Unicode support in the editor you use. But JavaScript implementations need not support non-BMP characters in program source. They may, and modern browser implementations generally do.” (Going Global with JavaScript and Globalize.js, p. 177) There are some caveats like properly declaring the character encoding.

Font support is a different issue, but when working with characters, you generally want to see them at some point anyway, at least in testing. So you more or less need some font(s) that cover the characters. The Fileformat.info pages also contain links to browser support info, such as (U+20000) Font Support – a good starting point, though not quite complete. For example, U+20000 '' is also supported in SimSun-ExtB

Split JavaScript string into array of codepoints? (taking into account surrogate pairs but not grapheme clusters)

@bobince's answer has (luckily) become a bit dated; you can now simply use

var chars = Array.from( text )

to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters.

Get last character of string in current modern Javascript, allowing for Astral characters such as Emoji that use surrogate pairs (two code units)

Spreading will dissect a string into its code points

[...'a'].pop()

How can I tell if a string contains multibyte characters in Javascript?

JavaScript strings are UCS-2 encoded but can represent Unicode code points outside the Basic Multilingual Pane (U+0000 - U+D7FF and U+E000 - U+FFFF) using two 16 bit numbers (a UTF-16 surrogate pair), the first of which must be in the range U+D800 - U+DFFF.

Based on this, it's easy to detect whether a string contains any characters that lie outside the Basic Multilingual Plane (which is what I think you're asking: you want to be able to identify whether a string contains any characters that lie outside the range of code points that JavaScript represents as a single character):

function containsSurrogatePair(str) {
return /[\uD800-\uDFFF]/.test(str);
}

alert( containsSurrogatePair("foo") ); // false
alert( containsSurrogatePair("f) ); // true

Working out precisely which code points are contained in your string is a little harder and requires a UTF-16 decoder. The following will convert a string into an array of Unicode code points:

var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}

// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();

alert( getStringCodePoints("f).join(",") ); // 102,119558

getting a string length that contains unicode character exceeding 0xffff

To sumarize my comments:

That's just the lenght of that string.

Some chars involve other chars as well, even if it looks like a single character. "̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉".length == 24

From this (great) blog post, they have a function that will return correct length:

function fancyCount(str){  const joiner = "\u{200D}";  const split = str.split(joiner);  let count = 0;      for(const s of split){    //removing the variation selectors    const num = Array.from(s.split(/[\ufe00-\ufe0f]/).join("")).length;    count += num;  }      //assuming the joiners are used appropriately  return count / split.length;}
console.log(fancyCount("F) == 2) // true

How to use five digit long Unicode characters in JavaScript

In the MDN documentation for fromCharCode, they note that javascript will only naturally handle characters up to 0xFFFF. However, they also have an implementation of a fixed method for fromCharCode that may do what you want (reproduced below):

function fixedFromCharCode (codePt) {
if (codePt > 0xFFFF) {
codePt -= 0x10000;
return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
}
else {
return String.fromCharCode(codePt);
}
}

foo = fixedFromCharCode(0x1D15D);

How can I process each letter of text using Javascript?

If the order of alerts matters, use this:

for (var i = 0; i < str.length; i++) {
alert(str.charAt(i));
}

Or this: (see also this answer)

 for (var i = 0; i < str.length; i++) {
alert(str[i]);
}

If the order of alerts doesn't matter, use this:

var i = str.length;
while (i--) {
alert(str.charAt(i));
}

Or this: (see also this answer)

 var i = str.length;
while (i--) {
alert(str[i]);
}