JavaScript strings outside of the BMP
Depends what you mean by ‘support’. You can certainly put non-UCS-2 characters in a JS string using surrogates, and browsers will display them if they can.
But, each item in a JS string is a separate UTF-16 code unit. There is no language-level support for handling full characters: all the standard String members (length
, split
, slice
etc) all deal with code units not characters, so will quite happily split surrogate pairs or hold invalid surrogate sequences.
If you want surrogate-aware methods, I'm afraid you're going to have to start writing them yourself! For example:
String.prototype.getCodePointLength= function() {
return this.length-this.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1;
};
String.fromCodePoint= function() {
var chars= Array.prototype.slice.call(arguments);
for (var i= chars.length; i-->0;) {
var n = chars[i]-0x10000;
if (n>=0)
chars.splice(i, 1, 0xD800+(n>>10), 0xDC00+(n&0x3FF));
}
return String.fromCharCode.apply(null, chars);
};
How to escape a character out of Basic Multilingual Plane?
You can use a pair of escaped surrogate code points, as described in @duskwuff’s answer. You can use my Full Unicode input utility to get the notations (button “Show \u”), or use the Fileformat.info character search to find them out (item “C/C++/Java source code”, because JavaScript uses the same notation here).
Alternatively, you can enter the characters directly: “You can enter non-BMP characters as such into string literals in your JavaScript code,whether in a separate file or as embedded in HTML. Naturally, you need suitable Unicode support in the editor you use. But JavaScript implementations need not support non-BMP characters in program source. They may, and modern browser implementations generally do.” (Going Global with JavaScript and Globalize.js, p. 177) There are some caveats like properly declaring the character encoding.
Font support is a different issue, but when working with characters, you generally want to see them at some point anyway, at least in testing. So you more or less need some font(s) that cover the characters. The Fileformat.info pages also contain links to browser support info, such as (U+20000) Font Support – a good starting point, though not quite complete. For example, U+20000 ''
is also supported in SimSun-ExtB
Split JavaScript string into array of codepoints? (taking into account surrogate pairs but not grapheme clusters)
@bobince's answer has (luckily) become a bit dated; you can now simply use
var chars = Array.from( text )
to obtain a list of single-codepoint strings which does respect astral / 32bit / surrogate Unicode characters. Get last character of string in current modern Javascript, allowing for Astral characters such as Emoji that use surrogate pairs (two code units)
Spreading will dissect a string into its code points
[...'a'].pop()
How can I tell if a string contains multibyte characters in Javascript?
JavaScript strings are UCS-2 encoded but can represent Unicode code points outside the Basic Multilingual Pane (U+0000
- U+D7FF
and U+E000
- U+FFFF
) using two 16 bit numbers (a UTF-16 surrogate pair), the first of which must be in the range U+D800
- U+DFFF
.
Based on this, it's easy to detect whether a string contains any characters that lie outside the Basic Multilingual Plane (which is what I think you're asking: you want to be able to identify whether a string contains any characters that lie outside the range of code points that JavaScript represents as a single character):
function containsSurrogatePair(str) {
return /[\uD800-\uDFFF]/.test(str);
}
alert( containsSurrogatePair("foo") ); // false
alert( containsSurrogatePair("f) ); // true
Working out precisely which code points are contained in your string is a little harder and requires a UTF-16 decoder. The following will convert a string into an array of Unicode code points:var getStringCodePoints = (function() {
function surrogatePairToCodePoint(charCode1, charCode2) {
return ((charCode1 & 0x3FF) << 10) + (charCode2 & 0x3FF) + 0x10000;
}
// Read string in character by character and create an array of code points
return function(str) {
var codePoints = [], i = 0, charCode;
while (i < str.length) {
charCode = str.charCodeAt(i);
if ((charCode & 0xF800) == 0xD800) {
codePoints.push(surrogatePairToCodePoint(charCode, str.charCodeAt(++i)));
} else {
codePoints.push(charCode);
}
++i;
}
return codePoints;
}
})();
alert( getStringCodePoints("f).join(",") ); // 102,119558
getting a string length that contains unicode character exceeding 0xffff
To sumarize my comments:
That's just the lenght of that string.
Some chars involve other chars as well, even if it looks like a single character. "̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉".length == 24
From this (great) blog post, they have a function that will return correct length:
function fancyCount(str){ const joiner = "\u{200D}"; const split = str.split(joiner); let count = 0; for(const s of split){ //removing the variation selectors const num = Array.from(s.split(/[\ufe00-\ufe0f]/).join("")).length; count += num; } //assuming the joiners are used appropriately return count / split.length;}
console.log(fancyCount("F) == 2) // true
How to use five digit long Unicode characters in JavaScript
In the MDN documentation for fromCharCode, they note that javascript will only naturally handle characters up to 0xFFFF. However, they also have an implementation of a fixed method for fromCharCode that may do what you want (reproduced below):
function fixedFromCharCode (codePt) {
if (codePt > 0xFFFF) {
codePt -= 0x10000;
return String.fromCharCode(0xD800 + (codePt >> 10), 0xDC00 + (codePt & 0x3FF));
}
else {
return String.fromCharCode(codePt);
}
}
foo = fixedFromCharCode(0x1D15D);
How can I process each letter of text using Javascript?
If the order of alerts matters, use this:
for (var i = 0; i < str.length; i++) {
alert(str.charAt(i));
}
Or this: (see also this answer) for (var i = 0; i < str.length; i++) {
alert(str[i]);
}
If the order of alerts doesn't matter, use this:var i = str.length;
while (i--) {
alert(str.charAt(i));
}
Or this: (see also this answer) var i = str.length;
while (i--) {
alert(str[i]);
}
var str = 'This is my string';
function matters() {
for (var i = 0; i < str.length; i++) {
alert(str.charAt(i));
}
}
function dontmatter() {
var i = str.length;
while (i--) {
alert(str.charAt(i));
}
}
<p>If the order of alerts matters, use <a href="#" onclick="matters()">this</a>.</p>
<p>If the order of alerts doesn't matter, use <a href="#" onclick="dontmatter()">this</a>.</p>
Related Topics
Invoke a Callback at the End of a Transition
Why Do People Put Code Like "Throw 1; <Dont Be Evil>" and "For(;;);" in Front of JSON Responses
What Is the Most Efficient Way to Reverse an Array in JavaScript
How to Send Email by Using JavaScript or Jquery
How to Parse a Date in Format "Yyyymmdd" in JavaScript
Load "Vanilla" JavaScript Libraries into Node.Js
How to Handle Oncut, Oncopy, and Onpaste in Jquery
How to Test for Nan in JavaScript
How to Access Object Using Dynamic Key
Pass Parameter with Python Flask in External JavaScript
Cancel/Kill Window.Settimeout() Before It Happens
Get Selected HTML in Browser via JavaScript
Check If an Object Implements an Interface at Runtime with Typescript
Angularjs Does Not Send Hidden Field Value
Getusermedia() in Chrome 47 Without Using Https