How Many Bytes in a JavaScript String

How many bytes in a JavaScript string?

You can use the Blob to get the string size in bytes.

Examples:

console.info(  new Blob(['']).size,                             // 4  new Blob(['']).size,                             // 4  new Blob(['']).size,                           // 8  new Blob(['']).size,                           // 8  new Blob(['I\'m a string']).size,                  // 12
// from Premasagar correction of Lauri's answer for // strings containing lone characters in the surrogate pair range: // https://stackoverflow.com/a/39488643/6225838 new Blob([String.fromCharCode(55555)]).size, // 3 new Blob([String.fromCharCode(55555, 57000)]).size // 4 (not 6));

String length in bytes in JavaScript

There is no way to do it in JavaScript natively. (See Riccardo Galli's answer for a modern approach.)


For historical reference or where TextEncoder APIs are still unavailable.

If you know the character encoding, you can calculate it yourself though.

encodeURIComponent assumes UTF-8 as the character encoding, so if you need that encoding, you can do,

function lengthInUtf8Bytes(str) {
// Matches only the 10.. bytes that are non-initial characters in a multi-byte sequence.
var m = encodeURIComponent(str).match(/%[89ABab]/g);
return str.length + (m ? m.length : 0);
}

This should work because of the way UTF-8 encodes multi-byte sequences. The first encoded byte always starts with either a high bit of zero for a single byte sequence, or a byte whose first hex digit is C, D, E, or F. The second and subsequent bytes are the ones whose first two bits are 10. Those are the extra bytes you want to count in UTF-8.

The table in wikipedia makes it clearer

Bits        Last code point Byte 1          Byte 2          Byte 3
7 U+007F 0xxxxxxx
11 U+07FF 110xxxxx 10xxxxxx
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
...

If instead you need to understand the page encoding, you can use this trick:

function lengthInPageEncoding(s) {
var a = document.createElement('A');
a.href = '#' + s;
var sEncoded = a.href;
sEncoded = sEncoded.substring(sEncoded.indexOf('#') + 1);
var m = sEncoded.match(/%[0-9a-f]{2}/g);
return sEncoded.length - (m ? m.length * 2 : 0);
}

How can I measure a size of javascript string?

The easiest way is to use Buffer's helper functions.

Buffer.byteLength(str)

How to split a string into chunks of a particular byte size?

Using Buffer seems indeed the right direction. Given that:

  • Buffer prototype has indexOf and lastIndexOf methods, and
  • 32 is the ASCII code of a space, and
  • 32 can never occur as part of a multi-byte character since all the bytes that make up a multi-byte sequence always have the most significant bit set.

... you can proceed as follows:

function chunk(s, maxBytes) {
let buf = Buffer.from(s);
const result = [];
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take the whole string
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
result.push(buf.slice(0, i).toString());
buf = buf.slice(i+1); // Skip space (if any)
}
return result;
}

console.log(chunk("Hey there! € 100 to pay", 12));
// -> [ 'Hey there!', '€ 100 to', 'pay' ]

You can consider extending this to also look for TAB, LF, or CR as split-characters. If so, and your input text can have CRLF sequences, you would need to detect those as well to avoid getting orphaned CR or LF characters in the chunks.

You can turn the above function into a generator, so that you control when you want to start the processing for getting the next chunk:

function * chunk(s, maxBytes) {
let buf = Buffer.from(s);
while (buf.length) {
let i = buf.lastIndexOf(32, maxBytes+1);
// If no space found, try forward search
if (i < 0) i = buf.indexOf(32, maxBytes);
// If there's no space at all, take all
if (i < 0) i = buf.length;
// This is a safe cut-off point; never half-way a multi-byte
yield buf.slice(0, i).toString();
buf = buf.slice(i+1); // Skip space (if any)
}
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

Browsers

Buffer is specific to Node. Browsers however implement TextEncoder and TextDecoder, which leads to similar code:

function * chunk(s, maxBytes) {    const decoder = new TextDecoder("utf-8");    let buf = new TextEncoder("utf-8").encode(s);    while (buf.length) {        let i = buf.lastIndexOf(32, maxBytes+1);        // If no space found, try forward search        if (i < 0) i = buf.indexOf(32, maxBytes);        // If there's no space at all, take all        if (i < 0) i = buf.length;        // This is a safe cut-off point; never half-way a multi-byte        yield decoder.decode(buf.slice(0, i));        buf = buf.slice(i+1); // Skip space (if any)    }}
for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

Why does a string of length 3 have 3 as its byte length?

Javascript class Buffer's default encoding is 'utf-8'. ASCII characters take 1 bytes in utf-8 encoding as you can see here. So the result should be 3. Note: Utf-8 encoding can take 1~3 bytes for one character.



Related Topics



Leave a reply



Submit