Using JavaScript's Atob to Decode Base64 Doesn't Properly Decode Utf-8 Strings

Using Javascript's atob to decode base64 doesn't properly decode utf-8 strings

The Unicode Problem

Though JavaScript (ECMAScript) has matured, the fragility of Base64, ASCII, and Unicode encoding has caused a lot of headache (much of it is in this question's history).

Consider the following example:

const ok = "a";
console.log(ok.codePointAt(0).toString(16)); // 61: occupies < 1 byte

const notOK = "✓"
console.log(notOK.codePointAt(0).toString(16)); // 2713: occupies > 1 byte

console.log(btoa(ok)); // YQ==
console.log(btoa(notOK)); // error

Why do we encounter this?

Base64, by design, expects binary data as its input. In terms of JavaScript strings, this means strings in which each character occupies only one byte. So if you pass a string into btoa() containing characters that occupy more than one byte, you will get an error, because this is not considered binary data.

Source: MDN (2021)

The original MDN article also covered the broken nature of window.btoa and .atob, which have since been mended in modern ECMAScript. The original, now-dead MDN article explained:

The "Unicode Problem"
Since DOMStrings are 16-bit-encoded strings, in most browsers calling window.btoa on a Unicode string will cause a Character Out Of Range exception if a character exceeds the range of a 8-bit byte (0x00~0xFF).



Solution with binary interoperability

(Keep scrolling for the ASCII base64 solution)

Source: MDN (2021)

The solution recommended by MDN is to actually encode to and from a binary string representation:

Encoding UTF8 ⇢ binary

// convert a Unicode string to a string in which
// each 16-bit unit occupies only one byte
function toBinary(string) {
const codeUnits = new Uint16Array(string.length);
for (let i = 0; i < codeUnits.length; i++) {
codeUnits[i] = string.charCodeAt(i);
}
return btoa(String.fromCharCode(...new Uint8Array(codeUnits.buffer)));
}

// a string that contains characters occupying > 1 byte
let encoded = toBinary("✓ à la mode") // "EycgAOAAIABsAGEAIABtAG8AZABlAA=="

Decoding binary ⇢ UTF-8

function fromBinary(encoded) {
const binary = atob(encoded);
const bytes = new Uint8Array(binary.length);
for (let i = 0; i < bytes.length; i++) {
bytes[i] = binary.charCodeAt(i);
}
return String.fromCharCode(...new Uint16Array(bytes.buffer));
}

// our previous Base64-encoded string
let decoded = fromBinary(encoded) // "✓ à la mode"

Where this fails a little, is that you'll notice the encoded string EycgAOAAIABsAGEAIABtAG8AZABlAA== no longer matches the previous solution's string 4pyTIMOgIGxhIG1vZGU=. This is because it is a binary encoded string, not a UTF-8 encoded string. If this doesn't matter to you (i.e., you aren't converting strings represented in UTF-8 from another system), then you're good to go. If, however, you want to preserve the UTF-8 functionality, you're better off using the solution described below.



Solution with ASCII base64 interoperability

The entire history of this question shows just how many different ways we've had to work around broken encoding systems over the years. Though the original MDN article no longer exists, this solution is still arguably a better one, and does a great job of solving "The Unicode Problem" while maintaining plain text base64 strings that you can decode on, say, base64decode.org.

There are two possible methods to solve this problem:

  • the first one is to escape the whole string (with UTF-8, see encodeURIComponent) and then encode it;
  • the second one is to convert the UTF-16 DOMString to an UTF-8 array of characters and then encode it.

A note on previous solutions: the MDN article originally suggested using unescape and escape to solve the Character Out Of Range exception problem, but they have since been deprecated. Some other answers here have suggested working around this with decodeURIComponent and encodeURIComponent, this has proven to be unreliable and unpredictable. The most recent update to this answer uses modern JavaScript functions to improve speed and modernize code.

If you're trying to save yourself some time, you could also consider using a library:

  • js-base64 (NPM, great for Node.js)
  • base64-js

Encoding UTF8 ⇢ base64

    function b64EncodeUnicode(str) {
// first we use encodeURIComponent to get percent-encoded UTF-8,
// then we convert the percent encodings into raw bytes which
// can be fed into btoa.
return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g,
function toSolidBytes(match, p1) {
return String.fromCharCode('0x' + p1);
}));
}

b64EncodeUnicode('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64EncodeUnicode('\n'); // "Cg=="

Decoding base64 ⇢ UTF8

    function b64DecodeUnicode(str) {
// Going backwards: from bytestream, to percent-encoding, to original string.
return decodeURIComponent(atob(str).split('').map(function(c) {
return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2);
}).join(''));
}

b64DecodeUnicode('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"
b64DecodeUnicode('Cg=='); // "\n"

(Why do we need to do this? ('00' + c.charCodeAt(0).toString(16)).slice(-2) prepends a 0 to single character strings, for example when c == \n, the c.charCodeAt(0).toString(16) returns a, forcing a to be represented as 0a).



TypeScript support

Here's same solution with some additional TypeScript compatibility (via @MA-Maddin):

// Encoding UTF8 ⇢ base64

function b64EncodeUnicode(str) {
return btoa(encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function(match, p1) {
return String.fromCharCode(parseInt(p1, 16))
}))
}

// Decoding base64 ⇢ UTF8

function b64DecodeUnicode(str) {
return decodeURIComponent(Array.prototype.map.call(atob(str), function(c) {
return '%' + ('00' + c.charCodeAt(0).toString(16)).slice(-2)
}).join(''))
}


The first solution (deprecated)

This used escape and unescape (which are now deprecated, though this still works in all modern browsers):

function utf8_to_b64( str ) {
return window.btoa(unescape(encodeURIComponent( str )));
}

function b64_to_utf8( str ) {
return decodeURIComponent(escape(window.atob( str )));
}

// Usage:
utf8_to_b64('✓ à la mode'); // "4pyTIMOgIGxhIG1vZGU="
b64_to_utf8('4pyTIMOgIGxhIG1vZGU='); // "✓ à la mode"

And one last thing: I first encountered this problem when calling the GitHub API. To get this to work on (Mobile) Safari properly, I actually had to strip all white space from the base64 source before I could even decode the source. Whether or not this is still relevant in 2021, I don't know:

function b64_to_utf8( str ) {
str = str.replace(/\s/g, '');
return decodeURIComponent(escape(window.atob( str )));
}

JavaScript atob is different than Notepad++ Base64 Decode

The window.atob is only good for decoding data which fits in a UTF-8 string. Anything which cannot be represented in UTF-8 string will not be equal to its binary form when decoded. Javascript, at most, will try encoding the resultant bytes in to UTF-8 character sequence. This is the reason why your zip archive is rendered invalid eventually.

The moment you do the following:

var data = window.atob(encoded_data)

... you are having a different representation of your data in a UTF-8 string which is referenced by the variable data.

You should decode your binary data directly to an ArrayBuffer. And window.atob is not a good fit for this.

Here is a function which can convert base64 encoded data directly in to an ArrayBuffer.

UTF-8 to readable characters

Try

let decodedString = decodeURIComponent(escape(window.atob(yourString)))

In Javascript, how do I decode a string in which decoded string contains binary (e.g. non UTF-8) data?

Last edit: More publicity should made of this sentence from the 2015 related SO answer

The original solution (deprecated)

When mentioning atob, btoa, base64js and other libraries functions: save your time with a library and deprecate atob and btoa

Try base64.js.toByteArray with your data. Follow the link to the library Github Readme: it says it does exactly what you want, well, only that from and to.

Tested @1.3.0 succesfully with your data base64js.min.js on Chrome 66.0.3359.181

Tested @1.3.0 succesfully with your data in Nodejs @4.4.7

quote from the README:

base64js has three exposed functions, byteLength, toByteArray and fromByteArray, which both take a single argument.

  1. byteLength - Takes a base64 string and returns length of byte array
  2. toByteArray - Takes a base64 string and returns a byte array
  3. fromByteArray - Takes a byte array and returns a base64 string

I would also test original binary data against whatever file type/structure it is supposed to be/have, if you can access it.

Did you test the base64 string against base64, at first, is it a proper base64 string?

To be fair a code Example link and more information in the Github repo, when included as script in the client, use:

var byteArray = base64js.toByteArray(myBase64Str)

In Node, you would do:

var fs = require('fs')
var base64js = require('base64-js')
fs.readFile('pathToMyBase64File', 'utf8', (err, data) => { // here 'utf8' is mandatory
if (err) throw err;
var byteArray = base64js.toByteArray(data)
console.log(base64js.byteLength(data) == byteArray.length)
})

Important, myBase64File should be encoded with utf-8 w/BOM or it won´t work.

P.S. : I am not any contributor to this library, just a good tool for everybody to know, LICENSE MIT.

Error using 'atob' command - Failed to execute 'atob' on 'Window': The string to be decoded is not correctly encoded

I found the solution. The Api was returning the base64 string with a the character '\'. So I removed all of then, and it works just fine.

Losing line break - Base64 to UTF-8

Issue was that after decoding base64 to UTF-8, Line Feed chars (ASCII 10) are transformed into Carriage Return chars (ASCII 13).

I don't know why, maybe I am missing something in base64 decode.

How can you encode a string to Base64 in JavaScript?

You can use btoa() and atob() to convert to and from base64 encoding.

There appears to be some confusion in the comments regarding what these functions accept/return, so…

  • btoa() accepts a “string” where each character represents an 8-bit byte – if you pass a string containing characters that can’t be represented in 8 bits, it will probably break. This isn’t a problem if you’re actually treating the string as a byte array, but if you’re trying to do something else then you’ll have to encode it first.

  • atob() returns a “string” where each character represents an 8-bit byte – that is, its value will be between 0 and 0xff. This does not mean it’s ASCII – presumably if you’re using this function at all, you expect to be working with binary data and not text.

See also:

  • How do I load binary image data using Javascript and XMLHttpRequest?

Most comments here are outdated. You can probably use both btoa() and atob(), unless you support really outdated browsers.

Check here:

  • https://caniuse.com/?search=atob
  • https://caniuse.com/?search=btoa

Javascript - code and decode from base64 using atob() and btoa() functions

Just do it in the proper order ;)

    const stringA = 'I am Enrico in Asci';    const stringB = btoa(stringA);    const stringA1 = atob(stringB);    console.log('stringA', stringA);    console.log('stringA1', stringA1);    console.log('equals', stringA === stringA1);


Related Topics



Leave a reply



Submit