What's the Difference Between Utf8/Utf16 and Base64 in Terms of Encoding

What's the difference between UTF8/UTF16 and Base64 in terms of encoding

UTF-8 and UTF-16 are methods to encode Unicode strings to byte sequences.

See: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Base64 is a method to encode a byte sequence to a string.

So, these are widely different concepts and should not be confused.

Things to keep in mind:

  • Not every byte sequence represents an Unicode string encoded in UTF-8 or UTF-16.

  • Not every Unicode string represents a byte sequence encoded in Base64.

UTF-8 encoding vs Base-64 Encoding

UTF-8 is a text encoding - a way of encoding text as binary data.

Base64 is in some ways the opposite - it's a way of encoding arbitrary binary data as ASCII text.

If you need to encode arbitrary binary data as text, Base64 is the way to go - you mustn't try to treat arbitrary binary data as if it's UTF-8 encoded text data.

However, you may well be able to transfer the file to the server just as binary data in the first place - it depends on what transport you're using.

Base64 UTF-16 encoding between java, python and javascript applications

The problem is that there are 4 variants of UTF-16.

This character encoding uses two bytes per code unit. Which of the two bytes should come first? This creates two variants:

  • UTF-16BE stores the most significant byte first.
  • UTF-16LE stores the least significant byte first.

To allow telling the difference between these two, there is an optional "byte order mark" (BOM) character, U+FEFF, at the start of the text. So UTF-16BE with BOM starts with the bytes fe ff while UTF-16LE with BOM starts with ff fe. Since BOM is optional, its presence doubles the number of possible encodings.

It looks like you are using 3 of the 4 possible encodings:

  • Python used UTF-16LE with BOM
  • Java used UTF-16BE with BOM
  • JavaScript used UTF-16LE without BOM

One of the reasons why people prefer UTF-8 to UTF-16 is to avoid this confusion.

What charset to use for json with base64 encoded binary data?

Base64 is ASCII, so if the bulk of your JSON is Base64-encoded data, the most space-efficient encoding will be UTF-8. UTF-8 encodes ASCII characters (code points 0000–007F) as one byte, whereas UTF-16 and UTF-32 encode them as two and four, respectively.

Furthermore, it's just a good idea to use UTF-8, because it's the default encoding for JSON and not all tools support other encodings. From RFC-7159:

8.1 Character Encoding


JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. The default encoding is UTF-8, and JSON texts that are encoded in UTF-8 are interoperable in the sense that they will be read successfully by the maximum number of implementations; there are many implementations that cannot successfully read texts in other encodings (such as UTF-16 and UTF-32).

base64 encoding for utf-8 strings

Like @RRUZ said, EncodeString() expects you to specify a byte encoding that the input String will be converted to, and then those octets will be encoded to base64.

You are passing a UTF8String to EncodeString(), which takes a UnicodeString as input in XE5, so the RTL will convert the UTF8String data back to UTF-16, undoing your UTF8Encode() (which is deprecated, BTW). Since you are not specifying a byte encoding, Indy uses its default encoding, which is set to ASCII by default (configurable via the GIdDefaultTextEncoding variable in the IdGlobal unit).

That is why orange works (no data loss) but سلام fails (data loss).

You need to get rid of your UTF8String altogether, and let Indy handle the UTF-8 for you:

procedure TForm5.Button2Click(Sender: TObject);
begin
m2.Text := TIdEncoderMIME.EncodeString(m1.Text, IndyTextEncoding_UTF8);
end;

DecodeString() has a similar parameter for specifying the byte encoding of the octets that have been base64 encoded. The input is first decoded to bytes, and then the bytes are converted to UnicodeString using the specified byte encoding, eg:

procedure TForm5.Button3Click(Sender: TObject);
begin
m1.Text := TIdDecoderMIME.DecodeString(m2.Text, IndyTextEncoding_UTF8);
end;

Why does UTF8 encoding change/corrupt bytes as oppose to Base64 and ASCII, when writing to file?

Because one shouldn't try to interpret raw bytes as symbols in some encoding unless he actually knows/can deduce the encoding used.

If you receive some nonspecific raw bytes, then process them as raw bytes.

But why it works/doesn't work?

Because:

  1. Encoding.Ascii seems to ignore values greater than 127 and return them as they are. So no matter the encoding/decoding done, raw bytes will be the same.
  2. Base64 is a straightforward encoding that won't change the original data in any way.
  3. UTF8 - theoretically with those bytes not being proper UTF8 string we may have some conversion data loss (though it would more likely result in an exception). But the most probable reason is a BOM being added during Encoding.UTF8.GetString call that would remain there after Encoding.UTF8.GetBytes.

In any case, I repeat - do not encode/decode anything unless it is actually string data/required format.

How does UTF-16 encoding works?

Unicode at the core is indeed a character set, i.e. it assigns numbers to what most people think of characters. These numbers are called codepoint.

The codepoint for 字 is U+5B57. This is the format how codepoints are usually specified. "5B57" is hexadecimal number.

In binary, 5B57 is 101101101010111, or 0101101101010111 if it is extended to 16 bits. But it is very unusual to specify codepoints in binary.

UTF-16 is one of several encodings for Unicode, i.e. a representation in memory or in files. UTF-16 uses 16-bit code units. Since 16-bit is 2 bytes, two variants exist for splitting it into bytes:

  • little-ending (lower 8 bit first)
  • big-endian (higher 8 bits first)

Often they are called UTF-16LE and UTF-16BE. Since most computers today use a little endian architecture, UTF-16LE is more common.

A single codepoint can result in 1 or 2 UTF-16 code units. In this particular case, it's a single code unit, and it is the same as the value for the codepoint: 5B57. It is saved as two bytes, either as:

5B 57 (or 01011011 01010111 in binary, big endian)

57 5B (or 01010111 01011011 in binary, little endian)

The latter one is the one you have shown. So it is UTF-16LE encoding.

For codepoints resulting in 2 UTF-16 code units, the encoding is somewhat more involved. It is explained in the UTF-16 Wikipedia article.

Is sending UTF-8 encoded characters Network Safe?

The reason for encoding with standard Base64 format is to make sure it won't contain any control characters which may be considered as control characters over network.

The above statement is incorrect. Base64 is used specifically to encode binary data using 64 of the printable ASCII characters. It is only necessary in specific situations where you are embedding binary data in a protocol which was designed to transfer text (such as embedding attachments in email); it is not required in general for transmitting data over a network. HTTP, for instance, manages perfectly well without it.

In this scenario, Does UTF-8 character encoding provides same as Base64 by not giving any control characters in the output so that we can send it via network?

No. UTF-8 is a Unicode string format. It cannot be used to encode arbitrary binary data.



Related Topics



Leave a reply



Submit