Iso-8859-1 VS. Utf-8

meta charset = UTF-8 vs charset = iso-8859-1

UTF-8

UTF-8 (UCS Transformation Format 8) is the World Wide Web's most common character encoding. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8. After the first 128 code points, it utilizes a multibyte approach for additional characters.

ISO-8859-1

By contrast, ISO-8859-1 is a single-byte encoding scheme. The major downfall of this type of encoding is its inability to accommodate languages that are composed of more than 128 symbols.

Source: MDN entry on UTF-8

ISO-8859-1 vs. UTF-8?

Unicode is taking over and has already surpassed all others. I suggest you hop on the train right now.

Note that there are several flavors of unicode. Joel Spolsky gives an overview.

Unicode is winning
(Graph current as of Feb. 2012, see comment below for more exact values.)

How do I convert between ISO-8859-1 and UTF-8 in Java?

In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.

To transcode text:

byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

or

byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.

ASCII, ISO 8859-1, Unicode in C how does it work?

Character encodings can be confusing for many reasons. Here are some explanations:

In the ISO 8859-1 encoding, the character y with a diaeresis ÿ (originally a ligature of i and j) is encoded as a byte value of 0xFF (255). The first 256 code points in Unicode do correspond to the same characters as the ones from ISO 8859-1, but the popular UTF-8 encoding for Unicode uses 2 bytes for code points larger than 127, so ÿ is encoded in UTF-8 as 0xC3 0xBF.

When you read the file file.txt, your program reads one byte at a time and outputs it to the console unchanged (except for line endings on legacy systems), the ÿ is read as 2 separate bytes which are output one after the other, and the terminal displays ÿ because the locale selected for the terminal also uses the UTF-8 encoding.

Adding to confusion, if the source file uses UTF-8 encoding, "ÿ" is a string of length 2 and 'ÿ' is parsed as a multibyte character constant. Multibyte character constants are very confusing and non portable (the value can be 0xC3BF or 0xBFC3 depending on the system), using them is strongly discouraged and the compiler should be configured to issue a warning when it sees one (gcc -Wall -Wextra).

Even more confusing is this: on many systems the type char signed by default. In this case, the character constant 'ÿ' (a single byte in ISO 8859-1) has a value of -1 and type int, no matter how you write it in the source code: '\377' and '\xff' will also have a value of -1. The reason for this is consistency with the value of "ÿ"[0], a char with the value -1. This is also the most common value of the macro EOF.

On all systems, getchar() and similar functions like getc() and fgetc() return values between 0 and UCHAR_MAX or the special negative value of EOF, so the byte 0xFF from a file where character ÿ in encoded as ISO 8859-1 is returned as the value 0xFF or 255, which compares different from 'ÿ' if char is signed, and also different from 'ÿ' if the source code is in UTF-8.

As a rule of thumb, do not use non-ASCII characters in character constants, do not make assumptions about the character encoding used for strings and file contents and configure the compiler to make char unsigned by default (-funsigned-char).

If you deal with foreign languages, using UTF-8 is highly recommended for all textual contents, including source code. Be aware that non-ASCII characters are encoded as multiple bytes with this encoding. Study the UTF-8 encoding, it is quite simple and elegant, and use libraries to handle textual transformations such as uppercasing.

Fixing incorrectly ISO-8859-1 decoded UTF-8 string in Java

Since ISO-8859-1 is a 1 byte per character encoding, it will always work. The UTF-8 bytes are converted to incorrect characters, but luckily there's no information lost.

Changing the characters back to bytes using ISO-8859-1 encoding gives you the original byte array, containing characters encoded in UTF-8, so you can then safely reinterpret it with the correct encoding.

The opposite of this is not (always¹) true, as UTF-8 is a multibyte encoding. The encoding process may encounter invalid byte sequences and replace them with the replacement character ?. At that point you've lost information and can't get the original bytes back anymore.

¹ If you stick to characters in the 0-127 range it will work, as they're encoded in UTF-8 using a single byte.

Are there examples of ISO 8859-1 text files which are valid, but different in UTF-8?

Latin-1 is a single-byte encoding (meaning 1 character = 1 byte), which uses all possible byte values. So any byte maps to something in Latin-1. So literally any file is "valid" in Latin-1. So you can interpret any file as Latin-1 and you'll get… something… as a result.

So yes, interpret any valid UTF-8 file in Latin-1. It's valid both in UTF-8 and Latin-1. The first 128 characters are the same for both encodings and both based on ASCII; but if your UTF-8 file uses any non-ASCII characters, those will be interpreted as gibberish (yet valid) Latin-1.






















bytesencodingtext
e6bc a2e5 ad97UTF-8漢字
e6bc a2e5 ad97Latin-1æ¼¢å­ valid but nonsensical

Is ISO-8859-1 a Unicode charset?

No, ISO 8859-1 is not a Unicode charset, simply because ISO 8859-1 does not provide encoding for all Unicode characters, only a small subset thereof. The word “charset” is sometimes used loosely (and therefore often best avoided), but as a technical term, it means a character encoding.

Loosening the definition so that “Unicode charset” would mean an encoding that covers part of Unicode would be pointless. Then every encoding would be a “Unicode charset”.

Is it significantly better to use ISO-8859-1 rather than UTF-8 wherever possible?

Most of these 127 UTF-8 1-byte characters are the most used when you work with ISO-8859-1. Let's have a look here. If you use UTF-8 you will need 1 extra byte only when you use one of the 127-255 characters (not so commons I bet).

My opinion? Use UTF-8 if you can and if you haven't problem handling it. The time you save the day you will need some extra characters (or the day you have to translate your content) really worth a few extra bytes here and there in the DB...



Related Topics



Leave a reply



Submit