Detect Encoding and Make Everything Utf-8

Detect encoding and make everything UTF-8

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

find reason for automatic encoding detection (UTF-8 vs Windows-1252)

This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.

$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);

Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.

This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace:

$value = preg_replace('/([^\pL0-9 -])+/', '', $value);

Using \p in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace to be UTF-8. See also this answer.

Detect UTF-8 encoding (How does MS IDE do it)?

If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:

unc ::IsUTF8(unc *cpt)
{
if (!cpt)
return 0;

if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
if (((*(cpt + 1) & 0xC0) == 0x80)
&& ((*(cpt + 2) & 0xC0) == 0x80)
&& ((*(cpt + 3) & 0xC0) == 0x80))
return 4;
}
else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
if (((*(cpt + 1) & 0xC0) == 0x80)
&& ((*(cpt + 2) & 0xC0) == 0x80))
return 3;
}
else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
if ((*(cpt + 1) & 0xC0) == 0x80)
return 2;
}
return 0;
}

If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.

Detect actual charset encoding in UTF

You need to unwrap the UTF-8 encoding and then pass it to a character-encoding detection library.

If random 8-bit data is encoded into UTF-8 (assuming an identity mapping, i.e. a C4 byte is assumed to represent U+00C4, as is the case with ISO-8859-1 and its superset Windows 1252), you end up with something like

Source:  8F    0A 20 FE    65
Result: C2 8F 0A 20 C3 BE 65

(because the UTF-8 encoding of U+008F is C2 8F, and U+00FE is C3 BE). You need to revert this encoding in order to obtain the source string, so that you can then identify its character encoding.

In Python, something like

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import chardet

mystery = u'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì'
print chardet.detect(mystery.encode('cp1252'))

Result:

{'confidence': 0.99, 'encoding': 'ISO-8859-5'}

On the Unix command line,

vnix$ echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
> iconv -t cp1252 | chardet
<stdin>: ISO-8859-5 (confidence: 0.99)

or iconv -t cp1252 file | chardet to decode a file and pass it to chardet.

(For this to work successfully at the command line, you need to have your environment properly set up for transparent Unicode handling. I am assuming that your shell, your terminal, and your locale are adequately configured. Try a recent Ubuntu Live CD or something if your regular environment is stuck in the 20th century.)

In the general case, you cannot know that the incorrectly applied encoding is CP 1252 but in practice, I guess it's going to be correct (as in, yield correct results for this scenario) most of the time. In the worst case, you would have to loop over all available legacy 8-bit encodings and try them all, then look at the one(s) with the highest confidence rating from chardet. Then, the example above will be more complex, too -- the mapping from legacy 8-bit data to UTF-8 will no longer be a simple identity mapping, but rather involve a translation table as well (for example, a byte F5 might correspond arbitrarily to U+0092 or whatever).

(Incidentally, iconv -l spits out a long list of aliases, so you will get a lot of fundamentally identical results if you use that as your input. But here is a quick ad-hoc attempt at fixing your slightly weird Perl script.

#!/bin/sh
iconv -l |
grep -F -v -e UTF -e EUC -e 2022 -e ISO646 -e GB2312 -e 5601 |
while read enc; do
echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
iconv -f utf-8 -t "${enc%//}" 2>/dev/null |
chardet | sed "s%^[^:]*%${enc%//}%"
done |
grep -Fwive ascii -e utf -e euc -e 2022 -e None |
sort -k4rn

The output still contains a lot of chaff, but once you remove that, the verdict is straightforward.

It makes no sense to try any multi-byte encodings such as UTF-16, ISO-2022, GB2312, EUC_KR etc in this scenario. If you convert a string into one of these successfully, then the result will most definitely be in that encoding. This is outside the scope of the problem outlined above: a string converted from an 8-bit encoding into UTF-8 using the wrong translation table.

The ones which returned ascii definitely did something wrong; most of them will have received an empty input, because iconv failed with an error. In a Python script, error handling would be more straightforward.)

How am I supposed to fix this utf-8 encoding error?

When trying to detangle a string that has doubly encoded sequences that was intended to be an escape sequence (i.e. \\ instead of \), the special text encoding codec unicode_escape may be used to rectify them back to the expected entity for further processing. However, given that the input is already of the type str, it needs to be turned into a bytes - assuming that the entire string is of fully valid ascii code points, that may be the codec for the initial conversion of the initial str input into bytes. The utf8 codec may be used should there are standard unicode codepoints represented inside the str, as the unicode_escape sequences wouldn't affect those codepoints. Examples:

>>> broken_string = 'La funci\\xc3\\xb3n est\\xc3\\xa1ndar datetime.'
>>> broken_string2 = 'La funci\\xc3\\xb3n estándar datetime.'
>>> broken_string.encode('ascii').decode('unicode_escape')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape')
'La función estándar datetime.'

Given the assumption that the unicode_escape codec assumes decoding to latin1, this intermediate string may simply be encoded to bytes using the latin1 codec post decoding, before turning that back into unicode str type through the utf8 (or whatever appropriate target) codec:

>>> broken_string.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'

As requested, an addendum to clarify the partially messed up string. Note that attempting to decode broken_string2 using the ascii codec will not work, due to the presence of the unescaped á character.

>>> broken_string2.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 21: ordinal not in range(128)

Encoding Problem while saving a txt file in utf-8

Try instead of "ü" the ASCII u-encoding: "\u00FC". If that suddenly works it means that the editor uses an other encoding (UTF-8) than the javac compiler (Cp1252). By the way: , StandardCharsets.UTF_8 is default.

The java source was saved in the editor as UTF-8, two bytes with high bit set.
The java compiler javac compiled with encoding Cp1252 (probably) and turned the two bytes in two chars, which as UTF-8 summed up to 4 bytes.

So the compiler encoding had to be set.
In this case also for the test sources.



Related Topics



Leave a reply



Submit