Detect encoding and make everything UTF-8
If you apply utf8_encode()
to an already UTF-8 string, it will return garbled UTF-8 output.
I made a function that addresses all this issues. It´s called Encoding::toUTF8()
.
You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8()
will convert everything to UTF-8.
I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);
$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);
Download:
https://github.com/neitanod/forceutf8
I've included another function, Encoding::fixUFT8()
, which will fix every UTF-8 string that looks garbled.
Usage:
require_once('Encoding.php');
use \ForceUTF8\Encoding; // It's namespaced now.
$utf8_string = Encoding::fixUTF8($garbled_utf8_string);
Examples:
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
will output:
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
I've transformed the function (forceUTF8
) into a family of static functions on a class called Encoding
. The new function is Encoding::toUTF8()
.
find reason for automatic encoding detection (UTF-8 vs Windows-1252)
This is what I did in the end to find out why the file was not detected as UTF-8, i.e. to find the characters that were not encoded in UTF-8. Since PHP is more readily available to me, I decided to simply use the following script, to force convert anything that is not UTF-8 to UTF-8, using the very handy neitanod/forceutf8 library.
$before = file_get_contents('export.csv');
$after = \ForceUTF8\Encoding::toUTF8($before);
file_put_contents('export.fixed.csv', $after);
Then I used a file comparison tool like Beyond Compare to compare the two resulting CSVs, in order to see more easily which characters were not originally encoded in UTF-8.
This in turn showed me that only one particular column of the export was affected. Upon further investigation I found out that the contents of that column were processed in PHP with the following preg_replace
:
$value = preg_replace('/([^\pL0-9 -])+/', '', $value);
Using \p
in the regular expression had an unknown side effect: all the special characters were converted to another encoding. A quick solution to this is to use the u
flag on the regex (see regex pattern modifiers reference). This forces the resulting encoding of this preg_replace
to be UTF-8. See also this answer.
Detect UTF-8 encoding (How does MS IDE do it)?
If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:
unc ::IsUTF8(unc *cpt)
{
if (!cpt)
return 0;
if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
if (((*(cpt + 1) & 0xC0) == 0x80)
&& ((*(cpt + 2) & 0xC0) == 0x80)
&& ((*(cpt + 3) & 0xC0) == 0x80))
return 4;
}
else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
if (((*(cpt + 1) & 0xC0) == 0x80)
&& ((*(cpt + 2) & 0xC0) == 0x80))
return 3;
}
else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
if ((*(cpt + 1) & 0xC0) == 0x80)
return 2;
}
return 0;
}
If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.
Detect actual charset encoding in UTF
You need to unwrap the UTF-8 encoding and then pass it to a character-encoding detection library.
If random 8-bit data is encoded into UTF-8 (assuming an identity mapping, i.e. a C4 byte is assumed to represent U+00C4, as is the case with ISO-8859-1 and its superset Windows 1252), you end up with something like
Source: 8F 0A 20 FE 65
Result: C2 8F 0A 20 C3 BE 65
(because the UTF-8 encoding of U+008F is C2 8F, and U+00FE is C3 BE). You need to revert this encoding in order to obtain the source string, so that you can then identify its character encoding.
In Python, something like
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import chardet
mystery = u'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì'
print chardet.detect(mystery.encode('cp1252'))
Result:
{'confidence': 0.99, 'encoding': 'ISO-8859-5'}
On the Unix command line,
vnix$ echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
> iconv -t cp1252 | chardet
<stdin>: ISO-8859-5 (confidence: 0.99)
or iconv -t cp1252 file | chardet
to decode a file and pass it to chardet
.
(For this to work successfully at the command line, you need to have your environment properly set up for transparent Unicode handling. I am assuming that your shell, your terminal, and your locale are adequately configured. Try a recent Ubuntu Live CD or something if your regular environment is stuck in the 20th century.)
In the general case, you cannot know that the incorrectly applied encoding is CP 1252 but in practice, I guess it's going to be correct (as in, yield correct results for this scenario) most of the time. In the worst case, you would have to loop over all available legacy 8-bit encodings and try them all, then look at the one(s) with the highest confidence rating from chardet
. Then, the example above will be more complex, too -- the mapping from legacy 8-bit data to UTF-8 will no longer be a simple identity mapping, but rather involve a translation table as well (for example, a byte F5 might correspond arbitrarily to U+0092 or whatever).
(Incidentally, iconv -l
spits out a long list of aliases, so you will get a lot of fundamentally identical results if you use that as your input. But here is a quick ad-hoc attempt at fixing your slightly weird Perl script.
#!/bin/sh
iconv -l |
grep -F -v -e UTF -e EUC -e 2022 -e ISO646 -e GB2312 -e 5601 |
while read enc; do
echo 'áÞåàÐÝØÒ ÜÝÞÓÞ ßàØÛÞÖÕÝØÙ Java, ÜÞÖÝÞ ×ÐÝïâì Òáî ÔÞáâãßÝãî ßÐÜïâì' |
iconv -f utf-8 -t "${enc%//}" 2>/dev/null |
chardet | sed "s%^[^:]*%${enc%//}%"
done |
grep -Fwive ascii -e utf -e euc -e 2022 -e None |
sort -k4rn
The output still contains a lot of chaff, but once you remove that, the verdict is straightforward.
It makes no sense to try any multi-byte encodings such as UTF-16, ISO-2022, GB2312, EUC_KR etc in this scenario. If you convert a string into one of these successfully, then the result will most definitely be in that encoding. This is outside the scope of the problem outlined above: a string converted from an 8-bit encoding into UTF-8 using the wrong translation table.
The ones which returned ascii
definitely did something wrong; most of them will have received an empty input, because iconv
failed with an error. In a Python script, error handling would be more straightforward.)
How am I supposed to fix this utf-8 encoding error?
When trying to detangle a string that has doubly encoded sequences that was intended to be an escape sequence (i.e. \\
instead of \
), the special text encoding codec unicode_escape
may be used to rectify them back to the expected entity for further processing. However, given that the input is already of the type str
, it needs to be turned into a bytes
- assuming that the entire string is of fully valid ascii
code points, that may be the codec for the initial conversion of the initial str
input into bytes
. The utf8
codec may be used should there are standard unicode codepoints represented inside the str
, as the unicode_escape
sequences wouldn't affect those codepoints. Examples:
>>> broken_string = 'La funci\\xc3\\xb3n est\\xc3\\xa1ndar datetime.'
>>> broken_string2 = 'La funci\\xc3\\xb3n estándar datetime.'
>>> broken_string.encode('ascii').decode('unicode_escape')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape')
'La función estándar datetime.'
Given the assumption that the unicode_escape
codec assumes decoding to latin1
, this intermediate string may simply be encoded to bytes
using the latin1
codec post decoding, before turning that back into unicode str
type through the utf8
(or whatever appropriate target) codec:
>>> broken_string.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'
As requested, an addendum to clarify the partially messed up string. Note that attempting to decode broken_string2
using the ascii
codec will not work, due to the presence of the unescaped á
character.
>>> broken_string2.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 21: ordinal not in range(128)
Encoding Problem while saving a txt file in utf-8
Try instead of "ü" the ASCII u-encoding: "\u00FC". If that suddenly works it means that the editor uses an other encoding (UTF-8) than the javac compiler (Cp1252). By the way: , StandardCharsets.UTF_8 is default.
The java source was saved in the editor as UTF-8, two bytes with high bit set.
The java compiler javac compiled with encoding Cp1252 (probably) and turned the two bytes in two chars, which as UTF-8 summed up to 4 bytes.
So the compiler encoding had to be set.
In this case also for the test sources.
Related Topics
PHP Method Chaining or Fluent Interface
How to Get the Last Inserted Id of a MySQL Table in PHP
Parse Query String into an Array
PHP MySQLi_Connect: Authentication Method Unknown to the Client [Caching_Sha2_Password]
Gcm With PHP (Google Cloud Messaging)
How to Validate an Email in PHP
Https and Ssl3_Get_Server_Certificate:Certificate Verify Failed, Ca Is Ok
Accessing $_Cookie Immediately After Setcookie()
PHP How to Find the Time Elapsed Since a Date Time
Split a Comma-Delimited String into an Array
What Are the Best PHP Input Sanitizing Functions
Pdoexception Sqlstate[Hy000] [2002] No Such File or Directory
What's Wrong With Using $_Request[]
How to Add Elements to an Empty Array in PHP