Fixing Broken Utf-8 Encoding

Fixing broken UTF-8 encoding

I've had to try to 'fix' a number of UTF8 broken situations in the past, and unfortunately it's never easy, and often rather impossible.

Unless you can determine exactly how it was broken, and it was always broken in that exact same way, then it's going to be hard to 'undo' the damage.

If you want to try to undo the damage, your best bet would be to start writing some sample code, where you attempt numerous variations on calls to mb_convert_encoding() to see if you can find a combination of 'from' and 'to' that fixes your data. In the end, it's often best to not even bother worrying about fixing the old data because of the pain levels involved, but instead to just fix things going forward.

However, before doing this, you need to make sure that you fix everything that is causing this issue in the first place. You've already mentioned that your DB table collation and editors are set properly. But there are more places where you need to check to make sure that everything is properly UTF-8:

  • Make sure that you are serving your HTML as UTF-8:
    • header("Content-Type: text/html; charset=utf-8");
  • Change your PHP default charset to utf-8:
    • ini_set("default_charset", 'utf-8');
  • If your database doesn't ALWAYS talk in utf-8, then you may need to tell it on a per connection basis to ensure it's in utf-8 mode, in MySQL you do that by issuing:
    • charset utf8
  • You may need to tell your webserver to always try to talk in UTF8, in Apache this command is:
    • AddDefaultCharset UTF-8
  • Finally, you need to ALWAYS make sure that you are using PHP functions that are properly UTF-8 complaint. This means always using the mb_* styled 'multibyte aware' string functions. It also means when calling functions such as htmlspecialchars(), that you include the appropriate 'utf-8' charset parameter at the end to make sure that it doesn't encode them incorrectly.

If you miss up on any one step through your whole process, the encoding can be mangled and problems arise. Once you get in the 'groove' of doing utf-8 though, this all becomes second nature. And of course, PHP6 is supposed to be fully unicode complaint from the getgo, which will make lots of this easier (hopefully)

How to fix broken utf-8 encoding in Python?

I'm not sure what you can do with these kind of data, but for your example in your original post, this works:

>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh

How am I supposed to fix this utf-8 encoding error?

When trying to detangle a string that has doubly encoded sequences that was intended to be an escape sequence (i.e. \\ instead of \), the special text encoding codec unicode_escape may be used to rectify them back to the expected entity for further processing. However, given that the input is already of the type str, it needs to be turned into a bytes - assuming that the entire string is of fully valid ascii code points, that may be the codec for the initial conversion of the initial str input into bytes. The utf8 codec may be used should there are standard unicode codepoints represented inside the str, as the unicode_escape sequences wouldn't affect those codepoints. Examples:

>>> broken_string = 'La funci\\xc3\\xb3n est\\xc3\\xa1ndar datetime.'
>>> broken_string2 = 'La funci\\xc3\\xb3n estándar datetime.'
>>> broken_string.encode('ascii').decode('unicode_escape')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape')
'La función estándar datetime.'

Given the assumption that the unicode_escape codec assumes decoding to latin1, this intermediate string may simply be encoded to bytes using the latin1 codec post decoding, before turning that back into unicode str type through the utf8 (or whatever appropriate target) codec:

>>> broken_string.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'

As requested, an addendum to clarify the partially messed up string. Note that attempting to decode broken_string2 using the ascii codec will not work, due to the presence of the unescaped á character.

>>> broken_string2.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 21: ordinal not in range(128)

How to recover string with broken charset to unicode?

You'll have to reverse the process. In Python, you can encode Unicode values to Latin-1 to get one-on-one bytes again, so the process would be:

  • Decode from UTF-8 to Unicode
  • Encode from Unicode to Latin-1
  • Decode from UTF-8 to Unicode again
  • Encode to ISO-8859-5

Your mangled text is missing characters that were not printable. If I ignore the broken characters, I get:

>>> 'ноÑÑажнаÑ.'.decode('utf8').encode('latin1').decode('utf8', 'ignore').encode('iso8859_5')
'\xdd\xde\xd0\xd6\xdd\xd0.'

Printing the result before encoding to ISO-8858-5, but replacing broken characters with a placeholder:

>>> print 'ноÑÑажнаÑ.'.decode('utf8').encode('latin1').decode('utf8', 'replace')
но��ажна�.

Fixing Broken UTF8 characters MYSQL

I might be reading this wrong, but... if your columns are utf8 strings which were stored as latin1 (and typically swedish, at that) column, and you altered your table so that the column used a new charset and new collation rules, then either:

  1. You have not altered the table in production yet...

    In this case hop straight to the WP documentation on how to do it.

  2. The altered table is already in production...

    In this case you're going to convert only part of your table -- the part before the alter occurred. Failing to do so will likely get you error messages like the one you're getting when converting non-ascii characters back and forth. (You might be able to detect strings with broken utf8 using a couple of regular expressions.)

fix broken utf8 encoding in haskell

OK, I'll just copy my comment down here:

  1. Haskell Strings are Unicode strings. They're not UTF-8 or UTF-anything -- they're just lists of Unicode codepoints.

  2. You're just looking at the result of show for a string. That's how the Show instance works -- you're not going to be able to do anything about that. If you print the string -- e.g. with putStrLn -- you'll see that it prints fine. The string is correct, it's just that the way you're looking at it escapes some characters.

UTF-8 Encoding still wrong output

This solves the problem

mysqli_query($link, "SET NAMES 'utf8'"); 
mysqli_query($link, "SET CHARACTER SET 'utf8'");


Related Topics



Leave a reply



Submit