Fixing broken UTF-8 encoding
I've had to try to 'fix' a number of UTF8 broken situations in the past, and unfortunately it's never easy, and often rather impossible.
Unless you can determine exactly how it was broken, and it was always broken in that exact same way, then it's going to be hard to 'undo' the damage.
If you want to try to undo the damage, your best bet would be to start writing some sample code, where you attempt numerous variations on calls to mb_convert_encoding()
to see if you can find a combination of 'from' and 'to' that fixes your data. In the end, it's often best to not even bother worrying about fixing the old data because of the pain levels involved, but instead to just fix things going forward.
However, before doing this, you need to make sure that you fix everything that is causing this issue in the first place. You've already mentioned that your DB table collation and editors are set properly. But there are more places where you need to check to make sure that everything is properly UTF-8:
- Make sure that you are serving your HTML as UTF-8:
header("Content-Type: text/html; charset=utf-8");
- Change your PHP default charset to utf-8:
ini_set("default_charset", 'utf-8');
- If your database doesn't ALWAYS talk in utf-8, then you may need to tell it on a per connection basis to ensure it's in utf-8 mode, in MySQL you do that by issuing:
- charset utf8
- You may need to tell your webserver to always try to talk in UTF8, in Apache this command is:
AddDefaultCharset UTF-8
- Finally, you need to ALWAYS make sure that you are using PHP functions that are properly UTF-8 complaint. This means always using the mb_* styled 'multibyte aware' string functions. It also means when calling functions such as
htmlspecialchars()
, that you include the appropriate 'utf-8' charset parameter at the end to make sure that it doesn't encode them incorrectly.
If you miss up on any one step through your whole process, the encoding can be mangled and problems arise. Once you get in the 'groove' of doing utf-8 though, this all becomes second nature. And of course, PHP6 is supposed to be fully unicode complaint from the getgo, which will make lots of this easier (hopefully)
How to fix broken utf-8 encoding in Python?
I'm not sure what you can do with these kind of data, but for your example in your original post, this works:
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh
How am I supposed to fix this utf-8 encoding error?
When trying to detangle a string that has doubly encoded sequences that was intended to be an escape sequence (i.e. \\
instead of \
), the special text encoding codec unicode_escape
may be used to rectify them back to the expected entity for further processing. However, given that the input is already of the type str
, it needs to be turned into a bytes
- assuming that the entire string is of fully valid ascii
code points, that may be the codec for the initial conversion of the initial str
input into bytes
. The utf8
codec may be used should there are standard unicode codepoints represented inside the str
, as the unicode_escape
sequences wouldn't affect those codepoints. Examples:
>>> broken_string = 'La funci\\xc3\\xb3n est\\xc3\\xa1ndar datetime.'
>>> broken_string2 = 'La funci\\xc3\\xb3n estándar datetime.'
>>> broken_string.encode('ascii').decode('unicode_escape')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape')
'La función estándar datetime.'
Given the assumption that the unicode_escape
codec assumes decoding to latin1
, this intermediate string may simply be encoded to bytes
using the latin1
codec post decoding, before turning that back into unicode str
type through the utf8
(or whatever appropriate target) codec:
>>> broken_string.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'
>>> broken_string2.encode('utf8').decode('unicode_escape').encode('latin1').decode('utf8')
'La función estándar datetime.'
As requested, an addendum to clarify the partially messed up string. Note that attempting to decode broken_string2
using the ascii
codec will not work, due to the presence of the unescaped á
character.
>>> broken_string2.encode('ascii').decode('unicode_escape').encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in position 21: ordinal not in range(128)
How to recover string with broken charset to unicode?
You'll have to reverse the process. In Python, you can encode Unicode values to Latin-1 to get one-on-one bytes again, so the process would be:
- Decode from UTF-8 to Unicode
- Encode from Unicode to Latin-1
- Decode from UTF-8 to Unicode again
- Encode to ISO-8859-5
Your mangled text is missing characters that were not printable. If I ignore the broken characters, I get:
>>> 'ноÑÑажнаÑ.'.decode('utf8').encode('latin1').decode('utf8', 'ignore').encode('iso8859_5')
'\xdd\xde\xd0\xd6\xdd\xd0.'
Printing the result before encoding to ISO-8858-5, but replacing broken characters with a placeholder:
>>> print 'ноÑÑажнаÑ.'.decode('utf8').encode('latin1').decode('utf8', 'replace')
но��ажна�.
Fixing Broken UTF8 characters MYSQL
I might be reading this wrong, but... if your columns are utf8 strings which were stored as latin1 (and typically swedish, at that) column, and you altered your table so that the column used a new charset and new collation rules, then either:
You have not altered the table in production yet...
In this case hop straight to the WP documentation on how to do it.
The altered table is already in production...
In this case you're going to convert only part of your table -- the part before the alter occurred. Failing to do so will likely get you error messages like the one you're getting when converting non-ascii characters back and forth. (You might be able to detect strings with broken utf8 using a couple of regular expressions.)
fix broken utf8 encoding in haskell
OK, I'll just copy my comment down here:
Haskell Strings are Unicode strings. They're not UTF-8 or UTF-anything -- they're just lists of Unicode codepoints.
You're just looking at the result of
show
for a string. That's how theShow
instance works -- you're not going to be able to do anything about that. If you print the string -- e.g. withputStrLn
-- you'll see that it prints fine. The string is correct, it's just that the way you're looking at it escapes some characters.
UTF-8 Encoding still wrong output
This solves the problem
mysqli_query($link, "SET NAMES 'utf8'");
mysqli_query($link, "SET CHARACTER SET 'utf8'");
Related Topics
Returning Json from PHP to JavaScript
Secure Random Number Generation in PHP
PHP & MySQL: MySQLi_Num_Rows() Expects Parameter 1 to Be MySQLi_Result, Boolean Given
PHP Multi-Dimensional Array Remove Duplicate
How to Echo a Variable With Single Quotes
Measuring the Distance Between Two Coordinates in PHP
How to Create an Array from a CSV File Using PHP and the Fgetcsv Function
Checking If Form Has Been Submitted - PHP
How to Remove Accents and Turn Letters into "Plain" Ascii Characters
What Does It Mean to Escape a String
How to Check Whether Mod_Rewrite Is Enable on Server
Reference Assignment Operator in PHP, =&