How to Convert These Strange Characters? (ë, Ã, ì, ù, Ã)

Converting special charactes such as ü and à back to their original, latin alphbet counterparts in C#

Well, first of all, as the data has been decoded using the wrong encoding, it's likely that some of the characters are impossible to recover. It looks like it's UTF-8 data that incorrectly decoded using an 8-bit encoding.

There is no built in method to recover data like this, because it's not something that you normally do. There is no reliable way to decode the data, because it's already broken.

What you can try, is to encode the data, and decode it using the wrong encoding again, just the other way around:

byte[] data = Encoding.Default.GetBytes(input);
string output = Encoding.UTF8.GetString(data);

The Encoding.Default uses the current ANSI encoding for your system. You can try some different encodings there and see which one gives the best result.

How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

Actually, everything is typically stored as Unicode of some kind internally, but lets not go into that. I'm assuming you're getting the iconic "åäö" type strings because you're using an ISO-8859 as your character encoding. There's a trick you can do to convert those characters. The escape and unescape functions used for encoding and decoding query strings are defined for ISO characters, whereas the newer encodeURIComponent and decodeURIComponent which do the same thing, are defined for UTF8 characters.

escape encodes extended ISO-8859-1 characters (UTF code points U+0080-U+00ff) as %xx (two-digit hex) whereas it encodes UTF codepoints U+0100 and above as %uxxxx (%u followed by four-digit hex.) For example, escape("å") == "%E5" and escape("あ") == "%u3042".

encodeURIComponent percent-encodes extended characters as a UTF8 byte sequence. For example, encodeURIComponent("å") == "%C3%A5" and encodeURIComponent("あ") == "%E3%81%82".

So you can do:

fixedstring = decodeURIComponent(escape(utfstring));

For example, an incorrectly encoded character "å" becomes "Ã¥". The command does escape("Ã¥") == "%C3%A5" which is the two incorrect ISO characters encoded as single bytes. Then decodeURIComponent("%C3%A5") == "å", where the two percent-encoded bytes are being interpreted as a UTF8 sequence.

If you'd need to do the reverse for some reason, that works too:

utfstring = unescape(encodeURIComponent(originalstring));

Is there a way to differentiate between bad UTF8 strings and ISO strings? Turns out there is. The decodeURIComponent function used above will throw an error if given a malformed encoded sequence. We can use this to detect with a great probability whether our string is UTF8 or ISO.

var fixedstring;

// If the string is UTF-8, this will work and not throw an error.
// If it isn't, an error will be thrown, and we can assume that we have an ISO string.

Detecting utf8 broken characters in MySQL

How about a different approach, namely converting the column back and forth to get the correct character set? You can convert it to binary, then to utf-8 and then to iso-8859-1 or whatever else you're using. See the manual for the details.

How to convert String with “ (ISO-8859-1) characters to normal (UTF-8)characters?

$final = '<li>Jain R.K. and Iyengar S.R.K., “Advanced Engineering Mathematicsâ€, Narosa Publications,</li>';

$final = str_replace("Â", "", $final);
$final = str_replace("’", "'", $final);
$final = str_replace("“", '"', $final);
$final = str_replace('–', '-', $final);
$final = str_replace('â€', '"', $final);

for past datas, i replaced the weird characters with UTF-8 characters.

for future datas, i made the charset to utf8 in php, html and databases connections.

Python: Converting from ISO-8859-1/latin1 to UTF-8

Try decoding it first, then encoding:


MySQL - find and fix incorrect characters

I get many results that are 'as' which are the same letters but without the accents.

That would be an issue of the collation used - those are rule sets for character comparison, and they define which characters are to be treated as equal in different languages.

But you can use the BINARYoperator to change that directly within the query.

What is this character ( Â ) and how do I remove it with PHP?

"Latin 1" is your problem here. There are approx 65256 UTF-8 characters available to a web page which you cannot store in a Latin-1 code page.

For your immediate problem you should be able to

$clean = str_replace(chr(194)," ",$dirty)

However I would switch your database to use utf-8 ASAP as the problem will almost certainly reoccur.

Related Topics

Leave a reply
