How to Detect If Have to Apply Utf-8 Decode or Encode on a String

How do I detect if have to apply UTF-8 decode or encode on a string?

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

The most universal way I found to work well in every case was:

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}

Detect encoding and make everything UTF-8

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂÃÂ©dÃÂÃÂ©ration Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂ©dération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

Decoding string to UTF-8 if string is already defined as r'string' instead of b'string'

Try this:

import re
re.sub(rb'\\([0-7]{3})', lambda match: bytes([int(match[1], 8)]), stringa.encode('ascii')).decode('utf-8')

Explanation:

We use the regular expression rb'\\([0-7]{3})' (which matches a literal backslash \ followed by exactly 3 octal digits) and replace each occurrence by taking the three digit code (match[1]), interpreting that as a number written in octal (int(_, 8)), and then replacing the original escape sequence with a single byte (bytes([_])).

We need to operate over bytes because the escape codes are of raw bytes, not unicode characters. Only after we "unescaped" those sequences, can we decode the UTF-8 to a string.

How can I detect a malformed UTF-8 string in PHP?

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

You can make use of the UTF-8 validity check that is available in preg_match ^{[PHP Manual]} since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding ^{[PHP Manual]}:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding ^{[PHP Manual]}:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

Additionally, iconv ^{[PHP Manual]} allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

Convert string of unknown encoding to UTF-8

"TrÃ¤ume groÃŸ" is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.

A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:

print('"TrÃ¤ume groÃŸ"'.encode('cp1252').decode('utf8'))

gives as expected:

"Träume groß"

But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.

Detect if a string was double-encoded in UTF-8

In principle you can't, especially allowing for cat-garbage.

You don't say what the original character encoding of the data was before it was UTF-8 encoded once or twice. I'll assume CP1251, (or at least that CP1251 is one of the possibilities) because it's quite a tricky case.

Take a non-ASCII character. UTF-8 encode it. You get some bytes, and all those bytes are valid characters in CP1251 unless one of them happens to be 0x98, the only hole in CP1251.

So, if you convert those bytes from CP1251 to UTF-8, the result is exactly the same as if you'd correctly UTF-8 encoded a CP1251 string consisting of those Russian characters. There's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters.

If you have some control over the original data, you could put a BOM at the start of it. Then when it comes back to you, inspect the initial bytes to see whether you have a UTF-8 BOM, or the result of incorrectly double-encoding a BOM. But I guess you probably don't have that kind of control over the original text.

In practice you can guess - UTF-8 decode it and then:

(a) look at the character frequencies, character pair frequencies, numbers of non-printable characters. This might allow you to tentatively declare it nonsense, and hence possibly double-encoded. With enough non-printable characters it may be so nonsensical that you couldn't realistically type it even by mashing at the keyboard, unless maybe your ALT key was stuck.

(b) attempt the second decode. That is, starting from the Unicode code points that you got by decoding your UTF-8 data, first encode it to CP1251 (or whatever) and then decode the result from UTF-8. If either step fails (due to invalid sequences of bytes), then it definitely wasn't double-encoded, at least not using CP1251 as the faulty interpretation.

This is more or less what you do if you have some bytes that might be UTF-8 or might be CP1251, and you don't know which.

You'll get some false positives for single-encoded cat-garbage indistinguishable from double-encoded data, and maybe a very few false negatives for data that was double-encoded but that after the first encode by fluke still looked like Russian.

If your original encoding has more holes in it than CP1251 then you'll have fewer false negatives.

Character encodings are hard.

How do I check if a string is unicode or ascii?

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

Decode a utf8 string in python

Pretty unclear question. However, the following code snippet could help (inline comments show partial progress report):

receive_string = "b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'"
vietnamese_txt = (receive_string
  .encode()                      # b"b'v\\xc3\\xb4 \\xc4\\x91\\xe1\\xbb\\x8bch thi\\xc3\\xaan h\\xe1\\xba\\xa1'"
  .decode('unicode_escape')      #  "b'vÃ´ Ä\x91á»\x8bch thiÃªn háº¡'"
  .encode('latin1').decode()     #  "b'vô địch thiên hạ'" 
  .lstrip('b').strip("'"))       #    'vô địch thiên hạ'

print(vietnamese_txt)            #     vô địch thiên hạ

vô địch thiên hạ