How to Detect a Malformed Utf-8 String in PHP

How can I detect a malformed UTF-8 string in PHP?

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding [PHP Manual]:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding [PHP Manual]:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

How do I detect if have to apply UTF-8 decode or encode on a string?

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

The most universal way I found to work well in every case was:

if (preg_match('!!u', $string))
{
// This is UTF-8
}
else
{
// Definitely not UTF-8
}

PHP JSON_encode() is getting Malformed UTF-8 characters, possibly incorrectly encoded (error)

SOLVED!

The issue was in the function mb_detect_order(), this function just don't work as I was expecting. I was thinking this was a list of full supporting encoding order by mostly used in order to speed up the detection process.

But I just found that this function return just 2 encoding:

//print_r(mb_detect_order());
Array
(
[0] => ASCII
[1] => UTF-8
)

Which is almost completly useless in my case.
MB functions can detect much more charset.
You can check them out by run mb_list_encodings() and get the full list:

//print_r(mb_list_encodings());
Array
(
[0] => pass
[1] => auto
[2] => wchar
[3] => byte2be
[4] => byte2le
[5] => byte4be
[6] => byte4le
[7] => BASE64
[8] => UUENCODE
[9] => HTML-ENTITIES
[10] => Quoted-Printable
[11] => 7bit
[12] => 8bit
[13] => UCS-4
[14] => UCS-4BE
[15] => UCS-4LE
[16] => UCS-2
[17] => UCS-2BE
[18] => UCS-2LE
[19] => UTF-32
[20] => UTF-32BE
[21] => UTF-32LE
[22] => UTF-16
[23] => UTF-16BE
[24] => UTF-16LE
[25] => UTF-8
[26] => UTF-7
[27] => UTF7-IMAP
[28] => ASCII
[29] => EUC-JP
[30] => SJIS
[31] => eucJP-win
[32] => EUC-JP-2004
[33] => SJIS-win
[34] => SJIS-Mobile#DOCOMO
[35] => SJIS-Mobile#KDDI
[36] => SJIS-Mobile#SOFTBANK
[37] => SJIS-mac
[38] => SJIS-2004
[39] => UTF-8-Mobile#DOCOMO
[40] => UTF-8-Mobile#KDDI-A
[41] => UTF-8-Mobile#KDDI-B
[42] => UTF-8-Mobile#SOFTBANK
[43] => CP932
[44] => CP51932
[45] => JIS
[46] => ISO-2022-JP
[47] => ISO-2022-JP-MS
[48] => GB18030
[49] => Windows-1252
[50] => Windows-1254
[51] => ISO-8859-1
[52] => ISO-8859-2
[53] => ISO-8859-3
[54] => ISO-8859-4
[55] => ISO-8859-5
[56] => ISO-8859-6
[57] => ISO-8859-7
[58] => ISO-8859-8
[59] => ISO-8859-9
[60] => ISO-8859-10
[61] => ISO-8859-13
[62] => ISO-8859-14
[63] => ISO-8859-15
[64] => ISO-8859-16
[65] => EUC-CN
[66] => CP936
[67] => HZ
[68] => EUC-TW
[69] => BIG-5
[70] => CP950
[71] => EUC-KR
[72] => UHC
[73] => ISO-2022-KR
[74] => Windows-1251
[75] => CP866
[76] => KOI8-R
[77] => KOI8-U
[78] => ArmSCII-8
[79] => CP850
[80] => JIS-ms
[81] => ISO-2022-JP-2004
[82] => ISO-2022-JP-MOBILE#KDDI
[83] => CP50220
[84] => CP50220raw
[85] => CP50221
[86] => CP50222
)

I was in wrong, thinking that mb_detect_order was just an ordered version of this list. The mb_detect_order is just.... useless. In order to encode in UTF8 in the right way use the following code:

$my_encoding_list = [
"UTF-8",
"UTF-7",
"UTF-16",
"UTF-32",
"ISO-8859-16",
"ISO-8859-15",
"ISO-8859-10",
"ISO-8859-1",
"Windows-1254",
"Windows-1252",
"Windows-1251",
"ASCII",
//add yours preferred
];

//remove unsupported encodings
$encoding_list = array_intersect($my_encoding_list, mb_list_encodings());

//detect 'finally' the encoding
$this->encoding = mb_detect_encoding($source,$encoding_list,true);

This worked and solved my issue with bad data saved in the database.

Regex to detect invalid UTF-8 string

You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.

$regex = '/(
[\xC0-\xC1] # Invalid UTF-8 Bytes
| [\xF5-\xFF] # Invalid UTF-8 Bytes
| \xE0[\x80-\x9F] # Overlong encoding of prior code point
| \xF0[\x80-\x8F] # Overlong encoding of prior code point
| [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
| [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
| [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
| (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
| (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
| (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
| (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
| (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
)/x';

We can test it by creating a few variations of text:

// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)

etc...

In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:

preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points

SYMFONY Serializer - Malformed UTF-8 characters, possibly incorrectly encoded

As the OP pointed out the problem was happening when using the strtolower function.

Quoting from the documentation:

strtolower ( string $string ) : string

Returns string with all alphabetic characters converted to lowercase.

Note that 'alphabetic' is determined by the current locale. This means that e.g. in the default "C" locale, characters such as umlaut-A (Ä) will not be converted.


To fix this, you could:

  • Option 1: Set the "correct" locale for your use case.
  • Option 2: Use mb_strtolower

About mb_strtolower:

By contrast to strtolower(), 'alphabetic' is determined by the Unicode character properties. Thus the behaviour of this function is not affected by locale settings and it can convert any characters that have 'alphabetic' property, such as A-umlaut (Ä).

Depending on your usecase/setup, the second way might not be the best way for you to use, since:

  1. You need to install/enable Multibyte String
  2. It's MUCH slower, than the strtolower function


Related Topics



Leave a reply



Submit