How can I detect a malformed UTF-8 string in PHP?
First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.
You can make use of the UTF-8 validity check that is available in preg_match
[PHP Manual] since PHP 4.3.5. It will return 0
(with no additional information) if an invalid string is given:
$isUTF8 = preg_match('//u', $string);
Another possibility is mb_check_encoding
[PHP Manual]:
$validUTF8 = mb_check_encoding($string, 'UTF-8');
Another function you can use is mb_detect_encoding
[PHP Manual]:
$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));
It's important to set the strict
parameter to true
.
Additionally, iconv
[PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv
encounters such a sequence, it generates a notification; this behavior cannot be changed.)
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;
You can use @
and check the length of the return string:
strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));
Check the examples on the iconv
manual page as well.
How do I detect if have to apply UTF-8 decode or encode on a string?
I can't say I can rely on mb_detect_encoding()
. I had some freaky false positives a while back.
The most universal way I found to work well in every case was:
if (preg_match('!!u', $string))
{
// This is UTF-8
}
else
{
// Definitely not UTF-8
}
PHP JSON_encode() is getting Malformed UTF-8 characters, possibly incorrectly encoded (error)
SOLVED!
The issue was in the function mb_detect_order()
, this function just don't work as I was expecting. I was thinking this was a list of full supporting encoding order by mostly used in order to speed up the detection process.
But I just found that this function return just 2 encoding:
//print_r(mb_detect_order());
Array
(
[0] => ASCII
[1] => UTF-8
)
Which is almost completly useless in my case.
MB functions can detect much more charset.
You can check them out by run mb_list_encodings()
and get the full list:
//print_r(mb_list_encodings());
Array
(
[0] => pass
[1] => auto
[2] => wchar
[3] => byte2be
[4] => byte2le
[5] => byte4be
[6] => byte4le
[7] => BASE64
[8] => UUENCODE
[9] => HTML-ENTITIES
[10] => Quoted-Printable
[11] => 7bit
[12] => 8bit
[13] => UCS-4
[14] => UCS-4BE
[15] => UCS-4LE
[16] => UCS-2
[17] => UCS-2BE
[18] => UCS-2LE
[19] => UTF-32
[20] => UTF-32BE
[21] => UTF-32LE
[22] => UTF-16
[23] => UTF-16BE
[24] => UTF-16LE
[25] => UTF-8
[26] => UTF-7
[27] => UTF7-IMAP
[28] => ASCII
[29] => EUC-JP
[30] => SJIS
[31] => eucJP-win
[32] => EUC-JP-2004
[33] => SJIS-win
[34] => SJIS-Mobile#DOCOMO
[35] => SJIS-Mobile#KDDI
[36] => SJIS-Mobile#SOFTBANK
[37] => SJIS-mac
[38] => SJIS-2004
[39] => UTF-8-Mobile#DOCOMO
[40] => UTF-8-Mobile#KDDI-A
[41] => UTF-8-Mobile#KDDI-B
[42] => UTF-8-Mobile#SOFTBANK
[43] => CP932
[44] => CP51932
[45] => JIS
[46] => ISO-2022-JP
[47] => ISO-2022-JP-MS
[48] => GB18030
[49] => Windows-1252
[50] => Windows-1254
[51] => ISO-8859-1
[52] => ISO-8859-2
[53] => ISO-8859-3
[54] => ISO-8859-4
[55] => ISO-8859-5
[56] => ISO-8859-6
[57] => ISO-8859-7
[58] => ISO-8859-8
[59] => ISO-8859-9
[60] => ISO-8859-10
[61] => ISO-8859-13
[62] => ISO-8859-14
[63] => ISO-8859-15
[64] => ISO-8859-16
[65] => EUC-CN
[66] => CP936
[67] => HZ
[68] => EUC-TW
[69] => BIG-5
[70] => CP950
[71] => EUC-KR
[72] => UHC
[73] => ISO-2022-KR
[74] => Windows-1251
[75] => CP866
[76] => KOI8-R
[77] => KOI8-U
[78] => ArmSCII-8
[79] => CP850
[80] => JIS-ms
[81] => ISO-2022-JP-2004
[82] => ISO-2022-JP-MOBILE#KDDI
[83] => CP50220
[84] => CP50220raw
[85] => CP50221
[86] => CP50222
)
I was in wrong, thinking that mb_detect_order
was just an ordered version of this list. The mb_detect_order
is just.... useless. In order to encode in UTF8 in the right way use the following code:
$my_encoding_list = [
"UTF-8",
"UTF-7",
"UTF-16",
"UTF-32",
"ISO-8859-16",
"ISO-8859-15",
"ISO-8859-10",
"ISO-8859-1",
"Windows-1254",
"Windows-1252",
"Windows-1251",
"ASCII",
//add yours preferred
];
//remove unsupported encodings
$encoding_list = array_intersect($my_encoding_list, mb_list_encodings());
//detect 'finally' the encoding
$this->encoding = mb_detect_encoding($source,$encoding_list,true);
This worked and solved my issue with bad data saved in the database.
Regex to detect invalid UTF-8 string
You can use this PCRE regular expression to check for a valid UTF-8 in a string. If the regex matches, the string contains invalid byte sequences. It's 100% portable because it doesn't rely on PCRE_UTF8 to be compiled in.
$regex = '/(
[\xC0-\xC1] # Invalid UTF-8 Bytes
| [\xF5-\xFF] # Invalid UTF-8 Bytes
| \xE0[\x80-\x9F] # Overlong encoding of prior code point
| \xF0[\x80-\x8F] # Overlong encoding of prior code point
| [\xC2-\xDF](?![\x80-\xBF]) # Invalid UTF-8 Sequence Start
| [\xE0-\xEF](?![\x80-\xBF]{2}) # Invalid UTF-8 Sequence Start
| [\xF0-\xF4](?![\x80-\xBF]{3}) # Invalid UTF-8 Sequence Start
| (?<=[\x00-\x7F\xF5-\xFF])[\x80-\xBF] # Invalid UTF-8 Sequence Middle
| (?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF] # Overlong Sequence
| (?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF]) # Short 3 byte sequence
| (?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2}) # Short 4 byte sequence
| (?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF]) # Short 4 byte sequence (2)
)/x';
We can test it by creating a few variations of text:
// Overlong encoding of code point 0
$text = chr(0xC0) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 5 byte encoding
$text = chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// Overlong encoding of 6 byte encoding
$text = chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80);
var_dump(preg_match($regex, $text)); // int(1)
// High code-point without trailing characters
$text = chr(0xD0) . chr(0x01);
var_dump(preg_match($regex, $text)); // int(1)
etc...
In fact, since this matches invalid bytes, you could then use it in preg_replace to replace them away:
preg_replace($regex, '', $text); // Remove all invalid UTF-8 code-points
SYMFONY Serializer - Malformed UTF-8 characters, possibly incorrectly encoded
As the OP pointed out the problem was happening when using the strtolower
function.
Quoting from the documentation:
strtolower ( string $string ) : string
Returns string with all alphabetic characters converted to lowercase.
Note that 'alphabetic' is determined by the current locale. This means that e.g. in the default "C" locale, characters such as umlaut-A (Ä) will not be converted.
To fix this, you could:
- Option 1: Set the "correct" locale for your use case.
- Option 2: Use
mb_strtolower
About mb_strtolower
:
By contrast to strtolower(), 'alphabetic' is determined by the Unicode character properties. Thus the behaviour of this function is not affected by locale settings and it can convert any characters that have 'alphabetic' property, such as A-umlaut (Ä).
Depending on your usecase/setup, the second way might not be the best way for you to use, since:
- You need to install/enable Multibyte String
- It's MUCH slower, than the
strtolower
function
Related Topics
How to Get Output of Proc_Open()
PHP Decoding and Encoding JSON with Unicode Characters
How to Make Number_Format() Not to Round Numbers Up
Using Strtotime for Dates Before 1970
Confusing About This Cookies in Redirecting System
Can You Create Instance Properties Dynamically in PHP
Laravel Model with Two Primary Keys Update
How to Run PHP Files on My Computer
Shorthand for Arrays: Is There a Literal Syntax Like {} or []
Jquery Ajax Request with JSON Response, How To
How to Generate Random 64-Bit Value as Decimal String in PHP
Is There Way to Keep Delimiter While Using PHP Explode or Other Similar Functions
No Application Encryption Key Has Been Specified
Run a MySQL Query as a Cron Job
Package 'Php5-Gd' Has No Installation Candidate