Ensuring Valid Utf-8 in PHP

Ensuring valid UTF-8 in PHP

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    | \xE0[\xA0-\xBF][\x80-\xBF]         # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    | \xED[\x80-\x9F][\x80-\xBF]         # excluding surrogates
    | \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);

How do I detect if have to apply UTF-8 decode or encode on a string?

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

The most universal way I found to work well in every case was:

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}

Sanitise UTF-8 in PHP

I think this is the best solution.

$raw_str = hex2bin('496e76616c6964206d61726b2096');
$sane_str = mb_convert_encoding($raw_str, 'UTF-8', 'UTF-8');

Ensure a string is UTF-8 encoded

Into: another inconvenient truth

It is impossible to detect the encoding of unknown text with 100% accuracy and/or confidence.

In practice there will be cases all over the spectrum of possible outcomes: you can be pretty sure that multilingual text in UTF-8 will be correctly detected as such, while it is flat out impossible to detect which of the family of ISO-8859 encodings corresponds to some text -- and unless you are willing to do statistical analysis, it is not even possible to make an educated guess!

What do we have to work with?

With that out of the way, let's see what you can do. First of all, unless you are bringing custom tools into the fight you are limited by what mb_detect_encoding can do for you. Unfortunately, that's not a whole lot. The documentation of the sister function mb_detect_order states:

mbstring currently implements the following encoding detection
filters. If there is an invalid byte sequence for the following
encodings, encoding detection will fail.

UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS,
ISO-2022-JP.

For ISO-8859-X, mbstring always detects as ISO-8859-X.

For UTF-16, UTF-32, UCS2 and UCS4, encoding detection will fail
always.

So, discounting the Japanese encodings, you basically have the capability to distinguish between UTF-8, UTF-7 and ASCII. You cannot detect ISO-8859-X because any text will be "recognized" as any of those encodings if you put it into consideration (i.e. you will have a 100% false positive rate -- not good), and the group which includes UTF-16 is simply not supported.

Unfortunately, the bad news doesn't end there. The order of the encodings matters too! Since text encoded in UTF-7 or ASCII is also valid UTF-8, placing UTF-8 at the front of the candidate list will ensure that's the only result you are ever going to get -- so it has to be avoided at all costs.

Since the default detection order is dependent on a php.ini setting, you should definitely not rely on that and move into a known state by setting your own detection order:

mb_detect_order('ASCII, UTF-8'); // I left UTF-7 out, but who cares?

So you can at least tell if your text is ASCII or UTF-8, right? Well, no. Not unless you specifically request that when you say "UTF-8", you really mean it:

$valid_utf8 = "\xC2\xA2";
$invalid_utf8 = "\xC2\x00";

mb_detect_order('UTF-8');
echo mb_detect_encoding($valid_utf8);   // "utf-8": correct
echo mb_detect_encoding($invalid_utf8); // "utf-8": WTF?!?!?!

The problem above is that unless you pass true for the $strict parameter, detection of UTF-8 is... a little over-optimistic.

Well, what can you actually do with this thing?

This is as good as it gets -- the correct way to detect encodings (just barely managing to keep using plural here):

$valid_utf8 = "\xC2\xA2";
$invalid_utf8 = "\xC2\x00";
$ascii = "hello world";

mb_detect_order('ASCII, UTF-8');
echo mb_detect_encoding($valid_utf8, mb_detect_order(), true);   // OK: "utf-8"
echo mb_detect_encoding($invalid_utf8, mb_detect_order(), true); // OK: false
echo mb_detect_encoding($ascii, mb_detect_order(), true);        // OK: "ascii"

What can be done with text that isn't valid UTF-8?

Unless you have out-of-band information about that text, unfortunately nothing.

OK, that's not entirely true. There are a few things that you can do in practice:

See if there's a BOM in the beginning of the text. Probably there won't be, and even if there is mathematically you might mistake a single-byte encoding for Unicode, but it's worth a shot.
See if it's a flavor of UTF-16. If a big majority of the even-numbered bytes have the same value, then you 're likely looking at UTF-16 LE. If this happens for a majority of the odd-numbered bytes, you 're likely looking at UTF-16 BE. Unforunately, in both cases you can never be sure.
Assume that the text is in ISO-8859-X and do statistical analysis based on known properties of the script that corresponds to this encoding to see if the result is close to what you would expect. If it's close enough for some encodings in this class and way off for the others you can make an educated guess.

Do I need to make sure output data is valid UTF-8?

First of all I would never just blindly encode it as UTF-8 (possibly) a second time because this would lead to invalid chars as you say. I would certainly try to detect if the charset of the content is not UTF-8 before attempting such a thing.

Secondly if the content in question comes from a source wich you have control over and control the charset for such as a file with UTF-8 or a database with UTF-8 in use in the tables and on the connection, I would trust that source unless something gives me hints that I can't and there is something funky going on. If the content is coming from more or less random places outside your control, well all the more reason to inspect it and possibly try to re-encode og transform from other charsets if you can detect it. So the bottom line is: It depends.

As to wether this is a security issue or not I wouldn't think so (at least I can't think of any scenarios where this could be exploitable) but I'll leave to others to be definitive about that.

UTF-8 validation in PHP without using preg_match()

You can always using the Multibyte String Functions:

If you want to use it a lot and possibly change it at sometime:

1) First set the encoding you want to use in your config file

/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");

2) Check the String

if(mb_check_encoding($string))
{
    // do something
}

Or, if you don't plan on changing it, you can always just put the encoding straight into the function:

if(mb_check_encoding($string, 'UTF-8'))
{
    // do something
}

How to validate a utf sequence in PHP?

mb_check_encoding() is designed for this purpose:

mb_check_encoding($string, 'UTF-8');

How can I detect a malformed UTF-8 string in PHP?

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

You can make use of the UTF-8 validity check that is available in preg_match ^{[PHP Manual]} since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding ^{[PHP Manual]}:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding ^{[PHP Manual]}:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

Additionally, iconv ^{[PHP Manual]} allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE   : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

Ensuring Valid Utf-8 in PHP