How to handle user input of invalid UTF-8 characters
The accept-charset="UTF-8"
attribute is only a guideline for browsers to follow, and they are not forced to submit that in that way. Crappy form submission bots are a good example...
I usually ignore bad characters, either via iconv()
or with the less reliable utf8_encode()
/ utf8_decode()
functions. If you use iconv
, you also have the option to transliterate bad characters.
Here is an example using iconv()
:
$str_ignore = iconv('UTF-8', 'UTF-8//IGNORE', $str);
$str_translit = iconv('UTF-8', 'UTF-8//TRANSLIT', $str);
If you want to display an error message to your users I'd probably do this in a global way instead of a per value received basis. Something like this would probably do just fine:
function utf8_clean($str)
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str);
}
$clean_GET = array_map('utf8_clean', $_GET);
if (serialize($_GET) != serialize($clean_GET))
{
$_GET = $clean_GET;
$error_msg = 'Your data is not valid UTF-8 and has been stripped.';
}
// $_GET is clean!
You may also want to normalize new lines and strip (non-)visible control chars, like this:
function Clean($string, $control = true)
{
$string = iconv('UTF-8', 'UTF-8//IGNORE', $string);
if ($control === true)
{
return preg_replace('~\p{C}+~u', '', $string);
}
return preg_replace(array('~\r\n?~', '~[^\P{C}\t\n]+~u'), array("\n", ''), $string);
}
Code to convert from UTF-8 to Unicode code points:
function Codepoint($char)
{
$result = null;
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result = sprintf('U+%04X', $codepoint[1]);
}
return $result;
}
echo Codepoint('à'); // U+00E0
echo Codepoint('ひ'); // U+3072
It is probably faster than any other alternative, but I haven't tested it extensively though.
Example:
$string = 'hello world�';
// U+FFFEhello worldU+FFFD
echo preg_replace_callback('/[\p{So}\p{Cf}\p{Co}\p{Cs}\p{Cn}]/u', 'Bad_Codepoint', $string);
function Bad_Codepoint($string)
{
$result = array();
foreach ((array) $string as $char)
{
$codepoint = unpack('N', iconv('UTF-8', 'UCS-4BE', $char));
if (is_array($codepoint) && array_key_exists(1, $codepoint))
{
$result[] = sprintf('U+%04X', $codepoint[1]);
}
}
return implode('', $result);
}
This may be what you were looking for.
Using iconv() to check for invalid UTF-8 characters: Detected an illegal character in input string
Another answer provides a better answer for why iconv()
is throwing an error:
The output character set (the second parameter) should be different
from the input character set (first param). If they are the same, then
if there are illegal UTF-8 characters in the string, iconv will reject
them as being illegal according to the input character set.
Taken from a comment in the PHP manual, you can detect if a string is encoded in UTF-8 with this function:
$valid = mb_detect_encoding($str, 'UTF-8', true); // returns boolean.
More info on mb_detect_encoding();
R string, UTF-8 coding swedish character treatment
Thanks to @Wiktor Stribiżew
this solution works best:
df$m <- gsub(" ", "", `Encoding<-`(as.character(df$m), "latin1"),fixed = TRUE)
How to remove invalid UTF-8 characters from a JavaScript string?
I use this simple and sturdy approach:
function cleanString(input) {
var output = "";
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
return output;
}
Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).
Example invalid utf8 string?
Take a look at Markus Kuhn's UTF-8 decoder capability and stress test file
You'll find examples of many UTF-8 irregularities, including lonely start bytes, continuation bytes missing, overlong sequences, etc.
Related Topics
Woocommerce: Display Some Reviews Randomly on Home Page
Get Root Node of Xml Doc Using Simplexml
Why PHP Iteration by Reference Returns a Duplicate Last Record
Mysqli_Stmt::Execute() Expects Exactly 0 Parameters, 1 Given
Composer Require Local Package
What's the Most Efficient Test of Whether a PHP String Ends with Another String
Phpmailer: Reply Using Only "Reply To" Address
Backslash in PHP -- What Does It Mean
Get Last Modified File in a Directory
Performance in Pdo/Php/Mysql: Transaction Versus Direct Execution
Split Array into Two Arrays by Index Even or Odd
Get Coupon Data from Woocommerce Orders
Codeigniter Sessions Not Working After Migration
How to Set Up Use Httponly Cookies in PHP
How to Downgrade or Install a Specific Version of Composer
Column Count of MySQL.Proc Is Wrong. Expected 20, Found 16. the Table Is Probably Corrupted