Replacing Invalid Utf-8 Characters by Question Marks, Mbstring.Substitute_Character Seems Ignored

Replacing invalid UTF-8 characters by question marks, mbstring.substitute_character seems ignored

You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.

Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.

When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).

// REPLACEMENT CHARACTER (U+FFFD)
mb_substitute_character(0xFFFD);

function replace_invalid_byte_sequence($str)
{
return mb_convert_encoding($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence2($str)
{
return htmlspecialchars_decode(htmlspecialchars($str, ENT_SUBSTITUTE, 'UTF-8'));
}

UConverter offers both procedual and object-oriented API.

function replace_invalid_byte_sequence3($str)
{
return UConverter::transcode($str, 'UTF-8', 'UTF-8');
}

function replace_invalid_byte_sequence4($str)
{
return (new UConverter('UTF-8', 'UTF-8'))->convert($str);
}

When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.

lead byte: 0x00 - 0x7F, 0xC2 - 0xF4
trail byte: 0x80(or 0x90 or 0xA0) - 0xBF(or 0x8F)

you can refer to the following resources for checking the byte range.

  1. "Syntax of UTF-8 Byte Sequences" in RFC 3629
  2. "Table 3-7. Well-Formed UTF-8 Byte Sequences" in the Unicode Standard 6.1
  3. "Multilingual form encoding" in W3C Internationalization"

The byte range table is the below.

      Code Points    First Byte Second Byte Third Byte Fourth Byte
U+0000 - U+007F 00 - 7F
U+0080 - U+07FF C2 - DF 80 - BF
U+0800 - U+0FFF E0 A0 - BF 80 - BF
U+1000 - U+CFFF E1 - EC 80 - BF 80 - BF
U+D000 - U+D7FF ED 80 - 9F 80 - BF
U+E000 - U+FFFF EE - EF 80 - BF 80 - BF
U+10000 - U+3FFFF F0 90 - BF 80 - BF 80 - BF
U+40000 - U+FFFFF F1 - F3 80 - BF 80 - BF 80 - BF
U+100000 - U+10FFFF F4 80 - 8F 80 - BF 80 - BF

How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.

The Unicode Standard shows an example:

before: <61    F1 80 80  E1 80  C2    62    80    63    80    BF    64  >
after: <0061 FFFD FFFD FFFD 0062 FFFD 0063 FFFD FFFD 0064>

Here is the implementation by preg_replace_callback() according to the above rule.

function replace_invalid_byte_sequence5($str)
{
// REPLACEMENT CHARACTER (U+FFFD)
$substitute = "\xEF\xBF\xBD";
$regex = '/
([\x00-\x7F] # U+0000 - U+007F
|[\xC2-\xDF][\x80-\xBF] # U+0080 - U+07FF
| \xE0[\xA0-\xBF][\x80-\xBF] # U+0800 - U+0FFF
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # U+1000 - U+CFFF
| \xED[\x80-\x9F][\x80-\xBF] # U+D000 - U+D7FF
| \xF0[\x90-\xBF][\x80-\xBF]{2} # U+10000 - U+3FFFF
|[\xF1-\xF3][\x80-\xBF]{3} # U+40000 - U+FFFFF
| \xF4[\x80-\x8F][\x80-\xBF]{2}) # U+100000 - U+10FFFF
|(\xE0[\xA0-\xBF] # U+0800 - U+0FFF (invalid)
|[\xE1-\xEC\xEE\xEF][\x80-\xBF] # U+1000 - U+CFFF (invalid)
| \xED[\x80-\x9F] # U+D000 - U+D7FF (invalid)
| \xF0[\x90-\xBF][\x80-\xBF]? # U+10000 - U+3FFFF (invalid)
|[\xF1-\xF3][\x80-\xBF]{1,2} # U+40000 - U+FFFFF (invalid)
| \xF4[\x80-\x8F][\x80-\xBF]?) # U+100000 - U+10FFFF (invalid)
|(.) # invalid 1-byte
/xs';

// $matches[1]: valid character
// $matches[2]: invalid 3-byte or 4-byte character
// $matches[3]: invalid 1-byte

$ret = preg_replace_callback($regex, function($matches) use($substitute) {

if (isset($matches[2]) || isset($matches[3])) {

return $substitute;

}

return $matches[1];

}, $str);

return $ret;
}

You can compare byte directly and avoid preg_match's restriction about byte size by this way.

function replace_invalid_byte_sequence6($str) {

$size = strlen($str);
$substitute = "\xEF\xBF\xBD";
$ret = '';

$pos = 0;
$char;
$char_size;
$valid;

while (utf8_get_next_char($str, $size, $pos, $char, $char_size, $valid)) {
$ret .= $valid ? $char : $substitute;
}

return $ret;
}

function utf8_get_next_char($str, $str_size, &$pos, &$char, &$char_size, &$valid)
{
$valid = false;

if ($str_size <= $pos) {
return false;
}

if ($str[$pos] < "\x80") {

$valid = true;
$char_size = 1;

} else if ($str[$pos] < "\xC2") {

$char_size = 1;

} else if ($str[$pos] < "\xE0") {

if (!isset($str[$pos+1]) || $str[$pos+1] < "\x80" || "\xBF" < $str[$pos+1]) {

$char_size = 1;

} else {

$valid = true;
$char_size = 2;

}

} else if ($str[$pos] < "\xF0") {

$left = "\xE0" === $str[$pos] ? "\xA0" : "\x80";
$right = "\xED" === $str[$pos] ? "\x9F" : "\xBF";

if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

$char_size = 1;

} else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) {

$char_size = 2;

} else {

$valid = true;
$char_size = 3;

}

} else if ($str[$pos] < "\xF5") {

$left = "\xF0" === $str[$pos] ? "\x90" : "\x80";
$right = "\xF4" === $str[$pos] ? "\x8F" : "\xBF";

if (!isset($str[$pos+1]) || $str[$pos+1] < $left || $right < $str[$pos+1]) {

$char_size = 1;

} else if (!isset($str[$pos+2]) || $str[$pos+2] < "\x80" || "\xBF" < $str[$pos+2]) {

$char_size = 2;

} else if (!isset($str[$pos+3]) || $str[$pos+3] < "\x80" || "\xBF" < $str[$pos+3]) {

$char_size = 3;

} else {

$valid = true;
$char_size = 4;

}

} else {

$char_size = 1;

}

$char = substr($str, $pos, $char_size);
$pos += $char_size;

return true;
}

The test case is here.

function run(array $callables, array $arguments)
{
return array_map(function($callable) use($arguments) {
return array_map($callable, $arguments);
}, $callables);
}

$data = [
// Table 3-8. Use of U+FFFD in UTF-8 Conversion
// http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf)
"\x61"."\xF1\x80\x80"."\xE1\x80"."\xC2"."\x62"."\x80"."\x63"
."\x80"."\xBF"."\x64",

// 'FULL MOON SYMBOL' (U+1F315) and invalid byte sequence
"\xF0\x9F\x8C\x95"."\xF0\x9F\x8C"."\xF0\x9F\x8C"
];

var_dump(run([
'replace_invalid_byte_sequence',
'replace_invalid_byte_sequence2',
'replace_invalid_byte_sequence3',
'replace_invalid_byte_sequence4',
'replace_invalid_byte_sequence5',
'replace_invalid_byte_sequence6'
], $data));

As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.

$data = [
// U+20AC
"\xE2\x82\xAC"."\xE2\x82\xAC"."\xE2\x82\xAC",
"\xE2\x82" ."\xE2\x82\xAC"."\xE2\x82\xAC",

// U+24B62
"\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
"\xF0\xA4\xAD" ."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",
"\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2",

// 'FULL MOON SYMBOL' (U+1F315)
"\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C",
"\xF0\x9F\x8C\x95" . "\xF0\x9F\x8C" . "\xF0\x9F\x8C"
];

Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.

str_repeat('a', 10000)

Finally, the result of my benchmark is following.

mb_convert_encoding()
0.19628190994263
htmlspecialchars()
0.082863092422485
UConverter::transcode()
0.15999984741211
UConverter::convert()
0.29843020439148
preg_replace_callback()
0.63967490196228
direct comparision
0.71933102607727

The benchmark code is here.

function timer(array $callables, array $arguments, $repeat = 10000)
{

$ret = [];
$save = $repeat;

foreach ($callables as $key => $callable) {

$start = microtime(true);

do {

array_map($callable, $arguments);

} while($repeat -= 1);

$stop = microtime(true);
$ret[$key] = $stop - $start;
$repeat = $save;

}

return $ret;
}

$functions = [
'mb_convert_encoding()' => 'replace_invalid_byte_sequence',
'htmlspecialchars()' => 'replace_invalid_byte_sequence2',
'UConverter::transcode()' => 'replace_invalid_byte_sequence3',
'UConverter::convert()' => 'replace_invalid_byte_sequence4',
'preg_replace_callback()' => 'replace_invalid_byte_sequence5',
'direct comparision' => 'replace_invalid_byte_sequence6'
];

foreach (timer($functions, $data) as $description => $time) {

echo $description, PHP_EOL,
$time, PHP_EOL;

}

PHP: replace invalid characters in utf-8 string in

use iconv

$text = iconv("UTF-8", "UTF-8//IGNORE", $text);

see the manual.

Cheers

replace non utf8 character

Try this: use only modifier u for Unicode

    $re = "/[^(\\x20-\\x7F\\n)]+/u";
$str = "Punjab me 1Train k niche 100 Sardar aa gaye..\n\n99 Mar gaye...\n\n1 Bach gaya";
$subst = "";

$result = preg_replace($re, $subst, $str);

Live demo

Remove non-utf8 characters from string

Using a regex approach:

$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| . # anything else
/x
END;
preg_replace($regex, '$1', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

$regex = <<<'END'
/
(
(?: [\x00-\x7F] # single-byte sequences 0xxxxxxx
| [\xC0-\xDF][\x80-\xBF] # double-byte sequences 110xxxxx 10xxxxxx
| [\xE0-\xEF][\x80-\xBF]{2} # triple-byte sequences 1110xxxx 10xxxxxx * 2
| [\xF0-\xF7][\x80-\xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3
){1,100} # ...one or more times
)
| ( [\x80-\xBF] ) # invalid byte in range 10000000 - 10111111
| ( [\xC0-\xFF] ) # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
if ($captures[1] != "") {
// Valid byte sequence. Return unmodified.
return $captures[1];
}
elseif ($captures[2] != "") {
// Invalid byte of the form 10xxxxxx.
// Encode as 11000010 10xxxxxx.
return "\xC2".$captures[2];
}
else {
// Invalid byte of the form 11xxxxxx.
// Encode as 11000011 10xxxxxx.
return "\xC3".chr(ord($captures[3])-64);
}
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

  • !empty(x) will match non-empty values ("0" is considered empty).
  • x != "" will match non-empty values, including "0".
  • x !== "" will match anything except "".

x != "" seem the best one to use in this case.

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

Web scraper replacing some characters with question marks

The problem is probably character encoding. My guess is that the web page you're scraping is encoded in UTF8, but somewhere along the line you're converting to ASCII.

Check out the excellent article called "What every developer should know about character encoding" for more details.

Update

You could try this, although the StreamReader should default to UTF-8 anyway:

var encoding = System.Text.Encoding.GetEncoding("utf-8");
StreamReader sr = new StreamReader(resp.GetResponseStream(), encoding);

iconv UTF-8//IGNORE still produces illegal character error

The output character set (the second parameter) should be different from the input character set (first param). If they are the same, then if there are illegal UTF-8 characters in the string, iconv will reject them as being illegal according to the input character set.

Some SQL data in Russian is outputted with symbol � at the end

It sounds like a simple truncation problem. Varchar(150) says field is maximum 150 bytes. UTF8 can use more than one byte per symbol - for example, each letter in cyrillic will use 2 bytes, while space or comma symbols will use 1 byte. So if the string is more than 150 bytes long, it's possible that the cyrillic letter is truncated in the middle. For example, in your second sentence, small и has utf8 code d0 b8, but was truncated to d0 which is non-printable symbol, resulting in ?s you see. Nothing you can do about it, the data is already lost. You can only prettify the display by removing standalone byte in the range of C2..DF from the end of the string.

As for unidentified letters you see, there's a lot of factors. Database encoding, connection encoding, table collation and display encoding if you're using a web interface all contribute to the mess, there might be multiple re-encodings to the string before you see it. It's also possible that the data is recoded in some way on insert and decoded before display - not the worst thing I saw in legacy code, really. You'll have to experiment and find the proper combination yourself.

PHP: Split multibyte string (word) into separate characters

try a regular expression with 'u' option, for example

  $chars = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);

why does str_replace replaces my character with a semicolon?

Your text is not "Veronica Mars". It's probably this:

"Veronica Mars"

If you strip ", only the ; remains.

What you see in the browser screen is the result of rendering some HTML code.



Related Topics



Leave a reply



Submit