Str_Replace() on Multibyte Strings Dangerous

str_replace() on multibyte strings dangerous?

No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split:

$string = mb_ereg_replace('"', '\\"', $string);
$string = implode('\\"', mb_split('"', $string));

Edit    Here’s a mb_replace implementation using the split-join variant:

function mb_replace($search, $replace, $subject, &$count=0) {
if (!is_array($search) && is_array($replace)) {
return false;
}
if (is_array($subject)) {
// call mb_replace for each single string in $subject
foreach ($subject as &$string) {
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
}
} elseif (is_array($search)) {
if (!is_array($replace)) {
foreach ($search as &$string) {
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
}
} else {
$n = max(count($search), count($replace));
while ($n--) {
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
next($search);
next($replace);
}
}
} else {
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts)-1;
$subject = implode($replace, $parts);
}
return $subject;
}

As regards the combination of parameters, this function should behave like the singlebyte str_replace.

Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?

Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.

In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.

This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).

Check if str_replace Should Execute First to Avoid Duplicate Strings

You should check first if string contains and if it does just don't replace.

function superscript_R( $value, $post_id, $field ) {
if( is_string($value) && strpos($value, '<sup>') === false ) {
$value = str_replace(['®', '®'],'<sup>®</sup>', $value );
}
return $value;
}

str_replace PHP script can't handle foreign characters such as umlauts robustly

Did you include the charset in php?

try this:

header('Content-Type: text/html; charset=utf-8');

If not working check if your file is already saved in utf8 before str replace:

utf8_encode ( string $data );

In the opposite case use:

utf8_decode( string $data );

Hope it helps!

Detecting and removing multibyte strings in R

This is probably an encoding issue, so try change the encoding during load! Try something like this,

df<- read.csv(file_path, 
encoding = "iso-8859-1", "use different encodings/langs"
header = TRUE,
stringsAsFactors = FALSE)

is PHP str_word_count() multibyte safe?

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

  • Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)

And perhaps as well:

  • Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
  • Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
  • Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php
/**
* is PHP str_word_count() multibyte safe?
* @link https://stackoverflow.com/q/8290537/367456
*/

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";

$test = "aword\xA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

Output:

New Locale: en_US.utf8

array(3) {
[0]=>
string(5) "aword"
[6]=>
string(5) "bword"
[12]=>
string(5) "aword"
}

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

Instead for UTF-8 you should take a look into the PCRE extension:

  • Matching Unicode letter characters in PCRE/PHP

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

Iterating backwards Multibyte String - C

If you need to deal with any theoretically-possible multibyte encoding, then it is not possible to iterate backwards. There is no requirement that a multibyte encoding have the property that no proper suffix of a valid multibyte sequence is a valid multibyte sequence. (As it happens, your algorithm requires an even stronger property, because you might recognize a multibyte sequence starting in the middle of one valid sequence and continuing into the next sequence.)

Also, you cannot predict (again, in general) the multibyte state if the multibyte encoding has shift states. If you back-up over a multibyte sequence which changes the state, you have no idea what the previous state was.

UTF-8 was designed with this in mind. It does not have shift states, and it clearly marks the octets (bytes) which can start a sequence. So if you know that the multibyte encoding is UTF-8, you can easily iterate backwards. Just scan backwards for a character not in the range 0x80-0xBF. (UTF-16 and UTF-32 are also easily iterated in either direction, but you need to read them as two-/four-byte code units, respectively, because a misaligned read is quite likely to be a correct codepoint.)

If you don't know that the multibyte encoding is UTF-8, then there is simply no robust algorithm to iterate backwards. All you can do is iterate forwards and remember the starting position and mbstate of each character.

Fortunately, these days there is really little reason to support multibyte encodings other than Unicode encodings.



Related Topics



Leave a reply



Submit