PHP Multi Byte Str_Replace

str_replace() on multibyte strings dangerous?

No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split:

$string = mb_ereg_replace('"', '\\"', $string);
$string = implode('\\"', mb_split('"', $string));

Edit    Here’s a mb_replace implementation using the split-join variant:

function mb_replace($search, $replace, $subject, &$count=0) {
if (!is_array($search) && is_array($replace)) {
return false;
}
if (is_array($subject)) {
// call mb_replace for each single string in $subject
foreach ($subject as &$string) {
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
}
} elseif (is_array($search)) {
if (!is_array($replace)) {
foreach ($search as &$string) {
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
}
} else {
$n = max(count($search), count($replace));
while ($n--) {
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
next($search);
next($replace);
}
}
} else {
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts)-1;
$subject = implode($replace, $parts);
}
return $subject;
}

As regards the combination of parameters, this function should behave like the singlebyte str_replace.

PHP Multi Byte str_replace?

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

str_replace doesn't work as expected - multi byte character set?

Use array_map. The code will look like this:

$originalArray = json_decode($jsonText, true);

$data = array_map(function($value){
return str_replace(" ", '', $value);
}, $originalArray);

var_dump($data);

Later Edit:
Looks like the requirements of the problem got changed and the same the input data.
This changes everything as well.
You can see here http://php.net/array_map how it works, it's simpler and cleaner.

So, having the array with this data (let's take only the first key-value)

// this is the actual data from the array
$a = "20200920202020202020202020202020202020313535266e6273703b3830382e30302020200920202020202020202020202020202020";

// make it readable
$b = hex2bin($a);

// see what is inside
var_dump($b);

var_dump will return something like:

string(54) "                    155 808.00                     "

So, you have   which is 6 characters written and displayed only one.

What solution I see in this case would be to use trim function to remove the spaces from the beginning and the end of the string, and then to use preg_replace to remove all non digit characters and dots.

$b = trim($b);
$b = preg_replace("/([^0-9\.]+)/", '', $b);

The result will be then:

string(9) "155808.00"

So, the end result will look like this:

$data = array_map(function($value){
$value = trim($value);

return preg_replace("/([^0-9\.]+)/", '', $value);
}, $originalArray);

mb_str_replace()... is slow. any alternatives?

As said there, str_replace is safe to use in utf-8 contexts, as long as all parameters are utf-8 valid, because it won't be any ambiguous match between both multibyte encoded strings. If you check the validity of your input, then you have no need to look for a different function.

str_ireplace works as str_replace

Most of the PHP string functions handle the strings as sequences of bytes, i.e. single-byte characters (ASCII).

You want to replace characters in a string that contains multi-byte characters.
str_replace() (kind of) works because it doesn't care to interpret the strings as characters. It replaces a sequence of bytes with another sequence of bytes and that's all. Most of the times it will not break anything while working with ASCII or even UTF-8 encoded strings (because the way UTF-8 was designed). However, it can produce unexpected results with other encodings.

When asked to handle characters outside the ASCII range, [str_ireplace()](http://php.net/manual/en/function.str-ireplace.php) works the same asstr_replace()`. It's "case insensitive" functionality requires splitting the strings into chars and recognizing the lowercase-uppercase pairs. But since it doesn't handle multi-byte characters it cannot recognize any character whose code is greater than 127.

For multi-byte character strings you should use the functions provided by the Multibyte String PHP extension.

The only function it provides for strings replacement is mb_ereg_replace() (with the case-insensitive version mb_eregi_replace()) but they don't help you very much (because they don't work with arrays).

If the list of characters you want to replace is fixed, my suggestion is to use str_replace() with a list of characters that includes both cases:

$str = "č-ć-đ-š-ž-Č-Ć-Đ-Š-Ž";
echo str_replace(
array('č', 'ć', 'đ', 'š', 'ž', 'Č', 'Ć', 'Đ', 'Š', 'Ž'),
array('c', 'c', 'd', 's', 'z', 'c', 'c', 'd', 's', 'z'),
$str
);

PHP multibyte safe preg_replace Vs. str_replace

Since you have a Unicode input, you must pass /u flag to the regex to deal with the input correctly:

$v = "line1\nline2\r\nмы хотели бы поблагодарить";
echo preg_replace('/\R/u', "", $v);
// => line1line2мы хотели бы поблагодарить

See IDEONE demo

This /u flag is required when both pattern and input can contain Unicode string literals.

Str_replace for multiple items

str_replace() can take an array, so you could do:

$new_str = str_replace(str_split('\\/:*?"<>|'), ' ', $string);

Alternatively you could use preg_replace():

$new_str = preg_replace('~[\\\\/:*?"<>|]~', ' ', $string);

PHP str_replace removing unintentionally removing Chinese characters

The string your using to act as a list of the things you want to replace doesn't work well with the mixed encoding. What I've done is to convert this string to UTF16 and then split it.

function removeSpecialCharactersFromString($inputString){
$inputString = str_replace(str_split(
mb_convert_encoding('#/\\:*?\"<>|[]\'_+(),{}’! &', 'UTF16')), "", $inputString);
return $inputString;
}
$test = '#赵景然 赵景然';
print(removeSpecialCharactersFromString($test));

Which gives...

赵景然赵景然

BTW -str_replace is MB safe - sort of recognised the poster... http://php.net/manual/en/ref.mbstring.php#109937

Why use multibyte string functions in PHP?

All of the PHP string functions do not handle multibyte strings regardless of your operating system's locale. That is why you need to use the multibyte string functions.

From the Multibyte String Introduction:

When you manipulate (trim, split, splice, etc.) strings encoded in a
multibyte encoding, you need to use special functions since two or
more consecutive bytes may represent a single character in such
encoding schemes. Otherwise, if you apply a non-multibyte-aware string
function to the string, it probably fails to detect the beginning or
ending of the multibyte character and ends up with a corrupted garbage
string that most likely loses its original meaning.



Related Topics



Leave a reply



Submit