str_replace() on multibyte strings dangerous?
No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace
or mb_split
:
$string = mb_ereg_replace('"', '\\"', $string);
$string = implode('\\"', mb_split('"', $string));
Edit Here’s a mb_replace
implementation using the split-join variant:
function mb_replace($search, $replace, $subject, &$count=0) {
if (!is_array($search) && is_array($replace)) {
return false;
}
if (is_array($subject)) {
// call mb_replace for each single string in $subject
foreach ($subject as &$string) {
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
}
} elseif (is_array($search)) {
if (!is_array($replace)) {
foreach ($search as &$string) {
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
}
} else {
$n = max(count($search), count($replace));
while ($n--) {
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
next($search);
next($replace);
}
}
} else {
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts)-1;
$subject = implode($replace, $parts);
}
return $subject;
}
As regards the combination of parameters, this function should behave like the singlebyte str_replace
.
Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?
Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.
In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF
. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.
This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C
might be the second byte of a character sequence representing something else).
Check if str_replace Should Execute First to Avoid Duplicate Strings
You should check first if string contains and if it does just don't replace.
function superscript_R( $value, $post_id, $field ) {
if( is_string($value) && strpos($value, '<sup>') === false ) {
$value = str_replace(['®', '®'],'<sup>®</sup>', $value );
}
return $value;
}
str_replace PHP script can't handle foreign characters such as umlauts robustly
Did you include the charset in php?
try this:
header('Content-Type: text/html; charset=utf-8');
If not working check if your file is already saved in utf8 before str replace:
utf8_encode ( string $data );
In the opposite case use:
utf8_decode( string $data );
Hope it helps!
Detecting and removing multibyte strings in R
This is probably an encoding issue, so try change the encoding during load! Try something like this,
df<- read.csv(file_path,
encoding = "iso-8859-1", "use different encodings/langs"
header = TRUE,
stringsAsFactors = FALSE)
is PHP str_word_count() multibyte safe?
I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:
- Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)
And perhaps as well:
- Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
- Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
- Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count
would be locale dependent.
If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0
sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):
<?php
/**
* is PHP str_word_count() multibyte safe?
* @link https://stackoverflow.com/q/8290537/367456
*/
echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";
$test = "aword\xA0bword aword";
$result = str_word_count($test, 2);
var_dump($result);
Output:
New Locale: en_US.utf8
array(3) {
[0]=>
string(5) "aword"
[6]=>
string(5) "bword"
[12]=>
string(5) "aword"
}
As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.
Instead for UTF-8 you should take a look into the PCRE extension:
- Matching Unicode letter characters in PCRE/PHP
PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.
Iterating backwards Multibyte String - C
If you need to deal with any theoretically-possible multibyte encoding, then it is not possible to iterate backwards. There is no requirement that a multibyte encoding have the property that no proper suffix of a valid multibyte sequence is a valid multibyte sequence. (As it happens, your algorithm requires an even stronger property, because you might recognize a multibyte sequence starting in the middle of one valid sequence and continuing into the next sequence.)
Also, you cannot predict (again, in general) the multibyte state if the multibyte encoding has shift states. If you back-up over a multibyte sequence which changes the state, you have no idea what the previous state was.
UTF-8 was designed with this in mind. It does not have shift states, and it clearly marks the octets (bytes) which can start a sequence. So if you know that the multibyte encoding is UTF-8, you can easily iterate backwards. Just scan backwards for a character not in the range 0x80-0xBF. (UTF-16 and UTF-32 are also easily iterated in either direction, but you need to read them as two-/four-byte code units, respectively, because a misaligned read is quite likely to be a correct codepoint.)
If you don't know that the multibyte encoding is UTF-8, then there is simply no robust algorithm to iterate backwards. All you can do is iterate forwards and remember the starting position and mbstate of each character.
Fortunately, these days there is really little reason to support multibyte encodings other than Unicode encodings.
Related Topics
I Cannot Use MySQL_* Functions After Upgrading PHP
Best Way to Allow Plugins for a PHP Application
What Are the Differences Between Psr-0 and Psr-4
Understanding Ioc Containers and Dependency Injection
Uploading a File in Chunks Using HTML5
Verify Imagemagick Installation
What Does Double Question Mark () Operator Mean in PHP
How to Catch a "Catchable Fatal Error" on PHP Type Hinting
Laravel 4 - Logging SQL Queries
How to Re-Save the Entity as Another Row in Doctrine 2
Setting PHP Tmp Dir - PHP Upload Not Working
Laravel 5 Not Finding CSS Files
Get Next and Previous Day with PHP
Mysqli Prepared Statements with in Operator
Order Multidimensional Array Recursively at Each Level in PHP
Can PHP and ASP.NET Run Together Within the Same Web Site in Iis 7.5