Is PHP Str_Word_Count() Multibyte Safe

is PHP str_word_count() multibyte safe?

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

  • Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)

And perhaps as well:

  • Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
  • Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
  • Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php
/**
* is PHP str_word_count() multibyte safe?
* @link https://stackoverflow.com/q/8290537/367456
*/

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";

$test = "aword\xA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

Output:

New Locale: en_US.utf8

array(3) {
[0]=>
string(5) "aword"
[6]=>
string(5) "bword"
[12]=>
string(5) "aword"
}

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

Instead for UTF-8 you should take a look into the PCRE extension:

  • Matching Unicode letter characters in PCRE/PHP

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

str_word_count does not properly handle non-latin characters

  1. Assuming you are asking how to still use str_word_count: You could try using: preg_replace('/[^a-zA-Z0-9\s]/','',$string) after you have already replaced any punctuation. Not having a "test string" that you know fails, I had no way to try that out, but at least it is something you can try yourself.

  2. One improvement, would be to actually trim the text, it mentions trim in the first comment but that first line is just removing HTML tags. Add a trim($string) then you can remove the last part:

CHANGE first 2 lines:

//trim it & remove tags
$text = trim(strip_tags(html_entity_decode($text,ENT_QUOTES)));

Remove:

// check if the last key of the array is empty and decrease the count by one 
$last_key = end($text_array);
if (empty($last_key)) {
$count--;
}

str_replace() on multibyte strings dangerous?

No, you’re right: Using a singlebyte string function on a multibyte string can cause an unexpected result. Use the multibyte string functions instead, for example mb_ereg_replace or mb_split:

$string = mb_ereg_replace('"', '\\"', $string);
$string = implode('\\"', mb_split('"', $string));

Edit    Here’s a mb_replace implementation using the split-join variant:

function mb_replace($search, $replace, $subject, &$count=0) {
if (!is_array($search) && is_array($replace)) {
return false;
}
if (is_array($subject)) {
// call mb_replace for each single string in $subject
foreach ($subject as &$string) {
$string = &mb_replace($search, $replace, $string, $c);
$count += $c;
}
} elseif (is_array($search)) {
if (!is_array($replace)) {
foreach ($search as &$string) {
$subject = mb_replace($string, $replace, $subject, $c);
$count += $c;
}
} else {
$n = max(count($search), count($replace));
while ($n--) {
$subject = mb_replace(current($search), current($replace), $subject, $c);
$count += $c;
next($search);
next($replace);
}
}
} else {
$parts = mb_split(preg_quote($search), $subject);
$count = count($parts)-1;
$subject = implode($replace, $parts);
}
return $subject;
}

As regards the combination of parameters, this function should behave like the singlebyte str_replace.

How can I get the correct position of a word in a UTF-8 text?

I try the solution by @Mario Johnathan but it didn't work properly for me.

Finally I get a solution by my own: I use the non multi-byte functions like substr and the position given by str_word_count, and the solution is changing the first substring if the first character is a danish character.

$first_part_aux = str_split(trim($first_part));

if (!ctype_alpha($first_part_aux[0])) {
for ($i = 1; $i < count($first_part_aux); $i++) {
if (ctype_alpha($first_part_aux[$i])) {
$start = $start + $i;
$length = $length - $i;

$first_part = substr($text, $start, $length);

break;
}
}
}

Validate that input string does not exceed word limit

Maybe str_word_count could help

http://php.net/manual/en/function.str-word-count.php

$Tag  = 'My Name is Gaurav'; 
$word = str_word_count($Tag);
echo $word;


Related Topics



Leave a reply



Submit