Multibyte Trim in PHP

Multibyte trim in PHP?

The standard trim function trims a handful of space and space-like characters. These are defined as ASCII characters, which means certain specific bytes from 0 to 0100 0000.

Proper UTF-8 input will never contain multi-byte characters that is made up of bytes 0xxx xxxx. All the bytes in proper UTF-8 multibyte characters start with 1xxx xxxx.

This means that in a proper UTF-8 sequence, the bytes 0xxx xxxx can only refer to single-byte characters. PHP's trim function will therefore never trim away "half a character" assuming you have a proper UTF-8 sequence. (Be very very careful about improper UTF-8 sequences.)


The \s on ASCII regular expressions will mostly match the same characters as trim.

The preg functions with the /u modifier only works on UTF-8 encoded regular expressions, and /\s/u match also the UTF8's nbsp. This behaviour with non-breaking spaces is the only advantage to using it.

If you want to replace space characters in other, non ASCII-compatible encodings, neither method will work.

In other words, if you're trying to trim usual spaces an ASCII-compatible string, just use trim. When using /\s/u be careful with the meaning of nbsp for your text.


Take care:

  $s1 = html_entity_decode(" Hello   "); // the NBSP
$s2 = " exotic test ホ ";

echo "\nCORRECT trim: [". trim($s1) ."], [". trim($s2) ."]";
echo "\nSAME: [". trim($s1) ."] == [". preg_replace('/^\s+|\s+$/','',$s1) ."]";
echo "\nBUT: [". trim($s1) ."] != [". preg_replace('/^\s+|\s+$/u','',$s1) ."]";

echo "\n!INCORRECT trim: [". trim($s2,' ') ."]"; // DANGER! not UTF8 safe!
echo "\nSAFE ONLY WITH preg: [".
preg_replace('/^[\s]+|[\s]+$/u', '', $s2) ."]";

Why does trim function not working correctly for Japanese input?

This is hard to troubleshoot, more detailed described here

  1. PHP output showing little black diamonds with a question mark

  2. Trouble with UTF-8 characters; what I see is not what I stored

For overcome this you can use str_replace.
replace all spaces with nothing in string. This will remove all spaces. Not recommended in sentences as it remove all spaces. Good for words.

$text = "  ひらがな  ";
$new_str = str_replace(' ', '', $text);
echo $new_str; // returns ひらがな

If you want to remove spaces in beginning and ending use regex with preg_replace

print preg_replace( '/^s+|s+$/', '', "    ひらがな ひらがな" ); //return ひらがな ひらがな

trim is actually nine times faster. But you can use it.

Check speed comparison here.

https://stackoverflow.com/a/4787238/10915534

PHP trim special character destroys other special character

trim is not encoding aware and just looks at individual bytes. If you tell it to trim '«»', and that's encoded in UTF-8, it will look for the bytes C2 AB C2 BB (where C2 is redundant, so AB BB C2 are the actual search terms). "ë" in UTF-8 is C3 AB, so half of it gets removed and the character is thereby broken.

You'll need to use an encoding aware functions to safely remove multibyte characters, e.g.:

preg_replace('/^[«»]+|[«»]+$/u', '', $str)

strip out multi-byte white space from a string PHP

Add the u flag to your regex. This makes the RegEx engine treat the input string as UTF-8.

$keywords = preg_replace("@[  ]@u", ' ',urldecode($keywords));
// outputs :'ラメ単色'

CodePad.

The reason it mangles the string is because to the RegEx engine, your replacement characters, 20 (space) or e3 80 80 (IDEOGRAPHIC SPACE) are not treated as two characters, but separate bytes 20, e3 and 80.

When you look at the byte sequence of your string to scan, we get e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2. We know the first character is a IDEOGRAPHIC SPACE, but because PHP is treating it as a sequence of bytes, it does a replacement individually of the first four bytes, because they match individual bytes that the regex engine is scanning.

As for the mangling which results in the � (REPLACEMENT CHARACTER), we can see this happens because the byte e3 is present further along in the string. The e3 byte is the start byte of a three byte long Japanese character, such as e3 83 a9 (KATAKANA LETTER RA). When that leading e3 is replaced with a 20 (space), it no longer becomes a valid UTF-8 sequence.

When you enable the u flag, the RegEx engine treats the string as UTF-8, and won't treat your characters in your character class on a per-byte basis.

PHP - how to count the number of leading spaces in a multi-byte / UTF-8 string correctly

I would go the regular expression route:

<?php
function count_leading_spaces($str) {
// \p{Zs} will match a whitespace character that is invisible,
// but does take up space
if (mb_ereg('^\p{Zs}+', $str, $regs) === false)
return 0;
return mb_strlen($regs[0]);
}

$samples = [
'            21st century ',
'      Other languages ',
'         General collections ',
'         Ancient languages ',
'         Medieval languages ',
'            Several authors (Two or more languages) ',
];

foreach ($samples as $i => $sample) {
printf("(%d) %d\n", $i + 1, count_leading_spaces($sample));
}

Output:


(1) 12
(2) 6
(3) 9
(4) 9
(5) 9
(6) 12

Why trim() affects on the other characters?

I think this is because it is right-toleft- language, You can use rtrim()

phpFiddle - Hit "Run F9" to execute

echo rtrim('سلام؟', '؟');

Truncate a multibyte String to n chars

Try this:

function truncate($string, $chars = 50, $terminator = ' …') {
$cutPos = $chars - mb_strlen($terminator);
$boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

But you need to make sure that your internal encoding is properly set.



Related Topics



Leave a reply



Submit