Using Str_Split on a Utf-8 Encoded String

Using str_split on a UTF-8 encoded string

str_split does not work with multi-byte characters, it will only return the first byte - thus invalidating your characters. you could use mb_split.

Using str_split on a UTF-8 encoded string

str_split does not work with multi-byte characters, it will only return the first byte - thus invalidating your characters. you could use mb_split.

PHP str_split and UTF8 polish characters

str_split works on byte level and not on character level (despite its name). So in fact you're splitting mała along its bytes and not along its characters. That's why you're getting an array of five items instead of four. Index 2 and 3 together form the UTF-8 encoding of ł.

You need to use either the mbstring or the iconv extension to split your string manually.

$str = 'mała';
$len = mb_strlen($str, 'UTF-8');
$result = [];
for ($i = 0; $i < $len; $i++) {
$result[] = mb_substr($str, $i, 1, 'UTF-8');
}
var_dump($result);

Split a UTF-8 encoded string on blank characters without knowing about UTF-8 encoding

Yes, you can.

Multibyte sequences necessarily include one lead byte (the two MSBs equal to 11) and one ore more continuation bytes (two MSBs equal to 10). The total length of the multibyte sequence (lead byte+continuation bytes) is equal to the number of count of MSBs equal to 1 in the lead byte, before the first bit 0 appears (e.g.: if lead byte is 110xxxxx, exactly one continuation byte should follow; if it is 11110xxx, there should be exactly three continuation bytes).

So, if you find short MB sequences or stray continuationb bytes without a lead byte, your string is probably invalid anyway, and you split procedures probably wouldn't screw it any further than what it probably already was.

But there is something you might want to note: Unicode introduces other “blank” symbols in the upper, non-ASCII compatible ranges. You might want to treat them accordingly.

I have some trouble with str_split, it doesn't work correctly with my language

You must use Multibyte String Functions for manipulating persian string.
You can use preg_split for your porpuse.

print_r(preg_split('//u', "رستوران ها", null, PREG_SPLIT_NO_EMPTY));
Output:
Array
(
[0] => ر
[1] => س
[2] => ت
[3] => و
[4] => ر
[5] => ا
[6] => ن
[7] =>
[8] => ه
[9] => ا
)

my function doesn't return some letters even though included in utf-8 (vscode, php)

str_split operates on bytes, and characters such as æ take up more than 1 byte in UTF-8.

So if you str_split these characters, they basically get 'split in two' into an invalid character. Just run count() on $letterarr to see that there are 9 items in the array, instead of the expected 7.

The solution is to use PHP's string functions that are UTF-8 aware. Simply changing str_split into mb_str_split will fix your code sample.

Split utf8 string into array of chars

I found out the é was not the character I expected. Apparently there is a difference between né and ńe. I got it working by normalizing the string first.

PHP str_split on string with decoded html_entity

This should do well:

function mb_str_split($string) {
return preg_split('/(?<!^)(?!$)/u', $string );
}
$string = 'My string ‘to parse’';
$string = utf8_encode($string);
$string_decoded = html_entity_decode($string, ENT_QUOTES, 'utf-8');
$string_array = mb_str_split($string_decoded);
var_dump($string_array);

As mentioned in comments: you need to split the string with mb_split or by regex.

Proof: https://3v4l.org/3FRmG



Related Topics



Leave a reply



Submit