How to Split a String into an Array of Unicode Characters in PHP

What is the best way to split a string into an array of Unicode characters in PHP?

You could use the 'u' modifier with PCRE regex ; see Pattern Modifiers (quoting) :

u (PCRE8)

This modifier turns on additional
functionality of PCRE that is
incompatible with Perl. Pattern
strings are treated as UTF-8. This
modifier is available from PHP 4.1.0
or greater on Unix and from PHP 4.2.3
on win32. UTF-8 validity of the
pattern is checked since PHP 4.3.5.

For instance, considering this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./', $str, $results);
var_dump($results[0]);

You'll get an unusable result:

array
0 => string 'a' (length=1)
1 => string 'b' (length=1)
2 => string 'c' (length=1)
3 => string ' ' (length=1)
4 => string '�' (length=1)
5 => string '�' (length=1)
6 => string '�' (length=1)
7 => string '�' (length=1)
8 => string '�' (length=1)
9 => string '�' (length=1)
10 => string '�' (length=1)
11 => string '�' (length=1)
12 => string '�' (length=1)
13 => string '�' (length=1)
14 => string '�' (length=1)
15 => string '�' (length=1)
16 => string ',' (length=1)
17 => string ' ' (length=1)
18 => string 'e' (length=1)
19 => string 'f' (length=1)
20 => string 'g' (length=1)

But, with this code :

header('Content-type: text/html; charset=UTF-8');  // So the browser doesn't make our lives harder
$str = "abc 文字化け, efg";

$results = array();
preg_match_all('/./u', $str, $results);
var_dump($results[0]);

(Notice the 'u' at the end of the regex)

You get what you want :

array
0 => string 'a' (length=1)
1 => string 'b' (length=1)
2 => string 'c' (length=1)
3 => string ' ' (length=1)
4 => string '文' (length=3)
5 => string '字' (length=3)
6 => string '化' (length=3)
7 => string 'け' (length=3)
8 => string ',' (length=1)
9 => string ' ' (length=1)
10 => string 'e' (length=1)
11 => string 'f' (length=1)
12 => string 'g' (length=1)

Hope this helps :-)

Split string into array based on a unicode character range in PHP

You have to check also with a look ahead if the next character is a cyrrilic one. This code will do the job:

$t = preg_split ('/(?<=[^а-я])(?=[а-я]+)/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

It gives this output:

Array
(
[0] => «
[1] => Добрый
[2] => день!» -
[3] => сказал
[4] => он,
[5] => потянувшись…
)

Here you can try it.

Convert a String into an Array of Characters - multi-byte

Just pass an empty pattern with the PREG_SPLIT_NO_EMPTY flag.
Otherwise, you can write a pattern with \X (unicode dot) and \K (restart fullstring match). I'll include a mb_split() call and a preg_match_all() call for completeness.

Code: (Demo)

$string='先秦兩漢';
var_export(preg_split('~~u', $string, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_split('~\X\K~u', $string, 0, PREG_SPLIT_NO_EMPTY));
echo "\n---\n";
var_export(preg_split('~\X\K(?!$)~u', $string));
echo "\n---\n";
var_export(mb_split('\X\K(?!$)', $string));
echo "\n---\n";
var_export(preg_match_all('~\X~u', $string, $out) ? $out[0] : []);

All produce::

array (
0 => '先',
1 => '秦',
2 => '兩',
3 => '漢',
)

From https://www.regular-expressions.info/unicode.html:

How to Match a Single Unicode Grapheme

Matching a single grapheme, whether it's encoded as a single code point, or as multiple code points using combining marks, is easy in Perl, PCRE, PHP, Boost, Ruby 2.0, Java 9, and the Just Great Software applications: simply use \X.

You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.


UPDATE, DHarman has brought to my attention that mb_str_split() is now available from PHP7.4.

The default length parameter of the new function is 1, so the length parameter can be omitted for this case.

https://wiki.php.net/rfc/mb_str_split

Dharman's demo: https://3v4l.org/M85Fi/rfc#output

Convert a String into an Array of Characters

You will want to use str_split().

$result = str_split('abcdef');

http://us2.php.net/manual/en/function.str-split.php

PHP - Split String Into Arrays for every N characters

You could do something like this:

$text = 'VERY LONG STRING';
$result = [];
$partial = [];
$len = 0;

foreach(explode(' ', $text) as $chunk) {
$chunkLen = strlen($chunk);
if ($len + $chunkLen > 5000) {
$result[] = $partial;
$partial = [];
$len = 0;
}
$len += $chunkLen;
$partial[] = $chunk;
}

if ($partial) {
$result[] = $partial;
}

You can test it more easily if you do it with a lower max length

PHP : Split a comma delimited & special symbol string into an array

Try this -

str = "your string";
$arr = explode('@', $str);
$newArray = array();
foreach ($arr as $val) {
$temp = explode(',', $val);
$newTemp['treatment'] = $temp[0];
$newTemp['quantity'] = $temp[1];
$newTemp['cost'] = $temp[2];
$newTemp['discount'] = $temp[3];
$newTemp['discount_type'] = "INR";
$newTemp['total'] = $temp[4];
$newTemp['note'] = $temp[5];
$newArray[] = $newTemp;
$temp = array();
}
var_dump($newArray);

How to split string with special character (�) using PHP

Take a look at mb_split:

array mb_split ( string $pattern , string $string [, int $limit = -1 ] )

Split a multibyte string using regular expression pattern and returns
the result as an array.

Like this:

$string = "a�b�k�e";
$chunks = mb_split("�", $string);
print_r($chunks);

Outputs:

Array
(
[0] => a
[1] => b
[2] => k
[3] => e
)

How to split a string character by character, , paying attention to special characters

str_split has problems with Unicode strings.

You can use the u modifier in preg_split instead

For instance:

$input = "Comment ça va?";
$letters1 = str_split($input);
$letters2 = preg_split('//u', $input, -1, PREG_SPLIT_NO_EMPTY);

print_r($letters1);
print_r($letters2);

Will output

Array ( [0] => C [1] => o [2] => m [3] => m [4] => e 
[5] => n [6] => t [7] => [8] => � [9] => �
[10] => a [11] => [12] => v [13] => a [14] => ? )

Array ( [0] => C [1] => o [2] => m [3] => m [4] => e
[5] => n [6] => t [7] => [8] => ç [9] => a
[10] => [11] => v [12] => a [13] => ? )


Related Topics



Leave a reply



Submit