Multi-Byte Safe Wordwrap() Function for Utf-8

Multi-byte safe wordwrap() function for UTF-8

This one seems to work well...

function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false, $charset = null) {
if ($charset === null) $charset = mb_internal_encoding();

$pieces = explode($break, $str);
$result = array();
foreach ($pieces as $piece) {
$current = $piece;
while ($cut && mb_strlen($current) > $width) {
$result[] = mb_substr($current, 0, $width, $charset);
$current = mb_substr($current, $width, 2048, $charset);
}
$result[] = $current;
}
return implode($break, $result);
}

Please define the term Multi-byte safe

When you are dealing with unicode characters, it is not safe to assume that all the characters just take a single byte or char (java). So when reading or parsing a string, you need to take this into consideration.

Here is an excellent article which explains complexities when dealing with Unicode w.r.t Java.

  1. Stored characters can take up an inconsistent number of bytes. A UTF-8
    encoded character might take between
    one (LATIN_CAPITAL_LETTER_A) and four
    (MATHEMATICAL_FRAKTUR_CAPITAL_G)
    bytes. Variable width encoding has
    implications for reading into and
    decoding from byte arrays.

  2. Not all code points can be stored in a char. The
    MATHEMATICAL_FRAKTUR_CAPITAL_G example
    lies in the supplementary range of
    characters and cannot be stored in 16
    bits. It must be represented by two
    sequential char values, neither of
    which is meaningful by itself. The
    Character class provides methods for
    working with 32-bit code points.

    // Unicode code point to char array
char[] math_fraktur_cap_g = Character.toChars(0x1D50A);

How to cut off multi-byte string (English word and Chinese character) in PHP?

Working code based on @Jonny's comment, thanks again

function neat_trim($str, $n, $delim='...')
{
$len = mb_detect_encoding($str) == "UTF-8" ? mb_strlen($str, "UTF-8") : strlen($str);
if ($len > $n)
{
preg_match('/(.{' . $n . '}.*?)\b/us', $str, $matches);
return rtrim($matches[1]) . $delim;
}
return $str;
}

Multi-byte string cutter - performance tuning

Current implementation is overcomplicated.

If I understood right then the better strategy would be:

  1. Cut the string by the length
  2. Iterate over characters from the end up to the first space
  3. Break and return the result

It should improve the performance significantly.

Split MB string based on length

This will split your string after every 10th "extended grapheme cluster" (suggested by Wiktor up in the comments).

var_export(preg_split('~\X{10}\K~u', $string));

preg_split('~.{10}\K~u', $string) will work on your sample string, but for cases beyond yours, \X is more robust when dealing with unicode.

From https://www.regular-expressions.info/unicode.html:

You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

Here is a related SO page.

The \K restarts the fullstring match, so there are no characters lost in the split.

Here is a demo where $len=10 https://regex101.com/r/uO6ur9/2

Code: (Demo)

$string='先秦兩漢先秦兩漢先秦兩漢漢先秦兩漢漢先秦兩漢( 243071)';
var_export(preg_split('~\X{10}\K~u',$string,));

Output:

array (
0 => '先秦兩漢先秦兩漢先秦',
1 => '兩漢漢先秦兩漢漢先秦',
2 => '兩漢( 243071',
3 => ')',
)

Implementation:

function word_chunk($str,$len){
return preg_split('~\X{'.$len.'}\K~u',$str);
}

While preg_split() might be slightly slower than preg_match_all(), one advantage is that preg_split() provides the desired 1-dimensional array. preg_match_all() generates a multi-dimensional array by which you would only need to access the [0] subarray's elements.



Related Topics



Leave a reply



Submit