Multi-Byte Safe Wordwrap() Function for Utf-8

Multi-byte safe wordwrap() function for UTF-8

This one seems to work well...

function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false, $charset = null) {
    if ($charset === null) $charset = mb_internal_encoding();

    $pieces = explode($break, $str);
    $result = array();
    foreach ($pieces as $piece) {
      $current = $piece;
      while ($cut && mb_strlen($current) > $width) {
        $result[] = mb_substr($current, 0, $width, $charset);
        $current = mb_substr($current, $width, 2048, $charset);
      }
      $result[] = $current;
    }
    return implode($break, $result);
}

Please define the term Multi-byte safe

When you are dealing with unicode characters, it is not safe to assume that all the characters just take a single byte or char (java). So when reading or parsing a string, you need to take this into consideration.

Here is an excellent article which explains complexities when dealing with Unicode w.r.t Java.

Stored characters can take up an inconsistent number of bytes. A UTF-8
encoded character might take between
one (LATIN_CAPITAL_LETTER_A) and four
(MATHEMATICAL_FRAKTUR_CAPITAL_G)
bytes. Variable width encoding has
implications for reading into and
decoding from byte arrays.

Not all code points can be stored in a char. The
MATHEMATICAL_FRAKTUR_CAPITAL_G example
lies in the supplementary range of
characters and cannot be stored in 16
bits. It must be represented by two
sequential char values, neither of
which is meaningful by itself. The
Character class provides methods for
working with 32-bit code points.

    // Unicode code point to char array
     char[] math_fraktur_cap_g = Character.toChars(0x1D50A);

How to cut off multi-byte string (English word and Chinese character) in PHP?

Working code based on @Jonny's comment, thanks again

function neat_trim($str, $n, $delim='...')
{
    $len = mb_detect_encoding($str) == "UTF-8" ? mb_strlen($str, "UTF-8") : strlen($str);
    if ($len > $n)
    {
        preg_match('/(.{' . $n . '}.*?)\b/us', $str, $matches);
        return rtrim($matches[1]) . $delim;
    }
    return $str;
}

Multi-byte string cutter - performance tuning

Current implementation is overcomplicated.

If I understood right then the better strategy would be:

Cut the string by the length
Iterate over characters from the end up to the first space
Break and return the result

It should improve the performance significantly.

Split MB string based on length

This will split your string after every 10th "extended grapheme cluster" (suggested by Wiktor up in the comments).

var_export(preg_split('~\X{10}\K~u', $string));

preg_split('~.{10}\K~u', $string) will work on your sample string, but for cases beyond yours, \X is more robust when dealing with unicode.

From https://www.regular-expressions.info/unicode.html:

You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.

Here is a related SO page.

The \K restarts the fullstring match, so there are no characters lost in the split.

Here is a demo where $len=10 https://regex101.com/r/uO6ur9/2

Code: (Demo)

$string='先秦兩漢先秦兩漢先秦兩漢漢先秦兩漢漢先秦兩漢( 243071)';
var_export(preg_split('~\X{10}\K~u',$string,));

Output:

array (
  0 => '先秦兩漢先秦兩漢先秦',
  1 => '兩漢漢先秦兩漢漢先秦',
  2 => '兩漢( 243071',
  3 => ')',
)

Implementation:

function word_chunk($str,$len){
    return preg_split('~\X{'.$len.'}\K~u',$str);
}

While preg_split() might be slightly slower than preg_match_all(), one advantage is that preg_split() provides the desired 1-dimensional array. preg_match_all() generates a multi-dimensional array by which you would only need to access the [0] subarray's elements.

Multi-Byte Safe Wordwrap() Function for Utf-8