Multi-byte safe wordwrap() function for UTF-8
This one seems to work well...
function mb_wordwrap($str, $width = 75, $break = "\n", $cut = false, $charset = null) {
if ($charset === null) $charset = mb_internal_encoding();
$pieces = explode($break, $str);
$result = array();
foreach ($pieces as $piece) {
$current = $piece;
while ($cut && mb_strlen($current) > $width) {
$result[] = mb_substr($current, 0, $width, $charset);
$current = mb_substr($current, $width, 2048, $charset);
}
$result[] = $current;
}
return implode($break, $result);
}
Please define the term Multi-byte safe
When you are dealing with unicode characters, it is not safe to assume that all the characters just take a single byte or char (java). So when reading or parsing a string, you need to take this into consideration.
Here is an excellent article which explains complexities when dealing with Unicode w.r.t Java.
Stored characters can take up an inconsistent number of bytes. A UTF-8
encoded character might take between
one (LATIN_CAPITAL_LETTER_A) and four
(MATHEMATICAL_FRAKTUR_CAPITAL_G)
bytes. Variable width encoding has
implications for reading into and
decoding from byte arrays.Not all code points can be stored in a char. The
MATHEMATICAL_FRAKTUR_CAPITAL_G example
lies in the supplementary range of
characters and cannot be stored in 16
bits. It must be represented by two
sequential char values, neither of
which is meaningful by itself. The
Character class provides methods for
working with 32-bit code points.
// Unicode code point to char array
char[] math_fraktur_cap_g = Character.toChars(0x1D50A);
How to cut off multi-byte string (English word and Chinese character) in PHP?
Working code based on @Jonny's comment, thanks again
function neat_trim($str, $n, $delim='...')
{
$len = mb_detect_encoding($str) == "UTF-8" ? mb_strlen($str, "UTF-8") : strlen($str);
if ($len > $n)
{
preg_match('/(.{' . $n . '}.*?)\b/us', $str, $matches);
return rtrim($matches[1]) . $delim;
}
return $str;
}
Multi-byte string cutter - performance tuning
Current implementation is overcomplicated.
If I understood right then the better strategy would be:
- Cut the string by the length
- Iterate over characters from the end up to the first space
- Break and return the result
It should improve the performance significantly.
Split MB string based on length
This will split your string after every 10th "extended grapheme cluster" (suggested by Wiktor up in the comments).
var_export(preg_split('~\X{10}\K~u', $string));
preg_split('~.{10}\K~u', $string)
will work on your sample string, but for cases beyond yours, \X
is more robust when dealing with unicode.
From https://www.regular-expressions.info/unicode.html:
You can consider \X the Unicode version of the dot. There is one difference, though: \X always matches line break characters, whereas the dot does not match line break characters unless you enable the dot matches newline matching mode.
Here is a related SO page.
The \K
restarts the fullstring match, so there are no characters lost in the split.
Here is a demo where $len=10
https://regex101.com/r/uO6ur9/2
Code: (Demo)
$string='先秦兩漢先秦兩漢先秦兩漢漢先秦兩漢漢先秦兩漢( 243071)';
var_export(preg_split('~\X{10}\K~u',$string,));
Output:
array (
0 => '先秦兩漢先秦兩漢先秦',
1 => '兩漢漢先秦兩漢漢先秦',
2 => '兩漢( 243071',
3 => ')',
)
Implementation:
function word_chunk($str,$len){
return preg_split('~\X{'.$len.'}\K~u',$str);
}
While preg_split()
might be slightly slower than preg_match_all()
, one advantage is that preg_split()
provides the desired 1-dimensional array. preg_match_all()
generates a multi-dimensional array by which you would only need to access the [0]
subarray's elements.
Related Topics
Regular Expression and Forward Slash
Multiple Forms and One Processing Page
How to Apply a Function to an Array
Allow User Submitted HTML in PHP
Selecting All Fields Except Only One Field in MySQL
How to? Form Post to Multiple Locations
Composer Running Out of Memory on Every Project, MAC Os X
Will PHP Script Be Executed After Header Redirect
Codeigniter Csrf Valid for Only One Time Ajax Request
How to Use Authentication for Multiple Tables in Laravel 5
Require_Once () or Die() Not Working
PHP Auto-Kill a Script If the Http Request Is Cancelled/Closed
PHP Get Content of Http 400 Response
Paypal Gateway Has Rejected Request. Security Header Is Not Valid (#10002: Security Error Magento