Truncate a Multibyte String to N Chars

Truncate a multibyte String to n chars

Try this:

function truncate($string, $chars = 50, $terminator = ' …') {
$cutPos = $chars - mb_strlen($terminator);
$boundaryPos = mb_strrpos(mb_substr($string, 0, mb_strpos($string, ' ', $cutPos)), ' ');
return mb_substr($string, 0, $boundaryPos === false ? $cutPos : $boundaryPos) . $terminator;
}

But you need to make sure that your internal encoding is properly set.

Truncate a string to first n characters of a string and add three dots if any characters are removed

//The simple version for 10 Characters from the beginning of the string
$string = substr($string,0,10).'...';

Update:

Based on suggestion for checking length (and also ensuring similar lengths on trimmed and untrimmed strings):

$string = (strlen($string) > 13) ? substr($string,0,10).'...' : $string;

So you will get a string of max 13 characters; either 13 (or less) normal characters or 10 characters followed by '...'

Update 2:

Or as function:

function truncate($string, $length, $dots = "...") {
return (strlen($string) > $length) ? substr($string, 0, $length - strlen($dots)) . $dots : $string;
}

Update 3:

It's been a while since I wrote this answer and I don't actually use this code any more. I prefer this function which prevents breaking the string in the middle of a word using the wordwrap function:

function truncate($string,$length=100,$append="…") {
$string = trim($string);

if(strlen($string) > $length) {
$string = wordwrap($string, $length);
$string = explode("\n", $string, 2);
$string = $string[0] . $append;
}

return $string;
}

How do I truncate a java string to fit in a given number of bytes, once UTF-8 encoded?

Here is a simple loop that counts how big the UTF-8 representation is going to be, and truncates when it is exceeded:

public static String truncateWhenUTF8(String s, int maxBytes) {
int b = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);

// ranges from http://en.wikipedia.org/wiki/UTF-8
int skip = 0;
int more;
if (c <= 0x007f) {
more = 1;
}
else if (c <= 0x07FF) {
more = 2;
} else if (c <= 0xd7ff) {
more = 3;
} else if (c <= 0xDFFF) {
// surrogate area, consume next char as well
more = 4;
skip = 1;
} else {
more = 3;
}

if (b + more > maxBytes) {
return s.substring(0, i);
}
b += more;
i += skip;
}
return s;
}

This does handle surrogate pairs that appear in the input string. Java's UTF-8 encoder (correctly) outputs surrogate pairs as a single 4-byte sequence instead of two 3-byte sequences, so truncateWhenUTF8() will return the longest truncated string it can. If you ignore surrogate pairs in the implementation then the truncated strings may be shorted than they needed to be.

I haven't done a lot of testing on that code, but here are some preliminary tests:

private static void test(String s, int maxBytes, int expectedBytes) {
String result = truncateWhenUTF8(s, maxBytes);
byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
if (utf8.length > maxBytes) {
System.out.println("BAD: our truncation of " + s + " was too big");
}
if (utf8.length != expectedBytes) {
System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
}
System.out.println(s + " truncated to " + result);
}

public static void main(String[] args) {
test("abcd", 0, 0);
test("abcd", 1, 1);
test("abcd", 2, 2);
test("abcd", 3, 3);
test("abcd", 4, 4);
test("abcd", 5, 4);

test("a\u0080b", 0, 0);
test("a\u0080b", 1, 1);
test("a\u0080b", 2, 1);
test("a\u0080b", 3, 3);
test("a\u0080b", 4, 4);
test("a\u0080b", 5, 4);

test("a\u0800b", 0, 0);
test("a\u0800b", 1, 1);
test("a\u0800b", 2, 1);
test("a\u0800b", 3, 1);
test("a\u0800b", 4, 4);
test("a\u0800b", 5, 5);
test("a\u0800b", 6, 5);

// surrogate pairs
test("\uD834\uDD1E", 0, 0);
test("\uD834\uDD1E", 1, 0);
test("\uD834\uDD1E", 2, 0);
test("\uD834\uDD1E", 3, 0);
test("\uD834\uDD1E", 4, 4);
test("\uD834\uDD1E", 5, 4);

}

Updated Modified code example, it now handles surrogate pairs.

Truncating Strings by Bytes

Why not convert to bytes and walk forward--obeying UTF8 character boundaries as you do it--until you've got the max number, then convert those bytes back into a string?

Or you could just cut the original string if you keep track of where the cut should occur:

// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
public static String cut(String s, int n) {
byte[] utf8 = s.getBytes();
if (utf8.length < n) n = utf8.length;
int n16 = 0;
int advance = 1;
int i = 0;
while (i < n) {
advance = 1;
if ((utf8[i] & 0x80) == 0) i += 1;
else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
else { i += 4; advance = 2; }
if (i <= n) n16 += advance;
}
return s.substring(0,n16);
}
}

Note: edited to fix bugs on 2014-08-25

Python truncating international string

You need to cut to bytes length, so you need first to .encode('utf-8') your string, and then cut it at a code point boundary.

In UTF-8, ASCII (<= 127) are 1-byte. Bytes with two or more most significant bits set (>= 192) are character-starting bytes; the number of bytes that follow is determined by the number of most significant bits set. Anything else is continuation bytes.

A problem may arise if you cut the multi-byte sequence in the middle; if a character did not fit, it should be cut completely, up to the starting byte.

Here's some working code:

LENGTH_BY_PREFIX = [
(0xC0, 2), # first byte mask, total codepoint length
(0xE0, 3),
(0xF0, 4),
(0xF8, 5),
(0xFC, 6),
]

def codepoint_length(first_byte):
if first_byte < 128:
return 1 # ASCII
for mask, length in LENGTH_BY_PREFIX:
if first_byte & mask == mask:
return length
assert False, 'Invalid byte %r' % first_byte

def cut_to_bytes_length(unicode_text, byte_limit):
utf8_bytes = unicode_text.encode('UTF-8')
cut_index = 0
while cut_index < len(utf8_bytes):
step = codepoint_length(ord(utf8_bytes[cut_index]))
if cut_index + step > byte_limit:
# can't go a whole codepoint further, time to cut
return utf8_bytes[:cut_index]
else:
cut_index += step
# length limit is longer than our bytes strung, so no cutting
return utf8_bytes

Now test. If .decode() succeeds, we have made a correct cut.

unicode_text = u"هيك بنكون" # note that the literal here is Unicode

print cut_to_bytes_length(unicode_text, 100).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 10).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 5).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 4).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 3).decode('UTF-8')
print cut_to_bytes_length(unicode_text, 2).decode('UTF-8')

# This returns empty strings, because an Arabic letter
# requires at least 2 bytes to represent in UTF-8.
print cut_to_bytes_length(unicode_text, 1).decode('UTF-8')

You can test that the code works with ASCII as well.

Smart way to truncate long strings

Essentially, you check the length of the given string. If it's longer than a given length n, clip it to length n (substr or slice) and add html entity (…) to the clipped string.

Such a method looks like

function truncate(str, n){
return (str.length > n) ? str.slice(0, n-1) + '…' : str;
};

If by 'more sophisticated' you mean truncating at the last word boundary of a string then you need an extra check.
First you clip the string to the desired length, next you clip the result of that to its last word boundary

function truncate( str, n, useWordBoundary ){
if (str.length <= n) { return str; }
const subString = str.slice(0, n-1); // the original check
return (useWordBoundary
? subString.slice(0, subString.lastIndexOf(" "))
: subString) + "…";
};

You can extend the native String prototype with your function. In that case the str parameter should be removed and str within the function should be replaced with this:

String.prototype.truncate = String.prototype.truncate || 
function ( n, useWordBoundary ){
if (this.length <= n) { return this; }
const subString = this.slice(0, n-1); // the original check
return (useWordBoundary
? subString.slice(0, subString.lastIndexOf(" "))
: subString) + "…";
};

More dogmatic developers may chide you strongly for that ("Don't modify objects you don't own". I wouldn't mind though).

An approach without extending the String prototype is to create
your own helper object, containing the (long) string you provide
and the beforementioned method to truncate it. That's what the snippet
below does.

const LongstringHelper = str => {
const sliceBoundary = str => str.substr(0, str.lastIndexOf(" "));
const truncate = (n, useWordBoundary) =>
str.length <= n ? str : `${ useWordBoundary
? sliceBoundary(str.slice(0, n - 1))
: str.slice(0, n - 1)}…`;
return { full: str, truncate };
};
const longStr = LongstringHelper(`Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore
magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation
ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute
irure dolor in reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum`);

const plain = document.querySelector("#resultTruncatedPlain");
const lastWord = document.querySelector("#resultTruncatedBoundary");
plain.innerHTML =
longStr.truncate(+plain.dataset.truncateat, !!+plain.dataset.onword);
lastWord.innerHTML =
longStr.truncate(+lastWord.dataset.truncateat, !!+lastWord.dataset.onword);
document.querySelector("#resultFull").innerHTML = longStr.full;
body {
font: normal 12px/15px verdana, arial;
}

p {
width: 450px;
}

#resultTruncatedPlain:before {
content: 'Truncated (plain) n='attr(data-truncateat)': ';
color: green;
}

#resultTruncatedBoundary:before {
content: 'Truncated (last whole word) n='attr(data-truncateat)': ';
color: green;
}

#resultFull:before {
content: 'Full: ';
color: green;
}
<p id="resultTruncatedPlain" data-truncateat="120" data-onword="0"></p>
<p id="resultTruncatedBoundary" data-truncateat="120" data-onword="1"></p>
<p id="resultFull"></p>

How do I cut a string and add suspension marks?

http://php.net/wordwrap

$str = 'A very long string here';
$str = wordwrap($str, 100);
$str = explode("\n", $str);
$str = $str[0] . '...';

Using JavaScript to truncate text to a certain size (8 KB)

If you are using a single-byte encoding, yes, 8192 characters=8192 bytes. If you are using UTF-16, 8192 characters(*)=4096 bytes.

(Actually 8192 code-points, which is a slightly different thing in the face of surrogates, but let's not worry about that because JavaScript doesn't.)

If you are using UTF-8, there's a quick trick you can use to implement a UTF-8 encoder/decoder in JS with minimal code:

function toBytesUTF8(chars) {
return unescape(encodeURIComponent(chars));
}
function fromBytesUTF8(bytes) {
return decodeURIComponent(escape(bytes));
}

Now you can truncate with:

function truncateByBytesUTF8(chars, n) {
var bytes= toBytesUTF8(chars).substring(0, n);
while (true) {
try {
return fromBytesUTF8(bytes);
} catch(e) {};
bytes= bytes.substring(0, bytes.length-1);
}
}

(The reason for the try-catch there is that if you truncate the bytes in the middle of a multibyte character sequence you'll get an invalid UTF-8 stream and decodeURIComponent will complain.)

If it's another multibyte encoding such as Shift-JIS or Big5, you're on your own.



Related Topics



Leave a reply



Submit