How to Reverse a Unicode String

How to reverse a Unicode string

Grapheme functions handle UTF-8 string more correctly than mbstring and PCRE functions/ Mbstring and PCRE may break characters. You can see the defference between them by executing the following code.

function str_to_array($string)
{
$length = grapheme_strlen($string);
$ret = [];

for ($i = 0; $i < $length; $i += 1) {

$ret[] = grapheme_substr($string, $i, 1);
}

return $ret;
}

function str_to_array2($string)
{
$length = mb_strlen($string, "UTF-8");
$ret = [];

for ($i = 0; $i < $length; $i += 1) {

$ret[] = mb_substr($string, $i, 1, "UTF-8");
}

return $ret;
}

function str_to_array3($string)
{
return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}

function utf8_strrev($string)
{
return implode(array_reverse(str_to_array($string)));
}

function utf8_strrev2($string)
{
return implode(array_reverse(str_to_array2($string)));
}

function utf8_strrev3($string)
{
return implode(array_reverse(str_to_array3($string)));
}

// http://www.php.net/manual/en/function.grapheme-strlen.php
$string = "a\xCC\x8A" // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6)

var_dump(array_map(function($elem) { return strtoupper(bin2hex($elem)); },
[
'should be' => "o\xCC\x88"."a\xCC\x8A",
'grapheme' => utf8_strrev($string),
'mbstring' => utf8_strrev2($string),
'pcre' => utf8_strrev3($string)
]));

The result is here.

array(4) {
["should be"]=>
string(12) "6FCC8861CC8A"
["grapheme"]=>
string(12) "6FCC8861CC8A"
["mbstring"]=>
string(12) "CC886FCC8A61"
["pcre"]=>
string(12) "CC886FCC8A61"
}

IntlBreakIterator can be used since PHP 5.5 (intl 3.0);

function utf8_strrev($str)
{
$it = IntlBreakIterator::createCodePointInstance();
$it->setText($str);

$ret = '';
$pos = 0;
$prev = 0;

foreach ($it as $pos) {
$ret = substr($str, $prev, $pos - $prev) . $ret;
$prev = $pos;
}

return $ret;
}

Python reversing an UTF-8 string

Python 2 strings are byte strings, and UTF-8 encoded text uses multiple bytes per character. Just because your terminal manages to interpret the UTF-8 bytes as characters, doesn't mean that Python knows about what bytes form one UTF-8 character.

Your bytestring consists of 6 bytes, every two bytes form one character:

>>> a = "čšž"
>>> a
'\xc4\x8d\xc5\xa1\xc5\xbe'

However, how many bytes UTF-8 uses depends on where in the Unicode standard the character is defined; ASCII characters (the first 128 characters in the Unicode standard) only need 1 byte each, and many emoji need 4 bytes!

In UTF-8 order is everything; reversing the above bytestring reverses the bytes, resulting in some gibberish as far as the UTF-8 standard is concerned, but the middle 4 bytes just happen to be valid UTF-8 sequences (for š and ō):

>>> a[::-1]
'\xbe\xc5\xa1\xc5\x8d\xc4'
-----~~~~~~~~^^^^^^^^####
| š ō |
\ \
invalid UTF8 byte opening UTF-8 byte missing a second byte

You'd have to decode the byte string to a unicode object, which consists of single characters. Reversing that object gives you the right results:

b = a.decode('utf8')[::-1]
print b

You can always encode the object back to UTF-8 again:

b = a.decode('utf8')[::-1].encode('utf8')

Note that in Unicode, you can still run into issues when reversing text, when combining characters are used. Reversing text with combining characters places those combining characters in front rather than after the character they combine with, so they'll combine with the wrong character instead:

>>> print u'e\u0301a'
éa
>>> print u'e\u0301a'[::-1]
áe

You can mostly avoid this by converting the Unicode data to its normalised form (which replaces combinations with 1-codepoint forms) but there are plenty of other exotic Unicode characters that don't play well with string reversals.

How to get a reversed String (unicode safe)

Your issue could also be resolved by converting the string into the canonical decomposition form NFC. Basically, the java.text.Normalizer class can be used to combine accents and other combining characters with their base characters so you will be able to reverse properly.

All these other ideas (String.reverse(), StringBuffer.reverse()) will correctly reverse the characters in your buffer, but if you start with decomposed characters, you might not get what you expect :).

In some "decomposition forms", accent characters are stored separate from their base forms (as separate characters), but in "combined" form they are not. So in one form "áe" is stored as three characters, and in the other, combined form, as two.

However, such normalization isn't sufficient for handling other kinds of character combination, nor can it account for characters in the Unicode astral planes, which are stored as two characters (or more?) in Java.

Thanks to tchrist for pointing out the ICU support for text segmentation including extended grapheme clusters such as the one identified in the comments below (see virama). This resource seems to be the authoritative source of information on this kind of stuff.

In Corona SDK how to reverse a unicode string?

Corona SDK seems to be using UTF-8 as encoding.

If you want to reverse all Unicode code points in a string, instead of all bytes, you can use that code:

function utf8reverse(str)
return str:gsub("([\194-\244][\128-\191]+)", string.reverse):reverse()
end

print(utf8reverse("أحمد"))

The trick is as follows: a multibyte Unicode code point always start with a byte 11xx xxxx, followed by one or several bytes 10xx xxxx. The first step is to reverse all bytes on each multibyte code point, and then reverse all bytes.

Note: when a Unicode character is composed of several code points, that simple trick will not work. A full support would require a big Unicode database to deal with.

How do I reverse a UTF-8 string in place?

I'd make one pass reversing the bytes, then a second pass that reverses the bytes in any multibyte characters (which are easily detected in UTF8) back to their correct order.

You can definitely handle this in line in a single pass, but I wouldn't bother unless the routine became a bottleneck.

How to reverse a string that contains complicated emojis?

If you're able to, use the _.split() function provided by lodash. From version 4.0 onwards, _.split() is capable of splitting unicode emojis.

Using the native .reverse().join('') to reverse the 'characters' should work just fine with emojis containing zero-width joiners

function reverse(txt) { return _.split(txt, '').reverse().join(''); }

const text = 'Hello world‍‍‍‍';
console.log(reverse(text));
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.20/lodash.min.js" integrity="sha512-90vH1Z83AJY9DmlWa8WkjkV79yfS2n2Oxhsi2dZbIv0nC4E6m5AbH8Nh156kkM7JePmqD6tcZsfad1ueoaovww==" crossorigin="anonymous"></script>

How to reverse a unicode string using AutoHotKey and paste it?

This has been solved without saving the data back to the clipboard:

^,::
text= %Clipboard%
newText := Reverse(text)
send, %newText%


Related Topics



Leave a reply



Submit