How to reverse a Unicode string
Grapheme functions handle UTF-8 string more correctly than mbstring and PCRE functions/ Mbstring and PCRE may break characters. You can see the defference between them by executing the following code.
function str_to_array($string)
{
$length = grapheme_strlen($string);
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = grapheme_substr($string, $i, 1);
}
return $ret;
}
function str_to_array2($string)
{
$length = mb_strlen($string, "UTF-8");
$ret = [];
for ($i = 0; $i < $length; $i += 1) {
$ret[] = mb_substr($string, $i, 1, "UTF-8");
}
return $ret;
}
function str_to_array3($string)
{
return preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);
}
function utf8_strrev($string)
{
return implode(array_reverse(str_to_array($string)));
}
function utf8_strrev2($string)
{
return implode(array_reverse(str_to_array2($string)));
}
function utf8_strrev3($string)
{
return implode(array_reverse(str_to_array3($string)));
}
// http://www.php.net/manual/en/function.grapheme-strlen.php
$string = "a\xCC\x8A" // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5)
."o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6)
var_dump(array_map(function($elem) { return strtoupper(bin2hex($elem)); },
[
'should be' => "o\xCC\x88"."a\xCC\x8A",
'grapheme' => utf8_strrev($string),
'mbstring' => utf8_strrev2($string),
'pcre' => utf8_strrev3($string)
]));
The result is here.
array(4) {
["should be"]=>
string(12) "6FCC8861CC8A"
["grapheme"]=>
string(12) "6FCC8861CC8A"
["mbstring"]=>
string(12) "CC886FCC8A61"
["pcre"]=>
string(12) "CC886FCC8A61"
}
IntlBreakIterator can be used since PHP 5.5 (intl 3.0);
function utf8_strrev($str)
{
$it = IntlBreakIterator::createCodePointInstance();
$it->setText($str);
$ret = '';
$pos = 0;
$prev = 0;
foreach ($it as $pos) {
$ret = substr($str, $prev, $pos - $prev) . $ret;
$prev = $pos;
}
return $ret;
}
Python reversing an UTF-8 string
Python 2 strings are byte strings, and UTF-8 encoded text uses multiple bytes per character. Just because your terminal manages to interpret the UTF-8 bytes as characters, doesn't mean that Python knows about what bytes form one UTF-8 character.
Your bytestring consists of 6 bytes, every two bytes form one character:
>>> a = "čšž"
>>> a
'\xc4\x8d\xc5\xa1\xc5\xbe'
However, how many bytes UTF-8 uses depends on where in the Unicode standard the character is defined; ASCII characters (the first 128 characters in the Unicode standard) only need 1 byte each, and many emoji need 4 bytes!
In UTF-8 order is everything; reversing the above bytestring reverses the bytes, resulting in some gibberish as far as the UTF-8 standard is concerned, but the middle 4 bytes just happen to be valid UTF-8 sequences (for š
and ō
):
>>> a[::-1]
'\xbe\xc5\xa1\xc5\x8d\xc4'
-----~~~~~~~~^^^^^^^^####
| š ō |
\ \
invalid UTF8 byte opening UTF-8 byte missing a second byte
You'd have to decode the byte string to a unicode
object, which consists of single characters. Reversing that object gives you the right results:
b = a.decode('utf8')[::-1]
print b
You can always encode the object back to UTF-8 again:
b = a.decode('utf8')[::-1].encode('utf8')
Note that in Unicode, you can still run into issues when reversing text, when combining characters are used. Reversing text with combining characters places those combining characters in front rather than after the character they combine with, so they'll combine with the wrong character instead:
>>> print u'e\u0301a'
éa
>>> print u'e\u0301a'[::-1]
áe
You can mostly avoid this by converting the Unicode data to its normalised form (which replaces combinations with 1-codepoint forms) but there are plenty of other exotic Unicode characters that don't play well with string reversals.
How to get a reversed String (unicode safe)
Your issue could also be resolved by converting the string into the canonical decomposition form NFC. Basically, the java.text.Normalizer class can be used to combine accents and other combining characters with their base characters so you will be able to reverse properly.
All these other ideas (String.reverse(), StringBuffer.reverse()) will correctly reverse the characters in your buffer, but if you start with decomposed characters, you might not get what you expect :).
In some "decomposition forms", accent characters are stored separate from their base forms (as separate characters), but in "combined" form they are not. So in one form "áe" is stored as three characters, and in the other, combined form, as two.
However, such normalization isn't sufficient for handling other kinds of character combination, nor can it account for characters in the Unicode astral planes, which are stored as two characters (or more?) in Java.
Thanks to tchrist for pointing out the ICU support for text segmentation including extended grapheme clusters such as the one identified in the comments below (see virama). This resource seems to be the authoritative source of information on this kind of stuff.
In Corona SDK how to reverse a unicode string?
Corona SDK seems to be using UTF-8 as encoding.
If you want to reverse all Unicode code points in a string, instead of all bytes, you can use that code:
function utf8reverse(str)
return str:gsub("([\194-\244][\128-\191]+)", string.reverse):reverse()
end
print(utf8reverse("أحمد"))
The trick is as follows: a multibyte Unicode code point always start with a byte 11xx xxxx, followed by one or several bytes 10xx xxxx. The first step is to reverse all bytes on each multibyte code point, and then reverse all bytes.
Note: when a Unicode character is composed of several code points, that simple trick will not work. A full support would require a big Unicode database to deal with.
How do I reverse a UTF-8 string in place?
I'd make one pass reversing the bytes, then a second pass that reverses the bytes in any multibyte characters (which are easily detected in UTF8) back to their correct order.
You can definitely handle this in line in a single pass, but I wouldn't bother unless the routine became a bottleneck.
How to reverse a string that contains complicated emojis?
If you're able to, use the _.split()
function provided by lodash. From version 4.0 onwards, _.split()
is capable of splitting unicode emojis.
Using the native .reverse().join('')
to reverse the 'characters' should work just fine with emojis containing zero-width joiners
function reverse(txt) { return _.split(txt, '').reverse().join(''); }
const text = 'Hello world';
console.log(reverse(text));
<script src="https://cdnjs.cloudflare.com/ajax/libs/lodash.js/4.17.20/lodash.min.js" integrity="sha512-90vH1Z83AJY9DmlWa8WkjkV79yfS2n2Oxhsi2dZbIv0nC4E6m5AbH8Nh156kkM7JePmqD6tcZsfad1ueoaovww==" crossorigin="anonymous"></script>
How to reverse a unicode string using AutoHotKey and paste it?
This has been solved without saving the data back to the clipboard:
^,::
text= %Clipboard%
newText := Reverse(text)
send, %newText%
Related Topics
How to Validate a Domain Name Using Regex & PHP
How to Pass a PHP Variable to Vue Component Instance in Laravel Blade
PHP and Microsoft Access Database - Connection and Crud
How to Add Namespace to an Attribute with PHP's Simplexml
Why Is Calling a Function (Such as Strlen, Count etc) on a Referenced Value So Slow
Max_File_Size in PHP - What's the Point
Concatenate Values of N Arrays in PHP
What's Difference Between _Construct and Function with Same Name as Class Has
Keeping Url Parameters During Pagination
PHP Re-Order Array of Month Names
Is Is Bad Practice to Use Array_Walk with MySQLi_Real_Escape_String
Update Command-Line Output, I.E. for Progress
Explain $Ci =& Get_Instance();
Parallel Processing in PHP - How to Do It
Utc Date/Time String to Timezone