How to Sort an Array of Utf-8 Strings

How to sort an array of UTF-8 strings?

Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie.
To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.

function traceStrColl($a, $b) {
$outValue=strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}

$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';

$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
$array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);

The result is:

string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
[0]=>
string(1) "c"
[1]=>
string(1) "B"
[2]=>
string(1) "s"
[3]=>
string(1) "C"
[4]=>
string(1) "k"
[5]=>
string(1) "D"
[6]=>
string(2) "ä"
[7]=>
string(1) "E"
[8]=>
string(1) "g"
[...]

The same snippet works on a Linux machine without any problems producing the following output:

string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
[0]=>
string(1) "a"
[1]=>
string(1) "A"
[2]=>
string(2) "ä"
[3]=>
string(2) "Ä"
[4]=>
string(1) "b"
[5]=>
string(1) "B"
[6]=>
string(1) "c"
[7]=>
string(1) "C"
[...]

The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).

I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).

Thanks to all of you.

Utf8 Sort Array

You can use setlocale along with first parameter LC_COLLATE and second locale with en_US.utf8 and simply sort using usort along with strcoll try as

setlocale(LC_COLLATE, 'en_US.utf8');
$array = array('Australien','Belgien','Botswana','Brasilien','Bulgarien','Burma','China','Costa Rica','Ägypten');
usort($array, 'strcoll');
print_r($array);

Demo

Java array sort UTF-8

You should use Collator class.

For example

Locale lithuanian = new Locale("lt_LT");
Collator lithuanianCollator = Collator.getInstance(lithuanian);

And then sort the collection using this collator

Collections.sort(theList, lithuanianCollator);

Order preserving mapping from utf8 to an array of bytes

Happily, it turns out that UTF-8 encoded strings can be sorted lexicographically as-is.

Sorting order: The chosen values of the leading bytes and the fact that the continuation bytes have the high-order bits first means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.

By truncating the Strings' byte sequences to a fixed-length prefix, you can achieve what was described in the question above.

How to sort a collection of UTF-8 strings containing non-Latin chars in Laravel 5.3?

Here's a Solid way to do it:

$blank = array();
$collection = collect([
["name"=>"maroon"],
["name"=>"zoo"],
["name"=>"ábel"],
["name"=>"élof"]
])->toArray();

$count = count($collection);

for ($x=0; $x < $count; $x++) {
$blank[$x] = $collection[$x]['name'];
}

$collator = collator_create('en_US');
var_export($blank);
collator_sort( $collator, $blank );
var_export( $blank );

dd($blank);

Outputs:

array (
0 => 'maroon',
1 => 'zoo',
2 => 'ábel',
3 => 'élof',
)array (
0 => 'ábel',
1 => 'élof',
2 => 'maroon',
3 => 'zoo',
)

Laravel Pretty Output:

array:4 [
0 => "ábel"
1 => "élof"
2 => "maroon"
3 => "zoo"
]

For personal Reading and reference:
http://php.net/manual/en/class.collator.php

Hope this answer helps, sorry for late response =)

Using sort for utf8 strings in Perl

The Unicode::Collate
should help with this.

A simple example that sorts your last list

use warnings;
use strict;
use feature 'say';

use Unicode::Collate;

use open ":std", ":encoding(UTF-8)";

open my $fh, '<', "country_list.txt";
my @list = <$fh>;
chomp @list;

my $uc = Unicode::Collate->new();
my @sorted = $uc->sort(@list);

say for @sorted;

However, in some languages non-ascii characters may have a very particular accepted placement, and the question doesn't provide any details. Then perhaps Unicode::Collate::Locale can help.

See (study) this perl.com article and this post (T. Christiansen), and this Effective Perler article.


If data to be sorted is in a complex data structure, cmp method is for individual comparison

my @sorted = map { $uc->cmp($a, $b) } @list;

where for $a and $b you'd extract what need be compared from the complex data structure.

Does Java String.getBytes(UTF-8) preserve lexicograhpical order?

Yes. According to RFC 3239:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

As Ian Roberts pointed out, this applies for "true UTF-8 (such as String.getBytes will give you)", but beware of DataInputStream's fake UTF-8, which will sort [U+000000] after [U+000001] and [U+00F000] after [U+10FFFF].

Sort a table with UTF-8 encoded values alphabetically

After some time searching to no avail, I found this article by Joseph Wright. Although it touched my issue, it didn't provide a clear solution to follow. I asked him, and it turned out that there's currently no direct way to do what I want. He pointed out, however, that slnunicode comes built-in with LuaTeX (albeit it will be replaced in the future).

I developed a 'crude' solution using the facilities provided in the LuaTeX environment. It isn't elegant, but it works, and it doesn't pull any external dependencies. About its efficiency, I have not perceived any difference in the document build time.

-- Make the facilities available
unicode = require( 'unicode' )
utf8 = unicode.utf8

--[[
Each character's position in this array-like table determines its 'priority'.
Several characters in the same slot have the same 'priority'.
]]
local alphabet =
{
-- The space is here because of other requirements of my project
{ ' ' },
{ 'a', 'á', 'à', 'ä' },
{ 'b' },
{ 'c' },
{ 'd' },
{ 'e', 'é', 'è', 'ë' },
{ 'f' },
{ 'g' },
{ 'h' },
{ 'i', 'í', 'ì', 'ï' },
{ 'j' },
{ 'k' },
{ 'l' },
{ 'm' },
{ 'n' },
{ 'ñ' },
{ 'o', 'ó', 'ò', 'ö' },
{ 'p' },
{ 'q' },
{ 'r' },
{ 's' },
{ 't' },
{ 'u', 'ú', 'ù', 'ü' },
{ 'v' },
{ 'w' },
{ 'x' },
{ 'y' },
{ 'z' }
}

-- Looks up the character `character´ in the alphabet and returns its 'priority'
local function get_pos_in_alphabet( character )
for i, alphabet_entry in ipairs(alphabet) do
for _, alphabet_char in ipairs(alphabet_entry) do
if character == alphabet_char then
return i
end
end
end

--[[
If it isn't in the alphabet, abort: it's better than silently outputting some
random garbage, and, thanks to the message, allows to add the character to
the table.
]]
assert( false , "'" .. character .. "' was not in alphabet" )
end

-- Returns the characters in the UTF-8-encoded string `s´ in an array-like table
local function get_utf8_string_characters( s )
--[[
I saw this variable being used in several code snippets around the Web, but
it isn't provided in my LuaTeX environment; I use this form of initialization
to be safe if it's defined in the future.
]]
utf8.charpattern = utf8.charpattern or "([%z\1-\127\194-\244][\128-\191]*)"

local characters = {}

for character in s:gmatch(utf8.charpattern) do
table.insert( characters , character )
end

return characters
end

local function compare_utf8_strings( _o1 , _o2 )
--[[
`o1_chars´ and `o2_chars´ are array-like tables containing all of the
characters of each string, which are all made lower-case using the
slnunicode facilities that come built-in with LuaTeX.
]]
local o1_chars = get_utf8_string_characters( utf8.lower(_o1) )
local o2_chars = get_utf8_string_characters( utf8.lower(_o2) )

local o1_len = utf8.len(o1)
local o2_len = utf8.len(o2)

for i = 1, math.min( o1_len , o2_len ) do
o1_pos = get_pos_in_alphabet( o1_chars[i] )
o2_pos = get_pos_in_alphabet( o2_chars[i] )

if o1_pos > o2_pos then
return false
elseif o1_pos < o2_pos then
return true
end
end

return o1_len < o2_len
end

I cannot integrate this solution in the question's framework because my test environment, the ZeroBrane Studio Lua IDE, doesn't come with slnunicode and I don't know how to add it.

That was it. If anyone has any doubt or would like further explanations, please, use the comments. I hope it's useful to someone else.



Related Topics



Leave a reply



Submit