How to Sort an Array of Utf-8 Strings

How to sort an array of UTF-8 strings?

Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie.
To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.

function traceStrColl($a, $b) {
    $outValue=strcoll($a, $b);
    echo "$a $b $outValue\r\n";
    return $outValue;
}

$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';

$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
    $array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);

The result is:

string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
  [0]=>
  string(1) "c"
  [1]=>
  string(1) "B"
  [2]=>
  string(1) "s"
  [3]=>
  string(1) "C"
  [4]=>
  string(1) "k"
  [5]=>
  string(1) "D"
  [6]=>
  string(2) "ä"
  [7]=>
  string(1) "E"
  [8]=>
  string(1) "g"
  [...]

The same snippet works on a Linux machine without any problems producing the following output:

string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
  [0]=>
  string(1) "a"
  [1]=>
  string(1) "A"
  [2]=>
  string(2) "ä"
  [3]=>
  string(2) "Ä"
  [4]=>
  string(1) "b"
  [5]=>
  string(1) "B"
  [6]=>
  string(1) "c"
  [7]=>
  string(1) "C"
  [...]

The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).

I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).

Thanks to all of you.

Utf8 Sort Array

You can use setlocale along with first parameter LC_COLLATE and second locale with en_US.utf8 and simply sort using usort along with strcoll try as

setlocale(LC_COLLATE, 'en_US.utf8');
$array = array('Australien','Belgien','Botswana','Brasilien','Bulgarien','Burma','China','Costa Rica','Ägypten');
usort($array, 'strcoll'); 
print_r($array);

Demo

Java array sort UTF-8

You should use Collator class.

For example

Locale lithuanian = new Locale("lt_LT");
Collator lithuanianCollator = Collator.getInstance(lithuanian);

And then sort the collection using this collator

Collections.sort(theList, lithuanianCollator);

Order preserving mapping from utf8 to an array of bytes

Happily, it turns out that UTF-8 encoded strings can be sorted lexicographically as-is.

Sorting order: The chosen values of the leading bytes and the fact that the continuation bytes have the high-order bits first means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.

By truncating the Strings' byte sequences to a fixed-length prefix, you can achieve what was described in the question above.

How to sort a collection of UTF-8 strings containing non-Latin chars in Laravel 5.3?

Here's a Solid way to do it:

$blank = array();
$collection = collect([
    ["name"=>"maroon"],
    ["name"=>"zoo"],
    ["name"=>"ábel"],
    ["name"=>"élof"]
])->toArray();

$count = count($collection);

for ($x=0; $x < $count; $x++) { 
    $blank[$x] = $collection[$x]['name'];
}

$collator = collator_create('en_US');
var_export($blank);
collator_sort( $collator, $blank );
var_export( $blank );

dd($blank);

Outputs:

array (
  0 => 'maroon',
  1 => 'zoo',
  2 => 'ábel',
  3 => 'élof',
)array (
  0 => 'ábel',
  1 => 'élof',
  2 => 'maroon',
  3 => 'zoo',
)

Laravel Pretty Output:

array:4 [
  0 => "ábel"
  1 => "élof"
  2 => "maroon"
  3 => "zoo"
]

For personal Reading and reference:
http://php.net/manual/en/class.collator.php

Hope this answer helps, sorry for late response =)

Using sort for utf8 strings in Perl

The Unicode::Collate
should help with this.

A simple example that sorts your last list

use warnings;
use strict;
use feature 'say';

use Unicode::Collate;

use open ":std", ":encoding(UTF-8)";

open my $fh, '<', "country_list.txt";
my @list = <$fh>;
chomp @list;

my $uc  = Unicode::Collate->new();
my @sorted = $uc->sort(@list);

say for @sorted;

However, in some languages non-ascii characters may have a very particular accepted placement, and the question doesn't provide any details. Then perhaps Unicode::Collate::Locale can help.

See (study) this perl.com article and this post (T. Christiansen), and this Effective Perler article.

If data to be sorted is in a complex data structure, cmp method is for individual comparison

my @sorted = map { $uc->cmp($a, $b) } @list;

where for $a and $b you'd extract what need be compared from the complex data structure.

Does Java String.getBytes(UTF-8) preserve lexicograhpical order?

Yes. According to RFC 3239:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

As Ian Roberts pointed out, this applies for "true UTF-8 (such as String.getBytes will give you)", but beware of DataInputStream's fake UTF-8, which will sort [U+000000] after [U+000001] and [U+00F000] after [U+10FFFF].

Sort a table with UTF-8 encoded values alphabetically

After some time searching to no avail, I found this article by Joseph Wright. Although it touched my issue, it didn't provide a clear solution to follow. I asked him, and it turned out that there's currently no direct way to do what I want. He pointed out, however, that slnunicode comes built-in with LuaTeX (albeit it will be replaced in the future).

I developed a 'crude' solution using the facilities provided in the LuaTeX environment. It isn't elegant, but it works, and it doesn't pull any external dependencies. About its efficiency, I have not perceived any difference in the document build time.

-- Make the facilities available
unicode = require( 'unicode' )
utf8 = unicode.utf8

--[[
    Each character's position in this array-like table determines its 'priority'.
    Several characters in the same slot have the same 'priority'.
]]
local alphabet =
{
    -- The space is here because of other requirements of my project
    { ' ' },
    { 'a', 'á', 'à', 'ä' },
    { 'b' },
    { 'c' },
    { 'd' },
    { 'e', 'é', 'è', 'ë' },
    { 'f' },
    { 'g' },
    { 'h' },
    { 'i', 'í', 'ì', 'ï' },
    { 'j' },
    { 'k' },
    { 'l' },
    { 'm' },
    { 'n' },
    { 'ñ' },
    { 'o', 'ó', 'ò', 'ö' },
    { 'p' },
    { 'q' },
    { 'r' },
    { 's' },
    { 't' },
    { 'u', 'ú', 'ù', 'ü' },
    { 'v' },
    { 'w' },
    { 'x' },
    { 'y' },
    { 'z' }
}

-- Looks up the character `character´ in the alphabet and returns its 'priority'
local function get_pos_in_alphabet( character )
    for i, alphabet_entry in ipairs(alphabet) do
        for _, alphabet_char in ipairs(alphabet_entry) do
            if character == alphabet_char then
                return i
            end
        end
    end

    --[[
        If it isn't in the alphabet, abort: it's better than silently outputting some
        random garbage, and, thanks to the message, allows to add the character to
        the table.
    ]]
    assert( false , "'" .. character .. "' was not in alphabet" )
end

-- Returns the characters in the UTF-8-encoded string `s´ in an array-like table
local function get_utf8_string_characters( s )
    --[[
        I saw this variable being used in several code snippets around the Web, but
        it isn't provided in my LuaTeX environment; I use this form of initialization
        to be safe if it's defined in the future.
    ]]
    utf8.charpattern = utf8.charpattern or "([%z\1-\127\194-\244][\128-\191]*)"

    local characters = {}

    for character in s:gmatch(utf8.charpattern) do
        table.insert( characters , character )
    end

    return characters
end

local function compare_utf8_strings( _o1 , _o2 )
    --[[
        `o1_chars´ and `o2_chars´ are array-like tables containing all of the
        characters of each string, which are all made lower-case using the
        slnunicode facilities that come built-in with LuaTeX.
    ]]
    local o1_chars = get_utf8_string_characters( utf8.lower(_o1) )
    local o2_chars = get_utf8_string_characters( utf8.lower(_o2) )

    local o1_len = utf8.len(o1)
    local o2_len = utf8.len(o2)

    for i = 1, math.min( o1_len , o2_len ) do
        o1_pos = get_pos_in_alphabet( o1_chars[i] )
        o2_pos = get_pos_in_alphabet( o2_chars[i] )

        if o1_pos > o2_pos then
            return false
        elseif o1_pos < o2_pos then
            return true
        end
    end

    return o1_len < o2_len
end

I cannot integrate this solution in the question's framework because my test environment, the ZeroBrane Studio Lua IDE, doesn't come with slnunicode and I don't know how to add it.

That was it. If anyone has any doubt or would like further explanations, please, use the comments. I hope it's useful to someone else.

How to Sort an Array of Utf-8 Strings