How to sort an array of UTF-8 strings?
Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie.
To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.
function traceStrColl($a, $b) {
$outValue=strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}
$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';
$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
$array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);
The result is:
string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
[0]=>
string(1) "c"
[1]=>
string(1) "B"
[2]=>
string(1) "s"
[3]=>
string(1) "C"
[4]=>
string(1) "k"
[5]=>
string(1) "D"
[6]=>
string(2) "ä"
[7]=>
string(1) "E"
[8]=>
string(1) "g"
[...]
The same snippet works on a Linux machine without any problems producing the following output:
string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
[0]=>
string(1) "a"
[1]=>
string(1) "A"
[2]=>
string(2) "ä"
[3]=>
string(2) "Ä"
[4]=>
string(1) "b"
[5]=>
string(1) "B"
[6]=>
string(1) "c"
[7]=>
string(1) "C"
[...]
The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).
I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).
Thanks to all of you.
Utf8 Sort Array
You can use setlocale
along with first parameter LC_COLLATE
and second locale with en_US.utf8
and simply sort using usort
along with strcoll
try as
setlocale(LC_COLLATE, 'en_US.utf8');
$array = array('Australien','Belgien','Botswana','Brasilien','Bulgarien','Burma','China','Costa Rica','Ägypten');
usort($array, 'strcoll');
print_r($array);
Demo
Java array sort UTF-8
You should use Collator class.
For example
Locale lithuanian = new Locale("lt_LT");
Collator lithuanianCollator = Collator.getInstance(lithuanian);
And then sort the collection using this collator
Collections.sort(theList, lithuanianCollator);
Order preserving mapping from utf8 to an array of bytes
Happily, it turns out that UTF-8 encoded strings can be sorted lexicographically as-is.
Sorting order: The chosen values of the leading bytes and the fact that the continuation bytes have the high-order bits first means that a list of UTF-8 strings can be sorted in code point order by sorting the corresponding byte sequences.
By truncating the Strings' byte sequences to a fixed-length prefix, you can achieve what was described in the question above.
How to sort a collection of UTF-8 strings containing non-Latin chars in Laravel 5.3?
Here's a Solid way to do it:
$blank = array();
$collection = collect([
["name"=>"maroon"],
["name"=>"zoo"],
["name"=>"ábel"],
["name"=>"élof"]
])->toArray();
$count = count($collection);
for ($x=0; $x < $count; $x++) {
$blank[$x] = $collection[$x]['name'];
}
$collator = collator_create('en_US');
var_export($blank);
collator_sort( $collator, $blank );
var_export( $blank );
dd($blank);
Outputs:
array (
0 => 'maroon',
1 => 'zoo',
2 => 'ábel',
3 => 'élof',
)array (
0 => 'ábel',
1 => 'élof',
2 => 'maroon',
3 => 'zoo',
)
Laravel Pretty Output:
array:4 [
0 => "ábel"
1 => "élof"
2 => "maroon"
3 => "zoo"
]
For personal Reading and reference:
http://php.net/manual/en/class.collator.php
Hope this answer helps, sorry for late response =)
Using sort for utf8 strings in Perl
The Unicode::Collate
should help with this.
A simple example that sorts your last list
use warnings;
use strict;
use feature 'say';
use Unicode::Collate;
use open ":std", ":encoding(UTF-8)";
open my $fh, '<', "country_list.txt";
my @list = <$fh>;
chomp @list;
my $uc = Unicode::Collate->new();
my @sorted = $uc->sort(@list);
say for @sorted;
However, in some languages non-ascii characters may have a very particular accepted placement, and the question doesn't provide any details. Then perhaps Unicode::Collate::Locale can help.
See (study) this perl.com article and this post (T. Christiansen), and this Effective Perler article.
If data to be sorted is in a complex data structure, cmp
method is for individual comparison
my @sorted = map { $uc->cmp($a, $b) } @list;
where for $a
and $b
you'd extract what need be compared from the complex data structure.
Does Java String.getBytes(UTF-8) preserve lexicograhpical order?
Yes. According to RFC 3239:
The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.
As Ian Roberts pointed out, this applies for "true UTF-8 (such as String.getBytes
will give you)", but beware of DataInputStream
's fake UTF-8, which will sort [U+000000] after [U+000001] and [U+00F000] after [U+10FFFF].
Sort a table with UTF-8 encoded values alphabetically
After some time searching to no avail, I found this article by Joseph Wright. Although it touched my issue, it didn't provide a clear solution to follow. I asked him, and it turned out that there's currently no direct way to do what I want. He pointed out, however, that slnunicode comes built-in with LuaTeX (albeit it will be replaced in the future).
I developed a 'crude' solution using the facilities provided in the LuaTeX environment. It isn't elegant, but it works, and it doesn't pull any external dependencies. About its efficiency, I have not perceived any difference in the document build time.
-- Make the facilities available
unicode = require( 'unicode' )
utf8 = unicode.utf8
--[[
Each character's position in this array-like table determines its 'priority'.
Several characters in the same slot have the same 'priority'.
]]
local alphabet =
{
-- The space is here because of other requirements of my project
{ ' ' },
{ 'a', 'á', 'à', 'ä' },
{ 'b' },
{ 'c' },
{ 'd' },
{ 'e', 'é', 'è', 'ë' },
{ 'f' },
{ 'g' },
{ 'h' },
{ 'i', 'í', 'ì', 'ï' },
{ 'j' },
{ 'k' },
{ 'l' },
{ 'm' },
{ 'n' },
{ 'ñ' },
{ 'o', 'ó', 'ò', 'ö' },
{ 'p' },
{ 'q' },
{ 'r' },
{ 's' },
{ 't' },
{ 'u', 'ú', 'ù', 'ü' },
{ 'v' },
{ 'w' },
{ 'x' },
{ 'y' },
{ 'z' }
}
-- Looks up the character `character´ in the alphabet and returns its 'priority'
local function get_pos_in_alphabet( character )
for i, alphabet_entry in ipairs(alphabet) do
for _, alphabet_char in ipairs(alphabet_entry) do
if character == alphabet_char then
return i
end
end
end
--[[
If it isn't in the alphabet, abort: it's better than silently outputting some
random garbage, and, thanks to the message, allows to add the character to
the table.
]]
assert( false , "'" .. character .. "' was not in alphabet" )
end
-- Returns the characters in the UTF-8-encoded string `s´ in an array-like table
local function get_utf8_string_characters( s )
--[[
I saw this variable being used in several code snippets around the Web, but
it isn't provided in my LuaTeX environment; I use this form of initialization
to be safe if it's defined in the future.
]]
utf8.charpattern = utf8.charpattern or "([%z\1-\127\194-\244][\128-\191]*)"
local characters = {}
for character in s:gmatch(utf8.charpattern) do
table.insert( characters , character )
end
return characters
end
local function compare_utf8_strings( _o1 , _o2 )
--[[
`o1_chars´ and `o2_chars´ are array-like tables containing all of the
characters of each string, which are all made lower-case using the
slnunicode facilities that come built-in with LuaTeX.
]]
local o1_chars = get_utf8_string_characters( utf8.lower(_o1) )
local o2_chars = get_utf8_string_characters( utf8.lower(_o2) )
local o1_len = utf8.len(o1)
local o2_len = utf8.len(o2)
for i = 1, math.min( o1_len , o2_len ) do
o1_pos = get_pos_in_alphabet( o1_chars[i] )
o2_pos = get_pos_in_alphabet( o2_chars[i] )
if o1_pos > o2_pos then
return false
elseif o1_pos < o2_pos then
return true
end
end
return o1_len < o2_len
end
I cannot integrate this solution in the question's framework because my test environment, the ZeroBrane Studio Lua IDE, doesn't come with slnunicode and I don't know how to add it.
That was it. If anyone has any doubt or would like further explanations, please, use the comments. I hope it's useful to someone else.
Related Topics
Using Strtotime for Dates Before 1970
How to Access an Object Attribute That Starts with a Number
Pkill -F Doesn't Work for Process Killing
MySQL Statement Takes More Than Minute to Execute
Undefined Variable Problem with PHP Function
Composer Installing: the JSON Extension Is Missing
PHP Decoding and Encoding JSON with Unicode Characters
PHP Remove Special Character from String
Call to a Member Function Fetch() on Boolean
Php.Ini Is Nonexistent Loaded Configuration File (None)
How to Send Email from Localhost Wamp Server to Send Email Gmail Hotmail or So Forth
Least Memory Intensive Way to Read a File in PHP
Remove .PHP Extension (Explicitly Written) for Friendly Url
Move_Uploaded_File() Function Is Not Working