Natural Sorting Algorithm in PHP with Support for Unicode

Natural sorting algorithm in PHP with support for Unicode?

Nailed it!

$array = array('Ägile', 'Ãgile', 'Test', 'カタカナ', 'かたかな', 'Ágile', 'Àgile', 'Âgile', 'Agile');

function Sortify($string)
{
return preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|tilde|uml);~i', '$1' . chr(255) . '$2', htmlentities($string, ENT_QUOTES, 'UTF-8'));
}

array_multisort(array_map('Sortify', $array), $array);

Output:

Array
(
[0] => Agile
[1] => Ágile
[2] => Âgile
[3] => Àgile
[4] => Ãgile
[5] => Ägile
[6] => Test
[7] => かたかな
[8] => カタカナ
)

Even better:

if (extension_loaded('intl') === true)
{
collator_asort(collator_create('root'), $array);
}

Thanks to @tchrist!

UCA + Natural Sorting

After digging a little more in the documentation I've found the solution:

if (extension_loaded('intl') === true)
{
if (is_object($collator = collator_create('root')) === true)
{
$collator->setAttribute(Collator::NUMERIC_COLLATION, Collator::ON);
$collator->asort($array);
}
}

Output:

Array
(
[0] => al
[3] => Alpha
[5] => Álpha
[6] => Àlpha
[7] => Älpha
[1] => be
[4] => Beta
[10] => img1.png
[11] => img2.png
[8] => img10.png
[9] => img12.png
[2] => かたかな
)

Cakephp2 - natural sorting by strong with numbers

Not sure if this will fully help your issue, but you can attempt to apply natural sorting of a MySQL database by sorting by length first:-

'order' => array(
'LENGTH(Photo.name)' => 'ASC',
'Photo.name' => 'ASC'
)

Python analog of PHP's natsort function (sort a list using a natural order algorithm)

From my answer to Natural Sorting algorithm:

import re
def natural_key(string_):
"""See https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/"""
return [int(s) if s.isdigit() else s for s in re.split(r'(\d+)', string_)]

Example:

>>> L = ['image1.jpg', 'image15.jpg', 'image12.jpg', 'image3.jpg']
>>> sorted(L)
['image1.jpg', 'image12.jpg', 'image15.jpg', 'image3.jpg']
>>> sorted(L, key=natural_key)
['image1.jpg', 'image3.jpg', 'image12.jpg', 'image15.jpg']

To support Unicode strings, .isdecimal() should be used instead of .isdigit(). See example in @phihag's comment. Related: How to reveal Unicodes numeric value property.

.isdigit() may also fail (return value that is not accepted by int()) for a bytestring on Python 2 in some locales e.g., '\xb2' ('²') in cp1252 locale on Windows.

How to sort an array considering localization?

You need to set the locale appropriately, probably like this:

setlocale(LC_ALL, 'hr_HR');

And then tell sort to honor the locale:

sort($names,SORT_LOCALE_STRING);

How to sort an array of UTF-8 strings?

Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 → Windows-1252 or ISO-8859-1) as suggested by ΤΖΩΤΖΙΟΥ due to an obvious PHP bug as discovered by Huppie.
To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.

function traceStrColl($a, $b) {
$outValue=strcoll($a, $b);
echo "$a $b $outValue\r\n";
return $outValue;
}

$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';

$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
$array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);

The result is:

string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
[0]=>
string(1) "c"
[1]=>
string(1) "B"
[2]=>
string(1) "s"
[3]=>
string(1) "C"
[4]=>
string(1) "k"
[5]=>
string(1) "D"
[6]=>
string(2) "ä"
[7]=>
string(1) "E"
[8]=>
string(1) "g"
[...]

The same snippet works on a Linux machine without any problems producing the following output:

string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
[0]=>
string(1) "a"
[1]=>
string(1) "A"
[2]=>
string(2) "ä"
[3]=>
string(2) "Ä"
[4]=>
string(1) "b"
[5]=>
string(1) "B"
[6]=>
string(1) "c"
[7]=>
string(1) "C"
[...]

The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).

I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).

Thanks to all of you.

Is there any way to know the number of steps occuring in natsort() php?

For counting the comparison steps:

natsort uses a specific compare function during the sorting, which can be called separately as well: strnatcmp.

This means that this:

natsort($data);

... will produce the same result as:

usort($data, 'strnatcmp');

Or, more verbose:

usort($data, function ($a, $b) {
return strnatcmp($a, $b);
});

That opens the door to counting the number of times this comparison function is called:

$count = 0;
usort($data, function ($a, $b) use (&$count) {
$count++;
return strnatcmp($a, $b);
});

For example:

$data = ['test', 'abc10', 'abc2', 'OK', 9, 13];

$count = 0;
usort($data, function ($a, $b) use (&$count){
$count++;
return strnatcmp($a, $b);
});

echo "$count comparisons: " . implode(", ", $data);

Outputs:

8 comparisons: 9, 13, OK, abc2, abc10, test

Note that this sorting algorithm is case sensitive: "OK" is sorted before "abc".

For a case insensitive natural sort you can use natcasesort and strnatcasecmp

For counting the swaps

It is not possible to use above method to know how many swaps are made during the sort. PHP uses a version of QuickSort internally, so you could mimic the same kind of algorithm and count the swaps yourself. This will obviously be an estimation, for the following reasons:

  • There are different flavors of Quick Sort, in terms of how they chose the pivot at each recursive step, so this influences the number of swaps they do for a particular input;
  • Some flavors do not swap at all, but create a new array all together and then at the end overwrite the original array with it -- see for instance this algorithm in PHP;
  • Still other flavors do perform swaps, but pick the pivot (in the algorithm) randomly, so the number of swaps will be different on the same inputs;
  • Many implementations will switch to another sorting algorithm when the array has only a few elements, which again influences the number of swaps.

I will provide here the code for what I believe is the most standard algorithm: it chooses one pivot index, right at the middle of the given partition:

class QuickSort {
private static $swaps = 0;
private static $cmp = 'test';

private static function swap(&$array, $i, $j) {
if ($i == $j) return;
self::$swaps++;
list($array[$i], $array[$j]) = [$array[$j], $array[$i]];
}

private static function partition(&$array, $begin, $end) {
$cmp = self::$cmp;
$index = floor(($begin + $end) / 2);
$pivot = $array[$index];
self::swap($array, $index, $end);
for ($i = $index = $begin; $i < $end; $i++) {
if ($cmp($array[$i], $pivot) >= 0) continue;
self::swap($array, $index++, $i);
}
self::swap($array, $index, $end);
return $index;
}

private static function qsort(&$array, $begin, $end) {
if ($end <= $begin) return;
$index = self::partition($array, $begin, $end);
self::qsort($array, $begin, $index - 1);
self::qsort($array, $index + 1, $end);
}

public static function sort(&$array, $cmp = 'strcmp') {
self::$swaps = 0;
self::$cmp = $cmp;
self::qsort($array, 0, count($array) - 1);
return self::$swaps;
}
}

// Sample input
$data = [1,3,2,8,2,5,4,6];
// Call it: this sort function returns the number of swaps made
$swaps = QuickSort::sort($data, 'strnatcmp');
// Display the result:
echo "$swaps swaps: " . implode(", ", $data);

Outputs:

9 swaps: 1, 2, 2, 3, 4, 5, 6, 8

Again, the number of swaps might be more than you expect in some cases, as Quick Sort is not specifically designed to minimise the number of swaps.

Universal Sorting Function for PHP without the Locale Hassle

Short answer is: you do need to be aware of the locale.

Don't confuse the charset with the locale sorting rules. UTF-8 is just a way to encode Unicode characters: it doesn't imply anything about how you handle sorting, capitalization, etc.

I'll put a simple example. The Spanish language has two collations: traditional (where "ch" is considered a letter) and modern (where "ch" are two letters). In traditional collation you sort this way:

  1. Barro
  2. Cuenco
  3. China
  4. Dado

In modern collation you'd sort this way:

  1. Barro
  2. China
  3. Cuenco
  4. Dado

This is the same in UTF-8, Latin1, Latin9, cp850 or whatever: the encoding is not relevant.

Sorting Issue in PHP

You should use The Collator class for such cases.

Requirement: (PHP 5 >= 5.3.0, PHP 7, PECL intl >= 1.0.0)

Description: Provides string comparison capability with support for appropriate locale-sensitive sort orderings.

$collator = new Collator('en_US');
$collator->sort($country);


Related Topics



Leave a reply



Submit