Natural Sort Supporting Big Numbers

Natural sort supporting big numbers

It works like @clemens suggested. Use numeric (= decimal) in the composite type:

CREATE TYPE ai AS (a text, i numeric);

db<>fiddle here

The reason I used int in the referenced answer is performance.

Natural sorting when characters and numbers mixed

You need to normalise the format of the numeric portion of the text. You can do that by splitting the string into the AB prefix and the numeric part, then left-padding the numeric part to a consistent length with zeroes.

For example: AB11a becomes AB00011a.

Apply this to all the items you've listed and they'll sort in the order you want.

You can do this with

    ... ORDER BY concat(substring(`code`,1,2),lpad(substr(`code`,3),6,'0')) ...

where `code` is the name of the column that contains the data you want to sort.

Note - this assumes that the prefix is always 2 characters.

Natural Sort on data mixing text and numbers (and then more text *sometimes*)

Through pure SQL? Not easy!

We did this in code on my project, and basically, you apply a regular expression that matches runs of digits or non-digits (something like ([0-9]+)|([^0-9]+)), turns a string into a tuple of such runs, converts digit runs into integers, then sorts the tuples with a pairwise comparison. For example:

"Lot 1"   -> ("Lot ", 1)
"Lot 2" -> ("Lot ", 2)
"Lot 3" -> ("Lot ", 3)
"Lot 3a" -> ("Lot ", 3, "a")
"Lot 100" -> ("Lot ", 100)

The pairwise comparison then (a) sorts integers before strings, (b) sorts integers in natural order, and (c) sorts strings in natural order.

If you were able to apply regular expressions to columns for sorting, you could use one to 'normalise' the numbers into digit strings of a fixed length, padded with zeroes. Like this:

"Lot 1"   -> "Lot 0001"
"Lot 2" -> "Lot 0002"
"Lot 3" -> "Lot 0003"
"Lot 3a" -> "Lot 0003a"
"Lot 100" -> "Lot 0100"

Which should sort correctly. However, whilst you can do this in Oracle and some other databases, in MySQL, you need to use a user-defined function.

However, you could do it outside the database. Add a column to the lot table to store the normalised lot name, and whenever you insert a row from your code, generate the normalised form and store it. You can then sort using that column.

F# - Natural Sort Of Strings Containing Numbers

Based on @matekus' comment, possibly the most correct solution is to port the AlphaNum sort to F#, so:

let len = String.length

let isnum (s: string) i =
let c = s.Chars i
c >= '0' && c <= '9'

let chunk s f t = (f < len s) && (t < len s) && (isnum s f) = (isnum s t)

let chunkto s f =
let rec to_ s f e = if chunk s f e then to_ s f (e + 1) else e in to_ s f f

let int_of_string str =
let v = ref 0
if System.Int32.TryParse(str, v) then !v else 0

let alphanumcmp a b =
let rec chunkcmp a ai b bi =
let al, bl = len a, len b
if ai >= al || bi >= bl then compare al bl else
let ae, be = chunkto a ai, chunkto b bi
let sa, sb = a.Substring(ai, (ae-ai)), b.Substring(bi, (be-bi))
let cmp = if isnum a ai && isnum b bi then compare (int_of_string sa) (int_of_string sb) else compare sa sb
if cmp = 0 then chunkcmp a ae b be else cmp
in chunkcmp a 0 b 0

type AlphanumComparer() =
interface System.Collections.IComparer with
member this.Compare(x, y) =
alphanumcmp (x.ToString()) (y.ToString())

Test:

let names = [ "1000X Radonius Maximus"; "10X Radonius"; "200X Radonius"; "20X Radonius"; "20X Radonius Prime"; "30X Radonius"; "40X Radonius"; "Allegia 50 Clasteron"; "Allegia 500 Clasteron"; "Allegia 51 Clasteron"; "Allegia 51B Clasteron"; "Allegia 52 Clasteron"; "Allegia 60 Clasteron"; "Alpha 100"; "Alpha 2"; "Alpha 200"; "Alpha 2A";  "Alpha 2A-8000"; "Alpha 2A-900"; "Callisto Morphamax"; "Callisto Morphamax 500"; "Callisto Morphamax 5000"; "Callisto Morphamax 600"; "Callisto Morphamax 700"; "Callisto Morphamax 7000"; "Callisto Morphamax 7000 SE";"Callisto Morphamax 7000 SE2"; "QRS-60 Intrinsia Machine"; "QRS-60F Intrinsia Machine"; "QRS-62 Intrinsia Machine"; "QRS-62F Intrinsia Machine"; "Xiph Xlater 10000"; "Xiph Xlater 2000"; "Xiph Xlater 300"; "Xiph Xlater 40"; "Xiph Xlater 5"; "Xiph Xlater 50"; "Xiph Xlater 500"; "Xiph Xlater 5000"; "Xiph Xlater 58" ];;

names |> List.sortWith alphanumcmp |> printf "%A"

Results:

 ["10X Radonius"; "20X Radonius"; "20X Radonius Prime"; "30X Radonius";
"40X Radonius"; "200X Radonius"; "1000X Radonius Maximus";
"Allegia 50 Clasteron"; "Allegia 51 Clasteron"; "Allegia 51B Clasteron";
"Allegia 52 Clasteron"; "Allegia 60 Clasteron"; "Allegia 500 Clasteron";
"Alpha 2"; "Alpha 2A"; "Alpha 2A-900"; "Alpha 2A-8000"; "Alpha 100";
"Alpha 200"; "Callisto Morphamax"; "Callisto Morphamax 500";
"Callisto Morphamax 600"; "Callisto Morphamax 700"; "Callisto Morphamax 5000";
"Callisto Morphamax 7000"; "Callisto Morphamax 7000 SE";
"Callisto Morphamax 7000 SE2"; "QRS-60 Intrinsia Machine";
"QRS-60F Intrinsia Machine"; "QRS-62 Intrinsia Machine";
"QRS-62F Intrinsia Machine"; "Xiph Xlater 5"; "Xiph Xlater 40";
"Xiph Xlater 50"; "Xiph Xlater 58"; "Xiph Xlater 300"; "Xiph Xlater 500";
"Xiph Xlater 2000"; "Xiph Xlater 5000"; "Xiph Xlater 10000"]val it : unit = ()

Humanized or natural number sorting of mixed word-and-number strings

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string.

Register a composite type made up of one text and one integer value once per database. I call it ai:

CREATE TYPE ai AS (a text, i int);

The trick is to form an array of ai from each value in the column.

regexp_matches() with the pattern (\D*)(\d*) and the g option returns one row for every combination of letters and numbers. Plus one irrelevant dangling row with two empty strings '{"",""}' Filtering or suppressing it would just add cost. Aggregate this into an array, after replacing empty strings ('') with 0 in the integer component (as '' cannot be cast to integer).

NULL values sort first - or you have to special case them - or use the whole shebang in a STRICT function like @Craig proposes.

Postgres 9.4 or later

SELECT data
FROM alnum
ORDER BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
FROM regexp_matches(data, '(\D*)(\d*)', 'g') x)
, data;

db<>fiddle here

Postgres 9.1 (original answer)

Tested with PostgreSQL 9.1.5, where regexp_replace() had a slightly different behavior.

SELECT data
FROM (
SELECT ctid, data, regexp_matches(data, '(\D*)(\d*)', 'g') AS x
FROM alnum
) x
GROUP BY ctid, data -- ctid as stand-in for a missing pk
ORDER BY regexp_replace (left(data, 1), '[0-9]', '0')
, array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
, data -- for special case of trailing 0

Add regexp_replace (left(data, 1), '[1-9]', '0') as first ORDER BY item to take care of leading digits and empty strings.

If special characters like {}()"', can occur, you'd have to escape those accordingly.
@Craig's suggestion to use a ROW expression takes care of that.


BTW, this won't execute in sqlfiddle, but it does in my db cluster. JDBC is not up to it. sqlfiddle complains:

Method org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) is
not yet implemented.

This has since been fixed: http://sqlfiddle.com/#!17/fad6e/1

Natural Sort in MySQL

I think this is why a lot of things are sorted by release date.

A solution could be to create another column in your table for the "SortKey". This could be a sanitized version of the title which conforms to a pattern you create for easy sorting or a counter.

Natural sorting in pandas

Use str.extract, sort_values, then use the index to reindex df.

idx = (df.assign(ID2=df.ID.str.extract(r'(\d+)$').astype(int))
.sort_values(['ID2', 'Time'])
.index)

df.iloc[idx]

ID Time oneMissing singleValue empty oneEmpty
0 CS1-1 1 10000.0 NaN None 0.0
2 CS1-1 2 30000.0 NaN None 0.0
3 CS1-2 1 10000.0 NaN None NaN
1 CS1-2 2 20000.0 0.0 None 0.0
5 CS1-2 3 30000.0 NaN None NaN
4 CS1-11 1 NaN 0.0 None NaN

This is under the assumption that your ID column follows the pattern "XXX-NUMBER".


A fool-proof solution will involve the use of the natsort module, which excels at fast natural sorting. With a little elbow-grease, we can argsort your data.

from natsort import natsorted
idx, *_ = zip(*natsorted(
zip(df.index, df.ID, df.Time), key=lambda x: (x[1], x[2])))

df.iloc[list(idx)]

ID Time oneMissing singleValue empty oneEmpty
0 CS1-1 1 10000.0 NaN None 0.0
2 CS1-1 2 30000.0 NaN None 0.0
3 CS1-2 1 10000.0 NaN None NaN
1 CS1-2 2 20000.0 0.0 None 0.0
5 CS1-2 3 30000.0 NaN None NaN
4 CS1-11 1 NaN 0.0 None NaN

Use PyPi to install: pip install natsort.

Is there a built in function for string natural sort?

There is a third party library for this on PyPI called natsort (full disclosure, I am the package's author). For your case, you can do either of the following:

>>> from natsort import natsorted, ns
>>> x = ['Elm11', 'Elm12', 'Elm2', 'elm0', 'elm1', 'elm10', 'elm13', 'elm9']
>>> natsorted(x, key=lambda y: y.lower())
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> natsorted(x, alg=ns.IGNORECASE) # or alg=ns.IC
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

You should note that natsort uses a general algorithm so it should work for just about any input that you throw at it. If you want more details on why you might choose a library to do this rather than rolling your own function, check out the natsort documentation's How It Works page, in particular the Special Cases Everywhere! section.


If you need a sorting key instead of a sorting function, use either of the below formulas.

>>> from natsort import natsort_keygen, ns
>>> l1 = ['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> l2 = l1[:]
>>> natsort_key1 = natsort_keygen(key=lambda y: y.lower())
>>> l1.sort(key=natsort_key1)
>>> l1
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']
>>> natsort_key2 = natsort_keygen(alg=ns.IGNORECASE)
>>> l2.sort(key=natsort_key2)
>>> l2
['elm0', 'elm1', 'Elm2', 'elm9', 'elm10', 'Elm11', 'Elm12', 'elm13']

Update November 2020

Given that a popular request/question is "how to sort like Windows Explorer?" (or whatever is your operating system's file system browser), as of natsort version 7.1.0 there is a function called os_sorted to do exactly this. On Windows, it will sort in the same order as Windows Explorer, and on other operating systems it should sort like whatever is the local file system browser.

>>> from natsort import os_sorted
>>> os_sorted(list_of_paths)
# your paths sorted like your file system browser

For those needing a sort key, you can use os_sort_keygen (or os_sort_key if you just need the defaults).

Caveat - Please read the API documentation for this function before you use to understand the limitations and how to get best results.



Related Topics



Leave a reply



Submit