Postgresql Order by Issue - Natural Sort

PostgreSQL ORDER BY issue - natural sort

The reason is that the string sorts alphabetically (instead of numerically like you would want it) and 1 sorts before 9.
You could solve it like this:

SELECT * FROM employees
ORDER BY substring(em_code, 3)::int DESC;

It would be more efficient to drop the redundant 'EM' from your em_code - if you can - and save an integer number to begin with.

Answer to question in comment

To strip any and all non-digits from a string:

SELECT regexp_replace(em_code, E'\\D','','g')
FROM employees;

\D is the regular expression class-shorthand for "non-digits".

'g' as 4th parameter is the "globally" switch to apply the replacement to every occurrence in the string, not just the first.

After replacing every non-digit with the empty string, only digits remain.

Postgres natural order by

Postgres allow you to sort by arrays -- which is essentially what the version number represents. Hence, you can use this syntax:

order by string_to_array(version, '.')::int[] desc

Here is a full example:

select *
from (values ('1'), ('2.1'), ('1.2.3'), ('1.10.6'), ('1.9.4')) v(version)
order by string_to_array(version, '.')::int[] desc;

And even a demonstration.

Alphanumeric Sorting in PostgreSQL

When sorting character data types, collation rules apply - unless you work with locale "C" which sorts characters by there byte values. Applying collation rules may or may not be desirable. It makes sorting more expensive in any case. If you want to sort without collation rules, don't cast to bytea, use COLLATE "C" instead:

SELECT * FROM table ORDER BY column COLLATE "C";

However, this does not yet solve the problem with numbers in the string you mention. Split the string and sort the numeric part as number.

SELECT *
FROM table
ORDER BY split_part(column, '-', 2)::numeric;

Or, if all your numbers fit into bigint or even integer, use that instead (cheaper).

I ignored the leading part because you write:

... the basis for ordering is the last whole number of the string, regardless of what the character before that number is.

Related:

  • Alphanumeric sorting with PostgreSQL
  • Split comma separated column data into additional columns
  • What is the impact of LC_CTYPE on a PostgreSQL database?

Typically, it's best to save distinct parts of a string in separate columns as proper respective data types to avoid any such confusion.

And if the leading string is identical for all columns, consider just dropping the redundant noise. You can always use a VIEW to prepend a string for display, or do it on-the-fly, cheaply.

Natural sort supporting big numbers

It works like @clemens suggested. Use numeric (= decimal) in the composite type:

CREATE TYPE ai AS (a text, i numeric);

db<>fiddle here

The reason I used int in the referenced answer is performance.

Humanized or natural number sorting of mixed word-and-number strings

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string.

Register a composite type made up of one text and one integer value once per database. I call it ai:

CREATE TYPE ai AS (a text, i int);

The trick is to form an array of ai from each value in the column.

regexp_matches() with the pattern (\D*)(\d*) and the g option returns one row for every combination of letters and numbers. Plus one irrelevant dangling row with two empty strings '{"",""}' Filtering or suppressing it would just add cost. Aggregate this into an array, after replacing empty strings ('') with 0 in the integer component (as '' cannot be cast to integer).

NULL values sort first - or you have to special case them - or use the whole shebang in a STRICT function like @Craig proposes.

Postgres 9.4 or later

SELECT data
FROM alnum
ORDER BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
FROM regexp_matches(data, '(\D*)(\d*)', 'g') x)
, data;

db<>fiddle here

Postgres 9.1 (original answer)

Tested with PostgreSQL 9.1.5, where regexp_replace() had a slightly different behavior.

SELECT data
FROM (
SELECT ctid, data, regexp_matches(data, '(\D*)(\d*)', 'g') AS x
FROM alnum
) x
GROUP BY ctid, data -- ctid as stand-in for a missing pk
ORDER BY regexp_replace (left(data, 1), '[0-9]', '0')
, array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
, data -- for special case of trailing 0

Add regexp_replace (left(data, 1), '[1-9]', '0') as first ORDER BY item to take care of leading digits and empty strings.

If special characters like {}()"', can occur, you'd have to escape those accordingly.

@Craig's suggestion to use a ROW expression takes care of that.


BTW, this won't execute in sqlfiddle, but it does in my db cluster. JDBC is not up to it. sqlfiddle complains:

Method org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) is
not yet implemented.

This has since been fixed: http://sqlfiddle.com/#!17/fad6e/1

What is the best way to replicate PostgreSQL sorting results in JavaScript?

Sorting strings is always done using a certain collation. If you are not conscious of using a collation in your programming language, you are probably using the POSIX collation, which compares strings character by character according to their code point (the numeric value in the encoding).

In PostgreSQL, that would look like this:

ORDER BY name COLLATE "POSIX";

So to solve your problem, you'd have to find out the collation of the column.

If there is no special collation specified in the column definition, it will use the database's collation, which can be found with

SELECT datcollate FROM pg_database WHERE datname = 'my_database';

That will be an operating system collation from the C library.

So all you have to do is to use that collation in your program.

If your program is written in C, you can directly use the C library. Otherwise, refer to the documentation of your programming language.

PostgreSQL ORDER BY clause on numerical portion of text column

Your id column is obviously of some text data type so the ordering is alphabetical, not by the number. To get it to work, strip the 'G' from the id column when ordering:

SELECT * FROM mytable
ORDER BY right(id, -1)::integer;

Natural sorting when characters and numbers mixed

You need to normalise the format of the numeric portion of the text. You can do that by splitting the string into the AB prefix and the numeric part, then left-padding the numeric part to a consistent length with zeroes.

For example: AB11a becomes AB00011a.

Apply this to all the items you've listed and they'll sort in the order you want.

You can do this with

    ... ORDER BY concat(substring(`code`,1,2),lpad(substr(`code`,3),6,'0')) ...

where `code` is the name of the column that contains the data you want to sort.

Note - this assumes that the prefix is always 2 characters.

How to order by the the alphabetical sort order of a title (ignoring The, An, etc.) and use an index

You could add an index on an expression:

create index on yourtable (natural_sort(title));

Postgres will then use the index when appropriate, and won't actually calculate natural_sort(title) when it does -- unless you select that too.

That being said (and much like with tsvector fields) you'll get improved performance if you actually store the pre-calculated result for performance reasons. If, in the above case, Postgres decides to not use that index for any reason, the need to actually calculate it for each and every row considered will be a big drag on your query.

In either case, don't forget numbers:

http://www.codinghorror.com/blog/2007/12/sorting-for-humans-natural-sort-order.html


Here are two functions to get you started on natural sorting:

/**
* @param text _str The input string.
* @return text The output string for consumption in natural sorting.
*/
CREATE OR REPLACE FUNCTION natsort(text)
RETURNS text
AS $$
DECLARE
_str text := $1;
_pad int := 15; -- Maximum precision for PostgreSQL floats
BEGIN
-- Bail if the string is empty
IF trim(_str) = ''
THEN
RETURN '';
END IF;

-- Strip accents and lower the case
_str := lower(unaccent(_str));

-- Replace nonsensical characters
_str := regexp_replace(_str, E'[^a-z0-9$¢£¥₤€@&%\\(\\)\\[\\]\\{\\}_:;,\\.\\?!\\+\\-]+', ' ', 'g');

-- Trim the result
_str := trim(_str);

-- @todo we'd ideally want to strip leading articles/prepositions ('a', 'the') at this stage,
-- but to_tsvector()'s default dictionary also strips stop words (e.g. 'all').

-- We're done if the string contains no numbers
IF _str !~ '[0-9]'
THEN
RETURN _str;
END IF;

-- Force spaces between numbers, so we can use regexp_split_to_table()
_str := regexp_replace(_str, E'((?:[0-9]+|[0-9]*\\.[0-9]+)(?:e[+-]?[0-9]+\\M)?)', E' \\1 ', 'g');

-- Pad zeros to obtain a reasonably natural looking sort order
RETURN array_to_string(ARRAY(
SELECT CASE
WHEN val !~ E'^\\.?[0-9]'
-- Not a number; return as is
THEN val
-- Do our best after expanding the number...
ELSE COALESCE(lpad(substring(val::numeric::text from '^[0-9]+'), _pad, '0'), '') ||
COALESCE(rpad(substring(val::numeric::text from E'\\.[0-9]+'), _pad, '0'), '')
END
FROM regexp_split_to_table(_str, E'\\s+') as val
WHERE val <> ''
), ' ');
END;
$$ IMMUTABLE STRICT LANGUAGE plpgsql COST 1;

COMMENT ON FUNCTION natsort(text) IS
'Rewrites a string so it can be used in natural sorting.

It''s by no means bullet proof, but it works properly for positive integers,
reasonably well for positive floats, and it''s fast enough to be used in a
trigger that populates an indexed column, or in an index directly.';

/**
* @param text[] _values The potential values to use.
* @return text The output string for consumption in natural sorting.
*/
CREATE OR REPLACE FUNCTION sort(text[])
RETURNS text
AS $$
DECLARE
_values alias for $1;
_sort text;
BEGIN
SELECT natsort(value)
INTO _sort
FROM unnest(_values) as value
WHERE value IS NOT NULL
AND value <> ''
AND natsort(value) <> ''
LIMIT 1;

RETURN COALESCE(_sort, '');
END;
$$ IMMUTABLE STRICT LANGUAGE plpgsql COST 1;

COMMENT ON FUNCTION sort(text[]) IS
'Returns natsort() of the first significant input argument.';

Sample output from the first function's unit tests:

public function testNatsort()
{
$this->checkInOut('natsort', array(
'<NULL>' => null,
'' => '',
'ABCde' => 'abcde',
'12345 12345' => '000000000012345 000000000012345',
'12345.12345' => '000000000012345.123450000000000',
'12345e5' => '000001234500000',
'.12345e5' => '000000000012345',
'1e10' => '000010000000000',
'1.2e20' => '120000000000000',
'-12345e5' => '- 000001234500000',
'-.12345e5' => '- 000000000012345',
'-1e10' => '- 000010000000000',
'-1.2e20' => '- 120000000000000',
'+-$¢£¥₤€@&%' => '+-$¢£¥₤€@&%',
'ÀÁÂÃÄÅĀĄĂÆ' => 'aaaaaeaaaaaae',
'ÈÉÊËĒĘĚĔĖÐ' => 'PostgreSQL ORDER BY issue - natural sort Postgres natural order by Alphanumeric Sorting in PostgreSQL Natural sort supporting big numbers Humanized or natural number soee',
'ÌÍÎÏĪĨĬĮİIJ' => 'iiiiiiiiiij',
'ÒÓÔÕÖØŌŐŎŒ' => 'oooooeoooooe',
'ÙÚÛÜŪŮŰŬŨŲ' => 'uuuueuuuuuu',
'ÝŶŸ' => 'yyy',
'àáâãäåāąăæ' => 'aaaaaeaaaaaae',
'èéêëēęěĕėð' => 'PostgreSQL ORDER BY issue - natural sort Postgres natural order by Alphanumeric Sorting in PostgreSQL Natural sort supporting big numbers Humanized or natural number soee',
'ìíîïīĩĭįıij' => 'iiiiiiiiiij',
'òóôõöøōőŏœ' => 'oooooeoooooe',
'ùúûüūůűŭũų' => 'uuuueuuuuuu',
'ýÿŷ' => 'yyy',
'ÇĆČĈĊ' => 'ccccc',
'ĎĐ' => 'dd',
'Ƒ' => 'f',
'ĜĞĠĢ' => 'gggg',
'ĤĦ' => 'hh',
'Ĵ' => 'j',
'Ķ' => 'k',
'ŁĽĹĻĿ' => 'lllll',
'ÑŃŇŅŊ' => 'nnnnn',
'ŔŘŖ' => 'rrr',
'ŚŠŞŜȘſ' => 'sssssss',
'ŤŢŦȚÞ' => 'ttttt',
'Ŵ' => 'w',
'ŹŽŻ' => 'zzz',
'çćčĉċ' => 'ccccc',
'ďđ' => 'dd',
'ƒ' => 'f',
'ĝğġģ' => 'gggg',
'ĥħ' => 'hh',
'ĵ' => 'j',
'ĸķ' => 'kk',
'łľĺļŀ' => 'lllll',
'ñńňņʼnŋ' => 'nnnnnn',
'ŕřŗ' => 'rrr',
'śšşŝșß' => 'sssssss',
'ťţŧțþ' => 'ttttt',
'ŵ' => 'w',
'žżź' => 'zzz',
'-_aaa--zzz--' => '-_aaa--zzz--',
'-:àáâ;-žżź--' => '-:aaa;-zzz--',
'-.à$â,-ž%ź--' => '-.a$a,-z%z--',
'--à$â--ž%ź--' => '--a$a--z%z--',
'-$à(â--ž)ź%-' => '-$a(a--z)z%-',
'#-à$â--ž?!ź-' => '-a$a--z?!z-',
));


Related Topics



Leave a reply



Submit