Postgresql Sorting Language Specific Characters (Collation)

Postgresql sorting language specific characters (collation)

You can check installed locale using:

locale -a

If et_EE is not listed run this commands:

sudo locale-gen et_EE
sudo locale-gen et_EE.UTF-8
sudo update-locale

and try your query again.

PostgreSQL improperly sorts unicode chars with Czech collation

It is correct. Accent for á, ď, é, ě, í, ň, ó, ť, ú, ů, ý should be ignored see article

Czech sort rules are little bit complex :)

Postgres ordering of UTF-8 characters

I'd be okay with all of the words starting with special characters
being at the end...

Use collate "C":

SELECT w."translated" 
FROM "words" AS w
ORDER BY w."translated" collate "C" desc limit 10;

See also Different behaviour in “order by” clause: Oracle vs. PostgreSQL

The query can be problematic when using ORM. The solution may be to recreate the database with the LC_COLLATE = C option, as suggested by the OP in the comment. There is one more option - change the collation for a single column:

ALTER TABLE "words" ALTER COLUMN "translated" TYPE text COLLATE "C";

What exactly means the collation 'de-DE-u-kn-true'

See ICU Collations in the PostgreSQL documentation. This links to the ICU documentation, which - with some indirection - leads to Unicode Locale Identifier, which makes clear that the -u introduces the Unicode Locale Extensions, and kn is one of those extensions. When you look at Collation Settings, you'll find kn configures numeric ordering. The true is the configuration of that option (meaning, numeric ordering is on):

If set to on, any sequence of Decimal Digits (General_Category =
Nd in the [UAX44]) is sorted at a primary level with its numeric
value. For example, "A-21" < "A-123". The computed primary weights are
all at the start of the digit reordering group. Thus with an
untailored UCA table, "a$" < "a0" < "a2" < "a12" < "a⓪" < "aa".

This is sometimes called “natural sort order”.

In other words, de-DE-u-kn-true is:

  • de: language German
  • DE: region Germany
  • u: what follows are Unicode Locale Extension
  • kn: Unicode Locale Extension numeric ordering
  • true: value of kn, meaning numeric ordering is on

Incorrect sort/collation/order with spaces in Postgresql 9.4

On Unix/Linux SE, a friendly expert explained that what you see is the proper way to sort Unicode. Basically, the standard is trying to sort:

di Silva Fred                  di Silva Fred
di Silva John diSilva Fred
diSilva Fred disílva Fred
diSilva John -> di Silva John
disílva Fred diSilva John
disílva John disílva John

Now if spaces were as important as letters, the sort could not separate the various identical spellings of Fred and John. So what happens is that it first sorts without spaces. Then in a second pass, strings that are the same without whitespace are sorted. (This is a simplification, the real algorithm looks fairly complex, assigning whitespace, accents and non-printable characters various levels of precedence.)

You can bypass the Unicode collation by setting:

export LC_ALL=C

Or in Postgres by casting to byte array for sorting:

order by name::bytea

Or (from Kiln's answer) by specifying the C collation:

order by name collate "C"

Or by altering the default collation for the column:

alter table products alter column name type text collate "C";

Unicode character default collation table

Postgresql uses the locales provided by the operating system. In your setup, locales are provided by glibc. Glibc uses a heavily modified version of an "ancient" version of ISO 14651 (see glibc Bug 14095 - Review / update collation data from Unicode / ISO 14651 for information on current pains in trying to update glibc locale data).

As of glibc 2.28, to be released on 2018-08-01, glibc will use data from ISO 14651:2016 (which is synchronized to Unicode 9), and will give the order the OP expects for en_US.

ISO 14651 is Method for comparing character strings and description of the common template tailorable ordering and it is similar to the UCA, with some differences. The CTT (Common Template Table) is the ISO14651 equivalent of the DUCET, and they are aligned.

The first time CYRILLIC SMALL LETTER SCHWA appeared in a collation table in glibc was for the az_AZ locale (Azerbaijani), where it is ordered after CYRILLIC SMALL LETTER IE. This corresponds to:

commit fcababc4e18fee81940dab20f7c40b1e1fb67209
Author: Ulrich Drepper <drepper@redhat.com>
Date: Fri Aug 3 08:42:28 2001 +0000

Update.

2001-08-03 Ulrich Drepper <drepper@redhat.com>

* locale/iso-639.def: Add Tigrinya.

From there, that ordering was eventually moved to the file iso14651_t1 as per Bug 672 - Include iso14651_t1 in collation rules, which was an effort to simplify glibc locale data. This corresponds to:

commit 5d2489928c0040d2a71dd0e63c801f2cf98e7efc
Author: Ulrich Drepper <drepper@redhat.com>
Date: Sun Feb 18 04:34:28 2007 +0000

[BZ #672]

2005-01-16 Denis Barbier <barbier@linuxfr.org>
[BZ #672]
* locales/ca_ES: Replace current collation rules by including
iso14651_t1 and adding extra rules if needed. There should be
no noticeable changes in sorted text. only ligatures and
ignoreable characters have modified weights.
* locales/da_DK: Likewise.
* locales/en_CA: Likewise.
* locales/es_US: Likewise.
* locales/fi_FI: Likewise.
* locales/nb_NO: Likewise.

[BZ #672]
* locales/iso14651_t1: Simplified. Extended.

Most locales in glibc start from iso14651_t1, and tailor it, which is what you are seeing with en_US.

While glibc based its default ordering in Azerbaijani, the DUCET instead bases it on the ordering for Kazakh and Tatar, which is where the difference comes from.

Set Order By to ignore punctuation on a per-column basis

If you want to have this ordering in one particular query you can

ORDER BY regexp_replace(title, '[^a-zA-Z]', '', 'g')

It will delete all non A-Z from sting and order by resulting field.

What is the best way to replicate PostgreSQL sorting results in JavaScript?

Sorting strings is always done using a certain collation. If you are not conscious of using a collation in your programming language, you are probably using the POSIX collation, which compares strings character by character according to their code point (the numeric value in the encoding).

In PostgreSQL, that would look like this:

ORDER BY name COLLATE "POSIX";

So to solve your problem, you'd have to find out the collation of the column.

If there is no special collation specified in the column definition, it will use the database's collation, which can be found with

SELECT datcollate FROM pg_database WHERE datname = 'my_database';

That will be an operating system collation from the C library.

So all you have to do is to use that collation in your program.

If your program is written in C, you can directly use the C library. Otherwise, refer to the documentation of your programming language.



Related Topics



Leave a reply



Submit