Postgresql sorting language specific characters (collation)
You can check installed locale using:
locale -a
If et_EE is not listed run this commands:
sudo locale-gen et_EE
sudo locale-gen et_EE.UTF-8
sudo update-locale
and try your query again.
PostgreSQL improperly sorts unicode chars with Czech collation
It is correct. Accent for á, ď, é, ě, í, ň, ó, ť, ú, ů, ý should be ignored see article
Czech sort rules are little bit complex :)
Postgres ordering of UTF-8 characters
I'd be okay with all of the words starting with special characters
being at the end...
Use collate "C":
SELECT w."translated"
FROM "words" AS w
ORDER BY w."translated" collate "C" desc limit 10;
See also Different behaviour in “order by” clause: Oracle vs. PostgreSQL
The query can be problematic when using ORM. The solution may be to recreate the database with the LC_COLLATE = C
option, as suggested by the OP in the comment. There is one more option - change the collation for a single column:
ALTER TABLE "words" ALTER COLUMN "translated" TYPE text COLLATE "C";
What exactly means the collation 'de-DE-u-kn-true'
See ICU Collations in the PostgreSQL documentation. This links to the ICU documentation, which - with some indirection - leads to Unicode Locale Identifier, which makes clear that the -u
introduces the Unicode Locale Extensions, and kn
is one of those extensions. When you look at Collation Settings, you'll find kn
configures numeric ordering. The true
is the configuration of that option (meaning, numeric ordering is on):
If set to on, any sequence of Decimal Digits (General_Category =
Nd in the [UAX44]) is sorted at a primary level with its numeric
value. For example, "A-21" < "A-123". The computed primary weights are
all at the start of the digit reordering group. Thus with an
untailored UCA table, "a$" < "a0" < "a2" < "a12" < "a⓪" < "aa".
This is sometimes called “natural sort order”.
In other words, de-DE-u-kn-true
is:
de
: language GermanDE
: region Germanyu
: what follows are Unicode Locale Extensionkn
: Unicode Locale Extension numeric orderingtrue
: value ofkn
, meaning numeric ordering is on
Incorrect sort/collation/order with spaces in Postgresql 9.4
On Unix/Linux SE, a friendly expert explained that what you see is the proper way to sort Unicode. Basically, the standard is trying to sort:
di Silva Fred di Silva Fred
di Silva John diSilva Fred
diSilva Fred disílva Fred
diSilva John -> di Silva John
disílva Fred diSilva John
disílva John disílva John
Now if spaces were as important as letters, the sort could not separate the various identical spellings of Fred and John. So what happens is that it first sorts without spaces. Then in a second pass, strings that are the same without whitespace are sorted. (This is a simplification, the real algorithm looks fairly complex, assigning whitespace, accents and non-printable characters various levels of precedence.)
You can bypass the Unicode collation by setting:
export LC_ALL=C
Or in Postgres by casting to byte array for sorting:
order by name::bytea
Or (from Kiln's answer) by specifying the C
collation:
order by name collate "C"
Or by altering the default collation for the column:
alter table products alter column name type text collate "C";
Unicode character default collation table
Postgresql uses the locales provided by the operating system. In your setup, locales are provided by glibc. Glibc uses a heavily modified version of an "ancient" version of ISO 14651 (see glibc Bug 14095 - Review / update collation data from Unicode / ISO 14651 for information on current pains in trying to update glibc locale data).
As of glibc 2.28, to be released on 2018-08-01, glibc will use data from ISO 14651:2016 (which is synchronized to Unicode 9), and will give the order the OP expects for en_US.
ISO 14651 is Method for comparing character strings and description of the common template tailorable ordering and it is similar to the UCA, with some differences. The CTT (Common Template Table) is the ISO14651 equivalent of the DUCET, and they are aligned.
The first time CYRILLIC SMALL LETTER SCHWA
appeared in a collation table in glibc was for the az_AZ
locale (Azerbaijani), where it is ordered after CYRILLIC SMALL LETTER IE
. This corresponds to:
commit fcababc4e18fee81940dab20f7c40b1e1fb67209
Author: Ulrich Drepper <drepper@redhat.com>
Date: Fri Aug 3 08:42:28 2001 +0000
Update.
2001-08-03 Ulrich Drepper <drepper@redhat.com>
* locale/iso-639.def: Add Tigrinya.
From there, that ordering was eventually moved to the file iso14651_t1
as per Bug 672 - Include iso14651_t1 in collation rules, which was an effort to simplify glibc locale data. This corresponds to:
commit 5d2489928c0040d2a71dd0e63c801f2cf98e7efc
Author: Ulrich Drepper <drepper@redhat.com>
Date: Sun Feb 18 04:34:28 2007 +0000
[BZ #672]
2005-01-16 Denis Barbier <barbier@linuxfr.org>
[BZ #672]
* locales/ca_ES: Replace current collation rules by including
iso14651_t1 and adding extra rules if needed. There should be
no noticeable changes in sorted text. only ligatures and
ignoreable characters have modified weights.
* locales/da_DK: Likewise.
* locales/en_CA: Likewise.
* locales/es_US: Likewise.
* locales/fi_FI: Likewise.
* locales/nb_NO: Likewise.
[BZ #672]
* locales/iso14651_t1: Simplified. Extended.
Most locales in glibc start from iso14651_t1, and tailor it, which is what you are seeing with en_US
.
While glibc based its default ordering in Azerbaijani, the DUCET instead bases it on the ordering for Kazakh and Tatar, which is where the difference comes from.
Set Order By to ignore punctuation on a per-column basis
If you want to have this ordering in one particular query you can
ORDER BY regexp_replace(title, '[^a-zA-Z]', '', 'g')
It will delete all non A-Z
from sting and order by resulting field.
What is the best way to replicate PostgreSQL sorting results in JavaScript?
Sorting strings is always done using a certain collation. If you are not conscious of using a collation in your programming language, you are probably using the POSIX
collation, which compares strings character by character according to their code point (the numeric value in the encoding).
In PostgreSQL, that would look like this:
ORDER BY name COLLATE "POSIX";
So to solve your problem, you'd have to find out the collation of the column.
If there is no special collation specified in the column definition, it will use the database's collation, which can be found with
SELECT datcollate FROM pg_database WHERE datname = 'my_database';
That will be an operating system collation from the C library.
So all you have to do is to use that collation in your program.
If your program is written in C, you can directly use the C library. Otherwise, refer to the documentation of your programming language.
Related Topics
Split a Single Column of Data with Comma Delimiters into Multiple Columns in Ssis
Making Row Values into Column Values -- SQL Pivot
Is There Any Other Way to Create Constraints During SQL Table Creation
How to Record Created_At and Updated_At Timestamps in Hive
Oracle SQL Comparison of Dates Returns Wrong Result
SQL Query to Translate a List of Numbers Matched Against Several Ranges, to a List of Values
Why Can Pl/Pgsql Functions Have Side Effect, While SQL Functions Can'T
How to Have the Table Name as "Option" in MySQL
Postgresql Sorting Language Specific Characters (Collation)
How to Save an Image from SQL Server to a File Using SQL
Dynamically Choose Column in SQL Query
Selecting Distinct Values for Multiple Columns