Sort Values Using a Specific Collation in Ruby/Rails

Sort values using a specific collation in Ruby/Rails

I found the ffi-locale on Github and that solves my problem as far as I can see.

It allows the following code:

FFILocale::setlocale FFILocale::LC_COLLATE, 'da_DK.UTF-8'
%w(Aarhus Aalborg Assens).sort { |a,b| FFILocale::strcoll(a, b) }

Which returns the correct result:

=> ["Assens", "Aalborg", "Aarhus"]

I haven't investigated performance yet but it calls out to native code so it ought to be faster that Ruby character replacement code...

Update
It is not perfect :( It does not work properly on Snow Leopard - it seems that the strcoll function is broken on OS X and have been for some time. It is annoying to me but the main platform for deployment is linux - where it works - so it is my currently preferred solution.

Sorting in Ruby using the Unicode collation algorithm

Does your use case allow for simply delegating the sorting to Postgres, rather than trying to recreate it in Ruby?

Part of the difficultly here is that there isn't a single correct sorting method, but any variable elements can result in fairly large discrepancies in the final sort order e.g. see the section on variable weighting.

For example, a gem like twitter-cldr-rb has a fairly robust implementation of the UCA, and is backed by a comprehensive test suite - but against the non-ignorable test cases, which is different from the Postgres implementation (Postgres appears to use the shift-trimmed variant).

The sheer number of test cases means that you have no guarantee that one working solution will match the Postgres sort order in all cases. E.g. will it handle en/em dash correctly, or even emojis? You could fork and modify the twitter-cldr-rb gem, but I suspect this won't be a small undertaking!

If you need to handle values that don't exist in the database, you can ask Postgres to sort them in a lightweight fashion by using a VALUES list:

sql = "SELECT * FROM (VALUES ('de luge'),('de Luge'),('de-luge'),('de-Luge'),('de-luge'),('de-Luge'),('death'),('deluge'),('deLuge'),('demark')) AS t(term) ORDER BY term ASC"
ActiveRecord::Base.connection.execute(sql).values.flatten

It will obviously result in a round-trip to Postgres, but should be very speedy nonetheless.

How to sort text in sqlite3 with specified locale?

SQLite supports integration with ICU. According to the Readme file,
sqlite/ext/icu/README.txt
the sqlite/ext/icu/ directory contains source code for the SQLite "ICU" extension, an
integration of the "International Components for Unicode" library with SQLite.

1. Features

1.1 SQL Scalars upper() and lower()
1.2 Unicode Aware LIKE Operator
1.3 ICU Collation Sequences
1.4 SQL REGEXP Operator

Sorting UTF-8 strings in RoR

http://github.com/grosser/sort_alphabetical

This gem should help. It adds sort_alphabetical and sort_alphabetical_by methods to Enumberable.

Rails find and sort using natural sort order collation

Look at How do I replace accented Latin characters in Ruby?.

You should be able to sort the countries by their normalized names.

Something like:

@countries.sort{|x,y| x.name.chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s <=> y.name.chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s}

Collation so ordering doesn't work properly with postresql

UTF-8 locales on OS X are broken. You can try to use a non-UTF-8 locale (tr_TR.ISO8859-9), but it appears that it is also broken in this respect. So it's not going to work.

Why is the alphabetical order breaking in my Rails app?

This is a "feature" of how Postgresql handles sorting - and to make matters worse, it varies from database to database, and even from platform to platform.

When you write Project.all.order(:title), Rails generates SQL (as you correctly figured out in a comment above) like this:

SELECT "projects".* FROM "projects" ORDER BY "projects"."title" ASC

This leaves Postgresql, or whatever other database you're using, to determine the order. Postgresql uses collations to determine order, which are locale-dependant. You can see what collations your databases are using by executing the \l command in psql. On my machines, for example, my databases default to en_US.UTF-8.

Here's where it gets tricky. I created a table in postgres as follows:

CREATE TABLE sorttest (name text);
INSERT INTO sorttest VALUES ('Spring House');
INSERT INTO sorttest VALUES ('Springg House');
INSERT INTO sorttest VALUES ('Springgg House');
SELECT * FROM sorttest ORDER BY name ASC;

On my Mac (Mac OS 10.13.3), it returns

name      
----------------
Spring House
Springg House
Springgg House

However, on my Debian machine, it returns

name      
----------------
Springgg House
Springg House
Spring House

As best as I can tell, the Mac is actually "doing it wrong", although it's the result you want. My Debian box, and your Heroku dyno, are sorting according to the UTF-8 spec: ignoring whitespace and capitalization, "springgghouse" should come before "springhouse".

If you want to sort using a different collation (say, the "C" or the "POSIX" collation), you need to use a SQL command like this:

SELECT "projects".* FROM "projects" ORDER BY "projects"."title" COLLATE "C" ASC

Fortunately, you can get there with ActiveRecord:

Project.all.order('title COLLATE "C"')

However, please note that this will make capitalization matter in your sort order - the "C" collation compares ASCII byte values, so capital letters will sort before lower case, eg:

SELECT * FROM sorttest ORDER BY name COLLATE "C" ASC;

name
----------------
Spring House
SpringGg House
Springg House

Why ruby sort array of strings differently than sql order (postgres)?

Sort behaviour for strings in ruby based on ASCII codes.

Sort behaviour for text in pg depends on the current collation of your locale. From PostgreSQL wiki - Why do my strings sort incorrectly?:

It is not in ASCII/byte order. No, it's not, it's not supposed to be.
ASCII is an encoding, not a sort order. If you want this, you can use
the C locale, but then you use the ability to non-ASCII characters.

So in plain SQL to sort by ASCII value, rather than a properly localized sort following your local language rules, you can use the COLLATE clause in query

order by name COLLATE "C" ASC

You can check your collate settings in psql with SHOW lc_collate;.

PostgreSQL uses OS collation support, so it's possible for results to vary slightly from host OS to host OS. Some versions of Mac OS X or a BSD-family operating system have problems with locale definitions.



Related Topics



Leave a reply



Submit