Sorting Utf-8 Strings in Ror

Sorting UTF-8 strings in RoR

http://github.com/grosser/sort_alphabetical

This gem should help. It adds sort_alphabetical and sort_alphabetical_by methods to Enumberable.

How one should add UTF-8 support to sorting in Ruby (including ł character, without affecting portability)?

good solution is using gem https://github.com/twitter/twitter-cldr-rb

require 'twitter_cldr'
collator = TwitterCldr::Collation::Collator.new
collator.sort(['m', 'ł', 'l'])
=> ["l", "ł", "m"]

Sorting in Ruby using the Unicode collation algorithm

Does your use case allow for simply delegating the sorting to Postgres, rather than trying to recreate it in Ruby?

Part of the difficultly here is that there isn't a single correct sorting method, but any variable elements can result in fairly large discrepancies in the final sort order e.g. see the section on variable weighting.

For example, a gem like twitter-cldr-rb has a fairly robust implementation of the UCA, and is backed by a comprehensive test suite - but against the non-ignorable test cases, which is different from the Postgres implementation (Postgres appears to use the shift-trimmed variant).

The sheer number of test cases means that you have no guarantee that one working solution will match the Postgres sort order in all cases. E.g. will it handle en/em dash correctly, or even emojis? You could fork and modify the twitter-cldr-rb gem, but I suspect this won't be a small undertaking!

If you need to handle values that don't exist in the database, you can ask Postgres to sort them in a lightweight fashion by using a VALUES list:

sql = "SELECT * FROM (VALUES ('de luge'),('de Luge'),('de-luge'),('de-Luge'),('de-luge'),('de-Luge'),('death'),('deluge'),('deLuge'),('demark')) AS t(term) ORDER BY term ASC"
ActiveRecord::Base.connection.execute(sql).values.flatten

It will obviously result in a round-trip to Postgres, but should be very speedy nonetheless.

Force strings to UTF-8 from any encoding

Ruby 1.9

"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:

str = str.force_encoding('UTF-8')

str.encoding.name # => 'UTF-8'

If you want to perform a conversion, use encode:

begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
# ...
end

I would definitely read the following post for more information:

http://graysoftinc.com/character-encodings/ruby-19s-string

Rails collection ordering doesn't works as expected with UTF-8 string

This is most likely a collation settings issue with PostgreSQL:

The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation.

You can try and fix the collation of a column with a Rails migration. Something like this:

class FixCollationForAbbr < ActiveRecord::Migration
def up
execute 'ALTER TABLE universities ALTER COLUMN abbr TYPE varchar COLLATE "ru_RU";'
end
end

You should probably also add collation information to your database.yml:

defaults: &defaults
adapter: postgresql
encoding: utf8
collation: ru_RU.utf8
ctype: ru_RU.utf8

Here is how the database.yml settings affect table creation with PostgreSQL:

def create_database(name, options = {})
options = { encoding: 'utf8' }.merge!(options.symbolize_keys)

option_string = options.inject("") do |memo, (key, value)|
memo += case key
...snip...
when :encoding
" ENCODING = '#{value}'"
when :collation
" LC_COLLATE = '#{value}'"
when :ctype
" LC_CTYPE = '#{value}'"
...snip...
end
end

execute "CREATE DATABASE #{quote_table_name(name)}#{option_string}"
end

Sort array of strings with special characters

The approach I used when I ran into the same issue (depends on iconv gem):

require 'iconv'

def sort_alphabetical(words)
# caching and api-wrapper
transliterations = {}

transliterate = lambda do |w|
transliterations[w] ||= Iconv.iconv('ascii//ignore//translit', 'utf-8', w).to_s
end

words.sort do |w1,w2|
transliterate.call(w1) <=> transliterate.call(w2)
end
end

sorted = sort_alphabetical(...)

An alternative would be to use the sort_alphabetical gem.

Ruby, problems comparing strings with UTF-8 characters

This is an issue with Unicode equivalence.

The a version of your string consists of the character ư (U+01B0: LATIN SMALL LETTER U WITH HORN), followed by U+0303 COMBINING TILDE. This second character, as the name suggests is a combining character, which when rendered is combined with the previous character to produce the final glyph.

The b version of the string uses the character (U+1EEF, LATIN SMALL LETTER U WITH HORN AND TILDE) which is a single character, and is equivalent to the previous combination, but uses a different byte sequence to represent it.

In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters. Current versions of Ruby have this built in (in earlier versions you needed to use a third party library).

So currently you have

a == b

which is false, but if you do

a.unicode_normalize == b.unicode_normalize

you should get true.

If you are on an older version of Ruby, there are a couple of options. Rails has a normalize method as part of its multibyte support, so if you are using Rails you can do:

a.mb_chars.normalize == b.mb_chars.normalize

or perhaps something like:

ActiveSupport::Multibyte::Unicode.normalize(a) == ActiveSupport::Multibyte::Unicode.normalize(b)

If you’re not using Rails, then you could look at the unicode_utils gem, and do something like this:

UnicodeUtils.nfkc(a) == UnicodeUtils.nfkc(b)

(nfkc refers to the normalisation form, it is the same as the default in the other techniques.)

There are various different ways to normalise unicode strings (i.e. whether you use the decomposed or combined versions), and this example just uses the default. I’ll leave researching the differences to you.

Sorting UTF-8 strings in Win32 program

The CompareStringEx Function probably does what you need.

But note that this function (and the Windows API in general) does not use the UTF-8 encoding to represent unicode strings. Instead, it uses the UTF-16 encoding (aka "wide character strings"). You might just be confusing the UTF-8 encoding with unicode in general. But if you are really dealing with UTF-8 encoded strings then you can do the conversion from UTF-8 to wide character strings with the MultiByteToWideChar Function.



Related Topics



Leave a reply



Submit