Sorting UTF-8 strings in RoR
http://github.com/grosser/sort_alphabetical
This gem should help. It adds sort_alphabetical
and sort_alphabetical_by
methods to Enumberable.
How one should add UTF-8 support to sorting in Ruby (including ł character, without affecting portability)?
good solution is using gem https://github.com/twitter/twitter-cldr-rb
require 'twitter_cldr'
collator = TwitterCldr::Collation::Collator.new
collator.sort(['m', 'ł', 'l'])
=> ["l", "ł", "m"]
Sorting in Ruby using the Unicode collation algorithm
Does your use case allow for simply delegating the sorting to Postgres, rather than trying to recreate it in Ruby?
Part of the difficultly here is that there isn't a single correct sorting method, but any variable elements can result in fairly large discrepancies in the final sort order e.g. see the section on variable weighting.
For example, a gem like twitter-cldr-rb has a fairly robust implementation of the UCA, and is backed by a comprehensive test suite - but against the non-ignorable test cases, which is different from the Postgres implementation (Postgres appears to use the shift-trimmed variant).
The sheer number of test cases means that you have no guarantee that one working solution will match the Postgres sort order in all cases. E.g. will it handle en/em dash correctly, or even emojis? You could fork and modify the twitter-cldr-rb
gem, but I suspect this won't be a small undertaking!
If you need to handle values that don't exist in the database, you can ask Postgres to sort them in a lightweight fashion by using a VALUES
list:
sql = "SELECT * FROM (VALUES ('de luge'),('de Luge'),('de-luge'),('de-Luge'),('de-luge'),('de-Luge'),('death'),('deluge'),('deLuge'),('demark')) AS t(term) ORDER BY term ASC"
ActiveRecord::Base.connection.execute(sql).values.flatten
It will obviously result in a round-trip to Postgres, but should be very speedy nonetheless.
Force strings to UTF-8 from any encoding
Ruby 1.9
"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:
str = str.force_encoding('UTF-8')
str.encoding.name # => 'UTF-8'
If you want to perform a conversion, use encode
:
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
# ...
end
I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string
Rails collection ordering doesn't works as expected with UTF-8 string
This is most likely a collation settings issue with PostgreSQL:
The collation feature allows specifying the sort order and character classification behavior of data per-column, or even per-operation. This alleviates the restriction that the LC_COLLATE and LC_CTYPE settings of a database cannot be changed after its creation.
You can try and fix the collation of a column with a Rails migration. Something like this:
class FixCollationForAbbr < ActiveRecord::Migration
def up
execute 'ALTER TABLE universities ALTER COLUMN abbr TYPE varchar COLLATE "ru_RU";'
end
end
You should probably also add collation information to your database.yml
:
defaults: &defaults
adapter: postgresql
encoding: utf8
collation: ru_RU.utf8
ctype: ru_RU.utf8
Here is how the database.yml
settings affect table creation with PostgreSQL:
def create_database(name, options = {})
options = { encoding: 'utf8' }.merge!(options.symbolize_keys)
option_string = options.inject("") do |memo, (key, value)|
memo += case key
...snip...
when :encoding
" ENCODING = '#{value}'"
when :collation
" LC_COLLATE = '#{value}'"
when :ctype
" LC_CTYPE = '#{value}'"
...snip...
end
end
execute "CREATE DATABASE #{quote_table_name(name)}#{option_string}"
end
Sort array of strings with special characters
The approach I used when I ran into the same issue (depends on iconv gem):
require 'iconv'
def sort_alphabetical(words)
# caching and api-wrapper
transliterations = {}
transliterate = lambda do |w|
transliterations[w] ||= Iconv.iconv('ascii//ignore//translit', 'utf-8', w).to_s
end
words.sort do |w1,w2|
transliterate.call(w1) <=> transliterate.call(w2)
end
end
sorted = sort_alphabetical(...)
An alternative would be to use the sort_alphabetical gem.
Ruby, problems comparing strings with UTF-8 characters
This is an issue with Unicode equivalence.
The a
version of your string consists of the character ư
(U+01B0: LATIN SMALL LETTER U WITH HORN), followed by U+0303 COMBINING TILDE. This second character, as the name suggests is a combining character, which when rendered is combined with the previous character to produce the final glyph.
The b
version of the string uses the character ữ
(U+1EEF, LATIN SMALL LETTER U WITH HORN AND TILDE) which is a single character, and is equivalent to the previous combination, but uses a different byte sequence to represent it.
In order to compare these strings you need to normalize them, so that they both use the same byte sequences for these types of characters. Current versions of Ruby have this built in (in earlier versions you needed to use a third party library).
So currently you have
a == b
which is false
, but if you do
a.unicode_normalize == b.unicode_normalize
you should get true
.
If you are on an older version of Ruby, there are a couple of options. Rails has a normalize
method as part of its multibyte support, so if you are using Rails you can do:
a.mb_chars.normalize == b.mb_chars.normalize
or perhaps something like:
ActiveSupport::Multibyte::Unicode.normalize(a) == ActiveSupport::Multibyte::Unicode.normalize(b)
If you’re not using Rails, then you could look at the unicode_utils gem, and do something like this:
UnicodeUtils.nfkc(a) == UnicodeUtils.nfkc(b)
(nfkc
refers to the normalisation form, it is the same as the default in the other techniques.)
There are various different ways to normalise unicode strings (i.e. whether you use the decomposed or combined versions), and this example just uses the default. I’ll leave researching the differences to you.
Sorting UTF-8 strings in Win32 program
The CompareStringEx Function probably does what you need.
But note that this function (and the Windows API in general) does not use the UTF-8 encoding to represent unicode strings. Instead, it uses the UTF-16 encoding (aka "wide character strings"). You might just be confusing the UTF-8 encoding with unicode in general. But if you are really dealing with UTF-8 encoded strings then you can do the conversion from UTF-8 to wide character strings with the MultiByteToWideChar Function.
Related Topics
Create Custom HTML Helpers in Ruby on Rails
Rails - Local Variables Versus Instance Variables
Ruby Modulo 3 with Negative Numbers Is Unintuitive
How to Access Nested Elements of a Hash with a Single String Key
Rails Active Admin CSS Conflicting with Twitter Bootstrap CSS
How to "Unflatten" a Ruby Array
Is There a 'Pipe' Equivalent in Ruby
Rails - Rspec - Difference Between "Let" and "Let!"
Is Ruby Really an Interpreted Language If All of Its Implementations Are Compiled into Bytecode
How to Pass Data from a Controller to a Model with Ruby on Rails
Trying to Install Ruby-Filemagic on Snow Leopard Using Brew Rather Than Ports
Optional Arguments with Default Value in Ruby
How to Use Controller Specific Stylesheets in Rails 3.2.1
New Rails Project: 'Bundle Install' Can't Install Rails in Gemfile