Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)
it also removes spaces, dots, dashes, and who knows what else.
It shouldn't.
string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s
You've mistyped, there should be a backslash before the x00, to refer to the NUL character.
/[^\-x00-\x7F]/n # So it would leave the dash alone
You've put the ‘-’ between the ‘\’ and the ‘x’, which will break the reference to the null character, and thus break the range.
How do I replace accented Latin characters in Ruby?
Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:
>> "àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"
Accent Insensitive ordering in Sphinx
Sphinx handles sorting on string fields by storing all the values in a list, sorting the list and then storing the index of each string as an int attribute. According to the docs the sorting of this list is done at a byte level and currently isn't configurable.
Ideally the strings should be sorted differently, depending on the encoding and locale. For instance, if the strings are known to be Russian text in KOI8R encoding, sorting the bytes 0xE0, 0xE1, and 0xE2 should produce 0xE1, 0xE2 and 0xE0, because in KOI8R value 0xE0 encodes a character that is (noticeably) after characters encoded by 0xE1 and 0xE2. Unfortunately, Sphinx does not support that at the moment and will simply sort the strings bytewise.
-- from http://www.sphinxsearch.com/docs/current.html
So, no easy way to achieve this within Sphinx. A modification to your REPLACE() based idea would be to have a separate column and populate it using a callback in your model. This would let you handle the replace in Ruby instead of MySQL, an arguably more maintainable solution.
# save an unaccented copy of your title. Normalise method borrowed from
# http://stackoverflow.com/questions/522715/removing-accents-diacritics-from-string-while-preserving-other-special-chars-tri
class MyModel < ActiveRecord::Base
before_validation :update_sort_col
private
def update_sort_col
sort_col = self.title.to_s.mb_chars.normalize(:kd).gsub(/[^-x00-\x7F]/n, '').to_s
end
end
Postgres accent insensitive LIKE search in Rails 3.1 on Heroku
Proper solution
Since PostgreSQL 9.1 you can just:
CREATE EXTENSION unaccent;
Provides a function unaccent()
, doing what you need (except for lower()
, just use that additionally if needed). Read the manual about this extension.
More about unaccent and indexes:
- Does PostgreSQL support "accent insensitive" collations?
Poor man's solution
If you can't install unacccent
, but are able to create a function. I compiled the list starting here and added to it over time. It is comprehensive, but hardly complete:
CREATE OR REPLACE FUNCTION lower_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE STRICT AS
$func$
SELECT lower(translate($1
, '¹²³áàâãäåāăąÀÁÂÃÄÅĀĂĄÆćčç©ĆČÇĐÐèéêёëēĕėęěÈÊËЁĒĔĖĘĚ€ğĞıìíîïìĩīĭÌÍÎÏЇÌĨĪĬłŁńňñŃŇÑòóôõöōŏőøÒÓÔÕÖŌŎŐØŒř®ŘšşșߊŞȘùúûüũūŭůÙÚÛÜŨŪŬŮýÿÝŸžżźŽŻŹ'
, '123Removing Accents/Diacritics from String While Preserving Other Special Chars (Tried Mb_Chars.Normalize and Iconv)Removing Accents/Diacritics from String While Preserving Other Special Chars (Tried Mb_Chars.Normalize and Iconv)aaacccccccddRemoving accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv) How do I replace accented Latin characters in Ruby? AcRemoving accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv) How do I replace accented Latin characters in Ruby? Aceeeeggiiiiiiiiiiiiiiiiiillnnnnnnooooooooooooooooooorrrsssssssuuuuuuuuuuuuuuuuyyyyzzzzzz'
));
$func$;
Your query should work like that:
find(:all, :conditions => ["lower_unaccent(name) LIKE ?", "%#{search.downcase}%"])
For left-anchored searches, you can use an index on the function for very fast results:
CREATE INDEX tbl_name_lower_unaccent_idx
ON fest (lower_unaccent(name) text_pattern_ops);
For queries like:
SELECT * FROM tbl WHERE (lower_unaccent(name)) LIKE 'bob%';
Or use COLLATE "C"
. See:
- PostgreSQL LIKE query performance variations
- Is there a difference between text_pattern_ops and COLLATE "C"?
Related Topics
Best Way to Debug Third-Party Gems in Ruby
How to Check If a Value Is a Number
How to Know When to "Refresh" My Model Object in Rails
How to Use Bundler with Offline .Gem File
(Ruby) Getting Net::Smtp Working with Gmail...
Node.Js Not Found by Rails/Execjs
Typing 'Rails Console' Doesn't Start
Problems Installing Ruby on Mountain Lion - Ruby 1.9.3 Wont' Compile
Ruby: Why Does Puts Call To_Ary
How to Configure Capistrano to Use My Rvm Version of Ruby
What to Use Instead of 'Render :Text' (And 'Render Nothing: True') in Rails 5.1 and Later
Why Is _File_ Uppercase and _Dir_ Lowercase
Ruby on Rails - "Add 'Gem SQLite3'' to Your Gemfile"
Change the Binding of a Proc in Ruby
Store the Day of the Week and Time
Generate a Powerset of a Set Without Keeping a Stack in Erlang or Ruby