Removing Accents/Diacritics from String While Preserving Other Special Chars (Tried Mb_Chars.Normalize and Iconv)

Removing accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv)

it also removes spaces, dots, dashes, and who knows what else.

It shouldn't.

string.mb_chars.normalize(:kd).gsub(/[^x00-\x7F]/n, '').to_s

You've mistyped, there should be a backslash before the x00, to refer to the NUL character.

/[^\-x00-\x7F]/n # So it would leave the dash alone

You've put the ‘-’ between the ‘\’ and the ‘x’, which will break the reference to the null character, and thus break the range.

How do I replace accented Latin characters in Ruby?

Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:

>> "àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"

Accent Insensitive ordering in Sphinx

Sphinx handles sorting on string fields by storing all the values in a list, sorting the list and then storing the index of each string as an int attribute. According to the docs the sorting of this list is done at a byte level and currently isn't configurable.

Ideally the strings should be sorted differently, depending on the encoding and locale. For instance, if the strings are known to be Russian text in KOI8R encoding, sorting the bytes 0xE0, 0xE1, and 0xE2 should produce 0xE1, 0xE2 and 0xE0, because in KOI8R value 0xE0 encodes a character that is (noticeably) after characters encoded by 0xE1 and 0xE2. Unfortunately, Sphinx does not support that at the moment and will simply sort the strings bytewise.

-- from http://www.sphinxsearch.com/docs/current.html

So, no easy way to achieve this within Sphinx. A modification to your REPLACE() based idea would be to have a separate column and populate it using a callback in your model. This would let you handle the replace in Ruby instead of MySQL, an arguably more maintainable solution.

# save an unaccented copy of your title. Normalise method borrowed from
# http://stackoverflow.com/questions/522715/removing-accents-diacritics-from-string-while-preserving-other-special-chars-tri
class MyModel < ActiveRecord::Base
  before_validation :update_sort_col

  private

  def update_sort_col
    sort_col = self.title.to_s.mb_chars.normalize(:kd).gsub(/[^-x00-\x7F]/n, '').to_s
  end
end

Postgres accent insensitive LIKE search in Rails 3.1 on Heroku

Proper solution

Since PostgreSQL 9.1 you can just:

CREATE EXTENSION unaccent;

Provides a function unaccent(), doing what you need (except for lower(), just use that additionally if needed). Read the manual about this extension.

More about unaccent and indexes:

Does PostgreSQL support "accent insensitive" collations?

Poor man's solution

If you can't install unacccent, but are able to create a function. I compiled the list starting here and added to it over time. It is comprehensive, but hardly complete:

CREATE OR REPLACE FUNCTION lower_unaccent(text)
  RETURNS text
  LANGUAGE sql IMMUTABLE STRICT AS
$func$
SELECT lower(translate($1
     , '¹²³áàâãäåāăąÀÁÂÃÄÅĀĂĄÆćčç©ĆČÇĐÐèéêёëēĕėęěÈÊËЁĒĔĖĘĚ€ğĞıìíîïìĩīĭÌÍÎÏЇÌĨĪĬłŁńňñŃŇÑòóôõöōŏőøÒÓÔÕÖŌŎŐØŒř®ŘšşșßŠŞȘùúûüũūŭůÙÚÛÜŨŪŬŮýÿÝŸžżźŽŻŹ'
     , '123Removing Accents/Diacritics from String While Preserving Other Special Chars (Tried Mb_Chars.Normalize and Iconv)Removing Accents/Diacritics from String While Preserving Other Special Chars (Tried Mb_Chars.Normalize and Iconv)aaacccccccddRemoving accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv) How do I replace accented Latin characters in Ruby? AcRemoving accents/diacritics from string while preserving other special chars (tried mb_chars.normalize and iconv) How do I replace accented Latin characters in Ruby? Aceeeeggiiiiiiiiiiiiiiiiiillnnnnnnooooooooooooooooooorrrsssssssuuuuuuuuuuuuuuuuyyyyzzzzzz'
     ));
$func$;

Your query should work like that:

find(:all, :conditions => ["lower_unaccent(name) LIKE ?", "%#{search.downcase}%"])

For left-anchored searches, you can use an index on the function for very fast results:

CREATE INDEX tbl_name_lower_unaccent_idx
  ON fest (lower_unaccent(name) text_pattern_ops);

For queries like:

SELECT * FROM tbl WHERE (lower_unaccent(name)) LIKE 'bob%';

Or use COLLATE "C". See:

PostgreSQL LIKE query performance variations
Is there a difference between text_pattern_ops and COLLATE "C"?

Removing Accents/Diacritics from String While Preserving Other Special Chars (Tried Mb_Chars.Normalize and Iconv)