Does PostgreSQL support accent insensitive collations?
Use the unaccent module for that - which is completely different from what you are linking to.
unaccent is a text search dictionary that removes accents (diacritic
signs) from lexemes.
Install once per database with:
CREATE EXTENSION unaccent;
If you get an error like:
ERROR: could not open extension control file
"/usr/share/postgresql/<version>/extension/unaccent.control": No such file or directory
Install the contrib package on your database server like instructed in this related answer:
- Error when creating unaccent extension on PostgreSQL
Among other things, it provides the function unaccent()
you can use with your example (where LIKE
seems not needed).
SELECT *
FROM users
WHERE unaccent(name) = unaccent('João');
Index
To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE
functions for indexes. If a function can return a different result for the same input, the index could silently break.
unaccent()
only STABLE
not IMMUTABLE
Unfortunately, unaccent()
is only STABLE
, not IMMUTABLE
. According to this thread on pgsql-bugs, this is due to three reasons:
- It depends on the behavior of a dictionary.
- There is no hard-wired connection to this dictionary.
- It therefore also depends on the current
search_path
, which can change easily.
Some tutorials on the web instruct to just alter the function volatility to IMMUTABLE
. This brute-force method can break under certain conditions.
Others suggest a simple IMMUTABLE
wrapper function (like I did myself in the past).
There is an ongoing debate whether to make the variant with two parameters IMMUTABLE
which declares the used dictionary explicitly. Read here or here.
Another alternative would be this module with an IMMUTABLE unaccent()
function by Musicbrainz, provided on Github. Haven't tested it myself. I think I have come up with a better idea:
Best for now
This approach is more efficient as other solutions floating around, and safer.
Create an IMMUTABLE
SQL wrapper function executing the two-parameter form with hard-wired schema-qualified function and dictionary.
Since nesting a non-immutable function would disable function inlining, base it on a copy of the C-function, (fake) declared IMMUTABLE
as well. Its only purpose is to be used in the SQL function wrapper. Not meant to be used on its own.
The sophistication is needed as there is no way to hard-wire the dictionary in the declaration of the C function. (Would require to hack the C code itself.) The SQL wrapper function does that and allows both function inlining and expression indexes.
CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
RETURNS text LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT AS
$func$
SELECT public.immutable_unaccent(regdictionary 'public.unaccent', $1)
$func$;
Drop PARALLEL SAFE
from both functions for Postgres 9.5 or older.
public
being the schema where you installed the extension (public
is the default).
The explicit type declaration (regdictionary
) defends against hypothetical attacks with overloaded variants of the function by malicious users.
Previously, I advocated a wrapper function based on the STABLE
function unaccent()
shipped with the unaccent module. That disabled function inlining. This version executes ten times faster than the simple wrapper function I had here earlier.
And that was already twice as fast as the first version which added SET search_path = public, pg_temp
to the function - until I discovered that the dictionary can be schema-qualified, too. Still (Postgres 12) not too obvious from documentation.
If you lack the necessary privileges to create C functions, you are back to the second best implementation: An IMMUTABLE
function wrapper around the STABLE
unaccent()
function provided by the module:
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text AS
$func$
SELECT public.unaccent('public.unaccent', $1) -- schema-qualify function and dictionary
$func$ LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT;
Finally, the expression index to make queries fast:
CREATE INDEX users_unaccent_name_idx ON users(public.f_unaccent(name));
Remember to recreate indexes involving this function after any change to function or dictionary, like an in-place major release upgrade that would not recreate indexes. Recent major releases all had updates for the unaccent
module.
Adapt queries to match the index (so the query planner will use it):
SELECT * FROM users
WHERE f_unaccent(name) = f_unaccent('João');
You don't need the function in the right expression. There you can also supply unaccented strings like 'Joao'
directly.
The faster function does not translate to much faster queries using the expression index. That operates on pre-computed values and is very fast already. But index maintenance and queries not using the index benefit.
Security for client programs has been tightened with Postgres 10.3 / 9.6.8 etc. You need to schema-qualify function and dictionary name as demonstrated when used in any indexes. See:
- 'text search dictionary “unaccent” does not exist' entries in postgres log, supposedly during automatic analyze
Ligatures
In Postgres 9.5 or older ligatures like 'Œ' or 'ß' have to be expanded manually (if you need that), since unaccent()
always substitutes a single letter:
SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
E A e a S
You will love this update to unaccent in Postgres 9.6:
Extend
contrib/unaccent
's standardunaccent.rules
file to handle all
diacritics known to Unicode, and expand ligatures correctly (Thomas
Munro, Léonard Benedetti)
Bold emphasis mine. Now we get:
SELECT unaccent('Œ Æ œ æ ß');
unaccent
----------
OE AE oe ae ss
Pattern matching
For LIKE
or ILIKE
with arbitrary patterns, combine this with the module pg_trgm
in PostgreSQL 9.1 or later. Create a trigram GIN (typically preferable) or GIST expression index. Example for GIN:
CREATE INDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);
Can be used for queries like:
SELECT * FROM users
WHERE f_unaccent(name) LIKE ('%' || f_unaccent('João') || '%');
GIN and GIST indexes are more expensive to maintain than plain btree:
- Difference between GiST and GIN index
There are simpler solutions for just left-anchored patterns. More about pattern matching and performance:
- Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL
pg_trgm
also provides useful operators for "similarity" (%
) and "distance" (<->
).
Trigram indexes also support simple regular expressions with ~
et al. and case insensitive pattern matching with ILIKE
:
- PostgreSQL accent + case insensitive search
PostgreSQL accent + case insensitive search
If you need to "combine with case insensitive", there are a number of options, depending on your exact requirements.
Maybe simplest, make the expression index case insensitive.
Building on the function f_unaccent()
laid out in the referenced answer:
- Does PostgreSQL support "accent insensitive" collations?
CREATE INDEX users_lower_unaccent_name_idx ON users(lower(f_unaccent(name)));
Then:
SELECT *
FROM users
WHERE lower(f_unaccent(name)) = lower(f_unaccent('João'));
Or you could build the lower()
into the function f_unaccent()
, to derive something like f_lower_unaccent()
.
Or (especially if you need to do fuzzy pattern matching anyways) you can use a trigram index provided by the additional module pg_trgm
building on above function, which also supports ILIKE
. Details:
- LOWER LIKE vs iLIKE
I added a note to the referenced answer.
Or you could use the additional module citext
(but I rather avoid it):
- Deferrable, case-insensitive unique constraint
PostgreSQL case-insensitive and accent-insensitive search
Creating case and accent insensitive ICU collations is pretty simple:
CREATE COLLATION english_ci_ai (
PROVIDER = icu,
DETERMINISTIC = FALSE,
LOCALE = "en-US-u-ks-level1"
);
Or, equivalently (that syntax also works wil old ICU versions:
CREATE COLLATION english_ci_ai (
PROVIDER = icu,
DETERMINISTIC = FALSE,
LOCALE = "en-US@colStrength=primary"
);
See the ICU documentation for details and my article for a detailed discussion.
But your problem is that you want substring search. So you should create a trigram index:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE INDEX ON tab USING gin (unaccent(doc) gin_trgm_ops);
Then you can search like this:
SELECT * FROM tab
WHERE unaccent(doc) ILIKE unaccent('%joh%');
Note that you have to force a minimal length of 4 or so on the search string if you want that to be efficient.
Postgres accent insensitive LIKE search in Rails 3.1 on Heroku
Proper solution
Since PostgreSQL 9.1 you can just:
CREATE EXTENSION unaccent;
Provides a function unaccent()
, doing what you need (except for lower()
, just use that additionally if needed). Read the manual about this extension.
More about unaccent and indexes:
- Does PostgreSQL support "accent insensitive" collations?
Poor man's solution
If you can't install unacccent
, but are able to create a function. I compiled the list starting here and added to it over time. It is comprehensive, but hardly complete:
CREATE OR REPLACE FUNCTION lower_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE STRICT AS
$func$
SELECT lower(translate($1
, '¹²³áàâãäåāăąÀÁÂÃÄÅĀĂĄÆćčç©ĆČÇĐÐèéêёëēĕėęěÈÊËЁĒĔĖĘĚ€ğĞıìíîïìĩīĭÌÍÎÏЇÌĨĪĬłŁńňñŃŇÑòóôõöōŏőøÒÓÔÕÖŌŎŐØŒř®ŘšşșߊŞȘùúûüũūŭůÙÚÛÜŨŪŬŮýÿÝŸžżźŽŻŹ'
, '123Does Postgresql Support "Accent Insensitive" CollationsDoes Postgresql Support "Accent Insensitive" CollationsaaacccccccddDoes PostgreSQL support accent insensitive collations? PostgreSQL accent + case insensitive search PostgreSQL case-insensitive and accent-insensitive search Postgres acDoes PostgreSQL support accent insensitive collations? PostgreSQL accent + case insensitive search PostgreSQL case-insensitive and accent-insensitive search Postgres aceeeeggiiiiiiiiiiiiiiiiiillnnnnnnooooooooooooooooooorrrsssssssuuuuuuuuuuuuuuuuyyyyzzzzzz'
));
$func$;
Your query should work like that:
find(:all, :conditions => ["lower_unaccent(name) LIKE ?", "%#{search.downcase}%"])
For left-anchored searches, you can use an index on the function for very fast results:
CREATE INDEX tbl_name_lower_unaccent_idx
ON fest (lower_unaccent(name) text_pattern_ops);
For queries like:
SELECT * FROM tbl WHERE (lower_unaccent(name)) LIKE 'bob%';
Or use COLLATE "C"
. See:
- PostgreSQL LIKE query performance variations
- Is there a difference between text_pattern_ops and COLLATE "C"?
Yii2: accent insensitive filter
I resolved it using the lower_unaccent PostgreSQL function and the ILIKE PostgreSQL operator:
$query->andWhere(new Expression(
'lower_unaccent(name) ILIKE \'%\' || lower_unaccent(\'' . $this->name . '\') || \'%\''
));
In Django with Postgresql 9.6 how to sort case and accent insensitive?
It isn't related to Django itself, PostgreSQL's lc_collate
configuration determines this. I'd suggest you to review its value:
SHOW lc_collate;
The right thing to do is fix this configuration. Don't forget to take a look on related settings too (lc_ctype
, etc.).
But if you cannot create another database with the right setting, try to explicit collate
on ORDER
like the following test case:
CREATE TEMPORARY TABLE table1 (column1 TEXT);
INSERT INTO table1 VALUES('Barn'),
('beef'),
('bémol'),
('Bœuf'),
('boulette'),
('Bubble');
SELECT * FROM table1 ORDER BY column1 COLLATE "en_US"; --Gives the expected order
SELECT * FROM table1 ORDER BY column1 COLLATE "C"; --Gives "wrong" order (in your case)
It's important to remember that PostgreSQL relies on operating system locales. This test case was executed on CentOS 7. More info here and here.
Related Topics
How to Do 'Insert If Not Exists' in MySQL
How to Escape a Single Quote in SQL Server
How to Use Parameters in Vba in the Different Contexts in Microsoft Access
Get a List of Dates Between Two Dates
Recommended SQL Database Design For Tags or Tagging
How to Import an SQL File Using the Command Line in MySQL
Postgresql Distinct on With Different Order By
Condition Within Join or Where
Postgresql: Running Count of Rows For a Query 'By Minute'
How to Perform Grouped Ranking in MySQL
The Multi-Part Identifier Could Not Be Bound
How to Write a Full Outer Join Query in Access
Foreign Key to Non-Primary Key
Find a String by Searching All Tables in SQL Server