Normalizing Accented Characters in MySQL Queries

How to remove accents in MySQL?

If you set an appropriate collation for the column then the value within the field will compare equal to its unaccented equivalent naturally.

mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'é' = 'e';
+------------+
| 'é' = 'e' |
+------------+
| 1 |
+------------+
1 row in set (0.05 sec)

mysql replace accented characters

Well, as a French computer scientist (with an accent in his name), I would strongly advocate against such a conversion.

Why, well, because this strategy is just replacing a wrong assumption with another wrong assumption. Let me explain myself:

We could see the fact that only us-ascii are allowed in login id as an incorrect assumption that "all user names can be written in ascii".

Trying to remove accents from users name is just replacing the assumption by "all user names can be converted to ascii".

That is also incorrect: It will work for me (as a European - Latin 1 proof named person), but how will you compute the name of 三島 ? or the one of Сахаров ?

I know my answer is non technical, but you should do whatever is necessary to allow non ascii chars in logins. Or, you should allow your users to choose this login.

Accent insensitive search query in MySQL

You can change the collation at runtime in the sql query,

...where title like '%torun%' collate utf8_unicode_ci 

but beware that changing the collation on the fly at runtime forgoes the possibility of mysql using an index, so performance on large tables may be terrible.

Or, you can copy the column to another column, such as searchable_title, but change the collation on it. It's actually common to do this type of stuff, where you copy data but have it in some slightly different form that's optimized for some specific workload/purpose. You can use triggers as a nice way to keep the duplicated columns in sync. This method has the potential to perform well, if indexed.

Note - Make sure that your db really has those characters and not html entities.
Also, the character set of your connection matters. The above assumes it's set to utf8, for example, via set names like set names utf8

If not, you need an introducer for the literal value

...where title like _utf8'%torun%' collate utf8_unicode_ci 

and of course, the value in the single quotes must actually be utf8 encoded, even if the rest of the sql query isn't.

BigQuery: Convert accented characters to their plain ascii equivalents

Try below as quick and simple option for you:


#standardSQL
WITH lookups AS (
SELECT
'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
),
pairs AS (
SELECT accent, latin FROM lookups,
UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1,
UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
WHERE p1 = p2
),
yourTableWithWords AS (
SELECT word FROM UNNEST(
SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
) AS word
)
SELECT
word AS word_with_accent,
(SELECT STRING_AGG(IFNULL(latin, char), '')
FROM UNNEST(SPLIT(word, '')) char
LEFT JOIN pairs
ON char = accent) AS word_without_accent
FROM yourTableWithWords

Output is

word_with_accent word_without_accent     
blessèd blessed
El Niño El Nino
belle époque belle epoque
boîte boite
Boötes Bootes
blasé blase
ångström angstrom
bobèche bobeche
barège barege
bric-à-brac bric-a-brac
bête noire bete noire
Bichon Frisé Bichon Frise
Brontë Beyoncé Bronte Beyonce
bêtise betise
beau idéal beau ideal
bombé bombe
brasília brasilia
boutonnière boutonniere
aperçu apercu
béguin beguin
Bön Bon

UPDATE

Below is how to pack this logic into SQL UDF - so accent2latin(word) can be called to make a "magic"

#standardSQL
CREATE TEMP FUNCTION accent2latin(word STRING) AS
((
WITH lookups AS (
SELECT
'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
),
pairs AS (
SELECT accent, latin FROM lookups,
UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1,
UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
WHERE p1 = p2
)
SELECT STRING_AGG(IFNULL(latin, char), '')
FROM UNNEST(SPLIT(word, '')) char
LEFT JOIN pairs
ON char = accent
));

WITH yourTableWithWords AS (
SELECT word FROM UNNEST(
SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
) AS word
)
SELECT
word AS word_with_accent,
accent2latin(word) AS word_without_accent
FROM yourTableWithWords

MySql normalization of string

A brute force method would be to use strtr to do in-place replacements:

strtr($string, "ÀÁÂÃÄÅàáâãäåā ", "Normalizing Accented Characters in MySQL Queriesaaaaa_")

Normalizing a field using a single UPDATE in MySQL

You can chain the replace calls, so it can be done in a single query:

UPDATE person SET name = REPLACE(REPLACE(REPLACE(name, 'á', 'a'), 'é', 'e'), 'í', 'i')

But this quickly becomes an unmaintainable mess. If you're simply trying to replace accented characters with their unaccented equivalents, maybe a character set change would be of more use.



Related Topics



Leave a reply



Submit