Remove Accents/Diacritics in a String in JavaScript

Remove accents/diacritics in a string in JavaScript

With ES2015/ES6 String.prototype.normalize(),

const str = "Crème Brulée"
str.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
> "Creme Brulee"

Note: use NFKD if you want things like \uFB01() normalized (to fi).

Two things are happening here:

  1. normalize()ing to NFD Unicode normal form decomposes combined graphemes into the combination of simple ones. The è of Crème ends up expressed as e + ̀.
  2. Using a regex character class to match the U+0300 → U+036F range, it is now trivial to globally get rid of the diacritics, which the Unicode standard conveniently groups as the Combining Diacritical Marks Unicode block.

As of 2021, one can also use Unicode property escapes:

str.normalize("NFD").replace(/\p{Diacritic}/gu, "")

See comment for performance testing.

Alternatively, if you just want sorting

Intl.Collator has sufficient support ~95% right now, a polyfill is also available here but I haven't tested it.

const c = new Intl.Collator();
["creme brulee", "crème brulée", "crame brulai", "crome brouillé",
"creme brulay", "creme brulfé", "creme bruléa"].sort(c.compare)
["crame brulai", "creme brulay", "creme bruléa", "creme brulee",
"crème brulée", "creme brulfé", "crome brouillé"]

["creme brulee", "crème brulée", "crame brulai", "crome brouillé"].sort((a,b) => a>b)
["crame brulai", "creme brulee", "crome brouillé", "crème brulée"]

Remove accents/diacritics in a string in Angular Probleme with IE 11

You got the error in IE 11 because IE doesn't support normalize.

You need to use polyfill to support normalize in IE 11. You can use module unorm to provide support for normalize in IE 11.

  1. Run npm install unorm.
  2. Use import 'unorm'; at the file where you write the normalize function.

how to replace all accented characters with English equivalents

function Convert(string){
return string.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
}
console.log(Convert("Ë À Ì Â Í Ã Î Ä Ï Ç Ò È Ó É Ô Ê Õ Ö ê Ù ë Ú î Û ï Ü ô Ý õ â "))

Output:

"E A I A I A I A I C O E O E O E O O e U e U i U i U o Y o a "

Remove accents/diacritics in a string in JavaScript

With ES2015/ES6 String.prototype.normalize(),

const str = "Crème Brulée"
str.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
> "Creme Brulee"

Note: use NFKD if you want things like \uFB01() normalized (to fi).

Two things are happening here:

  1. normalize()ing to NFD Unicode normal form decomposes combined graphemes into the combination of simple ones. The è of Crème ends up expressed as e + ̀.
  2. Using a regex character class to match the U+0300 → U+036F range, it is now trivial to globally get rid of the diacritics, which the Unicode standard conveniently groups as the Combining Diacritical Marks Unicode block.

As of 2021, one can also use Unicode property escapes:

str.normalize("NFD").replace(/\p{Diacritic}/gu, "")

See comment for performance testing.

Alternatively, if you just want sorting

Intl.Collator has sufficient support ~95% right now, a polyfill is also available here but I haven't tested it.

const c = new Intl.Collator();
["creme brulee", "crème brulée", "crame brulai", "crome brouillé",
"creme brulay", "creme brulfé", "creme bruléa"].sort(c.compare)
["crame brulai", "creme brulay", "creme bruléa", "creme brulee",
"crème brulée", "creme brulfé", "crome brouillé"]

["creme brulee", "crème brulée", "crame brulai", "crome brouillé"].sort((a,b) => a>b)
["crame brulai", "creme brulee", "crome brouillé", "crème brulée"]

Efficiently replace all accented characters in a string?

I can't speak to what you are trying to do specifically with the function itself, but if you don't like the regex being built every time, here are two solutions and some caveats about each.

Here is one way to do this:

function makeSortString(s) {
if(!makeSortString.translate_re) makeSortString.translate_re = /[öäüÖÄÜ]/g;
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
return ( s.replace(makeSortString.translate_re, function(match) {
return translate[match];
}) );
}

This will obviously make the regex a property of the function itself. The only thing you may not like about this (or you may, I guess it depends) is that the regex can now be modified outside of the function's body. So, someone could do this to modify the interally-used regex:

makeSortString.translate_re = /[a-z]/g;

So, there is that option.

One way to get a closure, and thus prevent someone from modifying the regex, would be to define this as an anonymous function assignment like this:

var makeSortString = (function() {
var translate_re = /[öäüÖÄÜ]/g;
return function(s) {
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
return ( s.replace(translate_re, function(match) {
return translate[match];
}) );
}
})();

Hopefully this is useful to you.


UPDATE: It's early and I don't know why I didn't see the obvious before, but it might also be useful to put you translate object in a closure as well:

var makeSortString = (function() {
var translate_re = /[öäüÖÄÜ]/g;
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
return function(s) {
return ( s.replace(translate_re, function(match) {
return translate[match];
}) );
}
})();

BigQuery UDF to remove accents/diacritics in a string

Consider below approach

select originalText, 
regexp_replace(normalize(originalText, NFD), r"\pM", '') output

if applied to sample data in your question - output is

Sample Image

You can easily wrap it with SQL UDF if you wish

Replacing diacritics in Javascript

If you want to do it entirely on the client side, I think your only option is with some kind of lookup table. Here's a starting point, written by a chap called Olavi Ivask on his blog...

function replaceDiacritics(s)
{
var s;

var diacritics =[
/[\300-\306]/g, /[\340-\346]/g, // A, a
/[\310-\313]/g, /[\350-\353]/g, // E, e
/[\314-\317]/g, /[\354-\357]/g, // I, i
/[\322-\330]/g, /[\362-\370]/g, // O, o
/[\331-\334]/g, /[\371-\374]/g, // U, u
/[\321]/g, /[\361]/g, // N, n
/[\307]/g, /[\347]/g, // C, c
];

var chars = ['A','a','E','e','I','i','O','o','U','u','N','n','C','c'];

for (var i = 0; i < diacritics.length; i++)
{
s = s.replace(diacritics[i],chars[i]);
}

document.write(s);
}

You can see this is simply an array of regexes for known diacritic chars, mapping them back onto a "plain" character.

How do I remove diacritics (accents) from a string in .NET?

I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka
On the meaning of meaningless, aka All
Mn characters are non-spacing, but
some are more non-spacing than
others)

static string RemoveDiacritics(string text) 
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder(capacity: normalizedString.Length);

for (int i = 0; i < normalizedString.Length; i++)
{
char c = normalizedString[i];
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}

return stringBuilder
.ToString()
.Normalize(NormalizationForm.FormC);
}

Note that this is a followup to his earlier post: Stripping diacritics....

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.



Related Topics



Leave a reply



Submit