How to Remove Accents from a Unicode String

What is the best way to remove accents (normalize) in a Python unicode string?

How about this:

import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

Easy way to remove accents from a Unicode string?

Finally, I've solved it by using the Normalizer class.

import java.text.Normalizer;

public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}

Remove accents/diacritics in a string in JavaScript

With ES2015/ES6 String.prototype.normalize(),

const str = "Crème Brulée"
str.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
> "Creme Brulee"

Note: use NFKD if you want things like \uFB01() normalized (to fi).

Two things are happening here:

  1. normalize()ing to NFD Unicode normal form decomposes combined graphemes into the combination of simple ones. The è of Crème ends up expressed as e + ̀.
  2. Using a regex character class to match the U+0300 → U+036F range, it is now trivial to globally get rid of the diacritics, which the Unicode standard conveniently groups as the Combining Diacritical Marks Unicode block.

As of 2021, one can also use Unicode property escapes:

str.normalize("NFD").replace(/\p{Diacritic}/gu, "")

See comment for performance testing.

Alternatively, if you just want sorting

Intl.Collator has sufficient support ~95% right now, a polyfill is also available here but I haven't tested it.

const c = new Intl.Collator();
["creme brulee", "crème brulée", "crame brulai", "crome brouillé",
"creme brulay", "creme brulfé", "creme bruléa"].sort(c.compare)
["crame brulai", "creme brulay", "creme bruléa", "creme brulee",
"crème brulée", "creme brulfé", "crome brouillé"]

["creme brulee", "crème brulée", "crame brulai", "crome brouillé"].sort((a,b) => a>b)
["crame brulai", "creme brulee", "crome brouillé", "crème brulée"]

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

with 3rd party package: unidecode

3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"

How to remove accents from Unicode characters?

You need a library like UnidecodeSharpFork to do that.

How do I remove diacritics (accents) from a string in .NET?

I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka
On the meaning of meaningless, aka All
Mn characters are non-spacing, but
some are more non-spacing than
others)

static string RemoveDiacritics(string text) 
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder(capacity: normalizedString.Length);

for (int i = 0; i < normalizedString.Length; i++)
{
char c = normalizedString[i];
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}

return stringBuilder
.ToString()
.Normalize(NormalizationForm.FormC);
}

Note that this is a followup to his earlier post: Stripping diacritics....

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.

how to remove accents from a string in python

The translate function is usually the fastest for this given that it's a one to one character mapping:

normalMap = {'À': 'A', 'Á': 'A', 'Â': 'A', 'Ã': 'A', 'Ä': 'A',
'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'ª': 'A',
'È': 'E', 'É': 'E', 'Ê': 'E', 'Ë': 'E',
'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
'Í': 'I', 'Ì': 'I', 'Î': 'I', 'Ï': 'I',
'í': 'i', 'ì': 'i', 'î': 'i', 'ï': 'i',
'Ò': 'O', 'Ó': 'O', 'Ô': 'O', 'Õ': 'O', 'Ö': 'O',
'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'º': 'O',
'Ù': 'U', 'Ú': 'U', 'Û': 'U', 'Ü': 'U',
'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
'Ñ': 'N', 'ñ': 'n',
'Ç': 'C', 'ç': 'c',
'§': 'S', '³': '3', '²': '2', '¹': '1'}
normalize = str.maketrans(normalMap)

output:

print("José Magalhães ".translate(normalize))
# Jose Magalhaes

If you're not allowed to use methods of str, you can still use the dictionary:

S = "José Magalhães "
print(*(normalMap.get(c,c) for c in S),sep="")
# Jose Magalhaes

If you're not allowed to use a dictionary, you can use two strings to map the characters and loop through the string:

fromChar = "ÀÁÂÃÄàáâãäªÈÉÊËèéêëÍÌÎÏíìîïÒÓÔÕÖòóôõöºÙÚÛÜùúûüÑñÇ秳²¹"
toChar = "AAAAAaaaaaAEEEEeeeeIIIIiiiiOOOOOoooooOUUUUuuuuNnCcS321"

S = "José Magalhães "
R = "" # start result with an empty string
for c in S: # go through characters
p = fromChar.find(c) # find position of accented character
R += toChar[p] if p>=0 else c # use replacement or original character

print(R)
# Jose Magalhaes

Is there a way to get rid of accents and convert a whole string to regular letters?

Start with java.text.Normalizer.

string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction

This will separate all of the accent marks from most characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

string = string.replaceAll("[^\\p{ASCII}]", "");

If your text is in Unicode, you should use this instead:

string = string.replaceAll("\\p{M}", "");

For Unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.

Thanks to GarretWilson for the pointer and regular-expressions.info for the great Unicode guide.


It is important to note that Normalizer by itself is insufficient to remove diacritics. For example, the following will not replace the accented with the unaccented e:

import static java.text.Normalizer.normalize;
import static java.text.Normalizer.Form.*;

public class T {
public static void main( final String[] args ) {
final var text = "Brévis";

System.out.println(
normalize( text, NFD ) + " " +
normalize( text, NFC ) + " " +
normalize( text, NFKD ) + " " +
normalize( text, NFKC )
);
}
}


Related Topics



Leave a reply



Submit