What is the best way to remove accents (normalize) in a Python unicode string?
How about this:
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
This works on greek letters, too:
>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>
The character category "Mn" stands for Nonspacing_Mark
, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).
And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".
Easy way to remove accents from a Unicode string?
Finally, I've solved it by using the Normalizer
class.
import java.text.Normalizer;
public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}
Remove accents/diacritics in a string in JavaScript
With ES2015/ES6 String.prototype.normalize(),
const str = "Crème Brulée"
str.normalize("NFD").replace(/[\u0300-\u036f]/g, "")
> "Creme Brulee"
Note: use NFKD
if you want things like \uFB01
(fi
) normalized (to fi
).
Two things are happening here:
normalize()
ing toNFD
Unicode normal form decomposes combined graphemes into the combination of simple ones. Theè
ofCrème
ends up expressed ase
+̀
.- Using a regex character class to match the U+0300 → U+036F range, it is now trivial to globally get rid of the diacritics, which the Unicode standard conveniently groups as the Combining Diacritical Marks Unicode block.
As of 2021, one can also use Unicode property escapes:
str.normalize("NFD").replace(/\p{Diacritic}/gu, "")
See comment for performance testing.
Alternatively, if you just want sorting
Intl.Collator has sufficient support ~95% right now, a polyfill is also available here but I haven't tested it.
const c = new Intl.Collator();
["creme brulee", "crème brulée", "crame brulai", "crome brouillé",
"creme brulay", "creme brulfé", "creme bruléa"].sort(c.compare)
["crame brulai", "creme brulay", "creme bruléa", "creme brulee",
"crème brulée", "creme brulfé", "crome brouillé"]
["creme brulee", "crème brulée", "crame brulai", "crome brouillé"].sort((a,b) => a>b)
["crame brulai", "creme brulee", "crome brouillé", "crème brulée"]
How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?
with 3rd party package: unidecode
3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"
How to remove accents from Unicode characters?
You need a library like UnidecodeSharpFork to do that.
How do I remove diacritics (accents) from a string in .NET?
I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka
On the meaning of meaningless, aka All
Mn characters are non-spacing, but
some are more non-spacing than
others)
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder(capacity: normalizedString.Length);
for (int i = 0; i < normalizedString.Length; i++)
{
char c = normalizedString[i];
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder
.ToString()
.Normalize(NormalizationForm.FormC);
}
Note that this is a followup to his earlier post: Stripping diacritics....
The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.
Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.
how to remove accents from a string in python
The translate function is usually the fastest for this given that it's a one to one character mapping:
normalMap = {'À': 'A', 'Á': 'A', 'Â': 'A', 'Ã': 'A', 'Ä': 'A',
'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'ª': 'A',
'È': 'E', 'É': 'E', 'Ê': 'E', 'Ë': 'E',
'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
'Í': 'I', 'Ì': 'I', 'Î': 'I', 'Ï': 'I',
'í': 'i', 'ì': 'i', 'î': 'i', 'ï': 'i',
'Ò': 'O', 'Ó': 'O', 'Ô': 'O', 'Õ': 'O', 'Ö': 'O',
'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'º': 'O',
'Ù': 'U', 'Ú': 'U', 'Û': 'U', 'Ü': 'U',
'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
'Ñ': 'N', 'ñ': 'n',
'Ç': 'C', 'ç': 'c',
'§': 'S', '³': '3', '²': '2', '¹': '1'}
normalize = str.maketrans(normalMap)
output:
print("José Magalhães ".translate(normalize))
# Jose Magalhaes
If you're not allowed to use methods of str, you can still use the dictionary:
S = "José Magalhães "
print(*(normalMap.get(c,c) for c in S),sep="")
# Jose Magalhaes
If you're not allowed to use a dictionary, you can use two strings to map the characters and loop through the string:
fromChar = "ÀÁÂÃÄàáâãäªÈÉÊËèéêëÍÌÎÏíìîïÒÓÔÕÖòóôõöºÙÚÛÜùúûüÑñÇ秳²¹"
toChar = "AAAAAaaaaaAEEEEeeeeIIIIiiiiOOOOOoooooOUUUUuuuuNnCcS321"
S = "José Magalhães "
R = "" # start result with an empty string
for c in S: # go through characters
p = fromChar.find(c) # find position of accented character
R += toChar[p] if p>=0 else c # use replacement or original character
print(R)
# Jose Magalhaes
Is there a way to get rid of accents and convert a whole string to regular letters?
Start with java.text.Normalizer
.
string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction
This will separate all of the accent marks from most characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.
string = string.replaceAll("[^\\p{ASCII}]", "");
If your text is in Unicode, you should use this instead:
string = string.replaceAll("\\p{M}", "");
For Unicode, \\P{M}
matches the base glyph and \\p{M}
(lowercase) matches each accent.
Thanks to GarretWilson for the pointer and regular-expressions.info for the great Unicode guide.
It is important to note that Normalizer
by itself is insufficient to remove diacritics. For example, the following will not replace the accented é
with the unaccented e
:
import static java.text.Normalizer.normalize;
import static java.text.Normalizer.Form.*;
public class T {
public static void main( final String[] args ) {
final var text = "Brévis";
System.out.println(
normalize( text, NFD ) + " " +
normalize( text, NFC ) + " " +
normalize( text, NFKD ) + " " +
normalize( text, NFKC )
);
}
}
Related Topics
Accessing Kotlin Extension Functions from Java
How to Compress a String in Java
Case Insensitive JSON to Pojo Mapping Without Changing the Pojo
Why Does Hibernate Disable Insert Batching When Using an Identity Identifier Generator
Display Blob (Image) Through Jsp
Incompatible Jvm in Ggts (Eclipse) and Java 1.8
Bouncy Castle:Pemreader => Pemparser
Turning an Executorservice to Daemon in Java
Why Does My Aes Encryption Throws an Invalidkeyexception
Java:Read Last N Lines of a Huge File
Any Simple Way to Explain Why I Cannot Do List<Animal> Animals = New Arraylist<Dog>()
How to Get the Parent Base Class Object Super.Getclass()
How to Pass a Value from One Jsp to Another Jsp Page
How to Change the Name of a Java Application Process
How to Specify the Default Jvm Arguments for Programs I Run from Eclipse