How to Remove Accents (Normalize) in a Python Unicode String

What is the best way to remove accents (normalize) in a Python unicode string?

How about this:

import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

how to remove accents from a string in python

The translate function is usually the fastest for this given that it's a one to one character mapping:

normalMap = {'À': 'A', 'Á': 'A', 'Â': 'A', 'Ã': 'A', 'Ä': 'A',
'à': 'a', 'á': 'a', 'â': 'a', 'ã': 'a', 'ä': 'a', 'ª': 'A',
'È': 'E', 'É': 'E', 'Ê': 'E', 'Ë': 'E',
'è': 'e', 'é': 'e', 'ê': 'e', 'ë': 'e',
'Í': 'I', 'Ì': 'I', 'Î': 'I', 'Ï': 'I',
'í': 'i', 'ì': 'i', 'î': 'i', 'ï': 'i',
'Ò': 'O', 'Ó': 'O', 'Ô': 'O', 'Õ': 'O', 'Ö': 'O',
'ò': 'o', 'ó': 'o', 'ô': 'o', 'õ': 'o', 'ö': 'o', 'º': 'O',
'Ù': 'U', 'Ú': 'U', 'Û': 'U', 'Ü': 'U',
'ù': 'u', 'ú': 'u', 'û': 'u', 'ü': 'u',
'Ñ': 'N', 'ñ': 'n',
'Ç': 'C', 'ç': 'c',
'§': 'S', '³': '3', '²': '2', '¹': '1'}
normalize = str.maketrans(normalMap)

output:

print("José Magalhães ".translate(normalize))
# Jose Magalhaes

If you're not allowed to use methods of str, you can still use the dictionary:

S = "José Magalhães "
print(*(normalMap.get(c,c) for c in S),sep="")
# Jose Magalhaes

If you're not allowed to use a dictionary, you can use two strings to map the characters and loop through the string:

fromChar = "ÀÁÂÃÄàáâãäªÈÉÊËèéêëÍÌÎÏíìîïÒÓÔÕÖòóôõöºÙÚÛÜùúûüÑñÇ秳²¹"
toChar = "AAAAAaaaaaAEEEEeeeeIIIIiiiiOOOOOoooooOUUUUuuuuNnCcS321"

S = "José Magalhães "
R = "" # start result with an empty string
for c in S: # go through characters
p = fromChar.find(c) # find position of accented character
R += toChar[p] if p>=0 else c # use replacement or original character

print(R)
# Jose Magalhaes

How to remove accent in Python 3.5 and get a string with unicodedata or other solutions?

with 3rd party package: unidecode

3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"

removing accent and special characters

A possible solution would be

def remove_accents(data):
return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.printable).lower()

Using NFKD AFAIK is the standard way to normalize unicode to convert it to compatible characters. The rest as to remove the special characters numbers and unicode characters that originated from normalization, you can simply compare with string.ascii_letters and remove any character's not in that set.

Remove accents from Beautifulsoup strings

It would appear the crux of the issue is you're defaulting to the encoding used by Python, not the encoding of the file in question.

I simplified your code a bit in an attempt to debug the issue, hopefully it's demonstrative of the core issue:

import unicodedata
from bs4 import BeautifulSoup

def strip_accents(text):
# Just a prefrence change
text = unicodedata.normalize('NFD', text)
text = text.encode('ascii', 'ignore')
return text.decode("utf-8")

# A simplified version of your code
def html_imp_old(h):
soup = BeautifulSoup(open(h), features = "lxml")
tags = []
for tag in soup.find_all('td'):
tags.append(strip_accents(tag.text))
print(tags)

# Same as _old, just specifying an encoding when reading the file
def html_imp_new(h):
soup = BeautifulSoup(open(h, encoding="utf-8"), features = "lxml")
tags = []
for tag in soup.find_all('td'):
tags.append(strip_accents(tag.text))
print(tags)

# Make a self-contained snippet, so write out the HTML to disk
with open("temp.html", "wt", encoding="utf-8") as f:
f.write("<TD WIDTH=\"80\" ALIGN=\"center\">Jes\u00fas</TD>\n")
# This works correctly, outputs "Jesus"
print(strip_accents('Jes\u00fas'))
# This doesn't work, outputs "JesAs" for me, though I assume this will be OS dependent behavior
html_imp_old("temp.html")
# This works correctly, outputs "Jesus"
html_imp_new("temp.html")

Remove accents and keep under dots in Python

I would use Unicode normalization for this.

Characters with accents and dots like that are precomposed Unicode characters. If you decompose them, you can get the base character plus the combining characters for the accents and dots and whatnot. Then you can remove the ones you don't want and re-compose the string back into precomposed characters.

You can do this in Python using unicodedata.normalize. Specifically, you want the "NFD" (Normalization Form Canonical Decomposition) normalization form. This will give you the canonical decomposition of the characters. Then to re-compose the characters, you want "NFC" (Normalization Form Canonical Composition).

I'll show you what I mean. First, let's look at individual code points the example text you provided above:

>>> from pprint import pprint
>>> import unicodedata
>>> text = 'ọmọàbúròẹlẹ́wà'
>>> pprint([unicodedata.name(c) for c in text])
['LATIN SMALL LETTER O WITH DOT BELOW',
'LATIN SMALL LETTER M',
'LATIN SMALL LETTER O WITH DOT BELOW',
'LATIN SMALL LETTER A WITH GRAVE',
'LATIN SMALL LETTER B',
'LATIN SMALL LETTER U WITH ACUTE',
'LATIN SMALL LETTER R',
'LATIN SMALL LETTER O WITH GRAVE',
'LATIN SMALL LETTER E WITH DOT BELOW',
'LATIN SMALL LETTER L',
'LATIN SMALL LETTER E WITH ACUTE',
'COMBINING DOT BELOW',
'LATIN SMALL LETTER W',
'LATIN SMALL LETTER A WITH GRAVE']

As you can see, one of the characters is already partially decomposed (the one with the separate "COMBINING DOT BELOW"). Now let's look at it fully decomposed:

>>> text = unicodedata.normalize('NFD', text)
>>> pprint([unicodedata.name(c) for c in text])
['LATIN SMALL LETTER O',
'COMBINING DOT BELOW',
'LATIN SMALL LETTER M',
'LATIN SMALL LETTER O',
'COMBINING DOT BELOW',
'LATIN SMALL LETTER A',
'COMBINING GRAVE ACCENT',
'LATIN SMALL LETTER B',
'LATIN SMALL LETTER U',
'COMBINING ACUTE ACCENT',
'LATIN SMALL LETTER R',
'LATIN SMALL LETTER O',
'COMBINING GRAVE ACCENT',
'LATIN SMALL LETTER E',
'COMBINING DOT BELOW',
'LATIN SMALL LETTER L',
'LATIN SMALL LETTER E',
'COMBINING DOT BELOW',
'COMBINING ACUTE ACCENT',
'LATIN SMALL LETTER W',
'LATIN SMALL LETTER A',
'COMBINING GRAVE ACCENT']

Now according to your requirements, it sounds like you want to keep all Latin letters (and probably the rest of ASCII too, I'm guessing) plus the "COMBINING DOT BELOW" code point, which we can refer to using the literal '\N{COMBINING DOT BELOW}' for easier readability of your code.

Here's an example function that I think will do what you want:

import unicodedata

def remove_accents_but_not_dots(input_text):
# Step 1: Decompose input_text into base letters and combinining characters
decomposed_text = unicodedata.normalize('NFD', input_text)

# Step 2: Filter out the combining characters we don't want
filtered_text = ''
for c in decomposed_text:
if ord(c) <= 0x7f or c == '\N{COMBINING DOT BELOW}':
# Only keep ASCII or "COMBINING DOT BELOW"
filtered_text += c

# Step 3: Re-compose the string into precomposed characters
return unicodedata.normalize('NFC', filtered_text)

(Of course, string concatenation in Python is slow, but I'll leave the optimizations to you. This example was written for readability.)

And here's what the result looks like:

>>> remove_accents_but_not_dots('ọmọàbúròẹlẹ́wà')
'ọmọaburoẹlẹwa'


Related Topics



Leave a reply



Submit