Normalizing Unicode

Normalizing Unicode

The unicodedata module offers a .normalize() function, you want to normalize to the NFC form. An example using the same U+0061 LATIN SMALL LETTER - U+0301 A COMBINING ACUTE ACCENT combination and U+00E1 LATIN SMALL LETTER A WITH ACUTE code points you used:

>>> print(ascii(unicodedata.normalize('NFC', '\u0061\u0301')))
'\xe1'
>>> print(ascii(unicodedata.normalize('NFD', '\u00e1')))
'a\u0301'

(I used the ascii() function here to ensure non-ASCII codepoints are printed using escape syntax, making the differences clear).

NFC, or 'Normal Form Composed' returns composed characters, NFD, 'Normal Form Decomposed' gives you decomposed, combined characters.

The additional NFKC and NFKD forms deal with compatibility codepoints; e.g. U+2160 ROMAN NUMERAL ONE is really just the same thing as U+0049 LATIN CAPITAL LETTER I but present in the Unicode standard to remain compatible with encodings that treat them separately. Using either NFKC or NFKD form, in addition to composing or decomposing characters, will also replace all 'compatibility' characters with their canonical form.

Here is an example using the U+2167 ROMAN NUMERAL EIGHT codepoint; using the NFKC form replaces this with a sequence of ASCII V and I characters:

>>> unicodedata.normalize('NFC', '\u2167')
'Ⅷ'
>>> unicodedata.normalize('NFKC', '\u2167')
'VIII'

Note that there is no guarantee that composed and decomposed forms are commutative; normalizing a combined character to NFC form, then converting the result back to NFD form does not always result in the same character sequence. The Unicode standard maintains a list of exceptions; characters on this list are composable, but not decomposable back to their combined form, for various reasons. Also see the documentation on the Composition Exclusion Table.

Will normalizing a string give the same result as normalizing the individual grapheme clusters?

No, that generally is not true. The Unicode Standard warns against the assumption that concatenating normalised strings produces another normalised string. From UAX #15:

In using normalization functions, it is important to realize that none
of the Normalization Forms are closed under string concatenation. That
is, even if two strings X and Y are normalized, their string
concatenation X+Y is not guaranteed to be normalized.

Many aspects of the Unicode text segmentation algorithm are tailorable; the standard mostly just provides default values that are useful in most contexts, but can be overridden when necessary for a certain purpose. Therefore, there is no guarantee that two Unicode-compliant applications are even going to agree on where grapheme boundaries are situated. A concrete example is the difference between legacy grapheme clusters and extended grapheme clusters.

In the former, characters with the Grapheme_Cluster_Break property values Spacing_Mark or Prepend do not act as grapheme extenders, while in the latter they do. As of Unicode 12.1, there are twelve such characters with a non-zero canonical combining class. These characters would break your method if you used the legacy grapheme cluster definition, such as in the following sequence:

<U+1D158, U+1D16D, U+1D166>

which is

  • MUSICAL SYMBOL NOTEHEAD BLACK (ccc=0)
  • MUSICAL SYMBOL COMBINING AUGMENTATION DOT (ccc=226)
  • MUSICAL SYMBOL COMBINING SPRECHGESANG STEM (ccc=216)

Because both the combining augmentation dot and the combining sprechgesang stem are Spacing_Mark, this sequence is actually divided into three legacy grapheme clusters, each only one character in length and thus automatically normalised. The real normalisation of the entire string would switch the positions of the dot and stem, however, because of their CCC values.

If we ignore the possibility of tailoring the algorithm and focus only on extended grapheme clusters strictly as defined in the standard, then normalising each grapheme cluster individually should produce the same result as normalising the whole string at once to the best of my knowledge, but there is no formal guarantee that future revisions of the standard won’t change that.

Normalizing unicode text to filenames, etc. in Python

What you want to do is also known as "slugify" a string. Here's a possible solution:

import re
from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.:]+')

def slugify(text, delim=u'-'):
"""Generates an slightly worse ASCII-only slug."""
result = []
for word in _punct_re.split(text.lower()):
word = normalize('NFKD', word).encode('ascii', 'ignore')
if word:
result.append(word)
return unicode(delim.join(result))

Usage:

>>> slugify(u'My International Text: åäö')
u'my-international-text-aao'

You can also change the delimeter:

>>> slugify(u'My International Text: åäö', delim='_')
u'my_international_text_aao'

Source: Generating Slugs

For Python 3: pastebin.com/ft7Yb3KS (thanks @MrPoxipol).

Convert weird strings to normal python strings

Using unicodedata you can normalize unicode strings:

>> from unicodedata import normalize
>> test_str = " quot;
>> print(normalize('NFKC', test_str))
BUILDING Speedy TUESDAY spaghetti


Related Topics



Leave a reply



Submit