What Is Normalized Utf-8 All About

What is normalized UTF-8 all about?

Everything You Never Wanted to Know about Unicode Normalization

Canonical Normalization

Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form. The resulting code points should appear identical to the original ones barring any bugs in the fonts or rendering engine.

When To Use

Because the results appear identical, it is always safe to apply canonical normalization to a string before storing or displaying it, as long as you can tolerate the result not being bit for bit identical to the input.

Canonical normalization comes in 2 forms: NFD and NFC. The two are equivalent in the sense that one can convert between these two forms without loss. Comparing two strings under NFC will always give the same result as comparing them under NFD.

NFD

NFD has the characters fully expanded out. This is the faster normalization form to calculate, but the results in more code points (i.e. uses more space).

If you just want to compare two strings that are not already normalized, this is the preferred normalization form unless you know you need compatibility normalization.

NFC

NFC recombines code points when possible after running the NFD algorithm. This takes a little longer, but results in shorter strings.

Compatibility Normalization

Unicode also includes many characters that really do not belong, but were used in legacy character sets. Unicode added these to allow text in those character sets to be processed as Unicode, and then be converted back without loss.

Compatibility normalization converts these to the corresponding sequence of "real" characters, and also performs canonical normalization. The results of compatibility normalization may not appear identical to the originals.

Characters that include formatting information are replaced with ones that do not. For example the character gets converted to 9. Others don't involve formatting differences. For example the roman numeral character is converted to the regular letters IX.

Obviously, once this transformation has been performed, it is no longer possible to losslessly convert back to the original character set.

When to use

The Unicode Consortium suggests thinking of compatibility normalization like a ToUpperCase transform. It is something that may be useful in some circumstances, but you should not just apply it willy-nilly.

An excellent use case would be a search engine since you would probably want a search for 9 to match .

One thing you should probably not do is display the result of applying compatibility normalization to the user.

NFKC/NFKD

Compatibility normalization form comes in two forms NFKD and NFKC. They have the same relationship as between NFD and C.

Any string in NFKC is inherently also in NFC, and the same for the NFKD and NFD. Thus NFKD(x)=NFD(NFKC(x)), and NFKC(x)=NFC(NFKD(x)), etc.

Conclusion

If in doubt, go with canonical normalization. Choose NFC or NFD based on the space/speed trade-off applicable, or based on what is required by something you are inter-operating with.

What determines the normalized form of a Unicode string in C++?

The interpretation of character strings outside of the basic source character set is implementation-dependent. (Standard quote below.) So there is no definitive answer; an implementation is not even obliged to accept source characters outside of the basic set.

Normalisation involves a mapping of possibly multiple source codepoints to possibly multiple internal codepoints, including the possibility of reordering the source character sequence (if, for example, diacritics are not in the canonical order). Such transformations are more complex than the source→internal transformation anticipated by the standard, and I suspect that a compiler which attempted them would not be completely conformant. In any event, I know of no compiler which does so.

So, in general, you should ensure that the source code you provide to the compiler is normalized as per your desired normalization form, if that matters to you.

In the particular case of GCC, the compiler interprets the source according to the default locale's encoding, unless told otherwise (with the -finput-charset command-line option). It will recode if necessary to Unicode codepoints. But it does not alter the sequence of codepoints. So if you give it a normalized UTF-8 string, that's what you get. And if you give it an unnormalized string, that's also what you get.

In this example on coliru, the first string is composed and the second one decomposed (although they are both in some normalization form). (The rendering of the second example string in coliru seems to be browser-dependent. On my machine, chrome renders them correctly, while firefox shifts the diacritics one position to the left. YMMV.)


The C++ standard defines the basic source character set (in §2.3/1) to be letters, digits, five whitespace characters (space, newline, tab, vertical tab and formfeed) and the symbols:

_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ’ 

It gives the compiler a lot of latitude as to how it interprets the input, and how it handles characters outside of the basic source character set. §2.2 paragraph 1 (from C++14 draft n4527):

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)

It's worth adding that diacritics are characters, from the perspective of the C++ standard. So the composed ñ (\u00d1) is one character and the decomposed ñ (\u006e \u0303) is two characters, regardless of how it looks to you.

A close reading of the above paragraph from the standard suggests that normalization or other transformations which are not strictly 1-1 are not permitted, although the compiler may be able to reject an input which contains characters outside the basic source character set.

Does mbstring normalize utf-8 strings?

Reading the source, mb_convert_encoding does not appear to normalize. It appears to convert between encodings then to substitute illegal characters, and that is all.

What's the point of String.normalize()?

It depends on what will do with strings: often you do not need it (if you are just getting input from user, and putting it to user). But to check/search/use as key/etc. such strings, you may want a unique way to identify the same string (semantically speaking).

The main problem is that you may have two strings which are semantically the same, but with two different representations: e.g. one with a accented character [one code point], and one with a character combined with accent [one code point for character, one for combining accent]. User may not be in control on how the input text will be sent, so you may have two different user names, or two different password. But also if you mangle data, you may get different results, depending on initial string. Users do not like it.

An other problem is about unique order of combining characters. You may have an accent, and a lower tail (e.g. cedilla): you may express this with several combinations: "pure char, tail, accent", "pure char, accent, tail", "char+tail, accent", "char+accent, cedilla".

And you may have degenerate cases (especially if you type from a keyboard): you may get code points which should be removed (you may have a infinite long string which could be equivalent of few bytes.

In any case, for sorting strings, you (or your library) requires a normalized form: if you already provide the right, the lib will not need to transform it again.

So: you want that the same (semantically speaking) string has the same sequence of unicode code points.

Note: If you are doing directly on UTF-8, you should also care about special cases of UTF-8: same codepoint could be written in different ways [using more bytes]. Also this could be a security problem.

The K is often used for "searches" and similar tasks: CO2 and CO₂ will be interpreted in the same manner, but this could change the meaning of the text, so it should often used only internally, for temporary tasks, but keeping the original text.

What does unicodedata.normalize do in python?

You need to specify the encoding type.

Then you need to use unicode instead of string as arguments of normalize()

# -*- coding: utf-8 -*-

import unicodedata
my_var = u"this is a string"
my_var2 = u" Esta es una oración que está en español "
my_var3 = unicodedata.normalize(u'NFKD', my_var2).encode('ascii', 'ignore').decode('utf8')
output = my_var + my_var3
print(output)

Unicode Normalization in Windows

From the MSDN article Using Unicode Normalization to Represent Strings.

Windows, Microsoft applications, and the .NET Framework generally generate characters in form C using normal input methods. For most purposes on Windows, form C is the preferred form. For example, characters in form C are produced by Windows keyboard input. However, characters imported from the Web and other platforms can introduce other normalization forms into the data stream.

Update: I've included some specific details relating to Question #2.

In regards to the file system, normalization is not required - based on the article Naming Files, Paths, and Namespaces.

There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.

In regards to SQL Server, no normalization is required - nor is data normalized when saved in the database. That said, when comparing strings, SQL Server 2000 uses its own string normalization mechanism inside of indexes; but I cannot find specific details on what that is. A SQL Server 2005 article states the same.

One important change in SQL Server 7.0 was the provision of an operating system–independent model for string comparison, so that the collations between all operating systems from Windows 95 through Windows 2000 would be consistent. This string comparison code was based on the same code that Windows 2000 uses for its own string normalization, and is encapsulated to be the same on all computers and in all versions of SQL Server.

Python unicode normalization issue

Consider that your file is perhaps not using UTF-8 as the encoding.

You are reading the file with the UTF-8 codec but decoding fails. Check that your file encoding is really UTF-8.

Note that UTF-8 is an encoding out of many, it doesn't mean 'decode magically to Unicode'.

If you don't yet understand what encodings are (as opposed to what Unicode is, a related but separate concept), you need to do some reading:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

How to normalize text content to UTF 8 in java

You can use ICU4J's CharsetDetector

Normalize filenames to NFC or not (Unicode)

The accuracy of advice you link to is dependent on the filesystem in question.

The 'standard' Linux file systems do not prescribe an encoding for filenames (they are treated as raw bytes), so assuming they are UTF-8 and normalising them is an error and may cause problems.

On the other hand, the default filesystem on Mac OS X (HFS+) enforces all filenames to be valid UTF-16 in a variant of NFD. If you need to compare file paths, you should do so in a similar format – ideally using the APIs provides by the system, as its NFD form is tied to an older version of Unicode.



Related Topics



Leave a reply



Submit