How to Compare Unicode Characters That "Look Alike"

How to compare Unicode characters that look alike?

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:

using System;
using System.Text;

class Program
{
static void Main(string[] args)
{
char first = 'μ';
char second = 'µ';

// Technically you only need to normalize U+00B5 to obtain U+03BC, but
// if you're unsure which character is which, you can safely normalize both
string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);
string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);

Console.WriteLine(first.Equals(second)); // False
Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True
}
}

For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

Find characters that are similar glyphically in Unicode?

This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:

# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')

Output

U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U

To find accent characters, use something like:

import unicodedata as ud
import string

def asc(unichr):
return ud.normalize('NFD',unichr).encode('ascii','ignore')

U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
print u''.join(u for u in U if asc(u) == c)

Output

aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
:
etc.

Comparing two letters returns false, but their unicode character is the same

The second one has an invisble character (65030) at position 1.

 console.log("i︆".charCodeAt(0), "i︆".charCodeAt(1))

Javascript string comparison fails when comparing unicode characters

Unlike what some other people here have said, this has nothing to do with encodings. Rather, your two strings use different code points to render the same visual characters.

To solve this correctly, you need to perform Unicode normalization on the two strings before comparing them. Unforunately, JavaScript doesn't have this functionality built in. Here is a JavaScript library that can perform the normalization for you: https://github.com/walling/unorm

How to compare Unicode strings so that strings that LOOK the same compare equally?

The process is called normalization, supported by any decent Unicode library. Backgrounder is here.

Compare strings with unicode characters in python

Based on the comments it sounds like this is what you're looking for:

import unicodedata

foo = 'Rit\u0113'
bar = 'Rite\u0304'

print(foo, bar)

print(unicodedata.normalize('NFD', foo))
print(unicodedata.normalize('NFD', bar))

assert unicodedata.normalize('NFD', foo) == unicodedata.normalize('NFD', bar)

I selected NFD as the form, but you may prefer NFC.



Related Topics



Leave a reply



Submit