Unicode Normalization (Form C) in R:Convert All Characters with Accents into Their One-Unicode-Character Form

Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):

> stri_trans_nfc('\u00e9') == stri_trans_nfc('\u0065\u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

> stri_compare('\u00e9', '\u0065\u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek Gągolewski and Bartek Tartanus, and to Kurt Hornik for the info!

How to convert one unicode char to second looking exactly the same?

What you're describing is Unicode canonical equivalence.

I figured using utf8mb4_unicode_ci collation would solve this for you. However the documentation implied that it would not:

A combined character will be considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by the CHAR_LENGTH() function or in result set metadata).

However a quick test seems to indicate that is incorrect:

mysql -u root -e "SELECT 'a<0301>' = 'á' COLLATE utf8mb4_unicode_ci;"
+-----------------------------------------+
| 'á' = 'á' COLLATE utf8mb4_unicode_ci    |
+-----------------------------------------+
|                                       1 |
+-----------------------------------------+

Confusing.. though I wonder if that sentence is only applies in the context of the previous two sentences:

Also, combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo.

So anyway, that may work for you. It is worth noting that utf8mb4_unicode_ci will result in relatively loose matching, e.g. á and a will be treated equivalent:

mysql -u root -e "SELECT 'á' = 'a' COLLATE utf8mb4_unicode_ci;"
+---------------------------------------+
| 'á' = 'a' COLLATE utf8mb4_unicode_ci  |
+---------------------------------------+
|                                     1 |
+---------------------------------------+

Another option, should you wish to have finer control on this, is to normalize text before insert into your database (intl extention required). Whether or not you'll want to do this depends on how interested you are in keeping it in it's absolute original form. The normalization process guarantees visual equivalence, so it should be safe to apply. For example, if you were to normalize to the composed form (which would be most storage efficient, should you care):

<?php

$a = 'á'; // 0xC3 0xA1
$b = 'á'; // 0x61 0xCC 0x81

$ca = \Normalizer::normalize($a, \Normalizer::FORM_C);
$cb = \Normalizer::normalize($b, \Normalizer::FORM_C);

$da = \Normalizer::normalize($a, \Normalizer::FORM_D);
$db = \Normalizer::normalize($b, \Normalizer::FORM_D);

var_dump($a === $b); // FALSE
var_dump($a === $cb); // TRUE, $a is already composed
var_dump($ca === $cb); // TRUE, $a is unchanged by normalizer
var_dump($b === $da); // TRUE, $b is already decomposed
var_dump($db === $da); // TRUE, $b is unchanged by normalizer

Umlaut matching in R regex

"K<c3><a4>se" encodes the "ä" as unicode character U+00E4 (LATIN SMALL LETTER A WITH DIAERESIS).

"Ka<cc><88>se" encodes the "ä" as unicode characters U+0061 (LATIN SMALL LETTER A) and U+0308 (COMBINING DIAERESIS).

Both are technically correct, but distinct. To compare them, you will need to normalize the strings. You could use the package stringi:

stri_trans_nfc("Ka\u0308se") -> "K\u00E4se"

More information:

Unicode normalization (form C) in R : convert all characters with accents into their one-unicode-character form?
Wikipedia: Unicode Equivalence

R - Special characters reading with read_fwf

You can fix this by changing the encoding via locale to LATIN1:

library(readr)
fw <- fwf_widths(c(2,13,2), col_names = c('A','B','C'))
x <- read_fwf('00StackOvérflow00\n',
                      col_positions = fw, locale = locale(encoding = 'LATIN1'))

Returning:

# A tibble: 1 x 3
  A     B             C    
  <chr> <chr>         <chr>
1 00    StackOvérflow 00

Convert special letters to english letters in R

You can use chartr

x <- "ØxxÅxx"
chartr("ØÅ", "OA", x)
[1] "OxxAxx"

And/or gsub

y <- "Æabc"
gsub("Æ", "AE", y)
[1] "AEabc"

What is the best way to remove accents (normalize) in a Python unicode string?

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A \u00c0 \u0394 \u038E")
u'A A \u0394 \u03a5'
>>>

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

Is it possible to convert between Unicode normal forms in PHP?

Unicode normalization is provided by the intl extension and its Normalizer class.

http://docs.php.net/manual/en/class.normalizer.php

Greek vowels with accents shown as two characters instead of a single one

The root cause: Sometime there is many different ways to represent the same glyph with Unicode. Usually we convert to a canonical form, but there is two canonical/normalization form (decomposed: NFD and composed: NFC). Apple prefers the first (and it was the original prefered way of Unicode), most of the other operating systems prefer the second. And each font has own preference (but shaper library will handle it).

You can transform your text into the canonical composed form (NFC), but not all glyphs can be transformed into one single characters: some combination of accent and base character requires two codepoints (or more if you have multiple accents).

Unicode Normalization (Form C) in R:Convert All Characters with Accents into Their One-Unicode-Character Form