Alphabetize Arabic and Japanese Text That Is in Unicode

Alphabetize Arabic and Japanese text that is in Unicode?

Unicode code points are not listed in alphabetic order (Z < a, for example), but they try to be approximately in that order anyway. There is a canonical unicode order, defined by the Unicode Collation Algorithm and they are also language-specific ordering (french order is not exacly the same as german or czech order, even with the same alphabet), which can be specified in locale information. I think the ICU library contains the language specific algorithms you are looking for.

javascript sort with unicode

If the locale in your system is set correctly then you can use localeCompare method instead of greater-than operator to compare the strings - this method is locale aware.

function sortComparer(a,b){
return a.title.localeCompare(b.title)
};

How to sort latin after local language in python 3?

Interesting question. Here’s some sample code that classifies strings
according to the writing system of the first character.

import unicodedata

words = ["Japanese", # English
"Nihongo", # Japanese, rōmaji
"にほんご", # Japanese, hiragana
"ニホンゴ", # Japanese, katakana
"日本語", # Japanese, kanji
"Японский язык", # Russian
"जापानी भाषा" # Hindi (Devanagari)
]

def wskey(s):
"""Return a sort key that is a tuple (n, s), where n is an int based
on the writing system of the first character, and s is the passed
string. Writing systems not addressed (Devanagari, in this example)
go at the end."""

sort_order = {
# We leave gaps to make later insertions easy
'CJK' : 100,
'HIRAGANA' : 200,
'KATAKANA' : 200, # hiragana and katakana at same level
'CYRILLIC' : 300,
'LATIN' : 400
}

name = unicodedata.name(s[0], "UNKNOWN")
first = name.split()[0]
n = sort_order.get(first, 999999);
return (n, s)

words.sort(key=wskey)
for s in words:
print(s)

In this example, I am sorting hiragana and katakana (the two Japanese
syllabaries) at the same level, which means pure-katakana strings will
always come after pure-hiragana strings. If we wanted to sort them such
that the same syllable (e.g., に and ニ) sorted together, that would be
trickier.

Unicode range for Japanese

CJK(Chinese Japanese and Korean), Hiragana and Katakana(include Halfwidth Katakana)

http://www.unicode.org/charts/

Unicode Characters that can be used to trick a string sorter?

Zero-width space (U+200B) should probably do what you want. From the Unicode spec:

Zero Width Space. The U+200B ZERO WIDTH SPACE indicates a line break opportunity, except that it has no width. Zero-width space characters are intended to be used in languages that have no visible word spacing to represent line break opportunities, such as Thai, Khmer, and Japanese.

Should be in most fonts you run into, but YMMV.

Database of readings for Japanese words

Assuming what you actually mean is you want a computer readable offline Japanese dictionary then look at JMDict (or the older edict) are Japanese dictionaries which have reading entries (in Kanji/Kana) with an associated kana reading element. The JMDict is in XML so it is pretty simple to use with most projects.

Multilingual text sorting in Perl, on Windows, using locale

Assuming that your starting point is Unicode, because you have been very careful to decode all incoming data no matter what its native encoding might be, then it is easy to use to the Unicode::Collate module as a starting point.

If you want locale tailoring, then you probably want to start with Unicode::Collate::Locale instead.

Decoding into Unicode

If you run in an all-UTF8 environment, this is easy, but if you are subject to the vicissitudes of random so-called “locales” (or even worse, the ugly things Microsoft calls “code pages”), then you might want to get the CPAN Encode::Locale module to help you out. For example:

 use Encode;
use Encode::Locale;

# use "locale" as an arg to encode/decode
@ARGV = map { decode(locale => $_) } @ARGV;

# or as a stream for binmode or open
binmode $some_fh, ":encoding(locale)";

binmode STDIN, ":encoding(console_in)" if -t STDIN;
binmode STDOUT, ":encoding(console_out)" if -t STDOUT;
binmode STDERR, ":encoding(console_out)" if -t STDERR;

(If it were me, I would just use ":utf8" for the output.)


Standard Collation, plus locales and tailoring

The point is, once you have everything decoded into internal Perl format, you can use Unicode::Collate and Unicode::Collate::Locale on it. These can be really easy:

   use v5.14;
use utf8;
use Unicode::Collate;
my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
@exes = Unicode::Collate->new->sort(@exes);
say "@exes";

# prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹

Or they can be pretty fancy. Here is one that tries to deal with book titles: it strips leading articles and zero-pads numbers.

my $collator = Unicode::Collate->new(
--upper_before_lower => 1,
--preprocess => {
local $_ = shift;
s/^ (?: The | An? ) \h+ //x; # strip articles
s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
return $_;
};
);

Now just use that object’s sort method to sort with.

Sometimes you need to turn the sort inside out. For example:

 my $collator = Unicode::Collate->new();
for my $rec (@recs) {
$rec->{NAME_key} =
$collator->getSortKey( $rec->{NAME} );
}
@srecs = sort {
$b->{AGE} <=> $a->{AGE}
||
$a->{NAME_key} cmp $b->{NAME_key}
} @recs;

The reason you have to do that is because you are sorting on a record with various fields. The binary sort key allows you to use the cmp operator on data that has been through your chosen/custom collator object.

The full constructor for the collator object has all this for a formal syntax:

      $Collator = Unicode::Collate->new(
UCA_Version => $UCA_Version,
alternate => $alternate, # alias for 'variable'
backwards => $levelNumber, # or \@levelNumbers
entry => $element,
hangul_terminator => $term_primary_weight,
highestFFFF => $bool,
identical => $bool,
ignoreName => qr/$ignoreName/,
ignoreChar => qr/$ignoreChar/,
ignore_level2 => $bool,
katakana_before_hiragana => $bool,
level => $collationLevel,
minimalFFFE => $bool,
normalization => $normalization_form,
overrideCJK => \&overrideCJK,
overrideHangul => \&overrideHangul,
preprocess => \&preprocess,
rearrange => \@charList,
rewrite => \&rewrite,
suppress => \@charList,
table => $filename,
undefName => qr/$undefName/,
undefChar => qr/$undefChar/,
upper_before_lower => $bool,
variable => $variable,
);

But you usually don’t have to worry about almost any of those. In fact, if you want country-specific locale tailoring using the CLDR data, you should just use Unicode::Collate::Locale, which adds exactly one more parameter to the constructor: locale => $country_code.

 use Unicode::Collate::Locale;
$coll = Unicode::Collate::Locale->
new(locale => "fr");
@french_text = $coll->sort(@french_text);

See how easy that is?

But you can do other cool things, too.

 use Unicode::Collate::Locale;
my $Collator = new Unicode::Collate::Locale::
locale => "de__phonebook",
level => 1,
normalization => undef,
;

my $full = "Ich müß Perl studieren.";
my $sub = "MUESS";
if (my ($pos,$len) = $Collator->index($full, $sub)) {
my $match = substr($full, $pos, $len);
say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";

}

When run, that says:

 Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›

Here are the available locales as of v0.96 of the Unicode::Collate::Locale module, taken from its manpage:

 locale name       description
--------------------------------------------------------------
af Afrikaans
ar Arabic
as Assamese
az Azerbaijani (Azeri)
be Belarusian
bg Bulgarian
bn Bengali
bs Bosnian
bs_Cyrl Bosnian in Cyrillic (tailored as Serbian)
ca Catalan
cs Czech
cy Welsh
da Danish
de__phonebook German (umlaut as 'ae', 'oe', 'ue')
ee Ewe
eo Esperanto
es Spanish
es__traditional Spanish ('ch' and 'll' as a grapheme)
et Estonian
fa Persian
fi Finnish (v and w are primary equal)
fi__phonebook Finnish (v and w as separate characters)
fil Filipino
fo Faroese
fr French
gu Gujarati
ha Hausa
haw Hawaiian
hi Hindi
hr Croatian
hu Hungarian
hy Armenian
ig Igbo
is Icelandic
ja Japanese [1]
kk Kazakh
kl Kalaallisut
kn Kannada
ko Korean [2]
kok Konkani
ln Lingala
lt Lithuanian
lv Latvian
mk Macedonian
ml Malayalam
mr Marathi
mt Maltese
nb Norwegian Bokmal
nn Norwegian Nynorsk
nso Northern Sotho
om Oromo
or Oriya
pa Punjabi
pl Polish
ro Romanian
ru Russian
sa Sanskrit
se Northern Sami
si Sinhala
si__dictionary Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
sk Slovak
sl Slovenian
sq Albanian
sr Serbian
sr_Latn Serbian in Latin (tailored as Croatian)
sv Swedish (v and w are primary equal)
sv__reformed Swedish (v and w as separate characters)
ta Tamil
te Telugu
th Thai
tn Tswana
to Tonga
tr Turkish
uk Ukrainian
ur Urdu
vi Vietnamese
wae Walser
wo Wolof
yo Yoruba
zh Chinese
zh__big5han Chinese (ideographs: big5 order)
zh__gb2312han Chinese (ideographs: GB-2312 order)
zh__pinyin Chinese (ideographs: pinyin order) [3]
zh__stroke Chinese (ideographs: stroke order) [3]
zh__zhuyin Chinese (ideographs: zhuyin order) [3]

Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
(Zulu).

Note

[1] ja: Ideographs are sorted in JIS X 0208 order. Fullwidth and halfwidth forms are identical to their regular form. The
difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
and then "katakana_before_hiragana" has no effect.

[2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
(level 2) greater than, the corresponding hangul syllable.

[3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.

Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.

So in summary, the main trick is to get your local data decoded into a uniform Unicode representation, then use deterministic sorting, possibly tailored, that doesn’t rely on random settings of the user’s console window for correct behavior.


Note: All these examples, apart from the manpage citation, are lovingly lifted from the 4th edition of Programming Perl, by kind permission of its author. :)



Related Topics



Leave a reply



Submit