Alphabetize Arabic and Japanese Text That Is in Unicode

Alphabetize Arabic and Japanese text that is in Unicode?

Unicode code points are not listed in alphabetic order (Z < a, for example), but they try to be approximately in that order anyway. There is a canonical unicode order, defined by the Unicode Collation Algorithm and they are also language-specific ordering (french order is not exacly the same as german or czech order, even with the same alphabet), which can be specified in locale information. I think the ICU library contains the language specific algorithms you are looking for.

javascript sort with unicode

If the locale in your system is set correctly then you can use localeCompare method instead of greater-than operator to compare the strings - this method is locale aware.

function sortComparer(a,b){
    return a.title.localeCompare(b.title)
};

How to sort latin after local language in python 3?

Interesting question. Here’s some sample code that classifies strings
according to the writing system of the first character.

import unicodedata

words = ["Japanese",         # English
         "Nihongo",          # Japanese, rōmaji
         "にほんご",          # Japanese, hiragana
         "ニホンゴ",          # Japanese, katakana
         "日本語",            # Japanese, kanji
         "Японский язык",    # Russian
         "जापानी भाषा"        # Hindi (Devanagari)
]

def wskey(s):
    """Return a sort key that is a tuple (n, s), where n is an int based
    on the writing system of the first character, and s is the passed
    string. Writing systems not addressed (Devanagari, in this example)
    go at the end."""

    sort_order = {
        # We leave gaps to make later insertions easy
        'CJK' : 100,
        'HIRAGANA' : 200,
        'KATAKANA' : 200,  # hiragana and katakana at same level
        'CYRILLIC' : 300,
        'LATIN' : 400
    }

    name = unicodedata.name(s[0], "UNKNOWN")
    first = name.split()[0]
    n = sort_order.get(first, 999999);
    return (n, s)

words.sort(key=wskey)
for s in words:
    print(s)

In this example, I am sorting hiragana and katakana (the two Japanese
syllabaries) at the same level, which means pure-katakana strings will
always come after pure-hiragana strings. If we wanted to sort them such
that the same syllable (e.g., に and ニ) sorted together, that would be
trickier.

Unicode range for Japanese

CJK(Chinese Japanese and Korean), Hiragana and Katakana(include Halfwidth Katakana)

http://www.unicode.org/charts/

Unicode Characters that can be used to trick a string sorter?

Zero-width space (U+200B) should probably do what you want. From the Unicode spec:

Zero Width Space. The U+200B ZERO WIDTH SPACE indicates a line break opportunity, except that it has no width. Zero-width space characters are intended to be used in languages that have no visible word spacing to represent line break opportunities, such as Thai, Khmer, and Japanese.

Should be in most fonts you run into, but YMMV.

Database of readings for Japanese words

Assuming what you actually mean is you want a computer readable offline Japanese dictionary then look at JMDict (or the older edict) are Japanese dictionaries which have reading entries (in Kanji/Kana) with an associated kana reading element. The JMDict is in XML so it is pretty simple to use with most projects.

Multilingual text sorting in Perl, on Windows, using locale

Assuming that your starting point is Unicode, because you have been very careful to decode all incoming data no matter what its native encoding might be, then it is easy to use to the Unicode::Collate module as a starting point.

If you want locale tailoring, then you probably want to start with Unicode::Collate::Locale instead.

Decoding into Unicode

If you run in an all-UTF8 environment, this is easy, but if you are subject to the vicissitudes of random so-called “locales” (or even worse, the ugly things Microsoft calls “code pages”), then you might want to get the CPAN Encode::Locale module to help you out. For example:

 use Encode;
 use Encode::Locale;

 # use "locale" as an arg to encode/decode
 @ARGV = map { decode(locale =>  $_) } @ARGV;

 # or as a stream for binmode or open
 binmode $some_fh, ":encoding(locale)";

 binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
 binmode STDOUT, ":encoding(console_out)"  if -t STDOUT;
 binmode STDERR, ":encoding(console_out)"  if -t STDERR;

(If it were me, I would just use ":utf8" for the output.)

Standard Collation, plus locales and tailoring

The point is, once you have everything decoded into internal Perl format, you can use Unicode::Collate and Unicode::Collate::Locale on it. These can be really easy:

   use v5.14;
   use utf8;
   use Unicode::Collate;
   my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
   @exes = Unicode::Collate->new->sort(@exes);
   say "@exes";

   # prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹

Or they can be pretty fancy. Here is one that tries to deal with book titles: it strips leading articles and zero-pads numbers.

my $collator = Unicode::Collate->new(
    --upper_before_lower => 1,
    --preprocess => {
        local $_ = shift;
        s/^ (?: The | An? ) \h+ //x;  # strip articles
        s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
        return $_;
    };
);

Now just use that object’s sort method to sort with.

Sometimes you need to turn the sort inside out. For example:

 my $collator = Unicode::Collate->new();
 for my $rec (@recs) {
     $rec->{NAME_key} = 
        $collator->getSortKey( $rec->{NAME} );
 }
 @srecs = sort {
     $b->{AGE}       <=>  $a->{AGE}
                     ||
     $a->{NAME_key}  cmp  $b->{NAME_key}
 } @recs;

The reason you have to do that is because you are sorting on a record with various fields. The binary sort key allows you to use the cmp operator on data that has been through your chosen/custom collator object.

The full constructor for the collator object has all this for a formal syntax:

      $Collator = Unicode::Collate->new(
         UCA_Version => $UCA_Version,
         alternate => $alternate, # alias for 'variable'
         backwards => $levelNumber, # or \@levelNumbers
         entry => $element,
         hangul_terminator => $term_primary_weight,
         highestFFFF => $bool,
         identical => $bool,
         ignoreName => qr/$ignoreName/,
         ignoreChar => qr/$ignoreChar/,
         ignore_level2 => $bool,
         katakana_before_hiragana => $bool,
         level => $collationLevel,
         minimalFFFE => $bool,
         normalization  => $normalization_form,
         overrideCJK => \&overrideCJK,
         overrideHangul => \&overrideHangul,
         preprocess => \&preprocess,
         rearrange => \@charList,
         rewrite => \&rewrite,
         suppress => \@charList,
         table => $filename,
         undefName => qr/$undefName/,
         undefChar => qr/$undefChar/,
         upper_before_lower => $bool,
         variable => $variable,
      );

But you usually don’t have to worry about almost any of those. In fact, if you want country-specific locale tailoring using the CLDR data, you should just use Unicode::Collate::Locale, which adds exactly one more parameter to the constructor: locale => $country_code.

 use Unicode::Collate::Locale;
 $coll = Unicode::Collate::Locale->
           new(locale => "fr");
 @french_text = $coll->sort(@french_text);

See how easy that is?

But you can do other cool things, too.

 use Unicode::Collate::Locale;
 my $Collator = new Unicode::Collate::Locale::
                 locale => "de__phonebook",
                 level  => 1,
                 normalization => undef,
                ;

 my $full = "Ich müß Perl studieren.";
 my $sub = "MUESS";
 if (my ($pos,$len) = $Collator->index($full, $sub)) {
     my $match = substr($full, $pos, $len);
     say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";

 }

When run, that says:

 Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›

Here are the available locales as of v0.96 of the Unicode::Collate::Locale module, taken from its manpage:

 locale name       description
--------------------------------------------------------------
 af                Afrikaans
 ar                Arabic
 as                Assamese
 az                Azerbaijani (Azeri)
 be                Belarusian
 bg                Bulgarian
 bn                Bengali
 bs                Bosnian
 bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
 ca                Catalan
 cs                Czech
 cy                Welsh
 da                Danish
 de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
 ee                Ewe
 eo                Esperanto
 es                Spanish
 es__traditional   Spanish ('ch' and 'll' as a grapheme)
 et                Estonian
 fa                Persian
 fi                Finnish (v and w are primary equal)
 fi__phonebook     Finnish (v and w as separate characters)
 fil               Filipino
 fo                Faroese
 fr                French
 gu                Gujarati
 ha                Hausa
 haw               Hawaiian
 hi                Hindi
 hr                Croatian
 hu                Hungarian
 hy                Armenian
 ig                Igbo
 is                Icelandic
 ja                Japanese [1]
 kk                Kazakh
 kl                Kalaallisut
 kn                Kannada
 ko                Korean [2]
 kok               Konkani
 ln                Lingala
 lt                Lithuanian
 lv                Latvian
 mk                Macedonian
 ml                Malayalam
 mr                Marathi
 mt                Maltese
 nb                Norwegian Bokmal
 nn                Norwegian Nynorsk
 nso               Northern Sotho
 om                Oromo
 or                Oriya
 pa                Punjabi
 pl                Polish
 ro                Romanian
 ru                Russian
 sa                Sanskrit
 se                Northern Sami
 si                Sinhala
 si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
 sk                Slovak
 sl                Slovenian
 sq                Albanian
 sr                Serbian
 sr_Latn           Serbian in Latin (tailored as Croatian)
 sv                Swedish (v and w are primary equal)
 sv__reformed      Swedish (v and w as separate characters)
 ta                Tamil
 te                Telugu
 th                Thai
 tn                Tswana
 to                Tonga
 tr                Turkish
 uk                Ukrainian
 ur                Urdu
 vi                Vietnamese
 wae               Walser
 wo                Wolof
 yo                Yoruba
 zh                Chinese
 zh__big5han       Chinese (ideographs: big5 order)
 zh__gb2312han     Chinese (ideographs: GB-2312 order)
 zh__pinyin        Chinese (ideographs: pinyin order) [3]
 zh__stroke        Chinese (ideographs: stroke order) [3]
 zh__zhuyin        Chinese (ideographs: zhuyin order) [3]

   Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
   it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
   (Zulu).

   Note

   [1] ja: Ideographs are sorted in JIS X 0208 order.  Fullwidth and halfwidth forms are identical to their regular form.  The
   difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
   and then "katakana_before_hiragana" has no effect.

   [2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
   (level 2) greater than, the corresponding hangul syllable.

   [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.

   Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.

So in summary, the main trick is to get your local data decoded into a uniform Unicode representation, then use deterministic sorting, possibly tailored, that doesn’t rely on random settings of the user’s console window for correct behavior.

_{Note: All these examples, apart from the manpage citation, are lovingly lifted from the 4^th edition of Programming Perl, by kind permission of its author. :)}

Alphabetize Arabic and Japanese Text That Is in Unicode