PHP Transliteration

Cyrillic transliteration in PHP

Try following code

$textcyr="Тествам с кирилица";
        $textlat="I pone dotuk raboti!";
        $cyr = ['Љ', 'Њ', 'Џ', 'џ', 'ш', 'ђ', 'ч', 'ћ', 'ж', 'љ', 'њ', 'Ш', 'Ђ', 'Ч', 'Ћ', 'Ж','Ц','ц', 'а','б','в','г','д','е','ё','ж','з','и','й','к','л','м','н','о','п', 'р','с','т','у','ф','х','ц','ч','ш','щ','ъ','ы','ь','э','ю','я', 'А','Б','В','Г','Д','Е','Ё','Ж','З','И','Й','К','Л','М','Н','О','П', 'Р','С','Т','У','Ф','Х','Ц','Ч','Ш','Щ','Ъ','Ы','Ь','Э','Ю','Я'
        ];
        $lat = ['Lj', 'Nj', 'Dž', 'dž', 'š', 'đ', 'č', 'ć', 'ž', 'lj', 'nj', 'Š', 'Đ', 'Č', 'Ć', 'Ž','C','c', 'a','b','v','g','d','e','io','zh','z','i','y','k','l','m','n','o','p', 'r','s','t','u','f','h','ts','ch','sh','sht','a','i','y','e','yu','ya', 'A','B','V','G','D','E','Io','Zh','Z','I','Y','K','L','M','N','O','P', 'R','S','T','U','F','H','Ts','Ch','Sh','Sht','A','I','Y','e','Yu','Ya'
        ];
        $textcyr = str_replace($cyr, $lat, $textcyr);
        $textlat = str_replace($lat, $cyr, $textlat);
        echo("$textcyr $textlat");

PHP Transliteration

You can use iconv, which has a special transliteration encoding.

When the string "//TRANSLIT" is appended to tocode, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several characters that look similar to the original character.

-- http://www.gnu.org/software/libiconv/documentation/libiconv/iconv_open.3.html

See here for a complete example that matches your use case.

PHP transliterate specify locale

Yes, Han-Latin means pinyin. ICU transliterators come from CLDR (I'll update the userguide to make this clear). ICU already can convert kana (hira/kata) to latin, but Kanji has more than one reading, so you won't find what you are looking for with a simple table-based conversion.

edit: so to summarize, ICU will not do what you want without writing rules, nor does it seem to me likely to be simple to do with your own rules due to how the Japanese language works.

PHP convert cyrillic

You can take this http://drupal.org/project/transliteration and make it suit your project. This is one of the best implementations of transliteration.

Also you can transliterate using iconv:

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text), PHP_EOL;

Exclude specific characters from Transliterator conversion

Given your example with input and output:

$transliterator = Transliterator::create("Any-Latin; Latin-ASCII");
$str = "AŠAàèìòù Chén Hǎi yáo München Faißt Финиш 国内 - 镜像";
echo $transliterator->transliterate($str), "\n";

ASAaeiou Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

when applying the transliteration only on the segments that do not match the range of characters you specified to keep (the italian accented characters [àèìòù]) it should provide the result.

One option is to use preg_replace_callback for that.

It requires to have a callback to apply the transliteration:

$transliterate = static function (array $match) use ($transliterator) {
    return $transliterator->transliterate($match[0]);
};

And it requires to have a pattern to match everything but the characters to keep. It needs to be properly defined and compatible with Unicode:

([^\xE0\xE8\xEC\xF2\xF9]+)ui


(...)                : delimiters: the regular expression is inside
u                    : modifier: u - Unicode mode (UTF-8 encoding in
                       PHP, PCRE_UTF8)
i                    : modifier: i - letters in the pattern match
                       both upper and lower case letters
                       (PCRE_CASELESS)

[^...]               : character class: not matching any of the
                       characters (`^`); negated character class
\xE0\xE8\xEC\xF2\xF9 : the italian accented characters àèìòù written
                       in a stable notation (you can easily copy and
                       paste it for example)

Last but not least, the subject to operate on must be compatible with the characters to keep. As there can be many ways to write the same character in Unicode, the input is normalized to be compatible with the PCRE pattern:

echo preg_replace_callback(
    '([^\xE0\xE8\xEC\xF2\xF9]+)ui', 
    $transliterate, 
    Normalizer::normalize($str, Normalizer::NFC)
), "\n";

The output:

ASAàèìòù Chen Hai yao Munchen Faisst Finis guo nei - jing xiang

Example across PHP versions.

Addendum:

\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFA lower-case list of italian accented characters (can be used with i-modifier)
\xC0\xC1\xC8\xC9\xCC\xCD\xD2\xD3\xD9\xDA\xE0\xE1\xE8\xE9\xEC\xED\xF2\xF3\xF9\xFA lower- and upper-case list of italian accented characters (can be used without i-modifier)

PCRE Syntax CHARACTERS (excerpt):

   \xhh       character with hex code hh
   \x{hhh..}  character with hex code hhh..

Link to the full PCRE syntax: https://www.pcre.org/original/doc/html/pcresyntax.html

Transliterate any convertible utf8 char into ascii equivalent

The toAscii() function of Patchwork\Utf8 does exactly this, see:

https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/src/Patchwork/Utf8.php

It leverages iconv and intl's Normalizer to remove accents, split ligatures and do many other generic transliterations.

Intelligent transliteration in PHP

I know with Japanese at least, you have a set number of letter combinations.

So, you could do something like create a matching array like this

array(
  'oo' => 'おう',
  'oh' => 'おう',
  'ou' => 'おう'
)

Of course, continuing on, and making sure you don't match 'su', when it should be 'tsu'.

This would only be a starting point, of course.

Machine learning is probably most practical with Chinese...but here's a rough start to hiragana: https://gist.github.com/1154969

Where can I find a list of IDs or rules for the PHP transliterator (Intl)?

The ids that Transliterator::listIDs() are the "basic ids". The example you gave is a "compound id". You can see the ICU docs on this.

You can also create your own rules with Transliterator::createFromRules().

You can take a look at the prefefined rules:

<?php
$a = new ResourceBundle(NULL, sprintf('icudt%dl-translit', INTL_ICU_VERSION), true);

foreach ($a['RuleBasedTransliteratorIDs'] as $name => $v) {
    $file = @$v['file'];
    if (!$file) {
        $file = $v['internal'];
        echo $name, " (direction $file[direction]; internal)\n";
    } else { 
        echo $name, " (direction: $file[direction])\n";
        echo $file['resource'];
    }
    echo "\n--------------\n";
}

After formatting, the result looks like this.

PHP Transliteration

Cyrillic transliteration in PHP

PHP Transliteration

PHP transliterate specify locale

PHP convert cyrillic

Exclude specific characters from Transliterator conversion

Transliterate any convertible utf8 char into ascii equivalent

Intelligent transliteration in PHP

Where can I find a list of IDs or rules for the PHP transliterator (Intl)?

Related Topics

Leave a reply