PHP: Replace Umlauts With Closest 7-Bit Ascii Equivalent in an Utf-8 String

PHP: Replace umlauts with closest 7-bit ASCII equivalent in an UTF-8 string

iconv("utf-8","ascii//TRANSLIT",$input);

Extended example

How to remove accents and turn letters into plain ASCII characters?

If you have iconv installed, try this (the example assumes your input string is in UTF-8):

echo iconv('UTF-8', 'ASCII//TRANSLIT', $string);

(iconv is a library to convert between all kinds of encodings; it's efficient and included with many PHP distributions by default. Most of all, it's definitely easier and more error-proof than trying to roll your own solution (did you know that there's a "Latin letter N with a curl"? Me neither.))

Convert special character (i.e. Umlaut) to most likely representation in ascii

I find iconv completely unreliable, and I dislike preg_match solutions and big arrays ... so my favorite way is ...

    function toASCII( $str )
{
return strtr(utf8_decode($str),
utf8_decode('ŠŒŽšœžŸ¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ'),
'SOZsozYYuAAAAAAACEEEEIIIIDNOOOOOOUUUUYsaaaaaaaceeeeiiiionoooooouuuuyy');
}

Convert 2 similarly-looking German characters of different kinds to same ASCII string in PHP

You could first convert your input to utf-8 using iconv and then apply your conversion to ASCII. To detect the current encoding you can use mb_detect_encoding.

$aUTF8 = iconv(mb_detect_encoding($a, 'UTF-8, ISO-8859-1', true), 'UTF-8', $a);
$bUTF8 = iconv(mb_detect_encoding($b, 'UTF-8, ISO-8859-1', true), 'UTF-8', $b);

$aASCII = iconv("utf-8", "ascii//TRANSLIT", $aUTF8);
$bASCII = iconv("utf-8", "ascii//TRANSLIT", $bUTF8);

Please note that you might have to add additional encodings to the encoding list of mb_detect_encoding.

Replace worldwide diacritics characters

You can achieve this by using iconv, available in PHP, and requesting an encoding conversion with transliteration. (This actually works for many different scripts!) If you only want basic European characters, make the target Latin-1, or even ASCII.

From the manual page:

iconv("UTF-8", "ISO-8859-1//TRANSLIT", $text)

What changes my UTF-8 string to ASCII?

Strings have no actual associated encoding, they're merely byte arrays. mb_detect_encoding doesn't tell you what encoding the string has, it merely tries to detect it. That means it takes a few guesses (your second argument) and tells you the first that is valid.

Your original string probably contains some non-ASCII characters, so ASCII isn't a valid encoding for it, but UTF-8 is. When you're later testing a substring of the original, that substring probably contains only characters which are valid in ASCII, and since ASCII is the first encoding that's tested, that's the guessed result. Any ASCII string is also valid UTF-8, so there's no actual problem or "conversion" which happened.

Find specific UTF8 chars independent of php code charset?

  1. You should be in control of what your source code is encoded as, it'd be very weird to suddenly have its encoding change out from under you.
  2. If that is actually a legitimate concern you want to counteract, then you can't even rely on your source code being either Latin-1 or UTF-8, it could be any number of other encodings (though admittedly in practice Latin-1 is a pretty common guess). So utf8_encode is not guaranteed to fix your problem at all.
  3. To be 100% agnostic of your source code file's encoding, denote your characters as raw bytes:

    $search = "\xC3\xA4,\xC3\xB6,\xC3\xBC"; // ä, ö and ü in UTF-8
  4. Note that this still won't guarantee what encoding $string will be in, you'll need to know and/or control its encoding separately from this issue at hand. At some point you just have to nail down your used encodings, you can't be agnostic of it all the way through.

strtr() partially not work

As others have noted, the most likely cause is a character encoding mismatch. Since the titles you're trying to convert are apparently in UTF-8, the problem is most likely that your PHP source code isn't. Try re-saving the file as UTF-8 text, and see if that fixes the problem.

BTW, a simple way to debug this would be to print out both your data rows and your transliteration array into the same output file using e.g. print_r() or var_dump(), and look at the output to see if the non-ASCII characters in it look correct. If the characters look right in the data but wrong in the transliteration table (or vice versa), that's a sign that the encodings don't match.

Ps. If you have the PHP iconv extension installed (and you probably do), consider using it to automatically convert your titles to ASCII.

downgrade non-ascii symbols to closest 7-bit ASCII equivalent (preferrably Java)

Have a look at java.text.Normalizer. It can help you with transforming equivalent characters: http://en.wikipedia.org/wiki/Unicode_equivalence



Related Topics



Leave a reply



Submit