How to Get List of Supported Encodings by Iconv Library in PHP

how to get list of supported encodings by iconv library in php?

In PHP the iconv extension does not have a function to list all available encodings.

The encodings which are available depends on which library iconv internally uses. For example there is libiconv. That website also contains a list of charsets you can use.

You can also connect to your server via SSH and execute the following command:

$ iconv -l

This will give you the available list on that system if PHP is compiled against the same library as the command-line utility.

If you're uncertain if a specific charset is available, you need to check for errors by probing with iconv.


List of iconv encoding and aliases (Feb 2013; gist):

ANSI_X3.4-1968 ANSI_X3.4-1986 ASCII CP367 IBM367 ISO-IR-6 ISO646-US ISO_646.IRV:1991 US US-ASCII CSASCII
UTF-8
ISO-10646-UCS-2 UCS-2 CSUNICODE
UCS-2BE UNICODE-1-1 UNICODEBIG CSUNICODE11
UCS-2LE UNICODELITTLE
ISO-10646-UCS-4 UCS-4 CSUCS4
UCS-4BE
UCS-4LE
UTF-16
UTF-16BE
UTF-16LE
UTF-32
UTF-32BE
UTF-32LE
UNICODE-1-1-UTF-7 UTF-7 CSUNICODE11UTF7
UCS-2-INTERNAL
UCS-2-SWAPPED
UCS-4-INTERNAL
UCS-4-SWAPPED
C99
JAVA
CP819 IBM819 ISO-8859-1 ISO-IR-100 ISO8859-1 ISO_8859-1 ISO_8859-1:1987 L1 LATIN1 CSISOLATIN1
ISO-8859-2 ISO-IR-101 ISO8859-2 ISO_8859-2 ISO_8859-2:1987 L2 LATIN2 CSISOLATIN2
ISO-8859-3 ISO-IR-109 ISO8859-3 ISO_8859-3 ISO_8859-3:1988 L3 LATIN3 CSISOLATIN3
ISO-8859-4 ISO-IR-110 ISO8859-4 ISO_8859-4 ISO_8859-4:1988 L4 LATIN4 CSISOLATIN4
CYRILLIC ISO-8859-5 ISO-IR-144 ISO8859-5 ISO_8859-5 ISO_8859-5:1988 CSISOLATINCYRILLIC
ARABIC ASMO-708 ECMA-114 ISO-8859-6 ISO-IR-127 ISO8859-6 ISO_8859-6 ISO_8859-6:1987 CSISOLATINARABIC
ECMA-118 ELOT_928 GREEK GREEK8 ISO-8859-7 ISO-IR-126 ISO8859-7 ISO_8859-7 ISO_8859-7:1987 ISO_8859-7:2003 CSISOLATINGREEK
HEBREW ISO-8859-8 ISO-IR-138 ISO8859-8 ISO_8859-8 ISO_8859-8:1988 CSISOLATINHEBREW
ISO-8859-9 ISO-IR-148 ISO8859-9 ISO_8859-9 ISO_8859-9:1989 L5 LATIN5 CSISOLATIN5
ISO-8859-10 ISO-IR-157 ISO8859-10 ISO_8859-10 ISO_8859-10:1992 L6 LATIN6 CSISOLATIN6
ISO-8859-11 ISO8859-11 ISO_8859-11
ISO-8859-13 ISO-IR-179 ISO8859-13 ISO_8859-13 L7 LATIN7
ISO-8859-14 ISO-CELTIC ISO-IR-199 ISO8859-14 ISO_8859-14 ISO_8859-14:1998 L8 LATIN8
ISO-8859-15 ISO-IR-203 ISO8859-15 ISO_8859-15 ISO_8859-15:1998 LATIN-9
ISO-8859-16 ISO-IR-226 ISO8859-16 ISO_8859-16 ISO_8859-16:2001 L10 LATIN10
KOI8-R CSKOI8R
KOI8-U
KOI8-RU
CP1250 MS-EE WINDOWS-1250
CP1251 MS-CYRL WINDOWS-1251
CP1252 MS-ANSI WINDOWS-1252
CP1253 MS-GREEK WINDOWS-1253
CP1254 MS-TURK WINDOWS-1254
CP1255 MS-HEBR WINDOWS-1255
CP1256 MS-ARAB WINDOWS-1256
CP1257 WINBALTRIM WINDOWS-1257
CP1258 WINDOWS-1258
850 CP850 IBM850 CSPC850MULTILINGUAL
862 CP862 IBM862 CSPC862LATINHEBREW
866 CP866 IBM866 CSIBM866
MAC MACINTOSH MACROMAN CSMACINTOSH
MACCENTRALEUROPE
MACICELAND
MACCROATIAN
MACROMANIA
MACCYRILLIC
MACUKRAINE
MACGREEK
MACTURKISH
MACHEBREW
MACARABIC
MACTHAI
HP-ROMAN8 R8 ROMAN8 CSHPROMAN8
NEXTSTEP
ARMSCII-8
GEORGIAN-ACADEMY
GEORGIAN-PS
KOI8-T
CP154 CYRILLIC-ASIAN PT154 PTCP154 CSPTCP154
KZ-1048 RK1048 STRK1048-2002 CSKZ1048
MULELAO-1
CP1133 IBM-CP1133
ISO-IR-166 TIS-620 TIS620 TIS620-0 TIS620.2529-1 TIS620.2533-0 TIS620.2533-1
CP874 WINDOWS-874
VISCII VISCII1.1-1 CSVISCII
TCVN TCVN-5712 TCVN5712-1 TCVN5712-1:1993
ISO-IR-14 ISO646-JP JIS_C6220-1969-RO JP CSISO14JISC6220RO
JISX0201-1976 JIS_X0201 X0201 CSHALFWIDTHKATAKANA
ISO-IR-87 JIS0208 JIS_C6226-1983 JIS_X0208 JIS_X0208-1983 JIS_X0208-1990 X0208 CSISO87JISX0208
ISO-IR-159 JIS_X0212 JIS_X0212-1990 JIS_X0212.1990-0 X0212 CSISO159JISX02121990
CN GB_1988-80 ISO-IR-57 ISO646-CN CSISO57GB1988
CHINESE GB_2312-80 ISO-IR-58 CSISO58GB231280
CN-GB-ISOIR165 ISO-IR-165
ISO-IR-149 KOREAN KSC_5601 KS_C_5601-1987 KS_C_5601-1989 CSKSC56011987
EUC-JP EUCJP EXTENDED_UNIX_CODE_PACKED_FORMAT_FOR_JAPANESE CSEUCPKDFMTJAPANESE
MS_KANJI SHIFT-JIS SHIFT_JIS SJIS CSSHIFTJIS
CP932
ISO-2022-JP CSISO2022JP
ISO-2022-JP-1
ISO-2022-JP-2 CSISO2022JP2
CN-GB EUC-CN EUCCN GB2312 CSGB2312
GBK
CP936 MS936 WINDOWS-936
GB18030
ISO-2022-CN CSISO2022CN
ISO-2022-CN-EXT
HZ HZ-GB-2312
EUC-TW EUCTW CSEUCTW
BIG-5 BIG-FIVE BIG5 BIGFIVE CN-BIG5 CSBIG5
CP950
BIG5-HKSCS:1999
BIG5-HKSCS:2001
BIG5-HKSCS BIG5-HKSCS:2004 BIG5HKSCS
EUC-KR EUCKR CSEUCKR
CP949 UHC
CP1361 JOHAB
ISO-2022-KR CSISO2022KR
CP856
CP922
CP943
CP1046
CP1124
CP1129
CP1161 IBM-1161 IBM1161 CSIBM1161
CP1162 IBM-1162 IBM1162 CSIBM1162
CP1163 IBM-1163 IBM1163 CSIBM1163
DEC-KANJI
DEC-HANYU
437 CP437 IBM437 CS
PC8CODEPAGE437
CP737
CP775 IBM775 CSPC775BALTIC
852 CP852 IBM852 CSPCP852
CP853
855 CP855 IBM855 CSIBM855
857 CP857 IBM857 CSIBM857
CP858
860 CP860 IBM860 CSIBM860
861 CP-IS CP861 IBM861 CSIBM861
863 CP863 IBM863 CSIBM863
CP864 IBM864 CSIBM864
865 CP865 IBM865 CSIBM865
869 CP-GR CP869 IBM869 CSIBM869
CP1125
EUC-JISX0213
SHIFT_JISX0213
ISO-2022-JP-3
BIG5-2003
ISO-IR-230 TDS565
ATARI ATARIST
RISCOS-LATIN1

How can I detect a malformed UTF-8 string in PHP?

First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.

You can make use of the UTF-8 validity check that is available in preg_match [PHP Manual] since PHP 4.3.5. It will return 0 (with no additional information) if an invalid string is given:

$isUTF8 = preg_match('//u', $string);

Another possibility is mb_check_encoding [PHP Manual]:

$validUTF8 = mb_check_encoding($string, 'UTF-8');

Another function you can use is mb_detect_encoding [PHP Manual]:

$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));

It's important to set the strict parameter to true.

Additionally, iconv [PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv encounters such a sequence, it generates a notification; this behavior cannot be changed.)

echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;

You can use @ and check the length of the return string:

strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));

Check the examples on the iconv manual page as well.

How to detect which type of chinese encoding has text file?

This may work for your needs (but I really cant tell). Setting the locale and utf8_decode, and using mb_check_encoding instead of mt_detect_encoding seems to give some useful output..

// some text from http://chinesenotes.com/chinese_text_l10n.php
// have tried both as string and content loaded from a file
$chinese = '譧躆 礛簼繰 剆坲姏 潧 騔鯬 跠 瘱瘵瘲 忁曨曣 蛃袚觙';
$chinese=utf8_decode($chinese);

$chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT';

$encodings = explode(',',$chinese_encodings);

//set chinese locale
setlocale(LC_CTYPE, 'Chinese');

foreach($encodings as $encoding) {
if (@mb_check_encoding($chinese, $encoding)) {
echo 'The string seems to be compatible with '.$encoding.'<br>';
} else {
echo 'Not compatible with '.$encoding.'<br>';
}
}

outputs

The string seems to be compatible with EUC-CN
The string seems to be compatible with HZ
The string seems to be compatible with GBK
The string seems to be compatible with CP936
Not compatible with GB18030
The string seems to be compatible with EUC-TW
The string seems to be compatible with BIG5
The string seems to be compatible with CP950
Not compatible with BIG5-HKSCS
Not compatible with BIG5-HKSCS:2004
Not compatible with BIG5-HKSCS:2001
Not compatible with BIG5-HKSCS:1999
Not compatible with ISO-2022-CN
Not compatible with ISO-2022-CN-EXT

It is total guess. Now it at least seems to recognise some of the chinese encodings. Delete if it is total junk.

How to cope with different encodings of xls files in PHP?

The _defaultEncoding property of the workbook object can be set to contain the charset used by the Excel file, and this is then used to handle conversion to UTF-16LE by the reader, but it makes no effort to identify the internal charset itself.

If you define

define('SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE',  0x0042);

among the other SPREADSHEET_EXCEL_READER_TYPE definitions, and then modify the switch statement starting at line 464 to include a case for SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE. The logic for this case needs to be something like:

$length = $this->_GetInt2d($this->_data, $pos + 2);
$recordData = substr($this->_data, $pos + 4, $length);

// move stream pointer to next record
$pos += 4 + $length;

// offset: 0; size: 2; code page identifier
$codepage = $this->_GetInt2d($recordData, 0);
$codepage = $this->_CodePageNumberToName($codepage)

Recreate the _GetInt2d method (that seems to have been stripped from the code at some point) as

function _GetInt2d($data, $pos)
{
return ord($data[$pos]) | (ord($data[$pos + 1]) << 8);
}

and create a _CodePageNumberToName method to return the codepage name from its numeric value:

function _CodePageNumberToName($codePage = '1252')
{
switch ($codePage) {
case 367: return 'ASCII'; break; // ASCII
case 437: return 'CP437'; break; // OEM US
case 720: throw new Exception('Code page 720 not supported.');
break; // OEM Arabic
case 737: return 'CP737'; break; // OEM Greek
case 775: return 'CP775'; break; // OEM Baltic
case 850: return 'CP850'; break; // OEM Latin I
case 852: return 'CP852'; break; // OEM Latin II (Central European)
case 855: return 'CP855'; break; // OEM Cyrillic
case 857: return 'CP857'; break; // OEM Turkish
case 858: return 'CP858'; break; // OEM Multilingual Latin I with Euro
case 860: return 'CP860'; break; // OEM Portugese
case 861: return 'CP861'; break; // OEM Icelandic
case 862: return 'CP862'; break; // OEM Hebrew
case 863: return 'CP863'; break; // OEM Canadian (French)
case 864: return 'CP864'; break; // OEM Arabic
case 865: return 'CP865'; break; // OEM Nordic
case 866: return 'CP866'; break; // OEM Cyrillic (Russian)
case 869: return 'CP869'; break; // OEM Greek (Modern)
case 874: return 'CP874'; break; // ANSI Thai
case 932: return 'CP932'; break; // ANSI Japanese Shift-JIS
case 936: return 'CP936'; break; // ANSI Chinese Simplified GBK
case 949: return 'CP949'; break; // ANSI Korean (Wansung)
case 950: return 'CP950'; break; // ANSI Chinese Traditional BIG5
case 1200: return 'UTF-16LE'; break; // UTF-16 (BIFF8)
case 1250: return 'CP1250'; break; // ANSI Latin II (Central European)
case 1251: return 'CP1251'; break; // ANSI Cyrillic
case 0: // CodePage is not always correctly set when the xls file was saved by Apple's Numbers program
case 1252: return 'CP1252'; break; // ANSI Latin I (BIFF4-BIFF7)
case 1253: return 'CP1253'; break; // ANSI Greek
case 1254: return 'CP1254'; break; // ANSI Turkish
case 1255: return 'CP1255'; break; // ANSI Hebrew
case 1256: return 'CP1256'; break; // ANSI Arabic
case 1257: return 'CP1257'; break; // ANSI Baltic
case 1258: return 'CP1258'; break; // ANSI Vietnamese
case 1361: return 'CP1361'; break; // ANSI Korean (Johab)
case 10000: return 'MAC'; break; // Apple Roman
case 32768: return 'MAC'; break; // Apple Roman
case 32769: throw new Exception('Code page 32769 not supported.');
break; // ANSI Latin I (BIFF2-BIFF3)
case 65001: return 'UTF-8'; break; // Unicode (UTF-8)
}
}

And store the returned value in $_defaultEncoding

Alternatively, switch to an Excel reader that can handle the codepage correctly in the first place

Debug iconv_strlen error - PHP 5.5

Problem solved. Thanks BrianS.

This was solved by re-installing mbstring.

sudo yum --disablerepo="*" --enablerepo="remi*"
install php-mbstring*
sudo httpd -k restart

Change iconv() replacement character

AFAIK there's no translit function that lets you specify your own replacement character, but you can work around it by implementing your own simple escaping.

function my_translit($text, $replace='!', $in_charset='UTF-8', $out_charset='ASCII//TRANSLIT') {
// escape existing ?
$res = str_replace(['\\', '?'], ['\\\\', '\\?'], $text);
// translit
$res = iconv($in_charset, $out_charset, $res);
// replace unescaped ?
$res = preg_replace('/(?<!\\\\)\\?/', $replace, $res);
// unescape
return str_replace(['\\\\', '\\?'], ['\\', '?'], $res);
}

$text = '? € é î ü π ∑ ñ \\? \\\\? \\\\\\?';
var_dump(
$text,
my_translit($text)
);

Result:

string(36) "? € é î ü π ∑ ñ \? \\? \\\?"
string(29) "? EUR ! ! ! p ! ! \? \\? \\\?"

I'm not certain why the transliteration output is different on my system, but the character replacement works.

Should I use mb_* or iconv_* functions for multibyte strings?

Have a look at this Powerpoint presentation:

http://www.nyphp.org/content/presentations/smallworld/April2006-nyphp-Presentation.ppt

In a nutshell:
Iconv supports more encodings, but is less portable.

From the presentation:

PHP supports multi byte in two
extensions: iconv and mbstring

  • iconv uses an external library (supports more encodings but less
    portable)
  • mbstring has the library bundled with PHP (less encodings but more
    portable)


Related Topics



Leave a reply



Submit