What Factors Make PHP Unicode-Incompatible

What factors make PHP Unicode-incompatible?

When PHP was started several years ago, UTF-8 was not really supported. We are talking about a time when non-Unicode OS like Windows 98/Me was still current and when other big languages like Delphi were also non-Unicode. Not all languages were designed with Unicode in mind from day 1, and completely changing your language to Unicode without breaking a lot of stuff is hard. Delphi only became Unicode compatible a year or two ago for example, while other languages like Java or C# were designed in Unicode from Day 1.

So when PHP grew and became PHP 3, PHP 4 and now PHP 5, simply no one decided to add Unicode. Why? Presumably to keep compatible with existing scripts or because utf8_de/encode and mb_string already existed and work. I do not know for sure, but I strongly believe that it has something to do with organic growth. Features do not simply exist by default, they have to be written by someone, and that simply did not happen for PHP yet.

Edit: Ok, I read the question wrong. The question is: How are strings stored internally? If I type in "Währung" or "Écriture", which Encoding is used to create the bytes used? In case of PHP, it is ASCII with a Codepage. That means: If I encode the string using ISO-8859-15 and you decode it with some chinese codepage, you will get weird results. The alternative is in languages like C# or Java where everything is stored as Unicode, which means: There is no codepage anymore, and theoretically you cannot mess up. I recommend Joel's article about Unicode and Character Sets, but essentially it boils down to: How are strings stored internally, and the answer with PHP is "Not in Unicode", which means that you have to be very careful and explicit when processing strings to make sure to always keep the string in the proper encoding during input, storage (database) and output, which is very errorprone.

Declaration to make PHP script completely Unicode-friendly

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.

So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.


One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :

mbstring supports a 'function
overloading' feature which enables you
to add multibyte awareness to such an
application without code modification
by overloading multibyte counterparts
on the standard string functions.

For example, mb_substr() is
called instead of substr() if
function overloading is enabled.

What does actually mean by Native Unicode Support in PHP and why PHP does not support Native Unicode Support even in PHP 7 releases?

  1. It means first 256 (out of 1 114 112 possible characters; 17 x 65 536) characters of Unicode characters set.
  2. Native letters, like ą, č, etc.
  3. Native letters are placed at the end of Unicode set after 256 character position, so that's why PHP does not support it.

Declaration to make PHP script completely Unicode-friendly

That full-unicode thing was precisely the idea of PHP 6 -- which has been canceled more than one year ago.

So, no, there is no way of getting all that -- except by using the right functions, and remembering that characters are not the same as bytes.


One thing that might help with you fourth point, though, is the Function Overloading Feature of the mbstring extension (quoting) :

mbstring supports a 'function
overloading' feature which enables you
to add multibyte awareness to such an
application without code modification
by overloading multibyte counterparts
on the standard string functions.

For example, mb_substr() is
called instead of substr() if
function overloading is enabled.

Does a reliable way to capitalize Unicode text exist?

You can try PortableUTF8 library, written as alternative to mbstring and iconv.

http://pageconfig.com/post/portable-utf8

Another interesting library is Stringy. It works by default with mbstring but if module is not located it will use polyfill package .

https://github.com/danielstjules/Stringy

In order to improve knowledge of the problem it's interesting to read:

What factors make PHP Unicode-incompatible?

I hope it will be useful for you.

Danish Æ being recognized as 2 letters instead of one

strlen is one of the naïve PHP core functions that understand strings as byte arrays and assume one byte == one character. Use mb_strlen with the correct encoding parameter to actually count characters according to the encoding of your string.

Convert Java source code characters in JSON string using PHP

You need to tell the web browser what encoding you are giving it.

<?php
header('content-type: text/plain; charset=utf-8');
var_dump(json_decode($jsonStr));


Related Topics



Leave a reply



Submit