Preparing PHP Application to Use with Utf-8

Preparing PHP application to use with UTF-8

Some useful options to have in .htaccess:

########################################
# Locale settings
########################################

# See: http://php.net/manual/en/timezones.php
php_value date.timezone "Europe/Amsterdam"

SetEnv   LC_ALL  nl_NL.UTF-8

########################################
# Set up UTF-8 encoding
########################################

AddDefaultCharset UTF-8
AddCharset UTF-8 .php

php_value default_charset "UTF-8"

php_value iconv.input_encoding "UTF-8"
php_value iconv.internal_encoding "UTF-8"
php_value iconv.output_encoding "UTF-8"

php_value mbstring.internal_encoding UTF-8
php_value mbstring.http_output UTF-8
php_value mbstring.encoding_translation On
php_value mbstring.func_overload 6

# See also php functions:
# mysql_set_charset
# mysql_client_encoding

# database settings
#CREATE DATABASE db_name
#   CHARACTER SET utf8
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   DEFAULT COLLATE utf8_general_ci
#   ;
#
#ALTER DATABASE db_name
#   CHARACTER SET utf8
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   DEFAULT COLLATE utf8_general_ci
#   ;

#ALTER TABLE tbl_name
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   ;

Migrating a php application to handle UTF-8

There's a little more to it than just replacing those functions.

Regular expressions

You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.

$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);

Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?

\p{L} instead of \w for any 'letter' character.
\p{Z} instead of \s for any 'space' character.
\p{N} instead of \d for any 'digit' character e.g. Arabic numbers

There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.

Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.

Function replacements

Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.

$unsafeFunctions = array(
    'mail'      => 'mb_send_mail',
    'split'     => null, //'mb_split', deprecated function - just don't use it
    'stripos'   => 'mb_stripos',
    'stristr'   => 'mb_stristr',
    'strlen'    => 'mb_strlen',
    'strpos'    => 'mb_strpos',
    'strrpos'   => 'mb_strrpos',
    'strrchr'   => 'mb_strrchr',
    'strripos'  => 'mb_strripos',
    'strstr'    => 'mb_strstr',
    'strtolower'    => 'mb_strtolower',
    'strtoupper'    => 'mb_strtoupper',
    'substr_count'  => 'mb_substr_count',
    'substr'        => 'mb_substr',
    'str_ireplace'  => null,
    'str_split'     => 'mb_str_split', //TODO - check this works
    'strcasecmp'    => 'mb_strcasecmp', //TODO - check this works
    'strcspn'       => null, //TODO - implement alternative
    'strrev'        => 'mb_strrev', //TODO - check this works
    'strspn'        => null, //TODO - implement alternative
    'substr_replace'=> 'mb_substr_replace',
    'lcfirst'       => null,
    'ucfirst'       => 'mb_ucfirst',
    'ucwords'       => 'mb_ucwords',
    'wordwrap'      => null,
);

MySQL

Although you would have thought that setting the character type to utf8 would give you UTF-8 support in MySQL, it does not.

It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane

To support these you should in general use:

utf8mb4 - for your character encoding.
utf8mb4_unicode_ci - for your character collation.

For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.

The list of places where you should set the character set and collation in your MySQL config file are:

[mysql]
default-character-set=utf8mb4

[client]
default-character-set=utf8mb4

[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

The SET NAMES may not be required in all circumstances - but it is safer on at only a small speed penalty.

PHP INI File

Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:

mbstring.language   = Neutral   ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding  = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On  ;  HTTP input encoding translation is enabled
mbstring.http_input     = auto  ; Set HTTP input character set dectection to auto
mbstring.http_output    = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order   = auto  ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset      = UTF-8 ; Default character set for auto content type header

Helping browser to choose UTF8 for forms

You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.
Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.

Misc

If you're using Apache set "AddDefaultCharset utf-8"
As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.

That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.

Am I correctly supporting UTF-8 in my PHP apps?

Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads

No. The user agent should be submitting data in UTF-8 format; if not you are losing the benefit of Unicode.

The way to ensure a user-agent submits in UTF-8 format is to serve the page containing the form it's submitting in UTF-8 encoding. Use the Content-Type header (and meta http-equiv too if you intend the form to be saved and work standalone).

I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8")

Don't. It was a nice idea in the HTML standard, but IE never got it right. It was supposed to state an exclusive list of allowable charsets, but IE treats it as a list of additional charsets to try, on a per-field basis. So if you have an ISO-8859-1 page and an “accept-charset="UTF-8"” form, IE will first try to encode a field as ISO-8859-1, and if there's a non-8859-1 character in there, then it'll resort to UTF-8.

But since IE does not tell you whether it has used ISO-8859-1 or UTF-8, that's of absolutely no use to you. You would have to guess, for each field separately, which encoding was in use! Not useful. Omit the attribute and serve your pages as UTF-8; that's the best you can do at the moment.

If a UTF string is improperly encoded will something go wrong

If you let such a sequence get through to the browser you could be in trouble. There are ‘overlong sequences’ which encode an low-numbered codepoint in a longer sequence of bytes than is necessary. This means if you are filtering ‘<’ by looking for that ASCII character in a sequence of bytes, you could miss one, and let a script element into what you thought was safe text.

Overlong sequences were banned back in the early days of Unicode, but it took Microsoft a very long time to get their shit together: IE would interpret the byte sequence ‘\xC0\xBC’ as a ‘<’ up until IE6 Service Pack 1. Opera also got it wrong up to (about, I think) version 7. Luckily these older browsers are dying out, but it's still worth filtering overlong sequences in case those browsers are still about now (or new idiot browsers make the same mistake in future). You can do this, and fix other bad sequences, with a regex that allows only proper UTF-8 through, such as this one from W3.

If you are using mb_ functions in PHP, you might be insulated from these issues. I can't say for sure as mb_* was unusable fragile when I was still writing PHP.

In any case, this is also a good time to remove control characters, which are a large and generally unappreciated source of bugs. I would remove chars 9 and 13 from submitted string in addition to the others the W3 regex takes out; it is also worth removing plain newlines for strings you know aren't supposed to be multiline textboxes.

Was UTF-16 written to address a limit in UTF-8?

No, UTF-16 is a two-byte-per-codepoint encoding that's used to make indexing Unicode strings easier in-memory (from the days when all of Unicode would fit in two bytes; systems like Windows and Java still do it that way). Unlike UTF-8 it is not compatible with ASCII, and is of little-to-no use on the Web. But you occasionally meet it in saved files, usually ones saved by Windows users who have been misled by Windows's description of UTF-16LE as “Unicode” in Save-As menus.

seems_utf8

This is very inefficient compared to the regex!

Also, make sure to use utf8_unicode_ci on all of your tables.

You can actually sort of get away without this, treating MySQL as a store for nothing but bytes and only interpreting them as UTF-8 in your script. The advantage of using utf8_unicode_ci is that it will collate (sort and do case-insensitive compares) with knowledge about non-ASCII characters, so eg. ‘ŕ’ and ‘Ŕ’ are the same character. If you use a non-UTF8 collation you should stick to binary (case-sensitive) matching.

Whichever you choose, do it consistently: use the same character set for your tables as you do for your connection. What you want to avoid is a lossy character set conversion between your scripts and the database.

How to set UTF-8 encoding for a PHP file

header('Content-type: text/plain; charset=utf-8');

Getting UTF-8 strings from MySQL using PHP

Thanks Deceze, the culprit ended up being an htmlentities call that needed to be replaced with:

htmlspecialchars($row['col'], ENT_QUOTES, "UTF-8");

In the end I just misread my own code. After all this time it was something so trivial. Frustrating, but glad to have found the solution.

Thanks for all your help.

PHP source code in UTF-8 files; how to interpret properly?

TL;DR

ASCII

Until PHP 5.4, the PHP interpreter didn't at all care about the charset of PHP files, as evidenced by the fact that the zend.script_encoding ini directive only appeared in that version. It always treated it as ASCII basically.

When PHP needs to identify, for example, a function name, that happens to contain characters beyond ASCII-7bit (well, any labeled entity with any label really, but you get my point...), it merely looks for a function in the symbol table with the same byte sequence - an umlaut (or whatever...) written in one way would be treated differently than an umlaut written in another way. Try it. For backwards compatibility, if zend.script_encoding is not set, this is still the default behavior. Also take note of the regex showing what is a valid identifier, which you can see is charset neutral (well... except latin letters, which are in the ASCII-7bit range), but shows you bytes instead.

This leads us also to the declare(encoding) construct. If you see THAT in a file, that's the definitive charset to honor for that particular file (ONLY). Use something else until you encounter one, and if you see more than one - honor the second one after its declare statement.

If there's none...

In a static context (i.e. when you don't know the effective ini settings), you'd need to fallback to something else (something that's user defined, ideally) when the charset is important, or otherwise just treat characters beyond ASCII-7bit as pure binary, and display them in some uniform code-point-like fashion.

In a dynamic context (e.g. if you could for example rename the file for a moment, create a temporary file at that place, with that name; have it echo the value of zend.script_encoding; restore back the normal file), you should use the zend.script_encoding value if available, and fallback to something else (just as in a static context) otherwise.

The same treatment applies to strings, HTML fragments and any other contents of a PHP file - it's just read as a binary string, except certain ASCII characters (i.e. bytes) that are important to the PHP lexer, such as the sequence "<?php" (notice that all are ASCII characters...); an apostrophe within a single quoted string; etc. - The interpreter itself doesn't care about a string's charset, and if you must display a string's contents on screen, you should use the above means to figure out the best way to do so.

Edge cases (requested in comments):

Is there a restriction on what encoding are allowed?

There doesn't seem to be any list of allowed encodings anywhere, or at least I can't find one. Given that this is the successor of the --enable-zend-multibyte compile setting, UTF encodings of all flavors are sure to be in that list. Even if other (ANSI) encodings don't have an effect on PHP itself, that shouldn't deter you from using that value as a hint.

How does "declare(encoding)" work if the source file is UTF-16 (null 8 bit bytes between 8 bit ascii chars for the declaration)?

zend.script_encoding is used until a declare(encoding) is encountered. If it's not set, ASCII is assumed. ~~This shouldn't be a problem even in a UTF-16 file... right? (I don't use UTF-16)~~ While this may be a problem for PHP files encoded as UTF-16, I think it's fair to say the vast majority of developers just don't encode their scripts in UTF-16. Their data, sure, if the application's case calls for it. But not the script itself. Most PHP files in the wild are encoded either with an ANSI encoding or UTF-8.

If the .ini or the file setting is UTF-8 or otherwise, then identifiers are presumably taken only from code points in range x41-xFF, but not from code points x100 up?

I haven't tried supplying invalid UTF-8 bytes to tell you the answer to that one, nor does the manual ever state anything on the question. I would assume that PHP execution will fail with a parse error on that. Or at least it should. As far as your tool is concerned, it should report the invalid UTF-8 sequence anyway, since even if PHP allows it, that's still a QA problem.

For UTF encodings, are characters in strings represented as their UTF code point (that makes no sense since PHP strings seem only have 8 bit characters)?

No. Characters in strings and non-PHP content are still treated as just a sequence of bytes, which you can confirm by looking at the output of strlen(), and seeing how it differs from mb_strlen(), which is the one that respects encoding (well... it respects the mbstring.internal_encoding setting to be exact, but still).

If not, what does it mean to set the encoding to UTF something?

AFAIK, it affects lookups in the symbol table. With UTF set, umlauts written in different ways, or in different UTF flavors that end up with the same UTF code points... they would all converge on the same symbol, as opposed to without declare(encoding), where byte-by-byte comparrison is done instead. And I say "AFAIK" here, because frankly, I've never used such experiments myself... I'm a "do gooddy 'everything-as-valid-UTF-8'-er".

MySQL prepared statements can't insert UTF-8 letters

Use a UTF8 encoding/collation on the tables and columns you want to add UTF8 data to.

Adding UTF-8 support to JS/PHP script

Ø§Ù„Ù…Ø±Ø§ÙƒØ² is Mojibake, or possibly "double encoding", for المراكز -- Please do SELECT col, hex(col) ... to see which of these looks like:

Mojibake: D8A7D984D985D8B1D8A7D983D8B2
double encoding: C398C2A7C399E2809EC399E280A6C398C2B1C398C2A7C399C692C398C2B2

If Mojibake:

The bytes to be stored need to be UTF-8-encoded. Fix this.
The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
HTML should start with <meta charset=UTF-8>.

If double-encoding: This is caused by converting from latin1 (or whatever) to utf8, then treating those bytes as if they were latin1 and repeating the conversion.

More discussion:

Trouble with UTF-8 characters; what I see is not what I stored

Do not use the mysql_* interface in PHP; switch to mysqli_* or PDO interfaces. mysql_* was removed in PHP 5.7.

How do I make MySQL return UTF-8?

You have to define the connection to your database as UTF-8:

// Set up your connection
$connection = mysql_connect('localhost', 'user', 'pw');
mysql_select_db('yourdb', $connection);
mysql_query("SET NAMES 'utf8'", $connection);

// Now you get UTF-8 encoded stuff
$query = sprintf('SELECT name FROM place where id = 1');
$result = mysql_query($query, $connection);
$result = mysql_fetch_assoc($result);

Preparing PHP Application to Use with Utf-8