Is "Set Character Set Utf8" Necessary

Is SET CHARACTER SET utf8 necessary?

Using SET CHARACTER SET utf8 after using SET NAMES utf8 will actually reset the character_set_connection and collation_connection to

@@character_set_database and @@collation_database respectively.

The manual states that

  • SET NAMES x is equivalent to

    SET character_set_client = x;
    SET character_set_results = x;
    SET character_set_connection = x;
  • and SET CHARACTER SET x is equivalent to

    SET character_set_client = x;
    SET character_set_results = x;
    SET collation_connection = @@collation_database;

whereas SET collation_connection = x also internally executes SET character_set_connection = <<character_set_of_collation_x>> and SET character_set_connection = x internally also executes SET collation_connection = <<default_collation_of_character_set_x.

So essentially you're resetting character_set_connection to @@character_set_database and collation_connection to @@collation_database. The manual explains the usage of these variables:

What character set should the server
translate a statement to after
receiving it?

For this, the server uses the
character_set_connection and
collation_connection system variables.
It converts statements sent by the
client from character_set_client to
character_set_connection (except for
string literals that have an
introducer such as _latin1 or _utf8).
collation_connection is important for
comparisons of literal strings. For
comparisons of strings with column
values, collation_connection does not
matter because columns have their own
collation, which has a higher
collation precedence.

To sum this up, the encoding/transcoding procedure MySQL uses to process the query and its results is a multi-step-thing:

  1. MySQL treats the incoming query as being encoded in character_set_client.
  2. MySQL transcodes the statement from character_set_client into character_set_connection
  3. when comparing string values to column values MySQL transcodes the string value from character_set_connection into the character set of the given database column and uses the column collation to do sorting and comparison.
  4. MySQL builds up the result set encoded in character_set_results (this includes result data as well as result metadata such as column names and so on)

So it could be the case that a SET CHARACTER SET utf8 would not be sufficient to provide full UTF-8 support. Think of a default database character set of latin1 and columns defined with utf8-charset and go through the steps described above. As latin1 cannot cover all the characters that UTF-8 can cover you may lose character information in step 3.

  • Step 3: Given that your query is encoded in UTF-8 and contains characters that cannot be represented with latin1, these characters will be lost on transcoding from utf8 to latin1 (the default database character set) making your query fail.

So I think it's safe to say that SET NAMES ... is the correct way to handle character set issues. Even though I might add that setting up your MySQL server variables correctly (all the required variables can be set statically in your my.cnf) frees you from the performance overhead of the extra query required on every connect.

To use utf8 or not - MySQL and PHP character encoding issue

Your problem is that your SET NAMES 'utf8_persian_ci' command was invalid (utf8_persion_ci is a collation, not an encoding). If you run it in a terminal you will see an error Unknown character set: 'utf8_persian_ci'. Thus your application, when it stored the data, was using the latin1 character set. MySQL interpreted your input as latin1 characters which it then stored encoded as utf-8. Likewise when the data was pulled back out, MySQL converted it from UTF-8 back to latin1 and (hopefully, most of the time) the original bytes you gave it.

In other words, all your data in the database is completely messed up, but it just so happened to work.

To fix this, you need to undo what you did. The most straightforward way is using PHP:

  1. SET NAMES latin1;
  2. Select every single text field from every table.
  3. SET NAMES utf8;
  4. Update the same rows using the same string unaltered.

Alternatively you can perform these steps inside MySQL, but it's tricky because MySQL understands the data to be in a certain character set. You need to modify your text columns to a BLOB type, then modify them back to text types with a utf8 character set. See the section at the bottom of the ALTER TABLE MySQL documentation labeled "Warning" in red.

After you do either one of these things, the bytes stored in your database columns will be the actual character set they claim to be. Then, make sure you always use mysql_set_charset('utf8') on any database access from PHP that you may do in the future! Otherwise you will mess things up again. (Note, do not use a simple mysql_query('SET NAMES utf8')! There are corner cases (such as a reset connection) where this can be reset to latin1 without your knowledge. mysql_set_charset() will set the charset whenever necessary.)

It would be best if you switched away from mysql_* functions and used PDO instead with the charset=utf8 parameter in your PDO dsn.

Suggested character set for non utf8 columns in mysql

GUID/UUID/MD5/SHA1 are all hex and dash. For them

CHAR(..) CHARACTER SET ascii COLLATE ascii_general_ci

That will allow for A=a when comparing hex strings.

For Base64 things, use either of

CHAR(..) CHARACTER SET ascii COLLATE ascii_bin
BINARY(..)

since A is not semantically the same as a.

Further notes...

  • utf8 spits at you if you give it an invalid 8-bit value.
  • ascii spits at you for any 8-bit value.
  • latin1 accepts anything -- thereby your problems down the road
  • It is quite OK to have different columns in a table having different charsets and/or collations.
  • The charset/collation on the table is just a default, ripe for overriding at the column definition.
  • BINARY may be a tiny bit faster than any _bin collation, but not enough to notice.
  • Use CHAR for columns that are truly fixed length; don't mislead the user by using it for other cases.
  • %_bin is faster than %_general_ci, which is faster than other collations. Again, you would be hard-pressed to measure a difference.
  • Never use TINYTEXT or TINYBLOB.
  • For proper encoding, use the appropriate charset.
  • For "proper sorting", use the appropriate collation. See example below.
  • For "proper sorting" where multiple languages are represented, and you are using utf8mb4, use utf8mb4_unicode_520_ci (or utf8mb4_900_ci if using version 8.0). The 520 and 900 refer to Unicode standards; new collations are likely to come in the future.

If you are entirely in Czech, then consider these charsets and collations. I list them in preferred order:

mysql> show collation like '%czech%';
+------------------+---------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+------------------+---------+-----+---------+----------+---------+
| utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | -- opens up the world
| utf8_czech_ci | utf8 | 202 | | Yes | 8 | -- opens up most of the world
| latin2_czech_cs | latin2 | 2 | | Yes | 4 | -- kinda like latin1

The rest are "useless":

| cp1250_czech_cs  | cp1250  |  34 |         | Yes      |       2 |
| ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
| utf16_czech_ci | utf16 | 111 | | Yes | 8 |
| utf32_czech_ci | utf32 | 170 | | Yes | 8 |
+------------------+---------+-----+---------+----------+---------+
7 rows in set (0.00 sec)

More

  • The reason for using smaller datatypes (where appropriate) is to shrink the dataset, which leads to less I/O, which leads to things being more cacheable, which makes the program run faster. This is especially important for huge datasets; it is less important for small- or medium-sized datasets.
  • ENUM is 1 byte, yet acts like a string. So you get the "best of both worlds". (There are drawbacks, and there is a 'religious war' among advocates for ENUM vs TINYINT vs VARCHAR.)
  • Usually columns that are "short" are always the same length. A country_code is always 2 letters, always ascii, always could benefit from case insensitive collation. So CHAR(2) CHARACTER SET ascii COLLATE ascii_general_ci is optimal. If you have something that is sometimes 1-char, sometimes 2, then flip a coin; whatever you do won't make much difference.
  • VARCHAR (up to 255) has an extra 1-byte length attached to it. So, if your strings vary in length at all, VARCHAR is at least as good as CHAR. So simplify your brain processing: "variable length --> `VARCHAR".
  • BIT, depending on version, may be implemented as a 1-byte TINYINT UNSIGNED. If you have only a few bits in your table, it is not worth worrying about.
  • One of my Rules of Thumb says that if you aren't likely to get a 10% improvement, move on to some other optimization. Much of what we are discussing here is under 10% (space in this case). Still, get in the habit of thinking about it when writing CREATE TABLE. I often see tables with BIGINT and DOUBLE (each 8 bytes) that could easily use smaller columns. Sometimes saving more than 50% (space).
  • How does "space" translate into "speed". Tiny tables -> a tiny percentage. Huge tables -> In some cases 10x. (That's 10-fold, not 10%.) (UUIDs are one way to get really bad performance on huge tables.)

ENUM

  • Acts and feels like a string, yet takes only one byte. (One byte translates, indirectly, into a slight speed improvement.)
  • Practical when fewer than, say, 10 different values.
  • Impractical if frequently adding a new value -- requires ALTER TABLE, though it can be "inplace".
  • Suggest starting the list with 'unknown' (or something like that) and making the column NOT NULL (versus NULL).
  • The character set for the enum needs to be whatever is otherwise being used for the connection. The choice does not matter much unless you have options that collate equal (eg, A versus a).

Is UTF-8 an encoding or a character set?

Is UTF-8 an encoding or a character set?

UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.



I often see the terms "encoding" and "charset" used interchangeably

Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.

Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.

† - Alphabet, a kind of *character set* where characters correspond directly to sounds in a spoken language.


A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:

Unicode

  1. The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org.
  2. A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.

An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.

The originally proposed encodings of the UCS, however, were
not compatible with many current applications and protocols, and this
has led to the development of UTF-8

Migrating a php application to handle UTF-8

There's a little more to it than just replacing those functions.

Regular expressions

You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.

$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);

Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?

  • \p{L} instead of \w for any 'letter' character.
  • \p{Z} instead of \s for any 'space' character.
  • \p{N} instead of \d for any 'digit' character e.g. Arabic numbers

There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.

Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.

Function replacements

Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.

$unsafeFunctions = array(
'mail' => 'mb_send_mail',
'split' => null, //'mb_split', deprecated function - just don't use it
'stripos' => 'mb_stripos',
'stristr' => 'mb_stristr',
'strlen' => 'mb_strlen',
'strpos' => 'mb_strpos',
'strrpos' => 'mb_strrpos',
'strrchr' => 'mb_strrchr',
'strripos' => 'mb_strripos',
'strstr' => 'mb_strstr',
'strtolower' => 'mb_strtolower',
'strtoupper' => 'mb_strtoupper',
'substr_count' => 'mb_substr_count',
'substr' => 'mb_substr',
'str_ireplace' => null,
'str_split' => 'mb_str_split', //TODO - check this works
'strcasecmp' => 'mb_strcasecmp', //TODO - check this works
'strcspn' => null, //TODO - implement alternative
'strrev' => 'mb_strrev', //TODO - check this works
'strspn' => null, //TODO - implement alternative
'substr_replace'=> 'mb_substr_replace',
'lcfirst' => null,
'ucfirst' => 'mb_ucfirst',
'ucwords' => 'mb_ucwords',
'wordwrap' => null,
);

MySQL

Although you would have thought that setting the character type to utf8 would give you UTF-8 support in MySQL, it does not.

It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane

To support these you should in general use:

  • utf8mb4 - for your character encoding.
  • utf8mb4_unicode_ci - for your character collation.

For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.

The list of places where you should set the character set and collation in your MySQL config file are:

[mysql]
default-character-set=utf8mb4

[client]
default-character-set=utf8mb4

[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

The SET NAMES may not be required in all circumstances - but it is safer on at only a small speed penalty.

PHP INI File

Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:

mbstring.language   = Neutral   ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On ; HTTP input encoding translation is enabled
mbstring.http_input = auto ; Set HTTP input character set dectection to auto
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order = auto ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset = UTF-8 ; Default character set for auto content type header

Helping browser to choose UTF8 for forms

  • You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.

  • Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.

Misc

  • If you're using Apache set "AddDefaultCharset utf-8"

  • As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.

That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.

Is the charset attribute required with HTML5?

It is not necessary to include <meta charset="blah">. As the specification says, the character set may also be specified by the server using the HTTP Content-Type header or by including a Unicode BOM at the beginning of the downloaded file.

Most web servers today will send back a character set in the Content-Type header for HTML text data if none is specified. If the web server doesn't send back a character set with the Content-Type header and the file does not include a BOM and the page does not include a <meta charset="blah"> declaration, the browser will have a default encoding that is usually based on the language settings of the host computer. If this does not match the actual character encoding of the file, then some characters will be displayed improperly.

Will browsers use the proper encoding 99% of the time? If your page is UTF-8, probably. If not, probably not.

The W3C provides a document outlining the precendence rules for the three methods that says the order is HTTP header, BOM, followed by in-document specification (meta tag).

Trouble with UTF-8 characters; what I see is not what I stored

This problem plagues the participants of this site, and many others.

You have listed the five main cases of CHARACTER SET troubles.

Best Practice

Going forward, it is best to use CHARACTER SET utf8mb4 and COLLATION utf8mb4_unicode_520_ci. (There is a newer version of the Unicode collation in the pipeline.)

utf8mb4 is a superset of utf8 in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.

Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8.

I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.

Overview of what you should do

  • Have your editor, etc. set to UTF-8.
  • HTML forms should start like <form accept-charset="UTF-8">.
  • Have your bytes encoded as UTF-8.
  • Establish UTF-8 as the encoding being used in the client.
  • Have the column/table declared CHARACTER SET utf8mb4 (Check with SHOW CREATE TABLE.)
  • <meta charset=UTF-8> at the beginning of HTML
  • Stored Routines acquire the current charset/collation. They may need rebuilding.

UTF-8 all the way through

More details for computer languages (and its following sections)

Test the data

Viewing the data with a tool or with SELECT cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do

SELECT col, HEX(col) FROM tbl WHERE ...

The HEX for correctly stored UTF-8 will be

  • For a blank space (in any language): 20
  • For English: 4x, 5x, 6x, or 7x
  • For most of Western Europe, accented letters should be Cxyy
  • Cyrillic, Hebrew, and Farsi/Arabic: Dxyy
  • Most of Asia: Exyyzz
  • Emoji and some of Chinese: F0yyzzww
  • More details

Specific causes and fixes of the problems seen

Truncated text (Se for Señor):

  • The bytes to be stored are not encoded as utf8mb4. Fix this.
  • Also, check that the connection during reading is UTF-8.

Black Diamonds with question marks (Se�or for Señor);
one of these cases exists:

Case 1 (original bytes were not UTF-8):

  • The bytes to be stored are not encoded as utf8. Fix this.
  • The connection (or SET NAMES) for the INSERT and the SELECT was not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Case 2 (original bytes were UTF-8):

  • The connection (or SET NAMES) for the SELECT was not utf8/utf8mb4. Fix this.
  • Also, check that the column in the database is CHARACTER SET utf8 (or utf8mb4).

Black diamonds occur only when the browser is set to <meta charset=UTF-8>.

Question Marks (regular ones, not black diamonds) (Se?or for Señor):

  • The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
  • The column in the database is not CHARACTER SET utf8 (or utf8mb4). Fix this. (Use SHOW CREATE TABLE.)
  • Also, check that the connection during reading is UTF-8.

Mojibake (Señor for Señor):
(This discussion also applies to Double Encoding, which is not necessarily visible.)

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.

If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.

Double Encoding can be confirmed by doing the SELECT .. HEX .. described above.

é should come back C3A9, but instead shows C383C2A9
The Emoji should come back F09F91BD, but comes back C3B0C5B8E28098C2BD

That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor.

Fixing the Data, where possible

For Truncation and Question Marks, the data is lost.

For Mojibake / Double Encoding, ...

For Black Diamonds, ...

The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

How do i Setup utf-8 as standard character set for a mysql server?

Solution:

add this into my.cnf:

[mysqld]
character-set-server=utf8
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8

[mysql]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8

[mysqladmin]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8

[mysqlcheck]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8

[mysqldump]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8

[mysqlimport]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8

[mysqlshow]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8
#end


Related Topics



Leave a reply



Submit