Is SET CHARACTER SET utf8 necessary?
Using SET CHARACTER SET utf8
after using SET NAMES utf8
will actually reset the character_set_connection
and collation_connection
to@@character_set_database
and @@collation_database
respectively.
The manual states that
SET NAMES x
is equivalent toSET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;and
SET CHARACTER SET x
is equivalent toSET character_set_client = x;
SET character_set_results = x;
SET collation_connection = @@collation_database;
whereas SET collation_connection = x
also internally executes SET character_set_connection = <<character_set_of_collation_x>>
and SET character_set_connection = x
internally also executes SET collation_connection = <<default_collation_of_character_set_x
.
So essentially you're resetting character_set_connection
to @@character_set_database
and collation_connection
to @@collation_database
. The manual explains the usage of these variables:
What character set should the server
translate a statement to after
receiving it?For this, the server uses the
character_set_connection and
collation_connection system variables.
It converts statements sent by the
client from character_set_client to
character_set_connection (except for
string literals that have an
introducer such as _latin1 or _utf8).
collation_connection is important for
comparisons of literal strings. For
comparisons of strings with column
values, collation_connection does not
matter because columns have their own
collation, which has a higher
collation precedence.
To sum this up, the encoding/transcoding procedure MySQL uses to process the query and its results is a multi-step-thing:
- MySQL treats the incoming query as being encoded in
character_set_client
. - MySQL transcodes the statement from
character_set_client
intocharacter_set_connection
- when comparing string values to column values MySQL transcodes the string value from
character_set_connection
into the character set of the given database column and uses the column collation to do sorting and comparison. - MySQL builds up the result set encoded in
character_set_results
(this includes result data as well as result metadata such as column names and so on)
So it could be the case that a SET CHARACTER SET utf8
would not be sufficient to provide full UTF-8 support. Think of a default database character set of latin1
and columns defined with utf8
-charset and go through the steps described above. As latin1
cannot cover all the characters that UTF-8 can cover you may lose character information in step 3.
- Step 3: Given that your query is encoded in UTF-8 and contains characters that cannot be represented with
latin1
, these characters will be lost on transcoding fromutf8
tolatin1
(the default database character set) making your query fail.
So I think it's safe to say that SET NAMES ...
is the correct way to handle character set issues. Even though I might add that setting up your MySQL server variables correctly (all the required variables can be set statically in your my.cnf
) frees you from the performance overhead of the extra query required on every connect.
To use utf8 or not - MySQL and PHP character encoding issue
Your problem is that your SET NAMES 'utf8_persian_ci'
command was invalid (utf8_persion_ci is a collation, not an encoding). If you run it in a terminal you will see an error Unknown character set: 'utf8_persian_ci'
. Thus your application, when it stored the data, was using the latin1
character set. MySQL interpreted your input as latin1 characters which it then stored encoded as utf-8. Likewise when the data was pulled back out, MySQL converted it from UTF-8 back to latin1 and (hopefully, most of the time) the original bytes you gave it.
In other words, all your data in the database is completely messed up, but it just so happened to work.
To fix this, you need to undo what you did. The most straightforward way is using PHP:
SET NAMES latin1;
- Select every single text field from every table.
SET NAMES utf8;
- Update the same rows using the same string unaltered.
Alternatively you can perform these steps inside MySQL, but it's tricky because MySQL understands the data to be in a certain character set. You need to modify your text columns to a BLOB type, then modify them back to text types with a utf8 character set. See the section at the bottom of the ALTER TABLE
MySQL documentation labeled "Warning" in red.
After you do either one of these things, the bytes stored in your database columns will be the actual character set they claim to be. Then, make sure you always use mysql_set_charset('utf8')
on any database access from PHP that you may do in the future! Otherwise you will mess things up again. (Note, do not use a simple mysql_query('SET NAMES utf8')
! There are corner cases (such as a reset connection) where this can be reset to latin1
without your knowledge. mysql_set_charset()
will set the charset whenever necessary.)
It would be best if you switched away from mysql_*
functions and used PDO
instead with the charset=utf8
parameter in your PDO dsn.
Suggested character set for non utf8 columns in mysql
GUID/UUID/MD5/SHA1 are all hex and dash. For them
CHAR(..) CHARACTER SET ascii COLLATE ascii_general_ci
That will allow for A
=a
when comparing hex strings.
For Base64 things, use either of
CHAR(..) CHARACTER SET ascii COLLATE ascii_bin
BINARY(..)
since A
is not semantically the same as a
.
Further notes...
- utf8 spits at you if you give it an invalid 8-bit value.
- ascii spits at you for any 8-bit value.
- latin1 accepts anything -- thereby your problems down the road
- It is quite OK to have different columns in a table having different charsets and/or collations.
- The charset/collation on the table is just a default, ripe for overriding at the column definition.
BINARY
may be a tiny bit faster than any_bin
collation, but not enough to notice.- Use
CHAR
for columns that are truly fixed length; don't mislead the user by using it for other cases. %_bin
is faster than%_general_ci
, which is faster than other collations. Again, you would be hard-pressed to measure a difference.- Never use
TINYTEXT
orTINYBLOB
. - For proper encoding, use the appropriate charset.
- For "proper sorting", use the appropriate collation. See example below.
- For "proper sorting" where multiple languages are represented, and you are using
utf8mb4
, useutf8mb4_unicode_520_ci
(orutf8mb4_900_ci
if using version 8.0). The 520 and 900 refer to Unicode standards; new collations are likely to come in the future.
If you are entirely in Czech, then consider these charsets and collations. I list them in preferred order:
mysql> show collation like '%czech%';
+------------------+---------+-----+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+------------------+---------+-----+---------+----------+---------+
| utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | -- opens up the world
| utf8_czech_ci | utf8 | 202 | | Yes | 8 | -- opens up most of the world
| latin2_czech_cs | latin2 | 2 | | Yes | 4 | -- kinda like latin1
The rest are "useless":
| cp1250_czech_cs | cp1250 | 34 | | Yes | 2 |
| ucs2_czech_ci | ucs2 | 138 | | Yes | 8 |
| utf16_czech_ci | utf16 | 111 | | Yes | 8 |
| utf32_czech_ci | utf32 | 170 | | Yes | 8 |
+------------------+---------+-----+---------+----------+---------+
7 rows in set (0.00 sec)
More
- The reason for using smaller datatypes (where appropriate) is to shrink the dataset, which leads to less I/O, which leads to things being more cacheable, which makes the program run faster. This is especially important for huge datasets; it is less important for small- or medium-sized datasets.
ENUM
is 1 byte, yet acts like a string. So you get the "best of both worlds". (There are drawbacks, and there is a 'religious war' among advocates forENUM
vsTINYINT
vsVARCHAR
.)- Usually columns that are "short" are always the same length. A
country_code
is always 2 letters, always ascii, always could benefit from case insensitive collation. SoCHAR(2) CHARACTER SET ascii COLLATE ascii_general_ci
is optimal. If you have something that is sometimes 1-char, sometimes 2, then flip a coin; whatever you do won't make much difference. VARCHAR
(up to 255) has an extra 1-byte length attached to it. So, if your strings vary in length at all,VARCHAR
is at least as good asCHAR
. So simplify your brain processing: "variable length --> `VARCHAR".BIT
, depending on version, may be implemented as a 1-byteTINYINT UNSIGNED
. If you have only a few bits in your table, it is not worth worrying about.- One of my Rules of Thumb says that if you aren't likely to get a 10% improvement, move on to some other optimization. Much of what we are discussing here is under 10% (space in this case). Still, get in the habit of thinking about it when writing
CREATE TABLE
. I often see tables withBIGINT
andDOUBLE
(each 8 bytes) that could easily use smaller columns. Sometimes saving more than 50% (space). - How does "space" translate into "speed". Tiny tables -> a tiny percentage. Huge tables -> In some cases 10x. (That's 10-fold, not 10%.) (UUIDs are one way to get really bad performance on huge tables.)
ENUM
- Acts and feels like a string, yet takes only one byte. (One byte translates, indirectly, into a slight speed improvement.)
- Practical when fewer than, say, 10 different values.
- Impractical if frequently adding a new value -- requires
ALTER TABLE
, though it can be "inplace". - Suggest starting the list with
'unknown'
(or something like that) and making the columnNOT NULL
(versusNULL
). - The character set for the enum needs to be whatever is otherwise being used for the connection. The choice does not matter much unless you have options that collate equal (eg,
A
versusa
).
Is UTF-8 an encoding or a character set?
Is UTF-8 an encoding or a character set?
UTF-8 is an encoding and that term is used in the RFC that defines it which is quoted below.
I often see the terms "encoding" and "charset" used interchangeably
Prior to Unicode, if you wanted to use an alphabet† like Cyrillic or Greek, you needed to use a encoding that only encoded to characters in that alphabet. Thus, the terms encoding and charset were often conflated but they mean different things.
Now though, Unicode is usually the only character set you need to worry about since it contains characters for most written languages you'll have to deal with, except Klingon.
† - Alphabet, a kind of *character set* where characters correspond directly to sounds in a spoken language.A character set is a mapping from code-units (integers) to characters, symbols, glyphs, or other marks in a written language. Unicode is a character set that maps 21b integers to unicode codepoints. The Unicode Consortium's glossary describes it thus:
Unicode
- The standard for digital representation of the characters used in writing all of the world's languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet. Unicode is developed and maintained by the Unicode Consortium: http://www.unicode.org.
- A label applied to software internationalization and localization standards developed and maintained by the Unicode Consortium.
An encoding is a mapping from strings to strings. UTF-8 is an encoding that maps strings of bytes (8b integers) to strings of code-points (21b integers). The Unicode Consortium calls it a "character encoding scheme" and it is defined in RFC 3629.
The originally proposed encodings of the UCS, however, were
not compatible with many current applications and protocols, and this
has led to the development of UTF-8
Migrating a php application to handle UTF-8
There's a little more to it than just replacing those functions.
Regular expressions
You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.
$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);
Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?
- \p{L} instead of \w for any 'letter' character.
- \p{Z} instead of \s for any 'space' character.
- \p{N} instead of \d for any 'digit' character e.g. Arabic numbers
There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.
Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.
Function replacements
Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.
$unsafeFunctions = array(
'mail' => 'mb_send_mail',
'split' => null, //'mb_split', deprecated function - just don't use it
'stripos' => 'mb_stripos',
'stristr' => 'mb_stristr',
'strlen' => 'mb_strlen',
'strpos' => 'mb_strpos',
'strrpos' => 'mb_strrpos',
'strrchr' => 'mb_strrchr',
'strripos' => 'mb_strripos',
'strstr' => 'mb_strstr',
'strtolower' => 'mb_strtolower',
'strtoupper' => 'mb_strtoupper',
'substr_count' => 'mb_substr_count',
'substr' => 'mb_substr',
'str_ireplace' => null,
'str_split' => 'mb_str_split', //TODO - check this works
'strcasecmp' => 'mb_strcasecmp', //TODO - check this works
'strcspn' => null, //TODO - implement alternative
'strrev' => 'mb_strrev', //TODO - check this works
'strspn' => null, //TODO - implement alternative
'substr_replace'=> 'mb_substr_replace',
'lcfirst' => null,
'ucfirst' => 'mb_ucfirst',
'ucwords' => 'mb_ucwords',
'wordwrap' => null,
);
MySQL
Although you would have thought that setting the character type to utf8
would give you UTF-8 support in MySQL, it does not.
It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane
To support these you should in general use:
- utf8mb4 - for your character encoding.
- utf8mb4_unicode_ci - for your character collation.
For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.
The list of places where you should set the character set and collation in your MySQL config file are:
[mysql]
default-character-set=utf8mb4
[client]
default-character-set=utf8mb4
[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
The SET NAMES
may not be required in all circumstances - but it is safer on at only a small speed penalty.
PHP INI File
Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:
mbstring.language = Neutral ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On ; HTTP input encoding translation is enabled
mbstring.http_input = auto ; Set HTTP input character set dectection to auto
mbstring.http_output = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order = auto ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset = UTF-8 ; Default character set for auto content type header
Helping browser to choose UTF8 for forms
You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.
Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.
Misc
If you're using Apache set "AddDefaultCharset utf-8"
As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.
That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.
Is the charset attribute required with HTML5?
It is not necessary to include <meta charset="blah">
. As the specification says, the character set may also be specified by the server using the HTTP Content-Type
header or by including a Unicode BOM at the beginning of the downloaded file.
Most web servers today will send back a character set in the Content-Type
header for HTML text data if none is specified. If the web server doesn't send back a character set with the Content-Type
header and the file does not include a BOM and the page does not include a <meta charset="blah">
declaration, the browser will have a default encoding that is usually based on the language settings of the host computer. If this does not match the actual character encoding of the file, then some characters will be displayed improperly.
Will browsers use the proper encoding 99% of the time? If your page is UTF-8, probably. If not, probably not.
The W3C provides a document outlining the precendence rules for the three methods that says the order is HTTP header, BOM, followed by in-document specification (meta tag).
Trouble with UTF-8 characters; what I see is not what I stored
This problem plagues the participants of this site, and many others.
You have listed the five main cases of CHARACTER SET
troubles.
Best Practice
Going forward, it is best to use CHARACTER SET utf8mb4
and COLLATION utf8mb4_unicode_520_ci
. (There is a newer version of the Unicode collation in the pipeline.)
utf8mb4
is a superset of utf8
in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.
Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4
, not utf8
.
I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
- Have your editor, etc. set to UTF-8.
- HTML forms should start like
<form accept-charset="UTF-8">
. - Have your bytes encoded as UTF-8.
- Establish UTF-8 as the encoding being used in the client.
- Have the column/table declared
CHARACTER SET utf8mb4
(Check withSHOW CREATE TABLE
.) <meta charset=UTF-8>
at the beginning of HTML- Stored Routines acquire the current charset/collation. They may need rebuilding.
UTF-8 all the way through
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with SELECT
cannot be trusted.
Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled.
So, pick a table and column that has some non-English text and do
SELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
- For a blank space (in any language):
20
- For English:
4x
,5x
,6x
, or7x
- For most of Western Europe, accented letters should be
Cxyy
- Cyrillic, Hebrew, and Farsi/Arabic:
Dxyy
- Most of Asia:
Exyyzz
- Emoji and some of Chinese:
F0yyzzww
- More details
Specific causes and fixes of the problems seen
Truncated text (Se
for Señor
):
- The bytes to be stored are not encoded as utf8mb4. Fix this.
- Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (Se�or
for Señor
);
one of these cases exists:
Case 1 (original bytes were not UTF-8):
- The bytes to be stored are not encoded as utf8. Fix this.
- The connection (or
SET NAMES
) for theINSERT
and theSELECT
was not utf8/utf8mb4. Fix this. - Also, check that the column in the database is
CHARACTER SET utf8
(or utf8mb4).
Case 2 (original bytes were UTF-8):
- The connection (or
SET NAMES
) for theSELECT
was not utf8/utf8mb4. Fix this. - Also, check that the column in the database is
CHARACTER SET utf8
(or utf8mb4).
Black diamonds occur only when the browser is set to <meta charset=UTF-8>
.
Question Marks (regular ones, not black diamonds) (Se?or
for Señor
):
- The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
- The column in the database is not
CHARACTER SET utf8
(or utf8mb4). Fix this. (UseSHOW CREATE TABLE
.) - Also, check that the connection during reading is UTF-8.
Mojibake (Señor
for Señor
):
(This discussion also applies to Double Encoding, which is not necessarily visible.)
- The bytes to be stored need to be UTF-8-encoded. Fix this.
- The connection when
INSERTing
andSELECTing
text needs to specify utf8 or utf8mb4. Fix this. - The column needs to be declared
CHARACTER SET utf8
(or utf8mb4). Fix this. - HTML should start with
<meta charset=UTF-8>
.
If the data looks correct, but won't sort correctly, then
either you have picked the wrong collation,
or there is no collation that suits your need,
or you have Double Encoding.
Double Encoding can be confirmed by doing the SELECT .. HEX ..
described above.
é should come back C3A9, but instead shows C383C2A9
The Emoji should come back F09F91BD, but comes back C3B0C5B8E28098C2BD
That is, the hex is about twice as long as it should be.
This is caused by converting from latin1 (or whatever) to utf8, then treating those
bytes as if they were latin1 and repeating the conversion.
The sorting (and comparing) does not work correctly because it is, for example,
sorting as if the string were Señor
.
Fixing the Data, where possible
For Truncation and Question Marks, the data is lost.
For Mojibake / Double Encoding, ...
For Black Diamonds, ...
The Fixes are listed here. (5 different fixes for 5 different situations; pick carefully): http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases
How do i Setup utf-8 as standard character set for a mysql server?
Solution:
add this into my.cnf:
[mysqld]
character-set-server=utf8
character-sets-dir=/usr/share/mysql/charsets
default-character-set=utf8
[mysql]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8
[mysqladmin]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8
[mysqlcheck]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8
[mysqldump]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8
[mysqlimport]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8
[mysqlshow]
character-sets-dir=/usr/share/mysql/charsets default-character-set=utf8
#end
Related Topics
What Ocr Options Exist Beyond Tesseract
How to Display an Blob Image Stored in MySQL Database
Use Openssl_Encrypt to Replace Mcrypt for 3Des-Ecb Encryption
Proper Repository Pattern Design in PHP
Codeigniter Activerecord, Retrieve Last Insert Id
Issue in Installing PHP7.2-Mcrypt
Run a PHP Script Every Second Using Cli
Crop or Mask an Image into a Circle
Ruby and PHP Hmacs Not Agreeing
Laravel 4 Custom Named Password Column
Destroy Session When Broswer Tab Closed
MySQL - Insert Date Range into Date Columns If Dates Don't Overlap with Existing Ones
Rest API Authorization & Authentication (Web + Mobile)
Authorization Header Missing in PHP Post Request
Why, Fatal Error: Class 'Phpunit_Framework_Testcase' Not Found in ...