Is a Blob Converted Using the Current/Default Charset in MySQL

Converting BLOB's to UTF-8 in MySQL

Blobs have no character set nor collation (the manual page points to varbinary where this is explained). The equivalent of a blob, but with character set and collation is a TEXT column (same manual page as the blob).

phpMyAdmin seems to be, to say the least, unclear in its operation. I have not been able to find a related bug report.

How to convert an entire MySQL database characterset and collation to UTF-8?

Use the ALTER DATABASE and ALTER TABLE commands.

ALTER DATABASE databasename CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Or if you're still on MySQL 5.5.2 or older which didn't support 4-byte UTF-8, use utf8 instead of utf8mb4:

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;

SQLAlchemy/MySQL binary blob is being utf-8 encoded?

Turns out that this was a driver issue. Apparently the default MySQL driver stumbles with Py3 and utf8 support. Installing cymysql into the virtual Python environment resolved this problem and the warnings disappear.

The fix: Find out if MySQL connects through socket or port (see here), and then modify the connection string accordingly. In my case using a socket connection:

mysql+cymysql://user:pwd@localhost/database?unix_socket=/var/run/mysqld/mysqld.sock

Use the port argument otherwise.

Edit: While the above fixed the encoding issue, it gave rise to another one: blob size. Due to a bug in CyMySQL blobs larger than 8M fail to commit. Switching to PyMySQL fixed that problem, although it seems to have a similar issue with large blobs.

MySQL character encoding change. Is data integrity preserved?

Every (character string-type) column has its own character set and collation metadata.

If, when the column's data type was specified (i.e. when it was last created or altered), no character set/collation was explicitly given, then the table's default character set and collation would be used for the column.

If, when the table was specified, no default character set/collation was explicitly given, then the database's default character set and collation would be used for the table's default.

The commands that you quote in your question merely alter such default character sets/collations for the database and table respectively. In other words, they will only affect tables and columns that are created thereafter—they will not affect existing columns (or data).

To update existing data, you should first read the Changing the Character Set section of the manual page on ALTER TABLE:

Changing the Character Set


To change the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a statement like this:

ALTER TABLE tbl_name CONVERT TO CHARACTER SET charset_name;

The statement also changes the collation of all character columns. If you specify no COLLATE clause to indicate which collation to use, the statement uses default collation for the character set. If this collation is inappropriate for the intended table use (for example, if it would change from a case-sensitive collation to a case-insensitive collation), specify a collation explicitly.

For a column that has a data type of VARCHAR or one of the TEXT types, CONVERT TO CHARACTER SET changes the data type as necessary to ensure that the new column is long enough to store as many characters as the original column. For example, a TEXT column has two length bytes, which store the byte-length of values in the column, up to a maximum of 65,535. For a latin1 TEXT column, each character requires a single byte, so the column can store up to 65,535 characters. If the column is converted to utf8, each character might require up to three bytes, for a maximum possible length of 3 × 65,535 = 196,605 bytes. That length does not fit in a TEXT column's length bytes, so MySQL converts the data type to MEDIUMTEXT, which is the smallest string type for which the length bytes can record a value of 196,605. Similarly, a VARCHAR column might be converted to MEDIUMTEXT.

To avoid data type changes of the type just described, do not use CONVERT TO CHARACTER SET. Instead, use MODIFY to change individual columns. For example:

ALTER TABLE t MODIFY latin1_text_col TEXT CHARACTER SET utf8;
ALTER TABLE t MODIFY latin1_varchar_col VARCHAR(M) CHARACTER SET utf8;

If you specify CONVERT TO CHARACTER SET binary, the CHAR, VARCHAR, and TEXT columns are converted to their corresponding binary string types (BINARY, VARBINARY, BLOB). This means that the columns no longer will have a character set attribute and a subsequent CONVERT TO operation will not apply to them.

If charset_name is DEFAULT in a CONVERT TO CHARACTER SET operation, the character set named by the character_set_database system variable is used.

 Warning


The CONVERT TO operation converts column values between the original and named character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;

The reason this works is that there is no conversion when you convert to or from BLOB columns.


To change only the default character set for a table, use this statement:

ALTER TABLE tbl_name DEFAULT CHARACTER SET charset_name;

The word DEFAULT is optional. The default character set is the character set that is used if you do not specify the character set for columns that you add to a table later (for example, with ALTER TABLE ... ADD column).

When the foreign_key_checks system variable is enabled, which is the default setting, character set conversion is not permitted on tables that include a character string column used in a foreign key constraint. The workaround is to disable foreign_key_checks before performing the character set conversion. You must perform the conversion on both tables involved in the foreign key constraint before re-enabling foreign_key_checks. If you re-enable foreign_key_checks after converting only one of the tables, an ON DELETE CASCADE or ON UPDATE CASCADE operation could corrupt data in the referencing table due to implicit conversion that occurs during these operations (Bug #45290, Bug #74816).

Reading č, ć and đ from MySQL using .NET Core service

So, the problem was that the stored procedure retrieving data referenced a column in another table which had swedish_latin_ci collation.

utf8 charset and utf8_general_ci collation is enough.



Related Topics



Leave a reply



Submit