Inserting Utf-8 Encoded String into Utf-8 Encoded MySQL Table Fails with "Incorrect String Value"

Incorrect string value when trying to insert UTF-8 into MySQL via JDBC?

MySQL's utf8 permits only the Unicode characters that can be represented with 3 bytes in UTF-8. Here you have a character that needs 4 bytes: \xF0\x90\x8D\x83 (U+10343 GOTHIC LETTER SAUIL).

If you have MySQL 5.5 or later you can change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy 4 bytes in UTF-8.

You may also have to set the server property character_set_server to utf8mb4 in the MySQL configuration file. It seems that Connector/J defaults to 3-byte Unicode otherwise:

For example, to use 4-byte UTF-8 character sets with Connector/J, configure the MySQL server with character_set_server=utf8mb4, and leave characterEncoding out of the Connector/J connection string. Connector/J will then autodetect the UTF-8 setting.

Inserting UTF-8 encoded string into UTF-8 encoded mysql table fails with Incorrect string value

(U+1D10E) is a character Unicode found outside the BMP (Basic Multilingual Plane) (above U+FFFF) and thus can't be represented in UTF-8 in 3 bytes. MySQL charset utf8 only accepts UTF-8 characters if they can be represented in 3 bytes. If you need to store this in MySQL, you'll need to use MySQL charset utf8mb4. You'll need MySQL 5.5.3 or later. You can use ALTER TABLE to change the character set without much problem; since it needs more space to store the characters, a couple issues show up that may require you to reduce string size. See http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-upgrading.html .

Incorrect string value error for unconventional characters

The characters you show require the column use the utf8mb4 encoding. Currently it seems your column is defined with the utf8mb3 encoding.

The way MySQL uses the name "utf8" is complicated, as described in https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html:

Note

Historically, MySQL has used utf8 as an alias for utf8mb3;
beginning with MySQL 8.0.28, utf8mb3 is used exclusively in the output
of SHOW statements and in Information Schema tables when this
character set is meant.

At some point in the future utf8 is expected to become a reference to
utf8mb4. To avoid ambiguity about the meaning of utf8, consider
specifying utf8mb4 explicitly for character set references instead of
utf8.

You should also be aware that the utf8mb3 character set is deprecated
and you should expect it to be removed in a future MySQL release.
Please use utf8mb4 instead.

You may have tried to change your table in the following way:

ALTER TABLE test_table CHARSET=utf8mb4;

But that only changes the default character set, to be used if you add new columns to the table subsequently. It does not change any of the current columns. To do that:

ALTER TABLE test_table MODIFY COLUMN dummy VARCHAR(255) CHARACTER SET utf8mb4;

Or to convert all string or TEXT columns in a table in one statement:

ALTER TABLE test_table CONVERT TO CHARACTER SET utf8mb4;

How to fix Incorrect string value errors?

"\xE4\xC5\xCC\xC9\xD3\xD8" isn't valid UTF-8. Tested using Python:

>>> "\xE4\xC5\xCC\xC9\xD3\xD8".decode("utf-8")
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid data

If you're looking for a way to avoid decoding errors within the database, the cp1252 encoding (aka "Windows-1252" aka "Windows Western European") is the most permissive encoding there is - every byte value is a valid code point.

Of course it's not going to understand genuine UTF-8 any more, nor any other non-cp1252 encoding, but it sounds like you're not too concerned about that?

Error 1366: Incorrect string value when inserting strings into MariaDB

which makes me believe the cutting of the string somehow makes the string indigestible for the database?

Cutting strings by sub-slicing as value[:10] (and measuring length with len for that matter) is always a mistake if your program has any chance of dealing with multi-byte characters. That's because indexing a string operates on its bytes, which may or may not be part of multi-byte encoding.

As you found out, the character š is encoded in UTF-8 as \xc5\xa1. If these two bytes appear in your value string right at index 9 and 10 your index expression [:10] corrupts the data.

The character sets utf8mb3 and utf8mb4 only restrict the range of admitted UTF-8 to respectively 3-byte and 4-byte characters, but \xc5 is not valid UTF-8 to begin with, so it gets rejected either way.

In MariaDB a column with data type VARCHAR(N) counts characters (as specified by the collation). You want to cut your value string at the tenth character, instead of at the tenth byte.

I would like to avoid handling utf8/unicode in my code

You are already admitting UTF-8 by declaring the MariaDB collation as utf8mb3. It's only logical that you properly handle input data in your code as UTF-8. To cut at the n-th character (or rune, which in Go represents a Unicode code point) you can use something like:

// count the runes
if utf8.RuneCountInString(value) > 10 {
// convert string to rune slice
chars := []rune(value)
// index the rune slice and convert back to string
value = string(chars[:10])
}

This won't corrupt UTF-8 encoding, however keep in mind it does more allocs and doesn't account for composed characters, e.g. when the joiner 200D is involved.

Mysql UTF8 Encoding Issue - Incorrect string value

I have solved it, by doing following changes.

  • used ";CharSet=utf8mb4;" in connection string. I missed this one earlier. I was using "utf8".
  • set database's default charset, table's default charset and all column's charset to 'utf8mb4'
  • set database's default collation, table's default collation and all column's collation to 'utf8mb4_unicode_ci'

As @eggyal mentions, column charset and collation only matters, i have set all default values, so that I don't have update all new columns in future.

go-mysql-driver insert string into table gives error 1336 even using utf8mb4

The value should be \xC2\xA7test. \xA7 doesn't have a valid mapping in utf8(mb4).

select hex('§test')

C2A774657374

ref: fiddle

“Incorrect string value” when trying to insert String into MySQL via Python and Text file

The string you're trying to insert into db has an unusual character at its beginning. I just copied your string:

In [1]: a = '<'

In [2]: a
Out[2]: '\xef\xbb\xbf<'

You need to get rid of those characters. This is a good post explaining what these characters are.



Related Topics



Leave a reply



Submit