Convert Latin1 Characters on a Utf8 Table into Utf8

Convert latin1 characters on a UTF8 table into UTF8

From what you describe, it seems you have UTF-8 data that was originally stored as Latin-1 and then not converted correctly to UTF-8. The data is recoverable; you'll need a MySQL function like

convert(cast(convert(name using  latin1) as binary) using utf8)

It's possible that you may need to omit the inner conversion, depending on how the data was altered during the encoding conversion.

converting latin1 data into utf8 inside of an existing database

Following this answer:

MySQL - Convert latin1 characters on a UTF8 table into UTF8

you can make a function:

CONVERT(CAST(CONVERT(name USING latin1) AS binary) USING utf8)

and apply it.

How to convert mysql latin1 to utf8

I managed to solve it by running updates on text fields like this:

UPDATE table SET title = CONVERT(CONVERT(CONVERT(title USING latin1) USING binary) USING UTF8)

MySQL: data being mangled while changing column to UTF8

F1 and FA are latin1 encodings. You need to tell MySQL that the data is latin1. One way is via SET NAMES latin1.

But note... That is independent of the setting for the column you are trying to store the data into. And, these days, utf8mb4 is the preferred setting for text. MySQL will convert between the column's encoding and the client's encoding. But you must tell it the client's encoding via connection parameters (or SET NAMES).

The pair of ALTER TABLEs works for certain situations, not all situations! You probably wanted the first entry in http://mysql.rjweb.org/doc.php/charcoll#fixes_for_various_cases

Table is CHARACTER SET latin1 and correctly encoded in latin1; want
utf8mb4:

ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4;

I don't happen to know if your data is irreparably hosed. Please provide one of the lines, together with HEX.

Hex

"Larrasoaña" is encoded as 4C61727261736F61F161, and "Jesús y María" as 4A6573FA732079204D6172ED6120

Those are latin1-encoded (or latin5 or dec8). If the table definition (SHOW CREATE TABLE) says latin1, then you could leave it alone. (latin1 handles Western European languages, but not Asian.)

If you want to convert all the text columns to utf8 or utf8mb4, do an ALTER like the one I presented above. Your 3-Alter approach will not work correctly; it assumes the bytes in the latin1 column are really UTF-8 bytes (which they aren't).

But... You must specify the client's encoding based on what the client wants. And it does not matter whether the client and the table agree since conversion will be provided.

Why the 3-step Alter fails

ALTER TABLE clientes CHARACTER SET utf8; -- This sets the default charset for new columns. It has no effect on the existing column definitions and any data in those columns.

ALTER TABLE clientes change nombre nombre varbinary(255); -- This says "forget about any text encoding". That is F1 is now just a bunch of bits, not the latin1 representation for ñ.

ALTER TABLE clientes change nombre nombre varchar(255) character set utf8; -- This takes those varbinary bits and says "let's treat them as utf8. And that gives the error message because F1 is not a valid encoding for utf8.

That procedure is appropriate if the bytes are already utf8 bytes. That is, if it were already the 2-byte C3B1 for ñ. (By the way, this usually manifests itself as 'Mojibake', displaying as ñ when interpreted as latin1.)

The 1-Alter procedure...

ALTER TABLE clientes CONVERT TO CHARACTER SET utf8; (to convert the entire table) or ALTER TABLE clientes MODIFY nombre varchar(255) character set utf8; (to convert just one column). They do the following things:

For each text (char/varchar/text) column, it reads the data according to its current encoding (latin1, F1), converts it to utf8 (or utf8mb4) (C3B1) and writes back into the row. Meanwhile, it has changed the declaration to be CHARACTER SET utf8.

That is, it is the 'right' process for changing the CHARACTER SET without changing the "text". True, the encoding changed (F1 -> C3B1), but that is in keeping with the change to the CHARACTER SET.

Recovery

Your first 2 ALTERs worked, correct? Did the 3rd one succeed, fail, or leave a messed up table?

If it aborted, leaving varbinary in place, then do 2 more alters: First go back to latin1; then go straight to utf8.

If it left you with a messed up column, especially if rows are truncated, then you need to go back to a backup, or otherwise reload the data.



Related Topics



Leave a reply



Submit