How to Convert Cyrillic Stored as Latin1 ( SQL ) to True Utf8 Cyrillic with Iconv

How can I convert Cyrillic stored as LATIN1 ( sql ) to true UTF8 Cyrillic with iconv?

iconv -f utf-8 -t latin1 < in.sql | iconv -f cp1251 -t utf-8 > out.sql

How to convert a window-1251 (russian cyrillic) MySql Database to UTF-8

Dump the .sql and use iconv ( linux program ).

iconv -f utf-8 -t latin1 < in.sql | iconv -f cp1251 -t utf-8 > out.sql

I did this earlier this year, How can I convert Cyrillic stored as LATIN1 ( sql ) to true UTF8 Cyrillic with iconv?

If you dont know how to get iconv, and dont have any sensitive information stored in the sql, I can do it for you and send it back to you.

What is my best option for converting my phpbb2 latin1 DB to a phpbb3 utf8 DB?

  1. Export phpBB2 database to the plain .sql file.
  2. Change encoding of that file from latain1 to Unicode UTF-8 (iconv).
  3. Change all occurrences of DEFAULT CHARACTER SET, SET NAMES etc. from latain1 to utf8.
  4. Change all occurrences of COLLATION / COLLATE from latain1_*_ci to utf8_unciode_ci
  5. Run phpBB2 to phpBB3 converter.

iconv not complete convert to utf8

The original text was in UTF-8. It got mistakenly interpreted as a text in Windows-1252 and converted from Windows-1252 to UTF-8. This should have never been done. To undo the damage we need to convert the file from UTF-8 to Windows-1252, and then just treat it as a UTF-8 file.

There's a problem however. The letter ف is encoded in UTF-8 as 0xd9 0x81, and the code 0x81 is not a part of Windows1252.

Luckily when the first erroneous conversion was made, the character was not lost or replaced with a question mark. It got converted to a control character 0xc2 0x81.

The 0xd9 code is in Windows1252, it's the letter Ù, which in UTF-8 is 0xc3 0x99. So the final byte sequence for ف in the converted file is 0xc3 0x99 0xc2 0x81.

We can just replace with something ASCII-friendly with a sed script, make an inverse conversion, and then replace it back with ف.

LANG=C sed $'s/\xc3\x99\xc2\x81/===FE===/g' forum.txt  | \
iconv -f utf8 -t cp1252 | \
sed $'s/===FE===/\xd9\x81/g'

The result is the original file encoded in UTF-8.

(make sure that ===FE=== is not used in the text first!)

Decoding Cyrillic string in R

These steps seem to do the trick

word <- "обезпечен"

xx <- iconv(word, from="UTF-8", to="cp1251")
Encoding(xx) <- "UTF-8"
xx
# [1] "обезпечен"

target <- "обезпечен"
xx == target
# [1] TRUE

So it seems what happened was at one point the bytes that make up the UTF-8 target value were misinterpreted as being cp1251 encoded and somewhere a process ran to convert the bytes to UTF-8 based on the cp1251->UTF-8 mapping rules. However, when you run this on data that insn't really cp1251 encoded you get weird values.

iconv(target, from="cp1251", to="UTF-8")
# "обезпечен"

Force encode from US-ASCII to UTF-8 (iconv)

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.

It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.



Related Topics



Leave a reply



Submit