How can I convert Cyrillic stored as LATIN1 ( sql ) to true UTF8 Cyrillic with iconv?
iconv -f utf-8 -t latin1 < in.sql | iconv -f cp1251 -t utf-8 > out.sql
How to convert a window-1251 (russian cyrillic) MySql Database to UTF-8
Dump the .sql
and use iconv
( linux program ).
iconv -f utf-8 -t latin1 < in.sql | iconv -f cp1251 -t utf-8 > out.sql
I did this earlier this year, How can I convert Cyrillic stored as LATIN1 ( sql ) to true UTF8 Cyrillic with iconv?
If you dont know how to get iconv, and dont have any sensitive information stored in the sql, I can do it for you and send it back to you.
What is my best option for converting my phpbb2 latin1 DB to a phpbb3 utf8 DB?
- Export phpBB2 database to the plain .sql file.
- Change encoding of that file from latain1 to Unicode UTF-8 (
iconv
). - Change all occurrences of
DEFAULT CHARACTER SET
,SET NAMES
etc. fromlatain1
toutf8
. - Change all occurrences of
COLLATION
/COLLATE
fromlatain1_*_ci
toutf8_unciode_ci
- Run phpBB2 to phpBB3 converter.
iconv not complete convert to utf8
The original text was in UTF-8. It got mistakenly interpreted as a text in Windows-1252 and converted from Windows-1252 to UTF-8. This should have never been done. To undo the damage we need to convert the file from UTF-8 to Windows-1252, and then just treat it as a UTF-8 file.
There's a problem however. The letter ف is encoded in UTF-8 as 0xd9 0x81
, and the code 0x81
is not a part of Windows1252.
Luckily when the first erroneous conversion was made, the character was not lost or replaced with a question mark. It got converted to a control character 0xc2 0x81
.
The 0xd9
code is in Windows1252, it's the letter Ù
, which in UTF-8 is 0xc3 0x99
. So the final byte sequence for ف in the converted file is 0xc3 0x99 0xc2 0x81
.
We can just replace with something ASCII-friendly with a sed
script, make an inverse conversion, and then replace it back with ف.
LANG=C sed $'s/\xc3\x99\xc2\x81/===FE===/g' forum.txt | \
iconv -f utf8 -t cp1252 | \
sed $'s/===FE===/\xd9\x81/g'
The result is the original file encoded in UTF-8.
(make sure that ===FE===
is not used in the text first!)
Decoding Cyrillic string in R
These steps seem to do the trick
word <- "обезпечен"
xx <- iconv(word, from="UTF-8", to="cp1251")
Encoding(xx) <- "UTF-8"
xx
# [1] "обезпечен"
target <- "обезпечен"
xx == target
# [1] TRUE
So it seems what happened was at one point the bytes that make up the UTF-8 target
value were misinterpreted as being cp1251 encoded and somewhere a process ran to convert the bytes to UTF-8 based on the cp1251->UTF-8 mapping rules. However, when you run this on data that insn't really cp1251 encoded you get weird values.
iconv(target, from="cp1251", to="UTF-8")
# "обезпечен"
Force encode from US-ASCII to UTF-8 (iconv)
ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.
It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.
Related Topics
How to Dynamically Create Columns in SQL Select Statement
Sql Server, Using a Table as a Queue
How to Insert into a Table with Just One Identity Column (Sql Express)
Sql Distance Query Without Trigonometry
Sql 2005 How to Use Keyword Like in a Case Statement
Microsoft SQL Server: Any Way to Tell When a Record Was Created
Sql Server Login Disable Windows Authentication
Sql Error: Ora-02298: Cannot Validate (System.Aeropuerto_Fk) - Parent Keys Not Found
Select The Rows That Just Inserted
Sql Azure Backup & Restore Strategy
Create a New Db User in SQL Server 2005
Why Does This Oracle Drop Column Alter The Default Value of Another Column
Hibernate 4.3.6 Querysyntaxexception: Path Expected for Join
Sql Selecting "Window" Around Particular Row
Difference Between Numeric and Float in Bigquery
Distinct() Function (Not Select Qualifier) in Postgres