How to convert ISO8859-15 to UTF8?
Could it be that your file is not ISO-8859-15 encoded? You should be able to check with the file command:
file YourFile.txt
Also, you can use iconv without providing the encoding of the original file:
iconv -t UTF-8 YourFile.txt
How do I convert between ISO-8859-1 and UTF-8 in Java?
In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.
To transcode text:
byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");
or
byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");
You can exercise more control by using the lower-level Charset
APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.
Convert files between UTF-8 and ISO-8859 on Linux
ISO-8859-x (Latin-1) encoding only contains very limited characters, you should always try to encode to UTF-8 to make life easier.
And utf-8 (Unicode) is a superset of ISO 8859 so it will be not surprised you could not convert UTF-8 to ISO 8859
It seems command file
just give a very limited info of the file encoding
You could try to guess the from encoding either ISO-8859-1 or ISO-8859-15 or the other from 2~14 as suggested in the comment by @hobbs
And you could get a supported encoding of iconv
by iconv -l
If life treats you not easy with guessing the real file encoding, this silly script might help you out :D
Convert from UTF-8 to ISO8859-15 in C++
I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:
- If it's less or equal 255, it's also a valid ISO-8859-1 character:
out.append(1, static_cast<char>(codepoint));
- If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark:
out.append("?");
So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.
The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.
std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;
unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
// a valid codepoint has been decoded; convert it to ISO-8859-15
char outc;
if (codepoint <= 255) {
// codepoints up to 255 can be directly converted wit a few exceptions
if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
&& codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
&& codepoint != 0xbd && codepoint != 0xbe) {
outc = static_cast<char>(codepoint);
}
else {
outc = '?';
}
}
else {
// With a few exceptions, codepoints above 255 cannot be converted
if (codepoint == 0x20AC) {
outc = 0xa4;
}
else if (codepoint == 0x0160) {
outc = 0xa6;
}
else if (codepoint == 0x0161) {
outc = 0xa8;
}
else if (codepoint == 0x017d) {
outc = 0xb4;
}
else if (codepoint == 0x017e) {
outc = 0xb8;
}
else if (codepoint == 0x0152) {
outc = 0xbc;
}
else if (codepoint == 0x0153) {
outc = 0xbd;
}
else if (codepoint == 0x0178) {
outc = 0xbe;
}
else {
outc = '?';
}
}
out.append(1, outc);
}
}
return out;
}
convert utf8 to ISO8859-1 using iconv command
Well, you could replace the ’
with something else ('
below) before converting with iconv
, like:
echo Frank’s ’ | sed "s/’/'/g" | iconv -f utf8 -t iso8859-1
Frank's '
To convert a file like that:
sed "s/’/'/g" input_file | iconv [your params here] > output_file
Related Topics
Signal Handling in Asm: Why am I Receiving Sigsegv When Invoking the Sys_Pause Syscall
How to Use Netcat for Windows to Send a Binary File to a Tcp Connection
How to Build Msi Package on a Linux Server
What Is the Meaning of Each Line of the Assembly Output of a C Hello World
Limit on File Name Length in Bash
What Does "/Dev/Null" Mean at the End of Shell Commands
How to Recover or Change Oracle Sysdba Password
Vagrant Synced Folders Not Working Real-Time on Virtualbox
Difference Between Retq and Ret
How to Imshow with Invisible Figure in Matlab Running on Linux
Aws Lambda Permission Denied When Trying to Use Ffmpeg
Hosting Two Website Under One Web App - Azure Services
Iterating Over Each Line of Ls -L Output
How to Copy Multiple Files from a Different Directory Using Cp
What Is the "Current" in Linux Kernel Source