How to Convert Iso8859-15 to Utf8

How to convert ISO8859-15 to UTF8?

Could it be that your file is not ISO-8859-15 encoded? You should be able to check with the file command:

file YourFile.txt

Also, you can use iconv without providing the encoding of the original file:

iconv -t UTF-8 YourFile.txt

How do I convert between ISO-8859-1 and UTF-8 in Java?

In general, you can't do this. UTF-8 is capable of encoding any Unicode code point. ISO-8859-1 can handle only a tiny fraction of them. So, transcoding from ISO-8859-1 to UTF-8 is no problem. Going backwards from UTF-8 to ISO-8859-1 will cause "replacement characters" (�) to appear in your text when unsupported characters are found.

To transcode text:

byte[] latin1 = ...
byte[] utf8 = new String(latin1, "ISO-8859-1").getBytes("UTF-8");

or

byte[] utf8 = ...
byte[] latin1 = new String(utf8, "UTF-8").getBytes("ISO-8859-1");

You can exercise more control by using the lower-level Charset APIs. For example, you can raise an exception when an un-encodable character is found, or use a different character for replacement text.

Convert files between UTF-8 and ISO-8859 on Linux

ISO-8859-x (Latin-1) encoding only contains very limited characters, you should always try to encode to UTF-8 to make life easier.

And utf-8 (Unicode) is a superset of ISO 8859 so it will be not surprised you could not convert UTF-8 to ISO 8859

It seems command file just give a very limited info of the file encoding

You could try to guess the from encoding either ISO-8859-1 or ISO-8859-15 or the other from 2~14 as suggested in the comment by @hobbs

And you could get a supported encoding of iconv by iconv -l

If life treats you not easy with guessing the real file encoding, this silly script might help you out :D

Convert from UTF-8 to ISO8859-15 in C++

I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:

  • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));
  • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");

So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.

The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.

std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;

if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
// a valid codepoint has been decoded; convert it to ISO-8859-15
char outc;
if (codepoint <= 255) {
// codepoints up to 255 can be directly converted wit a few exceptions
if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
&& codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
&& codepoint != 0xbd && codepoint != 0xbe) {
outc = static_cast<char>(codepoint);
}
else {
outc = '?';
}
}
else {
// With a few exceptions, codepoints above 255 cannot be converted
if (codepoint == 0x20AC) {
outc = 0xa4;
}
else if (codepoint == 0x0160) {
outc = 0xa6;
}
else if (codepoint == 0x0161) {
outc = 0xa8;
}
else if (codepoint == 0x017d) {
outc = 0xb4;
}
else if (codepoint == 0x017e) {
outc = 0xb8;
}
else if (codepoint == 0x0152) {
outc = 0xbc;
}
else if (codepoint == 0x0153) {
outc = 0xbd;
}
else if (codepoint == 0x0178) {
outc = 0xbe;
}
else {
outc = '?';
}
}
out.append(1, outc);
}
}
return out;
}

convert utf8 to ISO8859-1 using iconv command

Well, you could replace the with something else (' below) before converting with iconv, like:

echo Frank’s ’ | sed "s/’/'/g" | iconv -f utf8 -t iso8859-1
Frank's '

To convert a file like that:

sed "s/’/'/g" input_file | iconv [your params here] > output_file


Related Topics



Leave a reply



Submit