Convert Iso-8859-1 Strings to Utf-8 in C/C++

Convert ISO-8859-1 strings to UTF-8 in C/C++

If your source encoding will always be ISO-8859-1, this is trivial. Here's a loop:

unsigned char *in, *out;
while (*in)
if (*in<128) *out++=*in++;
else *out++=0xc2+(*in>0xbf), *out++=(*in++&0x3f)+0x80;

For safety you need to ensure that the output buffer is twice as large as the input buffer, or else include a size limit and check it in the loop condition.

C convert iso ISO−8859-1 to UTF-8 with libconv

Please read the iconv_open(3) manual page carefully:

iconv_t iconv_open(const char *tocode, const char *fromcode);

If you're converting to UTF-8 from ISO 8859-1 then this is at odds:

iconv_t iconvDesc = iconv_open ("ISO−8859-1", "UTF-8//TRANSLIT//IGNORE");

It should say

iconv_t iconvDesc = iconv_open ("UTF-8//TRANSLIT//IGNORE", "ISO−8859-1");

How to convert (char *) from ISO-8859-1 to UTF-8 in C++ multiplatformly?

ISO-8859-1 was incorporated as the first 256 code points of ISO/IEC 10646 and Unicode. So the conversion is pretty simple.

for each char:

uint8_t ch = code_point; /* assume that code points above 0xff are impossible since latin-1 is 8-bit */

if(ch < 0x80) {
append(ch);
} else {
append(0xc0 | (ch & 0xc0) >> 6); /* first byte, simplified since our range is only 8-bits */
append(0x80 | (ch & 0x3f));
}

See http://en.wikipedia.org/wiki/UTF-8#Description for more details.

EDIT: according to a comment by ninjalj, latin-1 translates direclty to the first 256 unicode code points, so the above algorithm should work.

ISO-8859 to UTF-8 Conversion C++

You are incrementing the out pointer in the loop, causing you to lose track of where the output starts. The pointer being passed to cout is the incremented one, so it obviously doesn't point at the start of the generated output any longer.

Further, the termination of out happens after printing it, which of course is the wrong way around.

Also, this relies on the encoding of the source code and stuff, not very nice. You should express the input string differently, using individual characters with hex values or something to be on the safe side.

Convert from UTF-8 to ISO8859-15 in C++

I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:

  • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));
  • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");

So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.

The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.

std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;

if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
// a valid codepoint has been decoded; convert it to ISO-8859-15
char outc;
if (codepoint <= 255) {
// codepoints up to 255 can be directly converted wit a few exceptions
if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
&& codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
&& codepoint != 0xbd && codepoint != 0xbe) {
outc = static_cast<char>(codepoint);
}
else {
outc = '?';
}
}
else {
// With a few exceptions, codepoints above 255 cannot be converted
if (codepoint == 0x20AC) {
outc = 0xa4;
}
else if (codepoint == 0x0160) {
outc = 0xa6;
}
else if (codepoint == 0x0161) {
outc = 0xa8;
}
else if (codepoint == 0x017d) {
outc = 0xb4;
}
else if (codepoint == 0x017e) {
outc = 0xb8;
}
else if (codepoint == 0x0152) {
outc = 0xbc;
}
else if (codepoint == 0x0153) {
outc = 0xbd;
}
else if (codepoint == 0x0178) {
outc = 0xbe;
}
else {
outc = '?';
}
}
out.append(1, outc);
}
}
return out;
}

How can I convert ISO-8859-7 strings to UTF-8 in C++?

Ok I decided to do this myself instead of looking for a compatible library. Here's how I did.

The main problem was figuring out how to fill the two bytes for Unicode using the single one for ISO, so I used the debugger to read the value for the same character, first written by the old machine and then written with a constant string (UTF-8 by default). I started with "O" and "Π" and saw that in UTF-8 the first byte was always 0xCE while the second one was filled with the ISO value plus an offset (-0x30). I built the following code to implement this and used a test string filled with all greek letters, both upper and lower case. Then I realised that starting from "π" (0xF0 in ISO) both the first byte and the offset for the second one changed, so I added a test to figure out which of the two rules to apply. The following method returns a bool to let the caller know whether the original string contained ISO characters (useful for other purposes) and overwrites the original string, passed as reference, with the new one. I worked with char arrays instead of strings for coherence with the rest of the project which is basically a C project written in C++.

bool iso_to_utf8(char* in){
bool wasISO=false;

if(in == NULL)
return wasISO;

// count chars
int i=strlen(in);
if(!i)
return wasISO;

// create and size new buffer
char *out = new char[2*i];
// fill with 0's, useful for watching the string as it gets built
memset(out, 0, 2*i);

// ready to start from head of old buffer
i=0;
// index for new buffer
int j=0;
// for each char in old buffer
while(in[i]!='\0'){
if(in[i] >= 0){
// it's already utf8-compliant, take it as it is
out[j++] = in[i];
}else{
// it's ISO
wasISO=true;
// get plain value
int val = in[i] & 0xFF;
// first byte to CF or CE
out[j++]= val > 0xEF ? 0xCF : 0xCE;
// second char to plain value normalized
out[j++] = val - (val > 0xEF ? 0x70 : 0x30);
}
i++;
}
// add string terminator
out[j]='\0';
// paste into old char array
strcpy(in, out);

return wasISO;

}

Converting string from ISO-8859-1 to UTF8

I tried in interactive mode. Seems to be encoded in UTF-8 your text:

$ php -a
Interactive mode enabled

php > $text = "<p>Ayurveda ist die älteste Lebens- und Gesundheitslehre der Welt. Sie ist in einer Hochkultur auf dem Gebiet des heutigen Indien entstanden und ihre Prinzipien sind universell gültig.
php " </p>";
php > echo utf8_encode(html_entity_decode($text));
<p>Ayurveda ist die älteste Lebens- und Gesundheitslehre der Welt. Sie ist in einer Hochkultur auf dem Gebiet des heutigen Indien entstanden und ihre Prinzipien sind universell gültig.
</p>
php > echo utf8_decode(html_entity_decode($text));
<p>Ayurveda ist die älteste Lebens- und Gesundheitslehre der Welt. Sie ist in einer Hochkultur auf dem Gebiet des heutigen Indien entstanden und ihre Prinzipien sind universell gültig.
</p>
php >

You can try use as above in your environment. If the problem persists when you load your page, you can try iconv() to fix it.



Related Topics



Leave a reply



Submit