Using Iconv to Convert from Utf-16Le to Utf-8

Converting file using `iconv` from UTF-16LE to UTF-8 yields UTF-16LE file

This question has been online for a long time and received literally no views nor an answer. Here's how I finally solved the problem.

I made a script for nodejs which performs the conversion:

const fs = require('fs');

const schemaFileName = 'data/schema.graphql';

const readContent = fs.readFileSync(schemaFileName, {
encoding: 'utf16le',
});

const writeContent = (readContent.charAt(0) === '\ufeff')
? readContent.substring(1)
: readContent;

fs.writeFileSync(schemaFileName, writeContent, 'utf8');

Converting UTF-16 to UTF-8 using libiconv

The input data for iconv is always an opaque byte stream. When reading UTF-16, iconv expects the input data to consist of two-byte code units. Therefore, if you want to provide hard-coded input data, you need to use a two-byte wide integral type.

In C++11 and C11 this should be char16_t, but you can also use uint16_t:

uint16_t data[] = { 0x68, 0x69, 0 };

char const * p = (char const *)data;

To be pedantic, there's nothing in general that says that uint16_t has two bytes. However, iconv is a Posix library, and Posix mandates that CHAR_BIT == 8, so it is true on Posix.

(Also note that the way you spell a literal value has nothing to do with the width of the type which you initialize with that value, so there's no difference between 0x68, 0x0068, or 0x00068. What's much more interesting are the new Unicode character literals \u and \U, but that's a whole different story.)

convert UTF-16LE to UTF-8 with iconv()

Here's the problem:

size_t readBytes = sizeof(inBuf);
size_t writeBytes = sizeof(outBuf);

When you pass arrays to a function, they decay to pointers to their first element. Your call

fn2Utf8(inBuf, outBuf);

is equal to

fn2Utf8(&inBuf[0], &outBuf[0]);

That means that in the function the arguments are not arrays, but pointers. And when you do sizeof on a pointer you get the size of the pointer and not what it's pointing to.

There are two solutions: The first is to pass in the length of the arrays as arguments to the function, and use that. The second, at least for the inBuf argument, is to rely on the fact that it's a null-terminated string and use strlen instead.

The second way, with strlen, works only on inBuf as I already said, but doesn't work on outBuf where you have to use the first way and pass in the size as an argument.


If works in the program without the function because then you are doing sizeof on the array, and not a pointer. When you have an array and not a pointer, sizeof will give you the size in bytes of the array.

Convert UTF-16LE to UTF-8 in php

iconv supports the UTF-16LE encoding.

You can use it to transpose the encoding from UTF-16LE to UTF-8:

$result = iconv($in_charset = 'UTF-16LE' , $out_charset = 'UTF-8' , $str);
if (false === $result)
{
throw new Exception('Input string could not be converted.');
}

See iconvDocs.

I'm just wondering if all code-points available in UTF-16LE are available in UTF-8. But I assume that this should fit in your case.


Edit: I was not able to reproduce the problem on a box of my own, but on another box I ran into this notice:

Notice: iconv() [function.iconv]: Wrong charset, conversion from UTF-16LE' toUTF-8' is not allowed in ...

Looks like that not all iconv versions can actually convert UTF-16LE to UTF-8.

It might be a workaround to use mb_convert_encodingDocs instead, at least it was in this case (Demo):

$result = mb_convert_encoding($str , 'UTF-8' , 'UTF-16LE');

Convert UTF8 to UTF16 using iconv

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.

UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.

If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.

I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.

Try running od -c on the files to see their actual contents.

UPDATE :

It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:

( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE

The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.

(Can anyone suggest a more elegant solution?)

Another workaround, if you know the endianness of the output produced by -t utf-16:

iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null


Related Topics



Leave a reply



Submit