Convert Utf8 to Utf16 Using Iconv

Convert UTF8 to UTF16 using iconv

UTF-16LE tells iconv to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE, the BOM isn't necessary.

UTF-16 tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.

If you're on a little-endian machine, I don't see a way to tell iconv to generate big-endian UTF-16 with a BOM, but I might just be missing something.

I find that the file command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings, you should get a valid UTF-8 version of the original file.

Try running od -c on the files to see their actual contents.

UPDATE :

It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv won't do that directly. But this should work:

( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE

The behavior of the printf might depend on your locale settings; I have LANG=en_US.UTF-8.

(Can anyone suggest a more elegant solution?)

Another workaround, if you know the endianness of the output produced by -t utf-16:

iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null

UTF-8 to UTF-16, different results using iconv vs mbstring

iconv adds a BOM at the begging of the output string. So for converting string, you probably want to use mb_convert_encoding. iconv can be more useful for files.

Convert UTF-8 to UTF-16 in iconv

It's not UTF-8, but UCS-2

Try:-

cat test | iconv  -f  UCS-2 -t UTF-16 

Converting UTF-16 to UTF-8 using libiconv

The input data for iconv is always an opaque byte stream. When reading UTF-16, iconv expects the input data to consist of two-byte code units. Therefore, if you want to provide hard-coded input data, you need to use a two-byte wide integral type.

In C++11 and C11 this should be char16_t, but you can also use uint16_t:

uint16_t data[] = { 0x68, 0x69, 0 };

char const * p = (char const *)data;

To be pedantic, there's nothing in general that says that uint16_t has two bytes. However, iconv is a Posix library, and Posix mandates that CHAR_BIT == 8, so it is true on Posix.

(Also note that the way you spell a literal value has nothing to do with the width of the type which you initialize with that value, so there's no difference between 0x68, 0x0068, or 0x00068. What's much more interesting are the new Unicode character literals \u and \U, but that's a whole different story.)

how to convert utf-8 to utf-16 with ndash?

I think the problem is that your code isn't outputting a UTF-16LE BOM (byte order mark) at the beginning of the file, so the programs reading it don't know what encoding it's in and are (apparently) guessing poorly.

A UTF-16LE BOM is the byte sequence 0xFF 0xFE (in that order) right at the beginning of the file. Make that the first thing you write to your output. More about BOMs in this Unicode FAQ.

To test my theory, I wrote the byte sequence for a UTF-16LE file containing only the characters 0–0:


FF FE 30 00 13 20 30 00

The FF FE is the BOM, the 30 00 is the digit zero, the 13 20 is the N-dash, and the final 30 00 is the final digit zero. (The zeros are just there so I can easily find the dash, though in such a short file it wouldn't really be difficult.)

I was able to open that with Office 365 on Windows just fine.

Then I wrote a file without the BOM:


30 00 13 20 30 00

Office 365 did indeed misinterpret the N-dash and show it as a character that looks like a pair of brackets.

Converting file using `iconv` from UTF-16LE to UTF-8 yields UTF-16LE file

This question has been online for a long time and received literally no views nor an answer. Here's how I finally solved the problem.

I made a script for nodejs which performs the conversion:

const fs = require('fs');

const schemaFileName = 'data/schema.graphql';

const readContent = fs.readFileSync(schemaFileName, {
encoding: 'utf16le',
});

const writeContent = (readContent.charAt(0) === '\ufeff')
? readContent.substring(1)
: readContent;

fs.writeFileSync(schemaFileName, writeContent, 'utf8');


Related Topics



Leave a reply



Submit