Converting file using `iconv` from UTF-16LE to UTF-8 yields UTF-16LE file
This question has been online for a long time and received literally no views nor an answer. Here's how I finally solved the problem.
I made a script for nodejs which performs the conversion:
const fs = require('fs');
const schemaFileName = 'data/schema.graphql';
const readContent = fs.readFileSync(schemaFileName, {
encoding: 'utf16le',
});
const writeContent = (readContent.charAt(0) === '\ufeff')
? readContent.substring(1)
: readContent;
fs.writeFileSync(schemaFileName, writeContent, 'utf8');
Converting UTF-16 to UTF-8 using libiconv
The input data for iconv
is always an opaque byte stream. When reading UTF-16, iconv
expects the input data to consist of two-byte code units. Therefore, if you want to provide hard-coded input data, you need to use a two-byte wide integral type.
In C++11 and C11 this should be char16_t
, but you can also use uint16_t
:
uint16_t data[] = { 0x68, 0x69, 0 };
char const * p = (char const *)data;
To be pedantic, there's nothing in general that says that uint16_t
has two bytes. However, iconv
is a Posix library, and Posix mandates that CHAR_BIT == 8
, so it is true on Posix.
(Also note that the way you spell a literal value has nothing to do with the width of the type which you initialize with that value, so there's no difference between 0x68
, 0x0068
, or 0x00068
. What's much more interesting are the new Unicode character literals \u
and \U
, but that's a whole different story.)
convert UTF-16LE to UTF-8 with iconv()
Here's the problem:
size_t readBytes = sizeof(inBuf);
size_t writeBytes = sizeof(outBuf);
When you pass arrays to a function, they decay to pointers to their first element. Your call
fn2Utf8(inBuf, outBuf);
is equal to
fn2Utf8(&inBuf[0], &outBuf[0]);
That means that in the function the arguments are not arrays, but pointers. And when you do sizeof
on a pointer you get the size of the pointer and not what it's pointing to.
There are two solutions: The first is to pass in the length of the arrays as arguments to the function, and use that. The second, at least for the inBuf
argument, is to rely on the fact that it's a null-terminated string and use strlen
instead.
The second way, with strlen
, works only on inBuf
as I already said, but doesn't work on outBuf
where you have to use the first way and pass in the size as an argument.
If works in the program without the function because then you are doing sizeof
on the array, and not a pointer. When you have an array and not a pointer, sizeof
will give you the size in bytes of the array.
Convert UTF-16LE to UTF-8 in php
iconv
supports the UTF-16LE
encoding.
You can use it to transpose the encoding from UTF-16LE
to UTF-8
:
$result = iconv($in_charset = 'UTF-16LE' , $out_charset = 'UTF-8' , $str);
if (false === $result)
{
throw new Exception('Input string could not be converted.');
}
See iconv
Docs.
I'm just wondering if all code-points available in UTF-16LE
are available in UTF-8
. But I assume that this should fit in your case.
Edit: I was not able to reproduce the problem on a box of my own, but on another box I ran into this notice:
Notice: iconv() [function.iconv]: Wrong charset, conversion from
UTF-16LE' to
UTF-8' is not allowed in ...
Looks like that not all iconv
versions can actually convert UTF-16LE
to UTF-8
.
It might be a workaround to use mb_convert_encoding
Docs instead, at least it was in this case (Demo):
$result = mb_convert_encoding($str , 'UTF-8' , 'UTF-16LE');
Convert UTF8 to UTF16 using iconv
UTF-16LE
tells iconv
to generate little-endian UTF-16 without a BOM (Byte Order Mark). Apparently it assumes that since you specified LE
, the BOM isn't necessary.
UTF-16
tells it to generate UTF-16 text (in the local machine's byte order) with a BOM.
If you're on a little-endian machine, I don't see a way to tell iconv
to generate big-endian UTF-16 with a BOM, but I might just be missing something.
I find that the file
command doesn't recognize UTF-16 text without a BOM, and your editor might not either. But if you run iconv -f UTF-16LE -t UTF_8 b-16 strings
, you should get a valid UTF-8 version of the original file.
Try running od -c
on the files to see their actual contents.
UPDATE :
It looks like you're on a big-endian machine (x86 is little-endian), and you're trying to generate a little-endian UTF-16 file with a BOM. Is that correct? As far as I can tell, iconv
won't do that directly. But this should work:
( printf "\xff\xfe" ; iconv -f utf-8 -t utf-16le UTF-8-FILE ) > UTF-16-FILE
The behavior of the printf
might depend on your locale settings; I have LANG=en_US.UTF-8
.
(Can anyone suggest a more elegant solution?)
Another workaround, if you know the endianness of the output produced by -t utf-16
:
iconv -f utf-8 -t utf-16 UTF-8-FILE | dd conv=swab 2>/dev/null
Related Topics
New Scala Worksheets Not Evaluated in Eclipse
How to Get Details of All Modules/Drivers That Were Initialized/Probed During the Linux Kernel Boot
How to Do 'Ret' Instruction from Code at _Start in MACos? Linux
How to Run a Shell Script by Cron Job
Bash Script to Find and Display Oldest File
Bash Echo with an $ Character Outside the String
Netfilter-Like Kernel Module to Get Source and Destination Address
Mercurial Hg No Changes Found - Can't Hg Push Out
Linux, Where Are the Return Codes Stored of System Daemons and Other Processes
Add Blank Line Between Lines from Different Groups
Overwrite Input File Using Awk
Jboss as 7.1.1 Ejb 3:Ejb Pool Error
How to Ignore Line Breaks in Input Using Nasm Assembly
Gdb: Redirect Target Stdout Temporarly
Dyld_Library_Path Environment Variable Is Not Forwarded to External Command in Makefile on MACos
History Command Works in a Terminal, But Doesn't When Written as a Bash Script