Writing Utf16 to File in Binary Mode

Writing utf16 to file in binary mode

I suspect that sizeof(wchar_t) is 4 in your environment - i.e. it's writing out UTF-32/UCS-4 instead of UTF-16. That's certainly what the hex dump looks like.

That's easy enough to test (just print out sizeof(wchar_t)) but I'm pretty sure it's what's going on.

To go from a UTF-32 wstring to UTF-16 you'll need to apply a proper encoding, as surrogate pairs come into play.

How to read and write Unicode (UTF-16 little endian) text to and from a binary file in Python?

Your question is a bit unclear, but to edit an executable you simply need to replace the target bytes with another set of bytes of the same length. Here's an example:

test.c - Simple program with an embedded UTF-16LE string (in Windows, anyway):

#include <stdio.h>

int main() {
    wchar_t* s = L"Hello";
    printf("%S\n",s);
    return 0;
}

test.py - replace the string with another string

with open('test.exe','rb') as f:
    data = f.read()

target = 'Hello'.encode('utf-16le')
replacement = 'ABCDE'.encode('utf-16le')

if len(target) != len(replacement):
    raise RuntimeError('invalid replacement')

data = data.replace(target,replacement)

with open('new_test.exe','wb') as f:
    f.write(data)

Demo:

C:\>cl /W4 /nologo test.c
test.c

C:\>test.exe
Hello

C:\>test.py

C:\>new_test.exe
ABCDE

Why ofstream does not write utf16 on linux in binary mode?

This actually written the content as UTF-16 but because I missed the BOM, the file opening on Windows didn't recognize it so I thought that it written the content as UTF8

grepping binary files and UTF16

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

Write a unicode character to a file in a binary way

You are confusing Unicode with encodings. An encoding is a standard that represents text as within the confines of individual values in the range of 0-255 (bytes), while Unicode is a standard that describes codepoints representing textual glyphs. The two are related but not the same thing.

The Unicode standard includes several encodings. UTF-16 is one such encoding that uses 2 bytes per codepoint, but it is not the only encoding included in the standard. UTF-8 is another such encoding, and it uses a variable number of bytes per codepoint.

Your file, however, is written using ASCII, the default codec used by Python 2 when you do not specify an explicit encoding. If you expected to see 2 bytes per codepoint, encode to UTF-16 explicitly:

fin.write(u'\x40'.encode('utf16-le')

This writes UTF-16 in little endian byte order; there is also a utf16-be codec. Normally, for multi-byte encodings like UTF-16 or UTF32, you'd also include a BOM, or Byte Order Mark; it is included automatically when you write UTF-16 without picking any endianes.

fin.write(u'\x40'.encode('utf16')

I strongly urge you to study up on Unicode, codecs and Python before you continue:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Problem writing unicode UTF-16 data to file in python

I believe the output from Google is UTF-8, not UTF-16. Try this fix:

ret = unicode(b.strip('"'), encoding='utf-8', errors='ignore')

Writing Utf16 to File in Binary Mode