Writing Utf16 to File in Binary Mode

Writing utf16 to file in binary mode

I suspect that sizeof(wchar_t) is 4 in your environment - i.e. it's writing out UTF-32/UCS-4 instead of UTF-16. That's certainly what the hex dump looks like.

That's easy enough to test (just print out sizeof(wchar_t)) but I'm pretty sure it's what's going on.

To go from a UTF-32 wstring to UTF-16 you'll need to apply a proper encoding, as surrogate pairs come into play.

How to read and write Unicode (UTF-16 little endian) text to and from a binary file in Python?

Your question is a bit unclear, but to edit an executable you simply need to replace the target bytes with another set of bytes of the same length. Here's an example:

test.c - Simple program with an embedded UTF-16LE string (in Windows, anyway):

#include <stdio.h>

int main() {
wchar_t* s = L"Hello";
printf("%S\n",s);
return 0;
}

test.py - replace the string with another string

with open('test.exe','rb') as f:
data = f.read()

target = 'Hello'.encode('utf-16le')
replacement = 'ABCDE'.encode('utf-16le')

if len(target) != len(replacement):
raise RuntimeError('invalid replacement')

data = data.replace(target,replacement)

with open('new_test.exe','wb') as f:
f.write(data)

Demo:

C:\>cl /W4 /nologo test.c
test.c

C:\>test.exe
Hello

C:\>test.py

C:\>new_test.exe
ABCDE

Why ofstream does not write utf16 on linux in binary mode?

This actually written the content as UTF-16 but because I missed the BOM, the file opening on Windows didn't recognize it so I thought that it written the content as UTF8

grepping binary files and UTF16

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

Write a unicode character to a file in a binary way

You are confusing Unicode with encodings. An encoding is a standard that represents text as within the confines of individual values in the range of 0-255 (bytes), while Unicode is a standard that describes codepoints representing textual glyphs. The two are related but not the same thing.

The Unicode standard includes several encodings. UTF-16 is one such encoding that uses 2 bytes per codepoint, but it is not the only encoding included in the standard. UTF-8 is another such encoding, and it uses a variable number of bytes per codepoint.

Your file, however, is written using ASCII, the default codec used by Python 2 when you do not specify an explicit encoding. If you expected to see 2 bytes per codepoint, encode to UTF-16 explicitly:

fin.write(u'\x40'.encode('utf16-le')

This writes UTF-16 in little endian byte order; there is also a utf16-be codec. Normally, for multi-byte encodings like UTF-16 or UTF32, you'd also include a BOM, or Byte Order Mark; it is included automatically when you write UTF-16 without picking any endianes.

fin.write(u'\x40'.encode('utf16')

I strongly urge you to study up on Unicode, codecs and Python before you continue:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • The Python Unicode HOWTO

  • Pragmatic Unicode by Ned Batchelder

Problem writing unicode UTF-16 data to file in python

I believe the output from Google is UTF-8, not UTF-16. Try this fix:

ret = unicode(b.strip('"'), encoding='utf-8', errors='ignore') 


Related Topics



Leave a reply



Submit