Writing utf16 to file in binary mode
I suspect that sizeof(wchar_t) is 4 in your environment - i.e. it's writing out UTF-32/UCS-4 instead of UTF-16. That's certainly what the hex dump looks like.
That's easy enough to test (just print out sizeof(wchar_t)) but I'm pretty sure it's what's going on.
To go from a UTF-32 wstring to UTF-16 you'll need to apply a proper encoding, as surrogate pairs come into play.
How to read and write Unicode (UTF-16 little endian) text to and from a binary file in Python?
Your question is a bit unclear, but to edit an executable you simply need to replace the target bytes with another set of bytes of the same length. Here's an example:
test.c - Simple program with an embedded UTF-16LE string (in Windows, anyway):
#include <stdio.h>
int main() {
wchar_t* s = L"Hello";
printf("%S\n",s);
return 0;
}
test.py - replace the string with another string
with open('test.exe','rb') as f:
data = f.read()
target = 'Hello'.encode('utf-16le')
replacement = 'ABCDE'.encode('utf-16le')
if len(target) != len(replacement):
raise RuntimeError('invalid replacement')
data = data.replace(target,replacement)
with open('new_test.exe','wb') as f:
f.write(data)
Demo:
C:\>cl /W4 /nologo test.c
test.c
C:\>test.exe
Hello
C:\>test.py
C:\>new_test.exe
ABCDE
Why ofstream does not write utf16 on linux in binary mode?
This actually written the content as UTF-16 but because I missed the BOM, the file opening on Windows didn't recognize it so I thought that it written the content as UTF8
grepping binary files and UTF16
The easiest way is to just convert the text file to utf-8 and pipe that to grep:
iconv -f utf-16 -t utf-8 file.txt | grep query
I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.
It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:
grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt
If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.
EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:
hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`
How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.
Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.
EDIT2: Got it!!!!
grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt
This searches for the hex version of the string Test
(in utf-16) in the file test.txt
Write a unicode character to a file in a binary way
You are confusing Unicode with encodings. An encoding is a standard that represents text as within the confines of individual values in the range of 0-255 (bytes), while Unicode is a standard that describes codepoints representing textual glyphs. The two are related but not the same thing.
The Unicode standard includes several encodings. UTF-16 is one such encoding that uses 2 bytes per codepoint, but it is not the only encoding included in the standard. UTF-8 is another such encoding, and it uses a variable number of bytes per codepoint.
Your file, however, is written using ASCII, the default codec used by Python 2 when you do not specify an explicit encoding. If you expected to see 2 bytes per codepoint, encode to UTF-16 explicitly:
fin.write(u'\x40'.encode('utf16-le')
This writes UTF-16 in little endian byte order; there is also a utf16-be
codec. Normally, for multi-byte encodings like UTF-16 or UTF32, you'd also include a BOM, or Byte Order Mark; it is included automatically when you write UTF-16 without picking any endianes.
fin.write(u'\x40'.encode('utf16')
I strongly urge you to study up on Unicode, codecs and Python before you continue:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
Problem writing unicode UTF-16 data to file in python
I believe the output from Google is UTF-8, not UTF-16. Try this fix:
ret = unicode(b.strip('"'), encoding='utf-8', errors='ignore')
Related Topics
C++ Object Size with Virtual Methods
Bjarne Stroustrup Says We Must Avoid Linked Lists
Std::Vector and Contiguous Memory of Multidimensional Arrays
Explain Morris Inorder Tree Traversal Without Using Stacks or Recursion
Automatic Perspective Correction Opencv
C++ Std::Map Holding Any Type of Value
Avoiding Copy of Objects with the "Return" Statement
Why Is 'Char' Signed by Default in C++
Where Are Temporary Object Stored
C++ New Operator Thread Safety in Linux and Gcc 4
How to Get the CPU Usage Per Thread on Windows (Win32)
Repeated Multiple Definition Errors from Including Same Header in Multiple Cpps
Std::Unique_Lock<Std::Mutex> or Std::Lock_Guard<Std::Mutex>
How to Get Current Timestamp in Milliseconds Since 1970 Just the Way Java Gets
C++ Correct Way to Return Pointer to Array from Function