grepping binary files and UTF16
The easiest way is to just convert the text file to utf-8 and pipe that to grep:
iconv -f utf-16 -t utf-8 file.txt | grep query
I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.
It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:
grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt
If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.
EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:
hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`
How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.
Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.
EDIT2: Got it!!!!
grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt
This searches for the hex version of the string Test
(in utf-16) in the file test.txt
grep unicode 16 support
iconv -f UTF-16 -t UTF-8 yourfile | grep xxx
Why doesn't grep command work on text files with UTF-16 LE encoding?
grep
is not encoding aware. It doesn't search for "characters", it searches for bytes. Your console is sending UTF-8/ASCII encoded text (same in this case for the string "^This") to grep
to search for. If the file contains UTF-16 encoded text, that won't match, since the byte representations are different.
How to use Ruby's readlines.grep for UTF-16 files?
While the answer by Viktor is technically correct, recoding of the whole file from UTF-16LE
into UTF-8
is unnecessary and might hit the performance. All you actually need is to build the regexp in the same encoding:
puts File.open(
"utf-16.txt", mode: "rb:BOM|UTF-16LE"
).readlines.grep(
Regexp.new "foo".encode(Encoding::UTF_16LE)
)
#⇒ foo
How do I grep for all non-ASCII characters?
You can use the command:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P
flag which equates to --perl-regexp
: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented
features.
Free program to grep unicode text files in Windows?
Just ran across grepWin which works perfectly for what I want here. Wish I would have found it earlier!
Related Topics
How to Fix Urllib3 Runtimeerror: Requests Dependency 'Urllib3' Must Be Version >= 1.21.1, < 1.22
Laravel-Mix No Build Notification
How to Use Find on Dirs with White Spaces
Understanding Glibc Malloc Trimming
Pipe Tar Extract into Tar Create
Using for Loop to Move Files from Subdirectories to Parent Directories
How to Get a Process Tree Trace/Log of a Process in Linux
Linux: How to Send a Whole Packet to a Specific Port on Another Host
Error When Bootstrapping Cmake:Log of Errors
Path Environment Variable in Linux
How Does Sort Work Out How Much Ram There Is
Store Passwords Required by a Linux Daemon
Installing PHPsh on Linux, Python Error
How to Determine The Date-And-Time That a Linux Process Was Started