How to Use Ruby's Readlines.Grep for Utf-16 Files

How to use Ruby's readlines.grep for UTF-16 files?

While the answer by Viktor is technically correct, recoding of the whole file from UTF-16LE into UTF-8 is unnecessary and might hit the performance. All you actually need is to build the regexp in the same encoding:

puts File.open(
"utf-16.txt", mode: "rb:BOM|UTF-16LE"
).readlines.grep(
Regexp.new "foo".encode(Encoding::UTF_16LE)
)
#⇒ foo

grepping binary files and UTF16

The easiest way is to just convert the text file to utf-8 and pipe that to grep:

iconv -f utf-16 -t utf-8 file.txt | grep query

I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.

It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:

grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt

If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.

EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:

hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`

How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.

Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.

EDIT2: Got it!!!!

grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt

This searches for the hex version of the string Test (in utf-16) in the file test.txt

grep unicode 16 support

iconv -f UTF-16 -t UTF-8 yourfile | grep xxx

Unicode null symbol in text parsed from file leading to failing equality checks

If I get your code correctly, your log file is encoded in utf16 rather than utf8, so you could open it accordingly and let ruby do the conversion on the fly. Example:

>> f = File.open("iso-8859-1.txt", "r:iso-8859-1:utf-8")
=> #<File:iso-8859-1.txt>
>> f.external_encoding.name
=> "ISO-8859-1"
>> content = f.read
=> "This file contains umlauts: äöü"
>> content.encoding.name
=> "UTF-8"

http://nuclearsquid.com/writings/ruby-1-9-encodings/

Ruby Invalid Byte Sequence in UTF-8

The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil) and #encoding: UTF-8 solved the issue.



Related Topics



Leave a reply



Submit