How to use Ruby's readlines.grep for UTF-16 files?
While the answer by Viktor is technically correct, recoding of the whole file from UTF-16LE
into UTF-8
is unnecessary and might hit the performance. All you actually need is to build the regexp in the same encoding:
puts File.open(
"utf-16.txt", mode: "rb:BOM|UTF-16LE"
).readlines.grep(
Regexp.new "foo".encode(Encoding::UTF_16LE)
)
#⇒ foo
grepping binary files and UTF16
The easiest way is to just convert the text file to utf-8 and pipe that to grep:
iconv -f utf-16 -t utf-8 file.txt | grep query
I tried to do the opposite (convert my query to utf-16) but it seems as though grep doesn't like that. I think it might have to do with endianness, but I'm not sure.
It seems as though grep will convert a query that is utf-16 to utf-8/ascii. Here is what I tried:
grep `echo -n query | iconv -f utf-8 -t utf-16 | sed 's/..//'` test.txt
If test.txt is a utf-16 file this won't work, but it does work if test.txt is ascii. I can only conclude that grep is converting my query to ascii.
EDIT: Here's a really really crazy one that kind of works but doesn't give you very much useful info:
hexdump -e '/1 "%02x"' test.txt | grep -P `echo -n Test | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "%02x"'`
How does it work? Well it converts your file to hex (without any extra formatting that hexdump usually applies). It pipes that into grep. Grep is using a query that is constructed by echoing your query (without a newline) into iconv which converts it to utf-16. This is then piped into sed to remove the BOM (the first two bytes of a utf-16 file used to determine endianness). This is then piped into hexdump so that the query and the input are the same.
Unfortunately I think this will end up printing out the ENTIRE file if there is a single match. Also this won't work if the utf-16 in your binary file is stored in a different endianness than your machine.
EDIT2: Got it!!!!
grep -P `echo -n "Test" | iconv -f utf-8 -t utf-16 | sed 's/..//' | hexdump -e '/1 "x%02x"' | sed 's/x/\\\\x/g'` test.txt
This searches for the hex version of the string Test
(in utf-16) in the file test.txt
grep unicode 16 support
iconv -f UTF-16 -t UTF-8 yourfile | grep xxx
Unicode null symbol in text parsed from file leading to failing equality checks
If I get your code correctly, your log file is encoded in utf16 rather than utf8, so you could open it accordingly and let ruby do the conversion on the fly. Example:
>> f = File.open("iso-8859-1.txt", "r:iso-8859-1:utf-8")
=> #<File:iso-8859-1.txt>
>> f.external_encoding.name
=> "ISO-8859-1"
>> content = f.read
=> "This file contains umlauts: äöü"
>> content.encoding.name
=> "UTF-8"
http://nuclearsquid.com/writings/ruby-1-9-encodings/
Ruby Invalid Byte Sequence in UTF-8
The combination of using: @file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
and #encoding: UTF-8
solved the issue.
Related Topics
Rails Email Error - 530-5.5.1 Authentication Required
Get SASS from Database (Compile Passed Data Instead of Reading from File)
How to List All Versions of a Gem Available at a Remote Site
Mongodb with Mongoid in Rails - Geospatial Indexing
Why Should I Care About Rvm's Gemset Feature When I Use Bundler
Rails - Invalid Authenticity Token After Deploy
How to Create a Ruby Date Object from a String
How to Install (Build) Ruby 1.9.3 on Osx Lion
Error Installing Gems That Use Native Extensions on Ubuntu, Ruby 1.9.2 via Rvm
Rails with Paypal Permissions and Paypal Express Checkout
Is Every Relavant Calculation Performed Every Time the Page Is Loaded
Read Input from Console in Ruby
How to Call Expire_Fragment from Rails Observer/Model
Can Someone Explain Ruby's Use of Pipe Characters in a Block