Binary String Literals in Ruby 2.0

Binary string literals in Ruby 2.0

The solution is to change the definition of the string literal to enforce its encoding. There are a few possible options to do this:

Use Array#pack (all versions of Ruby):

expected = ["d19b86"].pack('H*')

Use String#b (Ruby >= 2.0 only):

expected = "\xD1\x9B\x86".b

Use String#force_encoding (Ruby >= 1.9 only):

expected = "\xD1\x9B\x86".force_encoding("ASCII-8BIT")

Using binary data (strings in utf-8) from external file

If your file contains the literal escaped string:

\u306b\u3064\u3044\u3066

Then you will need to unescape it after reading. Ruby does this for you with string literals, which is why the second case worked for you. Taken from the answer to "Is this the best way to unescape unicode escape sequences in Ruby?", you can use this:

file  = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io|
contents = io.read.gsub(/\\u([\da-fA-F]{4})/) { |m|
[$1].pack("H*").unpack("n*").pack("U*")
}
contents.split(/\t/)
}

Alternatively, if you will like to make it more readable, extract the substitution into a new method, and add it to the String class:

class String
def unescape_unicode
self.gsub(/\\u([\da-fA-F]{4})/) { |m|
[$1].pack("H*").unpack("n*").pack("U*")
}
end
end

Then you can call:

file  = "c:\\...\\vlmList_unicode.txt" #\u306b\u3064\u3044\u3066
data = File.open(file, 'rb') { |io|
io.read.unescape_unicode.split(/\t/)
}

Why is a UTF-8 string not equal to the equivalent ASCII-8BIT string in Ruby 2.0?

(1) why does it return false?

When comparing strings, they either have to be in the same encoding or their characters must be encodable in US-ASCII.

Comparison works as expected if the string only contains byte values 0 to 127: (0b0xxxxxxx)

a = 'E'.encode('ISO8859-1')  #=> "E"
b = 'E'.encode('ISO8859-15') #=> "E"

a.bytes #=> [69]
b.bytes #=> [69]
a == b #=> true

And fails if it contains any byte values 128 to 255: (0b1xxxxxxx)

a = 'É'.encode('ISO8859-1')  #=> "\xC9"
b = 'É'.encode('ISO8859-15') #=> "\xC9"

a.bytes #=> [201]
b.bytes #=> [201]
a == b #=> false

Your string can't be represented in US-ASCII, because both its bytes are outside its range:

"\xFF\xFE".bytes #=> [255, 254]

Attempting to convert it doesn't produce any meaningful result:

"\xFF\xFE".encode('US-ASCII', 'ASCII-8BIT', :undef => :replace)
#=> "??"

The string will therefore return false when being compared to a string in another encoding, regardless of its content.

(2) what is the best way to go about achieving what i want?

You could compare your string to a string with the same encoding. binread returns a string in ASCII-8BIT encoding, so you could use b to create a compatible one:

IO.binread('your_file', 2) == "\xFF\xFE".b

or you could compare its bytes:

IO.binread('your_file', 2).bytes == [0xFF, 0xFE]

Division operator considered a regexp delimiter

First of all, run irb with warnings enabled to have a better understanding of what is going on (unrelevant warnings are omitted):

$ irb -w
irb:001> x = 0
=> 0
irb:002> x /2
irb:003/ /
(irb):2: warning: `/' after local variable or literal is interpreted as binary operator
(irb):2: warning: even though it seems like regexp literal
SyntaxError: (irb):3: unterminated regexp meets end of file

On line 2 the Ruby lexer detects that x is a local variable so it assumes that the following / is a binary operator not the beginning of a regexp. On line 3 raises an error because a / by itself is an incomplete regexp.

This happens because IRB uses a lexer to know if the expression you entered is complete, and therefore can be sent to Ruby for execution, or if you need to provide more input to complete the expression. The IRB's lexer can't detect what x is, so it assumes that it is a method and tries to interpret the rest of the line (/2) as the argument to x, since it is an unterminated regexp IRB ask you to complete it on line 3, thus the code sent by IRB to the Ruby parser is invalid as explained above.

For comparison consider what happens when x is actually a method:

$ irb -w
irb:001> def x; end
=> :x
irb:002> x /2
irb:003/ /
(irb):2: warning: ambiguous first argument; put parentheses or even spaces
ArgumentError: wrong number of arguments (1 for 0)
from (irb):1:in `x'
from (irb):2

In this case both Ruby and IRB agree on the way the expression have to be parsed and you got an error because you are trying to pass an argument (namely /2\n/) to the x method which expects none.

To the point: it is a bug or not? Maybe it is a bug or maybe it is just a compromise to keep the IRB's lexer simple, I can't really tell.

Ruby: Can I write multi-line string with no concatenation?

There are pieces to this answer that helped me get what I needed (easy multi-line concatenation WITHOUT extra whitespace), but since none of the actual answers had it, I'm compiling them here:

str = 'this is a multi-line string'\
' using implicit concatenation'\
' to prevent spare \n\'s'

=> "this is a multi-line string using implicit concatenation to eliminate spare
\\n's"

As a bonus, here's a version using funny HEREDOC syntax (via this link):

p <<END_SQL.gsub(/\s+/, " ").strip
SELECT * FROM users
ORDER BY users.id DESC
END_SQL
# >> "SELECT * FROM users ORDER BY users.id DESC"

The latter would mostly be for situations that required more flexibility in the processing. I personally don't like it, it puts the processing in a weird place w.r.t. the string (i.e., in front of it, but using instance methods that usually come afterward), but it's there. Note that if you are indenting the last END_SQL identifier (which is common, since this is probably inside a function or module), you will need to use the hyphenated syntax (that is, p <<-END_SQL instead of p <<END_SQL). Otherwise, the indenting whitespace causes the identifier to be interpreted as a continuation of the string.

This doesn't save much typing, but it looks nicer than using + signs, to me.

Also (I say in an edit, several years later), if you're using Ruby 2.3+, the operator <<~ is also available, which removes extra indentation from the final string. You should be able to remove the .gsub invocation, in that case (although it might depend on both the starting indentation and your final needs).

EDIT: Adding one more:

p %{
SELECT * FROM users
ORDER BY users.id DESC
}.gsub(/\s+/, " ").strip
# >> "SELECT * FROM users ORDER BY users.id DESC"

Is { 'symbol name': some value } valid Ruby 2 syntax for Hashes?

{ :my_key => "my value" } 
{ my_key: "my value" }
{ :'my_key' => "my value" }
{ :"my_key" => "my value" }

None of the above lines uses 2.x-only syntax. They are all also valid 1.9 syntax. (See demonstration.)

{ "my_key": "my value" }
{ 'my_key': "my value" }

That's feature request #4276 which landed in 2.2. That means it's invalid syntax in 2.1 or even older versions. It also means that an implementation that claims to implement 2.2 has to support it.

Ruby Gem randomly returns Encoding Error

In the Query class there is this line:

@key = Array(key).pack('N')

This creates a String with an associated encoding of ASCII-8BIT (i.e. it’s a binary string).

Later @key gets used in this line:

query = @sock.send("\xFE\xFD\x00\x01\x02\x03\x04" + @key, 0)

In Ruby 2.0 the default encoding of String literals is UTF-8, so this is combining a UTF-8 string with a binary one.

When Ruby tries to do this it first checks to see if the binary string only contains 7-bit values (i.e. all bytes are less than or equal to 127, with the top byte being 0), and if it does it considers it compatible with UTF-8 and so combines them without further issue. If it doesn’t, (i.e. if it contains bytes greater than 127) then the two strings are not compatible and an Encoding::CompatibilityError is raised.

Whether an error is raised depends on the contents of @key, which is initialized from a response from the server. Sometimes this value happens to contain only 7-bit values, so no error is raised, at other times there is a byte with the high bit set, so it generates an error. This is why the errors appear to be “random”.

To fix it you can specify that the string literal in the line where the two strings are combined should be treated as binary. The simplest way would be to use force_encoding like this:

query = @sock.send("\xFE\xFD\x00\x01\x02\x03\x04".force_encoding(Encoding::ASCII_8BIT) + @key, 0)


Related Topics



Leave a reply



Submit