How Does String.Unpack Work in Ruby

Ruby's pack and unpack explained

We were working on a similar problem this morning. If the array size is unknown, you can use:

ary = ["61", "62", "63"]
ary.pack('H2' * ary.size)
=> "abc"

You can reverse it using:

str = "abc"
str.unpack('H2' * str.size)
=> ["61", "62", "63"]

How does string.unpack work in Ruby?

Not 16 - it is showing 1 and then 6. h is giving the hex value of each nibble, so you get 0110 (6), then 0001 (1), depending on whether its the high or low bit you're looking at. Use the high nibble first and you get 61, which is hex for 97 - the value of 'a'

How Ruby string unpack works for extracting 16, 32 and 64 signed/unsigned values?

It’s basically the number in 256 base, converted to Integer.

▶ '12'.unpack('S*') == [50 * 256 + 49]
#⇒ true

▶ '1234'.bytes
#⇒ [49, 50, 51, 52]
▶ '1234'.unpack('L*') ==
▷    [52 * 256 * 256 * 256 + 51 * 256 * 256 + 50 * 256 + 49]
#⇒ true

Please note, S, L and Q are platform dependent. To ensure little-endian, one might use (credits to @NeilSlater):

▶ '12'.unpack('S<*') == [50 * 256 + 49]
#⇒ true

For big-endian one might use > modifier:

▶ '12'.unpack('S>*') == [49 * 256 + 50]
#⇒ true

How does pack() and unpack() work in Ruby

You are asking a question about the fundamental principles of how computers store numbers in memory. For example you can look at these to learn more:

http://en.wikipedia.org/wiki/Computer_number_format#Binary_Number_Representation

http://en.wikipedia.org/wiki/Signed_number_representations

As an example take the difference between S and s; both are used for packing and unpacking 16-bit numbers, but one is for signed integers and the other for unsigned. This has significant meaning when you want to unpack the string back into the original integers.

S: 16-bit unsigned means numbers 0 - 65535 (0 to (2^16-1))

s: 16-bit signed integer numbers -32768 - 32767 (-(2^15) to (2^15-1)) (one bit used for sign)

The difference can be seen here:

# S = unsigned: you cannot pack/unpack negative numbers
> [-1, 65535, 32767, 32768].pack('SSSS').unpack('SSSS')
=> [65535, 65535, 32767, 32768]   

# s = signed: you cannot pack/unpack numbers outside range -32768 - 32767
> [-1, 65535, 32767, 32768].pack('ssss').unpack('ssss')
=> [-1, -1, 32767, -32768]

So you see you have to know how numbers are represented in computer memory in order to understand your question. Signed numbers use one bit to represent the sign, while unsigned numbers do not need this extra bit, but you cannot represent negative numbers then.

This is the very basic of how numbers are represented as binary in computer memory.

The reason you need packing for example is when you need to send numbers as a byte stream from one computer to another (like over a network connection). You have to pack your integer numbers into bytes in order to be sent over a stream. The other option is to send the numbers as strings; then you encode and decode them as strings on both ends instead of packing and unpacking.

Or let's say you need to call a C-function in a system library from Ruby. System libraries written in C operate on basic integers (int, uint, long, short, etc.) and C-structures (struct). You will need to convert your Ruby integers into system integers or C-structures before calling such system methods. In those cases pack and unpack can be used to interface which such methods.

Regarding the additional directives they deal with the endianness of how to represent the packed byte sequence. See here on what endianness means and how it works:

http://en.wikipedia.org/wiki/Endianness

In simplified terms it just tells the packing method in which order the integers should be converted into bytes:

# Big endian
> [34567].pack('S>').bytes.map(&:to_i)
=> [135, 7]   
# 34567 = (135 * 2^8) + 7

# Little endian
> [34567].pack('S<').bytes.map(&:to_i)
=> [7, 135]   
# 34567 = 7 + (135 * 2^8)

How the @,x,X directives work with Ruby pack()/unpack() method?

I will give you few example and will be learning together with you:

[1,2,3,4].pack("CCCC")
=> "\x01\x02\x03\x04"

So serializes in unsigned chars. Every letter in new byte.

[1,2,3,4].pack("CCXCC")
=> "\x01\x03\x04"
[1,2,3,4].pack("CCXXC")
=> "\x03"

Think of the 'X' as backspace directive

[1,2,3,4].pack("CCxC")
=> "\x01\x02\x00\x03"
[1,2,3,4].pack("CCxxC")
=> "\x01\x02\x00\x00\x03"

'x' places zero valued byte.

[1,2,3,4].pack("CC@C")
=> "\x01\x03"
[1,2,3,4].pack("CC@@C")
=> "\x01\x03"
[1,2,3,4].pack("CC@@CC")
=> "\x01\x03\x04"
[1,2,3,4].pack("CC@CC")
=> "\x01\x03\x04"
[1,2,3,4].pack("CC@C@C")
=> "\x01\x04"
[1,2,3,4].pack("CC@C@@C")
=> "\x01\x04"

'@' seems to be a single backspace, but will not support multiple operations at one time. The last one as explanation does not relate at all with the text from the documentation:

@ Moves to absolute position

But is what it seems to be doing.

EDIT BTW @ seems a lot more logical when looked in the context of unpack:

[1,2,3,4,5].pack("CCCCC").unpack("CCC@CCCCC@CC")
=> [1, 2, 3, 1, 2, 3, 4, 5, 1, 2]

Starts unpacking from the very beginning once more.

EDIT2 And here goes the explanation of the other two directives in the context of unpacking:

[1,2,3,4,5].pack("CCCCC").unpack("CCCXC")
=> [1, 2, 3, 3]
[1,2,3,4,5].pack("CCCCC").unpack("CCCXXC")
=> [1, 2, 3, 2]
[1,2,3,4,5].pack("CCCCC").unpack("CCCxC")
=> [1, 2, 3, 5]

So 'x' ignores the next to decode byte and 'X' will make the previous byte the next to read once more. 'X' can stack.

Here goes my first attempt of summarizing the results:

pack:

'x' places zero byte
'X' works like backspace directive, meaning the previous byte is not going to be packed actually
'@' has unexplainable behaviour for me

unpack:

'x' Skips the next byte that was for unpacking
'X' moves the reader backwards, meaning the last read byte will be read once more.
'@' moves the reader to the very beginning. That means that all the bytes will be unpacked once more.

NOTE Reader is a word I made up for ease of explanation and is, by no means, formal.

EDIT3 Here goes the explanation of "\x01" notation too:

a = [17, 22, 31]
=> [17, 22, 31]
a.pack("CCC")
=> "\x11\x16\x1F"

It seems like this stands for hexadecimal representation. And all the site I have linked to use decimal representation apparently. Otherwise, as it can be seen those are the hexadecimal representations of the given numbers.

Ruby - Unpack array with mixed types

You could read the file in small chunks of 19 bytes and use 'A7fff' to pack and unpack. Do not use pointers to structure ('p' and 'P'), as they need more than 19 bytes to encode your information.
You could also use 'A6xfff' to ignore the 7th byte and get a string with 6 chars.

Here's an example, which is similar to the documentation of IO.read:

data = [["ABCDEF\t", 3.4, 5.6, 9.1], 
        ["FEDCBA\t", 2.5, 8.9, 3.1]]
binary_file = 'data.bin'
chunk_size = 19
pattern = 'A7fff'

File.open(binary_file, 'wb') do |o|
  data.each do |row|
    o.write row.pack(pattern)
  end
end

raise "Something went wrong. Please check data, pattern and chunk_size." unless File.size(binary_file) == data.length * chunk_size

File.open(binary_file, 'rb') do |f|
  while record = f.read(chunk_size)
    puts '%s %g %g %g' % record.unpack(pattern)
  end
end
# =>
#    ABCDEF   3.4 5.6 9.1
#    FEDCBA   2.5 8.9 3.1

You could use a multiple of 19 to speed up the process if your file is large.

Ruby: Why does unpack('Q') give a different result than manual conversion?

Byte order matters:

 Integer('abcdefgh'.
           each_char.
           flat_map { |c| c.unpack('B*') }.
           reverse.
           join, 2)
 #⇒ 7523094288207667809
 'abcdefgh'.unpack('Q*').first
 #⇒ 7523094288207667809

Your code produces the wrong result because after converting to binary, bytes should be reversed.

For the last part of your question, the reason the output of .unpack('Q') doesn't change with a longer input string is because the format is specifying a single 64-bit value so any characters after the first 8 are ignored. If you specified a format of Q2 and a 16 character string you'd decode 2 values:

> 'abcdefghihjklmno'.unpack('Q2')
=> [7523094288207667809, 8029475498074204265]

and again you'd find adding additional characters wouldn't change the result:

> 'abcdefghihjklmnofoofoo'.unpack('Q2')
=> [7523094288207667809, 8029475498074204265]

A format of Q* would return as many values as multiples of 64-bits were in the input:

> 'abcdefghihjklmnopqrstuvw'.unpack('Q*')
=> [7523094288207667809, 8029475498074204265, 8608196880778817904]
> 'abcdefghihjklmnopqrstuvwxyz'.unpack('Q*')
=> [7523094288207667809, 8029475498074204265, 8608196880778817904]