Compressing Large String in Ruby

native ruby methods for compressing/encrypt strings?

From http://ruby-doc.org/stdlib/libdoc/zlib/rdoc/classes/Zlib.html

  # aka compress
def deflate(string, level)
z = Zlib::Deflate.new(level)
dst = z.deflate(string, Zlib::FINISH)
z.close
dst
end

# aka decompress
def inflate(string)
zstream = Zlib::Inflate.new
buf = zstream.inflate(string)
zstream.finish
zstream.close
buf
end

Encryption from http://snippets.dzone.com/posts/show/991

require 'openssl'
require 'digest/sha1'
c = OpenSSL::Cipher::Cipher.new("aes-256-cbc")
c.encrypt
# your pass is what is used to encrypt/decrypt
c.key = key = Digest::SHA1.hexdigest("yourpass")
c.iv = iv = c.random_iv
e = c.update("crypt this")
e << c.final
puts "encrypted: #{e}\n"
c = OpenSSL::Cipher::Cipher.new("aes-256-cbc")
c.decrypt
c.key = key
c.iv = iv
d = c.update(e)
d << c.final
puts "decrypted: #{d}\n"

String compression with differencing

You can do this with zlib as well. Use the deflateSetDictionary() function to provide stringA as the dictionary when compressing stringB. On the other end you already have stringA when decompressing stringB, so use inflateSetDictonary() with stringA before decompressing stringB.

zlib will then find parts of stringB that match stringA and point to those parts in stringA.

You can do better still by providing stringA and stringB concatenated as the dictionary when compressing stringC. And so on. The dictionary can be up to 32K bytes.

Compress a bitstring using Ruby

Try:

FORMAT = '%0.*b'

bitmask = "0001010010010010010001001"
bitmask.to_i(2) # => 2696329
hexval = bitmask.to_i(2).to_s(16) # => "292489"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "0001010010010010010001001"

What it's doing is:

  • to_i(2) converts the string from binary to its integer value just to show what's happening.
  • to_i(2).to_s(16) converts it to its hexadecimal representation as a string.
  • FORMAT contains a printf format string saying to convert the passed in value into its binary string representation (%b) with leading 0 bytes (%0b) for an unknown length (%0.*b) which it gets from the first parameter passed (bitmask.size).

Here's another example using a longer bitmask:

bitmask = "11011110101011011011111011101111"

hexval = bitmask.to_i(2).to_s(16) # => "deadbeef"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "11011110101011011011111011101111"

And longer still:

bitmask = "1101111010101101101111101110111111111110111011011010110111011010"

hexval = bitmask.to_i(2).to_s(16) # => "deadbeeffeedadda"
FORMAT % [bitmask.size, hexval.to_i(16)] # => "1101111010101101101111101110111111111110111011011010110111011010"

Compress large file in ruby with Zlib for gzip

You can use IO#read to read a chunk of arbitrary length from the file.

require 'zlib'

Zlib::GzipWriter.open('compressed_file.gz') do |gz|
File.open(large_data_file) do |fp|
while chunk = fp.read(16 * 1024) do
gz.write chunk
end
end
gz.close
end

This will read the source file in 16kb chunks and add each compressed chunk to the output stream. Adjust the block size to your preference based on your environment.

How can you reversibly compress a bit of text into fewer ASCII characters?

If you know that only ASCII characters will be used, that is the 7 low order bits of every byte. With bit manipulation, you can mash every 8 bytes into 7 (12.5% savings). If you can get it into a smaller range (64 valid chars only), you can drop another byte.

However, because you want the compressed form to ALSO contain only ASCII characters, that loses you one byte - which goes back to square one unless your input can be restricted to 64-chars (e.g. lossy compression substituting some chars with others, storing only in lower case etc).

If your strings are not large (>1k), then there is minimal savings to be had using gzip/bzip2 etc because of the size of the headers. If you had a predefined dictionary to use as a Huffman table, you may get some compression but in other cases, you can get bloat against the original text.

Prior discussion on SO An efficient compression algorithm for short text strings

Compress a Integer to a Base64 upper/lower case character in Ruby (to make an Encoded Short URL)

As others here have said ruby's Base64 encoding is not the same as converting an integer to a string using a base of 64. Ruby provides an elegant converter for this but the maximum base is base-36. (See @jad's answer).

Below brings together everything into two methods for encoding/decoding as base-64.

def encode(int)
chars = [*'A'..'Z', *'a'..'z', *'0'..'9', '_', '!']
digits = int.digits(64).reverse
digits.map { |i| chars[i] }.join
end

And to decode

def decode(str)
chars = [*'A'..'Z', *'a'..'z', *'0'..'9', '_', '!']
digits = str.chars.map { |char| value = chars.index(char) }.reverse
output = digits.each_with_index.map do |value, index|
value * (64 ** index)
end
output.sum
end

Give them a try:

puts output = encode(123456) #=> "eJA"
puts decode(output) #=> 123456

The compression is pretty good, an integer around 99 Million (99,999,999) encodes down to 5 characters ("1pOkA").

To gain the extra compression of including upper and lower case characters using base-64 is inherantly case-sensetive. If you are wanting to make this case-insensetive, using the built in base-36 method per Jad's answer is the way to go.

Credit to @stefan for help with this.

Append string to an existing gzipfile in Ruby

It's not clear what you are looking for. If you are trying to join multiple files into one gzip archive, you can't get there. Per the gzip documentation:

Can gzip compress several files into a single archive?

Not directly. You can first create a tar file then compress it:
for GNU tar: gtar cvzf file.tar.gz filenames
for any tar: tar cvf - filenames | gzip > file.tar.gz

Alternatively, you can use zip, PowerArchiver 6.1, 7-zip or Winzip. The zip format allows random access to any file in the archive, but the tar.gz format usually gives a better compression ratio.

With the number of times you will be adding to the archive, it makes more sense to expand the source then append the string to a single file, then compress on demand or a cycle.

You will have a large file but the compression time would be fast.


If you want to accumulate data, not separate files, in a gzip file without expanding it all, it's possible from Ruby to append to an existing gzip file, however you have to specify the "a" ("Append") mode when opening your original .gzip file. Failing to do that causes your original to be overwritten:

require 'zlib'

File.open('main.gz', 'a') do |main_gz_io|
Zlib::GzipWriter.wrap(main_gz_io) do |main_gz|
5.times do
print '.'
main_gz.puts Time.now.to_s
sleep 1
end
end
end
puts 'done'
puts 'viewing output:'
puts '---------------'
puts `gunzip -c main.gz`

Which, when run, outputs:

.....done
viewing output:
---------------
2013-04-10 12:06:34 -0700
2013-04-10 12:06:35 -0700
2013-04-10 12:06:36 -0700
2013-04-10 12:06:37 -0700
2013-04-10 12:06:38 -0700

Run that several times and you'll see the output grow.

Whether this code is fast enough for your needs is hard to say. This example artificially drags its feet to write once a second.



Related Topics



Leave a reply



Submit