How to Use Multiple Threads for Zlib Compression (Same Input Source)

How to use multiple threads for zlib compression (same input source)

You cannot simply concatenate raw deflate data streams. Each deflate stream is self-terminating, and so decompression would end at the end of the first stream.

You need to look more carefully at the pigz code for how to merge deflate streams. You can use Z_SYNC_FLUSH to complete the last block and bring it to a byte boundary without ending the deflate stream. Then you can complete the deflate stream, and strip off the final empty block marked as the end block. Except for the last deflate stream which should terminate normally. Then you can concatenate the series of n-1 unterminated deflate streams and the last 1 terminating deflate stream.

How to use multiple threads for zlib compression

It is possible to have multiple threads compressing data simultaneously, as long as each thread has its own separate z_stream object. Each z_stream object should have deflateInit() called on it, then as many calls to deflate() as necessary, and then deflateEnd() called after all of the uncompressed data has been passed to deflate(). Using this technique, it would be straightforward to e.g. compress two different files at once.

However I suspect that what you are trying to do is speed up the compression of a single large file, no? In that case, you'll find that is not possible to do, at least not in the obvious way. The reason it's not possible is that the latter bytes of a deflated stream depend on the earlier bytes of that stream for their meaning -- which means that they cannot be generated until after all of the earlier bytes have been generated, which rules out generating the second half of the compressed file in parallel with the first half.

What you could do is generate two separate compressed files; one that is the compressed contents of the first half of the uncompressed file, and the other that is the compressed contents of the second half of the uncompressed file. That could be done in parallel since the two compressed streams would be fully independent of each other. Note that you would then need to write your own routine to uncompress those two files and concatenate the result back into a single uncompressed file again, since standard compression/decompression utilities would not be aware of this divide-and-conquer trick.

As pointed out by the original author of zlib (Adler), it is possible to compress large chunk of data in parallel as exemplified in pigz. Essentially you need to supply 32K of uncompressed data proceeding a particular chunk.

==Chunk 1===
-32K-====Chunk 2=======
--32K--====Chunk 3====

Then you can combine the compressed data.

Ruby Zlib compression gives different outputs for the same input

Option 1 - fixed mtime:

Yes. The compression time is stored in the header. You can use the mtime method to set the time to a fixed value, which will resolve your problem:

gz = Zlib::GzipWriter.new(output)
gz.mtime = 1
gz.write(data)
gz.close

Note that the Ruby documentation says that setting mtime to zero will disable the timestamp. I tried it, and it does not work. I also looked at the source code, and it appears this functionality is missing. Seems like a bug. So you have to set it to something else than 0 (but see comments below - it will be fixed in future releases).

Option 2 - skip the header:

Another option is to just skip the header when checking for similar data. The header is 10 bytes long, so to only check the data:

data = compress_data(input).bytes[10..-1]

Note that you do not need to call to_a on bytes. It is already an Array:

String.bytes -> an_array

Returns an array of bytes in str. This is a shorthand for str.each_byte.to_a.

C++/C Multiple threads to read gz file simultaneously

tl;dr: zlib isn't designed for random access. It seems possible to implement, though requiring a complete read-through to build an index, so it might not be helpful in your case.

Let's look into the zlib source. gzseek is a wrapper around gzseek64, which contains:

/* if within raw area while reading, just go there */
if (state->mode == GZ_READ && state->how == COPY &&
state->x.pos + offset >= 0) {

"Within raw area" doesn't sound quite right if we're processing a gzipped file. Let's look up the meaning of state->how in gzguts.h:

int how; /* 0: get header, 1: copy, 2: decompress */

Right. At the end of gz_open, a call to gz_reset sets how to 0. Returning to gzseek64, we end up with this modification to the state:

state->seek = 1;
state->skip = offset;

gzread, when called, processes this with a call to gz_skip:

if (state->seek) {
state->seek = 0;
if (gz_skip(state, state->skip) == -1)
return -1;
}

Following this rabbit hole just a bit further, we find that gz_skip calls gz_fetch until gz_fetch has processed enough input for the desired seek. gz_fetch, on its first loop iteration, calls gz_look which sets state->how = GZIP, which causes gz_fetch to decompress data from the input. In other words, your suspicion is right: zlib does decompress the entire file up to that point when you use gzseek.

How to do multithreaded compression / decompression with GZipStream without intermediate files on very large input

Sure, that will work fine. As it happens, a concatenation of valid gzip files is also a valid gzip file. Each distinct decompressible stream is called a gzip member. Your metadata just needs the offset in the file for the start of each stream.

The extra block of a gzip header is limited to 64K bytes, so this may limit how small a chunk can be, e.g. on the order of tens to a hundred megabytes. For another reason, I would recommend that your chunks of data to compress be at least several megabytes each anyway -- in order to avoid a reduction in compression effectiveness.

A downside of concatenation is that you get no overall check on the integrity of the input. For example, if you mess up the order of the members somehow, this will not be detected on decompression, since each member's integrity check will pass regardless of the order. So you may want to include an overall check for the uncompressed data. An example would be the CRC of the entire uncompressed data, which can be computed from the CRCs of the members using zlib's crc32_combine().

I would interested to know if in your case you get a significant speedup from parallel decompression. The decompression is usually fast enough that it is I/O bound on the mass storage device being read from.

zlib: compressed stream always the same?

Its not guaranteed at all. Its possible to generate infinite different compressed streams
with the same zlib parameters. That's why there're things like
gziphack: http://groups.google.com/group/comp.compression/browse_thread/thread/82fafc72949ed46c/0115418726ed45e1

http://www.advsys.net/ken/util/kzip.exe

http://www.advsys.net/ken/util/pngout.exe

http://www.walbeehm.com/download/DeflOpt207.7z

etc

Compression and decompression of data using zlib in Nodejs

Update: Didn't realize there was a new built-in 'zlib' module in node 0.5. My answer below is for the 3rd party node-zlib module. Will update answer for the built-in version momentarily.

Update 2: Looks like there may be an issue with the built-in 'zlib'. The sample code in the docs doesn't work for me. The resulting file isn't gunzip'able (fails with "unexpected end of file" for me). Also, the API of that module isn't particularly well-suited for what you're trying to do. It's more for working with streams rather than buffers, whereas the node-zlib module has a simpler API that's easier to use for Buffers.


An example of deflating and inflating, using 3rd party node-zlib module:

// Load zlib and create a buffer to compress
var zlib = require('zlib');
var input = new Buffer('lorem ipsum dolor sit amet', 'utf8')

// What's 'input'?
//input
//<Buffer 6c 6f 72 65 6d 20 69 70 73 75 6d 20 64 6f 6c 6f 72 20 73 69 74 20 61 6d 65 74>

// Compress it
zlib.deflate(input)
//<SlowBuffer 78 9c cb c9 2f 4a cd 55 c8 2c 28 2e cd 55 48 c9 cf c9 2f 52 28 ce 2c 51 48 cc 4d 2d 01 00 87 15 09 e5>

// Compress it and convert to utf8 string, just for the heck of it
zlib.deflate(input).toString('utf8')
//'x???/J?U?,(.?UH???/R(?,QH?M-\u0001\u0000?\u0015\t?'

// Compress, then uncompress (get back what we started with)
zlib.inflate(zlib.deflate(input))
//<SlowBuffer 6c 6f 72 65 6d 20 69 70 73 75 6d 20 64 6f 6c 6f 72 20 73 69 74 20 61 6d 65 74>

// Again, and convert back to our initial string
zlib.inflate(zlib.deflate(input)).toString('utf8')
//'lorem ipsum dolor sit amet'

File compression with zlib without saving to disk and send via socket

One way is to use a Boost Iostream compressor (they support zlib, gzip, bzip2 out of the box) and an ip::tcp::iostream socket from Boost Asio. Something like:

#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <boost/asio/ip/tcp.hpp>

int main() {
boost::asio::ip::tcp::iostream connection;

boost::iostreams::filtering_stream<boost::iostreams::input> connection_reader;
connection_reader.push(boost::iostreams::zlib_decompressor());
connection_reader.push(connection);

boost::iostreams::filtering_stream<boost::iostreams::output> connection_writer;
connection_writer.push(boost::iostreams::zlib_compressor());
connection_writer.push(connection);

auto const url = "127.0.0.1";
connection.connect(url, "http");

// Send.
connection_writer << "hello there\n";

// Receive.
for(std::string line; getline(connection_reader, line);) {
// Process line.
}
}


Related Topics



Leave a reply



Submit