Gzip with All Cores

Gzip with all cores

If you are on Linux, you can use GNU's xargs to launch as many processes as you have cores.

CORES=$(grep -c '^processor' /proc/cpuinfo)
find /source -type f -print0 | xargs -0 -n 1 -P $CORES gzip -9

find -print0 / xargs -0 protects you from whitespace in filenames
xargs -n 1 means one gzip process per file
xargs -P specifies the number of jobs
gzip -9 means maximum compression

multi cpu core gzip a big file

Use pigz, a parallel gzip implementation.

Unlike parallel with gzip, pigz produces a single gzip stream.

Utilizing multi core for tar+gzip/bzip compression/decompression

You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:

tar cf - paths-to-archive | pigz > archive.tar.gz

By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.

tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz

gzip on multiple cores with pv progression bar

Use GNU Parallel which has a progress bar or an eta:

find ... -print0 | parallel -0 --progress gzip -9 {}

find ... -print0 | parallel -0 --eta ...

find ... -print0 | parallel -0 --bar ...

Is it possible to unzip a compressed file with multiple threads?

In short: No, unzipping with multiple cores is not available.

The decompression normally has lower CPU-intensity than the compression (where multiple cores are often involved).

You wouldn't have much of an advantage anyway as the read/write-operations are more of the bottlenecks during decompression.

How do I get Java to use my multi-core processor with GZIPInputStream?

AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.

You could, however, have multiple threads, each unzipping a different file.

That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).

More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).

How to optimize CPU load up to 100% with GZip multithread archiver?

It's not possible to tell what is limiting your core usage without profiling, and also knowing how much data you are compressing in your test.

However I can say that in order to get good efficiency, which includes both full core utilization and close to a factor of n speedup for n threads over one thread, in pigz I have to create pools of threads that are always there, either running or waiting for more work. It is a huge impact to create and destroy threads for every chunk of data to be processed. I also have pools of pre-allocated blocks of memory for the same reason.

The source code at the link, in C, may be of help.

C++/C Multiple threads to read gz file simultaneously

tl;dr: zlib isn't designed for random access. It seems possible to implement, though requiring a complete read-through to build an index, so it might not be helpful in your case.

Let's look into the zlib source. gzseek is a wrapper around gzseek64, which contains:

/* if within raw area while reading, just go there */
if (state->mode == GZ_READ && state->how == COPY &&
        state->x.pos + offset >= 0) {

"Within raw area" doesn't sound quite right if we're processing a gzipped file. Let's look up the meaning of state->how in gzguts.h:

int how; /* 0: get header, 1: copy, 2: decompress */

Right. At the end of gz_open, a call to gz_reset sets how to 0. Returning to gzseek64, we end up with this modification to the state:

state->seek = 1;
state->skip = offset;

gzread, when called, processes this with a call to gz_skip:

if (state->seek) {
    state->seek = 0;
    if (gz_skip(state, state->skip) == -1)
        return -1;
}

Following this rabbit hole just a bit further, we find that gz_skip calls gz_fetch until gz_fetch has processed enough input for the desired seek. gz_fetch, on its first loop iteration, calls gz_look which sets state->how = GZIP, which causes gz_fetch to decompress data from the input. In other words, your suspicion is right: zlib does decompress the entire file up to that point when you use gzseek.

Gzip with All Cores