Gzip with all cores
If you are on Linux, you can use GNU's xargs to launch as many processes as you have cores.
CORES=$(grep -c '^processor' /proc/cpuinfo)
find /source -type f -print0 | xargs -0 -n 1 -P $CORES gzip -9
- find -print0 / xargs -0 protects you from whitespace in filenames
- xargs -n 1 means one gzip process per file
- xargs -P specifies the number of jobs
- gzip -9 means maximum compression
multi cpu core gzip a big file
Use pigz
, a parallel gzip implementation.
Unlike parallel
with gzip
, pigz
produces a single gzip stream.
Utilizing multi core for tar+gzip/bzip compression/decompression
You can use pigz instead of gzip, which does gzip compression on multiple cores. Instead of using the -z option, you would pipe it through pigz:
tar cf - paths-to-archive | pigz > archive.tar.gz
By default, pigz uses the number of available cores, or eight if it could not query that. You can ask for more with -p n, e.g. -p 32. pigz has the same options as gzip, so you can request better compression with -9. E.g.
tar cf - paths-to-archive | pigz -9 -p 32 > archive.tar.gz
gzip on multiple cores with pv progression bar
Use GNU Parallel which has a progress bar or an eta
:
find ... -print0 | parallel -0 --progress gzip -9 {}
Or
find ... -print0 | parallel -0 --eta ...
Or
find ... -print0 | parallel -0 --bar ...
Is it possible to unzip a compressed file with multiple threads?
In short: No, unzipping with multiple cores is not available.
The decompression normally has lower CPU-intensity than the compression (where multiple cores are often involved).
You wouldn't have much of an advantage anyway as the read/write-operations are more of the bottlenecks during decompression.
How do I get Java to use my multi-core processor with GZIPInputStream?
AFAIK the action of reading from this stream is single-threaded, so multiple CPUs won't help you if you're reading one file.
You could, however, have multiple threads, each unzipping a different file.
That being said, unzipping is not particularly calculation intensive these days, you're more likely to be blocked by the cost of IO (e.g., if you are reading two very large files in two different areas of the HD).
More generally (assuming this is a question of someone new to Java), Java doesn't do things in parallel for you. You have to use threads to tell it what are the units of work that you want to do and how to synchronize between them. Java (with the help of the OS) will generally take as many cores as is available to it, and will also swap threads on the same core if there are more threads than cores (which is typically the case).
How to optimize CPU load up to 100% with GZip multithread archiver?
It's not possible to tell what is limiting your core usage without profiling, and also knowing how much data you are compressing in your test.
However I can say that in order to get good efficiency, which includes both full core utilization and close to a factor of n speedup for n threads over one thread, in pigz I have to create pools of threads that are always there, either running or waiting for more work. It is a huge impact to create and destroy threads for every chunk of data to be processed. I also have pools of pre-allocated blocks of memory for the same reason.
The source code at the link, in C, may be of help.
C++/C Multiple threads to read gz file simultaneously
tl;dr: zlib isn't designed for random access. It seems possible to implement, though requiring a complete read-through to build an index, so it might not be helpful in your case.
Let's look into the zlib source. gzseek is a wrapper around gzseek64, which contains:
/* if within raw area while reading, just go there */
if (state->mode == GZ_READ && state->how == COPY &&
state->x.pos + offset >= 0) {
"Within raw area" doesn't sound quite right if we're processing a gzipped file. Let's look up the meaning of state->how
in gzguts.h:
int how; /* 0: get header, 1: copy, 2: decompress */
Right. At the end of gz_open
, a call to gz_reset
sets how
to 0. Returning to gzseek64
, we end up with this modification to the state:
state->seek = 1;
state->skip = offset;
gzread, when called, processes this with a call to gz_skip:
if (state->seek) {
state->seek = 0;
if (gz_skip(state, state->skip) == -1)
return -1;
}
Following this rabbit hole just a bit further, we find that gz_skip
calls gz_fetch
until gz_fetch
has processed enough input for the desired seek. gz_fetch
, on its first loop iteration, calls gz_look
which sets state->how = GZIP
, which causes gz_fetch
to decompress data from the input. In other words, your suspicion is right: zlib does decompress the entire file up to that point when you use gzseek
.
Related Topics
Backup a Running Docker Container
PDF Compare on Linux Command Line
Fork: Retry: Resource Temporarily Unavailable
Can't Find Out Where Does a Node.Js App Running and Can't Kill It
Linux Shell to Restrict Sftp Users to Their Home Directories
Replace Only If String Exists in Current Line
Logo Programming Language Implementations
Specify the from User When Sending Email Using the Mail Command
Permission Denied When Trying to Append a File to a Root Owned File with Sudo
Matlab Execute Script from Linux Command Line
Abuse Curl to Communicate with Redis
What's the Difference Between "Env" and "Set" (On MAC Os X or Linux)
How to Create Virtual Ethernet Devices in Linux
Using Iconv to Convert from Utf-16Le to Utf-8
Writing to Serial Port from Linux Command Line