Difference Between Archiving and Compression

Difference between archiving and compression

Archiving means that you take 10 files and combine them into one file, with no difference in size. If you start with 10 100KB files and archive them, the resulting single file is 1000KB.
On the other hand, if you compress those 10 files, you might find that the resulting files range from only a few kilobytes to close to the original size of 100KB, depending upon the original file type.
(source)

What the best choose in archiving individual and total group of file?

If the files have some similarity then there can be a noticeable advantage to a "solid" archive, which is putting the files together in a sequence and compressing them as one big file, like a .tar.gz file, as opposed to compressing each file individually, like .zip.

The advantage is even greater if the files are small.

I just did a quick test on a small set of files, where the .tar.gz was 15% smaller than a .zip file with the same contents. Both were compressed with the same compression algorithm at the same compression level.

What compression/archive formats support inter-file compression?

Several formats do inter-file compression.

The oldest example is .tar.gz; a .tar has no compression but concatenates all the files together, with headers before each file, and a .gz can compress only one file. Both are applied in sequence, and it's a traditional format in the Unix world. .tar.bz2 is the same, only with bzip2 instead of gzip.

More recent examples are formats with optional "solid" compression (for instance, RAR and 7-Zip), which can internally concatenate all the files before compressing, if enabled by a command-line flag or GUI option.

Compressing and Archiving the files in the folder using Java Runtime

The problem lies in this part:

" marketData*"

you expect the filenames to be compressed to be globbed from the * wildcard. Globbing is done by the shell, not by the tools themselves. your choices are to either:

  • numerate the files to be archived yourself
  • start the shell to perform the command ("/bin/sh -c")
  • start tar on the folder containing the files to be archived

Edit:
For the shell option, your command would look like:

String command = "sh -c \"tar --remove-files -cjvf "+archivedFile+" marketData*\"";

(mind the \"s that delimit the command to be executed by the shell, don't use single quotes ot the shell won't interpret the glob.)

Huge Compression Difference in RAR archive and Gzip: Is there anything that I am missing?

It is possible that the file is highly redundant with a repeating pattern that is larger than 32K. gzip's deflate only looks 32K back for matches, whereas the others can capitalize on history much further back.

Update:

I just made a file that is a 64K block of random data, repeated 4096 times (256 MB). gzip (with 32K window) was blind to the redundancy and so unable to compress it. gzip expanded it to 256.04 MB. xz (LZMA with 8 MB window) compressed it to 102 KB.

What is the difference between tar and zip?

tar in itself just bundles files together (the result is called a tarball), while zip applies compression as well.

Usually you use gzip along with tar to compress the resulting tarball, thus achieving similar results as with zip.

For reasonably large archives there are important differences though. A zip archive is a collection of compressed files. A gzipped tar is a compressed collection (of uncompressed files). Thus a zip archive is a randomly accessible list of concatenated compressed items, and a .tar.gz is an archive that must be fully expanded before the catalog is accessible.

  • The caveat of a zip is that you don't get compression across files (because each file is compressed independent of the others in the archive, the compression cannot take advantage of similarities among the contents of different files); the advantage is that you can access any of the files contained within by looking at only a specific (target file dependent) section of the archive (as the "catalog" of the collection is separate from the collection itself).
  • The caveat of a .tar.gz is that you must decompress the whole archive to access files contained therein (as the files are within the tarball); the advantage is that the compression can take advantage of similarities among the files (as it compresses the whole tarball).

Compressing large, near-identical files

Sounds like what you need is a binary diff program. You can google for that, and then try using binary diff between two of them, and then compressing one of them and the resulting diff. You could get fancy and try diffing all combinations, picking the smallest ones to compress, and send only one original.



Related Topics



Leave a reply



Submit