Split Tar.Bz2 File and Extract Each Individually

Split tar.bz2 file and extract each individually

I don't think it's easily possible. A .tar.bz2 is a single stream, it doesn't have an index like zip that would allow skipping to the start of a particular file within the archive. You can split the file using split utility, and than cat the parts and extract them (you can do this via stdin to avoid re-creating the pasted file on disk). The first fragment will be possible to extract separately (except for the last file in it which will probably be damaged), but further fragments will not be usable without the onces that come before them.

Is it possible to split a huge text file (based on number of lines) unpacking a .tar.gz archive if I cannot extract that file as whole?

To extract a file from f.tar.gz and split it into files, each with no more than 1 million lines, use:

tar Oxzf f.tar.gz | split -l1000000

The above will name the output files by the default method. If you prefer the output files to be named prefix.nn where nn is a sequence number, then use:

tar Oxzf f.tar.gz |split -dl1000000 - prefix.

Under this approach:

  • The original file is never written to disk. tar reads from the .tar.gz file and pipes its contents to split which divides it up into pieces before writing the pieces to disk.

  • The .tar.gz file is read only once.

  • split, through its many options, has a great deal of flexibility.

Explanation

For the tar command:

  • O tells tar to send the output to stdout. This way we can pipe it to split without ever having to save the original file on disk.

  • x tells tar to extract the file (as opposed to, say, creating an archive).

  • z tells tar that the archive is in gzip format. On modern tars, this is optional

  • f tells tar to use, as input, the file name specified.

For the split command:

  • -l tells split to split files limited by number of lines (as opposed to, say, bytes).

  • -d tells split to use numeric suffixes for the output files.

  • - tells split to get its input from stdin

compress multiple files into a bz2 file in python

This is what tarballs are for. The tar format packs the files together, then you compress the result. Python makes it easy to do both at once with the tarfile module, where passing a "mode" of 'w:bz2' opens a new tar file for write with seamless bz2 compression. Super-simple example:

import tarfile

with tarfile.open('mytar.tar.bz2', 'w:bz2') as tar:
for file in mylistoffiles:
tar.add(file)

If you don't need much control over the operation, shutil.make_archive might be a possible alternative, which would simplify the code for compressing a whole directory tree to:

shutil.make_archive('mytar', 'bztar', directory_to_compress)


Related Topics



Leave a reply



Submit