Split Files Using Tar, Gz, Zip, or Bzip2

Can I split compressed sql file using split linux command? If not, then any other method to do?

You might take a look at this question: split-files-using-tar-gz-zip-or-bzip2
I assume the reason you want to split it is to move it? And that you know you probably wont be able to import a small slice of the file into a database?

compress multiple files into a bz2 file in python

This is what tarballs are for. The tar format packs the files together, then you compress the result. Python makes it easy to do both at once with the tarfile module, where passing a "mode" of 'w:bz2' opens a new tar file for write with seamless bz2 compression. Super-simple example:

import tarfile

with tarfile.open('mytar.tar.bz2', 'w:bz2') as tar:
    for file in mylistoffiles:
        tar.add(file)

If you don't need much control over the operation, shutil.make_archive might be a possible alternative, which would simplify the code for compressing a whole directory tree to:

shutil.make_archive('mytar', 'bztar', directory_to_compress)

How to decompress BZIP (not BZIP2) with Apache Commons

The original Bzip was supposedly using a patented algorithm so Bzip2 was born using algorithms and techniques that were not patented.

That might be the reason why it's no longer in widespread use and open source libraries ignore it.

There's some C code for decompressing Bzip files shown here (gist.github.com mirror).

You might want to read and rewrite that in Java.

How to protect myself from a gzip or bzip2 bomb?

I guess the answer is: There is no easy, readymade solution. Here is what I use now:

class SafeUncompressor(object):
    """Small proxy class that enables external file object
    support for uncompressed, bzip2 and gzip files. Works transparently, and
    supports a maximum size to avoid zipbombs.
    """
    blocksize = 16 * 1024

    class FileTooLarge(Exception):
        pass

    def __init__(self, fileobj, maxsize=10*1024*1024):
        self.fileobj = fileobj
        self.name = getattr(self.fileobj, "name", None)
        self.maxsize = maxsize
        self.init()

    def init(self):
        import bz2
        import gzip
        self.pos = 0
        self.fileobj.seek(0)
        self.buf = ""
        self.format = "plain"

        magic = self.fileobj.read(2)
        if magic == '\037\213':
            self.format = "gzip"
            self.gzipobj = gzip.GzipFile(fileobj = self.fileobj, mode = 'r')
        elif magic == 'BZ':
            raise IOError, "bzip2 support in SafeUncompressor disabled, as self.bz2obj.decompress is not safe"
            self.format = "bz2"
            self.bz2obj = bz2.BZ2Decompressor()
        self.fileobj.seek(0)

    def read(self, size):
        b = [self.buf]
        x = len(self.buf)
        while x < size:
            if self.format == 'gzip':
                data = self.gzipobj.read(self.blocksize)
                if not data:
                    break
            elif self.format == 'bz2':
                raw = self.fileobj.read(self.blocksize)
                if not raw:
                    break
                # this can already bomb here, to some extend.
                # so disable bzip support until resolved.
                # Also monitor http://stackoverflow.com/questions/13622706/how-to-protect-myself-from-a-gzip-or-bzip2-bomb for ideas
                data = self.bz2obj.decompress(raw)
            else:
                data = self.fileobj.read(self.blocksize)
                if not data:
                    break
            b.append(data)
            x += len(data)

            if self.pos + x > self.maxsize:
                self.buf = ""
                self.pos = 0
                raise SafeUncompressor.FileTooLarge, "Compressed file too large"
        self.buf = "".join(b)

        buf = self.buf[:size]
        self.buf = self.buf[size:]
        self.pos += len(buf)
        return buf

    def seek(self, pos, whence=0):
        if whence != 0:
            raise IOError, "SafeUncompressor only supports whence=0"
        if pos < self.pos:
            self.init()
        self.read(pos - self.pos)

    def tell(self):
        return self.pos

It does not work well for bzip2, so that part of the code is disabled. The reason is that bz2.BZ2Decompressor.decompress can already produce an unwanted large chunk of data.

Split Files Using Tar, Gz, Zip, or Bzip2

Can I split compressed sql file using split linux command? If not, then any other method to do?

compress multiple files into a bz2 file in python

How to decompress BZIP (not BZIP2) with Apache Commons

How to protect myself from a gzip or bzip2 bomb?

Related Topics

Leave a reply