Lamp: How to Create .Zip of Large Files For the User on the Fly, Without Disk/Cpu Thrashing

LAMP: How to create .Zip of large files for the user on the fly, without disk/CPU thrashing

You can use popen() (docs) or proc_open() (docs) to execute a unix command (eg. zip or gzip), and get back stdout as a php stream. flush() (docs) will do its very best to push the contents of php's output buffer to the browser.

Combining all of this will give you what you want (provided that nothing else gets in the way -- see esp. the caveats on the docs page for flush()).

(Note: don't use flush(). See the update below for details.)

Something like the following can do the trick:

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/x-gzip');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('tar cf - file1 file2 file3 | gzip -c', 'r');

// pick a bufsize that makes you happy (64k may be a bit too big).
$bufsize = 65535;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);

You asked about "other technologies": to which I'll say, "anything that supports non-blocking i/o for the entire lifecycle of the request". You could build such a component as a stand-alone server in Java or C/C++ (or any of many other available languages), if you were willing to get into the "down and dirty" of non-blocking file access and whatnot.

If you want a non-blocking implementation, but you would rather avoid the "down and dirty", the easiest path (IMHO) would be to use nodeJS. There is plenty of support for all the features you need in the existing release of nodejs: use the http module (of course) for the http server; and use child_process module to spawn the tar/zip/whatever pipeline.

Finally, if (and only if) you're running a multi-processor (or multi-core) server, and you want the most from nodejs, you can use Spark2 to run multiple instances on the same port. Don't run more than one nodejs instance per-processor-core.

Update (from Benji's excellent feedback in the comments section on this answer)

1. The docs for fread() indicate that the function will read only up to 8192 bytes of data at a time from anything that is not a regular file. Therefore, 8192 may be a good choice of buffer size.

[editorial note] 8192 is almost certainly a platform dependent value -- on most platforms, fread() will read data until the operating system's internal buffer is empty, at which point it will return, allowing the os to fill the buffer again asynchronously. 8192 is the size of the default buffer on many popular operating systems.

There are other circumstances that can cause fread to return even less than 8192 bytes -- for example, the "remote" client (or process) is slow to fill the buffer - in most cases, fread() will return the contents of the input buffer as-is without waiting for it to get full. This could mean anywhere from 0..os_buffer_size bytes get returned.

The moral is: the value you pass to fread() as buffsize should be considered a "maximum" size -- never assume that you've received the number of bytes you asked for (or any other number for that matter).

2. According to comments on fread docs, a few caveats: magic quotes may interfere and must be turned off.

3. Setting mb_http_output('pass') (docs) may be a good idea. Though 'pass' is already the default setting, you may need to specify it explicitly if your code or config has previously changed it to something else.

4. If you're creating a zip (as opposed to gzip), you'd want to use the content type header:

Content-type: application/zip

or... 'application/octet-stream' can be used instead. (it's a generic content type used for binary downloads of all different kinds):

Content-type: application/octet-stream

and if you want the user to be prompted to download and save the file to disk (rather than potentially having the browser try to display the file as text), then you'll need the content-disposition header. (where filename indicates the name that should be suggested in the save dialog):

Content-disposition: attachment; filename="file.zip"

One should also send the Content-length header, but this is hard with this technique as you don’t know the zip’s exact size in advance. Is there a header that can be set to indicate that the content is "streaming" or is of unknown length? Does anybody know?

Finally, here's a revised example that uses all of @Benji's suggestions (and that creates a ZIP file instead of a TAR.GZIP file):

<?php
// make sure to send all headers first
// Content-Type is the most important one (probably)
//
header('Content-Type: application/octet-stream');
header('Content-disposition: attachment; filename="file.zip"');

// use popen to execute a unix command pipeline
// and grab the stdout as a php stream
// (you can use proc_open instead if you need to 
// control the input of the pipeline too)
//
$fp = popen('zip -r - file1 file2 file3', 'r');

// pick a bufsize that makes you happy (8192 has been suggested).
$bufsize = 8192;
$buff = '';
while( !feof($fp) ) {
   $buff = fread($fp, $bufsize);
   echo $buff;
}
pclose($fp);

Update: (2012-11-23) I have discovered that calling flush() within the read/echo loop can cause problems when working with very large files and/or very slow networks. At least, this is true when running PHP as cgi/fastcgi behind Apache, and it seems likely that the same problem would occur when running in other configurations too. The problem appears to result when PHP flushes output to Apache faster than Apache can actually send it over the socket. For very large files (or slow connections), this eventually causes in an overrun of Apache's internal output buffer. This causes Apache to kill the PHP process, which of course causes the download to hang, or complete prematurely, with only a partial transfer having taken place.

The solution is not to call flush() at all. I have updated the code examples above to reflect this, and I placed a note in the text at the top of the answer.

create big zip archives with lot of small files in memory (on the fly) with python

Ok, finally I created that zip's Frankenstein:

from io import BytesIO
from zipfile import ZipFile, ZIP_DEFLATED
import sys
import gc

files = (
    'input/i_1.docx',  # one file size is about ~580KB
    'input/i_2.docx',
    'input/i_3.docx',
    'input/i_4.docx',
    'input/i_5.docx',
    'input/i_6.docx',
    'input/i_7.docx',
    'input/i_8.docx',
    'input/i_9.docx',
    'input/i_10.docx',
    'input/i_11.docx',
    'input/i_12.docx',
    'input/i_13.docx',
    'input/i_14.docx',
    'input/i_21.docx'
)


# this function allow to get size of in-memory object
# add only for debug purposes
def _get_size(input_obj):
    memory_size = 0
    ids = set()
    objects = [input_obj]
    while objects:
        new = []
        for obj in objects:
            if id(obj) not in ids:
                ids.add(id(obj))
                memory_size += sys.getsizeof(obj)
                new.append(obj)
        objects = gc.get_referents(*new)
    return memory_size


class CustomizedZipFile(ZipFile):

    # add customized BytesIO to be able return faked offset
    class _CustomizedBytesIO(BytesIO):

        def __init__(self, fake_offset: int):
            self.fake_offset = fake_offset
            self.temporary_switch_to_faked_offset = False
            super().__init__()

        def tell(self):
            if self.temporary_switch_to_faked_offset:
                # revert tell method to normal mode to minimize faked behaviour
                self.temporary_switch_to_faked_offset = False
                return super().tell() + self.fake_offset
            else:
                return super().tell()

    def __init__(self, *args, **kwargs):
        # create empty file to write if fake offset is set
        if 'fake_offset' in kwargs and kwargs['fake_offset'] is not None and kwargs['fake_offset'] > 0:
            self._fake_offset = kwargs['fake_offset']
            del kwargs['fake_offset']
            if 'file' in kwargs:
                kwargs['file'] = self._CustomizedBytesIO(self._fake_offset)
            else:
                args = list(args)
                args[0] = self._CustomizedBytesIO(self._fake_offset)
        else:
            self._fake_offset = 0
        super().__init__(*args, **kwargs)

    # finalize zip (should be run only on last chunk)
    def force_write_end_record(self):
        self._write_end_record(False)

    # don't write end record by default to be able get not ended chunks
    # ZipFile writing end metainfo on close by default
    def _write_end_record(self, skip_write_end=True):
        if not skip_write_end:
            if self._fake_offset > 0:
                self.start_dir = self._fake_offset
                self.fp.temporary_switch_to_faked_offset = True
            super()._write_end_record()


def archive(files):

    compression_type = ZIP_DEFLATED
    CHUNK_SIZE = 1048576  # 1MB

    with open('tmp.zip', 'wb') as resulted_file:
        offset = 0
        filelist = []
        with BytesIO() as chunk:
            for f in files:
                with BytesIO() as tmp:
                    with CustomizedZipFile(tmp, 'w', compression=compression_type) as zf:
                        with open(f, 'rb') as b:
                            zf.writestr(
                                zinfo_or_arcname=f.replace('input/', 'output/'),
                                data=b.read()
                            )
                        zf.filelist[0].header_offset = offset
                        data = tmp.getvalue()
                        offset = offset + len(data)
                    filelist.append(zf.filelist[0])
                chunk.write(data)
                print('size of zipfile:', _get_size(zf))
                print('size of chunk:', _get_size(chunk))
                if len(chunk.getvalue()) > CHUNK_SIZE:
                    resulted_file.write(chunk.getvalue())
                    chunk.seek(0)
                    chunk.truncate()
            # write last chunk
            resulted_file.write(chunk.getvalue())
        # file parameter may be skipped it we using fake_offset
        # because empty _CustomizedBytesIO will be initialized at constructor
        with CustomizedZipFile(None, 'w', compression=compression_type, fake_offset=offset) as zf:
            zf.filelist = filelist
            zf.force_write_end_record()
            end_data = zf.fp.getvalue()
        resulted_file.write(end_data)


archive(files)

Output is:

size of zipfile: 2182955
size of chunk: 582336
size of zipfile: 2182979
size of chunk: 1164533
size of zipfile: 2182983
size of chunk: 582342
size of zipfile: 2182979
size of chunk: 1164562
size of zipfile: 2182983
size of chunk: 582343
size of zipfile: 2182979
size of chunk: 1164568
size of zipfile: 2182983
size of chunk: 582337
size of zipfile: 2182983
size of chunk: 1164556
size of zipfile: 2182983
size of chunk: 582329
size of zipfile: 2182984
size of chunk: 1164543
size of zipfile: 2182984
size of chunk: 582355
size of zipfile: 2182984
size of chunk: 1164586
size of zipfile: 2182984
size of chunk: 582338
size of zipfile: 2182984
size of chunk: 1164545
size of zipfile: 2182980
size of chunk: 582320

So we can see that chunk is always dumped on storage and truncated when reach the maximum chunk size (1MB in my case)

Resulted archive tested with MacOS The Unarchiver v4.2.4 and Windows 10 default unarchiver and 7-zip

Note!

Size of created by chunks archive is 16 bytes bigger than archive created by common zipfile library. Probably some extra zero bytes is written somewhere. I didn't check why it happens

_{^{zipfile is worstest python library I've ever seen before. Looks like it supposed to be used as non-extendable binary-like file}}

how to make zip file downloadable with headers php without saving it on the server

ZIP is not a streamable format (the target needs to support seeking, i.e. a ZIP file cannot be simply written to a target where the previously-written data cannot be re-read and modified) what you are trying to do can't work. So the best solution (if your zip files won't be huge) is creating it in a temporary location, readfile() it and then delete the temporary file.

See the following Stack Overflow questions for further reference. They contain some workarounds that might be fine for you:

Zip Stream in PHP
LAMP: How to create .Zip of large files for the user on the fly, without disk/CPU thrashing

Lamp: How to Create .Zip of Large Files For the User on the Fly, Without Disk/Cpu Thrashing