Python Ungzipping Stream of Bytes

Unzip buffer with Python?

Is it possible to use zlib

No, zlib is not designed to operate on ZIP files.

and how can I avoid using a temporary file?

Use io.BytesIO:

import zipfile
import io

buffer = b'PK\x03\x04\n\x00\x00\x00\x00\x00\n\\\x88Gzzo\xed\x03\x00\x00\x00\x03\x00\x00\x00\x07\x00\x1c\x00foo.txtUT\t\x00\x03$\x14gV(\x14gVux\x0b\x00\x01\x041\x04\x00\x00\x041\x04\x00\x00hi\nPK\x01\x02\x1e\x03\n\x00\x00\x00\x00\x00\n\\\x88Gzzo\xed\x03\x00\x00\x00\x03\x00\x00\x00\x07\x00\x18\x00\x00\x00\x00\x00\x01\x00\x00\x00\xb4\x81\x00\x00\x00\x00foo.txtUT\x05\x00\x03$\x14gVux\x0b\x00\x01\x041\x04\x00\x00\x041\x04\x00\x00PK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x00M\x00\x00\x00D\x00\x00\x00\x00\x00'

z = zipfile.ZipFile(io.BytesIO(buffer))

# The following three lines are alternatives. Use one of them
# according to your need:
foo = z.read('foo.txt') # Reads the data from "foo.txt"
foo2 = z.read(z.infolist()[0]) # Reads the data from the first file
z.extractall() # Copies foo.txt to the filesystem

z.close()

print foo
print foo2

Python ungzipping stream of bytes?

Yes, you can use the zlib module to decompress byte streams:

import zlib

def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv

The offset of 32 signals to the zlib header that the gzip header is expected but skipped.

The S3 key object is an iterator, so you can do:

for data in stream_gzip_decompress(k):
# do something with the decompressed data

Stream unZIP archive

It is possible to do this from within Python, without calling to an external process, and it can handle all the files in the zip, not just the first.

This can be done by using stream-unzip [disclaimer: written by me].

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()

for file_name, file_size, file_chunks in stream_unzip(zipped_chunks()):
for chunk in file_chunks:
print(chunk)

How to unpack from a binary file a byte array using Python?

If you want an 8-byte string, you need to put the number 8 in there:

struct.unpack('<8s', bytearray(fp.read(8)))

From the docs:

A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'.

For the 's' format character, the count is interpreted as the length of the bytes, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1. For packing, the string is truncated or padded with null bytes as appropriate to make it fit. For unpacking, the resulting bytes object always has exactly the specified number of bytes. As a special case, '0s' means a single, empty string (while '0c' means 0 characters).


However, I'm not sure why you're doing this in the first place.

fp.read(8) gives you an 8-byte bytes object. You want an 8-byte bytes object. So, just do this:

Data4 = fp.read(8)

Converting the bytes to a bytearray has no effect except to make a mutable copy. Unpacking it just gives you back a copy of the same bytes you started with. So… why?


Well, actually, struct.unpack returns a tuple whose one value is a copy of the same bytes you started with, but you can do that with:

Data4 = (fp.read(8),)

Which raises the question of why you want four single-element tuples in the first place. You're going to be doing Data1[0], etc. all over the place for no good reason. Why not this?

Data1, Data2, Data3, Data4 = struct.unpack('<LHH8s', fp.read(16))

Of course if this is meant to read a UUID, it's always better to use the "batteries included" than to try to build your own batteries from nickel and cadmium ore. As icktoofay says, just use the uuid module:

data = uuid.UUID(bytes_le=fp.read(16))

But keep in mind that Python's uuid uses the 4-2-2-1-1-6 format, not the 4-2-2-8 format. If you really need exactly that format, you'll need to convert it, which means either struct or bit twiddling anyway. (Microsoft's GUID makes things even more fun by using a 4-2-2-2-6 format, which is not the same as either, and representing the first 3 in native-endian and the last two in big-endian, because they like to make things easier…)

Downloading and unzipping a .zip file without writing to disk

My suggestion would be to use a StringIO object. They emulate files, but reside in memory. So you could do something like this:

# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'

import zipfile
from StringIO import StringIO

zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()

# output: "hey, foo"

Or more simply (apologies to Vishal):

myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
[ ... ]

In Python 3 use BytesIO instead of StringIO:

import zipfile
from io import BytesIO

filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
[ ... ]

How to uncompress gzipped data in a byte array?

zlib.decompress(data, 15 + 32) should autodetect whether you have gzip data or zlib data.

zlib.decompress(data, 15 + 16) should work if gzip and barf if zlib.

Here it is with Python 2.7.1, creating a little gz file, reading it back, and decompressing it:

>>> import gzip, zlib
>>> f = gzip.open('foo.gz', 'wb')
>>> f.write(b"hello world")
11
>>> f.close()
>>> c = open('foo.gz', 'rb').read()
>>> c
'\x1f\x8b\x08\x08\x14\xf4\xdcM\x02\xfffoo\x00\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x85\x11J\r\x0b\x00\x00\x00'
>>> ba = bytearray(c)
>>> ba
bytearray(b'\x1f\x8b\x08\x08\x14\xf4\xdcM\x02\xfffoo\x00\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x85\x11J\r\x0b\x00\x00\x00')
>>> zlib.decompress(ba, 15+32)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be string or read-only buffer, not bytearray
>>> zlib.decompress(bytes(ba), 15+32)
'hello world'
>>>

Python 3.x usage would be very similar.

Update based on comment that you are running Python 2.2.1.

Sigh. That's not even the last release of Python 2.2. Anyway, continuing with the foo.gz file created as above:

Python 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> strobj = open('foo.gz', 'rb').read()
>>> strobj
'\x1f\x8b\x08\x08\x14\xf4\xdcM\x02\xfffoo\x00\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x85\x11J\r\x0b\x00\x00\x00'
>>> import zlib
>>> zlib.decompress(strobj, 15+32)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
zlib.error: Error -2 while preparing to decompress data
>>> zlib.decompress(strobj, 15+16)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
zlib.error: Error -2 while preparing to decompress data

# OK, we can't use the back door method. Plan B: use the
# documented approach i.e. gzip.GzipFile with a file-like object.

>>> import gzip, cStringIO
>>> fileobj = cStringIO.StringIO(strobj)
>>> gzf = gzip.GzipFile('dummy-name', 'rb', 9, fileobj)
>>> gzf.read()
'hello world'

# Success. Now let's assume you have an array.array object-- which requires
# premeditation; they aren't created accidentally!
# The following code assumes subtype 'B' but should work for any subtype.

>>> import array, sys
>>> aaB = array.array('B')
>>> aaB.fromfile(open('foo.gz', 'rb'), sys.maxint)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
EOFError: not enough items in file
#### Don't panic, just read the fine manual
>>> aaB
array('B', [31, 139, 8, 8, 20, 244, 220, 77, 2, 255, 102, 111, 111, 0, 203, 72, 205, 201, 201, 87, 40, 207, 47, 202, 73, 1, 0, 133, 17, 74, 13, 11, 0, 0, 0])
>>> strobj2 = aaB.tostring()
>>> strobj2 == strobj
1 #### means True
# You can make a str object and use that as above.

# ... or you can plug it directly into StringIO:
>>> gzip.GzipFile('dummy-name', 'rb', 9, cStringIO.StringIO(aaB)).read()
'hello world'

Detect end of Deflate stream in Python

The only way to do it is to decode the entire deflate stream. The deflate format is self-terminating, so it tells you when it ends. There is no magic sequence of bits or bytes that you can search for to find the end.

Indeed, in Python you would use decompressobj in the zlib module to do this, checking unused_data until it is non-empty. When it is, the deflate stream terminated at the byte before what was returned by unused_data.

Reading a single-entry ZIP file incrementally from an unseekable stream in Python

Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?

Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].

We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.

The example from the README shows how to to this, using stream-unzip and httpx

from stream_unzip import stream_unzip
import httpx

def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)

If you do just want the first file, you can use break after the first file:

for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break

Also

Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries

This isn't completely true.

Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.

I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.



Related Topics



Leave a reply



Submit