Unzip buffer with Python?
Is it possible to use zlib
No, zlib is not designed to operate on ZIP files.
and how can I avoid using a temporary file?
Use io.BytesIO
:
import zipfile
import io
buffer = b'PK\x03\x04\n\x00\x00\x00\x00\x00\n\\\x88Gzzo\xed\x03\x00\x00\x00\x03\x00\x00\x00\x07\x00\x1c\x00foo.txtUT\t\x00\x03$\x14gV(\x14gVux\x0b\x00\x01\x041\x04\x00\x00\x041\x04\x00\x00hi\nPK\x01\x02\x1e\x03\n\x00\x00\x00\x00\x00\n\\\x88Gzzo\xed\x03\x00\x00\x00\x03\x00\x00\x00\x07\x00\x18\x00\x00\x00\x00\x00\x01\x00\x00\x00\xb4\x81\x00\x00\x00\x00foo.txtUT\x05\x00\x03$\x14gVux\x0b\x00\x01\x041\x04\x00\x00\x041\x04\x00\x00PK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x00M\x00\x00\x00D\x00\x00\x00\x00\x00'
z = zipfile.ZipFile(io.BytesIO(buffer))
# The following three lines are alternatives. Use one of them
# according to your need:
foo = z.read('foo.txt') # Reads the data from "foo.txt"
foo2 = z.read(z.infolist()[0]) # Reads the data from the first file
z.extractall() # Copies foo.txt to the filesystem
z.close()
print foo
print foo2
Python ungzipping stream of bytes?
Yes, you can use the zlib
module to decompress byte streams:
import zlib
def stream_gzip_decompress(stream):
dec = zlib.decompressobj(32 + zlib.MAX_WBITS) # offset 32 to skip the header
for chunk in stream:
rv = dec.decompress(chunk)
if rv:
yield rv
The offset of 32 signals to the zlib
header that the gzip header is expected but skipped.
The S3 key object is an iterator, so you can do:
for data in stream_gzip_decompress(k):
# do something with the decompressed data
Stream unZIP archive
It is possible to do this from within Python, without calling to an external process, and it can handle all the files in the zip, not just the first.
This can be done by using stream-unzip [disclaimer: written by me].
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, file_chunks in stream_unzip(zipped_chunks()):
for chunk in file_chunks:
print(chunk)
How to unpack from a binary file a byte array using Python?
If you want an 8-byte string, you need to put the number 8
in there:
struct.unpack('<8s', bytearray(fp.read(8)))
From the docs:
A format character may be preceded by an integral repeat count. For example, the format string '4h' means exactly the same as 'hhhh'.
…
For the 's' format character, the count is interpreted as the length of the bytes, not a repeat count like for the other format characters; for example, '10s' means a single 10-byte string, while '10c' means 10 characters. If a count is not given, it defaults to 1. For packing, the string is truncated or padded with null bytes as appropriate to make it fit. For unpacking, the resulting bytes object always has exactly the specified number of bytes. As a special case, '0s' means a single, empty string (while '0c' means 0 characters).
However, I'm not sure why you're doing this in the first place.
fp.read(8)
gives you an 8-byte bytes
object. You want an 8-byte bytes
object. So, just do this:
Data4 = fp.read(8)
Converting the bytes
to a bytearray
has no effect except to make a mutable copy. Unpacking it just gives you back a copy of the same bytes
you started with. So… why?
Well, actually, struct.unpack
returns a tuple
whose one value is a copy of the same bytes
you started with, but you can do that with:
Data4 = (fp.read(8),)
Which raises the question of why you want four single-element tuples in the first place. You're going to be doing Data1[0]
, etc. all over the place for no good reason. Why not this?
Data1, Data2, Data3, Data4 = struct.unpack('<LHH8s', fp.read(16))
Of course if this is meant to read a UUID, it's always better to use the "batteries included" than to try to build your own batteries from nickel and cadmium ore. As icktoofay says, just use the uuid
module:
data = uuid.UUID(bytes_le=fp.read(16))
But keep in mind that Python's uuid
uses the 4-2-2-1-1-6 format, not the 4-2-2-8 format. If you really need exactly that format, you'll need to convert it, which means either struct
or bit twiddling anyway. (Microsoft's GUID makes things even more fun by using a 4-2-2-2-6 format, which is not the same as either, and representing the first 3 in native-endian and the last two in big-endian, because they like to make things easier…)
Downloading and unzipping a .zip file without writing to disk
My suggestion would be to use a StringIO
object. They emulate files, but reside in memory. So you could do something like this:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
import zipfile
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
Or more simply (apologies to Vishal):
myzipfile = zipfile.ZipFile(StringIO(get_zip_data()))
for name in myzipfile.namelist():
[ ... ]
In Python 3 use BytesIO instead of StringIO:
import zipfile
from io import BytesIO
filebytes = BytesIO(get_zip_data())
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
[ ... ]
How to uncompress gzipped data in a byte array?
zlib.decompress(data, 15 + 32) should autodetect whether you have gzip
data or zlib
data.
zlib.decompress(data, 15 + 16) should work if gzip
and barf if zlib
.
Here it is with Python 2.7.1, creating a little gz file, reading it back, and decompressing it:
>>> import gzip, zlib
>>> f = gzip.open('foo.gz', 'wb')
>>> f.write(b"hello world")
11
>>> f.close()
>>> c = open('foo.gz', 'rb').read()
>>> c
'\x1f\x8b\x08\x08\x14\xf4\xdcM\x02\xfffoo\x00\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x85\x11J\r\x0b\x00\x00\x00'
>>> ba = bytearray(c)
>>> ba
bytearray(b'\x1f\x8b\x08\x08\x14\xf4\xdcM\x02\xfffoo\x00\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x85\x11J\r\x0b\x00\x00\x00')
>>> zlib.decompress(ba, 15+32)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: must be string or read-only buffer, not bytearray
>>> zlib.decompress(bytes(ba), 15+32)
'hello world'
>>>
Python 3.x usage would be very similar.
Update based on comment that you are running Python 2.2.1.
Sigh. That's not even the last release of Python 2.2. Anyway, continuing with the foo.gz
file created as above:
Python 2.2.3 (#42, May 30 2003, 18:12:08) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> strobj = open('foo.gz', 'rb').read()
>>> strobj
'\x1f\x8b\x08\x08\x14\xf4\xdcM\x02\xfffoo\x00\xcbH\xcd\xc9\xc9W(\xcf/\xcaI\x01\x00\x85\x11J\r\x0b\x00\x00\x00'
>>> import zlib
>>> zlib.decompress(strobj, 15+32)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
zlib.error: Error -2 while preparing to decompress data
>>> zlib.decompress(strobj, 15+16)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
zlib.error: Error -2 while preparing to decompress data
# OK, we can't use the back door method. Plan B: use the
# documented approach i.e. gzip.GzipFile with a file-like object.
>>> import gzip, cStringIO
>>> fileobj = cStringIO.StringIO(strobj)
>>> gzf = gzip.GzipFile('dummy-name', 'rb', 9, fileobj)
>>> gzf.read()
'hello world'
# Success. Now let's assume you have an array.array object-- which requires
# premeditation; they aren't created accidentally!
# The following code assumes subtype 'B' but should work for any subtype.
>>> import array, sys
>>> aaB = array.array('B')
>>> aaB.fromfile(open('foo.gz', 'rb'), sys.maxint)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
EOFError: not enough items in file
#### Don't panic, just read the fine manual
>>> aaB
array('B', [31, 139, 8, 8, 20, 244, 220, 77, 2, 255, 102, 111, 111, 0, 203, 72, 205, 201, 201, 87, 40, 207, 47, 202, 73, 1, 0, 133, 17, 74, 13, 11, 0, 0, 0])
>>> strobj2 = aaB.tostring()
>>> strobj2 == strobj
1 #### means True
# You can make a str object and use that as above.
# ... or you can plug it directly into StringIO:
>>> gzip.GzipFile('dummy-name', 'rb', 9, cStringIO.StringIO(aaB)).read()
'hello world'
Detect end of Deflate stream in Python
The only way to do it is to decode the entire deflate stream. The deflate format is self-terminating, so it tells you when it ends. There is no magic sequence of bits or bytes that you can search for to find the end.
Indeed, in Python you would use decompressobj
in the zlib module to do this, checking unused_data
until it is non-empty. When it is, the deflate stream terminated at the byte before what was returned by unused_data
.
Reading a single-entry ZIP file incrementally from an unseekable stream in Python
Is there an alternate way to use zipfile or a third-party Python library to do this in a completely streaming way?
Yes: https://github.com/uktrade/stream-unzip can do it [full disclosure: essentially written by me].
We often need to unzip extremely large (unencrypted) ZIP files that are hosted by partners over HTTPS.
The example from the README shows how to to this, using stream-unzip
and httpx
from stream_unzip import stream_unzip
import httpx
def zipped_chunks():
# Any iterable that yields a zip file
with httpx.stream('GET', 'https://www.example.com/my.zip') as r:
yield from r.iter_bytes()
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
If you do just want the first file, you can use break
after the first file:
for file_name, file_size, unzipped_chunks in stream_unzip(zipped_chunks()):
for chunk in unzipped_chunks:
print(chunk)
break
Also
Generally, the ZIP file format (shown below) needs to download in full to be able to see the "central directory" data to identify file entries
This isn't completely true.
Each file has a "local" header that contains its name, and it can be worked out when the compressed data for any member file ends (via information in the local header if it's there or from the compressed data itself). While there is more information in the central file directory at the end, if you just need the name + bytes of the files, then it is possible to start unzipping a ZIP file, that contains multiple files, as it's downloading.
I can't claim its absolutely possible in all cases: technically ZIP allows for many different compression algorithms and I haven't investigated them all. However, for DEFLATE, which is the one most commonly used, it is possible.
Related Topics
Mixed Slashes with Os.Path.Join on Windows
How to "Zip Sort" Parallel Numpy Arrays
How to Replace (Or Strip) an Extension from a Filename in Python
Preprocessing in Scikit Learn - Single Sample - Depreciation Warning
Why Is the Empty Dictionary a Dangerous Default Value in Python
Django Model Field Default Based Off Another Field in Same Model
Variable Defined with With-Statement Available Outside of With-Block
Keep Persistent Variables in Memory Between Runs of Python Script
How to Left Align a Fixed Width String
Case Insensitive Flask-Sqlalchemy Query
Using Self.Xxxx as a Default Parameter - Python
Matrix Multiplication in Pure Python
Why Does Map Return a Map Object Instead of a List in Python 3
Generating Sublists Using Multiplication ( * ) Unexpected Behavior