Python: Inflate and Deflate Implementations

Python: Inflate and Deflate implementations

This is an add-on to MizardX's answer, giving some explanation and background.

See http://www.chiramattel.com/george/blog/2007/09/09/deflatestream-block-length-does-not-match.html

According to RFC 1950, a zlib stream constructed in the default manner is composed of:

a 2-byte header (e.g. 0x78 0x9C)
a deflate stream -- see RFC 1951
an Adler-32 checksum of the uncompressed data (4 bytes)

The C# DeflateStream works on (you guessed it) a deflate stream. MizardX's code is telling the zlib module that the data is a raw deflate stream.

Observations: (1) One hopes the C# "deflation" method producing a longer string happens only with short input (2) Using the raw deflate stream without the Adler-32 checksum? Bit risky, unless replaced with something better.

Updates

error message Block length does not match with its complement

If you are trying to inflate some compressed data with the C# DeflateStream and you get that message, then it is quite possible that you are giving it a a zlib stream, not a deflate stream.

See How do you use a DeflateStream on part of a file?

Also copy/paste the error message into a Google search and you will get numerous hits (including the one up the front of this answer) saying much the same thing.

The Java Deflater ... used by "the website" ... C# DeflateStream "is pretty straightforward and has been tested against the Java implementation". Which of the following possible Java Deflater constructors is the website using?

public Deflater(int level, boolean nowrap)

Creates a new compressor using the specified compression level. If 'nowrap' is true then the ZLIB header and checksum fields will not be used in order to support the compression format used in both GZIP and PKZIP.

public Deflater(int level)

Creates a new compressor using the specified compression level. Compressed data will be generated in ZLIB format.

public Deflater()

Creates a new compressor with the default compression level. Compressed data will be generated in ZLIB format.

A one-line deflater after throwing away the 2-byte zlib header and the 4-byte checksum:

uncompressed_string.encode('zlib')[2:-4] # does not work in Python 3.x

zlib.compress(uncompressed_string)[2:-4]

Guess configuration to inflate zlib compressed data

There are no "configurations" needed. zlib's inflate will inflate any valid compressed zlib stream losslessly to the original content.

Therefore, despite your attestation, your data is getting corrupted or deliberately modified somewhere along the way.

Deflate string with gzip or zlib in Python - why am I missing the H4sIAAAAAAAA/ bit

You have exactly the same compressed data stream. The only difference is that your expected data stream has the MTIME field of the header set to 0 and the XFL flag set to 0, not 2:

>>> from base64 import b64decode
>>> expected = b64decode('H4sIAAAAAAAA/7Oxr8jNUShLLSrOzM+zVTfUM1BXSM1Lzk/JzEu3VQ8NcdO1ULe3AwBHQvxaJgAAAA==')
>>> actual = b64decode('H4sIAHDj6lsC/7Oxr8jNUShLLSrOzM+zVTfUM1BXSM1Lzk/JzEu3VQ8NcdO1ULe3AwBHQvxaJgAAAA==')
>>> expected[:4] == actual[:4]  # identification, compression method and flag match
True
>>> expected[4:8], actual[4:8]  # mtime bytes differ, zero vs. current time
(b'\x00\x00\x00\x00', b'p\xe3\xea[')
>>> from datetime import datetime
>>> print(datetime.fromtimestamp(int.from_bytes(actual[4:8], 'little')))
2018-11-13 14:45:04
>>> expected[8], actual[8]  # XFL is set to 2 in the actual output
(0, 2)
>>> expected[9], actual[9]  # OS set to *unknown* in both
(255, 255)
>>> expected[10:] == actual[10:]  # compressed data payload is the same
True

The gzip.compress() function just uses the gzip.GzipFile() class to do the actual compressing, and it'll use time.time() for the MTIME field whenever the mtime argument is left to the default None.

I'd not expect that to actually matter, both strings will result in the exact same decompressed data.

If you must have the same output, then the easiest method is to just replace the header:

compressed = gzip.compress("<?xml version='1.0' encoding='UTF-8'?>".encode('utf-8'))
result = base64.b64encode(b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\xff' + compressed[10:])

The above replaces the existing header with one that will have the parts that matter set to the same values as your expected output; both MTIME and the XFL flag set to 0. Note that when you use gzip.compress() that only the MTIME bytes would ever vary, and the XFL field is not actually used when decompressing.

While you could use the gzip.GzipFile() class to produce compressed output with MTIME set to 0 (pass in mtime=0), you can't change what the XFL field is set to; that is currently hard-coded to 2.

Note that even accounting for the MTIME and XFL differences, like data compressed with different implementations of the DEFLATE compression algorithm could still result in a different compressed stream, even when using the same compression settings! That's because DEFLATE encodes data based on the frequency of snippets, and different implementations are free to make different choices when there are multiple snippets with the same frequency available when compressing. So the only correct way to test if your data has been compressed correctly, is to decompress again and compare the result.

Is it always okay to Inflate zlib-wrapped data with the windowBits of 15?

Yes, that's okay. The windowBits needs to be greater than or equal to the window size that the data was compressed with. It is always ok to decompress with the maximum window size (15).

How to inflate a partial zlib file

Though I don't know python, I managed to get this to work:

#!/usr/bin/python
import sys
import zlib
f = open(sys.argv[1], "rb")
g = open(sys.argv[2], "wb")
z = zlib.decompressobj()
while True:
    buf = z.unconsumed_tail
    if buf == "":
        buf = f.read(8192)
        if buf == "":
            break
    got = z.decompress(buf)
    if got == "":
        break
    g.write(got)

That should extract all that's available from your partial zlib file.

Python: Inflate and Deflate Implementations