Is There a Practical Way to Compress Nsdata

Is there a practical way to compress NSData?

Yes, compress the data with zlib.

@Brad Larson posted on this: see here and added the code as well.

There is a CocoaPod which uses Objective-Zip by flyingdolphinstudio.

How to compress data

To get a bit of confusion out of the way first: MemoryLayout gives you information about the size and structure of the layout of a type at compile time, but can't be used to determine the amount of storage an Array value needs at runtime because the size of the Array structure itself does not depend on how much data it contains.

Highly simplified, the layout of an Array value looks like this:

┌─────────────────────┐                        
│ Array │
├──────────┬──────────┤ ┌──────────────────┐
│ length │ buffer ─┼───▶│ storage │
└──────────┴──────────┘ └──────────────────┘
1 word / 1 word /
8 bytes 8 bytes
└─────────┬─────────┘
└─▶ MemoryLayout<Array<UInt8>>.size

An Array value stores its length, or count (mixed in with some flags, but we don't need to worry about that) and a pointer to the actual space where the items it contains are stored. Those items aren't stored as part of the Array value itself, but separately in allocated memory which the Array points to. Whether the Array "contains" 10 values or 100000 values, the size of the Array structure remains the same: 1 word (or 8 bytes on a 64-bit system) for the length, and 1 word for the pointer to the actual underlying storage. (The size of the storage buffer, however, is exactly determined by the number of elements it is able to contain, at runtime.)

In practice, Array is significantly more complicated than this for bridging and other reasons, but this is the basic gist; this is why you only ever see MemoryLayout.size(ofValue:) return the same number every time. [And incidentally, the size of String is the same as Array for similar reasons, which is why MemoryLayout<MyData>.size also reports 16.]

In order to know how many bytes an Array or a Data effectively take up, it's sufficient to ask them for their .count: Array<UInt8> and Data are both collections of UInt8 values (bytes), and their .count will reflect the amount of data effectively stored in their underlying storage.


As for the size increase between step (4) and (5), note that

  • Step 4 takes 100 copies of your MyData and joins them together before converting them to JSON, while
  • Step 5 takes 100 copies of individually compressed MyData instances, joins those together, and then re-coverts them to JSON

Step 5 has a few issues compared to step 4:

  1. Compression benefits heavily from repetition in data: a bit of data compressed and repeated 100 times won't be nearly as compact as a bit of data repeated 100 times, then compressed, because each round of compression can't benefit from knowing that there's another copy of the data that came before it. As a simple example:
    • Let's say we wanted to use a form of run-length encoding to compress the string Hello: there isn't a lot we can do, except maybe turn it into Hel{2}o (where {2} indicates a repetition of the last character 2 times)
    • If we compress Hello and join it 3 times, we get might get Hel{2}oHel{2}oHel{2}o,
    • But if we first joined Hello 3 times and then compressed, we could get {Hel{2}o}{3}, which is much more compact
  2. Compression also typically needs to insert some information about how the data was compressed in order to be able to recognize and decompress the data later. By compressing MyData 100 times and joining all of those instances, you're repeating that metadata 100 times
  3. Even after compressing your MyData instances, re-representing them as JSON decreases how compressed they are because it can't represent the binary data exactly. Instead, it has to convert each Data blob into a Base64-encoded string, which causes it to grow again

Between these issues, it's not terribly surprising that your data is growing. What you actually want is a modification to step 4, which is compressing the joined data:

guard let encodedArray = try? JSONEncoder().encode(myDataArray) else { fatalError() }
guard let compressedEncodedArray = try? (encodedArray as NSData).compressed(using: .lzma) else { fatalError() }
print(compressedEncodedArray.count) // => 520

This is significantly better than

guard let encodedCompressedArray = try? JSONEncoder().encode(compressedArray) else { fatalError() }
print(encodedCompressedArray.count) // => 66661

As an aside: it seems unlikely that you're actually using JSONEncoder in practice to join data in this way, and this was just for measurement here — but if you actually are, consider other mechanisms for doing this. Converting binary data to JSON in this way is very inefficient storage-wise, and with a bit more information about what you might actually need in practice, we might be able to recommend a more effective way to do this.

If what you're actually doing in practice is encoding an Encodable object tree and then compressing that the one time, that's totally fine.

Compressing a small amount of data

What are you hoping to accomplish by compressing 150 bits? Unless you aggregate several of this 19b messages, I'm not sure what you hope to gain. Is it a UI issue--wherein you want your users to send/receive "codes"?

How about base 64 encoding? This will take binary data and turn it into coded characters for easy transmission or entry.

What is the best file compression of random binary data that you can achieve?

If file sizes could be specified accurate to the bit, for any file size N, there would be precisely 2^(N+1)-1 possible files of N bits or smaller. In order for a file of size X to be mapped to some smaller size Y, some file of size Y or smaller must be mapped to a file of size X or larger. The only way lossless compression can work is if some possible files can be identified as being more probable than others; in that scenario, the likely files will be shrunk and the unlikely ones will grow.

As a simple example, suppose that one wishes to store losslessly a file in which the bits are random and independent, but instead of 50% of the bits being set, only 33% are. One could compress such a file by taking each pair of bits and writing "0" if both bits were clear, "10" if the first bit was set and the second one not, "110" if the second was set and the first not, or "111" if both bits were set. The effect would be that each pair of bits would become one bit 44% of the time, two bits 22% of the time, and three bits 33% of the time. While some strings of data would grow, others would shrink; the pairs that shrank would--if the probability distribution was as expected--outnumber those that grow (4/9 files would shrink by a bit, 2/9 would stay the same, and 3/9 would grow, so pairs would on average shrink by 1/9 bit, and files would on average shrink by 1/18 [since the 1/9 figure was bits per pair]).

Note that if the bits actually had a 50% distribution, then only 25% of pairs would become one bit, 25% would stay two bits, and 50% would become three bits. Consequently, 25% of bits would shrink and 50% would grow, so pairs on average would grow by 25%, and files would grow by 12.5%. The break-even point would be about 38.2% of bits being set (two minus the golden mean), which would yield 38.2% of bit pairs shrinking and the same percentage growing.

Achieving the best possible compression with data consisting of 0s and 1s?

I assume that you want to compress string into other strings even though your data really is binary. I don't know what the best compression algorithm is (and that will vary depending on your data) but you can convert the input text into bits, compress these and then convert the compressed bytes into a string again using base-64 encoding. This will allow you to go from string to string and still apply a compression algorithm of your choice.

The .NET framework provides the class DeflateStream that will allow you to compress a stream of bytes. The first step is to create a custom Stream that will allow you to read and write your text format. For lack of better name I have named it TextStream. Note that to simplify matters a bit I use \n as the line ending (instead of \r\n).

class TextStream : Stream {

readonly String text;

readonly Int32 bitsPerLine;

readonly StringBuilder buffer;

Int32 textPosition;

// Initialize a readable stream.
public TextStream(String text) {
if (text == null)
throw new ArgumentNullException("text");
this.text = text;
}

// Initialize a writeable stream.
public TextStream(Int32 bitsPerLine) {
if (bitsPerLine <= 0)
throw new ArgumentException();
this.bitsPerLine = bitsPerLine;
this.buffer = new StringBuilder();
}

public override Boolean CanRead { get { return this.text != null; } }

public override Boolean CanWrite { get { return this.buffer != null; } }

public override Boolean CanSeek { get { return false; } }

public override Int64 Length { get { throw new InvalidOperationException(); } }

public override Int64 Position {
get { throw new InvalidOperationException(); }
set { throw new InvalidOperationException(); }
}

public override void Flush() {
}

public override Int32 Read(Byte[] buffer, Int32 offset, Int32 count) {
// TODO: Validate buffer, offset and count.
if (!CanRead)
throw new InvalidOperationException();

var byteCount = 0;
Byte currentByte = 0;
var bitCount = 0;
for (; byteCount < count && this.textPosition < this.text.Length; this.textPosition += 1) {
if (text[this.textPosition] != '0' && text[this.textPosition] != '1')
continue;
currentByte = (Byte) ((currentByte << 1) | (this.text[this.textPosition] == '0' ? 0 : 1));
bitCount += 1;
if (bitCount == 8) {
buffer[offset + byteCount] = currentByte;
byteCount += 1;
currentByte = 0;
bitCount = 0;
}
}
if (bitCount > 0) {
buffer[offset + byteCount] = currentByte;
byteCount += 1;
}
return byteCount;
}

public override void Write(Byte[] buffer, Int32 offset, Int32 count) {
// TODO: Validate buffer, offset and count.
if (!CanWrite)
throw new InvalidOperationException();

for (var i = 0; i < count; ++i) {
var currentByte = buffer[offset + i];
for (var mask = 0x80; mask > 0; mask /= 2) {
if (this.buffer.Length > 0) {
if ((this.buffer.Length + 1)%(2*this.bitsPerLine) == 0)
this.buffer.Append('\n');
else
this.buffer.Append(',');
}
this.buffer.Append((currentByte & mask) == 0 ? '0' : '1');
}
}
}

public override String ToString() {
if (this.text != null)
return this.text;
else
return this.buffer.ToString();
}

public override Int64 Seek(Int64 offset, SeekOrigin origin) {
throw new InvalidOperationException();
}

public override void SetLength(Int64 length) {
throw new InvalidOperationException();
}

}

Then you can write methods for compressing and decompressing using DeflateStream. Note that the the uncompressed input is a string like the one you have provided in your question an the compressed output is a base-64 encoded string.

String Compress(String text) {
using (var inputStream = new TextStream(text))
using (var outputStream = new MemoryStream()) {
using (var compressedStream = new DeflateStream(outputStream, CompressionMode.Compress))
inputStream.CopyTo(compressedStream);
return Convert.ToBase64String(outputStream.ToArray());
}
}

String Decompress(String compressedText, Int32 bitsPerLine) {
var bytes = Convert.FromBase64String(compressedText);
using (var inputStream = new MemoryStream(bytes))
using (var outputStream = new TextStream(bitsPerLine)) {
using (var compressedStream = new DeflateStream(inputStream, CompressionMode.Decompress))
compressedStream.CopyTo(outputStream);
return outputStream.ToString();
}
}

To test it I used a method to create a random string (using a fixed seed to always create the same string):

String CreateRandomString(Int32 width, Int32 height) {
var random = new Random(0);
var stringBuilder = new StringBuilder();
for (var i = 0; i < width; ++i) {
for (var j = 0; j < height; ++j) {
if (i > 0 && j == 0)
stringBuilder.Append('\n');
else if (j > 0)
stringBuilder.Append(',');
stringBuilder.Append(random.Next(2) == 0 ? '0' : '1');
}
}
return stringBuilder.ToString();
}

Creating a random 4,096 x 4,096 string has an uncompressed size of 33,554,431 characters. This is compressed to 2,797,056 characters which is a reduction to about 8% of the original size.

Skipping the base-64 encoding would increase the compression ratio even more but the output would be binary and not a string. If you also consider the input as binary you actually get the following result for random data with equal probability of 0 and 1:


Input bytes: 4,096 x 4,096 / 8 = 2,097,152
Output bytes: 2,097,792
Size after compression: 100%

Simply converting to bytes is a better than doing that following by a deflate. However, using random input but with 25% 0 and 75% 1 you get this result:


Input bytes: 4,096 x 4,096 / 8 = 2,097,152
Output bytes: 1,757,846
Size after compression: 84%

How much deflate will compress your data really depends of the nature of the data. If it is completely random you wont be able to get much compression after converting from text to bytes.

Practical Compression of Random Data

My guess is that you're achieving compression on your random file because you're not using an optimal serialization technique, but without more details it's impossible to answer your question. Is the compressed file with n numbers in the range [0, k) less than n*log2(k) bits? (That is, n*log256(k) bytes). If so, does gzip manage to do that for all the random files you generate, or just occasionally?

Let me note one thing: suppose you say to me, "I've generated a file of random octets by using a uniform_int_distribution(0, 255) with the mt19937 prng [1]. What's the optimal compression of my file?" Now, my answer could reasonably be: "probably about 80 bits". All I need to reproduce your file is

  • the value you used to seed the prng, quite possibly a 32-bit integer [2]; and

  • the length of the file, which probably fits in 48 bits.

And if I can reproduce the file given 80 bits of data, that's the optimal compression. Unfortunately, that's not a general purpose compression strategy. It's highly unlikely that gzip will be able to figure out that you used a particular prng to generate the file, much less that it will be able to reverse-engineer the seed (although these things are, at least in theory, achievable; the Mersenne twister is not a cryptographically secure prng.)

For another example, it's generally recommended that text be compressed before being encrypted; the result will be quite a bit shorter than compressing after encryption. But the fact is that encryption adds very little entropy; at most, it adds the number of bits in the encryption key. Nonetheless, the resulting output is difficult to distinguish from random data, and gzip will struggle to compress it (although it often manages to squeeze a few bits out).


Note 1: Note: that's all c++11/boost lingo. mt19937 is an instance of the Mersenne twister pseudo-random number generator (prng), which has a period of 2^19937 - 1.

Note 2: The state of the Mersenne twister is actually 624 words (19968 bits), but most programs use somewhat fewer bits to seed it. Perhaps you used a 64-bit integer instead of a 32-bit integer, but it doesn't change the answer by much.

Best compression technique for binary data?

As external libraries were out fo the question, I created a custom solution for this. The system used run length encoding to compress the data, then the RLE encoded data was represented in base32 (32 characters for the zeroes, and the matching set for ones). This allowed us to represent files approximately 5MB in size with only around 30KB, without any loss.



Related Topics



Leave a reply



Submit