Efficient Way of Storing Huffman Tree

Efficient way of storing Huffman tree

Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.

Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.

So for each node, starting at root:

If leaf-node: Output 1-bit + N-bit character/byte
If not leaf-node, output 0-bit. Then encode both child nodes (left first then right) the same way

To read, do this:

Read bit. If 1, then read N-bit character/byte, return new node around it with no children
If bit was 0, decode left and right child-nodes the same way, and return new node around them with those children, but no value

A leaf-node is basically any node that doesn't have children.

With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences.

Pseudo-code for calculation:

Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))

The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.

SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.

PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.

Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.

void EncodeNode(Node node, BitWriter writer)
{
    if (node.IsLeafNode)
    {
        writer.WriteBit(1);
        writer.WriteByte(node.Value);
    }
    else
    {
        writer.WriteBit(0);
        EncodeNode(node.LeftChild, writer);
        EncodeNode(node.Right, writer);
    }
}

To read it back in:

Node ReadNode(BitReader reader)
{
    if (reader.ReadBit() == 1)
    {
        return new Node(reader.ReadByte(), null, null);
    }
    else
    {
        Node leftChild = ReadNode(reader);
        Node rightChild = ReadNode(reader);
        return new Node(0, leftChild, rightChild);
    }
}

An example (simplified, use properties, etc.) Node implementation:

public class Node
{
    public Byte Value;
    public Node LeftChild;
    public Node RightChild;

    public Node(Byte value, Node leftChild, Node rightChild)
    {
        Value = value;
        LeftChild = leftChild;
        RightChild = rightChild;
    }

    public Boolean IsLeafNode
    {
        get
        {
            return LeftChild == null;
        }
    }
}

Here's a sample output from a specific example.

Input: AAAAAABCCCCCCDDEEEEE

Frequencies:

A: 6
B: 1
C: 6
D: 2
E: 5

Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.

The tree could look like this:

      20
  ----------
  |        8
  |     -------
 12     |     3
-----   |   -----
A   C   E   B   D
6   6   5   1   2

So the paths to each character is as follows (0 is left, 1 is right):

A: 00
B: 110
C: 01
D: 111
E: 10

So to calculate the output size:

A: 6 occurrences * 2 bits = 12 bits
B: 1 occurrence * 3 bits = 3 bits
C: 6 occurrences * 2 bits = 12 bits
D: 2 occurrences * 3 bits = 6 bits
E: 5 occurrences * 2 bits = 10 bits

Sum of encoded bytes is 12+3+12+6+10 = 43 bits

Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.

The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.

001A1C01E01B1D 0000000000001100101010101011111111010101010

For the concrete example you have in the comments, AABCDEF, you will get this:

Input: AABCDEF

Frequencies:

A: 2
B: 1
C: 1
D: 1
E: 1
F: 1

Tree:

        7
  -------------
  |           4
  |       ---------
  3       2       2
-----   -----   -----
A   B   C   D   E   F
2   1   1   1   1   1

Paths:

A: 00
B: 01
C: 100
D: 101
E: 110
F: 111

Tree: 001A1B001C1D01E1F = 59 bits

Data: 000001100101110111 = 18 bits

Sum: 59 + 18 = 77 bits = 10 bytes

Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.

Storing and reconstruction of Huffman tree

You almost certainly do not need to store the tree itself. You could do, and it shouldn't take the space you think it does, but it's not generally necessary.

If your huffman codes are canonical, you need only store the bit-lengths for each symbol, as this is all the information required to generate a canonical coding. This is a relatively small number of bits per-symbol, so should be fairly compact. You also can further compress that information (see the answer from Aki Suihkonen).

Naturally the bit-length of a code is essentially the same as the tree depth, so I think this is roughly what you're asking about. The important part is to know how to build a canonical code, given the lengths - it's not necessarily the same as the codes produced by traversing the tree. You could regenerate a tree from this, but it's not necessarily the tree you started with - however typically you don't need the tree other than to determine the code lengths in the first place.

The algorithm for generating canonical codes is fairly simple:

Take all the symbols you want to generate codes for, sorted first by code-length (shortest first), and then by the symbol itself.
Start with a zero-length code.
If the next symbol requires more bits than are currently in the code, add zeros to the right (least significant bits) of your code until it's the right length.
Associate the code with the current symbol, and increment the code.
Loop back to (3) until you have generated all the symbols.

Take the string "banana". Obviously there are 3 symbols used, 'b', 'a', and 'n', with counts of 1, 3, and 2, respectively.

So the tree might look like this:


    *
   / \
  *   a
 / \
b   n

Naively, that could give codes:


a = 1
b = 00
n = 01

However if instead you simply use the bit-lengths as input to canonical code generation, you would produce this:


a = 0
b = 10
n = 11

Its a different code, but obviously it would produce the same length compressed output. Further more, you only need to store the code-lengths in order to reproduce the code.

So you only need to store a sequence:


0... 1 2 0... 2 0...

Where "..." represents easily compressible repetition, and the values will all be quite small (probably only 4-bits each - and note that the symbols aren't stored at all). This representation will be very compact.

If you you really must store the tree itself, one technique is to traverse the tree and store a single bit to indicate whether a node is internal or a leaf, and then for leaf nodes, storing the symbol code. This is fairly compact for trees which do not contain every symbol, and not too bad even for fairly complete trees. The worst case size for this would be the total size of all your symbols, plus as many single bits as you could have nodes. For a standard 8-bit byte stream, that would be 320 bytes (256 bytes for the codes, 511 bits for the tree structure itself).

The method is to start at the root node, and for each node:

If the node is a parent, output a 0 and then output the left then right children.
If the node is a leaf, output a 1 and then output the symbol

To reconstruct, perform a similar recursive procedure, but obviously reading the data and choosing whether to recursively create children, or read in a symbol, as appropriate.

For the example above, the bit-stream for the tree would be something like:


0, 0, 1, 'b', 1, 'n', 1, 'a'

That's 5 bits for the tree, plus 3 bytes for the symbols, rounding up to 4 bytes of storage. However it will grow rapidly as you add more symbols, whereas storing the code-lengths does not.

Efficient Huffman tree search while remembering path taken

Create a dictionary of value -> bit-string, that would give you the fastest lookup.

If the values are a known size, you can probably get by with just an array of bit-strings and look up the values by their index.

How to save Huffman tree in file?

Write the tree as a series of bits: 0 represents a leaf, 1 represents an internal node. The output for a binary tree (Huffman or otherwise) with N leaf nodes and N-1 internal nodes will be a sequence of 2N-1 bits. (You can actually save two bits, since you know that the first and last nodes in the tree will be leaf nodes, but it's probably not worth complicating the algorithm to save two bits.)

Perhaps easiest is to arrange the bits in pre-order:

function write_tree (top_node) {
    if is_leaf(top_node) {
        write "0"
        // optionally, write any date associated with the leaf node
        // although in practice it's easier to write the leaf data
        // to a separate output stream. That lets this stream contain
        // actual bits rather than the characters "0"/"1"
        }
    else {
        write "1"
        write_tree (top_node.left)
        write_tree (top_node.right) }}

function read_tree (bit_stream) -> returns tree
    next_bit = bit_stream.read()
    if next_bit = "0" {
        root = new leaf
        // optionally read data associated with the leaf node
        }
    else {
        root = new internal node
        root.left = read_tree (bit_stream)
        root.right = read_tree (bit_stream) }
    return root }

I didn't notice at first that you mentioned Java, so I wrote the above in pseudo-code, which I'm sure you'll have no trouble re-writing in Java.

How to store and subsequently decode Huffman table

All you need to send are the number of bits in the code for each symbol. Zero can indicate that the symbol is not present and so was not coded. The number of bits can itself be Huffman coded and run-length encoded, as done in the deflate format.

Given that set of code lengths, you just need to make sure that the bit sequences are assigned the same way on both ends. This is done using a canonical Huffman code, where the codes are assigned in symbol order within each bit length.

Will Serialization Help in Storing a Huffman Tree To A File

You don't need to transfer the tree. Once you have the code lengths for each symbol, discard the tree. You can then construct a canonical code from the lengths and an ordering of the symbols. You would then transmit only the lengths to the decoder, and the decoder would construct the same canonical code from just the lengths.

What is the most efficient(*) way of building a canonical huffman tree?

If frequencies form a monotonic sequence, ie. A[0]<=A[1]<=...<=A[n-1] or A[0]>=A[1]>=...>=A[n-1], then you can generate an optimal code lengths in O(n) time and O(1) additional space. This algorithm requires only 2 simple passes over the array and it's very fast. A full description is given in [1].

If your frequencies aren't sorted, first you need to sort them and then apply the above algorithm. In this case time complexity is O(n log n) and an auxiliary array of n integers is needed to store sorted order - space complexity O(n).

[1]:
In-Place Calculation of Minimum-Redundancy Codes by Alistair Moffat and Jyrki Katajainen, available online: http://www.diku.dk/~jyrki/Paper/WADS95.pdf

Efficient Way of Storing Huffman Tree