badd10de.dev

Encoding and Compression


Some common general purpose algorithms used for data encoding and compression include:

  1. Huffman coding
  2. Run length encoding
  3. Burrows-Wheeler transform
  4. Move to front transform
  5. Zstandard
  6. LZ77 and LZ78 (LZ1, LZ2)
  7. LZ4
  8. LZSS

Huffman coding

Can be used to find the most optimal possible encoding symbols in the minimum number of bytes. This doesn’t always make it the most efficient encoding for compression, however.

Algorithm

  1. For an alphabet of N symbols.
  2. Count the number of occurrences of each symbol in the data (frequency).
  3. Sort the symbol table from minimum to maximum frequency.
  4. Assemble a Huffman binary tree by combining the symbols with the smallest frequency together.
  5. Create the Huffman table by walking down the tree, one branch adds a 1 and another adds a 0.

Example

Let’s encode MISSISSIPPI_RIVER. We have the following frequency table:

+--------+-----------+--------------+
| symbol | frequency | probability  |
+--------+-----------+--------------+
| M      | 1         | 1/17 = 0.058 |
| I      | 5         | 5/17 = 0.294 |
| S      | 4         | 4/17 = 0.235 |
| P      | 2         | 2/17 = 0.117 |
| R      | 2         | 2/17 = 0.117 |
| E      | 1         | 1/17 = 0.058 |
| V      | 1         | 1/17 = 0.058 |
| _      | 1         | 1/17 = 0.058 |
+--------+-----------+--------------+

Now let’s sort it in ascending frequency:

+--------+-----------+--------------+
| symbol | frequency | probability  |
+--------+-----------+--------------+
| M      | 1         | 1/17 = 0.058 |
| E      | 1         | 1/17 = 0.058 |
| V      | 1         | 1/17 = 0.058 |
| _      | 1         | 1/17 = 0.058 |
| P      | 2         | 2/17 = 0.117 |
| R      | 2         | 2/17 = 0.117 |
| S      | 4         | 4/17 = 0.235 |
| I      | 5         | 5/17 = 0.294 |
+--------+-----------+--------------+

Let’s start assembling the three step by step:

M1,E1,V1,_1,P2,R2,S4,I5

          |
          V

2 -- M1
 `-- E1

          |
          V

2 -- M1
 `-- E1
2 -- V1
 `-- _1

          |
          V

2 -- M1
 `-- E1
2 -- V1
 `-- _1
4 -- P2
 `-- R2

          |
          V

4 -- 2 -- M1
|     `-- E1
` -- 2 -- V1
      `-- _1
4 -- P2
 `-- R2

          |
          V

8 -- 4 -- 2 -- M1
|    |     `-- E1
|    ` -- 2 -- V1
|          `-- _1
 `-- 4 -- P2
      `-- R2

          |
          V

9 -- I5
 `-- S4
8 -- 4 -- 2 -- M1
|    |     `-- E1
|    ` -- 2 -- V1
|          `-- _1
 `-- 4 -- P2
      `-- R2

          |
          V

17 -- 9 -- I5
 |     `-- S4
  `-- 8 -- 4 -- 2 -- M1
      |    |     `-- E1
      |    ` -- 2 -- V1
      |          `-- _1
       `-- 4 -- P2
            `-- R2

The following is the resulting Huffman tree including the binary path for each branch:

17 --*1*-- 9 --*1*-- I5
 |          `--*0*-- S4
  `--*0*-- 8 --*1*-- 4 --*1*-- 2 --*1*-- M1
           |         |          `--*0*-- E1
           |         ` --*0*-- 2 --*1*-- V1
           |                    `--*0*-- _1
            `--*0*-- 4 --*1*-- P2
                      `--*0*-- R2

This tree yields the following encoding table:

+--------+-----------+--------------+------------+--------------+
| symbol | frequency | probability  | ASCII code | Huffman code |
+--------+-----------+--------------+------------+--------------+
| M      | 1         | 1/17 = 0.058 | 1001101    | 0111         |
| E      | 1         | 1/17 = 0.058 | 1000101    | 0110         |
| V      | 1         | 1/17 = 0.058 | 1010110    | 0101         |
| _      | 1         | 1/17 = 0.058 | 1011111    | 0100         |
| P      | 2         | 2/17 = 0.117 | 1010000    | 001          |
| R      | 2         | 2/17 = 0.117 | 1010010    | 000          |
| S      | 4         | 4/17 = 0.235 | 1010011    | 10           |
| I      | 5         | 5/17 = 0.294 | 1001001    | 11           |
+--------+-----------+--------------+------------+--------------+

Thus the following message can be encoded as such:

Before (119 bits):

M       I       S       S       I       S
1001101 1001001 1010011 1010011 1001001 1010011

S       I       P       P       I       _
1010011 1001001 1010000 1010000 1001001 1011111

R       I       V       E       R
1010010 1001001 1010110 1000101 1010010

After (46 bits):

M       I       S       S       I       S
0111    11      10      10      11      10

S       I       P       P       I       _
10      11      001     001     11      0100

R       I       V       E       R
000     11      0101    0110    000

Stats:

msg_entropy := H = 1 * 5/17 * math.log2(17/5) +
                   1 * 4/17 * math.log2(17/4) +
                   2 * 2/17 * math.log2(17/2) +
                   4 * 1/17 * math.log2(17) = 2.69866

avg_code_len := L = 2 * (5/17 + 4/17) +
                    3 * 2 * 2/17 +
                    4 * 4 * 1/17 = 2.688

efficiency := H/L = 0.9973
compression_ratio = 2.59

Implementation

The algorithm can be implemented by using a hash table to count the frequency of each symbol and a priority queue to build the Huffman tree. The hash table is not necessary if we know all possible symbols we will be using, as it is the case with byte compression. The priority queue can be implemented with a min. heap. Here is some untested pseudo-code of how this could work in practice.

// We will byte-encode all symbols. Ascii symbols are worth 1 byte only.
msg = [M, I, S, S, I, S, S, I, P, P, I, _, R, I, V, E, R]
frequency[256] = {0}
for elem in msg {
    frequency[(u8)elem]++
}

struct Node {
    T symbol;
    u64 count;
    Node *left;
    Node *right;
}
// NOTE: For byte encoding we will need a maximum of 256 + 128 + 64 + 32 + 16
// + 8 + 4 + 2 + 1 nodes so we can preallocate this amount of memory to reduce
// the mumber of memory allocations. This can only happen if there is exactly
// one symbol of each byte, creating a perfect 1/256 per symbol distribution.
//
// Node nodes[512] = {0}

mh = min_heap(Node)
for i, freq in enumerate(frequency) {
    if freq != 0 {
        mh.insert(Node{i, freq, NULL, NULL})
    }
}

// Build the tree.
while mh.lenght() > 1 {
    node = Node{0}
    node.left = mh.pop_min()
    node.right = mh.pop_min()
    node.count = node.left.count + node.right.count
    mh.insert(node)
}
root = mh.pop_min() // Should have only 1 element left, the last one

// Create an encoding symbol table from the tree
// If we are encoding bytes the worst case scenario we will need 256 bits worth
// of information, or 64bytes per symbol (or 8x32bit, or 4x64bit, or 2x128bit,
// or 1x256bit values).
//
//     FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
//     FFFFFFFF FFFFFFFF FFFFFFFF FFFFFFFF
//
// Here we instantiate the table with those worst scenario constrains.
table = byte[256][64] // OR: 256 elements of type struct {u8[64] code; u8 n_bits;}

// Walk the tree to get the code for each symbol on the leaf
current_code = byte[64]
void fill_table(root, table, current_code) {
    if root.left == NULL && root.right == NULL {
        table[root.symbol] = current_code // NOTE: Full copy, we might also want to store the number of bits for this code
        return
    }
    fill_table(root.left, table, current_code << 1)
    fill_table(root.right, table, (current_code << 1) | 1)
}

// Use the table for encoding the message in the channel
channel = byte[CHANNEL_LENGTH] // NOTE: Make sure we don't overflow the channel
// NOTE: We could probably speed this up quite a bit if we used SIMD 256
// registers instead of doing bit shifts per byte.
for symbol in msg {
    channel = (channel << table[symbol].n_bits) & table[symbol].code
}

The message could be encoded in several ways once we have the table. Here is how the message would look like in a channel, separated by bytes. Note how some symbols are separated on the byte boundary.

Code table:

symbols | M   I   S   S   I   S   S   I   P   P   I   _   R   I   V   E   R   |
binary  | 0xb 0x3 0x0 0x0 0x3 0x0 0x0 0x3 0x4 0x4 0x3 0x7 0x2 0x3 0xa 0x6 0x2 |
n_bits  | 4   2   2   2   2   2   2   2   3   3   2   4   3   2   4   4   3   |


Channel:

symbols | M   I S  | S I S S  | I P  P   | I _   R  |  I V   E |    R     |
binary  | 10111100 | 00110000 | 11100100 | 11011101 | 01110100 | 110010xx |
hex     | 0xbc     | 0x30     | 0xe4     | 0xdd     | 0x74     | 0xc8     |
                                                                       ^
                                                                      EOM
Stats:
    n_symbols = 17
    n_bits = 46
    min_code_length = 2
    max_code_length = 4

Canonical codes

Once we have a Huffman codes for all symbols in the data, we could obtain new canonical codes by following the procedure mentioned below. With canonical codes, we don’t need to retransmit the entire translation table every time but instead we only need to send the number of levels and number of bits per level.

When compressing binary data, the worst case scenario if we were to transmit the entire table would be to store both the code (max 64 bytes) plus an extra byte for the number of bits (1-255) for all 256 entries, resulting in 65 * 256 = 16KB of information that we need to send in every compressed block for table reconstruction (And that is without headers). If we instead use canonical codes, we may reconstruct the table with only B * 2^B bits of information being sent on the channel.

To help us understand canonical Huffman codes, we can select any rules we want as long as these are consistent for table recovery from the given number of bits. For example we can say that:

Using the previously mentioned message, we obtain the following number of bits for each symbol.

+--------+--------+
| symbol | n_bits |
+--------+--------+
| M      | 4      |
| E      | 4      |
| V      | 4      |
| _      | 4      |
| P      | 3      |
| R      | 3      |
| S      | 2      |
| I      | 2      |
+--------+--------+

Canonical codes that follow the aforementioned rules can be generated by detecting the starting points for each code level. To do so, the following algorithm can be used:

1. Start with the largest codes being all zeroes and increment numerically. In
   our example there are 4 codes with 4 bits [0000-0011].
2. Subsequent levels can be found by taking the last address in the previous
   level and adding 1. If we right shift this value to remove the least
   significant bit, we obtain the initial position.


    start_{i} = (start_{i-1} + n_bits_{i-1}) >> 1

previous start level bits := prev_start
number of codes in previous level := prev_m

    start = (prev_start + prev_m) >> 1

If we now sort them in alphabetic order per bit level, we can generate the following canonical codes.

+--------+--------+---------------------+----------------+--------+
| symbol | n_bits | prev_start | prev_m | initial_code   | code   |
+--------+--------+---------------------+----------------+--------+
| E      | 4      | 0000       | 0      | 0000           | 0000   |
| M      | 4      | 0000       | 0      | 0000           | 0001   |
| V      | 4      | 0000       | 0      | 0000           | 0010   |
| _      | 4      | 0000       | 0      | 0000           | 0011   |
| P      | 3      | 0000       | 4      | 010            | 010    |
| R      | 3      | 0000       | 4      | 010            | 011    |
| I      | 2      | 010        | 2      | 10             | 10     |
| S      | 2      | 010        | 2      | 10             | 11     |
+--------+--------+------------+--------+----------------+--------+

Resources