2011-08-29 11:51:39 8 Comments

I've been trying to implement a fast Huffman decoder in order to encode/decode video. However, I'm barely able to decode a 1080p50 video using my decoder. On the other hand, there are lots of codecs in ffmpeg that entropy decode 4-8 times faster.

I have been trying to optimize and profile my code, but I don't think I can get it to run much faster. Does anyone have any suggestion as to how one can optimize Huffman decoding?

My profiler says my application is spending most of the time in the following code:

```
current = current->children + data_reader.next_bit();
*ptr = current->value;
ptr = ptr + current->step;
```

Here is the entire code:

```
void decode_huff(void* input, uint8_t* dest)
{
struct node
{
node* children; // 0 right, 1 left
uint8_t value;
bool step;
};
CACHE_ALIGN node nodes[512] = {};
node* nodes_end = nodes+1;
auto data = reinterpret_cast<unsigned long*>(input);
size_t table_size = *(data++); // Size is first 32 bits.
size_t num_comp = *(data++); // Data size is second 32 bits.
bit_reader table_reader(data);
unsigned char n_bits = ((table_reader.next_bit() << 2) | (table_reader.next_bit() << 1) | (table_reader.next_bit() << 0)) & 0x7; // First 3 bits are n_bits-1.
// Unpack huffman-tree
std::stack<node*> stack;
stack.push(nodes); // "nodes" is root
while(!stack.empty())
{
node* ptr = stack.top();
stack.pop();
if(table_reader.next_bit())
{
ptr->step = true;
ptr->children = nodes->children;
for(int n = n_bits; n >= 0; --n)
ptr->value |= table_reader.next_bit() << n;
}
else
{
ptr->children = nodes_end++;
nodes_end++;
stack.push(ptr->children+0);
stack.push(ptr->children+1);
}
}
// Decode huffman-data
// THIS IS THE SLOW PART
auto huffman_data = reinterpret_cast<long*>(input) + (table_size+32)/32;
size_t data_size = *(huffman_data++); // Size is first 32 bits.
uint8_t* ptr = dest;
auto current = nodes;
bit_reader data_reader(huffman_data);
size_t end = data_size - data_size % 4;
while(data_reader.index() < end)
{
current = current->children + data_reader.next_bit();
*ptr = current->value;
ptr = ptr + current->step;
current = current->children + data_reader.next_bit();
*ptr = current->value;
ptr = ptr + current->step;
current = current->children + data_reader.next_bit();
*ptr = current->value;
ptr = ptr + current->step;
current = current->children + data_reader.next_bit();
*ptr = current->value;
ptr = ptr + current->step;
}
while(data_reader.index() < data_size)
{
current = current->children + data_reader.next_bit();
*ptr = current->value;
ptr = ptr + current->step;
}
// If dest is not filled with num_comp, duplicate the last value.
std::fill_n(ptr, num_comp - (ptr - dest), ptr == dest ? nodes->value : *(ptr-1));
}
class bit_reader
{
public:
typedef long block_type;
static const size_t bits_per_block = sizeof(block_type)*8;
static const size_t high_bit = 1 << (bits_per_block-1);
bit_reader(void* data)
: data_(reinterpret_cast<block_type*>(data))
, index_(0){}
long next_bit()
{
const size_t block_index = index_ / bits_per_block;
const size_t bit_index = index_ % bits_per_block;
++index_;
return (data_[block_index] >> bit_index) & 1;
}
size_t index() const {return index_;}
private:
size_t index_;
block_type* data_;
};
```

### Related Questions

#### Sponsored Content

#### 2 Answered Questions

### [SOLVED] Testing Video Frames for Uniqueness

**2018-10-15 17:46:40****Trevor****103**View**3**Score**2**Answer- Tags: c++ edit-distance opencv video

#### 2 Answered Questions

### [SOLVED] Fast integer handling

**2013-07-14 05:37:28****Kevin Melkowski****373**View**2**Score**2**Answer- Tags: c++ algorithm performance binary-search hash-table

#### 2 Answered Questions

### [SOLVED] C++ huffman encoding - constructing encoded bytes efficiently

**2017-12-17 11:17:59****Tsaras****886**View**1**Score**2**Answer- Tags: c++ performance strings compression

#### 1 Answered Questions

### [SOLVED] C++ Huffman coder and decoder

**2017-05-18 19:52:14****Yksuh****1211**View**8**Score**1**Answer- Tags: c++ compression

#### 1 Answered Questions

### [SOLVED] Huffman encoding implementation for Unicode

**2016-12-23 21:57:33****MAG****181**View**6**Score**1**Answer- Tags: rust compression

#### 1 Answered Questions

### [SOLVED] Huffman tree decoding

**2016-12-21 05:16:53****CodeYogi****2814**View**3**Score**1**Answer- Tags: java performance algorithm programming-challenge

#### 2 Answered Questions

### [SOLVED] Huffman compressor in C++

**2016-12-02 06:55:07****coderodde****875**View**4**Score**2**Answer- Tags: c++ console compression

#### 0 Answered Questions

### Data model for complex tree (multiple children and multiple parents)

**2016-06-09 12:20:15****Elfayer****1608**View**3**Score**0**Answer- Tags: javascript performance tree

#### 1 Answered Questions

### [SOLVED] Python Octree Implementation

**2016-04-23 16:25:51****David de la Iglesia****2367**View**6**Score**1**Answer- Tags: python performance algorithm numpy numba

#### 1 Answered Questions

### [SOLVED] Optimizing Huffman Decoding

**2011-08-31 20:43:44****ronag****1680**View**6**Score**1**Answer- Tags: c++ optimization lookup compression video

## 3 comments

## @Glenn Davis 2013-12-28 23:30:41

The fastest way I've found is simple and very compact code-wise, especially if you're able to compress using canonical or semi-canonical Huffman codes. Start by reading about how to decode all the codeword bits simultaneously using Dan Hirschberg and Debra Lelewer's "Efficient Decoding of Prefix Codes" in Communications of the ACM, April 1990, Volume 33 Number 4 pages 449-459.

Over the years (decades, actually) I've developed further speed improvements based on that basic 'zero-extension' table idea, and I've done a lot of speed optimization in C.

Also, you might be able to set up the decoder to decode two consecutive codewords per table loop iteration. How much that helps depends on processor cache sizes, the translation lookaside buffer, bus speeds, chip architecture, and a lot of other things.

## @Konrad Rudolph 2011-08-29 12:45:13

Without profiling, your

`next_bit`

performs a division and a modulus at each call (though the operations will most likely be coalesced into a single`divmod`

).I would lose this abstraction and unroll the main loop eight times (instead of four): once for each separate bit in a byte in the input. That way, you could completely omit division and just pick off bytes one at a time.

But the first step should certainly be an algorithmic improvement, as proposed by Tobias.

## @Tobias Langner 2011-08-29 12:08:22

You seem to work on single bit basis. This is very slow. You have to combine several bits together, look up the appropriate pattern in the table and then continue from there (if necessary) in the tree.

You can see an outline to this solution presented here http://www.gzip.org/algorithm.txt. Of course, googling for "fast huffman decoding" reveals several papers on that subject as well. Reading them might be worthy as well.