By ronag


2011-08-29 11:51:39 8 Comments

I've been trying to implement a fast Huffman decoder in order to encode/decode video. However, I'm barely able to decode a 1080p50 video using my decoder. On the other hand, there are lots of codecs in ffmpeg that entropy decode 4-8 times faster.

I have been trying to optimize and profile my code, but I don't think I can get it to run much faster. Does anyone have any suggestion as to how one can optimize Huffman decoding?

My profiler says my application is spending most of the time in the following code:

current = current->children + data_reader.next_bit();                           
*ptr    = current->value;
ptr     = ptr + current->step;

Here is the entire code:

void decode_huff(void* input, uint8_t* dest)
{
    struct node
    {
        node*               children; // 0 right, 1 left
        uint8_t             value;
        bool                step;
    };

    CACHE_ALIGN node                    nodes[512] = {};
    node*                               nodes_end   = nodes+1;      

    auto data = reinterpret_cast<unsigned long*>(input); 
    size_t table_size   = *(data++); // Size is first 32 bits.
    size_t num_comp     = *(data++); // Data size is second 32 bits.

    bit_reader table_reader(data);  
    unsigned char n_bits = ((table_reader.next_bit() << 2) | (table_reader.next_bit() << 1) | (table_reader.next_bit() << 0)) & 0x7; // First 3 bits are n_bits-1.

    // Unpack huffman-tree
    std::stack<node*> stack;
    stack.push(nodes);      // "nodes" is root
    while(!stack.empty())
    {
        node* ptr = stack.top();
        stack.pop();
        if(table_reader.next_bit())
        {
            ptr->step = true;
            ptr->children = nodes->children;
            for(int n = n_bits; n >= 0; --n)
                ptr->value |= table_reader.next_bit() << n;
        }
        else
        {
            ptr->children = nodes_end++;
            nodes_end++;

            stack.push(ptr->children+0);
            stack.push(ptr->children+1);
        }
    }       

    // Decode huffman-data

    // THIS IS THE SLOW PART            

    auto huffman_data   = reinterpret_cast<long*>(input) + (table_size+32)/32; 

    size_t data_size = *(huffman_data++); // Size is first 32 bits.     

    uint8_t* ptr = dest;
    auto current = nodes;

    bit_reader data_reader(huffman_data);

    size_t end = data_size - data_size % 4;
    while(data_reader.index() < end)
    { 
        current = current->children + data_reader.next_bit();                           
        *ptr    = current->value;
        ptr     = ptr + current->step;

        current = current->children + data_reader.next_bit();                           
        *ptr    = current->value;
        ptr     = ptr + current->step;

        current = current->children + data_reader.next_bit();                           
        *ptr    = current->value;
        ptr     = ptr + current->step;

        current = current->children + data_reader.next_bit();                           
        *ptr    = current->value;
        ptr     = ptr + current->step;
    }

    while(data_reader.index() < data_size)
    { 
        current = current->children + data_reader.next_bit();                           
        *ptr    = current->value;
        ptr     = ptr + current->step;
    }

    // If dest is not filled with num_comp, duplicate the last value.
    std::fill_n(ptr, num_comp - (ptr - dest), ptr == dest ? nodes->value : *(ptr-1));   
}

class bit_reader
{
public:
    typedef long block_type;
    static const size_t bits_per_block = sizeof(block_type)*8;
    static const size_t high_bit = 1 << (bits_per_block-1);

    bit_reader(void* data) 
        : data_(reinterpret_cast<block_type*>(data))
        , index_(0){}

    long next_bit()
    {
        const size_t block_index = index_ / bits_per_block;
        const size_t bit_index = index_ % bits_per_block;

        ++index_;

        return (data_[block_index] >> bit_index) & 1;
    }

    size_t index() const {return index_;}
private:    
    size_t index_;
    block_type* data_;
};

3 comments

@Glenn Davis 2013-12-28 23:30:41

The fastest way I've found is simple and very compact code-wise, especially if you're able to compress using canonical or semi-canonical Huffman codes. Start by reading about how to decode all the codeword bits simultaneously using Dan Hirschberg and Debra Lelewer's "Efficient Decoding of Prefix Codes" in Communications of the ACM, April 1990, Volume 33 Number 4 pages 449-459.

Over the years (decades, actually) I've developed further speed improvements based on that basic 'zero-extension' table idea, and I've done a lot of speed optimization in C.

Also, you might be able to set up the decoder to decode two consecutive codewords per table loop iteration. How much that helps depends on processor cache sizes, the translation lookaside buffer, bus speeds, chip architecture, and a lot of other things.

@Konrad Rudolph 2011-08-29 12:45:13

Without profiling, your next_bit performs a division and a modulus at each call (though the operations will most likely be coalesced into a single divmod).

I would lose this abstraction and unroll the main loop eight times (instead of four): once for each separate bit in a byte in the input. That way, you could completely omit division and just pick off bytes one at a time.

But the first step should certainly be an algorithmic improvement, as proposed by Tobias.

@Tobias Langner 2011-08-29 12:08:22

You seem to work on single bit basis. This is very slow. You have to combine several bits together, look up the appropriate pattern in the table and then continue from there (if necessary) in the tree.

You can see an outline to this solution presented here http://www.gzip.org/algorithm.txt. Of course, googling for "fast huffman decoding" reveals several papers on that subject as well. Reading them might be worthy as well.

Related Questions

Sponsored Content

2 Answered Questions

[SOLVED] Testing Video Frames for Uniqueness

2 Answered Questions

[SOLVED] Fast integer handling

2 Answered Questions

1 Answered Questions

[SOLVED] C++ Huffman coder and decoder

  • 2017-05-18 19:52:14
  • Yksuh
  • 1211 View
  • 8 Score
  • 1 Answer
  • Tags:   c++ compression

1 Answered Questions

[SOLVED] Huffman encoding implementation for Unicode

  • 2016-12-23 21:57:33
  • MAG
  • 181 View
  • 6 Score
  • 1 Answer
  • Tags:   rust compression

1 Answered Questions

[SOLVED] Huffman tree decoding

2 Answered Questions

[SOLVED] Huffman compressor in C++

0 Answered Questions

Data model for complex tree (multiple children and multiple parents)

1 Answered Questions

[SOLVED] Python Octree Implementation

1 Answered Questions

[SOLVED] Optimizing Huffman Decoding

Sponsored Content