By coderodde


2016-11-27 16:58:05 8 Comments

This post improves upon my Java implementation of the Huffman algorithm for data compression.

I did the following improvements:

  1. The map mapping each byte value to its frequency does not rely on float values, but rather on integers representing absolute counts. I am not quite sure, but it seems like a more robust way of representing frequencies due to rounding features of floating-point numbers.

  2. The actual frequency map has moved from the HashMap to the sorted TreeMap, which allows me easily make this implementation compatible with the implementations written in other languages (I am working on C++ one).

  3. Decoding of the text uses directly the Huffman tree.

  4. Renamed InvalidFileFormatException to InvalidFormatException.

Code

App.java

package net.coderodde.app.huffman;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import net.coderodde.compression.huffman.BitString;
import net.coderodde.compression.huffman.ByteCountComputer;
import net.coderodde.compression.huffman.HuffmanDecoder;
import net.coderodde.compression.huffman.HuffmanDeserializer;
import net.coderodde.compression.huffman.HuffmanEncoder;
import net.coderodde.compression.huffman.HuffmanSerializer;
import net.coderodde.compression.huffman.HuffmanTree;

public final class App {

    private static final String ENCODE_OPTION_SHORT = "-e";
    private static final String ENCODE_OPTION_LONG  = "--encode";
    private static final String DECODE_OPTION_SHORT = "-d";
    private static final String DECODE_OPTION_LONG  = "--decode";
    private static final String HELP_OPTION_SHORT = "-h";
    private static final String HELP_OPTION_LONG  = "--help";
    private static final String VERSION_OPTION_SHORT = "-v";
    private static final String VERSION_OPTION_LONG  = "--version";
    private static final String ENCODED_FILE_EXTENSION = "het";

    public static void main(String[] args) {
        Set<String> commandLineArgumentSet = getCommandLineOptions(args);

        if (commandLineArgumentSet.isEmpty() 
                || commandLineArgumentSet.contains(HELP_OPTION_LONG)
                || commandLineArgumentSet.contains(HELP_OPTION_SHORT)) {
            printHelpMessage();
            System.exit(0);
        }

        if (commandLineArgumentSet.contains(VERSION_OPTION_LONG)
                || commandLineArgumentSet.contains(VERSION_OPTION_SHORT)) {
            printVersion();
            System.exit(0);
        }

        boolean decode = commandLineArgumentSet.contains(DECODE_OPTION_LONG) ||
                         commandLineArgumentSet.contains(DECODE_OPTION_SHORT);

        boolean encode = commandLineArgumentSet.contains(ENCODE_OPTION_LONG) ||
                         commandLineArgumentSet.contains(ENCODE_OPTION_SHORT);

        if (!decode && !encode) {
            printHelpMessage();
            System.exit(0);
        }

        if (decode && encode) {
            printHelpMessage();
            System.exit(0);
        }

        commandLineArgumentSet.removeAll(Arrays.asList(ENCODE_OPTION_SHORT,
                                                       ENCODE_OPTION_LONG,
                                                       DECODE_OPTION_SHORT,
                                                       DECODE_OPTION_LONG));
        if (commandLineArgumentSet.isEmpty()) {
            System.err.println("Bad command line format.");
            System.exit(1);
        }

        File file = new File(commandLineArgumentSet.iterator().next());

        try {
            if (decode) {
                doDecode(args);
            } else if (encode) {
                doEncode(file);
            } 
        } catch (Exception ex) {
            System.err.println(ex.getMessage());
            System.exit(1);
        }
    }

    private static void doEncode(File file) throws FileNotFoundException {
        byte[] fileBytes = readBytes(file);

        Map<Byte, Integer> weightMap =
                new ByteCountComputer()
                        .computeCharacterWeights(fileBytes);

        Map<Byte, BitString> encodeMap = 
                new HuffmanTree(weightMap).inferEncodingMap();

        BitString encodedText = new HuffmanEncoder().encode(encodeMap,
                                                            fileBytes);

        byte[] data = new HuffmanSerializer().serialize(weightMap,
                                                        encodedText);

        File outputFile = 
                new File(file.getName() + "." + ENCODED_FILE_EXTENSION);

        System.out.println(
            "Writing compressed text to \"" + outputFile.getName() + "\"...");

        writeBytes(data, outputFile);
    }

    private static void doDecode(String[] args) {
        String file1 = null;
        String file2 = null;

        try {
            int index = 0;

            for (int i = args.length - 1; i >= 0; --i) {
                if (args[i].equals("DECODE_OPTION_SHORT") 
                        || args[i].equals("DECODE_OPTION_LONG")) {
                    index = i;
                    break;
                }
            }

            file1 = args[index + 1];
            file2 = args[index + 2];
        } catch (Exception ex) {
            System.err.println("Not enough tokens on command line.");
            System.exit(1);
        }

        byte[] inputData = readBytes(new File(file1));
        HuffmanDeserializer.Result result = 
                new HuffmanDeserializer().deserialize(inputData);
        HuffmanTree decoderTree = new HuffmanTree(result.getEncoderMap());
        HuffmanDecoder decoder = new HuffmanDecoder();
        byte[] originalData = decoder.decode(decoderTree, 
                                             result.getEncodedText());
        writeBytes(originalData, new File(file2));
    }

    private static Set<String> getCommandLineOptions(String[] args) {
        Set<String> set = new HashSet<>();

        for (String arg : args) {
            set.add(arg);
        }

        return set;
    }

    private static void printHelpMessage() {
        String preamble = "usage: java -jar " + getThisJarName() + " ";
        int preambleLength = preamble.length();
        String indent = getIndent(preambleLength);

        StringBuilder sb = new StringBuilder();
        sb.append(preamble);

        sb.append("[")
          .append(HELP_OPTION_SHORT)
          .append(" | ")
          .append(HELP_OPTION_LONG)
          .append("]\n");

        sb.append(indent)
          .append("[")
          .append(VERSION_OPTION_SHORT)
          .append(" | ")
          .append(VERSION_OPTION_LONG)
          .append("]\n");

        sb.append(indent)
          .append("[")
          .append(ENCODE_OPTION_SHORT)
          .append(" | ")
          .append(ENCODE_OPTION_LONG)
          .append("] FILE\n");

        sb.append(indent)
          .append("[")
          .append(DECODE_OPTION_SHORT)
          .append(" | ")
          .append(DECODE_OPTION_LONG)
          .append("] FILE1 FILE2\n");

        sb.append("Where:\n");

        sb.append(HELP_OPTION_SHORT)
          .append(", ")
          .append(HELP_OPTION_LONG)
          .append("     Prints this message and exits.\n");

        sb.append(VERSION_OPTION_SHORT)
          .append(", ")
          .append(VERSION_OPTION_LONG)
          .append("  Prints the version info and exits.\n");

        sb.append(ENCODE_OPTION_SHORT)
          .append(", ")
          .append(ENCODE_OPTION_LONG)
          .append("   Encodes the text from standard input.\n");

        sb.append(DECODE_OPTION_SHORT)
          .append(", ")
          .append(DECODE_OPTION_LONG)
          .append("   Decodes the text from standard input.\n");

        System.out.println(sb.toString());
    }

    private static String getIndent(int preambleLength) {
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < preambleLength; ++i) {
            sb.append(' ');
        }

        return sb.toString();
    }

    private static void printVersion() {
        String msg = 
        "Huffman compressor tool, version 1.61 (Nov 19, 2016)\n" +
        "By Rodion \"rodde\" Efremov";

        System.out.println(msg);
    }

    private static String getThisJarName() {
        return new File(App.class.getProtectionDomain()
                        .getCodeSource()
                        .getLocation()
                        .getPath()).getName();
    }

    private static void writeBytes(byte[] data, File file) {
        try {
            FileOutputStream fos = new FileOutputStream(file);
            fos.write(data);
            fos.close();
        } catch (IOException ex) {
            throw new RuntimeException(
            "ERROR: File IO failed while writing encoded data.", ex);
        } 
    }

    private static byte[] readBytes(File file) {
        try {
            Path path = Paths.get(file.getAbsolutePath());
            return Files.readAllBytes(path);
        } catch (IOException ex) {
            throw new RuntimeException(
            "ERROR: File IO failed while reading a binary file.", ex);
        }
    }
}

BitString.java

package net.coderodde.compression.huffman;

import java.util.Arrays;

/**
 * This class implements a builder for creating bit strings.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.618 (Nov 14, 2016)
 */
public final class BitString {

    /**
     * The length of a {@code long} array storing the bits.
     */
    private static final int DEFAULT_NUMBER_OF_LONGS = 2;

    /**
     * Used for more efficient modulo arithmetics when indexing a particular
     * bit in a {@code long} value.
     */
    private static final int MODULO_MASK = 0b111111;

    /**
     * Number of bits per a {@code long} value;
     */
    private static final int BITS_PER_LONG = Long.BYTES * Byte.SIZE;

    /**
     * This field holds the actual array of long values for storing the bits.
     */
    private long[] storageLongs;

    /**
     * Current maximum of bits this builder can store.
     */
    private int storageCapacity;

    /**
     * Stores the number of bits this bit string builder actually represents.
     * We have an invariant {@code size <= storageCapacity}.
     */
    private int size;

    /**
     * Constructs an empty bit string builder.
     */
    public BitString() {
        this.storageLongs = new long[DEFAULT_NUMBER_OF_LONGS];
        this.storageCapacity = this.storageLongs.length * BITS_PER_LONG;
    }

    /**
     * Constructs a distinct bit string builder with the same content as in
     * {@code toCopy}. 
     * 
     * @param toCopy the bit string builder whose content to copy.
     */
    public BitString(BitString toCopy) {
        this.size = toCopy.size;
        this.storageCapacity = 
                Math.max(size, DEFAULT_NUMBER_OF_LONGS * BITS_PER_LONG);

        int numberOfLongs = storageCapacity / BITS_PER_LONG +
                         (((storageCapacity & MODULO_MASK) == 0) ? 0 : 1);

        this.storageLongs = Arrays.copyOf(toCopy.storageLongs, numberOfLongs);
    }

    public void appendBit(boolean bit) {
        checkBitArrayCapacity(size + 1);
        writeBitImpl(size, bit);
        ++size;
    }

    /**
     * Returns number of bits stored in this builder.
     * 
     * @return number of bits.
     */
    public int length() {
        return size;
    }

    /**
     * Appends all the bits in {@code bitStringBuilder} to the end of this 
     * builder.
     * 
     * @param bitStringBuilder the bit string builder whose bits to append.
     */
    public void appendBitsFrom(BitString bitStringBuilder) {
        checkBitArrayCapacity(size + bitStringBuilder.size);
        int otherSize = bitStringBuilder.size;

        for (int i = 0; i != otherSize; ++i) {
            appendBit(bitStringBuilder.readBitImpl(i));
        }
    }

    /**
     * Reads a specific bit.
     * 
     * @param index the index of the target bit.
     * @return {@code true} if the target bit is on, {@code false} otherwise.
     */
    public boolean readBit(int index) {
        checkAccessIndex(index);
        return readBitImpl(index);
    }

    /**
     * Removes the very last bit from this bit string builder.
     */
    public void removeLastBit() {
        if (size == 0) {
            throw new IllegalStateException(
            "Removing the last bit from an bit string builder.");
        }

        --size;
    }

    /**
     * Clears the entire bit string builder.
     */
    public void clear() {
        this.size = 0;
    }

    /**
     * Returns the number of bytes occupied by bits.
     * 
     * @return number of bytes occupied. 
     */
    public int getNumberOfBytesOccupied() {
        return size / 8 + ((size % 8 == 0) ? 0 : 1);
    }

    public byte[] toByteArray() {
        int numberOfBytes = (size / Byte.SIZE) +
                           ((size % Byte.SIZE == 0) ? 0 : 1);

        byte[] byteArray = new byte[numberOfBytes];

        for (int i = 0; i != numberOfBytes; ++i) {
            byteArray[i] = 
                    (byte)
                    ((storageLongs[i / Long.BYTES] 
                    >>> Byte.SIZE * (i % Long.BYTES)) & 0xff);
        }

        return byteArray;
    }

    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder(size);

        for (int i = 0; i != size; ++i) {
            sb.append(readBitImpl(i) ? '1': '0');
        }

        return sb.toString();
    }

    private void checkAccessIndex(int index) {
        if (size == 0) {
            throw new IllegalStateException("The bit string is empty.");
        }

        if (index < 0) {
            throw new IndexOutOfBoundsException(
            "The index is negative: " + index + ".");
        }

        if (index >= size) {
            throw new IndexOutOfBoundsException(
            "The bit index is too large (" + index + "). Must be at most " +
            (size - 1) + ".");
        }
    }

    private boolean readBitImpl(int index) {
        int longIndex = index / BITS_PER_LONG;
        int bitIndex  = index & MODULO_MASK;
        long mask = 1L << bitIndex;
        return (storageLongs[longIndex] & mask) != 0;
    }

    private void writeBitImpl(int index, boolean bit) {
        int longIndex = index / BITS_PER_LONG;
        int bitIndex  = index & MODULO_MASK;

        if (bit) {
            long mask = 1L << bitIndex;
            storageLongs[longIndex] |= mask;
        } else {
            long mask = ~(1L << bitIndex);
            storageLongs[longIndex] &= mask;
        }
    }

    private void checkBitArrayCapacity(int requestedCapacity) {
        if (requestedCapacity > storageCapacity) {
            int requestedWords1 =
                    requestedCapacity / BITS_PER_LONG + 
                 (((requestedCapacity & MODULO_MASK) == 0) ? 0 : 1);

            int requestedWords2 = (3 * storageCapacity / 2) / BITS_PER_LONG;

            int selectedRequestedWords = Math.max(requestedWords1,
                                                  requestedWords2);

            this.storageLongs = Arrays.copyOf(storageLongs, 
                                              selectedRequestedWords);

            this.storageCapacity = this.storageLongs.length * BITS_PER_LONG;
        }
    } 
}

ByteList.java

package net.coderodde.compression.huffman;

import java.util.Arrays;

/**
 * This class implements a simple, non-generic list of bytes.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.6 (Nov 17, 2016)
 */
final class ByteList {

    private byte[] data;
    private int size;

    public ByteList() {
        this.data = new byte[8];
    }

    public ByteList(int capacity) {
        this.data = new byte[capacity];
    }

    public void appendByte(byte b) {
        ensureCapacity(size + 1);
        data[size++] = b;
    }

    public byte[] toByteArray() {
        return Arrays.copyOfRange(data, 0, size);
    }

    private void ensureCapacity(int requestedCapacity) {
        if (requestedCapacity > data.length) {
            int selectedCapacity = Math.max(requestedCapacity,
                                            data.length * 2);

            data = Arrays.copyOf(data, selectedCapacity);
        }
    }
}

HuffmanDecoder.java

package net.coderodde.compression.huffman;

import net.coderodde.compression.huffman.HuffmanTree.IntHolder;

/**
 * This class is responsible for recovering the encoded text.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public final class HuffmanDecoder {

    /**
     * Recovers the text encoded by the bit string {@code bits} and the encoder
     * map {@code encoderMap}.
     * 
     * @param tree the Huffman tree used for decoding.
     * @param bits the actual encoded text bits.
     * @return the recovered text.
     */
    public byte[] decode(HuffmanTree tree, BitString bits) {
        IntHolder index = new IntHolder();
        int bitStringLength = bits.length();
        ByteList byteList = new ByteList();

        while (index.value < bitStringLength) {
            byte character = tree.decodeBitString(index, bits);
            byteList.appendByte(character);
        }

        return byteList.toByteArray();
    }
}

HuffmanDeserializer.java

package net.coderodde.compression.huffman;

import java.util.Map;
import java.util.TreeMap;

/**
 * This class is responsible for deserializing the text from a raw byte data.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public class HuffmanDeserializer {

    public static final class Result {

        private final BitString encodedText;
        private final Map<Byte, Integer> frequencyMap;

        Result(BitString encodedText, 
               Map<Byte, Integer> frequencyMap) {
            this.encodedText  = encodedText;
            this.frequencyMap = frequencyMap;
        }

        public BitString getEncodedText() {
            return encodedText;
        }

        public Map<Byte, Integer> getEncoderMap() {
            return frequencyMap;
        }
    }

    /**
     * Deserialises and returns the data structures need for decoding the text.
     * 
     * @param data the raw byte data previously serialised.
     * @return the data structures needed for decoding the text.
     */
    public Result deserialize(byte[] data) {
        checkSignature(data);
        int numberOfCodeWords = extractNumberOfCodeWords(data);
        int numberOfBits = extractNumberOfEncodedTextBits(data);

        Map<Byte, Integer> frequencyMap =
                extractFrequencyMap(data, 
                                    numberOfCodeWords);

        BitString encodedText = extractEncodedText(data, 
                                                   frequencyMap,
                                                   numberOfBits);
        return new Result(encodedText, frequencyMap);
    }

    private void checkSignature(byte[] data) {
        if (data.length < 4) {
            throw new InvalidFileFormatException(
            "No file type signature. The file is too short: " + data.length);
        }

        for (int i = 0; i != HuffmanSerializer.MAGIC.length; ++i) {
            if (data[i] != HuffmanSerializer.MAGIC[i]) {
                throw new InvalidFileFormatException(
                "Bad file type signature.");
            }
        }
    }

    private int extractNumberOfCodeWords(byte[] data) {
        if (data.length < 8) {
            throw new InvalidFileFormatException(
            "No number of code words. The file is too short: " + data.length);
        }

        int numberOfCodeWords = 0;

        numberOfCodeWords |= (Byte.toUnsignedInt(data[7]) << 24);
        numberOfCodeWords |= (Byte.toUnsignedInt(data[6]) << 16);
        numberOfCodeWords |= (Byte.toUnsignedInt(data[5]) << 8);
        numberOfCodeWords |= (Byte.toUnsignedInt(data[4]));

        return numberOfCodeWords;
    }

    private Map<Byte, Integer> extractFrequencyMap(byte[] data,
                                                 int numberOfCodeWords) {
        Map<Byte, Integer> frequencyMap = new TreeMap<>();

        try {
            int dataByteIndex =
                    HuffmanSerializer.MAGIC.length +
                    HuffmanSerializer.BYTES_PER_BIT_COUNT_ENTRY +
                    HuffmanSerializer.BYTES_PER_CODE_WORD_COUNT_ENTRY;

            for (int i = 0; i != numberOfCodeWords; ++i) {
                byte character = data[dataByteIndex++];
                byte frequencyByte1 = data[dataByteIndex++];
                byte frequencyByte2 = data[dataByteIndex++];
                byte frequencyByte3 = data[dataByteIndex++];
                byte frequencyByte4 = data[dataByteIndex++];

                int frequency = Byte.toUnsignedInt(frequencyByte1);
                frequency |= (Byte.toUnsignedInt(frequencyByte2) << 8);
                frequency |= (Byte.toUnsignedInt(frequencyByte3) << 16);
                frequency |= (Byte.toUnsignedInt(frequencyByte4) << 24);

                frequencyMap.put(character, frequency);
            }
        } catch (ArrayIndexOutOfBoundsException ex) {
            throw new InvalidFileFormatException("Invalid format.");
        }

        return frequencyMap;
    }

    private BitString extractEncodedText(byte[] data,
                                         Map<Byte, Integer> frequencyMap,
                                         int numberOfEncodedTextBits) {
        int omittedBytes = HuffmanSerializer.MAGIC.length +
                           HuffmanSerializer.BYTES_PER_BIT_COUNT_ENTRY +
                           HuffmanSerializer.BYTES_PER_CODE_WORD_COUNT_ENTRY;

        omittedBytes += frequencyMap.size() * 
                        HuffmanSerializer.BYTES_PER_WEIGHT_MAP_ENTRY;

        BitString encodedText = new BitString();
        int currentByteIndex = omittedBytes;
        int currentBitIndex = 0;

        try {
            for (int bitIndex = 0; 
                    bitIndex != numberOfEncodedTextBits; 
                    bitIndex++) {
                boolean bit = 
                        (data[currentByteIndex] & (1 << currentBitIndex)) != 0;

                encodedText.appendBit(bit);

                if (++currentBitIndex == Byte.SIZE) {
                    currentBitIndex = 0;
                    currentByteIndex++;
                }
            }
        } catch (ArrayIndexOutOfBoundsException ex) {
            throw new InvalidFileFormatException("Invalid file format.");
        }

        return encodedText;
    }

    private int extractNumberOfEncodedTextBits(byte[] data) {
        if (data.length < 12) {
            throw new InvalidFileFormatException(
            "No number of encoded text bits. The file is too short: " + 
                    data.length);
        }

        int numberOfEncodedTextBits = 0;

        numberOfEncodedTextBits |= (Byte.toUnsignedInt(data[11]) << 24);
        numberOfEncodedTextBits |= (Byte.toUnsignedInt(data[10]) << 16);
        numberOfEncodedTextBits |= (Byte.toUnsignedInt(data[9] ) << 8);
        numberOfEncodedTextBits |= (Byte.toUnsignedInt(data[8]));

        return numberOfEncodedTextBits;
    }
}

HuffmanEncoder.java

package net.coderodde.compression.huffman;

import java.util.Map;

/**
 * This class provides a method for encoding the given text using a particular
 * encoder map.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public final class HuffmanEncoder {

    /**
     * Encodes the input text {@code text} using the encoder map 
     * {@code encoderMap}.
     * 
     * @param map  the encoder map.
     * @param text the text to encode.
     * @return a bit string representing the encoded text.
     */
    public BitString encode(Map<Byte, BitString> map, byte[] text) {
        BitString outputBitString = new BitString();
        int textLength = text.length;

        for (int index = 0; index != textLength; ++index) {
            byte currentByte = text[index];
            BitString codeWord = map.get(currentByte);
            outputBitString.appendBitsFrom(codeWord);
        }

        return outputBitString;
    }
}

HuffmanSerializer.java

package net.coderodde.compression.huffman;

import java.util.Map;

/**
 * This class is responsible for converting the encoded text and the encoder map
 * into a raw byte array.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public final class HuffmanSerializer {

    /**
     * The magic file signature for recognizing the file type.
     */
    static final byte[] MAGIC = new byte[]{ (byte) 0xC0,
                                            (byte) 0xDE,
                                            (byte) 0x0D,
                                            (byte) 0xDE };

    /**
     * The number of bytes it takes to serialize one mapping from a character
     * to its code word.
     */
    static final int BYTES_PER_WEIGHT_MAP_ENTRY = 5;

    /**
     * The number of bytes it takes to serialize the number of code words.
     */
    static final int BYTES_PER_CODE_WORD_COUNT_ENTRY = 4;

    /**
     * The number of bytes it takes to serialize the number of bits in the 
     * actual encoded text.
     */
    static final int BYTES_PER_BIT_COUNT_ENTRY = 4;

    /**
     * Produces a byte array holding the compressed text along with its 
     * encoder map.
     * 
     * @param frequencyMap the encoder map used for encoding the text.
     * @param encodedText  the encoded text.
     * @return an array of byte.
     */
    public byte[] serialize(Map<Byte, Integer> frequencyMap,
                            BitString encodedText) {
        ByteList byteList = new ByteList(computeByteListSize(frequencyMap, 
                                                             encodedText));
        // Emit the magic number:
        for (byte b : MAGIC) {
            byteList.appendByte(b);
        }

        int numberOfCodeWords = frequencyMap.size();
        int numberOfBits = encodedText.length();

        // Emit the number of code words.
        byteList.appendByte((byte) (numberOfCodeWords & 0xff));
        byteList.appendByte((byte)((numberOfCodeWords >>= 8) & 0xff));
        byteList.appendByte((byte)((numberOfCodeWords >>= 8) & 0xff));
        byteList.appendByte((byte)((numberOfCodeWords >>= 8) & 0xff));

        // Emit the number of bits in the encoded text.
        byteList.appendByte((byte) (numberOfBits & 0xff));
        byteList.appendByte((byte)((numberOfBits >>= 8) & 0xff));
        byteList.appendByte((byte)((numberOfBits >>= 8) & 0xff));
        byteList.appendByte((byte)((numberOfBits >>= 8) & 0xff));

        // Emit the code words:
        for (Map.Entry<Byte, Integer> entry : frequencyMap.entrySet()) {
            byte character = entry.getKey();
            int frequency = entry.getValue();

            // Emit the character:
            byteList.appendByte(character);

            // Emit the bytes of the weight value:
            byteList.appendByte((byte) (frequency & 0xff));
            byteList.appendByte((byte)((frequency >>= 8) & 0xff));
            byteList.appendByte((byte)((frequency >>= 8) & 0xff));
            byteList.appendByte((byte)((frequency >>= 8) & 0xff));
        }

        byte[] encodedTextBytes = encodedText.toByteArray();

        // Emit the encoded text:
        for (byte b : encodedTextBytes) {
            byteList.appendByte(b);
        }

        return byteList.toByteArray();
    }

    private int computeByteListSize(Map<Byte, Integer> frequencyMap,
                                    BitString encodedText) {
        return MAGIC.length + BYTES_PER_CODE_WORD_COUNT_ENTRY
                            + BYTES_PER_BIT_COUNT_ENTRY
                            + frequencyMap.size() * BYTES_PER_WEIGHT_MAP_ENTRY 
                            + encodedText.getNumberOfBytesOccupied();
    }
}

HuffmanTree.java

package net.coderodde.compression.huffman;

import java.util.Map;
import java.util.PriorityQueue;
import java.util.Queue;
import java.util.TreeMap;

/**
 * This class implements a Huffman tree for building a prefix code.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public final class HuffmanTree {

    static final class IntHolder {
        int value;
    }

    private static final class HuffmanTreeNode 
            implements Comparable<HuffmanTreeNode> {

        byte character;
        int  frequency;
        boolean isLeaf;
        HuffmanTreeNode left;
        HuffmanTreeNode right;

        HuffmanTreeNode(byte character, int frequency, boolean isLeaf) {
            this.frequency = checkFrequency(frequency);
            this.isLeaf = isLeaf;

            if (isLeaf) {
                this.character = character;
            }
        }

        @Override
        public int compareTo(HuffmanTreeNode o) {
            int cmp = Integer.compare(frequency, o.frequency);

            if (cmp != 0) {
                return cmp;
            }

            // If reached here, equal weights so order by the character value:
            return Byte.compare(character, o.character);
        }

        static HuffmanTreeNode merge(HuffmanTreeNode node1, 
                                     HuffmanTreeNode node2) {
            HuffmanTreeNode newNode = 
                    new HuffmanTreeNode((byte) 0,
                                        node1.frequency + node2.frequency,
                                        false);

            if (node1.frequency < node2.frequency) {
                newNode.left  = node1;
                newNode.right = node2;
            } else {
                newNode.left  = node2;
                newNode.right = node1;
            }

            return newNode;
        }

        private int checkFrequency(int frequency) {
            if (frequency <= 0) {
                throw new IllegalArgumentException(
                "The input byte frequency must be positive. Received " +
                        frequency + ".");
            }

            return frequency;
        }
    }

    private HuffmanTreeNode root;

    /**
     * Constructs a Huffman tree from the character frequencies 
     * {@code weightMap}.
     * 
     * @param frequencyMap the map mapping each byte to its frequency.
     */
    public HuffmanTree(Map<Byte, Integer> frequencyMap) {
        if (frequencyMap.isEmpty()) {
            throw new IllegalArgumentException(
                    "Compressor requires a non-empty text.");
        }

        Queue<HuffmanTreeNode> queue = new PriorityQueue<>();

        for (Map.Entry<Byte, Integer> entry : frequencyMap.entrySet()) {
            queue.add(new HuffmanTreeNode(entry.getKey(),
                                          entry.getValue(),
                                          true));
        }

        while (queue.size() > 1) {
            HuffmanTreeNode node1 = queue.remove();
            HuffmanTreeNode node2 = queue.remove();
            queue.add(HuffmanTreeNode.merge(node1, node2));
        }

        root = queue.peek();
    }

    public byte decodeBitString(IntHolder index, BitString bitString) {
        if (root.isLeaf) {
            // Ugly special case: the encoded text contains only one distinct
            // byte value. Return it and increment the index holder. If we would
            // not handle this special case. The below while loop would become
            // infinite.
            index.value++;
            return root.character;
        }

        HuffmanTreeNode currentNode = root;

        while (currentNode.isLeaf == false) {
            boolean bit = bitString.readBit(index.value++);
            currentNode = (bit ? currentNode.right : currentNode.left);
        }

        return currentNode.character;
    }

    /**
     * Construct the encoder map from this tree.
     * 
     * @return the encoder map.
     */
    public Map<Byte, BitString> inferEncodingMap() {
        Map<Byte, BitString> map = new TreeMap<>();

        if (root.isLeaf) {
            // Corner case. Only one byte value in the text.
            root.isLeaf = false;
            root.left = new HuffmanTreeNode(root.character, 
                                            1, 
                                            true);
            BitString bs = new BitString();
            bs.appendBit(false);
            map.put(root.character, bs);
            return map;
        }

        BitString bitStringBuilder = new BitString();
        inferEncodingMapImpl(bitStringBuilder, root, map);
        return map;
    }

    private void inferEncodingMapImpl(BitString currentCodeWord,
                                      HuffmanTreeNode currentTreeNode,
                                      Map<Byte, BitString> map) {
        if (currentTreeNode.isLeaf) {
            map.put(currentTreeNode.character, 
                    new BitString(currentCodeWord));
            return;
        }

        currentCodeWord.appendBit(false);
        inferEncodingMapImpl(currentCodeWord,
                             currentTreeNode.left,
                             map);
        currentCodeWord.removeLastBit();

        currentCodeWord.appendBit(true);
        inferEncodingMapImpl(currentCodeWord, 
                             currentTreeNode.right,
                             map);
        currentCodeWord.removeLastBit();
    }
}

InvalidFormatException.java

package net.coderodde.compression.huffman;

public class InvalidFormatException extends RuntimeException {

    public InvalidFormatException(String errorMessage) {
        super(errorMessage);
    }
}

The entire (Maven) project along the unit tests is here.

Critique request

I would like to hear everything you have to say, yet more importantly

  1. If we confine us to Huffman trees, is there any room for performance improvement?

  2. How is my coding and naming styles?

  3. How is my modularity?

1 comments

@RobAu 2016-11-28 13:51:38

General

You have some comments in your code. While reading and tyring to understand what you are doing, I sometimes had to look twice. (For example, the constructor of the Huffman tree implements most of the algorithm by leveraging the PriorityQueue. This is nicely done, but could use a little extra explanation.

Streaming Decoding

You should make the decoding process streaming. Now you first read in the input to a byte[], then decode and collect it to a byte[], and then write that to a file. This consumes a lot of memory, while you only need to have the huffman-tree in memory and read the maximum bit-set-length of bytes. (you could use buffering of course, but buffering the entire input and output is not the best implementation.

HuffmanTreeNode

HuffmanTreenode uses a setting to determine/encode the 'isLeaf' : the boolean isLeaf. If isLeaf is set the character will be default and there is no way to tell the difference between a given byte 0 or the default value 0, so the if in the constructor does not make sense?

IntHolder

You only use the IntHolder to index the bits that you read from the BitString. Why not create an iterator on the BitString?

EncodeMap

I don't understand the EncodeMap abstraction to pass information around. Why not pass the HuffManTree and let the HuffManEncoder ask the HuffManTree for the BitString of the current byte?

Like this:

public BitString encode(HuffmanTree tree, byte[] text) {
    BitString outputBitString = new BitString();
    int textLength = text.length;

    for (int index = 0; index != textLength; ++index) {
        byte currentByte = text[index];
        BitString codeWord = tree.getBitStringFor(currentByte);
        outputBitString.appendBitsFrom(codeWord);
    }
    return outputBitString;
}

If you want to optimize you can still infer the map, but no need to externalize this datastructure. Allows for easier changing of implementation later as well.

Bugs?

In doDecode() I think you did not use the intended final Strings?

        for (int i = args.length - 1; i >= 0; --i) {
            if (args[i].equals("DECODE_OPTION_SHORT") 
                    || args[i].equals("DECODE_OPTION_LONG")) {

Reuse existing classes

Your BitString looks a bit (no pun intended) like the Java BitSet. Better to reuse if possible.

Your ByteList is similar to Java ByteBuffer

Related Questions

Sponsored Content

2 Answered Questions

[SOLVED] Huffman compressor in C++

2 Answered Questions

[SOLVED] Huffman compressor in Java

Sponsored Content