By coderodde


2016-11-19 12:24:09 8 Comments

(See the next iteration as well.)

I have this Java program that can en-/decode files, both text and binary. What comes to critique I want to hear anything regarding these points:

  • Performance,
  • Modularity,
  • Coding conventions,
  • Naming conventions.

Code

BitString.java

package net.coderodde.compression.huffman;

import java.util.Arrays;

/**
 * This class implements a builder for creating bit strings.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.618 (Nov 14, 2016)
 */
public final class BitString {

    /**
     * The length of a {@code long} array storing the bits.
     */
    private static final int DEFAULT_NUMBER_OF_LONGS = 2;

    /**
     * Used for more efficient modulo arithmetics when indexing a particular
     * bit in a {@code long} value.
     */
    private static final int MODULO_MASK = 0b111111;

    /**
     * Number of bits per a {@code long} value;
     */
    private static final int BITS_PER_LONG = Long.BYTES * Byte.SIZE;

    /**
     * This field holds the actual array of long values for storing the bits.
     */
    private long[] storageLongs;

    /**
     * Current maximum of bits this builder can store.
     */
    private int storageCapacity;

    /**
     * Stores the number of bits this bit string builder actually represents.
     * We have an invariant {@code size <= storageCapacity}.
     */
    private int size;

    /**
     * Constructs an empty bit string builder.
     */
    public BitString() {
        this.storageLongs = new long[DEFAULT_NUMBER_OF_LONGS];
        this.storageCapacity = this.storageLongs.length * BITS_PER_LONG;
    }

    /**
     * Constructs a distinct bit string builder with the same content as in
     * {@code toCopy}. 
     * 
     * @param toCopy the bit string builder whose content to copy.
     */
    public BitString(BitString toCopy) {
        this.size = toCopy.size;
        this.storageCapacity = 
                Math.max(size, DEFAULT_NUMBER_OF_LONGS * BITS_PER_LONG);

        int numberOfLongs = storageCapacity / BITS_PER_LONG +
                         (((storageCapacity & MODULO_MASK) == 0) ? 0 : 1);

        this.storageLongs = Arrays.copyOf(toCopy.storageLongs, numberOfLongs);
    }

    public void appendBit(boolean bit) {
        checkBitArrayCapacity(size + 1);
        writeBitImpl(size, bit);
        ++size;
    }

    /**
     * Returns number of bits stored in this builder.
     * 
     * @return number of bits.
     */
    public int length() {
        return size;
    }

    /**
     * Appends all the bits in {@code bitStringBuilder} to the end of this 
     * builder.
     * 
     * @param bitStringBuilder the bit string builder whose bits to append.
     */
    public void appendBitsFrom(BitString bitStringBuilder) {
        checkBitArrayCapacity(size + bitStringBuilder.size);
        int otherSize = bitStringBuilder.size;

        for (int i = 0; i != otherSize; ++i) {
            appendBit(bitStringBuilder.readBitImpl(i));
        }
    }

    /**
     * Reads a specific bit.
     * 
     * @param index the index of the target bit.
     * @return {@code true} if the target bit is on, {@code false} otherwise.
     */
    public boolean readBit(int index) {
        checkAccessIndex(index);
        return readBitImpl(index);
    }

    /**
     * Removes the very last bit from this bit string builder.
     */
    public void removeLastBit() {
        if (size == 0) {
            throw new IllegalStateException(
            "Removing the last bit from an bit string builder.");
        }

        --size;
    }

    /**
     * Clears the entire bit string builder.
     */
    public void clear() {
        this.size = 0;
    }

    /**
     * Returns the number of bytes occupied by bits.
     * 
     * @return number of bytes occupied. 
     */
    public int getNumberOfBytesOccupied() {
        return size / 8 + ((size % 8 == 0) ? 0 : 1);
    }

    public byte[] toByteArray() {
        int numberOfBytes = (size / Byte.SIZE) +
                           ((size % Byte.SIZE == 0) ? 0 : 1);

        byte[] byteArray = new byte[numberOfBytes];

        for (int i = 0; i != numberOfBytes; ++i) {
            byteArray[i] = 
                    (byte)
                    ((storageLongs[i / Long.BYTES] 
                    >>> Byte.SIZE * (i % Long.BYTES)) & 0xff);
        }

        return byteArray;
    }

    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder(size);

        for (int i = 0; i != size; ++i) {
            sb.append(readBitImpl(i) ? '1': '0');
        }

        return sb.toString();
    }

    @Override
    public int hashCode() {
        int hash = 0;
        int fullLongs = size / BITS_PER_LONG;

        for (int i = 0; i != fullLongs; ++i) {
            long word = storageLongs[i];
            hash += (int)((word & 0xffffffff00000000L) ^ 
                          (word & 0xffffffff));
        }

        int leftoverBits = size & MODULO_MASK;

        for (int i = 0; i != leftoverBits; ++i) {
            hash += readBitImpl(fullLongs * BITS_PER_LONG + i) ? 1 : 0;
        }

        return hash;
    }

    @Override
    public boolean equals(Object o) {
        if (o == null) {
            return false;
        }

        if (o == this) {
            return true;
        }

        if (!getClass().equals(o.getClass())) {
            return false;
        }

        BitString other = (BitString) o;

        if (size != other.size) {
            return false;
        }

        int fullLongs = size / BITS_PER_LONG;

        for (int i = 0; i != fullLongs; ++i) {
            if (storageLongs[i] != other.storageLongs[i]) {
                return false;
            }
        }

        int leftoverBits = size & MODULO_MASK;

        for (int i = 0; i != leftoverBits; ++i) {
            if (readBitImpl(fullLongs * BITS_PER_LONG + i) !=
                other.readBitImpl(fullLongs * BITS_PER_LONG + i)) {
                return false;
            }
        }

        return true;
    }

    private void checkAccessIndex(int index) {
        if (size == 0) {
            throw new IllegalStateException("The bit string is empty.");
        }

        if (index < 0) {
            throw new IndexOutOfBoundsException(
            "The index is negative: " + index + ".");
        }

        if (index >= size) {
            throw new IndexOutOfBoundsException(
            "The bit index is too large (" + index + "). Must be at most " +
            (size - 1) + ".");
        }
    }

    private boolean readBitImpl(int index) {
        int longIndex = index / BITS_PER_LONG;
        int bitIndex  = index & MODULO_MASK;
        long mask = 1L << bitIndex;
        return (storageLongs[longIndex] & mask) != 0;
    }

    private void writeBitImpl(int index, boolean bit) {
        int longIndex = index / BITS_PER_LONG;
        int bitIndex  = index & MODULO_MASK;

        if (bit) {
            long mask = 1L << bitIndex;
            storageLongs[longIndex] |= mask;
        } else {
            long mask = ~(1L << bitIndex);
            storageLongs[longIndex] &= mask;
        }
    }

    private void checkBitArrayCapacity(int requestedCapacity) {
        if (requestedCapacity > storageCapacity) {
            int requestedWords1 =
                    requestedCapacity / BITS_PER_LONG + 
                 (((requestedCapacity & MODULO_MASK) == 0) ? 0 : 1);

            int requestedWords2 = (3 * storageCapacity / 2) / BITS_PER_LONG;

            int selectedRequestedWords = Math.max(requestedWords1,
                                                  requestedWords2);

            this.storageLongs = Arrays.copyOf(storageLongs, 
                                              selectedRequestedWords);

            this.storageCapacity = this.storageLongs.length * BITS_PER_LONG;
        }
    } 
}

ByteList.java

package net.coderodde.compression.huffman;

import java.util.Arrays;

/**
 * This class implements a simple, non-generic list of bytes.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.6 (Nov 17, 2016)
 */
final class ByteList {

    private byte[] data;
    private int size;

    public ByteList() {
        this.data = new byte[8];
    }

    public ByteList(int capacity) {
        this.data = new byte[capacity];
    }

    public void appendByte(byte b) {
        ensureCapacity(size + 1);
        data[size++] = b;
    }

    public byte[] toByteArray() {
        return Arrays.copyOfRange(data, 0, size);
    }

    private void ensureCapacity(int requestedCapacity) {
        if (requestedCapacity > data.length) {
            int selectedCapacity = Math.max(requestedCapacity,
                                            data.length * 2);

            data = Arrays.copyOf(data, selectedCapacity);
        }
    }
}

ByteWeightComputer.java

package net.coderodde.compression.huffman;

import java.util.HashMap;
import java.util.Map;

/**
 * This class provides a method for counting relative frequencies of characters
 * in any given corpus of text.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.618 (Nov 19, 2016)
 */
public final class ByteWeightComputer {

    /**
     * Computes the map mapping each character in the text {@code text} to its
     * relative frequency.
     * 
     * @param text the text for which to compute the frequencies.
     * @return the map mapping each character to its respective frequency.
     */
    public Map<Byte, Float> computeCharacterWeights(byte[] text) {
        Map<Byte, Float> map = new HashMap<>();
        int textLength = text.length;

        for (int i = 0; i != textLength; ++i) {
            byte currentByte = text[i];

            if (map.containsKey(currentByte)) {
                map.put(currentByte, map.get(currentByte) + 1.0f);
            } else {
                map.put(currentByte, 1.0f);
            }
        }

        float textLengthFloat = textLength;

        for (Map.Entry<Byte, Float> entry : map.entrySet()) {
            entry.setValue(entry.getValue() / textLengthFloat);
        }

        return map;
    }
}

HuffmanDecoder.java

package net.coderodde.compression.huffman;

import java.util.HashMap;
import java.util.Map;

/**
 * This class is responsible for recovering the encoded text.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public final class HuffmanDecoder {

    /**
     * Recovers the text encoded by the bit string {@code bits} and the encoder
     * map {@code encoderMap}.
     * 
     * @param bits       the actual encoded text.
     * @param encoderMap the map mapping each character to its respective
     *                   code word.
     * @return the recovered text.
     */
    public byte[] decode(BitString bits,
                         Map<Byte, BitString> encoderMap) {
        Map<BitString, Byte> decoderMap = invertEncoderMap(encoderMap);
        ByteList byteList = new ByteList();
        BitString bitAccumulator = new BitString();
        int totalBits = bits.length();

        for (int bitIndex = 0; bitIndex != totalBits; ++bitIndex) {
            bitAccumulator.appendBit(bits.readBit(bitIndex));
            Byte currentByte = decoderMap.get(bitAccumulator);

            if (currentByte != null) {
                byteList.appendByte(currentByte);
                bitAccumulator.clear();
            }
        }

        return byteList.toByteArray();
    }

    private Map<BitString, Byte>
            invertEncoderMap(Map<Byte, BitString> encoderMap) {
        Map<BitString, Byte> map = new HashMap<>(encoderMap.size());

        for (Map.Entry<Byte, BitString> entry 
                : encoderMap.entrySet()) {
            map.put(entry.getValue(), entry.getKey());
        }

        return map;
    }
}

HuffmanDeserializer.java

package net.coderodde.compression.huffman;

import java.util.HashMap;
import java.util.Map;

/**
 * This class is responsible for deserializing the text from a raw byte data.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public class HuffmanDeserializer {

    public static final class Result {

        private final BitString encodedText;
        private final Map<Byte, BitString> encoderMap;

        Result(BitString encodedText, 
               Map<Byte, BitString> encoderMap) {
            this.encodedText = encodedText;
            this.encoderMap  = encoderMap;
        }

        public BitString getEncodedText() {
            return encodedText;
        }

        public Map<Byte, BitString> getEncoderMap() {
            return encoderMap;
        }
    }

    /**
     * Deserialises and returns the data structures need for decoding the text.
     * 
     * @param data the raw byte data previously serialised.
     * @return the data structures needed for decoding the text.
     */
    public Result deserialize(byte[] data) {
        checkSignature(data);
        int numberOfCodeWords = extractNumberOfCodeWords(data);
        int numberOfBits = extractNumberOfEncodedTextBits(data);

        Map<Byte, BitString> encoderMap = extractEncoderMap(data, 
                                                            numberOfCodeWords);
        BitString encodedText = extractEncodedText(data, 
                                                   encoderMap, 
                                                   numberOfBits);
        return new Result(encodedText, encoderMap);
    }

    private void checkSignature(byte[] data) {
        if (data.length < 4) {
            throw new InvalidFileFormatException(
            "No file type signature. The file is too short: " + data.length);
        }

        for (int i = 0; i != HuffmanSerializer.MAGIC.length; ++i) {
            if (data[i] != HuffmanSerializer.MAGIC[i]) {
                throw new InvalidFileFormatException(
                "Bad file type signature.");
            }
        }
    }

    private int extractNumberOfCodeWords(byte[] data) {
        if (data.length < 8) {
            throw new InvalidFileFormatException(
            "No number of code words. The file is too short: " + data.length);
        }

        int numberOfCodeWords = 0;

        numberOfCodeWords |= (Byte.toUnsignedInt(data[7]) << 24);
        numberOfCodeWords |= (Byte.toUnsignedInt(data[6]) << 16);
        numberOfCodeWords |= (Byte.toUnsignedInt(data[5]) << 8);
        numberOfCodeWords |= (Byte.toUnsignedInt(data[4]));

        return numberOfCodeWords;
    }

    private Map<Byte, BitString> extractEncoderMap(byte[] data,
                                                   int numberOfCodeWords) {
        Map<Byte, BitString> encoderMap = new HashMap<>();

        try {
            int dataByteIndex =
                    HuffmanSerializer.MAGIC.length +
                    HuffmanSerializer.BYTES_PER_BIT_COUNT_ENTRY +
                    HuffmanSerializer.BYTES_PER_CODE_WORD_COUNT_ENTRY;

            for (int i = 0; i != numberOfCodeWords; ++i) {
                byte character = data[dataByteIndex++];
                int codeWordLength = data[dataByteIndex++];
                int bitIndex = 0;
                BitString codeWordBits = new BitString();

                for (int codeWordBitIndex = 0;
                        codeWordBitIndex != codeWordLength;
                        codeWordBitIndex++) {
                    byte currentByte = data[dataByteIndex];
                    boolean bit = (currentByte & (1 << bitIndex)) != 0;
                    codeWordBits.appendBit(bit);

                    if (++bitIndex == Byte.SIZE) {
                        bitIndex = 0;
                        dataByteIndex++;
                    }
                }

                encoderMap.put(character, codeWordBits);

                if (bitIndex != 0) {
                    dataByteIndex++;
                }
            }
        } catch (ArrayIndexOutOfBoundsException ex) {
            throw new InvalidFileFormatException("Invalid file format.");
        }

        return encoderMap;
    }

    private BitString extractEncodedText(byte[] data,
                                         Map<Byte, BitString> encoderMap,
                                         int numberOfEncodedTextBits) {
        int omittedBytes = HuffmanSerializer.MAGIC.length +
                           HuffmanSerializer.BYTES_PER_BIT_COUNT_ENTRY +
                           HuffmanSerializer.BYTES_PER_CODE_WORD_COUNT_ENTRY;

        for (Map.Entry<Byte, BitString> entry : encoderMap.entrySet()) {
            omittedBytes += 2 + entry.getValue().getNumberOfBytesOccupied();
        }

        BitString encodedText = new BitString();
        int currentByteIndex = omittedBytes;
        int currentBitIndex = 0;

        try {
            for (int bitIndex = 0; 
                    bitIndex != numberOfEncodedTextBits; 
                    bitIndex++) {
                boolean bit = 
                        (data[currentByteIndex] & (1 << currentBitIndex)) != 0;

                encodedText.appendBit(bit);

                if (++currentBitIndex == Byte.SIZE) {
                    currentBitIndex = 0;
                    currentByteIndex++;
                }
            }
        } catch (ArrayIndexOutOfBoundsException ex) {
            throw new InvalidFileFormatException("Invalid file format.");
        }

        return encodedText;
    }

    private int extractNumberOfEncodedTextBits(byte[] data) {
        if (data.length < 12) {
            throw new InvalidFileFormatException(
            "No number of encoded text bits. The file is too short: " + 
                    data.length);
        }

        int numberOfEncodedTextBits = 0;

        numberOfEncodedTextBits |= (Byte.toUnsignedInt(data[11]) << 24);
        numberOfEncodedTextBits |= (Byte.toUnsignedInt(data[10]) << 16);
        numberOfEncodedTextBits |= (Byte.toUnsignedInt(data[9] ) << 8);
        numberOfEncodedTextBits |= (Byte.toUnsignedInt(data[8]));

        return numberOfEncodedTextBits;
    }
}

HuffmanEncoder.java

package net.coderodde.compression.huffman;

import java.util.Map;

/**
 * This class provides a method for encoding the given text using a particular
 * encoder map.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public final class HuffmanEncoder {

    /**
     * Encodes the input text {@code text} using the encoder map 
     * {@code encoderMap}.
     * 
     * @param map  the encoder map.
     * @param text the text to encode.
     * @return a bit string representing the encoded text.
     */
    public BitString encode(Map<Byte, BitString> map, byte[] text) {
        BitString outputBitString = new BitString();
        int textLength = text.length;

        for (int index = 0; index != textLength; ++index) {
            byte currentByte = text[index];
            BitString codeWord = map.get(currentByte);
            outputBitString.appendBitsFrom(codeWord);
        }

        return outputBitString;
    }
}

HuffmanSerializer.java

package net.coderodde.compression.huffman;

import java.util.Map;

/**
 * This class is responsible for converting the encoded text and the encoder map
 * into a raw byte array.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public final class HuffmanSerializer {

    /**
     * The magic file signature for recognizing the file type.
     */
    static final byte[] MAGIC = new byte[]{ (byte) 0xC0,
                                            (byte) 0xDE,
                                            (byte) 0x0D,
                                            (byte) 0xDE };

    /**
     * The number of bytes it takes to serialize one mapping from a character
     * to its code word.
     */
    static final int BYTES_PER_ENCODER_MAP_ENTRY = 4;

    /**
     * The number of bytes it takes to serialize the number of code words.
     */
    static final int BYTES_PER_CODE_WORD_COUNT_ENTRY = 4;

    /**
     * The number of bytes it takes to serialize the number of bits in the 
     * actual encoded text.
     */
    static final int BYTES_PER_BIT_COUNT_ENTRY = 4;

    /**
     * Produces a byte array holding the compressed text along with its 
     * encoder map.
     * 
     * @param encoderMap  the encoder map used for encoding the text.
     * @param encodedText the encoded text.
     * @return an array of byte.
     */
    public byte[] serialize(Map<Byte, BitString> encoderMap,
                            BitString encodedText) {
        ByteList byteList = new ByteList(computeByteListSize(encoderMap, 
                                                             encodedText));
        // Emit the magic number:
        for (byte b : MAGIC) {
            byteList.appendByte(b);
        }

        int numberOfCodeWords = encoderMap.size();
        int numberOfBits = encodedText.length();

        // Emit the number of code words.
        byteList.appendByte((byte) (numberOfCodeWords & 0xff));
        byteList.appendByte((byte)((numberOfCodeWords >>= 8) & 0xff));
        byteList.appendByte((byte)((numberOfCodeWords >>= 8) & 0xff));
        byteList.appendByte((byte)((numberOfCodeWords >>= 8) & 0xff));

        // Emit the number of bits in the encoded text.
        byteList.appendByte((byte) (numberOfBits & 0xff));
        byteList.appendByte((byte)((numberOfBits >>= 8) & 0xff));
        byteList.appendByte((byte)((numberOfBits >>= 8) & 0xff));
        byteList.appendByte((byte)((numberOfBits >>= 8) & 0xff));

        // Emit the code words:
        for (Map.Entry<Byte, BitString> entry 
                : encoderMap.entrySet()) {
            byte character = entry.getKey();
            BitString codeWord = entry.getValue();

            // Emit the character:
            byteList.appendByte(character);

            // Emit the codeword length:
            byte codeWordLength = (byte) codeWord.length();
            byteList.appendByte(codeWordLength);

            // Emit the code word bits:
            byte[] codewordBytes = codeWord.toByteArray();

            for (byte b : codewordBytes) {
                byteList.appendByte(b);
            }
        }

        byte[] encodedTextBytes = encodedText.toByteArray();

        for (byte b : encodedTextBytes) {
            byteList.appendByte(b);
        }

        return byteList.toByteArray();
    }

    private int computeByteListSize(Map<Byte, BitString> encoderMap,
                                    BitString encodedText) {
        return MAGIC.length + BYTES_PER_CODE_WORD_COUNT_ENTRY
                            + BYTES_PER_BIT_COUNT_ENTRY
                            + encoderMap.size() * BYTES_PER_ENCODER_MAP_ENTRY 
                            + encodedText.getNumberOfBytesOccupied();
    }
}

HuffmanTree.java

package net.coderodde.compression.huffman;

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.PriorityQueue;
import java.util.Queue;
import java.util.Set;

/**
 * This class implements a Huffman tree for building a prefix code.
 * 
 * @author Rodion "rodde" Efremov
 * @version 1.61 (Nov 19, 2016)
 */
public final class HuffmanTree {

    private static final class HuffmanTreeNode 
            implements Comparable<HuffmanTreeNode> {

        byte character;
        float weight;
        boolean isLeaf;
        HuffmanTreeNode left;
        HuffmanTreeNode right;

        HuffmanTreeNode(byte character, float weight, boolean isLeaf) {
            checkWeight(weight);
            this.weight = weight;
            this.isLeaf = isLeaf;

            if (isLeaf) {
                this.character = character;
            }
        }

        @Override
        public int compareTo(HuffmanTreeNode o) {
            return Float.compare(weight, o.weight);
        }

        static HuffmanTreeNode merge(HuffmanTreeNode node1, 
                                     HuffmanTreeNode node2) {
            HuffmanTreeNode newNode = 
                    new HuffmanTreeNode((byte) 0,
                                        node1.weight + node2.weight,
                                        false);

            if (node1.weight < node2.weight) {
                newNode.left  = node1;
                newNode.right = node2;
            } else {
                newNode.left  = node2;
                newNode.right = node1;
            }

            return newNode;
        }

        private float checkWeight(float weight) {
            if (Double.isNaN(weight)) {
                throw new IllegalArgumentException("The input weight is NaN.");
            }

            if (weight <= 0.0f) {
                throw new IllegalArgumentException(
                "The input weight is not strictly positive: " + weight);
            }

            return weight;
        }
    }

    private HuffmanTreeNode root;

    /**
     * Constructs a Huffman tree from the character frequencies 
     * {@code weightMap}.
     * 
     * @param weightMap the map mapping each byte to its relative frequency.
     */
    public HuffmanTree(Map<Byte, Float> weightMap) {
        if (weightMap.isEmpty()) {
            throw new IllegalArgumentException(
                    "Compressor requires a non-empty text.");
        }

        Queue<HuffmanTreeNode> queue = new PriorityQueue<>();

        for (Map.Entry<Byte, Float> entry : weightMap.entrySet()) {
            queue.add(new HuffmanTreeNode(entry.getKey(),
                                          entry.getValue(),
                                          true));
        }

        while (queue.size() > 1) {
            HuffmanTreeNode node1 = queue.remove();
            HuffmanTreeNode node2 = queue.remove();
            queue.add(HuffmanTreeNode.merge(node1, node2));
        }

        root = queue.peek();
    }

    /**
     * Construct the encoder map from this tree.
     * 
     * @return the encoder map.
     */
    public Map<Byte, BitString> inferEncodingMap() {
        Map<Byte, BitString> map = new HashMap<>();

        if (root.isLeaf) {
            // Corner case. Only one byte value in the text.
            BitString bs = new BitString();
            bs.appendBit(false);
            map.put(root.character, bs);
            return map;
        }

        BitString bitStringBuilder = new BitString();
        inferEncodingMapImpl(bitStringBuilder, root, map);
        return map;
    }

    private void inferEncodingMapImpl(BitString currentCodeWord,
                                      HuffmanTreeNode currentTreeNode,
                                      Map<Byte, BitString> map) {
        if (currentTreeNode.isLeaf) {
            map.put(currentTreeNode.character, 
                    new BitString(currentCodeWord));
            return;
        }

        currentCodeWord.appendBit(false);
        inferEncodingMapImpl(currentCodeWord,
                             currentTreeNode.left,
                             map);
        currentCodeWord.removeLastBit();

        currentCodeWord.appendBit(true);
        inferEncodingMapImpl(currentCodeWord, 
                             currentTreeNode.right,
                             map);
        currentCodeWord.removeLastBit();
    }
}

InvalidFileFormatException.java

package net.coderodde.compression.huffman;

public class InvalidFileFormatException extends RuntimeException {

    public InvalidFileFormatException(String errorMessage) {
        super(errorMessage);
    }
}

App.java

package net.coderodde.app.huffman;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import net.coderodde.compression.huffman.BitString;
import net.coderodde.compression.huffman.ByteWeightComputer;
import net.coderodde.compression.huffman.HuffmanDecoder;
import net.coderodde.compression.huffman.HuffmanDeserializer;
import net.coderodde.compression.huffman.HuffmanEncoder;
import net.coderodde.compression.huffman.HuffmanSerializer;
import net.coderodde.compression.huffman.HuffmanTree;

public final class App {

    private static final String ENCODE_OPTION_SHORT = "-e";
    private static final String ENCODE_OPTION_LONG  = "--encode";
    private static final String DECODE_OPTION_SHORT = "-d";
    private static final String DECODE_OPTION_LONG  = "--decode";
    private static final String HELP_OPTION_SHORT = "-h";
    private static final String HELP_OPTION_LONG  = "--help";
    private static final String VERSION_OPTION_SHORT = "-v";
    private static final String VERSION_OPTION_LONG  = "--version";
    private static final String ENCODED_FILE_EXTENSION = "het";

    public static void main(String[] args) {
        Set<String> commandLineArgumentSet = getCommandLineOptions(args);

        if (commandLineArgumentSet.isEmpty() 
                || commandLineArgumentSet.contains(HELP_OPTION_LONG)
                || commandLineArgumentSet.contains(HELP_OPTION_SHORT)) {
            printHelpMessage();
            System.exit(0);
        }

        if (commandLineArgumentSet.contains(VERSION_OPTION_LONG)
                || commandLineArgumentSet.contains(VERSION_OPTION_SHORT)) {
            printVersion();
            System.exit(0);
        }

        boolean decode = commandLineArgumentSet.contains(DECODE_OPTION_LONG) ||
                         commandLineArgumentSet.contains(DECODE_OPTION_SHORT);

        boolean encode = commandLineArgumentSet.contains(ENCODE_OPTION_LONG) ||
                         commandLineArgumentSet.contains(ENCODE_OPTION_SHORT);

        if (!decode && !encode) {
            printHelpMessage();
            System.exit(0);
        }

        if (decode && encode) {
            printHelpMessage();
            System.exit(0);
        }

        commandLineArgumentSet.removeAll(Arrays.asList(ENCODE_OPTION_SHORT,
                                                       ENCODE_OPTION_LONG,
                                                       DECODE_OPTION_SHORT,
                                                       DECODE_OPTION_LONG));
        if (commandLineArgumentSet.isEmpty()) {
            System.err.println("Bad command line format.");
            System.exit(1);
        }

        File file = new File(commandLineArgumentSet.iterator().next());

        try {
            if (decode) {
                doDecode(args);
            } else if (encode) {
                doEncode(file);
            } 
        } catch (Exception ex) {
            System.err.println(ex.getMessage());
            System.exit(1);
        }
    }

    private static void doEncode(File file) throws FileNotFoundException {
        byte[] fileBytes = readBytes(file);

        Map<Byte, Float> weightMap =
                new ByteWeightComputer()
                        .computeCharacterWeights(fileBytes);

        Map<Byte, BitString> encodeMap = 
                new HuffmanTree(weightMap).inferEncodingMap();

        BitString encodedText = new HuffmanEncoder().encode(encodeMap,
                                                            fileBytes);

        byte[] data = new HuffmanSerializer().serialize(encodeMap,
                                                        encodedText);

        File outputFile = 
                new File(file.getName() + "." + ENCODED_FILE_EXTENSION);

        System.out.println(
            "Writing compressed text to \"" + outputFile.getName() + "\"...");

        writeBytes(data, outputFile);
    }

    private static void doDecode(String[] args) {
        String file1 = null;
        String file2 = null;

        try {
            int index = 0;

            for (int i = 0; i < args.length; ++i) {
                if (args[i].equals("DECODE_OPTION_SHORT") 
                        || args[i].equals("DECODE_OPTION_LONG")) {
                    index = i;
                    break;
                }
            }

            file1 = args[index + 1];
            file2 = args[index + 2];
        } catch (Exception ex) {
            System.err.println("Not enough tokens on command line.");
            System.exit(1);
        }

        byte[] inputData = readBytes(new File(file1));
        HuffmanDeserializer.Result result = 
                new HuffmanDeserializer().deserialize(inputData);
        byte[] originalData = new HuffmanDecoder()
                .decode(result.getEncodedText(),
                        result.getEncoderMap());

        writeBytes(originalData, new File(file2));
    }

    private static Set<String> getCommandLineOptions(String[] args) {
        Set<String> set = new HashSet<>();

        for (String arg : args) {
            set.add(arg);
        }

        return set;
    }

    private static void printHelpMessage() {
        String preamble = "usage: java -jar " + getThisJarName() + " ";
        int preambleLength = preamble.length();
        String indent = getIndent(preambleLength);

        StringBuilder sb = new StringBuilder();
        sb.append(preamble);

        sb.append("[")
          .append(HELP_OPTION_SHORT)
          .append(" | ")
          .append(HELP_OPTION_LONG)
          .append("]\n");

        sb.append(indent)
          .append("[")
          .append(VERSION_OPTION_SHORT)
          .append(" | ")
          .append(VERSION_OPTION_LONG)
          .append("]\n");

        sb.append(indent)
          .append("[")
          .append(ENCODE_OPTION_SHORT)
          .append(" | ")
          .append(ENCODE_OPTION_LONG)
          .append("] FILE\n");

        sb.append(indent)
          .append("[")
          .append(DECODE_OPTION_SHORT)
          .append(" | ")
          .append(DECODE_OPTION_LONG)
          .append("] FILE1 FILE2\n");

        sb.append("Where:\n");

        sb.append(HELP_OPTION_SHORT)
          .append(", ")
          .append(HELP_OPTION_LONG)
          .append("     Prints this message and exits.\n");

        sb.append(VERSION_OPTION_SHORT)
          .append(", ")
          .append(VERSION_OPTION_LONG)
          .append("  Prints the version info and exits.\n");

        sb.append(ENCODE_OPTION_SHORT)
          .append(", ")
          .append(ENCODE_OPTION_LONG)
          .append("   Encodes the text from standard input.\n");

        sb.append(DECODE_OPTION_SHORT)
          .append(", ")
          .append(DECODE_OPTION_LONG)
          .append("   Decodes the text from standard input.\n");

        System.out.println(sb.toString());
    }

    private static String getIndent(int preambleLength) {
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < preambleLength; ++i) {
            sb.append(' ');
        }

        return sb.toString();
    }

    private static void printVersion() {
        String msg = 
        "Huffman compressor tool, version 1.61 (Nov 19, 2016)\n" +
        "By Rodion \"rodde\" Efremov";

        System.out.println(msg);
    }

    private static String getThisJarName() {
        return new File(App.class.getProtectionDomain()
                        .getCodeSource()
                        .getLocation()
                        .getPath()).getName();
    }

    private static void writeBytes(byte[] data, File file) {
        try {
            FileOutputStream fos = new FileOutputStream(file);
            fos.write(data);
            fos.close();
        } catch (IOException ex) {
            throw new RuntimeException(
            "ERROR: File IO failed while writing encoded data.", ex);
        } 
    }

    private static byte[] readBytes(File file) {
        try {
            Path path = Paths.get(file.getAbsolutePath());
            return Files.readAllBytes(path);
        } catch (IOException ex) {
            throw new RuntimeException(
            "ERROR: File IO failed while reading a binary file.", ex);
        }
    }
}

Note 1 On War and Peace achieves compression ratio of about 43 percent.

Note 2 Project available at GitHub.

2 comments

@oopexpert 2016-11-19 14:09:16

Very clean code. I only see some little issues:

You throw "InvalidFileFormatException" in HuffmanDeserializer. Should the class care about where the byte-array comes from? I suggest to rename it to "InvalidFormatException".

The MAGIC handle (checkSignature) can be omitted as you already identify format issues while you process the byte array. That is semantical redundancy. I currently do NOT say "delete the constant" as it is significant information to a developer what he/she has to expect if maintaining the code. But the relevant data is only the length of MAGIC.

On the other side I do not know if MAGIC bytes are used only in your code. If so then I really think it is completely obselete.

You several times mentioned:

HuffmanSerializer.MAGIC.length + HuffmanSerializer.BYTES_PER_BIT_COUNT_ENTRY + HuffmanSerializer.BYTES_PER_CODE_WORD_COUNT_ENTRY

I suggest to extract this code into a method "getOmittedByteCount" as you already mentioned it in the method "extractEncodedText" (omittedBytes).

@JS1 2016-11-19 13:14:28

Consider walking the tree to decode

For encoding, using a map is smart, as you can convert your input byte to an output bitstring in a single lookup.

However, for decoding, you don't really know when your bitstring is going to end. Your code does one (unsuccessful) lookup for each bit in the bitstring until the bitstring ends. So if 011011 mapped to 'A', you'd do six total lookups. If you instead walked the huffman tree starting at the root, then for each bit you would move to the left or right child and check whether that node was a leaf. I think that this would be faster because it would be a node traversal versus a map lookup.

Even faster using FSM

I think you can get even faster decode times using a finite state machine. There is an article here that describes one implementation.

Related Questions

Sponsored Content

1 Answered Questions

1 Answered Questions

[SOLVED] Simple Java Download Manager

1 Answered Questions

[SOLVED] Java web application for sharing temporary notes

2 Answered Questions

[SOLVED] Huffman compressor in C++

1 Answered Questions

[SOLVED] Huffman compressor in Java - follow-up

1 Answered Questions

[SOLVED] Bit string builder for Java

  • 2016-05-25 17:23:56
  • coderodde
  • 460 View
  • 6 Score
  • 1 Answer
  • Tags:   java bitwise bitset

1 Answered Questions

1 Answered Questions

Sponsored Content