By Turnipdabeets


2018-12-04 19:10:15 8 Comments

I’m learning Swift and decided the Huffman Coding algorithm would be a good exercise while learning this new language. Below is a working version that encodes a string and decodes a string.

Here’s an example of how the API is used:

let huffEncoded = Huffman("MISSISSIPPI_RIVER!")
let decode = huffEncoded.decode()
  1. I’d love to learn of anything I’m not doing that would be more swift-like.
  2. I’m wondering if my choice of using a class with a lot of static functions is a good approach or if a struct would be better. They seem like similar solutions but I’m not 100% sure the real difference past structs are value types and classes are reference types.
  3. I'm also not confident in how I used GCD, but I think that using a concurrent queue will be helpful if encoding or decoding a large string since this isn't updating a view.
  4. I’m happy to hear any suggestions on my approach to the algorithm itself
  5. My next step is to make this into a framework so it can be used by other code. Does my approach scale for this?

Thanks for any help!

class Huffman {
    private var key = [String: String]()
    private(set) var code = [String]()
    lazy private var queue: DispatchQueue = {
        return DispatchQueue(label: "huffman-queue", qos: .userInitiated, attributes: .concurrent)
    }()

    init(_ input: String) {
        self.key = Huffman.getKey(for: input)
        self.code = Huffman.encode(for: input, with: self.key)
    }

    func decode() -> String {
        var word = ""
        queue.sync { [unowned self] in
            var reverseKey = [String:String]()
            for (k, v) in self.key {
                reverseKey[v] = k
            }

            for prefix in self.code {
                if let letter = reverseKey[prefix] {
                   word += letter
                }
            }
        }
        return word
    }

    static private func getKey(for input: String) -> [String: String] {
        // sort letter frequency by decreasing count
        let sortedFrequency = Array(input)
            .reduce(into: [String: Int](), { freq, char in
                let letter = String(char)
                return freq[letter] = freq[letter, default: 0] + 1
            })
            .sorted(by: {$0.1 > $1.1})
        // create queue of initial Nodes
        let queue = sortedFrequency.map{ Node(name: $0.key, value: $0.value)}
        // generate key by traversing tree
        return Huffman.generateKey(for: Huffman.createTree(with: queue), prefix: "")
    }

    static private func encode(for input: String, with key: [String: String]) -> [String] {
        var code = [String]()
        let queue = DispatchQueue(label: "huffman-encode-queue", qos: .userInitiated, attributes: .concurrent)
        queue.sync {
            for char in input {
                if let prefix = key[String(char)] {
                    code.append(prefix)
                }
            }
        }
        return code
    }

    static private func generateKey(for node: Node, prefix: String) -> [String: String] {
        var key = [String: String]()
        if let left = node.left, let right = node.right {
            key.merge(generateKey(for: left, prefix: prefix + "0"), uniquingKeysWith: {current,_ in current})
            key.merge(generateKey(for: right, prefix: prefix + "1"), uniquingKeysWith: {current,_ in current})
        }else {
            key[node.name] = prefix
        }
        return key
    }

    static private func createTree(with queue: [Node]) -> Node {
        var queue = queue
        // until we have 1 root node, get subtree of least frequency
        while queue.count > 1 {
            let last = queue.count - 1
            let node1 = queue[last]
            let node2 = queue[last - 1]

            // if we have a third then compare frequency to second
            if let node3 = queue[safe: last - 2], node3.value + node2.value < node2.value + node1.value {
                queue.remove(at: last - 1)
                queue.remove(at: last - 2)
                queue.append(Huffman.createRoot(with: node2, and: node3))
            } else {
                queue.removeLast()
                queue.removeLast()
                queue.append(Huffman.createRoot(with: node1, and: node2))
            }
        }
        return queue[0]
    }

    static private func createRoot(with first: Node, and second: Node) -> Node {
        return Node(name: "\(first.name)\(second.name)", value: first.value + second.value, left: first, right: second)
    }

}

class Node {
    var name: String
    var value: Int
    var left: Node?
    var right: Node?

    init(name: String, value: Int, left: Node? = nil, right: Node? = nil) {
        self.name = name
        self.value = value
        self.left = left
        self.right = right
    }
}

extension Collection {
    subscript (safe index: Index) -> Element? {
        return indices.contains(index) ? self[index] : nil
    }
}

1 comments

@Martin R 2018-12-04 21:37:58

Concurrency

In the encode and decode method you use queue.sync() to dispatch code to a concurrent queue – which means that the current queue is blocked until the computation is finished. If that is the main queue in an iOS or macOS application, UI updates and event handling are blocked for that time.

Therefore using a dispatch queue does not really help here. Also the Huffman encoder cannot know in which context it is executed (such as “user initiated”).

Dispatching the code with GCD – if necessary – should be done by the caller of these methods, not in the Huffman class itself.

The API

In its current form, the class seems of limited use to me.

let huffEncoded = Huffman("MISSISSIPPI_RIVER!")

builds the Huffman tree and encodes the given string and stores the results in private member variables. The only thing that I can do with the result is to decode it again. What I am missing are methods to

  • retrieve the generated Huffman tree so that it can be stored (perhaps in some serialized form) for later decompression,
  • retrieve the compressed string as a sequence of zeros and ones,
  • an initializer which takes a (previously created) Huffman tree,
  • a decode methods which takes a previously compressed string and decodes it, using the given Huffman tree.

Simplifications and other remarks

All properties in class Node are never mutated, and can be declared as constants (with let).

In func decode() the reverse dictionary mapping can be created with

let reverseKey = Dictionary(uniqueKeysWithValues: zip(key.values, key.keys))

instead of a loop. The following loop can be more simply done with compactMap():

let word = code.compactMap({ reverseKey[$0] }).joined()

In func getKey() it is not necessary to create an array from the given string. The return in the closure of the reduce() call is misleading: An assignment does not return anything. += can be used for the increment. Using $n.value instead of $n.1$ in the sort predicate emphasizes that the dictionary is sorted according to its values:

let sortedFrequency = input.reduce(into: [String: Int](), { freq, char in
        freq[String(char), default: 0] += 1
    })
    .sorted(by: {$0.value > $1.value})

In func encode(), compactMap() can be used again instead of a for loop:

let code = input.compactMap( { key[String($0)] } )

In func createTree() a “safe subscript” method (defined as an extension of the Collection protocol) is used to access the third last queue element if it exists. Testing if queue.count >= 3 would seem more clear to me.

The Huffman tree algorithm

Your algorithm does not generate the optimal code. The reason is that at each step it only considers the last two or three nodes, not the two nodes with the minimal total weight. Here is an example: For the string "ABCDEFGH" (i.e. 8 distinct characters with equal freqency) your program generates the codes

"D": 01
"G": 11
"F": 001
"E": 101
"A": 0001
"C": 0000
"H": 1000
"B": 1001

with an average code length of 26/8 = 3.25. (Your output can be different because hash values are randomized since Swift 4.2.) The optimal tree in this case would be a balanced binary tree, where every code has length 3.

@Turnipdabeets 2018-12-04 21:47:37

Thanks! I need to spend some time thinking through all your suggestions. Wouldn't the last two or three nodes always be of minimal total weight if I've sorted in that order?

@Martin R 2018-12-04 22:22:55

@Turnipdabeets: Here is an example: You start with weights 1, 1, 1, 1, 1, 1. Then the last two are joined: 1, 1, 1, 1, 2. Then 1, 1, 2, 2. Now the first two should be joined, but your algorithm looks only at the last three. – Also for nodes with equal weight the one with lower height should be chosen.

@Turnipdabeets 2018-12-05 15:50:45

Thanks for that example. I’ll rework on the weights. Can you elaborate on the API improvements? An initializer that takes a previously created Huffman tree is a great idea. But is it better to retrieve and initialize with the tree or the key that I generate from the tree? My thought was that the key is O(n) look up so better than the tree.

@Turnipdabeets 2018-12-05 15:51:13

I was also thinking maybe the encrypted code should contain the tree/key rather than separate them… maybe as the first element in the array for code or maybe return a struct with these two values. What are your thoughts about keeping the values together like that? The instance var code is readable but its an array of zeros and ones. Is there a benefit of having a string of zeros and ones vs an array? I’ll have to google around for compressing the string or array of strings.

Related Questions

Sponsored Content

1 Answered Questions

[SOLVED] C++11 implementation of Huffman-encoding

Sponsored Content