Huffman coding in Java

I’m going through my list of interesting assignments that are worth writing about and now is Huffman coding‘s turn. Simply put, we had to create a completely functioning file format implementing a compression method based on Huffman’s coding. The implementation of the coding itself is well documented, but the header of the file is the part that interests us. How can we describe the dictionary for the coding the most efficiently?

Huffman’s coding

The coding itself is very simple. We count the number of occurrences of each byte, then we assign each character a bit string in a way that produces the shortest compressed value. Like I said, simple… Ok, maybe not so much. It’s easier to comprehend if we consider that he number of bits required is equal to log2(fileSize / characterCount). Still not clear on how to do it? Well then you’re just going to have to hope I got it right! ;)

The Wikipedia article explains how to implement this much better than I ever can. So I will concentrate on the file header for the rest of this article.

Clean the garbage

Another problem to take care of is the garbage at the end of the file. The Windows file system will always store the file on a integer number of bytes. This means if our encoded file has 17 bits, it will be written as 24 bits. When we read it, we will read 7 bits of garbage (zeros) at the end of the file! This can translate to a magic character appearing at the end of the file. To avoid that, we compress the content and then we write  a 3 bits unsigned integer (the garbage length), then a series of zeros the same length as the  garbage.

In our submission, we wrote this as a header, meaning we keep the whole compressed content in memory to measure the garbage lenght, then we add the garbage as a header. This requires additional resources to compress large files because we keep the whole compressed content in memory. We cannot put this garbage at the end of the file. It is possible to use the preprocessed character frequencies and the generated character mappings to calculate the total length of the file and modulo this by 8 to obtain the trash length without keeping the output in memory. We did not do this because of a lack of time.

Note that we already cache the input because we need to calculate the character frequencies so we already waste resources. Most compression algorithms split the input into chunks to mitigate this problem.

Plant the tree

Next, we need to serialize our tree into the file header so it can be decompressed, otherwise it is impossible to know what each series of bits means or even to know how to split each character! This part is really the heart of our optimizations.

To represent the tree, we came up with a code. First, we write the characters that we have in the tree in depth first order from one side to the other. Next, we write a binary string representing the tree structure using the following code:

  • 0 means we go deeper to the right
  • 1 means we place the next letter in the list, then we go up in the tree until we reach a parent from the right and then we go deeper to the left (if we go went from the left, then went left, we would be navigating the same tree branch twice).

Using this encoding scheme, it is possible to encode the structure of the complete tree with as many bits as there are branches in it.

Right now, we write the character count before listing the characters. This is a waste of one byte. It is possible to trim this byte by placing the tree before the character list. This is due to the fact that the last “1” in the tree structure definition will cause the root to return to its parent, marking the last character in the tree. Using this, we can count the number of ones in the tree structure definition and read the correct number of characters from the file.

Thanks and acknowledgements

We have used parts of the BitIO library which is LGPL licenced (and not owned by us). The rest of the code is licenced under the MIT Licence. This gives you the right to customize and distribute this as closed source, in which case you must replace the BitInputStream and BitOutputStream classes with your own to get rid of the LGPL licence on them. Simply, I just don’t want to be sued, but you can do whatever you want with this Huffman encoder other than that.

As for the Sudoku solver, thanks to my teammates: Mathieu Lacasse-Potvin and Michael Badeau.

The program was part of an assignment for the class LOG320 at École de Technologie Supérieure in Montréal during summer 2010.

The code

Here are the sources (zipped: 10KB).

The utility is called from command line. To use it, simply pass in the letter “e” for encode or “d” for decode, followed by the input and output files. These three parameters are all you need to encode and decode files using this utility.

7 Responses to Huffman coding in Java

  1. Trying to encode a 59 KB csv file containing integers:

    Exception in thread “main” java.lang.ArrayIndexOutOfBoundsException: 2
    at Main.main(Main.java:41)

  2. I just opened up the sources real quick and I saw that you’re right, there is a bug! However, it’s a problem with the validation of the input paramters, not the logic in the code.

    You probably only passed two parameters. You need to pass the following for the code to work. I’ll update a quick fix for the validation issue right away.

    1. The letter ‘e’ without quotes for the encode mode. For decode, use ‘d’.
    2. The input file containing the file you want to encode.
    3. The output file you want to encode to.

  3. Could you clarify what you mean with this? Can you make an image because I’m not so good in math?

    -0 means we go deeper to the right
    -1 means we place the next letter in the list, then we go up in the tree until we reach a parent from the right and then we go deeper to the left (if we go went from the left, then went left, we would be navigating the same tree branch twice).

  4. Huffman coding in Java for image compression with lossless technique

  5. i just ran the program, it is correctly encoding and decoding. however when i open the output file, instead of the huffman code i see strange characters, doesnt your program generate huffman code??

  6. @Stoneage : Let’s take a header with “ABC” as the letters and the bit string “00111”.

    It will produce the following tree :

    / \ 0
    1 ‘C’ / \ 0
    1 ‘B’ ‘A’ 1

    It add a new node to the right for each ‘0’ and it puts the next value and it goes to the next available empty node to the left for each ‘1’.

    This allows us to encode the tree’s structure very efficiently.

    @ruthvika : It encodes the file into a compressed string of bytes according to the Huffman coding rules. This string of bytes cannot be read using a text editor. It would only look like garbage, just like trying to open an executable file with notepad.

  7. I read a lot of interesting posts here. Probably you spend a lot of time
    writing, i know how to save you a lot of time, there is an online tool that creates readable, SEO friendly articles in minutes, just type in google –
    laranitas free content source

Leave a Reply