EECS 233 Programming Assignment
In this assignment, you will implement the Huffman encoding of English characters, using a combined list and binary tree data structure. The method should have the following signature:
public static String huffmanEncoder(String inputFileName, String encodingFileName String outputFileName);
which will read and compress an input text file inputFileName, based on the Huffman encoding produced using the input text file encodingFileName, and output the compressed file in outputFileName (note that these strings represent the file names, not content – you need to perform file I/O). The method will output the outcome of its execution, e.g., “OK”, “Input file error”, “Encoding file error” , etc. The output file can simply contain the characters “0” and “1”, instead of the binary bits. It means you are not required to use sophisticated binary bit operations to actually store the encoded characters in a compact way. In fact, because you will be translating each character into a sequence of zeros and ones, each being the full character itself, your output file size will actually be significantly larger than the input file. Decompressing a file is not required.
The program should also print out
(a) The Huffman encoding of characters in the form of a table of character/frequency/encoding triples – one triple per line, with “:” separating the elements, e.g. (the encoding is imaginary, just an example):
a: 315: 10010
b: 855: 1010
…
(b) The calculated amount of space savings (see below for details).
You will need to use a Java’s standard list facility but implement your own HuffmanNode class with the following fields:
0 inChar: the character denoted by the node. This is meaningful only for the leaf nodes of a Huffman tree so interior nodes can have garbage here. But storing garbage rarely represents a good programming style. So you need to distinguish between a meaningful character and “nothing”. To do so even when any value may potentially represent some valid character code you should use the Character wrapper class instead of the primitive “char” type. Then you can have “null” to represent “nothing”.
0 frequency: the frequency of occurrences of all characters in the subtree rooted at this node. For a leaf node, this frequency is the frequency of the character in the leaf node; for an interior node, the frequency is the sum of all frequency values in the leaves of the subtree.
0 left: left child of a node in the Huffman tree
0 right: right child of a node in the Huffman tree
(Note: the above class contains no “next” field since you are using Java’s native lists and the objects of this class are just elements in the list)
Explain your choice of the Java List implementation (ArrayList vs LinkedList) in the comments. (Do not look for a deep reason why one of the choices will be unsuitable – both are actually workable; but because you have to pick one implementation, share your thoughts why you picked one and not the other.)
You will implement several helper methods, specifically:
0 A method to scan the encoding text file to generate the initial list of HuffmanNodes (you will need to read the file and define a local table in this method to remember the frequency),
0 A method to merge two HuffmanNodes and return the combined HuffmanNode,
0 A method to run the Huffman encoding algorithm to produce the Huffman tree,
0 A method to traverse the Huffman tree to output the character encoding, and
0 A method to scan the input text file, produce the encoded output file, and compute the savings.
You can use whatever method of computing the character frequencies. However, once you have them, you do need to generate Huffman encoding, rather than assign variable-length codes to characters in some ad-hoc manner.
Testing your code
As a test of your program, you will need to test the compression efficiency of the included text “Gadsby.txt” using two generated encodings.
Gadsby is a 1939 novel written by Ernest Vincent Wright. It is a story about how the titular character, John Gadsby, fights to restore his hometown to its former glory. What makes this novel distinctive is that it was written entirely without the letter “E”. As it is written entirely without that letter the novel does not follow the typical letter frequency common in most english writing where “E” is the most common. You will analyse the relative efficiency of compressing the novel using a Huffman Encoding generated by statistical analysis of general english writing and a Huffman Encoding generated by analysis of the text of the novel itself.
The first encoding you will use is based on the relative character frequency obtained through the analysis of Gadsby. This can be obtained done by running the helper methods that are specified above to generate the encoding using “Gadsby.txt” as the encoding file. This will entail calling the method huffmanEncoder using “Gadsby.txt” as both the input file and the encoding file.
The other of these encodings is based on the relative character frequency in text written in english. This frequency will be obtained by using your helper methods to generate a Huffman encoding based on the unabridged dictionary included as an attached file as “Dictionary.txt”. While this is not a perfect huffman encoding relative to the frequencies of characters it is close enough for our purposes. This test entails calling huffmanEncoder using “Gadsby.txt” as the input file and the unabridged dictionary as the encoding file.
Calculate the relative efficiency of the encoded text generated by each encoding and compare them to each other. For this, you assume that in the original encoding, each character costs 8 bits in space, and then as you encode the text with Huffman encoding, you count how many bits each character will now occupy, and so you can maintain the running total to the end. Please write a short (~1 Paragraph) explanation of why the compressed encodings are the size they are and how they compare. Please include details about why basing your encoding on another text aside from the one you are compressing changes the compressed size.
Deliverables:
A zip file containing:
1. All source codes, including sufficient comments for the TAs to understand what you have done (code without comments will be penalized). This must include the main method. We will run your program using command:
“java HuffmanCompressor inputFile encodingFile outputFile”
so please name your classes accordingly.
3. The Huffman encoding tables you generated using the two encoding files.
4. The encoded text files, labeled to specify which file is the result of which encoding.
5. The calculated amount of space savings each encoding has achieved on your text file. (Again, this will not be reflected in the actual sizes of the original and encoded files because you are representing bits as entire characters in your output file.)
6. The explanation of the relative sizes of the compressed text with detail about how the encoding changes the
size.
Grading rubrics:
Scanning the encoding file and generating character frequencies: 20 pts
Producing the Huffman tree (including the HuffmanNode class specification): 25
Using the Huffman tree to produce the encoding table: 25
Producing the encoded output files, computing the space savings, paragraph on encoding sizes: 20
Style and completeness of submission, including detailed comments: 10
Remember that all work submitted must be your own. Asking for help is fine but copying is explicitly not.