Week 4 Lab - Compression
COSC244 & TELE2021 IntroductionThe idea behind data compression is to make a file as small as possible in order to save diskspace or to reduce transmission time over a network. Compression can save both time andmoney. In lectures, you have seen a number of different ways to perform data compression –Huffman codes, Run Length encoding (bit-level and character-level), and Lempel-Ziv encoding.In this lab, we will work with a variant of the Lempel-Ziv algorithm called LZW compression.This algorithm was published in 1984 by Terry Welch. The compression program and threesupport classes are provided for your use. You will implement the decompression programfrom a skeleton program we provide.1.1 Lempel-Ziv-Welch CompressionThe basic idea behind LZW compression is to construct a look-up table which maps charactersequences to numbers. Initially the table is created containing only a mapping of an eightbit character set (for example ISO-8859-1) to the numbers 0-255. As input gets processed,the look-up table is expanded by adding to it new combinations of characters as they areencountered. Whenever a combination appears again in the input, only a single number needsto be output in its place, thus saving space.During decompression, the table is constructed as the compressed file is read just as the tablewas constructed during the compression process. The table does not have to be sent with thecompressed file.1.2 Lab overviewThis lab has 3 steps:1. Understand the LZW algorithm in detail.2. Become familiar with the provided support classes.3. Write a decompression program.The first two steps above are the preparation for this lab and should be done before comingto the lab.2 AssessmentThis lab is worth 1%. The assessment in this lab consists of tracing the decompressionalgorithm, writing answers to some questions, and one programming exercise.1Tracing the decompression algorithm and completing the written questions constitute thepreparation for this lab. This will be marked by the demonstrators at the start of the lab. Thepreparation questions should be answered before coming to the lab. They should be storedelectronically in a plain text file with a .txt suffix in your ~/244/04 directory. We have provideda file lab04-answers.txt which you can copy from /home/cshome/coursework/244/start/04to put your answers in.We estimate it will take 2-3 hours to properly prepare for this lab.Make sure you get your work marked by a demonstrator before leaving the lab.3 Preparation3.1 LZW AlgorithmBefore we can implement the LZW algorithm (or any algorithm), we must understand it indetail. Wikipedia had a good explanation of the algorithm, along with two worked examplesand some pseudocode (Lempel-Ziv-Welch on Wikipedia - Feb 13, 2008). The current versiondoesn’t have the pseudocode anymore, so make sure you use the older version. You can accessit from the Resources page of the course webpages. We are more interested in the first, thanthe second example.During the tutorial in week 3, you worked through the compression algorithm. Now it is timeto work through the decompression algorithm. Here is the decompression algorithm slightlymodified from Wikipedia.initialise the dictionary with codes 0-255 and theircorresponding characters;read a code;entry = dictionary entry for the code;output entry;while (read a code) {prev = entryif (code exists in dictionary) {entry = dictionary entry for the code;} else { // code does not exist in dictionaryentry = prev + prev[0];}output entry;add (prev + entry[0]) to the dictionary;}// NOTE: x[0] means the first character in the string, xIn preparation for the lab, use this decompression algorithm to decompress the following string.The numbers represent the binary data being read from a compressed file.84 79 66 69 79 82 78 79 84 256 258 260 265 259 261 263The dictionary is initialised with the extended ASCII character set in positions 0-255. So thefirst available slot for insertion in the dictionary is 256. The following table gives the relevantportion of the initialised dictionary.2Code String66 B69 E78 N79 O82 R84 TQ1 Fill out the table shown below by tracing the execution of the decompression algorithmon the provided input. There is a copy of this table which you can complete in thelab04-answers.txt file. You will find it easiest to open up the file in Emacs becauseEmacs will handle the table in a special way to keep things tidy.The demonstrator will check that you have completed this table when checkingyour lab preparation.code prev entry prev + entry[0] output dictionaryASCII code in 0-2558479666979827879842562582602652592612633.2 Provided Support CodeA complete implementation of the LZW algorithm is not a trivial task. We have provided youwith 3 support classes. These classes allow you to concentrate on the algorithm itself. You donot need to worry about reading and writing bit stream data or how to construct and accessthe table – we are giving you classes and methods which do this. Open up the documentationfor the provided classes in a web browser using this command:$ firefox /home/cshome/coursework/244/start/04/docs/index.html &3.2.1 BitOutputStreamLet’s first look at BitOutputStream. This class is used to write the compressed output file asbinary data. It is used in the Compress.java program which you are provided. You will notneed this class to write the decompression program, but it is instructive to look at its methods(and you have to answer the following questions).3Click on BitOutputStream and look at the constructor:BitOutputStream(OutputStream out, int bitFieldSize)The first parameter specifies where we want the output to go. It is a java.io.OutputStreamwhich Compress.java assigns to System.out. When you run the program later, it will beredirected to a file.Q2 Briefly describe the second input parameter.Q3 What is the purpose of the method, setBitFieldSize()?Q4 What is the purpose of the method, write()?Q5 What is the purpose of the method, close()?3.2.2 BitInputStreamNow look at BitInputStream. You will use this class to read a compressed file. Click onBitInputStream and look at the constructor:BitInputStream(InputStream in, int bitFieldSize)The first parameter specifies the source of input data. It is a java.io.InputStream. The skeletonDecompress.java assigns it to System.in. When you run your decompression program, you willredirect to get input from a file. The second parameter specifies the number of bits to read.This is set to 9-bits in the skeleton program. More on why this is done later in this document.Q6 What is the purpose of the method, setBitFieldSize()?4Q7 What is the purpose of the method, read()?3.2.3 LookUpTableLook at LookUpTable. This class is used to create and manipulate the table containing stringsand their binary codes. Click on LookUpTable and look at the constructor:LookUpTable()Q8 Briefly describe what it does.Q9 What is the purpose of the method, add()?Q10 What is the purpose of the method, hasString()?Q11 What is the purpose of the method, getCode()?The methods, add(), hasString() and getCode(), are used in the compression algorithm.In addition to add() the decompression algorithm uses the two methods, hasCode() andgetString().Q12 What is the purpose of the method, hasCode()?Q13 What is the purpose of the method, getString()?53.2.4 A Bit More ExplanationA bit more explanation is needed to fully understand the classes, the compression program,and the decompression skeleton program.The lookup table is initialised with the codes and their corresponding characters for every8-bit character. We store the individual characters as a String so it is easy to continue on withmultiple character sequences. The table initially has entries for codes, 0 - 255.0 - 255 is the range of codes that can be stored in an 8-bit field. All character sequences addedto the initialised dictionary must have codes greater than 255. Thus once we want to read orwrite a character sequence rather than a single character, we need more than 8 bits. Thereforewe would need to begin writing 9-bit codes. The previous 8-bit codes are compatible with the9-bit codes since the most significant bit of the 8-bit codes will be a zero when expressed with9-bits. If later we got to code 512, we would have to expand to 10-bit codes. And so on.In a nutshell, the above scheme writes 8-bit codes until we get to 256, then it begins writing9-bit codes. Once we get to code 512, we expand to 10-bit codes. We continue adding a bitevery time the code doubles to another power of two.Intuitively one might be tempted to start writing 8-bit codes during compression. In fact,it works fine for compression. Do you see any problem for decompression? Think about it.What do you get when you attempt to read codes above 255 as an 8-bit code? How do youknow when, for example, 256 is received?If we try to decompress a file which was compressed beginning with 8-bit codes, we have aproblem when we get to the first 9-bit code. That is, we have only read 8 bits of the codeinstead of the required 9.The solution we adopt is to start off writing and reading 9-bit codes for 0 - 255. To makethings a bit simpler in the provided classes, compression program and decompression skeleton,we use the code, 256, to indicate it is time to expand the input or output bit stream to anadditional bit. The code, 256, is a constant defined in LookUpTable called GROW_CODE. Because256 serves a special purpose, the first multiple character sequence goes into slot 257 in theLookUpTable. The next goes into slot 258, and so on.Because we are writing 9-bit codes, compression works well until we need to write code 512.That code requires 10-bits. Before writing 512, Compress.java writes code 256 (our specialmarker code), increases the number of bits being written to 10, and then writes 512. Fromthis point on, all codes are written as 10-bit codes. A similar thing happens when we get tocode, 1024, 2048, and so on.When decompressing a file, we begin by reading 9-bit codes. This allows us to reads codes0 - 511. When code 256 in encountered, we increment the number of bits read and continuereading.This is why Compress.java and the skeleton Decompress.java have the line:int bitFieldSize = 9;We have provided the actual line of code which will increment the bit field size for you withinthe commented pseudocode in Decompress.java, so you don’t need to figure it out for yourself.It isin.setBitFieldSize(++bitFieldSize);64 Programming ExerciseAssuming you are in your 244 directory, you can start by copying the provided files into thedirectory you want to work in using this command:$ cp -r /home/cshome/coursework/244/start/04/src 04This copies all the files you will need for this lab. Change into the 04 directory to beginworking on them.$ cd 04Start by compiling the compression program using this command:$ javac Compress.javaYou should be able to compress smallfile like so:$ java Compress < smallfile > smallfile.cmpThe task which you need to complete for this lab is to implement the decompression algo-rithm. We have provided a file called Decompress.java which contains skeleton code, as wellas pseudocode comments which describe the algorithm. A printout of the file is attached asthe next to last page of this document.Your task is to write the code which implements the pseudocode comments. You can test yourprogram by decompressing smallfile.cmp using this command:$ java Decompress < smallfile.cmp > smallfile.dcpYou can either visually compare smallfile to smallfile.dcp or type the following command whichshould show no differences:$ diff smallfile smallfile.dcpOnce you have completed this program ask a demonstrator to check your work.Make sure you get your work signed off by a demonstrator before leaving the lab.5 Optional Extension ExerciseThis part is for those who want a bit more of a challenge. It is not required or marked.Add some code to Compress.java and Decompress.java to measure how long the compressionor decompression takes, and print this information to System.err. You might find the methodSystem.currentTimeMillis() helpful here.After you have that working, add some code to Compress.java to make it read from a fileinstead of System.in and write to a file instead of System.out. Now add some code to Com-press.java to calculate the size of the compressed file as a percentage of the original file.7Listing of Decompress.java/*** This class performs LZW decompression on data read from stdin. It* reads the compressed input from stdin using a BitInputStream which* allows a variable width bit field to be used.*/public class Decompress {/*** The entry point of the program. Performs LZW decompression on input* read from stdin, and outputs it to stdout.** @param args command line arguments are not used.* @exception Exception Throws all exceptions (no error handling done)*/public static void main(String[] args) throws Exception {int bitFieldSize = 9; // 8 bits for character set + 1 bit for growcodeBitInputStream in = new BitInputStream(System.in, bitFieldSize);LookUpTable table = new LookUpTable(); // initialise the dictionaryint code; // the code read from stdinString entry = null; // current entry from dictionaryString prev = null; // previous entry from dictionary/** process the first code value in the compressed file/stream */if ((code = in.read())!= -1) { // if there is input// let entry = table entry for code// output entry}/** process the remaining codes in the compressed file/stream */// while there is input// if code is the grow code// in.setBitFieldSize(++bitFieldSize);// otherwise// let prev = entry// let entry = table entry for code// if there wasn’t a table entry for code then// let entry = prev + first char of prev// output entry// add prev + first char of entry to the table}}