Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP26120
Lab Exercise 3: Spellchecking (Better Trade-offs) Duration: (almost) 4 weeks You should do all your work in the lab3 directory of the COMP26120 2022 repository - see Blackboard for further details. You will need to make use of the existing code in the branch as a starting point. Important: You submit this lab via a quiz on Blackboard. This will: 1. Ask you some questions about your implementation, including the hash and tag of the commit you want us to mark (see below for details of this). 2. Ask you to upload a zip file of the python, c or java folder you have been working in. 3. Ask you to upload a PDF report of the experiments you will run in Part C of this lab. You can save your answers and submit later so we recommend filling in the questions for each part as you complete it, rather than entering everything at once at the end. Code Submission Details You have the choice to complete the lab in C, Java or Python. Program stubs for this exercise exist for each language. Only one language solution will be marked. Because people had a number of issues with GitLab last year we are going to take a multiple redundancy approach to submission of code. This involves both pushing and tagging a commit to GitLab and uploading a zip of your code to Blackboard. By preference we will mark the code you submitted to GitLab but if we can’t find it or it doesn’t check out properly then we will look in the zip file on Blackboard. Please do both to maximise the chance that one of them will work. When you submit the assignment through Blackboard you will be asked for the hash and tag of the commit you want marked. This is to make sure the TAs can identify exactly which GitLab commit you want marked. You tag a commit lab3 solution (we recommend you use this tag, but you do not have to) by typing the following at the command line: git tag lab3_solution git push git push origin lab3_solution You can find the hash of your most recent commit by looking in your repository on GitLab as shown in figure 1. You can also find the hash for a previous commit by clicking on the “commits” link and then iden- tifying the commit you are interested in. This is shown in figure 2. 1 Figure 1: Identifying the hash of your most recent commit in GitLab Figure 2: Identifying the hashes of previous commits in GitLab 2 For this assignment, you can only get full credit if you do not use code from outside sources. Built-in libraries in your chosen language are allowed but bear in mind that we ask you to implement hash sets - simply using the Java HashMap class (for instance) will not get you the marks for implementation. Note on extension activities: The marking scheme for this lab is organised so that you can get over 75% without attempting any of the extension activities. These activities are directed at those wanting a challenge and may take considerable extra time or effort. This lab is designed to take up 8 hours of student time spread over four weeks – if you have not achieved the extension activi- ties after you have put 8 hours of work into this lab then we recommend that you do not attempt them. Reminder: It is bad practice to include automatically generated files in source control (e.g. your git repositories). This applies to object files (C), class files (Java), and compiled bytecode files (Python). This is a general poor practice but for this course it can also cause issues with the online tests. Please also do not include files containing dictionaries and queries that you have generated – the size of these tends to cause issues with marking. While it is fine to discuss this coursework with your friends and compare notes, the work submitted should be your own. In particular this means you should not have copied any of the source code, or the report. We will be using the turnitin tool to compare reports for similarities. 3 Learning Objectives By the end of this lab you will be able to: • Describe the role of the hash function in the implementation of hash tables and describe and compare various hash collision strategies • Describe the basic structure of a binary search tree as well as the basic operations and their complexities • Write C, Java, or Python code to implement the above concepts • Evaluate experimentally the behaviour of hash sets and binary search trees. Introduction The aim of this exercise is to use the context of a simple spell-checking program to explore the binary search trees and hash tables data structures. This exercise builds on work in Labs 1 and 2. Even if you have not done these labs we recommend you take a look at them and their model solutions in order to familiarise yourself with the spell-checking program and our expectations for creating and writing up experiments. Getting the Support Code You should then be able to see a lab3 folder in your Gitlab that contains all the support files for this lab. Data Structures The spell-checking program stores a dictionary of words using a Set datatype. There are three data structures used to implement this Set datatype in this lab. In Lab 1 we introduced this datatype but we repeat that information here but you may also need to look online (e.g. the Wikipedia page) to complete your knowledge. Set. This is an abstract datatype that is used to store the dictionary of words. The operations required for this application are: • find whether a given value (i.e. string, alphabetic word) appears in the set. • insert a given value in the set. Note that there is no notion of multiplicity in sets - a value either appears or it does not. Therefore, if insert is called with a duplicate value (i.e. it is already in the set) it can be ignored. There would usually also be a remove function but this is not required for this application. We also include two further utility functions, which we ask you to implement, that are useful for printing what is happening : • print set to list the contents of a Set. • print stats to output statistical information to show how well your code is working. You should have already used a dynamic array to implement this dataype in Lab 1. In this exercise we use hash tables and binary search trees to implement the Set datatype. These have been introduced in lectures and we describe these briefly below. You may want to look at the recommended textbooks or online to complete your knowledge. 4 Hash Table. The underlying memory structure for a hash table is an array. A hash function is then used to map from a large ‘key’ domain to the (much smaller) set of indices into the array, where a value can be stored. When two values in the input domain map to the same index in the array this is called a collision and there are multiple ways to resolve these collisions. To use a hash table to represent a set we make the key and value the same - the result is usually called a hash set. Binary Search Tree. You can think of a tree as a linked list where every node has two ‘children’. This allows us to differentiate between what happens in each sub-tree. In a binary search tree the general idea is that the ‘left’ sub-tree holds values smaller than that found in the current node, and the ‘right’ sub-tree holds larger values. Lab 3: Better Storage In Lab 1 we achieved a faster find function by first sorting the input. In this part we explore two data structures that organise the data in a way that should make it faster to perform the find operation. Part a: Hash Table Lookup This part of the lab should take you between one and two weeks to complete (i.e., 3-4 hours). So far our solution to a fast find function has been sorting the input and in Part b we will look at storing it in a sorted order. In this part you will take a different approach that relies on a hash function to distribute the values to store into a fixed size array. We have only provided you with very basic program stubs. You need to decide what internal data structure you need etc. In this part you need to edit the program stub in your chosen language for hash sets. You should: 1. Complete the insert function for hash sets. What should you do if the value to be inserted already exists? Hint: this is representing a set. 2. Complete the find function. Importantly, it should follow the same logic as insertion. 3. Implement the print set function – This should print out a sensible representation of the hash set when called (for instance when running the program with the −vv flag) 4. Implement the print stats function – This should print out a sensible set of statistics, such as average collisions per insert, that can be used to understand how your program is performing. This will require some extra fields in the class (Java, Python) or node struct (C). You can find a discussion of how to test your code in the Section of these instructions on Testing (after Part b). The hash-value(s) needed for inserting a string into your hash-table should be derived from the string. For example, you can consider a simple summation key based on ASCII values of the individ- ual characters, or a more sophisticated polynomial hash code, in which different letter positions are associated with different weights. (Warning: if your algorithm is too simple, you may find that your program is very slow when using a realistic dictionary.) You should experiment with the various hash functions described in the lectures and textbook and implement at least two for comparison. One can be terrible, but one should be reasonable otherwise you are likely to have issues when performing an experimental comparison in Part c. These can be accessed by modes al- ready given in the program stubs and more details of these are contained in the language specific instructions, do not change these since they are used in our testing processes. Write your code so that you can use the −m parameter, which sets the mode. Initially, you should use an open addressing hashing strategy so that collisions are dealt with by using a collision resolution function. As a first step you should use linear probing, the most 5 straightforward strategy. To understand what is going on you should add code to count the num- ber of collisions so you can calculate the average per access in print stats. An issue with open addressing is what to do when it is not possible to insert a value. To make things simple, to begin with you can simply fail in such cases. Your code should keep the num entries field of the table up to date. Your insert function should check for a full table, and exit the program cleanly should this occur. Once this is working you should increase (double?) the hash table size when the table is getting full and then rehash into a larger table. Implement this resize and rehash functionality. When the program is called using the −vvv flag it should print out a message when it rehashes and then call the print stats function to show the new state of the hash set once rehashing completes. In C, beware of memory leaks!1 Mod of negative numbers: Most of you are likely to use the mod operator in your hash functions by doing something like h = x % size to try to get a value of h that is between 0 and size - 1. If x is negative then the behaviour of this operator varies by language: in par- ticular in C and Java it will return a negative number. See the discussion at https://torstencurdt.com/tech/posts/modulo-of-negative-numbers/ You might think that x cannot be negative. However, if you are adding or multiplying large numbers together (e.g., when hashing a string) then you are likely to get integer overflow. In hashing, we don’t care about such overflows as it is deterministic and just adds to the ‘chaos’ a bit but Java, for instance, may throw ArrayIndexOutOfBoundsException if a negative number is passed to the hash data structure, if you have not accounted for this possibility in your implementation. In C you could use an unsigned integer to make sure that you only have positive numbers. Once you have completed this implementation you should look at the Blackboard submission form where you will be asked the following questions about Part a. 1. What did you implement as your first hash function? How does it work? (No more than 100 words) 2. How do find and insert with linear probing work? Illustrate your explanation with a sample of your code for insert - this may be tidied up a bit for presentation but should clearly be the same as the code you have submitted. (no more than 200 words, not counting the code snippet). 3. What did you implement as your second hash function? How does it work? (No more than 100 words) 4. Did you implement rehashing? If so briefly explain your implementation including explaining when you choose to rehash. (No more than 200 words, you may include code snippets which do not count towards the word count) Part b: Binary Search Tree Lookup This part of the lab should take you a week to complete (i.e., 2 hours). 1One can also get memory leaks in memory managed languages but these are more semantic e.g. preserving a reference to something you will never use again. 6 In this part you need to edit the program stub in your chosen language for binary search trees. You should: 1. Complete the insert function by comparing the value to insert with the existing value. 2. Complete the find function. Importantly, it should follow the same logic as insertion. 3. Implement the print set function – This should print out a sensible representation of the hash set when called (for instance when running the program with the −vv flag) 4. Implement the print stats function – This should print out a sensible set of statistics, such as the height and average number of comparisons per insert and find. Once you have completed this implementation you should look at the Blackboard submission form where you will be asked the following question about Part b. 1. How do find and insert for binary trees work? Include your code for insert and relate your explanation to this code. (No more than 200 words, not including the code snippet) In this lab exercise we will stop there with trees. However, it is worth noting that this implementation is not optimal. What will happen if the input dictionary is already sorted? In the next lab we will be exploring self-balancing trees.