Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP26120
Lab Exercise 3: Spellchecking (Better Trade-offs)
Duration: (almost) 4 weeks
You should do all your work in the lab3 directory of the COMP26120 2022 repository - see Blackboard
for further details. You will need to make use of the existing code in the branch as a starting point.
Important: You submit this lab via a quiz on Blackboard. This will:
1. Ask you some questions about your implementation, including the hash and tag of the commit
you want us to mark (see below for details of this).
2. Ask you to upload a zip file of the python, c or java folder you have been working in.
3. Ask you to upload a PDF report of the experiments you will run in Part C of this lab.
You can save your answers and submit later so we recommend filling in the questions for each part as
you complete it, rather than entering everything at once at the end.
Code Submission Details
You have the choice to complete the lab in C, Java or Python. Program stubs for this exercise exist
for each language. Only one language solution will be marked.
Because people had a number of issues with GitLab last year we are going to take a multiple
redundancy approach to submission of code. This involves both pushing and tagging a commit to
GitLab and uploading a zip of your code to Blackboard. By preference we will mark the code you
submitted to GitLab but if we can’t find it or it doesn’t check out properly then we will look in the
zip file on Blackboard. Please do both to maximise the chance that one of them will work.
When you submit the assignment through Blackboard you will be asked for the hash and tag of
the commit you want marked. This is to make sure the TAs can identify exactly which GitLab commit
you want marked. You tag a commit lab3 solution (we recommend you use this tag, but you do not
have to) by typing the following at the command line:
git tag lab3_solution
git push
git push origin lab3_solution
You can find the hash of your most recent commit by looking in your repository on GitLab as
shown in figure 1.
You can also find the hash for a previous commit by clicking on the “commits” link and then iden-
tifying the commit you are interested in. This is shown in figure 2.
1
Figure 1: Identifying the hash of your most recent commit in GitLab
Figure 2: Identifying the hashes of previous commits in GitLab
2
For this assignment, you can only get full credit if you do not use code from outside
sources. Built-in libraries in your chosen language are allowed but bear in mind that we
ask you to implement hash sets - simply using the Java HashMap class (for instance)
will not get you the marks for implementation.
Note on extension activities: The marking scheme for this lab is organised so that you can get
over 75% without attempting any of the extension activities. These activities are directed at those
wanting a challenge and may take considerable extra time or effort. This lab is designed to take
up 8 hours of student time spread over four weeks – if you have not achieved the extension activi-
ties after you have put 8 hours of work into this lab then we recommend that you do not attempt them.
Reminder: It is bad practice to include automatically generated files in source control (e.g. your git
repositories). This applies to object files (C), class files (Java), and compiled bytecode files (Python).
This is a general poor practice but for this course it can also cause issues with the online tests. Please
also do not include files containing dictionaries and queries that you have generated – the size of
these tends to cause issues with marking.
While it is fine to discuss this coursework with your friends and compare notes, the work submitted
should be your own. In particular this means you should not have copied any of the source code,
or the report. We will be using the turnitin tool to compare reports for similarities.
3
Learning Objectives
By the end of this lab you will be able to:
• Describe the role of the hash function in the implementation of hash tables and describe and
compare various hash collision strategies
• Describe the basic structure of a binary search tree as well as the basic operations and their
complexities
• Write C, Java, or Python code to implement the above concepts
• Evaluate experimentally the behaviour of hash sets and binary search trees.
Introduction
The aim of this exercise is to use the context of a simple spell-checking program to explore the binary
search trees and hash tables data structures.
This exercise builds on work in Labs 1 and 2. Even if you have not done these labs we recommend
you take a look at them and their model solutions in order to familiarise yourself with the spell-checking
program and our expectations for creating and writing up experiments.
Getting the Support Code
You should then be able to see a lab3 folder in your Gitlab that contains all the support files for this
lab.
Data Structures
The spell-checking program stores a dictionary of words using a Set datatype. There are three data
structures used to implement this Set datatype in this lab. In Lab 1 we introduced this datatype but
we repeat that information here but you may also need to look online (e.g. the Wikipedia page) to
complete your knowledge.
Set. This is an abstract datatype that is used to store the dictionary of words. The operations
required for this application are:
• find whether a given value (i.e. string, alphabetic word) appears in the set.
• insert a given value in the set. Note that there is no notion of multiplicity in sets - a value
either appears or it does not. Therefore, if insert is called with a duplicate value (i.e. it is
already in the set) it can be ignored.
There would usually also be a remove function but this is not required for this application. We also
include two further utility functions, which we ask you to implement, that are useful for printing what
is happening :
• print set to list the contents of a Set.
• print stats to output statistical information to show how well your code is working.
You should have already used a dynamic array to implement this dataype in Lab 1.
In this exercise we use hash tables and binary search trees to implement the Set datatype. These
have been introduced in lectures and we describe these briefly below. You may want to look at the
recommended textbooks or online to complete your knowledge.
4
Hash Table. The underlying memory structure for a hash table is an array. A hash function is
then used to map from a large ‘key’ domain to the (much smaller) set of indices into the array, where
a value can be stored. When two values in the input domain map to the same index in the array this
is called a collision and there are multiple ways to resolve these collisions. To use a hash table to
represent a set we make the key and value the same - the result is usually called a hash set.
Binary Search Tree. You can think of a tree as a linked list where every node has two ‘children’.
This allows us to differentiate between what happens in each sub-tree. In a binary search tree the
general idea is that the ‘left’ sub-tree holds values smaller than that found in the current node, and
the ‘right’ sub-tree holds larger values.
Lab 3: Better Storage
In Lab 1 we achieved a faster find function by first sorting the input. In this part we explore two data
structures that organise the data in a way that should make it faster to perform the find operation.
Part a: Hash Table Lookup
This part of the lab should take you between one and two weeks to complete (i.e., 3-4 hours).
So far our solution to a fast find function has been sorting the input and in Part b we will look
at storing it in a sorted order. In this part you will take a different approach that relies on a hash
function to distribute the values to store into a fixed size array. We have only provided you with very
basic program stubs. You need to decide what internal data structure you need etc.
In this part you need to edit the program stub in your chosen language for hash sets. You should:
1. Complete the insert function for hash sets. What should you do if the value to be inserted
already exists? Hint: this is representing a set.
2. Complete the find function. Importantly, it should follow the same logic as insertion.
3. Implement the print set function – This should print out a sensible representation of the hash
set when called (for instance when running the program with the −vv flag)
4. Implement the print stats function – This should print out a sensible set of statistics, such as
average collisions per insert, that can be used to understand how your program is performing.
This will require some extra fields in the class (Java, Python) or node struct (C).
You can find a discussion of how to test your code in the Section of these instructions on Testing
(after Part b).
The hash-value(s) needed for inserting a string into your hash-table should be derived from the
string. For example, you can consider a simple summation key based on ASCII values of the individ-
ual characters, or a more sophisticated polynomial hash code, in which different letter positions are
associated with different weights. (Warning: if your algorithm is too simple, you may find
that your program is very slow when using a realistic dictionary.) You should experiment
with the various hash functions described in the lectures and textbook and implement at least two
for comparison. One can be terrible, but one should be reasonable otherwise you are likely to have
issues when performing an experimental comparison in Part c. These can be accessed by modes al-
ready given in the program stubs and more details of these are contained in the language specific
instructions, do not change these since they are used in our testing processes. Write your code so that
you can use the −m parameter, which sets the mode.
Initially, you should use an open addressing hashing strategy so that collisions are dealt with
by using a collision resolution function. As a first step you should use linear probing, the most
5
straightforward strategy. To understand what is going on you should add code to count the num-
ber of collisions so you can calculate the average per access in print stats.
An issue with open addressing is what to do when it is not possible to insert a value. To make things
simple, to begin with you can simply fail in such cases. Your code should keep the num entries field
of the table up to date. Your insert function should check for a full table, and exit the program
cleanly should this occur. Once this is working you should increase (double?) the hash table size
when the table is getting full and then rehash into a larger table. Implement this resize and
rehash functionality. When the program is called using the −vvv flag it should print out
a message when it rehashes and then call the print stats function to show the new state
of the hash set once rehashing completes. In C, beware of memory leaks!1
Mod of negative numbers:
Most of you are likely to use the mod operator in your hash functions by doing something
like
h = x % size
to try to get a value of h that is between 0 and size - 1.
If x is negative then the behaviour of this operator varies by language: in par-
ticular in C and Java it will return a negative number. See the discussion at
https://torstencurdt.com/tech/posts/modulo-of-negative-numbers/
You might think that x cannot be negative. However, if you are adding or multiplying large
numbers together (e.g., when hashing a string) then you are likely to get integer overflow.
In hashing, we don’t care about such overflows as it is deterministic and just adds to the
‘chaos’ a bit but Java, for instance, may throw ArrayIndexOutOfBoundsException if a
negative number is passed to the hash data structure, if you have not accounted for this
possibility in your implementation.
In C you could use an unsigned integer to make sure that you only have positive numbers.
Once you have completed this implementation you should look at the Blackboard submission form
where you will be asked the following questions about Part a.
1. What did you implement as your first hash function? How does it work? (No more than 100
words)
2. How do find and insert with linear probing work? Illustrate your explanation with a sample of
your code for insert - this may be tidied up a bit for presentation but should clearly be the same
as the code you have submitted. (no more than 200 words, not counting the code snippet).
3. What did you implement as your second hash function? How does it work? (No more than 100
words)
4. Did you implement rehashing? If so briefly explain your implementation including explaining
when you choose to rehash. (No more than 200 words, you may include code snippets which do
not count towards the word count)
Part b: Binary Search Tree Lookup
This part of the lab should take you a week to complete (i.e., 2 hours).
1One can also get memory leaks in memory managed languages but these are more semantic e.g. preserving a
reference to something you will never use again.
6
In this part you need to edit the program stub in your chosen language for binary search trees.
You should:
1. Complete the insert function by comparing the value to insert with the existing value.
2. Complete the find function. Importantly, it should follow the same logic as insertion.
3. Implement the print set function – This should print out a sensible representation of the hash
set when called (for instance when running the program with the −vv flag)
4. Implement the print stats function – This should print out a sensible set of statistics, such as
the height and average number of comparisons per insert and find.
Once you have completed this implementation you should look at the Blackboard submission form
where you will be asked the following question about Part b.
1. How do find and insert for binary trees work? Include your code for insert and relate your
explanation to this code. (No more than 200 words, not including the code snippet)
In this lab exercise we will stop there with trees. However, it is worth noting that this implementation
is not optimal. What will happen if the input dictionary is already sorted? In the next lab we will be
exploring self-balancing trees.