COMP90025 Parallel and multicore computing
Project 1: Spell checker — identifying words
Project 1 requires developing and submitting a C/C++ OpenMP program and a written
report. Project 1 is worth 20% of the total assessment and must be done individually. The due
date for the submission is nominally 11.59pm on 11 September, the Wednesday of Week 8.
Submissions received after this date will be deemed late. A penalty of 10% of marks per working
day late (rounded up) will be applied unless approved prior to submission. The weekend does not
count as working days. A submission any time on Thursday will attract 10% loss of marks and so
on. Please let the subject coordinator know if there is anything unclear or of concern about these
aspects.
1 Background
We have all come to rely on spelling checkers, in everything from word processors to web forms
and text messages. This year’s projects will combine to form a simple parallel spell checker.
In Project 1, you will write OpenMP (shared memory) code to collect a list of distinct words
from a document, ready to check the spelling of each. In Project 2, you will write MPI (message
passing) code to take such a list of words and check them against a dictionary of known words.
2 Assessment Tasks
1. Write an OpenMP (or pthreads) C/C++ function
void WordList(uint8_t *doc, size_t doc_len, int non_ASCII, uint16_t ***words, int *count)
to do the following:
(a) take a pointer doc to a UTF-8 string and its length, doc len;
(b) parse the string into a list of words, where a word is defined as a maximal string of
characters that are Unicode letter type, as determined by iswalnum (or isalnum for
ASCII) in the current locale (not necessarily the standard C locale).
That is, it is a string of Unicode letters that
? is either preceded by a non-letter or starts at the start of the string; and
? is either followed by a non-letter or ends at the end of the string;
The function should malloc two regions of memory: one containing all of the (‘\0’-
terminated) words contiguously, and the other containing a list of pointers to the starts
of words within the first region. A pointer to the second should be returned in words.
The memory should be able to be freed by free (**words); free (*words);
(c) set *count to the number of distinct words.
The array should be sorted according to the current locale. The C function strcoll compares
according to the current locale. We will test with locales LC ALL=C, en AU. In bash, the locale
can be specified on the command line, like
1
LC ALL=en AU run find words text1-ASCII.txt
LC ALL=en AU run find words text1-en.txt
or it can be specified using export
export LC ALL=en AU
run find words text1-ASCII.txt
run find words text1-en.txt
There should be no spaces around the “=”.
If nonASCII is zero, then you can assume that the UTF-8 input is all ASCII, and optimize
for that case. If nonASCII is 1, then you can assume that non-ASCII characters are rare.
If nonASCII is 2, then non-ASCII characters may be common. This is not required, unless
you are aiming to get the fastest possible code.
You must write your own sorting code. (Choose a simple algorithm first, and replace it if
you get time.)
Do not use any container class libraries.
You can assume that all UTF-16 characters fit into a single uint16_t.
The driver program to call your code, and a skeleton Makefile, are available at https:
//canvas.lms.unimelb.edu.au/files/20401320/download?download_frd=1 and on
Spartan at /data/projects/punim0520/2024/Project1.
2. Write a minor report (3000 words (+/- 30%) not including figures, tables, diagrams, pseu-
docode or references). The lengths given below are guidelines to give a balanced report; if
you have a good reason to write more or less, then you may. Use the following sections and
details:
(a) Introduction (400 words): define the problem as above in your own words and discuss
the parallel technique that you have implemented. Present the technique using parallel
pseudo-code. Cite any relevant literature that you have made use of, either as a basis
for your code or for comparison. This can include algorithms you chose not to use along
with why you didn’t use them. If you use an AI assistant like ChatGPT, then clearly
identify which text was based on the AI output, and state in an appendix what prompt
you used to generate that text.
(b) Methodology (500 words): discuss the experiments that you will use to measure the per-
formance of your program, with mathematical definitions of the performance measures
and/or explanations using diagrams, etc.
(c) Experiments (500 words): show the results of your experiments, using appropriate
charts, tables and diagrams that are captioned with numbers and referred to from the
text. The text should be only enough to explain the presented results so it is clear what
is being presented, not to analyse result.
(d) Discussion and Conclusion (1600 words): analyze your experimental results, and dis-
cuss how they provide evidence either that your parallel techniques were successful or
otherwise how they were not successful or, as may be the case, how the results are
inconclusive. Provide and justify, using theoretical reasoning and/or experimental evi-
dence, a prediction on the performance you would expect using your parallel technique
if the number of threads were to increase to a much larger number; taking architectural
aspects and technology design trends into account as best as you can – this may require
some speculation.
For each test case, there will be a (generous) time limit, and code that fails to complete
in that time will fail the test. The time limit will be much larger than the time taken
by the sequential skeleton, so it will only catch buggy implementations.
2
(e) References: cite literature that you have cited in preparing your report.
Use, for example, the ACM Style guide at https://authors.acm.org/proceedings/produc
tion-information/preparing-your-article-with-latex for all aspects of formatting your
report, i.e., for font size, layout, margins, title, authorship, etc.