Goal: The goal of this assignment is to gain familiarity and practical experience with the
MapReduce programming model and give you experience with text processing.
Programming Language: Python
Evaluation: Part of our evaluation is through testing. Our test cases will NOT be made
available to you before submission. It is your responsibility to test extensively. For the different
parts of the assignments do not concern yourself with stop words or changing the case of letters
This assignment is complex, more so than A1; start early, check through the provided
documents/documentation, and make use of the forum. I will be slightly more hands-off to
allow people to work through the complexities of learning a new system and environment.
(Expect responses during reading week to be delayed by 24-72 hours).
The Department of Computer Science has a cluster where Cloudera has been installed. Cloudera
is a company that provides Apache Hadoop. There are multiple VMs which can handle multiple
people using them simultaneously.
You can ssh in using your ID (replace bdavis56) and by changing vmX to vm1 through to vm10
Mac OS X & Linux terminals should have a built in SSH and SFTP client. Windows users are
strongly recommended to use WSL2 / Ubuntu for Windows; you can use putty and WinSCP if
you wish as well.
The software you need to complete the assignment is on the VM. You do not need to install
anything.
We will be using a container for the VMs. This has been graciously provided by Gary
Molenkamp here at Western. If you do not want to use the VM and have the technical skill
required to run your own container, you may be able to do the assignment locally.
Given a set of documents, calculate the frequency of a term in each document. The output
should be the term, document and number of occurrences of the term in the document. This is
different from the example presented in the lectures in that the example focused on one
document. In this example, the final output should consist of pairs in the following form:
((term, document identifier), count)
Submission: You should submit a zip file with the name Part1.zip. When unzipped there should
be two files: mapper.py and reducer.py.
Take the word count example and extend it to count bigrams which refers to sequences of two
consecutive words.
You should make use of Hadoop for this part.
Submission: You should submit a zip file with the name Part2.zip. When unzipped there should
be two files: mapper.py and reducer.py.
This is an extension of part 2 where you count the number of unique bigrams. One approach is to
use two MapReduce passes. The first is what you did for Part 2 and the second is something you
need to develop.
Submission: You should submit a zip file with the name Part3.zip. When unzipped there should
be two files for each MapReduce pass, i: mapperi.py and reduceri.py. For example, if i is 1 then
you should have mapper1.py and reducer1.py and if i is 2 then you should have mapper2.py and
reducer2.py.