Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Lab 4: A Simple MapReduce-Style Wordcount Application
CMPSC 473
1 Purpose and Background
This project is designed to give you experience in writing multi-threaded programs by
implementing a simplified MapReduce-style wordcount application. By working on this project:
• You will learn to write multi-threaded code that correctly deals with race conditions.
• You will carry out a simple performance evaluation to examine the performance impact of
(i) the degree of parallelism in the mapper stage and (ii) the size of the shared buffer
which the two stages of your application will use to communicate.
Figure 1: Overview of our Mapreduce-style multi-threaded wordcount application.
As covered briefly in Lecture 22, the wordcount application takes as input a text file and
produces as output the counts for all uniquely occurring words in the input file ar- ranged in an
alphabetically increasing order. We will assume that the words within our input files will only
contain letters of the English alphabet and the digits 0-9 (i.e., no punc- tuation marks or other
special characters). Our wordcount will consist of two stages. The
1
first stage, called ”mapper,” reads individual words and produces key-value pairs of the
form (word, 1). The second stage, called the ”reducer” consumes these, groups them
by key, and sums up the counts in each group to produce the final output. Notice how the
first stage can be parallelized. That is, the mapper stage can consist of multiple mapper
threads, each working on its own separate portion of the input file. The reducer stage
only contains a single thread which runs concurrently with the mapper threads. The
communication between the mapper threads and the reducer thread occurs via a shared
in-memory FIFO buffer. Figure 1 summarizes these ideas.
2 Starter code: unsynchronized word-count
We are providing you starter code that implements a preliminary version of wordcount
that works correctly in the absence of multi-threading. To appreciate this, you should
confirm that this code is able to pass the ”serialize” test in which accesses to the buffer
occur in a strictly serial manner (i.e., an operation is issued only upon the completion
of the previous operation). However, our implementation is not thread-safe. That is, it
has race conditions. You will need to enhance this code to create a thread-safe version of
wordcount.
3 What you need to implement
Take the time to skim all the starter source code files. All your implementation will
be confined to the files helper.c and helper.h. Specifically, you need to consider
enhancing the following functions (recall that the provided versions of these functions
work correctly in the absence of multi-threding). Though we have implemented the
buffer_create function for you, you can enhance it to make any kind of initialization.
• state_t* buffer_create(int capacity): creates an in-memory buffer with
the specified capacity in bytes). Returns a pointer to state_t, see definition in
helper.h. This function will be called once at the beginning, you can do any kind
of initialization in it based on your implementation.
• enum buffer_status buffer_send(state_t *buffer, void* data):
writes data to the buffer. In case the buffer is full, the function waits till the buffer
has space to write the new data. Returns BUFFER_SUCCESS for successfully writing
data to the buffer, CLOSED_ERROR if the buffer is closed, and BUFFER_ERROR on
encountering any other error. The size of data can be found out using the function
get_msg_size(data).
• enum buffer_status buffer_receive(state_t* buffer, void** data):
Reads data from the given buffer and stores it in the function’s input parameter, data
(Note that it is a double pointer). This is a blocking call i.e., the function only returns
on a successful completion of receive In case the buffer is empty, the function waits
till the buffer has some data to read.
2
• enum buffer_status buffer_close(state_t* buffer): closes the buffer
and informs (you may think of giving signal) all the blocking send/receive calls to
return with CLOSED_ERROR. Once the buffer is closed, send/receive operations will
return CLOSED_ERROR. Returns BUFFER_SUCCESS if close is successful, CLOSED_ERROR
if the buffer is already closed, and BUFFER_ERROR for other errors.
• enum buffer_status buffer_destroy(state_t* buffer)
Frees all the memory allocated to the buffer , using own version of sem flags The
caller is responsible for calling buffer_close and waiting for all threads to finish
their tasks before calling buffer_destroy Returns BUFFER_SUCCESS if destroy
is successful, DESTROY_ERROR if buffer_destroy is called on an open buffer,
and BUFFER_ERROR in any other error case
4 Programming rules
You are not allowed to take any of the following approaches to complete the assignment:
• Spinning in a polling loop to implement blocking calls.
• Sleeping instead of using condition-wait.
• Trying to change the timing of your code to overcome race conditions.
• Using global variables.
You are only allowed to use the pthreads library, the standard C library functions (e.g.,
malloc/free), and the provided starter code. If you think you have a valid reason to use
some code outside of these, please contact the course staff to determine its eligibility.
5 Testing your code
Your code will be evaluated for correctness, properly handling synchronization, and en-
suring it does not violate any of the programming rules from the previous section. We are
providing several test cases which are located in test.c—you can find the list of tests at
the bottom of the file. Although we are providing you all the tests that are part of our de-
fault plan for grading your work, we reserve the right to employ additional tests during
the final grading if the need arises. Therefore, you are responsible for ensuring your code
is correct, where a large part of correctness is ensuring you don’t have race conditions,
deadlocks, or other synchronization bugs.
Running the make command in your project will create two executable files: wordcount
and wordcount sanitize.
• The default executable is wordcount. If you want to run a single test, run the fol-
lowing: ./wordcount [test_case_name] [iters], where [test_case_name]
is the test name and [iters] is the number of times to run the test. If you do not
provide a test name, all tests will be run. The default number of iterations is 1.