Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP5349: Cloud Computing Sem. 1/2022
Assignment 2: Data Preproceessing and Performance
1 Introduction
This assignment tests your ability to handle input data with complex structures using
Spark. It also tests your understanding of Spark execution and your ability to tune your
implementation and execution environment to improve performance.
The data set you will work on is from the same Atticus Project involving Contract
Understanding Atticus Dataset (CUAD). The contract understanding problem is modelled
as an extractive Question Answering task. Many natural language processing (NLP) tasks
involve a data pre-processing step to convert the raw data to proper model inputs. If the
pre-processing step is executed sequentially on a CPU, it may need a few hours to process
all training data. This is the case for the CUAD data set.
You are asked to develop a Spark program that can execute on multiple nodes to pre-
process the CUAD data in parallel. You also deep to provide a performance report to
analyse your program’s resource usage and scalability.
2 Input Data Set Description
The CUAD data set presented in typical question answering format data.zip can be down-
loaded from the project repository. The data.zip contains the following three files:
• CUADv1.json: the complete data set
• train_separate_questions.json: the training set
• test.json: the test set
All files are of the same SQuAD-style JSON format. The complete data set contains data
of all 510 contracts. The training set contains data of 408 contracts. The test set contains
data of 102 contracts.
Each contract is represented as a nested JSON object of the following format:
1
1 {"title":"contract title" ,
2 "paragraphs": [
3 {
4 "context": "the full contract as a string",
5 "qas": [
6 {
7 "id": "document name and category label",
8 "question": "question text",
9 "is_impossible": True/False,
10 "answers": [
11 {
12 "answer_start": answer start position ,
13 "text": "the actual answer text"
14 },
15 ...
16 ]
17 },
18 ...
19 ]
20 }
21 ]
22 }
There are two fields at the top level: “title” and “paragraphs”. The “title” field stores the
title of a contract. The “paragraphs” field stores the main data and it is of array type; each
element in the array represents a paragraph in the document. The CUAD dataset treats
the entire contract as a single paragraph. This array always contains a single paragraph
object.
Each paragraph object has two fields: “context” and “qas”. The “context” field stores the
entire contract text as a string. The “qas” field is of array type and stores the question and
answer pairs related with this contract. The complete CUAD data set contains 13,000+
clauses belonging to 41 categories. Each category is considered as a question. The “qas”
field of any contract contains 41 question answer pairs; each represents a category and is
stored as a nested JSON object.
The question answer pair object has four fields: “id”, “is_impossible”, “question” and
“answers”. The “id” field stores a unique id of the question. Its value is a string concatenat-
ing the document file name and the category label. The “question” field stores the question
text. For instance, the “question” field for category “Document Name” has the following
text: “Highlight the parts (if any) of this contract related to "Document Name" that should
be reviewed by a lawyer. Details: The name of the contract”. The “is_impossible” field is of
2
Boolean type. A True value indicates there is no answer for this question in this contract.
The “answers” field is of array type and stores potential answers. If the “is_impossible”
field is True, the “answers” field contains an empty array. Otherwise, it contains one or
more answer objects; each represents a valid answer of that question (an annotated clause
belonging to that category).
Each answer object has two fields: “answer_start” and “text”. The “answer_start” field
stores the starting position of the answer in the contract (the index of the answer’s first
character). The “text” field stores the actual answer text.
The actual JSON file also contains the version information. The root object of the file
has two fields: “version” and “data”. All contract data as described above are stored as an
array in the “data” file.
3 Workload Description
A typical training sample of a question answering model is a four element tuple in the
following format: (source, question, answer start location, answer end location). The start
and end location values are set to 0 for a question with no answer in the context. If a
question has two answers in the same context, two samples will be created with different
answer start and end locations.
The CUAD training data set has 408 contracts, each with 41 questions and their corre-
sponding answers or no answer indicator. If each question has at most one answer and the
entire contract is set as source, the data set would theoretically contain 41 × 408 = 16728
training samples.
However, most NLP models only accept input that has a specified maximum length,
which in turn requires maximum length to be set on the source text and the question
text. A contract is much longer than the maximum length of a source text any model
could accept. A typical pre-processing step involves segmenting long input into multiple
sequences of fixed size and using each sequence as the source text in a sample. If a
contract is segmented into 100 sequences, and all questions have at most one answer in
this contract, 100× 41 = 4100 samples will be generated for this contract.
In this assignment, you are asked to segment each contract into overlapping sequences
of 4096 characters. This should be done using a sliding window of 4096 bytes (characters)
and a stride of 2048 bytes (characters). A stride is the distance the window should move
to generate the next sequence. This way, a contract containing 8204 characters would be
segmented into 5 sequences. The first one contains the first 4096 characters. The 2nd se-
quence contains characters between location 2048 (inclusive) and 6144 (exclusive). The
third and forth sequences start from locations 4096 and 6144 respectively; they contain
4096 and 2060 characters respectively. The last sequence contains the rest of the charac-
ters from location 8192.
Not all sequences contain an answer. In fact, for any particular question, only one
3
or a few sequences may contain one or more answers, or part of one of the answers. A
sequence contains part of (or all of) a specific answer if and only if that answer’s location
range overlaps partially (or fully) with the location range of the sequence in the original
contract text. To determine the location range of an answer, one needs to know where
an answer begins and ends in the contract text. The start location is already given in the
corresponding answer object. The end location can be obtained by adding the length of
the answer to the start location.