Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS5344 Lab 2
This lab has two tasks and requires you to work with a large set of documents and search
for relevant documents. You will need to understand the concepts of RDD, transformation
and action in Apache Spark, and implement an algorithm at the RDD level. This is an
individual lab assignment.
Task A: Write a Spark program to find the top-k frequently occurring word pairs in
a set of documents.
Algorithm for Task A:
Step 1. Preprocess the documents. (Refer to Data Pre-processing section)
Step 2. Compute the count of every word pair in the resulting documents. Note that
and are considered as the same word pair.
Step 3. Sort the list of word pairs in descending order and obtain the top-k frequently
occurring word pairs.
Input: (a) Set of documents (in “datafiles” folder),
(b) stopwords to remove (in stopwords.txt).
Output: One line per word pair in the following format:
You should sort the word pairs in descending order of frequency of occurrence.
Use k = 5.
Task B: Write a Spark program that finds that top-k relevant documents given a
query comprising of set of keywords.
A document can be modelled as a vector of words (or terms). Each entry in the vector is
a TF-IDF value that reflects how important a word is to a document in a collection,
computed as TF-IDF = (1 + log (TF)) * log (N/DF) where N is total number of documents,
TF is the count of the word in a document, and DF is the count of documents having the
word. Figure 1 shows a simple example.
Figure 1. Example of representing documents as vectors.
A query can also be represented as a vector where each entry represents a word with a
value 1 if the word is in the query, and 0 otherwise. We can compute a relevance score
for each document d to a query q based on the based on the cosine similarity of their
corresponding vectors V1 and V2 and rank the documents with respect to a query:
relevance (q, d) = (⃗⃗ ⃗⃗ , ⃗⃗ ⃗⃗ ) =
⃗⃗⃗⃗ ⃗ ∙ ⃗⃗ ⃗⃗ ⃗⃗
||⃗⃗⃗⃗ ⃗||×||⃗⃗⃗⃗ ⃗||
Algorithm for Task B.
Step 1. Preprocess the documents. (Refer to Data Pre-processing section)
Step 2. Compute term frequency (TF) of every word in a document.
Step 3. Compute TF-IDF of every word w.r.t a document.
You can use key-value pair RDD and the groupByKey() API. Use log base 10 for
TF-IDF calculation.
Step 4. Compute normalized TF-IDF of every word w.r.t. a document.
If the TF-IDF value of word1 in doc1 is t1 and the sum of squares of the TF-IDF
of all the words in doc1 is S, then the normalized TF-IDF value of word1 is
1
√
.
Step 5. Compute the relevance of each document w.r.t a query.
Step 6. Sort and get top-k documents.
Input: (a) set of documents (in “datafiles” folder),
(b) set of keywords for a query (in query.txt),
(c) stopwords to remove (in stopwords.txt).
Output: One line per review in the following format:
The output should be sorted in descending order of the relevance of the documents to
the query. Use k = 5.
Deliverables: Zip your executable Spark programs for Task A and Task B with
documentation in the code, the output files and upload it to the Lab2 folder. The zipped
folder should be named as Student ID_Lab2.
Data Pre-processing:
Convert all the words to lower case.
Remove the stop words (provided in stopwords.txt) and then combine the words
on special symbols like -, ', etc. Ex. "tea-table" becomes "teatable"
Drop symbols such as “?”, “!”, “.”, etc. occurring at the end of words. Ex. “home?”
or “home ?” becomes “home”.
Drop independent numbers (not alphanumeric) in sentences. Ex. “flying 2 birds”
becomes “flying birds”.
You can consider non-english, alphanumeric words as new words. Ex. apple1
can be considered as a new word.
Important Notes:
Specify the python version along with the supporting packages used and the
command to run your scripts.
Your code should be executable either on the virtual machine configuration given
in Lab 1 or on stand-alone Spark configuration.