Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Assignment
1. Overview of the Assignment
In Assignment 3, you will complete two tasks. The goal is to familiarize you with Locality Sensitive Hashing
(LSH), and different types of collaborative-filtering recommendation systems. The dataset you are going
to use is a subset from the Yelp dataset used in the previous assignments.
2. Assignment Requirements
2.1 Programming Language and Library Requirements
a. You must use Python to implement all tasks. You can only use standard python libraries (i.e., external
libraries like numpy or pandas are not allowed). There will be a 10% bonus for each task (or case) if you
also submit a Scala implementation and both your Python and Scala implementations are correct.
b. You are required to only use the Spark RDD to understand Spark operations. You will not receive any
points if you use Spark DataFrame or DataSet.
2.2 Programming Environment
Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2
(2.4.4))
2.3 Write your own code
Do not share your code with other students!!
We will combine all the code we can find from the Web (e.g., GitHub) as well as other students’ code from
this and other (previous) sections for plagiarism detection. We will report all the detected plagiarism.
3. Yelp Data
We generated the following two datasets from the original Yelp review dataset with some filters. We
randomly took 60% of the data as the training dataset, 20% of the data as the validation dataset, and 20%
of the data as the testing dataset.
a. yelp_train.csv: the training data, which only include the columns: user_id, business_id, and stars.
b. yelp_val.csv: the validation data, which are in the same format as training data.
c. We are not sharing the test dataset.
d. other datasets: providing additional information (like the average star or location of a business)
4. Tasks
Note: This Assignment has been divided into 2 parts on Vocareum. This has been done to
provide more computational resources.
4.1 Task1: Jaccard based LSH (2 points)
In this task, you will implement the Locality Sensitive Hashing algorithm with Jaccard similarity using
yelp_train.csv.
In this task, we focus on the “0 or 1” ratingsrather than the actual ratings/stars from the users. Specifically,
if a user has rated a business, the user’s contribution in the characteristic matrix is 1. If the user hasn’t
rated the business, the contribution is 0. You need to identify similar businesses whose similarity >= 0.5.
You can define any collection of hash functions that you think would result in a consistent permutation of
the row entries of the characteristic matrix. Some potential hash functions are:
f(x)= (ax + b) % m or f(x) = ((ax + b) % p) % m
where p is any prime number and m is the number of bins. Please carefully design your hash functions.
After you have defined all the hashing functions, you will build the signature matrix. Then you will divide
the matrix into b bands with r rows each, where b x r = n (n is the number of hash functions). You should
carefully select a good combination of b and r in your implementation (b>1 and r>1). Remember that
two items are a candidate pair if their signatures are identical in at least one band.
Your final results will be the candidate pairs whose original Jaccard similarity is >= 0.5. You need to write
the final results into a CSV file according to the output format below.
Example of Jaccard Similarity:
user1 user2 user3 user4
business1 0 1 1 1
business2 0 1 0 0
Jaccard Similarity (business1, business2) = #intersection / #union = 1/3
Input format: (we will use the following command to execute your code)
Python: spark-submit task1.py <input_file_name> <output_file_name>
Scala: spark-submit --class task1 hw3.jar <input_file_name> <output_file_name>
Param: input_file_name: the name of the input file (yelp_train.csv), including the file path.
Param: output_file_name: the name of the output CSV file, including the file path.
Output format:
IMPORTANT: Please strictly follow the output format since your code will be graded automatically. We
will not regrade because of formatting issues.
a. The output file is a CSV file, containing all business pairs you have found. The header is “business_id_1,
business_id_2, similarity”. Each pair itself must be in the alphabetical order. The entire file also needs to
be in the alphabetical order. There is no requirement for the number of decimals for the similarity value.
Please refer to the format in Figure 2.