Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Set Similarity Join Using Spark on Google Dataproc
Problem Definition:
Given two collections of records R and S, a similarity function sim(., .), and a
threshold τ, the set similarity join between R and S, is to find all record pairs r (from
R) and s (from S), such that sim(r, s) >= τ.
In this project, you are required to use the Jaccard similarity function to compute
sim(r, s). Given the following example, and set τ=0.5,
the result pairs are (r1, s1) (similarity 0.75), (r2, s2) (similarity 0.5), (r3, s1) (similarity
0.5), (r3, s2) (similarity 0.5).
Input files:
You are required to do the “self-join”, that is, a single input file is given, in which
each line is in format of:
“RecordId list”,
and this file serves as both R and S.
An example input file is as below (integers are separated by space):
0 1 4 5 6
1 2 3 6
2 4 5 6
3 1 4 6
4 2 5 6
5 3 5
This sample file “tiny-data.txt” can be downloaded at:
https://webcms3.cse.unsw.edu.au/COMP9313/21T3/resources/69118
Another sample input file “flickr_small.txt” can be downloaded at:
https://webcms3.cse.unsw.edu.au/COMP9313/21T3/resources/69119
Output:
The output file contains the similar pairs together with their similarities. Each line is
in format of “(RecordId1,RecordId2)\tSimilarity” (RecordId1are no duplicate pairs in the result). The similarities are of double precision. The
pairs are sorted in ascending order (by the first record and then the second).
Given the example input data, the output file is like:
(0,2)\t0.75
(0,3)\t0.75
(1,4)\t0.5
(2,3)\t0.5
(2,4)\t0.5
Code format:
Name your java file as “SetSimJoin.scala” and put it in the package
“comp9313.proj3”. Your program should take three parameters: the input file, the
output folder, and the similarity threshold τ (double precision).
Cluster configuration:
Create a bucket with name “comp9313-” in Dataproc.
Create a folder “project3” in this bucket for holding the input files.
This project aims to let you see the power of distributed computation. Your code
should scale well with the number of nodes used in a cluster. You are required to
create three clusters in Dataproc to run the same job:
Cluster1 - 1 master node and 2 worker nodes;
Cluster2 - 1 master node and 4 worker nodes;
Cluster3 - 1 master node and 6 worker nodes.
For both master and worker nodes, select n1-standard-2 (2 vCPU, 7.5GB memory).
Unzip and upload the following data set to your bucket, and set τ to 0.85 to run your
program: https://webcms3.cse.unsw.edu.au/COMP9313/21T3/resources/69125.
Record the runtime on each cluster and draw a figure where the x-axis is the number
of nodes you used and the y-axis is the time of getting the result, and store this figure
in a file “Runtime.jpg”. Please also take a screenshot of running your program on
Dataproc in each cluster as a proof of the runtime. Compress the three screenshots
into a zip file “Screenshots.zip”. Briefly describe your optimization techniques in a
file “Optimization.pdf”.