Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Assignment
1. Overview of the Assignment
In this assignment, you will explore the spark GraphFrames library as well as implement your own
Girvan-Newman algorithm using the Spark Framework to detect communities in graphs. You will use the
ub_sample_data.csv dataset to find users who have a similar business taste. The goal of this assignment
is to help you understand how to use the Girvan-Newman algorithm to detect communities in an
efficient way within a distributed environment.
2. Requirements
2.1 Programming Requirements
a. You must use Python and Spark to implement all tasks. There will be 10% bonus for each task if you
also submit a Scala implementation and both your Python and Scala implementations are correct.
b. For task1, you can use the Spark DataFrame and GraphFrames library. For task2 you can ONLY use
Spark RDD and standard Python or Scala libraries.
2.2 Programming Environment
Python 3.6, JDK 1.8, Scala 2.12, and Spark 3.1.2
We will use these library versions to compile and test your code. There will be no point if we cannot run
your code on Vocareum. On Vocareum, you can call `spark-submit` located at
/opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit`. (*Do not use the one at
`/home/local/spark/latest/bin/spark-submit (2.4.4))
2.3 Write your own code
Do not share code with other students!!
For this assignment to be an effective learning experience, you must write your own code! We
emphasize this point because you will be able to find Python implementations of some of the required
functions on the web. Please do not look for or at any such code!
TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code
from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism.
2.4 What you need to turn in
You need to submit the following files on Vocareum: (all lowercase)
a. [REQUIRED] two Python scripts, named: task1.py, task2.py
b1. [OPTIONAL, REQUIRED FOR SCALA] two Scala scripts, named: task1.scala, task2.scala
b2. [OPTIONAL, REQUIRED FOR SCALA] one jar package, named: hw4.jar
c. [OPTIONAL] You can include other scripts called by your main program
d. You don’t need to include your results. We will grade on your code with our testing data (data will be
in the same format).
3. Datasets
We have generated a sub-dataset, ub_sample_data.csv, from the Yelp review dataset containing user_id
and business_id. You can find the data on Vocareum under resource/asnlib/publicdata/.
4. Tasks
4.1 Graph Construction
To construct the social network graph, assume that each node is uniquely labeled, and that links are
undirected and unweighted.
Each node represents a user. There should be an edge between two nodes if the number of common
businesses reviewed by two users is greater than or equivalent to the filter threshold. For example,
suppose user1 reviewed set{business1, business2, business3} and user2 reviewed set{business2,
business3, business4, business5}. If the threshold is 2, there will be an edge between user1 and user2.
If the user node has no edge, we will not include that node in the graph.
In this assignment, we use filter threshold 7.
4.2 Task1: Community Detection Based on GraphFrames (2 pts)
In task1, you will explore the Spark GraphFrames library to detect communities in the network graph you
constructed in 4.1. In the library, it provides the implementation of the Label Propagation Algorithm
(LPA) which was proposed by Raghavan, Albert, and Kumara in 2007. It is an iterative community
detection solution whereby information “flows” through the graph based on underlying edge structure.
In this task, you do not need to implement the algorithm from scratch, you can call the method provided
by the library.
4.2.1 Execution Detail
The version of the GraphFrames should be 0.6.0.
(For your convenience, graphframes0.6.0 is already installed for python on Vocareum. The corresponding
jar package can also be found under $ASNLIB/public folder. )
For Python (in local machine):
● [Approach 1] Run “python3.6 -m pip install graphframes” in the terminal to install the package.
● [Approach 2] In PyCharm, you add the sentence below into your code to use the jar package
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages graphframes:graphframes:0.8.2-spark3.1-s_2.12")
● In the terminal, you need to assign the parameter “packages” of the spark-submit:
--packages graphframes:graphframes:0.8.2-spark3.1-s_2.12
For Scala (in local machine):
● In Intellij IDEA, you need to add library dependencies to your project
“graphframes” % “graphframes” % “0.8.2-spark3.1-s_2.12”
“org.apache.spark” %% “spark-graphx” % sparkVersion
● In the terminal, you need to assign the parameter “packages” of the spark-submit:
--packages graphframes:graphframes:0.8.2-spark3.1-s_2.12
For the parameter “maxIter” of LPA method, you should set it to 5.
4.2.2 Output Result
In this task, you need to save your result of communities in a txt file. Each line represents one
community and the format is:
‘user_id1’, ‘user_id2’, ‘user_id3’, ‘user_id4’, …
Your result should be firstly sorted by the size of communities in the ascending order and then the first
user_id in the community in lexicographical order (the user_id is type of string). The user_ids in each
community should also be in the lexicographical order.
If there is only one node in the community, we still regard it as a valid community.
Figure 1: community output file format
4.3 Task2: Community Detection Based on Girvan-Newman algorithm (5 pts)
In task2, you will implement your own Girvan-Newman algorithm to detect the communities in the
network graph. You can refer to the Chapter 10 from the Mining of Massive Datasets book for the
algorithm details.
Because your task1 and task2 code will be executed separately, you need to construct the graph again in
this task following the rules in section 4.1.