INF553 Foundations and Applications of Data Mining
Foundations and Applications of Data Mining
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
INF553 Foundations and Applications of Data Mining
1. Overview of the Assignment
In this assignment, you will explore the spark GraphFrames library as well as implement your
own Girvan-Newman algorithm using the Spark Framework to detect communities in graphs.
You will use the ub_sample_data.csv dataset to find users who have a similar business taste.
The goal of this assignment is to help you understand how to use the Girvan-Newman
algorithm to detect communities in an efficient way within a distributed environment.
2. Requirements
2.1 Programming Requirements
a. You must use Python and Spark to implement all tasks. There will be 10% bonus for each
task if you also submit a Scala implementation and both your Python and Scala
implementations are correct.
b. You can use the Spark DataFrame and GraphFrames library for task1, but for task2 you
can ONLY use Spark RDD and standard Python or Scala libraries. (ps. For Scala, you can try
GraphX, but for the assignment, you need to use GraphFrames.)
2.2 Programming Environment
Python 3.6, Scala 2.11, and Spark 2.3
We will use Vocareum to automatically run and grade your submission. You must test your
scripts on the local machine and the Vocareum terminal before submission.
2.3 Write your own code
Do not share code with other students!!
For this assignment to be an effective learning experience, you must write your own code! We
emphasize this point because you will be able to find Python implementations of some of the
required functions on the web. Please do not look for or at any such code!
TAs will combine all the code we can find from the web (e.g., Github) as well as other
students’ code from this and other (previous) sections for plagiarism detection. We will report
all detected plagiarism.
2.4 What you need to turn in
You need to submit the following files on Vocareum: (all lowercase)
a. [REQUIRED] two Python scripts, named: task1.py, task2.py
b1. [REQUIRED FOR SCALA] two Scala scripts, named: task1.scala, task2.scala
b2. [REQUIRED FOR SCALA] one jar package, named: hw4.jar
c. [OPTIONAL] You can include other scripts called by your main program
d. You don’t need to include your results. We will grade your code with our testing data (data
will be in the same format).
3. Datasets
You will continue to use the Yelp dataset. We have generated a sub-dataset,
ub_sample_data.csv, from the Yelp review dataset containing user_id and business_id. You
can download it from Vocareum.
4. Tasks
4.1 Graph Construction
To construct the social network graph, each node represents a user and there will be an edge
between two nodes if the number of times that two users review the same business is greater
than or equivalent to the filter threshold. For example, suppose user1 reviewed [business1,
business2, business3] and user2 reviewed [business2, business3, business4, business5]. If the
threshold is 2, there will be an edge between user1 and user2.
If a user node has no edge, we will not include that node in the graph.
In this assignment, we use a filter threshold of 7.
4.2 Task1: Community Detection Based on GraphFrames (2 pts)
In task1, you will explore the Spark GraphFrames library to detect communities in the
network graph that you constructed in 4.1. In the library, it provides the implementation of
the Label Propagation Algorithm (LPA) which was proposed by Raghavan, Albert, and Kumara
in 2007. It is an iterative community detection solution whereby information “flows” through
the graph based on the underlying edge structure. For the details of the algorithm, you can
refer to the paper in the link below. In this task, you do not need to implement the algorithm
from scratch, you can call the method provided by the library. The following websites may
4.2.1 Execution Detail
The version of the GraphFrames should be 0.6.0.
For Python:
● In PyCharm, you need to add the sentence below into your code
pip install graphframes
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11")
● In the terminal, you need to assign the parameter “packages” of the spark-submit:
--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
For Scala:
● In Intellij IDEA, you need to add library dependencies to your project
“graphframes” % “graphframes” % “0.6.0-spark2.3-s_2.11”
“org.apache.spark” %% “spark-graphx” % sparkVersion
● In the terminal, you need to assign the parameter “packages” of the spark-submit:
--packages graphframes:graphframes:0.6.0-spark2.3-s_2.11
For the parameter “maxIter” of LPA method, you should set it to 5.
4.2.2 Output Result
In this task, you need to save your result of communities in a txt file. Each line represents
one community and the format is:
‘user_id1’, ‘user_id2’, ‘user_id3’, ‘user_id4’, …
Your result should be firstly sorted by the size of communities in ascending order and then the
first user_id in the community in the lexicographical order (the data type of user_id is a
string). The user_ids in each community should also be in the lexicographical order.
If there is only one node in the community, we still regard it as a valid community.