Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
INF553 Foundations and Applications of Data Mining
1. Overview of the Assignment
In this assignment, you will implement the SON algorithm using the Apache Spark Framework. You will
develop a program to find frequent itemsets in two datasets, one simulated dataset and one real-world
dataset generated from Yelp dataset. The goal of this assignment is to learn how to implement and
apply the algorithms you have learned in class on real, large datasets efficiently in a distributed
environment.
2. Requirements
2.1 Programming Requirements
a. You must use Python to implement all tasks. There will be a 10% bonus for each task if you also
submit a Scala implementation and both your Python and Scala implementations are correct.
b. You are required to only use Spark RDD in order to understand Spark operations. You will not get any
point if you use Spark DataFrame or DataSet.
2.2 Programming Environment
Python 3.6, Scala 2.11.8 and Spark 2.3.0
We will use Vocareum to automatically run and grade your submission. We highly recommend that
you first test/debug your scripts on your local machine and once they are ready, submit them to
Vocareum.
2.3 Write your own code
Do not share your code with other students!!
For this assignment to be an effective learning experience, you must write your own code. We
emphasize this point because you will be able to find Python implementations of some of the required
functions on the Web.
TAs will combine all the code we can find from the Web (e.g., Github) as well as other students’ code
from this and other (also previous) sections for plagiarism detection. We will report all detected
plagiarism.
2.4 What you need to turn in
a. Four Python scripts, named: (all lowercase): task1.py, task2.py, task3.py, preprocess.py, task3_fp.py
(optional)
b1. [OPTIONAL] 4 Scala scripts, named: (all lowercase)
task1.scala, task2.scala, task3.scala, task3_fp.scala (No need to write preprocessing code in Scala)
b2. [OPTIONAL] one jar package, named: hw2.jar (all lowercase)
Note. You don’t need to include your output files for both tasks. We will grade on your code with our
testing data (data will be in the same format).
3. Datasets
In this assignment, you will use one simulated dataset and one real-world dataset. In task 1, you will
build and test your program with a small simulated CSV file that has been provided to you on Vocareum.
For task 2, you need to first generate a subset of business.json and review.json data from the Yelp
dataset. You should generate the data subsets using the same data structure as the simulated data
(Figure 1). Figure 1 shows the data structure, where the first column is user_id and the second column is
business_id. In task2, you will test your code with this real-world data.
Figure 1: Input Data Format
We will only provide a submission report for small1.csv on Vocareum for task 1. No submission report
will be provided for task2. You are encouraged to use the command line to run the code for small2.csv
as well as for task2 to get a sense of the running time.
4. Tasks
In this assignment, you will implement the SON algorithm to solve Task 1 and 2 on top of the Apache
Spark Framework. You need to find all the possible combinations of the frequent itemsets in any given
input file within the required time. You can refer to Chapter 6 from the Mining of Massive Datasets
textbook, especially section 6.4 – Limited-Pass Algorithms, for implementing your algorithm(s). (Hint:
you can choose either A-Priori, MultiHash, or PCY algorithm to process each chunk of the data) For task
3 you need to use the FPGrowth algorithm in spark.mllib to solve Task 2.
4.1 Task 1: Simulated data (3 pts)
There are two CSV files ($ASNLIB/publicdata/small1.csv and $ASNLIB/publicdata/small2.csv) provided on
Vocareum in your workspace. The file small1.csv is just a sample file that you can use to debug your
code. For task1, we will test your code on small2.csv for grading.
In this task, you need to build two kinds of market-basket models.
Case 1 (1.5 pts):
You will calculate the combinations of frequent businesses (as singletons, pairs, triples, etc.) that are
qualified as frequent given a support threshold. You need to create a basket for each user. The basket
will contain the business ids reviewed by a user. If a business was reviewed more than once by a
reviewer, we consider this product was rated only once. Specifically, the business ids within each basket
are unique since each basket is a set of ids. Examples of user baskets are:
user1: [business11, business12, business13, ...]
user2: [business21, business22, business23, ...]
user3: [business31, business32, business33, ...]
Case 2 (1.5 pts):
You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that are qualified
as frequent given a support threshold. You need to create a basket for each business. The basket will
contain the ids from users who commented on this business. Similar to case 1, the user ids in each
basket are unique. Examples of business baskets are:
business1: [user11, user12, user13, ...]
business2: [user21, user22, user23, ...]
business3: [user31, user32, user33, ...]
Input format:
1. Case number: An integer that specifies the case, 1 for Case 1 and 2 for Case 2.
2. Support: An integer that defines the minimum count to qualify as a frequent itemset.
3. Input file path: This is the path to the input file including path, file name and extension.
4. Output file path: This is the path to the output file including path, file name and extension.
Output format:
1. Console output - Runtime: the total execution time from loading the file till finishing writing the
output file
You need to print the runtime in the console with the “Duration” tag: “Duration: ”,
eg: “Duration: 100.00”
2. Output file:
(1) Output-1
You should use “Candidates:” to indicate your candidate section in your output file. For each line you
should output the candidates of frequent itemsets identified after the first pass of SON algorithm,
followed by an empty line after each frequent X itemset combination list (i.e., X can be single, pair,
triple, and so on). The printed itemsets must be sorted in the lexicographical order. (Both user_id and
business_id have the data type “string”.)
(2) Output-2
You should use “Frequent Itemsets:” to indicate your frequent itemset section in your output file. For
each line you should output the final frequent itemsets identified after finishing the SON algorithm. The
output format for the frequent X itemsets is the same as in Output-1. The printed itemsets must be
sorted in the lexicographical order.