Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP9313: Big Data Management
Revisit
2MyExperience Survey
❖ “Please participate in the myExperience Survey and take the opportunity
to share your constructive thoughts on your learning experience. Your
contributions help your teachers and shape the future of education at
UNSW.”
❖ You can access the survey by logging into Moodle or accessing
myexperience.unsw.edu.au directly.
❖ The deadline of myExperience is 2023-11-23.
❖ As mentioned in WebCMS3 notice, if the response rate from the
class is more than 50%, everybody gets 1 bonus mark added to the
final mark
3Final exam
❖ Final written exam (50 pts)
❖ Six questions in total on five topics
❖ Four hours (Do not wait for the last minute to submit!)
❖ Online exam. Submit through Moodle
❖ If you are ill on the day of the exam, do not attend the exam – will not
accept any medical special consideration claims from people who
already attempted the exam.
4Chapters Required in Exam
❖ Hadoop MapReduce (Chapters 1, 2, and 3)
➢ HDFS
➢ MapReduce Concepts and Mechanism
➢ MapReduce algorithm design
❖ Spark (Chapters 4 and 5)
➢ RDD
➢ DataFrame
❖ Mining Data Streams (Chapter 6)
❖ Finding Similar Items (Chapter 7)
➢ Shingling, minhash, LSH
➢ Exact solution
❖ Graph Data Management (Chapter 8)
❖ NoSQL and Hive (Chapter 9)
5Exam Questions
❖ Question 1: HDFS, MapReduce, and Spark concepts
❖ Question 2: MapReduce algorithm design (pseudo-code only)
❖ Question 3: Spark algorithm design
➢ RDD
➢ DataFrame
❖ Question 4 Finding Similar Items
❖ Question 5 Mining Data Streams
❖ Question 6 Graph Data Management
6Question 0
❖ (a) (2 marks) Explain the data flow in MapReduce using the word
count problem as an example.
❖ (b) (2 marks) Explain the data flow in Spark using the word count
problem as an example.
7Map and Reduce Functions
❖ Programmers specify two functions:
➢ map (k1, v1) → list []
Map transforms the input into key-value pairs to process
➢ reduce (k2, [v2]) → []
Reduce aggregates the list of values for each key
All values with the same key are sent to the same reducer
❖ Optionally, also:
➢ combine (k2, [v2]) → []
➢ partition (k2, number of partitions) → partition for k2
➢ Grouping comparator: controls which keys are grouped together
for a single call to Reducer.reduce() function
❖ The execution framework handles everything else…
8MapReduce Data Flow
9Sample Questions
❖ Assume that you are given a data set crawled from a location-based
social network, in which each line of the data is in format of (userID, a
list of locations the user has visited ). Your task is to
compute for each location the set of users who have visited it, and the
users are sorted in ascending order according to their IDs.
10
Sample Solution
class Question1
method map(self, userID, list of locations)
foreach loc in the list of locations
Emit(“loc, userID”, “”)
method reduce_init(self)
current_loc = “”
current_list = []
method reduce(self, key, value)
loc, userID = key.split(“,”)
if loc != current_loc
if current_loc!=“”
Emit(current_loc, current_list)
current_list = []
current_list.add(userID)
current_loc=loc
else
current_list.add(userID)
method reduce_final(self)
Emit(current_loc, current_list)
In JOBCONF, configure:
'mapreduce.map.output.key.field.separator':’,’,
'mapreduce.partition.keypartitioner.options':'-k1,1’,
'mapreduce.partition.keycomparator.options':'-k1,1 -k2,2'
11
Sample Questions
❖ Given a table shown as below, find out the person(s) with the
maximum salary in each department (employees could have the same
salary).
❖ Solution:
➢ Mapper: for each record, Emit(department + “,” + salary, name)
➢ Combiner: find out all persons with the local maximum salary for
each department
➢ Reducer: receives data ordered by (department, salary), the first
one is the maximum salary in a department. Check the next one
until reaching a smaller salary and ignore all remaining. Save all
persons with this maximum salary in the department
➢ JOBCONF: key partitioned by “-k1,1”, sorted by “-k1,1 -k2,2n”
EmployeeID Name DepartmentID Salary
001 Emma 1 100,000
002 Helen 2 85,000
003 Jack 3 85,000
004 James 1 110,000
12
Sample Questions
❖ Given a large text dataset, find the top-k frequent terms (considering
that you can utilize multiple reducers, and the efficiency of your
method is evaluated).
❖ Solution:
➢ Two rounds:
First round compute term frequency in multiple reducers, and
each reducer only stores local top-k.
Second round get the local top-k and compute the final top-k
using a single reducer.
13
What is RDD
❖ Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-
Memory Cluster Computing. Matei Zaharia, et al. NSDI’12
➢ RDD is a distributed memory abstraction that lets programmers
perform in-memory computations on large clusters in a fault-
tolerant manner.
❖ Resilient
➢ Fault-tolerant, is able to recompute missing or damaged partitions
due to node failures.
❖ Distributed
➢ Data residing on multiple nodes in a cluster.
❖ Dataset
➢ A collection of partitioned elements, e.g. tuples or other objects
(that represent records of the data you work with).
❖ RDD is the primary data abstraction in Apache Spark and the core of
Spark. It enables operations on collection of elements in parallel.
14
RDD Operations
❖ Transformation: returns a new RDD.
➢ Nothing gets evaluated when you call a Transformation function, it
just takes an RDD and return a new RDD.
➢ Transformation functions include map, filter, flatMap, groupByKey,
reduceByKey, aggregateByKey, filter, join, etc.
❖ Action: evaluates and returns a new value.
➢ When an Action function is called on a RDD object, all the data
processing queries are computed at that time and the result value
is returned.
➢ Action operations include reduce, collect, count, first, take,
countByKey, foreach, saveAsTextFile, etc.
15
DataFrame
❖ DataFrame more like a traditional database of two-dimensional form,
in addition to data, but also to grasp the structural information of the
data, that is, schema
➢ RDD[Person] although with Person for type parameters, but the
Spark framework itself does not understand internal structure of
Person class
➢ DataFrame has provided a detailed structural information, making
Spark SQL can clearly know what columns are included in the
dataset, and what is the name and type of each column. Thus,
Spark SQL query optimizer can target optimization