Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DATA3404: Scalable Data Management
Assignment 2: Apache Spark Programming and Optimisation
Group Assignment (15%)
Introduction
This is the second part of the practical assignment of DATA3404 in which you have to translate some
SQL queries, respectively data analysis tasks into Apache Spark Python programs, and optimise
them in Databricks.
You find links to online documentation, data, and hints on tools and schema needed for this
assignment in the ’Assignment’ section in Canvas modules.
Data Set Description
Conversion: rewrite as Apache Spark Python code (pySpark)
Optimisation: Optimise this query. Compare the optimisations on small/medium/large
datasets, both in terms of the query plan, and in terms of runtime (you can use the sparkmea-
sure package for this, cf. Tutorial Week 11).
Task 2: PySpark Code to determine ”Airports with largest total delay, and their
top-5 delayed airlines”
In this second task, write an Apache Spark program that determines for a given year and country
(both user-specified) the top-10 airports with the largest total annual departure delay (in minutes)
and lists them in descending order of the total delay. For each of those airports, list further the
five airlines contributing most to this overall departure delay in the given year, together with the
percentage of how much they contributed to this delay. List those airlines as comma-separated
string that is enclosed in ’[’ and ’]’ (indicating a list), each airline listed with both airline name and
the percentage of their delay contribution in descending order of the percentage.
The output should have the following tab-delimited (\t) format as follows:
airport name \t total departure delay \t [(airline1, percentage), (airline2, percentage), ...,
(airline5, percentage)]
Deliverable: write as an Apache Spark Python code (pySpark)
Optimisation: Optimise your program. Compare the optimisations on small/medium/large datasets
both in terms of the query plan, and in terms of runtime performance (e.g. using StageMetrics from
the sparkmeasure package).
2
Optimisation Recommendations and Report contents
You may use the following approaches as guidelines for your analysis and report:
• Analyse the operation of your queries using the techniques covered for SQL and Spark
• Identify something that can be improved, and change the query to support that and thus to
be more efficient
• Ensure the output is unchanged – and that you anticipate it would be unchanged for any valid
inputs, i.e. do not rely on the specific data you have available for testing.
• If you are unable to identify an improvement, you should write something that is equivalent
in output but involves a different execution.
• Observe an improvement both from the SQL execution plan, the Spark DAG, and the Spark-
Measure analysis.
• You may wish to use the larger flights datasets to check this, particularly the large dataset –
this will make it clear whether there is a speed improvement. Note that these versions have
not been cleaned as the flights small cleaned dataset has. To achieve similar results, dropna
should be applied to the dataframe when it is loaded, the column names should be corrected,
and inner joins should be used when connecting with airline/airport/aircraft data.
• In the event that any bugs in the provided queries are discovered, they should be treated as
working – that is, your modification should retain any bugged behaviour rather than try to
correct it to the stated question.
• The cache clearing in the Spark code should be retained – do not use caching that requires
the queries to be run multiple times to achieve efficiency. The efficiency improvement should
come from the query being run for the first time. Caching within the query itself is fine.
Note that each task should be approached by the group as a whole. Each task answer should have
the authors who contributed to that version indicated in a comment.
Note also that there will be quite a lot that you might identify and change – you do not need
to optimise these queries to be perfectly efficient! All you need to do is to identify one aspect of
inefficiency in each question and correct it. However, there are a couple of things that are the same
for both tasks, particularly with Spark – you will need to use a different approach for the two rather
than cutting and pasting an improvement that applies to both tasks for your improvement to count.
Deliverables and Submission Details
There are three deliverables per group:
1. a brief report/documentation outlining your outputs, optimisations and observations; and a
2. source code - Jupyter notebook as a single .DBC archive or notebook source file that answers
the given two tasks.
3. A demo in week 12/13 where you will be quizzed on the contribution of each member.
Here are the specifics associated with each deliverable item.
3
Report
Filename recommendation: data3404 y23s1 assignment2 tutgroupname assignmentgroupnum.pdf
• Your group name, including tutorial code and group number
• The answers (output) you receive for each question when executing against all three (small-
/medium/large) datasets.
• A short explanation of ’what is going on’ in the general sense in the Spark code of your
solutions for each Task.
• A group contribution statement with the following headings:
– Your group members’ names and SIDs
– Whether they contributed meaningfully to tasks and the report (yes or no)
• This does not have a strict page limit, but you should keep it relevant to ensure feedback can
be useful. In particular:
– Do not include large code dumps. You are already submitting all of your code. Use 1
or 2 line snippets if you absolutely must. Remember that including your code is not
explaining your code.
– Do not keep writing more in the hope that some of what you include will be correct. You
are more likely to run into issues including incorrect information than you are to gain
by including correct information.
Jupyter notebook or DBC Archive
Filename recommendation:data3404 y23s1 assignment2 tutgroupname assignmentgroupnum.ipynb
A single Jupyter notebook file (or .DBC archive) that contains all of your completed tasks. This file
must be able to be run attached to a Databricks Community cluster that has had the Assignment
Bootstrap notebook run on it, and no other configurations made.
Due Date: All deliverables are due in Week 13, no later than Sunday 28th May. Late submission
penalty: -5% of the marks per day late. The marking rubric is in Canvas.
Students must retain electronic copies of their submitted assignment files and databases, as the
unit coordinator may request to inspect these files before marking of an assignment is completed. If
these assignment files are not made available to the unit coordinator when requested, the marking
of this assignment may not proceed.