Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DATA3404: Data Science Platforms
Tutorial Week 11: Apache Spark
In this tutorial, we will continue to work with Apache Spark. Your tutors will also
discuss about the progress of the assignment.
Exercise 1. Query Execution Plans in Apache Spark
We will again be using the pySpark interface in Databricks for this week’s tutorial.
In this week’s tutorial, you will complete a few query tasks and translations from
SQL using Apache pySpark which are related to the upcoming Assignment 2.
To start:
1. We have provided a new sample notebook (Tutorial Week11-PySparkQuerying.ipynb)
with a few sample tasks that demonstrate pySpark, its Dataframe API and how
to evaluate and measure its performance and internal query plans. Down-
load this notebook from Canvas Week 11 module too and upload it to your
Databricks account.
2. Next start a new compute cluster in Databricks (e.g. via the ’Compute’ item in
the sidebar and then using the ’Create Compute’ button. Start Exercise 2.
Exercise 2. Topic of the Week Revision / Video
Starting the new cluster will take some time. While we are waiting for the new
cluster to start, let’s do some revision of this weeks lecture content with your
tutor - or watch this week’s video presentation.
Exercise 3. Example PySpark Queryiong
Go through the different tasks of the provided pySpark notebook (Tutorial Week11-
PySparkQuerying.ipynb) and see how you can translate a given SQL query into a
pySpark program. It also shows how you can inspect the internal query plan in
Spark, and how to use the sparkmeasure package to measure query performance
in Databricks.
Exercise 4. Assignment 2
Assignment 2 has been released in Canvas. Please discuss any questions with your
tutor and feel free to start in the remaining time of the tutorial.