Large Scale Data Engineering
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Coursework - EMATM0051 Large Scale Data Engineering
[Data Science]
Version: 12.11.2021 v2.0
Changes:
12.11.2021 v2.0 – Initial version for 2021-22 unit
Summary
This coursework is divided into two parts:
Part 1: A written task (only) related to the knowledge gained in the AWS Academy Cloud
Foundations course (weeks 1-7).
Part 2: A combined practical and written activity architecting a scaling application on the Cloud,
where you will be required to use knowledge gained and a little further research to implement the
scaling infrastructure, followed by a report that will focus on your experience in the practical activity
together with knowledge gained in the entire LSDE course.
Weighting: This assessment is worth 100% of your total unit 20 credits.
Set: 13:00. Monday 15th Nov 2021.
Due: 13:00. Wednesday 12th Jan 2022.
Pre-requisites:
• You must have completed the AWS Academy Cloud Foundations course set in weeks 1-7
• You will require an AWS Academy Learner Lab account for the practical activity. You should
receive an invite when this document is released. Please contact the LSDE Unit Director if
you have no email or issues with the registration.
• A Secure Shell (SSH) client, such as MacOS Terminal or PuTTy on Windows, for server admin.
Submission:
Via the LSDE BlackBoard coursework assessment page, submit one zip file, named using your UOB
username (‘username.zip’), containing:
• a Report (‘report.pdf’) in PDF format containing:
o Part 1
o Part 2
• a Text File (‘credentials.txt’) containing your AWS Academy account credentials (username,
password), to enable us to access and review your Learner Lab account as required.
In this document we provide a detailed explanation of the tasks, and the approach to marking.
Unit Director: Alan Forsyth
Task 1: (25%)
Write a maximum of 1000 words (minimum: 600) debating the statement:
"The Public Cloud is ideal for data processing"
Include your own descriptions of the following:
• At least 5 AWS features or services introduced in the Cloud Foundations course that make
data processing in the public cloud advantageous.
• At least 3 scenarios where the public cloud is not optimal or should be avoided for data
processing.
Task 2: Scaling the WordFreq Application (75%)
Overview
WordFreq is a complete, working application, built using the Go programming language.
[NOTE: you are NOT expected to understand or permitted to modify the source code in any way]
The basic functionality of the application is to count words in a text file. It returns the top ten most
frequent words found in a text document.
The application uses a number of AWS services:
• S3: Text files are uploaded to an S3 bucket. The bucket has upload notifications enabled,
such that when a file is uploaded, a message notification is automatically added to a
wordfreq SQS queue
• SQS: There are two queues used for the application.
o One is used for hold notification messages of newly uploaded text files from the S3
bucket. These messages are known as ‘jobs’, or tasks to be performed by the
application, and specify the location of the text file on the S3 bucket.
o A second queue is used to hold messages containing the ‘top 10’ results of the
processed jobs.
• DynamoDB: A NoSQL database table is created to store the results of the processed jobs.
• EC2: The application runs on an Ubuntu Linux EC2 instance, which you will need to set up
initially following the instructions given. This will include setting up and identifying the S3,
SQS and DynamoDB resources to the application.
You will be required to initially set up and test the application, using instructions given with the zip
download file. You will then need to implement auto-scaling for the application and improve its
architecture based on principles learned in the CF course. Finally, you will write a report covering
this process, along with some extra material.
Figure 1 - WordFreq standard architecture
Task A – Install the Application
Ensure you have accepted access to your AWS Academy Learner Lab account and have at least $20
credit (you are provided with $300 to start with). If you are running short of credit, please inform
your instructor.
Refer to the WordFreq installation instructions (‘README.txt’) in the coursework zip download on
the BlackBoard site, to install and configure the application in your AWS Educate account. These
instructions do not cover every step – you are assumed to be confident in certain tasks, such as in
the use of IAM permissions, launching and connecting via SSH to an EC2 instance, etc.
You will set up the database, storage buckets, queues and worker EC2 instance.
Finally, ensure that you can upload a file using the ‘run_upload.sh’ script and can see the results
logged from the running worker service, before moving on to the next task.
[NOTE: The application code is in the Go language. You are NOT expected to understand or modify it.
Any code changes will be ignored and may lose marks.]
Task B – Design and Implement Auto-scaling
Review the architecture of the existing application. Each job process takes at least 10 seconds
(artificially induced, but DO NOT modify the application source code!). To be able to process
multiple uploaded files, we need to add scaling to the application.
This should initially function as follows:
• When a given maximum performance metric threshold is exceeded, an identical worker
instance is launched and begins to also process messages on the queues.
• When a given minimum performance metric threshold is exceeded, the most recently
launched worker instance is removed (terminated).
• There must always be at least one worker instance available to process messages when the
application architecture is 'live'.
Using the knowledge gained from the Cloud Foundations course, architect and implement auto-
scaling functionality for the WordFreq application. Note that this will not be exactly the same as Lab
6 in Module 10, which is for a web application. You will not need a load balancer, and you will need
to identify a different CloudWatch performance metric to use for the ‘scale out’ and ‘scale in’ rules.
The 'Average CPU Utilization' metric used in Lab 6 is not necessarily the best choice for this
application.