COMP9321 Data Services Engineering
Data Services Engineering
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP9321 Data Services Engineering
Clarified that CSV files should be saved after pre-processing the
DataFrame.
As budding data scientists, we're no doubt interested in the job market that awaits us. In this
assignment we'll be building a data pipeline to help inform our future decisions about entering the
workforce.
Our primary dataset is a CSV file containing data science jobs from the past few years. It's been
scraped from ai-jobs.net and, to some degree, pre-processed.
We'll also be using a number of secondary datasets, which we'll need to scrape ourselves. These
will help inform our analysis:
Source CSE Mirror
Cost of Living Index by Country 2023 [CSE Mirror]
Foreign currency exchange rates for the year ending 31 December 2023 [CSE Mirror]
Two-letter country codes [CSE Mirror]
Important Note: We will not be scraping data from the source, but rather from the CSE Mirror
pages. This is to prevent placing undue burden on third-party servers, and to ensure the pages
are static throughout the duration of the assignment.
You're encouraged to review these links and explore these datasets prior to attempting the
assignment.
Please note that this is publicly available data, and we are not responsible for political correctness,
or other data inaccuracies, as this is purely a learning exercise for data services engineering.
Accompanying each question below you'll find:
Assignment 1: Data Ingestion, Manipulation, and
Visualisation (15 marks)
Changelog
Introduction
Output: This includes the expected shape and column names of the DataFrame you create.
Marking Considerations: These, along with the expected output, are provided to help you
stay on track. It is not a marking rubric, but merely some points for you to be mindful of while
completing each question.
A code template is provided. In completing this assignment:
You must read and follow these instructions carefully.
You must use the code template.
You must rename the code template with your zID.
You must not modify the code template, except where indicated. For example, you must not:
Import additional third-party libraries.
Modify the function signature for each
question_X
function.
Modify the
main
function.
Modify the
log
function.
Disable or modify the calls to the
log
and
plt.savefig
functions within each
question_X
function.
Add any global variables.
You must setup a virtual environment as per the instructions below.
You must use either Python 3.11 (which is installed on CSE) or Python 3.12 (which is the
current release).
You must use pandas features to solve each question.
You must not iterate over the rows of any DataFrame.
You must not convert any DataFrame to a native data type (e.g. a
list
or
dict
) in order to
process the data.
You must not hard-code any file paths or URL's within the code you write.
These are specified in the
main
function, which you must not modify, and are passed as
parameters to the
question_X
functions that you will complete. You must use these
local variables instead of hard-coded values.
You must not display any plots (e.g. using
plt.show
).
The code template is configured to save your plot to disk.
You must not manually edit any datasets.
During marking a clean copy of the CSV file will be used, and web data will be freshly
scraped.
You may import and use any of the Python 3.11 or Python 3.12 standard libraries.
You may write helper functions and include them where indicated in the code template.
You may iterate over
DataFrame.columns
, if you feel it necessary.
For your reference, the third-party libraries you may use, and their versions, are listed below. Of
these, the only library you are required to use is pandas. In setting up your virtual environment,
these packages, along with their dependencies, will be installed automatically.
Package Version
matplotlib 3.8.2
numpy 1.26.0
Package Version
pandas 2.2.0
thefuzz 0.22.1
We'll be setting up a virtual environment in which to work on the assignment. This is not only good
practice, but critical to ensure your code can be run during marking.
For this, we'll be using Python's built-in venv tool to create and activate the environment, and then
pip to install the permitted libraries within the virtual environment.
Instructions below are based on a typical macOS/Linux or Windows environment. If your
environment differs, please see the documentation for venv and pip.
. Decide on a directory in which you'd like to work on the assignment, for example:
macOS:
~/9321/ass1
Windows:
C:\Users\username\9321\ass1
. Create the virtual environment using:
python3 -m venv /path/to/my/assignment # macOS
python -m venv C:\path\to\my\assignment # Windows
For the example directory it would be:
python3 -m venv ~/9321/ass1 # macOS
python -m venv C:\Users\username\9321\ass1 # Windows
Running this command creates the target directory, if it doesn't already exist, along with any
necessary parent directories.
. Activate the virtual environment using:
source /path/to/my/assignment/bin/activate # macOS
C:\path\to\my\assignment\Scripts\activate.bat # Windows
For the example directory it would be:
source ~/9321/ass1/bin/activate # macOS
C:\Users\username\9321\ass1\Scripts\activate.bat # Windows
. Download the provided requirements.txt file to your assignment directory (you may need to
right-click → save as...), and then install the listed packages within the virtual environment:
Part 0: Setting up your Virtual Environment
pip3 install -r /path/to/my/assignment/requirements.txt # macOS
pip install -r C:\path\to\my\assignment\requirements.txt # Windows
For the example directory it would be:
pip3 install -r ~/9321/ass1/requirements.txt # macOS
pip install -r C:\Users\username\9321\ass1\requirements.txt # Windows
. You're now ready to begin working on the assignment. For example, you could open the
assignment directory as a workspace in Visual Studio Code:
code /path/to/my/assignment # macOS
code C:\path\to\my\assignment # Windows
For the example directory it would be:
code ~/9321/ass1 # macOS
code C:\Users\username\9321\ass1 # Windows
Tip: if you are using VS Code, make sure you have the Python extension installed. You may
also like to review the documentation on Python environments in VS Code.
. Once you've finished working on the assignment, you can deactivate the virtual environment
by simply executing:
deactivate
Important Note: You will need to repeat steps 3 and 6, to activate and deactivate the virtual
environment, each time you work on the assignment.
First things first, we need to load our primary dataset.
Download the code template to your assignment directory.
Rename the file according to your zID.
Download the data science jobs CSV file to the same directory as your Python script.
Read it into a DataFrame.
Hint: You may need to right-click → save as... to download each of the files.
Shape:
Part 1: Data Ingestion and Cleaning (3.5 marks)
Question 1 (0.5 marks)
Output
(3755, 11)
Columns:
['work_year', 'experience_level', 'employment_type', 'job_title',
'salary', 'salary_currency', 'salary_in_usd', 'employee_residence',
'remote_ratio', 'company_location', 'company_size']
The Python script has been correctly renamed.
The CSV file is not hard coded within the function, but rather uses the argument passed to
the function.
The CSV file is read correctly into a DataFrame and returned by the function.
In assessing the job market, it's great to know the salaries that are out there, but for it to be
meaningful, we also need to consider the cost of living in each market.
Scrape the table from the cost of living page into a DataFrame.
Convert all the column names to lower case and replace spaces with underscores.
Because we're ethically minded data scientists, once you've completed the above pre-
processing, save the DataFrame as a CSV file (without the index), so that on future
executions of your script, instead of scraping the data it will be read from disk.
Hint: use pandas.read_html().
Hint: don't write the code to save the CSV file until you're confident that you're scraping the
website correctly, otherwise you'll need to delete the file before each execution.
Note: you may find the
rank
column is empty. This is fine. We can always calculate it ourselves, if
we need to.
Shape:
(140, 8)
Columns:
['rank', 'country', 'cost_of_living_index', 'rent_index',
'cost_of_living_plus_rent_index', 'groceries_index',
'restaurant_price_index', 'local_purchasing_power_index']
Marking Considerations
Question 2 (1 mark)
Output
Marking Considerations
The CSV file and URL are not hard coded within the function, but rather the arguments
passed to the function are used.
The website is only scraped if the CSV file doesn't exist, and:
The column names are correctly sanitised.
The CSV file is correctly saved.
The CSV file is read if it does exist.
The dataset is loaded correctly into a DataFrame and returned by the function.
You'll notice the jobs dataset includes the salary for each job in its local currency as well as its US
dollar equivalent. For our purposes, it would be better to have the salaries in Australian dollars.
We'll be focussing mainly on the most recent job listings, in 2023, and fortunately the Australian
Taxation Office provides average currency exchange rates for each calendar year.
Scrape the table from the currency exchange rates page into a DataFrame.
You'll see this table actually has two header rows:
Remove all columns under
Nearest actual exchange rate
along with the headers.
Remove the top level header row.
The remaining header row and the data contain some non-breaking spaces, which can be
problematic. Replace all such spaces with regular spaces.