CMT309 Computational Data Science
Computational Data Science
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Module Code: CMT309
Module Title: Computational Data Science
This assignment is worth 70% of the total marks available for this module. If coursework is
submitted late (and where there are no extenuating circumstances):
1 - If the assessment is submitted no later than 24 hours after the deadline, the
mark for the assessment will be capped at the minimum pass mark;
2 - If the assessment is submitted more than 24 hours after the deadline, a mark
of 0 will be given for the assessment.
Submission Instructions
Your coursework should be submitted via Learning Central by the above deadline. You have
to upload the following files:
Description Type Name
Cover sheet Compulsory One PDF (.pdf) file Student_number.pdf
Your solution to Part 1 and Part 2 Compulsory One Jupyter Notebook (.ipynb) file Part1_2.ipynb
HTML version of part1_2.ipynb Compulsory One HTML (.html) file Part1_2.html
Your solution to Part 3 Compulsory One Jupyter Notebook (.ipynb) file Part3.ipynb
HTML version of part3.ipynb Compulsory One HTML (.html) file Part3.html
For the filename of the Cover Sheet, replace ‘Student_number’ by your student number, e.g.
“C1234567890.pdf”. Make sure to include your student number as a comment in all of the
Python files! Any deviation from the submission instructions (including the number and types
of files submitted) may result in a reduction of marks for the assessment or question part.
You can submit multiple times on Learning Central. ONLY files contained in the last attempt
will be marked, so make sure that you upload all files in the last attempt.
Staff reserve the right to invite students to a meeting to discuss the Coursework submissions.
Assignment
Start by downloading the following files from the Assessment tab in Learning Central:
• Part1_2.ipynb (Jupyter Notebook file for Part 1 and Part 2)
• Part3.ipynb (Jupyter Notebook file for Part 3)
• listings.csv
• reviews.csv
Then answer the following questions. You can use any Python expression or package that
was used in the lectures and practical sessions. Additional packages are not allowed unless
instructed in the question. You answer the questions by filling in the appropriate sections in
the Jupyter Notebook. Export your final Jupyter Notebooks as HTML to produce the
corresponding HTML files. Before submitting your Jupyter Notebooks and HTML files,
make sure to restart the kernel and execute each cell such that all outputs and figures
are visible.
Scenario
In this assignment you slip into the role of a Data Scientist who has been hired by Airbnb.
Airbnb is an online market place for vacation and short-term rentals of rooms or flats which
operates in many countries in the world (see https://en.wikipedia.org/wiki/Airbnb). Airbnb
collects data on the listing and users interacting with the platform. Let us define these terms
first:
- User: A user is someone using the Airbnb platform (guest or host).
- Guest: A guest is a user who uses the Airbnb platform to book a room, flat or house.
- Host: A host is a user who offers a room, flat, or house for rent.
- Listing: A listing is a room, flat, or house offered for rent. In the dataset, each row
corresponds to a listing.
Since we do not have access to Airbnb’s internal database, we will instead use data published
by the website Inside Airbnb. The data has been acquired by web-scraping publicly available
data from Airbnb. It is available online (http://insideairbnb.com/about.html) but you do not
need to download any data from this website since we have downloaded and renamed all
datasets you need and made them available on Learning Central.
In our scenario, you are responsible for the Amsterdam branch of Airbnb operations. Your
main task is to provide insights on the data collected in Amsterdam as well as write
algorithms to improve the experience of Airbnb users. The assignment is split into three
parts. In the first two parts, you will focus on the numerical parts of the data. In the last part,
you will focus on the text data.
• Part 1: Pre-processing and exploratory analysis
You start by reading the csv file into a Pandas DataFrame, cleaning the data and
removing unwanted columns, performing conversions and calculating new columns.
Then, you will perform exploratory analysis to look at the properties and distribution of
the data, and answer a couple of questions your manager put forward to you.
• Part 2: Statistical analysis and recommender systems
Starting from the pre-processed DataFrame, you will perform statistical analysis using t-
tests and linear regression to identify variables that significantly affect the price of the
rent. Then, you will design a series of recommender systems that have been requested by
users: a function that helps in setting the price for someone offering a new property, and a
function that helps in selecting a city to visit given a particular budget.
• Part 3: Text analysis and ethics
You will mostly work with unstructured text data in a Pandas Dataframe representing
user reviews.