Statistical Learning and Data Mining
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
QBUS6180
Statistical Learning and Data Mining
Regression Project: Airbnb Pricing Analytics
1. Overview
In this project your team will analyse data from Airbnb rentals in Sydney to provide market
advice to hosts, real estate investors, and other stakeholders. Your team will have two tasks:
the first will be to build a predictive model for vacation rental prices and the second will be to
uncover interesting facts from the data that can help your clients make better decisions.
2. Problem description
Airbnb is a global platform that runs an online marketplace for short term
travel rentals.
As a team of data scientists and business analysts working at a market intelligence and
consulting company targeting the Airbnb market, you are tasked with developing an advice
service for hosts, property managers, and real estate investors.1
To achieve your project’s goals, you are provided with a dataset containing detailed
information on a number of existing Airbnb listings in Sydney. Your team has two tasks:2
1. To develop a predictive model for the daily prices of Airbnb rentals based on state-of-
the-art machine learning techniques. This model will and allow the company to advise
hosts on pricing and to help owners and investors to predict the potential revenue of
Airbnb rentals (which also depends on the occupancy rate).
2. To obtain at least three insights that can help hosts to make better decisions. What
are the best hosts doing?
We will refer to these tasks as statistical learning and data mining respectively.
As part of the contract, you are asked to write a report according to the instructions given
below.
1 A real example is Airdna. Airbnb itself has a large data science and analytics team.
2 This is similar to Airdna: https://www.airdna.co/airbnb-hosts.
BUSINESS SCHOOL
Page 2 of 5
3. Understanding the data
3.1 Training, validation, and test sets
The data are split into two files, a training dataset and a second dataset for validation and
evaluation. The second omits the price values.
We will run a Kaggle competition as part of the assignment. Kaggle randomly splits the
observations in the second file into validation (50%) and test (50%) cases, but you will not
know which ones are which. You get a score equal to the RMSLE computed on the validation
cases when you submit to the competition. These scores are displayed on the Public
Leaderboard and provide an ongoing ranking of teams. You can use the scores of your
submissions to help you select the best predictive model.
You will select one of your submissions to be used as the final model at the end of the
competition. Once the competition is over, Kaggle will rank the teams’ final submissions based
on the test cases only. Those will be displayed on the Private Leaderboard. Your goal is to
achieve the best possible score on the Private Leaderboard at the end of the
competition.
Be careful not to overfit the validation cases in an attempt to improve your public ranking.