FIT5196 assessment
This is an individual assessment and worth 35% of your total mark for
FIT5196.
Data Cleansing (%60)
For this assessment, you are required to write Python (Python 2/3) code to analyze your dataset,
find and fix the problems in the data. The input and output of this task are shown below:
Table 1. The input and output of the task
Input Output Output Notebook
<student_id>_dirty_data.csv
<student_id>_outlier_data.csv
<student_id>_missing_data.csv
<student_id>_dirty_data_solution.csv
<student_id>_outlier_data_solution.csv
<student_id>_missing_data_solution.csv
<student_id>_ass2.ipynb
<student_id>_ass2.pdf
Note1: You should submit a zip file and a pdf file which will be used for plagiarism check.
1. The csv files and the ipynb file must be zipped into a file named
<student_id>_ass2.zip.
2. The pdf file should be exported from the <student_id>_ass2.ipynb without
any cell output (please only keep the markdown notes and scripts in the
pdf file). The pdf file is named <student_id>_ass2.pdf.
Note2: <student_id> is to be replaced with your student id
Note3: Students can find their three input files here based on their student_id
Note 4: An interview is required for this assessment. You will need to explain your
solution and answer questions from a teaching team member.
Exploring and understanding the data is one of the most important parts of the data wrangling
process. You are required to perform graphical and/or non-graphical EDA methods to understand
the data first and then find and fix the data problems. You are required to:
● Detect and fix errors in <student_id>_dirty_data.csv
● Detect and remove outlier rows in <student_id>_outlier_data.csv
(outliers are to be found w.r.t. delivery_charges attribute only)
● Impute the missing values in <student_id>_missing_data.csv
As a starting point, here is what we know about the dataset in hand:
The dataset contains transactional retail data from an online electronics store (DigiCO) located in
Melbourne, Australia . The store operation is exclusively online, and it has three warehouses 1
around Melbourne from which goods are delivered to customers.
Each instance of the data represents a single order from said store. The description of each data
column is shown in Table 2.
Table 2. Description of the columns
COLUMN DESCRIPTION
order_id A unique id for each order
customer_id A unique id for each customer
date The date the order was made, given in YYYY-MM-DD format
nearest_warehouse A string denoting the name of the nearest warehouse to the
customer
shopping_cart A list of tuples representing the order items: first element of
the tuple is the item ordered, and the second element is the
quantity ordered for such item.
order_price A float denoting the order price in AUD. The order price is the
price of items before any discounts and/or delivery charges
are applied.