Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DATA9001
Problem1 – Theoretical [4 marks]
A two-class model was trained and then tested with a dataset of 100 instances. The test set
contained 60 instances in positive class P, and 40 instances in negative class N. As a result of
testing, the following counts were obtained:
50 instances of P were classified correctly,
10 instances of P were classified into N,
20 instances of N were classified correctly,
20 instances of N were classified into P.
i) [1 mark] Construct contingency table (also called confusion matrix)
ii) [1.5 marks] Calculate the following macro metrics: (show your calculations)
a. Precision
b. Recall
c. F1
iii) [1.5 marks] Calculate the following micro metrices:
a. Precision,
b. Recall,
c. F1
Note: For this problem, Problem1 -Theoretical, you can write your solution on a paper and scan your
solutions.
Problem2 – Practical [11 marks]
This assignment is inspired by a real-life scenario. Imagine you have been hired as a Data
Scientist by a major e-commerce retailer. Your responsibility is to analyse customer reviews
to determine the rating of new products.
For this assignment you will be given a dataset of 5000 customer reviews from Amazon.
Each review consists of a short text, and one of five ratings: a number from 1 to 5. You are
required to evaluate various supervised machine learning methods using a variety of
features and settings.
You are required to apply three machine learning methods discussed in the lecture: Naïve
Bayes (NB), Decision Tree (DT) and k-Nearest Neighbour (KNN). These methods will be
trained on the first 4000 reviews and tests on the remaining 1000. An incomplete solution is
provided to minimize the required coding effort so you can focus on the result analysis.
For all models, consider a review to be a collection of words, where a word is a string of at
least two letter, numbers or the symbols / (slash), - hyphen), $ or %, delimited by a space,
after replacing three dots (…) by a space. The (…) part has been done, but you need to do
the other characters.
The default parameters in your code should be as follows:
• All models: no stemming, no conversation of text to lowercase (lowercase=False in
CountVectorizer), stop words removed (stop_words=“english” in CountVectorizer),
first 4000 tweets used for training, last 1000 used for testing
• Decision Tree: max_depth=None, criterion=“entropy”, random_state=0. Max depth
set to none means full Decision Tree is used.
• K-Nearest Neighbour: n_neighbors = 5
Provided for this assignment:
1. Dataset.tsv: 5000 customer reviews in the tsv format (ins_number, text, rating)
2. Incomplete_solution.py: an incomplete solution that you can use to develop your
code based on.
Questions to be answered for Problem2:
In questions 2-6 metrics means the following set of metrics obtained on the test part of the
data (values up to 3 decimal points):
Precision Recall F1
Micro: all ratings
Macro: all ratings
1
2
3
4
5
This table can be used to present the results.
1. [0.5 mark] Show the distribution in the tabular or chart form of all instances over the
ratings. How this distribution might affect training/testing and what metrics would
be suitable to measure model performance. Briefly justify your answer.
2. [0.5 mark] Show metrics for all three methods with default parameters. Do not take
screenshots from terminal.
3. [1 mark] Change the default parameters in the code by not removing stop words.
Show metrics for all three methods, compare to the default setting, and briefly
explain the differences.
4. [1 mark] Change the default parameters by enabling stemming and lowercasing in
the code. Show metrics for all three methods, compare to the default setting and
briefly explain the differences.
5. [1 mark] Modify k in the KNN from 1 to 10 and choose the best k as measured by
macro F1. Compare this metric to the default setting, and briefly explain the
difference. You can use Python loop to run KNN for different k.
6. [2 marks] Limiting the depth of Decision Tree is one of the ways to prune and
balance bias with variance. Modify max_depth in DT from 5 to 15 and choose the
best model as measured by macro F1. Compare to the default setting, and briefly
explain the difference. You can use Python loop to run DT for different max_depth.
7. [1 mark] One of the first steps of the provided incomplete_solution.py is it creates a
new column, Please refer to function “stentiment_converter” and explain what does
the data in the new columns represents.
Instead of training all the standard models on the rating, train them on the new
column. Show metrics of the three methods with default parameters on the new
column. Compare these results with the results on the “rating” data. Briefly, explain
the difference and provide what you believe is the reason of this difference.
8. [4 marks] For this section, you are required to describe your chosen “best” methods
for rating predication. You are free to choose from existing methods and tune
parameters or you can introduce a new method that you found to be useful in this
context.
You need to give new experimental results for your trained method on the training
set 4000 of reviews and tested on the test set of the last 1000 reviews. Explain how
this experimental evaluation justifies your choice of model, including settings and
parameters, against a range of alternatives. Provide new experiments and
justifications: do not just refer to previous answers.
Deliverables:
1. ZIP file (A3-z1234567-FirstName-LastName.zip) containing:
a. Solution to Problem1, either handwritten or computer generated, in PDF
format (A3-z1234567-FirstName-Surname-Problem1.pdf).
b. Report in PDF format (A3-z1234567-FirstName-Surname-Problem2.pdf)
answering the questions specified at “Questions to be answered for
Problem2”. The report should contain answers to each question clearly
numbered with tables showing results, if required. Do not use screenshots of
classification_report. Questions may require studying additional
material in the textbooks listed in the Outline of this course and
experimenting with the code.
c. All code developed for this assignment in one file (A3-z1234567-FirstName-
Surname.py)
d. Signed Plagiarism Declaration form ( PlagirismDeclaration_A3-
z1234567FirstName-Surname.py)
Project help:
This assignment is strictly individual work. However, consulting external resources is
allowed providing that proper acknowledgement is included in the report. This can be done
by providing a list of references at the end of the report and placing the reference number
in the report where the resource is used. Instead of copying texts from these resources,
paraphrase in your own words.
To successfully run the provided Python code for Problem2, you may need to install some
external libraries and resources. We deliberately didn't provide detailed instructions on this
matter. The intention is to replicate real-world challenges and help you prepare for them.
Installing the required libraries and resources shouldn't be overly complicated. If you
encounter any difficulties, we encourage you to seek assistance from your tutor, lecturer, or
even your classmates. Helping each other out on installation process is not only allowed but
also encouraged in this assignment.
By learning to handle the installation process while doing it, you'll become more confident
in handling similar challenges in the future.