Introduction to Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP90049 Introduction to Machine Learning
Assignment
Released: Friday 8th September (end of Week 7)
Due: Stage I (main assignment deadline): Friday 6th October at 5pm (end of Week 10)
Stage II (peer review deadline): Wednesday 11th October at 5pm
Marks: This project will be marked out of 30 and consist of 30% of your total marks for
the subject
1 Introduction
In this assignment you will develop, evaluate, and critically assess machine learning models for identify-
ing patient sentiment towards clinicians. You will do this using reviews derived from the ratemds.com
United States clinician review website. You will be provided with a dataset of clinician reviews that
have been labeled with a sentiment classification (i.e. whether a patient was satisfied or dissatisfied
regarding an interaction with a clinician). In addition, each review is labelled with the gender of the
clinician (male, female, or unknown). You may use this information to determine whether your model
works equally well in predicting sentiment for male and female genders.2. This assessment provides
you with an opportunity to reflect on concepts in machine learning in the context of an open-ended
research problem, and to strengthen your skills in data analysis and problem solving.
Online clinician review sites (i.e. websites like ratemds.com) that publish patient reviews are a useful
resource for individuals seeking to make informed healthcare choices, for health systems seeking to
improve patient safety and healthcare quality, and for researchers seeking to identify systematic and
emerging problems in healthcare provision. Given that commercial websites like ratemds.com and
non-commercial entities like the UK National Health Service generate thousands of user feedback
comments per week, in the past decade considerable effort has been expended on designing and
evaluating appropriate machine learning/natural language processing tools to identify trends in patient
comments. Examples of this kind of work include: Greaves et al. (2013), Doing-Harris et al. (2016),
Cammel et al. (2020), Zhang et al. (2018), and Chekijian et al. (2021). Data provided for this project
are derive from the following two papers: Wallace et al. (2014) and Lo´pez et al. (2012)
The goal of the assignment is to critically assess and evaluate the effectiveness and appropri-
ateness of various machine learning approaches applied to the problem of determining the sentiment
(positive or negative) of patient-generated clinician reviews, and to articulate the knowledge that
you have gained in a technical report . The technical aspect of this project will involve ap-
plying appropriate machine learning algorithms to the data to address the task. There will be a
Kaggle in-class competition where you can compare the performance of your algorithm against your
classmates.
The primary output of this project will be the report, which will be formatted as a short research
1Note that this assignment is largely based on a format developed by Dr Lea Frermann
2Note that there was insufficient data and evidence to identify non-binary gender classes
1
paper. In the report, you will demonstrate the knowlege that you have gained over the duration of
the subject in a manner that is accessible to an informed reader.
Note that you do not need to implement algorithms “from-scratch” from this assignments. It is ex-
pected that you will use algorithms implemented in existing libraries (e.g. scikit-learn). Assessment
will be based on the quality of your report.
2 Deliverables
Stage I: Model development and testing and report writing (due Friday 6th October at 5pm — end
of Week 10)
1. One or more programs (where “programs” can include stand-alone scripts or Jupyter Notebooks)
written in Python 3, including all the necessary code to reproduce the results of your report
(including model implementation, label prediction, and evaluation). You should also include
a README file that briefly details your implementation. This component of the assessment
should be submitted through the Canvas LMS. All your code files (and README) should be
contained in a single zip file.
2. An anonymous report of approximately 2,000 words (±10%) excluding bibliographic references
and the ethics statement (described below), but including material in tables and captions. Your
name and student ID should not appear anywhere in the report, including the metadata (e.g.
filename). This component of the evaluation should be submitted through Canvas. You must
upload the report as a separate PDF file. Do not upload it as part of a compressed archive file
(e.g. zip, tar) or in a different format (e.g. Microsoft Word). Anonymity is required in order to
enhance the fairness of the peer review process (i.e. your reviewer should not know your name
and you should not know your reviewer’s name)
3. Predictions for the test set of clinician sentiment predictions submitted to Kaggle3 are described
in Section 7
Stage II: Peer reviews (due Wednesday 11th October at 5pm)
1. Reviews of two reports written by your classmates. These reports will be approximately 200-400
words each. This component of the assessment should be submitted through Canvas.
3 Data sets
The data in its entirety (i.e. before being divided into training, validation, and test sets) consists of:
• 5 columns (see Table 1 for a detailed description of the column names)
• 54,107 rows (i.e. individual comments on clinicians by patients with associated binary sentiment
labels of “-1” [negative sentiment ] and “1” [positive sentiment ])
• 19,097 distinct clinicians are reviewed in the dataset (i.e. many clinicians are the target of more
than one patient review). For this project, the primary task is comment-based classification
(i.e. classifying individual comments for sentiment), but you may experiment with clinician-
level classification, too.
• 72% of the comments are labelled as positive sentiment (38,847) and 28% are labelled as negative
sentiment (15,170)
You will be provided with:
1. A training set of 43,003 clinician reviews [TRAIN.csv]
2. A development (or validation) set of 5,500 clinician [VALIDATION.csv] reviews
2
Col # Name Description
1 index Row index value [0 to 54,106]
2 dr-id-adjusted Clinician ID. Note that there are 19,097 distinct clinicians in the
dataset
3 dr id gender Provides clinician ID level gender information. This should be
the gender demographic label you use in your project
[0: female; 1: male; 2: unknown]
4 review-text-cleaned Main comment field (e.g. Very professional and concerned with
your health, and worst doctor ever. Awful! )
5 rating Target sentiment label. This is the target label [-1: negative
sentiment; 1: positive sentiment]. This is a binary label
Table 1: Data description
3. A test set of 5,514 with no target (sentiment) labels which will be used for final evaluation of
the Kaggle in-class competition [TEST NO LABELS.csv]
Note that your classified test corpus will be submitted to Kaggle for evaluation. In
your report for this assignment, you should present the results of your classifier on the
provided validation corpus.
The format of the data is described in Table 1.