Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Assignment 2 – ML
In this assignment, you will build several text classification models on a provided movie review
dataset. The dataset consists of 50,000 review articles written movies on IMBD, each labelled
with the sentiment of the review – either positive or negative. In addition there are another 50,000
reviews without labels.
In this assignment you will:
1. Become familiar with the Pytorch framework for implementing neural network-based
machine learning models.
2. Develop a better understanding of how machine learning models are trained in practice,
including partitioning of datasets and evaluation.
3. Study how changing the values of various model hyper-parameters impacts their
performance.
Throughout this assignment you will make changes to the provided code to improve the models.
In addition, you will produce an answers file with your responses to each question. Your answers
file must be a .pdf file named u1234567.pdf where u1234567 is your Uni ID. You should submit a
.zip file containing all of the code files and your answers pdf file, BUT NO DATA.
Your answers to coding questions will be marked based on the quality of your code (is it efficient,
is it readable, is it extendable, is it correct).
Your answers to discussion questions will be marked based on how convincing your
explanations are (are they sufficiently detailed, are they well-reasoned, are they backed by
appropriate evidence, are they clear).
Question 1 – A simple linear classifier (15%)
You have been provided with various code files to create, train and evaluate models on the given
dataset. To begin with, read through the code of and then run train_vector_classifier.py. For the
vector classifier, the documents will be represented as tf-idf vectors and a simple linear classifier
(LogisticRegressor) is trained to predict classes.
Your first task is to change the various settings in this file to try and improve the validation accuracy
of the model. The options you can change are:
1. The pre-processor in preprocessor.py. You can re-use your pre-processor from assignment 1,
or develop a new one.
2. The get_vector_representation function in data_loader.py. This function specifies how lists
of tokens will be transformed into tf-idf vectors. You can find documentation for the
CountVectorizer parameters here: https://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
3. The settings of the LogisticRegressor model in train_vector_classifier.py. You can find
documentation for the available parameters here: https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Pick three changes you made that you think caused the most increase in performance, and for each
one briefly describe what effect this change has on your model and why you believe this lead to the
increased performance.
Leave your code file with the best performing settings.
Question 2 – Embedding based classifier (15%)
Next you will implement an embedding based classifier in pytorch. The model is a simple variation
on FastText (you can read more about it here if you are interested
https://arxiv.org/pdf/1607.01759.pdf). In models.py there is a class FastText, your task is to
implement the forward function of this class. Once you have implemented it you can then run
train_embedding_classifier.py to run your model.
For each input sentence, the model computes its output in 3 steps:
1. Embed the input IDs to create vector representations of each word in the sentence.
2. Aggregate the vectors together to create one vector that represents the entire sentence.
You can use any aggregation method you want, e.g. sum, max, average (note that average is
significantly more difficult to implement due to padding), etc.
3. Run the aggregated vector through a linear layer to compute predicted logits for each class.
Question 3 – Tuning a pytorch model (15%)
Similarly to question 1, change the various settings of your FastText model to try and improve its
validation accuracy. Here are some things you may want to change:
1. The aggregation method used by FastText.
2. The dimension of the embedding vectors.
3. The optimizer, including learning rate, momentum or even algorithm.
4. The number of epochs trained for.
5. The batch size.
6. The preprocessor.
In your answers report five different settings you tried, along with their accuracies. You may
optionally include some plots of losses over time. For each setting briefly describe why you believe
this led to increased/decreased performance.
Leave your code file with the best performing settings.