Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP4650/6490 Document Analysis
Assignment 2
Overview
In this assignment you will:
1. Develop a better understanding of how machine learning models are trained in practice, including
partitioning of datasets and evaluation.
2. Become familiar with the scikit-learn1 package for machine learning with text.
3. Become familiar with the PyTorch2 framework for implementing neural network-based machine
learning models.
Throughout this assignment you will make changes to the provided code to improve or complete existing
models. In some cases, you will write your own code from scratch after reviewing an example.
Submission
• The answers to this assignment (including your code files) have to be submitted online in Wattle.
• You will produce an answers file with your responses to each question. Your answers file must be
a PDF file named u1234567.pdf where u1234567 should be replaced with your Uni ID.
• You should submit a ZIP file containing all of the code files and your answers PDF file, BUT NO
DATA.
Marking
This assignment will be marked out of 15, and it will contribute 15% of your final course mark.
Your answers to coding questions (or coding parts of each question) will be marked based on the quality
of your code (is it efficient, is it readable, is it extendable, is it correct).
Your answers to discussion questions (or discussion parts of each question) will be marked based on how
convincing your explanations are (are they sufficiently detailed, are they well-reasoned, are they backed
by appropriate evidence, are they clear, do they use appropriate visual aids such as tables, charts, or
diagrams).
Question 1: Movie Review Sentiment Classification (4 marks)
For this question you have been provided with a movie review dataset. The dataset consists of 50,000
review articles written for movies on IMDb, each labelled with the sentiment of the review – either
positive or negative. Your task is to apply logistic regression with dense word vectors to the movie review
dataset to predict the sentiment label from the review text.
A simple approach to building a sentiment classifier is to train a logistic regression model that uses
aggregated pre-trained word embeddings. While this approach, with simple aggregation, normally works
best with short sequences, you will try it out on the movie reviews.
You have been provided with a Python file dense_linear_classifier.py which reads in the dataset and
splits it into training, testing, and validation sets; and then loads the pre-trained word embeddings. These
embeddings were extracted from the spacy-2.3.5 Python package’s en_core_web_mdmodel and, to save
disk space, were filtered to only include words that occur in the movie reviews.
Your task is to use a logistic regression classifier with aggregated word embedding features to determine
the sentiment labels of documents from their text. First implement the document_to_vector function
which converts a document into a vector by first tokenising it (the TreebankWordTokenizer in the nltk
package would be an excellent choice) and then aggregating the word embeddings of those words that exist
in the dense word embedding dictionary. You will have to work out how to handle words that are missing
from the dictionary. For aggregation, the mean is recommended but you could also try other functions
such as max. Next, implement the fit_model and test_model functions using your document_to_vector
function and LogisticRegression from the scikit-learn package. Using fit_model, test_model, and
your training and validation sets you should then try several values for the regularisation parameter C
and select the best based on accuracy. To try regularisation parameters, you should use an automatic
hyperparameter search method. Next, re-train your classifier using the training set concatenated with the
validation set and your best C value. Evaluate the performance of your model on the test set.
Answer the following questions in your answer PDF:
1. What range of values for C did you try? Explain, why this range is reasonable. Also explain what
search technique you used and why it is appropriate here.
2. What was the best performing C value?
3. What was your final accuracy?
Also make sure you submit your code.
Hint: If you do the practical exercise in Lab 3 this question will be much easier.
Tip: If you use TreebankWordTokenizer then for efficiency you should instantiate the class as a global
variable. The TreebankWordTokenizer compiles many regular expressions when it is initialised; doing
this every time youwant to tokenise a sentence is very inefficient. For more details see the documentation3
for TreebankWordTokenizer.