Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP4650/6490 Document Analysis
Assignment
Overview
In this assignment you will:
1. Develop a better understanding of how machine learning models are trained in practice, including
partitioning of datasets, evaluation, and tuning hyper-parameters.
2. Become familiar with the scikit-learn1 and gensim2 packages for machine learning with text.
3. Become familiar with the PyTorch3 framework for implementing neural network-based machine
learning models.
Throughout this assignment you will make changes to the provided code to improve or complete existing
functions. In some cases, you may write your own code to perform functionalities similar to the provided
code but using different types of features or dataset.
Submission
• You will produce an answers file with your responses to each question. Your answers file must be
a PDF file named u1234567.pdf where u1234567 should be replaced with your Uni ID.
• The answers to this assignment (including your code files) have to be submitted online in Wattle.
• You should submit a ZIP file containing all of the code files and your answers PDF file, BUT NO
DATA. A simple Python script create submission zip.py that can generate a ZIP archive with
all the required files for your submission has been provided.
• No late submission will be permitted without a pre-arranged extension. A mark of 0 will be
awarded if not submitted by the due date.
Marking
This assignment will be marked out of 100, and it will contribute 10% of your final course mark.
Your answers to coding questions (or coding parts of each question) will be marked based on the quality
of your code (is it efficient, is it readable, is it extendable, is it correct) and the solution in general (is it
appropriate, is it reliable, does it demonstrate a suitable level of understanding).
Your answers to discussion questions (or discussion parts of each question) will be marked based on how
convincing your explanations are (are they sufficiently detailed, are they well-reasoned, are they backed
by appropriate evidence, are they clear, do they use appropriate visual aids such as tables, charts, or
diagrams where necessary).
This is an individual assignment. Group work is not permitted. Assignments will be checked for
similarities.
1https://scikit-learn.org/
2https://radimrehurek.com/gensim/intro.html
3https://pytorch.org/
1
Question 1: Movie Review Sentiment Classification (40%)
For this question you have been provided with a labelled movie review dataset – the same dataset you
explored in Lab 3. The dataset consists of 50,000 review articles written for movies on IMDb, each
labelled with the sentiment of the review – either positive or negative. The overall distribution of labels
is balanced (25,000 pos and 25,000 neg). Your task is to apply logistic regression with both sparse and
dense representations to predict the sentiment label from the review text.
Part A
One simple approach to classifying the sentiment of documents from their text is to train a logistic
regression classifier using TF-IDF features. This approach is relatively straightforward to implement and
can be very hard to beat in practice.
To do this you should first implement the get features tfidf function (in features.py) that takes a set
of training sentences as input and calculates the TF-IDF (sparse) document vectors. You may want to use
the TfidfVectorizer4 in the scikit-learn package. You should use it after reading the documentation.
For text preprocessing, you could set the analyzer argument of TfidfVectorizer to the tokenise text
function provided in features.py. Alternatively, you may set appropriate values to the arguments of
TfidfVectorizer or write your own text preprocessing code.
Next, implement the search C function (in classifier.py) to try several values for the regularisation
parameter C and select the best based on the accuracy on the validation data. The train model and
eval model functions provided in the same Python file might be useful for this task. To try regularisation
parameters, you should use an automatic hyper-parameter search method presented in the lectures.
You should then run sentiment analysis.py which first reads in the dataset and splits it into training,
validation and test sets; then trains a logistic regression sentiment classifier and evaluate its performance
on the test set. Make sure you first uncomment the line with the analyse sentiment tfidf function
(which uses your get features tfidf function to generate TF-IDF features, and your search C function
to find the best value of C) in the top-level code block of sentiment analysis.py (i.e., the block after
the line “if name == ’ main ’:”) and then run sentiment analysis.py.
Answer the following questions in your answers PDF:
1. What range of values for C did you try? Explain, why this range is reasonable. Also explain what
search technique you used and why it is appropriate here.
2. What was the best performing C value?
3. What was your accuracy on the test set?
Part B
Another simple approach to building a sentiment classifier is to train a logistic regression model that uses
aggregated pre-trained word embeddings. While this approach, with simple aggregation, normally works
best with short sequences, you will try it out on the movie reviews.
Your task is to use Word2Vec5 in the gensim package to learn embeddings of words and predict the
sentiment labels of review text using a logistic regression classifier with the aggregated word embedding
features. You should use it after reading the documentation.
hyper-parameters of Word2Vec (e.g. vector size, window, negative, alpha, epochs, etc.) as well as the
regularisation parameter C for your logistic regression classifier. You should use an automatic hyper-
parameter search method presented in the lectures. (Hint: The search C function in classifier.py and
the get features w2v in features.py might be useful.)