Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP4650/6490 Document Analysis
Overview
For this assignment, you will implement several common NLP algorithms: probabilistic context free
grammar learning, dependency parsing, and relation extraction in Python.
Throughout this assignment you will make changes to the provided code to improve or complete existing
models.
Submission
• The answers to this assignment (including your code files) have to be submitted online in Wattle.
• You will produce an answers file with your responses to all questions. Your answers file must be a
PDF file named u1234567.pdf where u1234567 should be replaced with your Uni ID.
• Questions 1 and 3 also require you to submit an output file in a format specified by the question.
• You should submit a ZIP file containing all the code files, your answers PDF file, and these requested
output files BUT NO OTHER DATA.
Marking
This assignment will be marked out of 15, and it will contribute 15% of your final course mark.
Your answers to coding questions (or coding parts of each question) will be marked based on the quality
of your code (is it efficient, is it readable, is it extendable, is it correct). The output files requested will be
used as part of the marking for the coding questions.
Your answers to discussion questions (or discussion parts of each question) will be marked based on how
convincing your explanations are (are they sufficiently detailed, are they well-reasoned, are they backed
by appropriate evidence, are they clear).
This is an individual assignment. Group work is not permitted. Assignments will be checked for
similarities.
Question 1: Probabilistic Context Free Grammars (3 marks)
For this question you will be modifying the starter code in syntax parser.py. You have been provided
with a training dataset containing 73,638 English language sentences, which have been parsed into
constituency trees (in parsed sents list.json).
For example, given the sentence: “A large elephant scratching”. The parse tree is:
1
For this structure the parsed sentence in parsed sents list.json is
[“S”, [“NP”, [“DT”, “a”], [“JJ”, “large”], [“NN”, “elephant”]], [“VP”, [“VBG”, “scratching”]]].
Note that parsed sents list.json contains a list of parses, each for a different sentence.
Your task is to estimate a probabilistic context free grammar (PCFG) given this dataset. You should
estimate the conditional probabilities of the rules in this grammar (including all rules that express facts
about the lexicon) using the maximum-likelihood approach given in lectures. Some examples of rules in
this grammar are:
S → NP
VBN → parked
ADVP → INNP
NP → NP, NP
Note that some terminals and non-terminals are punctuation characters. There are many more rules
than those listed here and you will first have to identify all the rules. All terminals are lower case or
punctuation.
When your syntax parser.py code is run, write a list of all the transition rules and their conditional
probabilities to a file called “q1.json”. There is a class called RuleWriter in syntax parser.py which
will write the rules in the correct format for the automated judging system (do not modify this class). The
expected format is (example of format only not actual answers):
[[“NP”, [“PP”, “NP”], 0.8731919455731331], [“NNS”, [“locations”], 0.062349802638746935],
. . . ]
Hint: if you create a test input file containing only
[[“A”, [“B”, [“C”, “blue”]], [“B”, “cat”]]]
Your program should output (note that the order the rules are listed in the output does not matter):
[[“A”, [“B”, “B”], 1.0 ], [“B”, [“C”], 0.5 ], [“B”, [“cat”], 0.5 ], [“C”, [“blue”], 1.0 ]]
What to submit
Make sure to submit “q1.json” for marking. Also make sure to submit your syntax parser.py.
Question 2: Dependency Parsing (6 marks)
Complete both Part A and Part B of this question.
Part A
The file dependency parser.py contains a partial implementation of Nivre’s arc-eager dependency
parsing algorithm. Your first task is to implement the missing left arc method, reduce method,
2
right arc method, and parse function. Your implementation should follow the transitions as given by
the lecture slides (also some additional instructions are provided as comments in dependency parser.py).
Make sure that your class methods and parse function return True if the operation(s) was/were successfully
applied, or else False. The parse function should return True if it generates a dependency forest. Your
implementation must make use of the DepContext class which stores the current state of the parser.
Do not modify the DepContext class or the name and parameters of the methods and functions you
have been asked to implement because Part A of this question will be automatically marked.
The auto-marker will inspect the internal state of DepContext to confirm that operations have been
applied correctly. You should not use any additional imports for this question (except from Python
standard libraries). The judging system will use Python-3.9.13 in a sandboxed Linux environment.
Hint: As mentioned in lectures we need to make sure that words are uniquely represented (in case there
are multiple copies of the same word), to do this we use the position of the word in the sequence, with
the [ROOT] token prepended to the sequence. This is handled by DepContext.
What to submit
In your answers PDF file include the output of dependency parser.py (it runs a small number of test
cases and prints them to the terminal). Make sure to submit your dependency parser.py.
Part B
Your second task is to implement the standard oracle for arc-eager dependency parsing in oracle parser.py
(specifically you should implement the parse gold function). Make use of the parsing class you developed in Part A of this question – do not modify your parsing class from Part A, all code for this second
part should be put in the oracle parser.py.
The parsing class from Part A has been imported. Additional imports are not permitted (except from
Python standard libraries). You may add helper functions in oracle parser.py.
The oracle you need to implement determines which action to be performed in each state to reconstruct the
ground truth dependency parse tree. It returns the list of actions, and the states in which they occur – though
in this case you should return only features of the state (using the provided extract features function
– do not modify the feature extraction code). The details of this algorithm were not covered in lectures,
instead you should implement Algorithm 1 from this paper https://aclanthology.org/C12-1059.pdf
(while this paper covers different oracle methods including a dynamic oracle, you should only implement
Algorithm 1 on page 963). You should implement un-labelled dependency parsing so you will need to
modify this algorithm to work without labels. To implement Algorithm 1 you will need to make use of
the parsing class you developed in Part A of this question.
Once you have implemented the parse gold function, run oracle parser.py and it will determine the
actions that could generate the ground truth dependency parse trees in gold parse.json under the rule
set in rule set.json. It will print the first 3 cases where the oracle could reproduce the ground truth,
the first 3 cases where it could not reproduce the ground truth, and after a short wait the number of total
failure and total success cases.
What to submit
In your answers PDF file include the output of oracle parser.py and explain why it failed in the three
printed cases. Make sure to submit your oracle parser.py.
3
Question 3: Relation Extraction (6 marks)
Your task is to design and implement a relation extraction model along with its training and testing pipeline.
Specifically, your model should extract the relation ‘nationality’ (denoted /people/person/nationality
in our dataset) which holds between a person and a geopolitical entity (GPE). For example,
/people/person/nationality(Yann LeCun, France),
/people/person/nationality(Julie Bishop, Australia).
The training data is provided in sents parsed train.json, the test data is in sents parsed test.json.
A file containing a list of country names (country names.txt) is also provided for you to use. The test
data does not have the ground truth set of relations; your program must extract them. Some starter code
has been provided in rel extract.py which shows you how to read in the data, access its various fields,
and write a valid output file.
It is very important that your relation extraction algorithm work for entities not seen during training.
To clarify, this means that your model should be designed so that it can identify relations such as
/people/person/nationality(Julie Bishop, Australia) even if “Julie Bishop” and “Australia”
entities are not in the training data. None of the (PERSON, GPE) pairs that constitute a nationality relation
occur in both the training dataset and the test dataset. The focus of this question is to develop a good
training/validation pipeline and to explore different lexical, syntactic or semantic features. You should
use the “Feature-based supervised relation classifiers” technique introduced in the Speech and Language
Processing textbook (3rd ed. Jan 12, 2022 draft) Section 17.2.2 and Figure 17.6. You do not have to
implement all the suggested features, but an attempt to implement a variety of them is expected. You
will likely need to do substantial pre-processing of the input data as well as some post processing of
the output list to get a good result. You should use LogisticRegression from scikit-learn as the
classifier. Neural networks and vector features generated by neural networks are not to be used
for this question. You may use spaCy1 to extract additional features from your text; however, several
features are already provided for you (see Description of the data below). It is possible to get full marks
using only the features provided; however, these features will likely need some processing before they can
be used appropriately as part of a classification pipeline. You are permitted to use the Python standard
libraries, numpy, pandas, scikit-learn, nltk, spaCy. You should not use other libraries, datasets, or
ontological resources.
The final goal is to extract all relations of the type nationality that exist in the test data, not to simply
identify if a relation type is in each sentence. Your output should be a list of all the relations in the test
data, in CSV format (you should name it “q3.csv”) – the starter code provides a function to write this
file in the correct format for the automated judging system. The order of lines in this CSV file does not
matter, and duplicate lines will be removed before scoring. An example judging script eval output.py
is provided for you. Since you have not been given the ground truth for the test data you can only run this
script using the training data. To do a sanity check before submitting,
(1) first run your rel extract.py using the training data, and
(2) evaluate your model performance on the training set
python eval_output.py --trainset
(3) to make sure “q3.csv” is the output from the test data, run rel extract.py again using the test data
(this overwrites “q3.csv” with the relations extracted from the test set).
Note that part of the mark for this question will be based on the F1 score of your “q3.csv” file on the
held-out ground truth for the test data (evaluated by the examiner using the automated judging system).
The rest will be based on an examination of your code and answers to the questions below.
1https://spacy.io/
4
What to submit
Answer the following questions in your answers PDF file (please answer in plain English, dot points are
preferred, do not include code snippets or screenshots of code as these will not be marked):
(a) What pre-processing did you do?
(b) How did you split the data into training and validation sets?
(c) How did you choose hyper-parameters?
(d) What features did you choose and why did you choose these features?
(e) What post-processing did you do?
Submit your code. Submit your output file named “q3.csv” (make sure this is the output from the test
data). Submit your answers to the questions.
Description of the data
The data you have been given is described in detail below. Briefly, it is a set of sentences, each of
which contain a relation between two entities. You may use the relations that are not the target relation
/people/person/nationality to generate negative examples on which to train your model.
sents parsed train.json and sents parsed test.json each contain a list of dictionaries. Each dictionary corresponds to a different example. Each example dictionary has the following fields:
• tokens: a tokenised sentence that contains a relation
• entities: a list of the named entities in the sentence
• relation: a description of the relation in the sentence
Each entity is a dictionary with the following fields:
• start: the token index where the entity starts
• end: the token index where the entity ends
• label: the entity type. PERSON = a person, GPE = a geopolitical entity, and several others
Each relation is a dictionary with the following fields (note that the relation field has been removed from
sents parsed test.json):
• a: first entity that is part of the relationship (tokenised into a list)
• b: the second entity that is part of the relationship (tokenised into a list)
• a start: the token index where the first entity starts
• b start: the token index where the second entity starts
• relation: the type of relation, note we are interested in the /people/person/nationality relation
for this assignment.
There are also some additional features that may or may not be helpful (each of the features below is a
list, the array indices match to those of the tokenised sentence):
• lemma: the tokenised text that has been lemmatised.
• pos: part of speech tags for each word
• dep: the dependency relationships
5
• dep head: the head of each dependency relationship (allowing you to reconstruct the full dependency tree)
• shape: shape features for each word. This preserves things such as word length and capitalisation
while removing much of the content of the word.
• isalpha: is each word only alphabetic.
• isstop: is each word in the spaCy stop word dictionary (and thus likely to be common)
All features were extracted using spaCy-2.3.5 en core web md models. The data you have been provided
was constructed automatically by matching ground truth relations to text. For the training data any
sentence that had both the named entities that participated in a ground truth relation was marked as
containing that relation. No manual checks were done to ensure accuracy. You might like to consider
how this will affect the data quality.