Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
*This sample exam paper contains 1 or 2 exam questions for each weekly topic,
and it is for sharing the structure and style of the final exam. You can see which
can be 2 marks and 5 marks questions. However, it does not expect you to finish
it in 1 hour, like the final exam.
*The final exam will be an open-book, unsupervised exam.
Week 1. Count-based Word Representation
Q1. Calculate the TFIDF for the terms listed below for documents 1 to 4. There are
10,000 documents in a collection. The number of times each of these terms occur in
documents 1 to 4 as well as the number of documents in the collections are stated in the
following table. Use this information to calculate the TFIDF scores. (2 marks)
Number of documents containing terms:
● machine: 3
● university: 50
● headphone: 10
● perfume: 3
Term frequency (The number of times each of 4 terms occur in document 1 to 4)
Documents
Doc1 Doc2 Doc3 Doc4
machine 8 10 0 0
university 3 1 2 2
headphone 0 0 8 7
perfume 2 2 2 9
Solution
IDF calculation
● machine : 3 (IDF = log(10000/3) ≈ 8.11)
● University: 50 (IDF = log(10000/50) ≈ 5.30)
● headphone: 10 (IDF = log(10000/10) ≈ 6.91)
● perfume: 3 (IDF = log(10000/3) ≈ 8.11)
TFIDF calculation
Documents
Doc1 Doc2 Doc3 Doc4
machine 8.11 * 8 = 64.88 8.11 * 10 = 81.10 0 0
university 5.30 * 3 = 15.90 5.30 * 1 = 5.30 5.30 * 2 = 10.60 5.30 * 2 = 10.60
headphone 0 0 6.91 * 8 = 55.28 6.91 * 7 = 48.37
perfume 8.11 * 2 = 16.22 8.11 * 2 = 16.22 8.11 * 2 = 16.22 8.11 * 9 = 72.99
Week 2. Word Embeddings and Representation
Q2. Illustrate an advantage of FastText over Word2Vec. Give application examples to
support your argument. (2 marks)
Solution
The main advantage of FastText embeddings over Word2Vec is to take into account the
internal structure of words while learning word representations, which could be very useful
for morphologically rich languages, and also for words that occur rarely. For example, for a
word that appears in various forms such as “teaching”, “teacher” and “teached”, the
internal structure of each word can be learned and represented using n-gram based FastText
embeddings.
Q3. Illustrate 2 examples of how we can evaluate word vectors. For each example,
please indicate whether it is intrinsic or extrinsic. (2 marks)
Solution
● Intrinsic: Word Vector Analogies ; Word vector distances and their correlation with
human judgments; Word clustering and categorization
● Extrinsic: Name entity recognitions: finding a person, location or organization and so
on; Various text classification tasks such as sentiment analysis we did in assignment
1: identifying and extracting subjective information in the source text or document.
Week 3 and 4. Word Classification with Machine Learning
Q4. In class, we learned that the family of recurrent neural networks have many important
advantages and can be used in a variety of NLP tasks. For each of the following tasks and
inputs, state how you would run an RNN to do that task. (5 marks)
1. how many outputs i.e. number of times the softmax is called from your RNN. If (t)y︿
the number of outputs is not fixed, state it as arbitrary
2. what each is a probability distribution over(t)y︿
3. which inputs are fed at each time step to produce each output
Task A: Named-Entity Recognition: For each word in a sentence, classify that word as
either a person, organization, location, or none. (Inputs: A sentence containing n words)
Task B: Sentiment Analysis: Classify the sentiment of a sentence ranging from negative to
positive (integer values from 0 to 4). (Inputs: A sentence containing n words.)
Solution
Task A: Named Entity Recognition
1. Number of Outputs: n outputs
2. Each is a probability distribution over 4 NER categories.(t)y︿
3. Each word in the sentence is fed into the RNN and one output is produced at every
time step corresponding to the predicted tag/category for each word.
Task B: Sentiment Analysis
1. Number of Outputs: 1 output. (n outputs is also acceptable if it takes average of all
outputs)
2. Each is a probability distribution over 5 sentiment values.(t)y︿
3. Each word in the sentence is fed into the RNN and one output is produced from the
hidden states (by either taking only the final, max, or mean across all states)
corresponding to the sentiment value of the sentence.
Q5. Assume that you build a sentiment analysis system that feeds a sentence into a
RNN, and then computes the sentiment class between 0 (very negative) and 4 (very
positive), based only on the final hidden state of the RNN. Ilustrate one advantage that
an RNN would have over a neural window-based model for this task. (2 marks)
Solution
There are multiple answers: We can process arbitrary length inputs. It can encode temporal
information.(’Take ordering into consideration’ is only partially correct, because theoretically
window-based models also can, although hard to). Shared weights. Less parameters. The
number of parameters would increase proportional to the input size of the neural
window-based network whereas it would stay constant for RNNs since weights are shared at
every time-step.
Week 5. Language Fundamental
Q6. Describe the difference between lemmatization and stemming. Give application
examples to support your argument (2 marks)
Solution
Stemming is a procedure to reduce all words with the same stem to a common form whereas
lemmatization removes inflectional endings and returns the base or dictionary form of a
word. For example, words “trouble”, “troubling” and “troubled” may be stemmed to be
“troubl” (not a valid English word) but will be lemmatized to be “trouble” for comparison.
Also, another good example for lemmatization would be words “was”, “is” to be mapped to
“be”.
Week 6. Part of Speech Tagging
Q7. Find one tagging error in each of the following sentences that are tagged with
the Penn Treebank tagset. Briefly explain why. (2 marks)
1. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN
2. Can/MD I/PRP serve/VB the/DT dinner/NNS
Solution
1. Atlanta is NNP. Atlanta is the capital city of the U.S. state of Georgia so it should be a
Proper noun, singular, a name used for an individual person, place, or organization,
spelled with an initial capital letter.
2. Dinner is NN. NN represents Noun, singular, whereas NNS is a Noun, plural. The
word ‘dinner’ is a noun, singular.
Q8-a. A hidden markov model includes states, observations, transition probabilities,
observation likelihoods. Describe what each one of these would correspond to when
using an HMM for POS tagging. (2 marks)
Solution
● States: The POS tags at specific points in the sentence.
● Observations: The words that are observed as the sentence is read in.
● Transition probabilities: the probability of finding POS tag N following POS tag N-1
● Observation likelihoods: the probability of seeing a particular word
Q8-b. Given the sentence ``I promise to back the bill.’’ show how you would compute
the probability of ``back’’ as a verb versus the probability of ``back’’ as a noun using
the probabilities in Tables a and b using the Viterbi algorithm. You are given the values
for the third column of the Viterbi table which correspond to observation 3 or ``to’’.
They are VB: 0, TO: .00000018, NN: 0, PRP: 0. Thus, you will show two computations
both of which will use these values. You do not need to do the arithmetic; just show the
formula that would be computed. (3 marks) (5 marks)
(*assume all verb tags as VB)
Table a. Observation Likelihoods
I promise to back
VB 0 .0093 0 .00008
TO 0 0 .99 0
NN 0 .0085 0 .00068
PRP .37 0 0 0
Table b. Tag transition probabilities.
VB TO NN PRP
.019 .0043 .041 .067
VB .0038 .035 .047 .0070
TO .83 0 .00047 0
NN .0040 .016 .087 .0045
PRP .23 .00079 .0012 .00014
Solution
● back as a verb:
.00000018 * Prob(VB| TO) * Prob (VB| back) = .00000018 *.83 * .00008
● back as a noun:
.00000018* Prob (NN | TO) * Prob (NN | back) = .00000018 * .00047 * .00068
Week 7. Dependency Parsing
Q9. State a sequence of transitions that make an transition-based dependency parser
produce the following dependency tree. Explain how to get the sequence of transitions.
(5 marks)
Solution
Suppose SH = Shift, RA = Right Arc, LA = Left Arc.
SH SH SH SH RA SH SH LA RA RA RA
In order to get this dependency tree using the arc-standard algorithm, we need to do the
following steps based on the three possible transactions (SH, RA, LA):
Step 1. SH the ROOT 0 to the stack while all the other words from 1 to 5 will be in the buffer
as our initial state.
Step 2. SH the 1 from buffer to the stack
Step 3. SH the 2 from buffer to the stack
Step 4. SH the 3 from buffer to the stack
Step 5. RA from 2 to 3 and remove 3 out of stack
Step 6. SH 4 from buffer to the stack
Step 7. SH 5 from the buffer to the stack
Step 8. LA from 5 to 4 and remove 4 out of stack
Step 9. RA from 2 to 5 and remove 5 out of stack
Step 10. RA from 1 to 2 and remove 2 out of stack
Step 11. RA from 0 to 1 and remove 1 out of stack
Head and modifier refer to the two words in a dependency relation where the head is the one
that is governor, parent and the modifier is the one that is dependen, daughter. Using the
arrow for the dependencies, it will point from the head to the modifier. For example,
considering the dependency of words ‘red hat’, the red will be the modifier while the hat will
be the head. And the arrow will point from “hat” to “red” in this case.
And, please put a detailed explanation on how to get the dependency tree using SH, LA, RA
definition and what the head and modifier are.
Week 8. Language Model and Natural Language Generation
Q10. During training a neural language model, we normally apply teacher forcing.
Describe what the teacher forcing technique is. Give application examples to support
your argument. (2 marks)
Solution
Teacher forcing is the technique where the target word is passed as the next input to the
decoder. Let us assume we want to train a sentence generation model, and the ground truth
caption for the above image is “Two people reading a book”. Our model makes a mistake in
predicting the 2nd word and we have “Two” and “birds” for the 1st and 2nd prediction
respectively. If we use Teacher Forcing, we would feed “people” to our RNN for the 3rd
prediction, after computing and recording the loss for the 2nd prediction.
(optional: Without Teacher Forcing, we would feed “birds” back to our RNN to predict the
3rd word. Let’s say the 3rd prediction is “flying”. Even though it makes sense for our model
to predict “flying” given the input is “birds”, it is different from the ground truth.)
Week 9. Named Entity Recognition and Coreference Resolution
Q11. The IOB format categorizes tagged tokens as I, O and B. Why are three tags
necessary? What problem would be caused if we used I and O tags exclusively? Give
application examples to support your argument. (2 marks)
Solution
The IOB format (short for inside, outside, beginning) is a common tagging format for tagging
tokens in a chunking task. If two chunks follow each other, it would not be possible to make
clear that they are two chunks instead of one chunk consisting of two words and also not
where the first ends and the second begin. For example, considering the NER tags using only
IO format for the sentence ‘Tom/I-PER, Amy/I-PER and/O Tony/I-PER went/O to/O
Usyd/I-LOC’, the two words Tom and Amy cannot be distinguished as separate two chunks
as expected. However, this can be solved by using the IOB format as ‘Tom/B-PER,
Amy/B-PER and/O Tony/B-PER went/O to/O Usyd/B-LOC’.s
Week 10. Attention and Reading Comprehension
Q12. Describe the main intuition behind attention in a neural network model. (2 marks)
Solution
Attention gives us a fixed-sized summary (as a weighted sum) of an arbitrary set of
representations (the values), dependent on some other representation (the query). The
weighted sum is a selective summary of the information contained in the values, where the
query determines which values to focus on.