Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
1 (6 marks) Consider standard social media such as Twitter or Weibo.
(a) (3 marks) Describe some of the aspects of them that make it ideal as a domain for the study of
semi-structured data.
(b) (3 marks) Text in a Tweet is challenging to analyse and interpret for standard natural language
processing methods. Give three different problems caused by these sorts of Tweets, and for each
suggest how you could address the problem.
2 (6 marks) Consider the following extract from a news article on July 31st 2020 in Spaceflight
Now:
NASA’s Perseverance rover will depart Cape Canaveral Thursday on a $2.7 billion mission to Mars,
carrying with it the first interplanetary aircraft, sophisticated instruments to search for signs of
ancient life, and drill to core samples for eventual return to Earth.
Building on past discoveries at the Red Planet, the nuclear-powered robot will aim to become
NASA’s ninth mission to land on Mars, and the first since the Viking landers of the 1970s charged
with seeking evidence of life.
(a) (2 marks) What is a “coreference” in natural language, and what are all the co-references in this
article?
(b) (2 marks) Explain why coreference resolution is hard.
(c) (2 marks) Give examples of 2 different kinds of phrases, and name the kinf of phrase, for
instance the part of speech, they represent.
Embeddings [13 marks]
3 (4 marks) For representing words in a machine learning algorithm, for instance for text
classification, what is the difference between one-hot coding and the use of embeddings?
Explain the pros and cons of each.
4 (3 marks) Consider the word “bank”. Could you use embeddings to help uncover synonyms
for this word. What potential errors could this induce?
5 (6 marks) Describe how the CBOW model of word-embeddings works.
HINT: best marks will give for combining the formula with an intuitive explanation.