Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
0. Warming up with WordNet and NLTK
WordNet is a lexical database; like a dictionary, WordNet provides definitions and example
usages for different senses of word lemmas. But WordNet does the job of a thesaurus as well: it
provides synonyms for the senses, grouping synonymous senses together into a set called a synset.
But wait; there’s more! WordNet also provides information about semantic relationships beyond synonymy, such as antonymy, hyperonymy/hyponymy, and meronymy/holonymy. Throughout this assignment, you will be making use of WordNet via the NLTK package, so the first step
is to get acquainted with doing so. Consult sections 4.1 and 5 of chapter 2 as well as section
3.1 of chapter 3 of the NLTK book for an introduction along with examples that you will likely
find useful for this assignment. You may also find section 3.6 is also useful for its discussion of
lemmatization, although you will not be doing any lemmatization for this assignment.
(a) A root hyperonym is a synset with no hyperonyms. A synset s is said to have depth d if there
are d hyperonym links between s and a root hyperonym. Keep in mind that, because synsets
can have multiple hyperonyms, they can have multiple paths to root hyperonyms.
Implement the deepest function in q0.py that finds the synset in WordNet with the
largest maximum depth and report both the synset and its depth on each of its paths to a
root hyperonym. (Hint: you may find the wn.all_synsets and synset.max_depth
methods helpful.)
(b) Implement the superdefn function in q0.py that takes a synset s and returns a list consisting of all of the tokens in the definitions of s, its hyperonyms, and its hyponyms. Use
word_tokenize as shown in chapter 3 of the NLTK book.
(c) WordNet’s word_tokenize only tokenizes text; it doesn’t filter out any of the tokens.
You will be calculating overlaps between sets of strings, so it will be important to remove
stop words and any tokens that consist entirely of punctuation symbols.
Implement the stop_tokenize function in q0.py that takes a string, tokenizes it using
word_tokenize, removes any tokens that occur in NLTK’s list of English stop words
(which has already been imported for you), and also removes any tokens that consist entirely
of punctuation characters. For a list of punctuation symbols, use Python’s punctuation
characters from the string module (this has also already been imported for you). Keep
in mind that NLTK’s list contains only lower-case tokens, but the input string to stop_
tokenize may contain upper-case symbols. Maintain the original case in what you return.
1. The Lesk algorithm & word2vec
Recall the problem of word sense disambiguation (WSD): given a semantically ambiguous
word in context, determine the correct sense. A simple but surprisingly hard-to-beat baseline
method for WSD is Most Frequent Sense (MFS): just select the most frequent sense for each
ambiguous word, where sense frequencies are provided by some corpus.
(a) Implement the mfs function that returns the most frequent sense for a given word in a sentence. Note that wordnet.synsets() orders its synsets by decreasing frequency.
As discussed in class, the Lesk algorithm is a venerable method for WSD. The Lesk algorithm
variant that we will be using for this assignment selects the sense with the largest largest number of
words in common with the ambiguous word’s sentence. This version is called the simplified Lesk
algorithm.