Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CMPT 456 Assignment
Every student has to complete the assignment independently. While you are encouraged to learn
through discussion with the instructor, the TAs and the peer students, any plagiarisms are serious
violation of the university’s academic integrity policy. We have absolutely zero tolerance of such
behavior.
This assignment covers the materials in the sections of queries and interfaces, and information
retrieval models. The questions use the data sets D1 and D2 in Q1 of Assignment 2. Please also
merge D1 and D2 to form data set D. Do NOT stem the data sets.
Question 1 (50 points)
This question asks you to build a simple spelling correction function.
Your program should take a word w as input. It searches all words x in D that have edit distance
edit-dist(w, x) <= 2, where each of the editing operations, insertion, deletion, and replacement,
takes weight 1. Then, your program should output the top-5 words x that have the highest termfrequency
in D as the suggested corrections as well as their term-frequencies in D.
1. Describe your implementation of the above spelling correction function and use 3
examples to show the correction results. Particularly, explain how you can find those
candidate corrections and the top-5 candidates efficiently. Please submit your codes. (20
points)
2. The above design has a problem. If you input a correctly spelt word, like “heart”, it still
may suggest some similar words as possible corrections, such as “heard”. Can you
propose a fix to the problem so that your program can avoid suggesting corrections of
words frequently used in D? Describe your idea use examples, implement it, and test the
results. Does your program need any parameters? If so, discuss how those parameters
may contribute to or affect the performance of the program, that is, how those
parameters are tuned. Please submit your codes. (10 points)
3. A data set of tweets may often contain misspelt words. Your spelling correction program
trained from D may be used on D to improve the data quality. Apply your program to D
itself to catch misspelt words and correct them. Show the corrections you program makes.
Does your program make some mistakes in correction? Use the first 100 corrections your
program makes (or as many as possible if your program makes less than 100 corrections)
to calculate the precision. Explain (1) how your program picks the correction of a word
among all possible candidates; and (2) how your program avoids changing every word in
D, particularly those words correctly spelt but infrequent in D. Submit your codes. (20
points)
Question 2 (25 points)
This question asks you to use language modelsto understand the differences between the usages
of words in D1 and D2. Please remove the stop words in D1 and D2.