Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Practical Assignment
(18% of Total Marks)
Your task is towrite an information retrieval engine,which will be able to index a collection of documents, and in response toa keyword query,retrieve matchingdocuments.The information retrieval model your program will use is the vector-space model.
You must followall ofthe instructions below:
SEPARATE SUBMISSIONS ARE REQUIRED FOR THE CREDIT LEVELASSIGNMENT
AND THE HIGH-DISTINCTION LEVEL ASSIGNMENT (IF ATTEMPTING THE HIGHD IS TINCTION LEVEL).
I.INSTRUCTIONS FOR THE CREDIT LEVEL ASSIGNMENT
(MAXIMUM MARK 69%)
1.Your program can be written in Java, Python or any other programminglanguage of your choice. Note that since programming skills arepre-requisite of thisunit,your tutoris not to help you with the coding part of the assignment.
2.Allyourprogramming source files must be submitted as pecifiedin Section III,and must all follow the standard convention of having a file extension depending on the programming language you use(e.g..java,.py)Do not use package statements in your code.
3.The name of your program mustbe MySearchEngine(i.e.at a minimum your
source code directory must contain a file called MySearchEngine.java which
contains the main() method). You may split yourcode into multiple source files,as long as they compile to produce the final MySearchEngine.class file by issuing the command in instruction #4.
4. It must be possible to compile your program on the server byissuingthe relevant runtime command from within the source code directory e.g.
javac *.java
5.Your program should be able to run from the command line and send itsoutput to standard output (except for the indexreferred to in be stored as a file).
Page 2 of 5
6.Your program must be able to be invoked from the command linewith the
following usage/parameters:java MySearchEngine [command]
where [command]is one of:
a.index collection_dir index_dir stopwords.txt
index all the documentsstored in collection_dir. The index so-constructed
should be stored in index_dir.The indexfile should be named index.txt.See instructions #8 and#9 for the prescribed tokenization/stemming rules and index format.Stopwords are contained in thefile stopwords.txt,a plain text file with one stopword per line.Do not consider the stopwords in the file stopwords.txt for stemming into index terms.
for example:
java MySearchEngine index ~/mydocs ~/myindex ~/stopwords.txt
b. search index_dir num_docs keyword_list return a ranked list of the top num_docs documents that match the query specified in keyword_list.The most relevant document must appear first in the list. Notethat keywords in the query are separated by white space on the command line. Refer to instruction #9 for a more detailed description of what should be returned by this command.