●
Build Docker image from the source code (make sure that we have. (i.e. current location)
at the end of the command):
docker build -t cmpt456-lucene-solr:6.6.7.
NOTE: Since Docker is not available free for Windows OS, we recommend you use VirtualBox with
Ubuntu OS or Windows Subsystem for Linux (WSL)
●
Run the Docker image we just built in order to activate the Docker container:
docker run -it cmpt456-lucene-solr:6.6.7
Demo
In this section, we help you to get familiar with Lucene basic components by running 2 simple
programs:
●
Index Files: this program uses standard analyzers to create tokens from input text files,
convert them to lowercase then filer out predefined list of stop-words.
The
source
code
is stored
in this file
within the
Search Files: this program uses a query parser to parse the input query text, then pass to
the index searcher to look for matching results.
The
source
code
is stored
in this file
within the
You are expected to run these examples, understand Lucene components used in the indexing
and querying process in order to make further extensions in the below programming tasks.
Text Parsing (30 pts)
In the first part of the assignment, you will learn how to use Lucene to build search capabilities
for documents in various formats, such as HTML, XML, PDF, Word. In fact, Lucene does not care
about the parsing of these and other document formats, and it is the responsibility of the
application using Lucene to use an appropriate parser to convert the original format into plain
text before passing that plain text to Lucene.
In the class IndexFiles.java within the Demo section, you can see that it indexes the content of
html files, including all html tags (e.g., <body>, <head>, <table>). In this section, we want you to
create a new class called HtmlIndexFiles.java to: