COMP9024 Data Structures and Algorithms
Data Structures and Algorithms
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP9024 Assignment
The Missing Pages
Data Structures and Algorithms
Change Log
We may make minor changes to the spec to address/clarify some outstanding issues. These may require minimal changes in your design/code, if at all.
Students are strongly encouraged to check the online forum discussion and the change log regularly.
Version 1.0
(2024-03-15 17:00)
Initial release.
Background
As we have mentioned in lectures, the Internet can be thought of as a graph (a very large graph). Web pages represent vertices and hyperlinks represent
directed edges.
With almost 1.1 billion unique websites (as of February 2024), and each website having multiple webpages, and each webpage having multiple hyperlinks, it
can understandably be a very difficult job to remember the URL of every website you want to visit.
In order to make life easier, from the very early days of the internet, there have been search engines that can be used to find websites.
But the job of a search engine is very difficult: First it must index (create a representation of) the entire (or as close to it as possible) World Wide Web. Next it
must rank the webpages it finds.
In this assignment we will be implementing algorithms to solve each of these problems, and figure out the fastest way to navigate from one page to another.
1. To index the internet we will be creating a web crawler.
2. To rank webpages we will implement the PageRank algorithm.
3. To find the shortest path between two pages we will implement Dijkstra's algorithm
The Assignment
Starter Files
Download this zip file.
Unzipping the file will create a directory called 'assn' with all the assignment start-up files.
Alternatively, you can achieve the same thing from a terminal with commands such as:
prompt$ unzip assn.zip -d assn
The first command will download assn.zip into the current working directory, then the second command will extract it into a sub-directory assn.
You can also make note of the following URLs:
Once you read the assignment specification, hopefully it will be clear to you how these URLs might be useful. You may also find it useful to construct a similar
visual representation for the mini-web.
Overall File Structure
Below is a reference for each file and their purpose.
Note: You cannot modify ANY of the header (.h) files.
Provided File Description Implemented In
crawler.c A driver program to crawl the web —
dijkstra.h Interface for the Shortest Path functions (Subset 4) graph.c
graph.h Interface for the Graph ADT (Subset 1b) graph.c
list.h Interface for the List ADT (Subset 1a) list.c
Makefile A build script to compile the crawler into an executable —
pagerank.h Interface for the PageRank functions (Subset 3) graph.c
2024/3/15 17:55 COMP9024 24T1 - Assignment
Your task will be to provide the necessary implementations to complete this project.
Subset 1 - Dependencies
Before we can start crawling we need to be able to store our crawled data. As the internet is a Graph, this means we need a Graph ADT. We will also need a Set
ADT and one of a Queue ADT or a Stack ADT, in order to perform web scraping (for a BFS or DFS).
Subset 1a - Implement the List (Queue, Stack, Set) ADT
You have been provided with a file list.h. Examine the file carefully. It provides the interface for an ADT that will provide Queue, Stack, and Set functionality.
Your task is to implement the functions prototyped in the list.h header file within list.c.
You must create the file list.c to implement this ADT.
You must store string (char *) data within the ADT.
You must allocate memory dynamically.
You must not modify the list.h file.
You must not modify the function prototypes declared in the list.h file.
You may add utility functions to the list.c file.
You may use the string.h library, and other standard libraries from the weekly exercises.
You may reuse code previously submitted for weekly assessments and provided in the lectures.
You may use whatever internal representation you like for your list ADT, provided it does not contradict any of the above.
You may assume that any instance of your list ADT will only be used as a queue or a stack or a set.
You should write programs that use your ADT to test and debug your code.
You should use valgrind to verify that your ADT does not leak memory.
As a reminder:
Queue - First In, First Out
Stack - First In, Last Out
Set - Only stores unique values.
See list.h for more information about each function that you are required to implement.
Testing
We have created a script to automatically test your list ADT. It expects to find list.c in the current working directory. Limited test cases are provided, so you
should always do your own, more thorough, testing.
prompt$ 9024 dryrun assn_list
Subset 1b - Implement the Graph ADT
You have been provided with a file graph.h. Examine the file carefully. It provides the interface for an ADT that will provide Graph functionality. The graph is
both weighted and directed.
Your task is to implement the functions prototyped in the graph.h header file within graph.c.
You must create the file graph.c to implement this ADT.
You must use an adjacency list representation, but the exact representation is up to you.
You must use string (char *) data to label the vertices.
You must allocate memory dynamically.
You must not modify the graph.h file.
You must not modify the function prototypes declared in the graph.h file.
You may add utility functions to the graph.c file.
You may use the string.h library, and other standard libraries from the weekly exercises.
You may reuse code previously submitted for weekly assessments and provided in the lectures.
You should write programs that use your ADT to test and debug your code.
You should use valgrind to verify that your ADT does not leak memory.
See graph.h for more information about each function that you are required to implement.
Subset 2 - Web Crawler
We are now going to use the list and graph ADTs you have created to implement a web crawler.
Assuming your ADTs are implemented correctly, you should be able to compile the crawler using the provided build script:
prompt$ make crawler
Note: crawler.c requires external dependencies (libcurl and libxml2). The provided Makefile will work on CSE servers (ssh and vlab), but may not
work on your home computer.
After running the executable, check that the output aligns with the navigation of the sample website.
Carefully examine the code in crawler.c. Uncomment the block of code that uses scanf to take user input for the ignore_list.
The ignore list represents the URLs that we would like to completely ignore when we are calculating PageRanks, as if they did not exist in the graph. This
means that any incoming and outgoing links from these URLs are treated as non-existent. You are required to implement this functionality locally - within the
graph_show function - and NOT change the representation of the actual graph strcuture within the ADT. For further details see the graph.h file.
If you have correctly implemented the ADTs from the previous tasks, this part should be mostly free.
crawler.c is a complete implementation of a web crawler; you do not need to modify the utility functions, only the bottom part of the main function. However,
you should look at the program carefully and understand it well so that you can use it (i.e., modify it appropriately) for later tasks.
Sample Output
prompt$
All traces of index.html have been removed. This means that only the remaining vertices are displayed as there are no longer any edges. Note that the order of
the output matters. It should follow the BFS that is performed by the crawler. If your result does not follow this order, you will be marked as incorrect, even if your
graph is valid.
Testing
We have created a script to automatically test your list and graph ADTs. It expects to find list.c and graph.c in the current working directory. Limited test
cases are provided, so you should always do your own, more thorough, testing.
prompt$ 9024 dryrun assn_crawler
Subset 3 - PageRank
Background
Now that we can crawl a web and build a graph, we need a way to determine which pages (i.e. vertices) in our web are important.
We haven't kept page content so the only metric we can use to determine the importance of a page is to check how much other pages rely on its existence. That
is, how easy is it to follow a sequence of one or more links (edges) and end up on the page.
In 1998, Larry Page and Sergey Brin (a.k.a. Google), created the PageRank algorithm to evaluate this metric.
Google still uses the PageRank algorithm to score every page it indexes on the internet to help order its search results.
Task
In graph.c implement the two new functions graph_pagerank and graph_show_pagerank.
First, graph_pagerank should calculate and store the PageRank of each vertex (i.e. page) in the graph.
The algorithm must exclude the URLs that are provided in an 'ignore list' to the function. Do not remove the pages from the graph, only skip (i.e., ignore) them
from calculations. This means that you will need to understand which parts of the PageRank algorithm need to be modified.
Using the ignore list, you will be able to see what happens to the PageRanks as certain pages are removed. What should happen to the PageRank of a
particular page if you remove all pages linking to it?
Second, graph_show_pagerank should print the PageRank of every vertex (i.e. page) in the graph that is NOT in the ignore list.
Pages (vertices) should be printed from highest to lowest rank, based on their rounded (to 3 d.p.) rank. You should use the round function from the math.h
library. If two pages have the same rounded rank then they should be printed lexiographically.
You may add more utility functions to graph.c.
You may (and most likely will need to) modify your struct definitions in graph.c.
You must not modify the file graph.h.
You must not modify the file pagerank.h.
You must not modify the function prototypes for graph_pagerank and graph_show_pagerank.