Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
INFS4205/7205
Specification Updates
0.2 → 0.3
− More clarification about dealing with sites at the same distance to the testing point:
Outputs of each k-nn query should be ordered from the nearest to the furthest. If multiple sites have
the same distance, order them by their ID in ascending order. Each query should be cut at exactly k
outputs, regardless of whether (k+1)th output has the same distance as the kth output.
− More clarification about documentation (.txt) and your environment:
o The code documentation (.txt) should specify the libraries you used with corresponding
version. For example, pandas==1.4.1
o Please also include the language version. For example, python==3.6
o Python 3.6 is recommended if you use Python because of its high compatibility
o It is your responsibility to include sufficient details about how to install your environment
to ensure your program can run
− More clarification about output file:
o For output file, you need to continuously access all testing points and write the results to
the output file (i.e., only one output file for one testing set containing multiple testing
points).
o Therefore, for one test file, the wrong result of the previous testing points may cause the
wrong results of all subsequent testing points.
o We will have multiple test files for each task. Only hard test file will include large amounts
of testing points, while simple test file will include small number of testing points.
0.1 → 0.2
− For task 1 and task 2, it is clarified that for those sites at the same distance to the current testing
point, return IDs in ascending order.
− For task 4, it is clarified that the time windows are inclusive.
− For task 4, the format of time windows is updated to align with the sample inputs.
− For task 5, the distance metric used in the project is clarified: it is assumed that 0.01 units of
longitude/latitude equal 1 km. The distance of two geo-locations is represented by the Euclidean
distance in a 2D plane.
− Sample files of input (test set) and output (lines of site IDs) are added:
o You must follow the format of sample files to read input files and write output files.
o You can test your methods with sample outputs.
o Correctness on these samples tests does not guarantee the correctness of the final tests used
to evaluate your implementation. Various test sets will be used for final evaluation.
− Some typos are fixed.
Overview
The course project consists of two assignments, which are Project Plan and Project Implementation. In
this assignment, you are asked to implement a set of query scenarios utilising spatial data techniques as
well as computational geometry algorithms wherever suitable.
This assignment is designed to assess your ability to apply advanced techniques for spatio-temporal
data manipulation to solve real-world problems. This is an individual assignment. The completion of
the assignment should be based on your own design.
Language requirement: Python or Java. You are allowed to use existing libraries (citation required).
You are NOT allowed to use any DBMS such as Oracle, MySQL, and PostGIS.
Note that, if you decide to use any other programming languages, you need to get a pre-approval by
the teaching team by 22 Apr.
Dataset
In this project, you are given a data set in the following format.
Figure 1: Example Dataset
Each tuple is supposed to be a COVID-19 exposure site associated with site ID (i.e., a random integer),
geo-location (i.e., longitude and latitude), and date and time of visit.
INFS4205/7205 Semester 1, 2022
Query Tasks
Task 1: Find the K nearest sites (Point) of a given location (Point)
Input (command line arguments):
argument 1: the filename of the dataset to be queried (e.g., “dataset.csv”. See an example of the file in
Figure 1)
argument 2: an integer indicating the specific task (i.e., 1 in this case)
argument 3: the filename of the test set (i.e., a txt file with a list of testing points and required K, refer
to “task1_sample.txt” for the format)
Expected Output:
A txt file named “task1_results.txt” containing K nearest sites to each testing point (refer to
“task1_sample_results.txt” for the format)
Query criteria:
− Each testing point is a 2D location (i.e., longitude and latitude) and leads to an independent query
search.
− For each testing point, you are expected to output a listing of IDs, which refer to the K-nearest sites
to the current testing point in ascending order (i.e., closest to farthest).
− For those sites with the same distance to the current testing point, return IDs in ascending order.
− In the output file, one line is for one site ID.
− You need to continuously access all testing points and write the kNN results to the output file (i.e.,
only one output file for one testing set containing multiple testing points)
− Various K will be set for different testing points.
Task 2: Find the K nearest sites (Point) of a given location (Point) with in certain time
window
Input (command line arguments):
argument 1: the filename of the dataset to be queried (e.g., “dataset.csv”. See an example of the file in
Figure 1)
argument 2: an integer indicating the specific task (i.e., 2 in this case)
argument 3: the filename of the test set (i.e., a txt file with a list of testing points, required K and the
time window, refer to “task2_sample.txt” for the format)
Expected Output:
A txt file named “task2_results.txt” containing K nearest sites to the testing points within a required
time window (refer to “task2_sample_results.txt” for the format)
Query criteria:
− Each testing point is a 2D location (i.e., longitude and latitude) and leads to an independent query
search.
− For each testing point, you are expected to output a listing of IDs, which refer to the K-nearest sites
to the current testing point in ascending order (i.e., closest to farthest).
− For those sites with the same distance to the current testing point, return IDs in ascending order.
− In the output file, one line is for one site ID.
− You need to continuously access all testing points and write the kNN results to the output file (i.e.,
only one output file for one testing set containing multiple testing points)
− Various K and time window will be set for different testing points.
INFS4205/7205 Semester 1, 2022
− Different from Task 1, when returning kNN results of each testing point, the time window also
needs to be considered. Your code is expected to return the K-nearest sites which emerged in the
provided time window (e.g., 2020-12-27 3pm – 2020-12-27 9pm)
Task 3: Find all exposure sites in a given rectangular area
Input (command line arguments):
argument 1: the filename of the dataset to be queried (e.g., “dataset.csv”. See an example of the file in
Figure 1)
argument 2: an integer indicating the specific task (i.e., 3 in this case)
argument 3: the filename of the test set (i.e., a txt file with a list of testing points (i.e., rectangles
represented by its top-left and bottom-right geo locations – 2>, refer to “task3_sample.txt” for the format)
Expected Output:
A txt file named “task3_results.txt” containing all sites within the required areas (refer to
“task3_sample_results.txt” for the format)
Query criteria:
− Each testing point is a rectangular area represented by its top-left and bottom-right geo locations –
− For each testing point, you are expected to output a listing of IDs in ascending order, which refer
to all the sites within the testing area. Sites on the boundaries of the testing points count.
− In the output file, one line is for one site ID.
− You need to continuously access all testing points and write results to the output file (i.e., only one
output file for one testing set containing multiple testing points)
Task 4: Find all exposure sites (Point) in a given rectangular area and within a certain
time window
Input (command line arguments):
argument 1: the filename of the dataset to be queried (e.g., “dataset.csv”. See an example of the file in
Figure 1)
argument 2: an integer indicating the specific task (i.e., 4 in this case)
argument 3: the filename of the test set (i.e., a txt file with a list of testing points (i.e., rectangles
represented by its top-left and bottom-right geo locations – 2>) and a time window (i.e.," longitude="" 2,="" 2=""> with a time window.
− For each testing point, you are expected to output a listing of IDs in ascending order, which refer
to all the sites within the testing area. Sites on the boundaries of the testing points count.
− Various time windows will be set for different testing points. Time windows are inclusive.
− In the output file, one line is for one site ID.
INFS4205/7205 Semester 1, 2022
− You need to continuously access all testing points and write results to the output file (i.e., only one
output file for one testing set containing multiple testing points)
− Different from Task 3, the time window also needs to be considered. The returned sites should have
emerged in the provided time window (e.g., 2020-12-27 12:00:00 – 2020-12-27 14:00:00)
Task 5: Find all exposure sites within certain distance (d km) to a trajectory emerging
on the same day
Input (command line arguments):
argument 1: the filename of the dataset to be queried (e.g., “dataset.csv”. See an example of the file in
Figure 1)
argument 2: an integer indicating the specific task (i.e., 5 in this case)
argument 3: the filename of the test set (i.e., a txt file with a list of testing points, each of which is
associated with a date when the trajectory occurred and a distance threshold d km (float), refer to
“task5_sample.txt” for the format)
Expected Output:
A txt file named “task5_results.txt” containing all sites (1) emerged on the same day when a testing
trajectory happened, and (2) within d km to any key geo-location of the trajectory (refer to
“task5_sample_results.txt” for the format)
Query criteria:
− Each testing point is a trajectory represented by a number of key geo-locations (e.g., latitude 1, longitude 2, latitude 2, longitude 3, latitude 3, …>) with a date when the trajectory
occurred.
− For each testing point, you are expected to output a list of IDs in ascending order, which refer to all
the sites emerged on the same day of the testing point and within d km to any key geo-location of
the testing point.
− It is assumed that 0.01 units of longitude/latitude equal 1 km. The distance of two geo-locations is
represented by the Euclidean distance in a 2D plane.
− In the output file, one line is for one site ID.
− You need to continuously access all testing points and write results to the output file (i.e., only one
output file for one testing set containing multiple testing points).