Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
[Assignment 1]
Taxi & For-Hire Vehicle Trip Dataset
Information Retrieval using Binary Search Trees
Handed out: Friday, 23 of August
Due: 8:00 AM, Monday, 9 of September
Purpose
The purpose of this assignment is for you to:
? Increase your proficiency in C programming, your dexterity with dynamic memory allocation
and your understanding of linked data structures, through programming a dictionary.
? Increase your understanding of how computational complexity can affect the performance of an
algorithm by conducting orderly experiments with your program and comparing the results of
your experimentation with theory.
? Increase your proficiency in using UNIX utilities.
Background
A dictionary is an abstract data type that stores and supports lookup of key, value pairs. For example,
in a telephone directory, the (string) key is a person or company name, and the value is the phone
number. In a student record lookup, the key would be a student ID number and the value would be a
complex structure containing all the other information about the student.
A dictionary can be implemented in C using a number of underlying data structures. Any implementation
must support the operations: makedict a new dictionary; insert a new item (key, value
pair) into a dictionary; search for a key in the dictionary, and return the associated value. Most
dictionaries will also support the operation delete an item.
Your task
In this assignment, you will create a simple instance of a dictionary, and we’ll use it to look up information
about for-hire vehicle trips in New York City.
There are two stages in this project. In the first stage you will code a dictionary in the C programming
language, using a binary search tree as the underlying data structure. You will insert records into this
dictionary from a file, and look up records by key. In the second stage, you will code additional functions
to retrieve information from this dictionary. You will use a Makefile to direct the compilation
of two separate executable programs, one for Stage 1 and one for Stage 2.
In both stages of the assignment, you will report on the number of key comparisons used for search
and analyse what would have been expected theoretically. The report should cover each file used to
initialize the dictionary.
You are not required to implement the delete functionality.
1
Stage 1 (7 marks)
In Stage 1 of this assignment, your Makefile will direct the compilation to produce an executable
program called dict1. The program dict1 takes two command line arguments: the first argument is
the name of the data file used to build the dictionary; the second argument is the name of the output
file, containing the data located in the searches. The data file consists of an unspecified number of
records, one per line, with the following information:
VendorID - Code to indicate the vendor which produced the record
passenger count - Number of passengers
trip distance - Length of the trip in miles
RatecodeID - Code to represent the fare rate for the trip
store and fwd flag - Indicates whether trip records were stored locally
PULocationID - TLC Taxi Zone where passengers were picked up
DOLocationID - TLC Taxi Zone where passengers were dropped off
payment type - Code to indicate payment type (e.g., cash)
fare amount - Fare for the trip in USD
extra - Extra charges in USD
mta tax - MTA tax in USD
tip amount - Tip in USD
tolls amount - Tolls in USD
improvement surcharge - Improvement surcharge in USD
total amount - Total cost of the trip in USD
PUdatetime - Date/time passengers were picked up
DOdatetime - Date/time passengers were dropped off
trip duration - Duration of the trip in minutes
This data comes from a publicly-available dataset released by the New York City Taxi & Limousine
Commission. More information about the dataset can be found at:
https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
The field <PUdatetime> is an alphabetic string representing the date and time of the taxi trip in
the format YYYY-MM-DD HH:mm:ss (year-month-day hour:minute:second). The other columns can be
treated simply as the associated <data> field. Build a data structure of strings to save the associated
data collected about each taxi trip. The maximum size that any string can be is 128 characters. Each
string is separated by a comma “,”. This is a standard csv format where the delimiter used is a comma.
The <PUdatetime> field will serve as the dictionary key, so the records will be sorted in temporal
order. Note that because the datetime information is stored in lexicographical order, the values can be
compared as strings (e.g., with strcmp()) to determine which trip is earlier/later. The <data> is the
information sought during lookup.
In this assignment the search keys are not guaranteed to be unique – there are instances where multiple
taxis pick up passengers at exactly the same day and time. You should handle duplicates by
implementing a linked list for items with same key.
For the purposes of this assignment, you may assume that the input data is well-formatted, that the
input file is not empty, and that the maximum length of an input record (a single full line of the csv
file) is 256 characters. This number could help you to determine the buffer size to use when reading
the file.
In this first stage of the assignment, you will:
? Construct a binary search tree to store the information contained in the file specified in the
2
command line argument. Each record should be stored in a separate Node.
? Search the binary search tree for records, based on their keys. The keys are read in from stdin,
i.e. from the screen.