Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
An End-to-End Data Science Project
Project Weight: 30%
Project Overview
This project aims to make a quantitative analysis of the New York City Taxi and Limousine Service Trip Record Data. The dataset covers trips taken in various types of taxi and for-hire vehicle services in the New York City area. The data in parquet format is directly downloadable from here, with corresponding usage guide linked here. You will need to choose a minimum of 6 months if working with PySpark or 3 months if working with pandas from 2016 or later (ensure your data includes Zones, not coordinates). PySpark is the expectation here; ensure you have obtained permission from your tutor to use pandas.
Students will be required to prepare a self-contained report which must be 6-8 pages including figures, excluding references, and written using LaTeX. Please do not submit any other format written in Word or Google Docs, we are expecting a compiled pdf written in LaTeX. There are no exceptions.
Project Expectations
Please refer to the Canvas Subject Overview for expectations and further information.
We understand that the page limit is strict and quite short. This project aims to get students to be able to concisely summarise information professionally. This is because the results of Project 1 will be used to allocate which project student groups get for Project 2 (Industry Project).
Lastly, we know that the best way to learn new tools is to use and apply them in a project, consider this to be “the project” . Please try your best, the tutor team will be here to support you where possible. Sample solutions will be released at the start of Week 2.
Note: Students in prior years have often found themselves underestimating the time commitment required for this project. Be sure to start it ASAP; you should aim to have all of the results by the start of Week 4, so you can spend the remaining time writing.
Project Assumptions
Students are free to choose any software, language, or package that is deemed useful to complete this project, although it is strongly recommended that Python and PySpark be used.
A LaTeX report template will be provided and students are not allowed to change the margins or font size. Students who prepare their document templates will be required to add margin commands to adhere to the requirements. Otherwise, there will be penalties. We have been very clear here so do not submit any other document that was not written up in LaTeX.
Students must maintain a GitHub repository with an appropriate and documented README. md file. A template repository has been provided for your benefit under Canvas → Modules → Project 1 Links → Templates via GitHub Classrooms.
Students have the freedom of choice to select their timeline to analyze, the type of Licensed Taxi you wish to focus on (i.e Yellow vs Green Taxi, Taxi vs For-Hire Vehicles), and the choice of attributes for their area of study. Once again, make sure the time frame chosen is 2016 onward.
Students should use any external datasets which are deemed sufficiently relevant to support the analysis and attributes of the study.
The timeline and dataset must be sufficiently “large” and justified to support your research goal. Students may subsample the data when visualizing or fitting a model (you must state and justify this in the report or you will be penalised), but, must use the full distribution when analyzing the distribution, aggregating attributes, or performing outlier analysis.