Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DA5020 – Practicum III
In this practicum you will use the k-nearest neighbor algorithm to predict a continuous variable. Each
question in the practicum follows the CRISP-DM framework. The practicum was designed in this manner
to help you to practice and conceptualize each phase, based on the requirements of an actual project.
This is a group practicum which means that you may choose to work in groups of up to three students.
You may fully collaborate and submit the same work. However, you must include all students' names
on all submitted work. If a group member is not adequately contributing, the remaining team members
may "vote to eject" the student from the team by emailing me the reason. In such an event, the team
member who was "fired" must still complete the project individually by the due date.
If you are working in groups, you can self-signup in Canvas or notify me via email by April and
I will create the group for you. Ensure that you include your name and the name(s) of your group
member(s) in the email and cc them.
Practicum Tasks
CRISP-DM: Business Understanding
The NYC Taxi and Limousine Commission (TLC) publishes a dataset on yellow and green taxi trip
records which include: pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances,
itemized fares, rate types, payment types, and driver-reported passenger counts. For more information on
the dataset, visit the following website and view the accompanying data dictionary for additional
information.
Description of the Problem
You are hired as a Machine Learning Engineer, on the Data Insights and Analytics Team, for the NYC
Taxi and Limousine Commission (TLC). Your first assignment is to analyze the trip data from the Green
Taxis; more specifically, you need to evaluate where passengers use these cabs and how frequently.
However, your main objective is to evaluate the factors that contribute toward cab drivers being
incentivized (i.e. what determines whether or not they receive a tip). This will enable you to build a model
that can be used to predict the tip amount for future trips.
In this use-case, you will conduct your analysis using the NYC Green Taxi Trip Records for February
2020 and build a k-nn regression model to predict the tip amount. You are free to use any libraries to
support your analysis.
Question 1 — (20 points) +10 optional points
CRISP-DM: Data Understanding
• Load the NYC Green Taxi Trip Records data directly from the URL into a data frame or tibble.
• Data exploration: explore the data to identify any patterns and analyze the relationships between the
features and the target variable i.e. tip amount. At a minimum, you should analyze: 1) the distribution,
2) the correlations 3) missing values and 4) outliers — provide supporting visualizations and explain
all your steps.
Tip: remember that you have worked with this dataset in your previous assignments. You are free to
reuse any code that support your analysis.
• Feature selection: identify the features/variables that are good indicators and should be used to predict
the tip amount. Note: this step involves selecting a subset of the features that will be used to build the
predictive model. If you decide to omit any features/variables ensure that you briefly state the reason.
• Feature engineering: (+10 bonus points): create a new feature and analyze its effect on the target
variable (e.g. the tip amount). Ensure that you calculate the correlation coefficient and also use
visualizations to support your analysis. Summarize your findings and determine if the new feature is a
good indicator to predict the tip amount. If it is, ensure that you include it in your model. If it is not a
good indicator, explain the reason.
NOTE: If you attempt this bonus question, ensure that you create a meaningful feature (and nothing
arbitrary). If you are unable to think about something meaningful, do not become fixated on this. There
is another bonus question that you can attempt later in the practicum.
Question 2 — (20 points)
CRISP-DM: Data Preparation
• Prepare the data for the modeling phase and handle any issues that were identified during the
exploratory data analysis. At a minimum, ensure that you:
• Preprocess the data: handle missing data and outliers, perform any suitable data transformation
steps, etc. Also, ensure that you filter the data. The goal is to predict the tip amount, therefore you
need to ensure that you extract the data that contains this information. Hint: read the data dictionary.
• Normalize the data: perform either max-min normalization or z-score standardization on the
continuous variables/features.
• Encode the data: determine if there are any categorical variables that need to be encoded and
perform the encoding.
• Prepare the data for modeling: shuffle the data and split it into training and test sets. The percent
split between the training and test set is your decision. However, clearly indicate the reason.