Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
IFT 6758 Project: Milestone
Due date: Oct. 15, 2021 The goal of this milestone is to give you experience with the data wrangling and exploratory data analysis phases of a data science project, which are often where you will spend most of your time during a data science project. You will gain experience with some of the common tools used to retrieve and manipulate data, as well as build confidence in creating tools and visualizations to help you understand the data prior to jumping into more advanced modeling. Broadly, the outline for this milestone is to use NHL stats API to retrieve both aggregate data (player stats for a given season) and “play-by-play” data for a specific time period and to generate plots. You will begin with creating simple visualizations from the aggregate data that do not require much preprocessing, and then move to creating interactive visualizations from the play-by-play data which will involve more work to prepare. There will be a small number of simple qualitative questions to answer throughout the tasks that will relate to the tasks outlined. Finally, you will present your work in the form of a simple static web page, created using Jekyll. Note that the work you do in this milestone will be useful for future milestones, so make sure your code is clean and reusable - your future self will thank you for it! A note on Plagiarism 2 NHL data 3 Motivation 4 Learning Objectives 5 Deliverables 5 Submission Details 6 Tasks and Questions 6 1. Warm-up (10%) 7 2. Data Acquisition (25%) 7 3. Interactive Debugging Tool (bonus 5%) 9 4. Tidy Data (10%) 11 5. Simple Visualizations (25%) 11 6. Advanced Visualizations: Shot Maps (30%) 12 7. Blog Post (up to 30% penalty) 14 Group Evaluations 14 Useful References 15 1 Due: IFT 6758 2021 - Milestone 1Oct. 15, 2021 A note on Plagiarism Using code/templates from online resources is acceptable and it is common in data science, but be clear to cite exactly where you took code from when necessary. A simple one line snippet which covers some simple syntax from a StackOverflow post or package documentation probably doesn’t warrant citation, but copying a function which does a bunch of logic that creates your figure does. We trust that you can use your best judgement in these cases, but if you have any doubts you can always just cite something to be safe. We will run some plagiarism detection on your code and deliverables, and it is better to be safe than sorry by citing your references. Integrity is an important expectation of this course project and any suspected cases will be pursued in accordance with Université de Montréal’s very strict policy. The full text of the university-wide regulations can be found here. It is the responsibility of the team to ensure that this is followed rigorously and action can be taken on individuals or the entire team depending on the case. 2 Due: IFT 6758 2021 - Milestone 1Oct. 15, 2021 NHL data The subject matter for this project is hockey data, specifically the NHL stats API. This data is very rich; it contains information from many years into the past ranging from metadata about the season itself (eg. how many games were played), to season standings, to player stats per season, to fine-grained event data for every game played, known play-by-play data. If you’re unfamiliar with play-by-play data, the NHL uses this exact data to generate their play-by-play visualizations, an example of which is shown below. For a single game, roughly 200-300 events are tracked, typically limited in scope to faceoffs, shots, goals, saves, and hits (no passes or individual player location). Note that there is a logical way the games are assigned a unique ID, which is described here (take care to note the difference between regular season and playoff games!). Figure 1: Sample play-by-play data for game 2017020001 at the beginning of the 1st period. Each event contains an identifier (e.g. “FACEOFF”, “SHOT”, etc.), a description of the event, the players involved, as well as the location of this event on the ice (drawn on the ice rink to the far right). The raw event data contains more information than this. You can explore this game’s play-by-play here. The time of event, event type, location, players involved, and other information is recorded for each event and the raw data is accessible through the play-by-play API. For example, the raw data for the above play-by-play can be found here: https://statsapi.web.nhl.com/api/v1/game/2017020001/feed/live/ A snippet of the raw event data can be seen in Figure 2. You will need to explore the data and read the API docs to figure out exactly what you will need. 3 Due: IFT 6758 2021 - Milestone 1Oct. 15, 2021 Figure 2: Raw JSON data obtained from the NHL stats API for the same events in Figure 1. Note that there are other events between the desired ones - that will be up for you to explore! Although technically undocumented, there is a very detailed unofficial API document maintained by the community, which should be the first place you look for information about the API. Motivation While we understand some people may not be sports fans, we think this is an exciting dataset to work with for a number reasons: 1. It is a real-world dataset that is used by professional data scientists, some of which are employed by the NHL teams themselves, and others run their own analytics businesses. 2. During the hockey season, data is updated live as games are in progress! This gives the opportunity to interact with new data frequently, giving you some insight as to why “pipelining”, and writing clean and reusable code is critical in a successful data science workflow. 4 Due: IFT 6758 2021 - Milestone 1Oct. 15, 2021 3. It is very rich, as discussed above. 4. It is “clean” in the sense that the API is consistent and you will not have to deal with parsing or cleaning nonsensical data. 5. It is “messy” in the sense that all of the raw data is in JSON, and not immediately suitable for use in a data science workflow. You will need to “tidy” the data into a usable format, which is a significant portion of many data science projects. Because the data already comes in a consistent format, we think this is a good balance between giving you some work to do to clean the data, while not being unreasonable. 6. Hockey is often a great conversation facilitator here in Canada (and especially Montreal). If you’re new to Canada, this is a great way to learn a little bit about our culture :) Even if you are not a hockey fan, we hope that you will find this project experience interesting and educational. We think that playing with real-world data is more rewarding and far more representative of the data science workflow than working with prepared datasets such as those available on Kaggle. If you are particularly proud of your project, some of the deliverables will teach you how to host your content in a publicly accessible way (via Github pages), which can help you in your future internship or job hunts! Learning Objectives ● Data acquisition and cleaning ○ Understand what a REST API is ○ Programmatically downloading data from the internet using Python ○ Format the raw data into useful tables of data ○ Get familiar with the idea of “pipelining” your work; i.e. creating logically separated components such as: ■ Download and save data ■ Load raw data ■ Process raw data into some format ● Data exploration ○ Explore the raw data and understand what it looks like ○ Build simple interactive tools to help you work with the data more efficiently ● Visualization and Exploratory Data Science ○ Gain some intuition and answer some simple questions about the data by looking at visualizations ○ Use Matplotlib and Seaborn to create nice figures ○ Create interactive figures to communicate your results more effectively Deliverables You must submit BOTH: 1. Blog post style report 2. Your team’s codebase that is reproducible 5 Due: IFT 6758 2021 - Milestone 1Oct. 15, 2021 Instead of a traditional report written in LaTeX, you will be asked to submit a blog post which will contain discussion points and (interactive!) figures. We will provide a template and instructions by , so don’t worry about having to figure it all out by yourself. At a high level,Sep. 22, 2021 you will use Jekyll to create a static web page from Markdown. This is a very simple way to create nice looking pages, and could be very useful for you to create blog posts in the future if you are interested, or wish to buff up your resume in a job hunt. Although we will not be deploying these pages to the public1, it is very simple to use github pages to publish your content. You’re more than welcome to do this at the end of the course! Submission Details To submit your project, you must: Publish your final milestone submission to the master or main branch (You must do this first before downloading the ZIPs!) Submit a ZIP of your blog post to gradescope Submit a ZIP of your codebase to gradescope Add the IFT6758 TA Github account to your git repo as a viewer To submit a ZIP of your repository, you can download it via the Github UI: Remember that this method does not download the whole git repo, just the master or main branch. Make sure all of your code is committed to the master branch before downloading the ZIPs. Tasks and Questions The tasks required for milestone 1 are outlined here. The overall description of what is required is described at the beginning of each task. The Questions section of each task will outline 1 A caveat about Github pages is that even if your repo is private, any published pages are public. You may not even be able to publish a page from a private repo if you didn’t get your free student Github Pro account. Because we don’t want groups to have their pages visible to one another while they are working on the project, you should not publish the pages you create and instead render them locally. 6 Due: IFT 6758 2021 - Milestone 1Oct. 15, 2021 content that is required in the blog post. This may either be interpretation questions, or you may have to produce figures or images to include in the blog post. We try to write most of the things that are required in the report in bold, but make sure you answer everything asked of you in each question. We do not expect long responses for the questions, in most cases a few sentences will suffice. 1. Warm-up (10%) Let's start with some very simple plots for visualizing player statistics to get your feet wet. This part will be a bit different than the next sections due to it being rather tedious to get player stats from the NHL API. While it's certainly possible, it's much easier to just scrape a webpage that already tabulates the exact data that we want. Because this is a bit disconnected from the next tasks, we provide a function to scrape the data and format it into a DataFrame for you to work with. You will still need to be mindful of NaNs however! Questions We’ll try to explore goaltenders and consider who was the best goaltender of the 2017-2018 season. Use the provided function to download the goalie stats for the 2017-2018 season. 1. Sort the goalies by their save percentage (‘SV%’), which is the ratio of their shots saved over the total number of shots they faced. What issues do you notice by using this metric to rank goalies? What could be done to deal with this? Add this discussion to your blog post (no need for the dataframe or a plot yet). Note: You don’t need to create a fancy new metric here. If you’d like, you can do a sanity check against the official NHL stats webpage. You also don’t need to reproduce any particular ranking on the NHL page; if your approach is reasonable you will get full marks. 2. Filter out the goalies using your proposed approach above, and produce a bar plot with player names on the y-axis and save percentage (‘SV%’) on the x-axis. You can keep the top 20 goalies. Include this figure in your blog post; ensure all of the axes are labeled and the title is appropriate. 3. Save percentage is obviously not a very comprehensive feature. Discuss what other features could potentially be useful in determining a goalie’s performance. You do not need to implement anything unless you really want to, all that’s required is a short paragraph of discussion. 2. Data Acquisition (25%) Create a function or class to download NHL play-by-play data for both the regular season and playoffs. 7 Due: IFT 6758 2021 - Milestone 1Oct. 15, 2021 You will need to read the unofficial API doc to understand how GAME_ID is formed. You could open up the endpoint in your browser to check out the raw JSON to explore it a little bit (Firefox has a nice built-in JSON viewer). Use your tool to download data from the 2016-17 season all the way up to the 2020-21 season. You can implement this however you wish, but if you need guidance here are some tips: 1. This is a public API, and as such you must be mindful that someone else is paying for the requests. You should download the raw data and save it locally, and then use this local copy to derive tidy/usable datasets from it. 2. Do not commit the data (or large binary blobs) to your GitHub repo. This is bad practice, git is for code, not file storage. While it may be possible for the dataset you will be working with, it likely won’t be when you work on larger-scale projects in industry or academia. Larger git repos become slow to clone and work with. Note that because of the way git works, once you commit and push a file, simply removing it and committing the deletion won’t actually delete the file; you’ll need to actually rewrite the git history which becomes risky. A good way to accidentally avoid committing files is to use a .gitignore file, and add whatever file pattern you may want (such as *.npy or *.pkl). 3. A nice pattern could be to define a function that accepts the target year and a filepath as an argument and then checks at the specified filepath for a file corresponding to the dataset you are going to download. If it exists, it could immediately open up the file and return the saved contents. If not, it could download the contents from the REST API and save it to the file before returning the data. This means that the first time you run this function, it will automatically download and cache the data locally, and the next time you run the same function, it will instead load the local data. Consider using environment variables to allow each teammate to specify different locations, and having your function automatically retrieve the location specified by the environment variable so you don’t have to fight about paths in your git repository. 4. If you wanted to get even fancier, you could consider pushing this logic into a class which implements the logic suggested in (3). This lends itself nicely to how the data is separated by hockey seasons, and it would allow you to add logic that would generalize to any other season that you may wish to analyze in a clean and scalable way.