Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Principles of Data Science MATH575A
Where to find data?
Grading
Examples for inspiration
Important Dates
About project proposal
About exploratory analysis
About blog posts
About the analysis document
A bunch of interested data sets available online
Final Project
Math 488P/575A: Principles of Data Science
You final project is to do a novel data analysis to answer a question and write about it. This can be
interpreted broadly and the requirements are discussed below.
The rough outline of the project is:
Start with a question.
Find data that might get at that question.
Play around with the data.
Attempt to answer the question.
Iterate.
Communicate.
Your project should have one significant aspect to it. Examples might include,
put together a novel data set (e.g. scrape something from the web)
answer an interesting question
a “sophisticated” statistical/machine learning model
a really compelling visualization
Final deliverable
Summer 2021 Principles of Data Science (MATH-575… HZ
a really compelling visualization
You can work solo or work in groups of up to 3 people. I can generate an initial non-binding group
assignment. You could take my recommendation or totally ignore it and find your own
teammates. See below for grading details and the group work policies.
Final deliverable
There are two final deliverable: a blog post and the analysis document. The final project is due
Tuesday July 6th at 11:59pm.
Blog post
Write a blog post in R Markdown aimed at a general audience (think 538.
should be 1000-1500 words
have at least two figures
See the section “About blog post” below.
Analysis document
All analysis document should be posted and well documented. The main technical results (plots,
regressions, etc) should be written up in a well documented, supporting technical document
(using R Markdown). You might also include R scripts for cleaning data or helper functions.
See the section “About analysis document” below.
Where to find data?
You can find a seriously large amount of data online. I encourage you to “gather your own data
online” by doing something like scraping Twitter (http://varianceexplained.org/r/trump-tweets/)
though this is not expected.
There are some obvious places to look like data.gov (data.gov?
_&d2lSessionVal=2GuukUAXlEW744t2vjwVmpaRG&ou=53741). I’ve put together a collection of
interesting data sets you can find online at the bottom of this page.
If you are already doing research with a data set you are welcome to use it, but you have to do
something new.
Grading
Your team’s grade will be 50% blog post and 50% analysis document. Your individual grade will be
weighted by your team member’s reviews.
The project will be graded on
Communication
Cl iti (b th i th bl t d i th ti t h i l d t)
Document code
Accuracy
Did you use reasonable statistics?
Does your final code run?
How well do your findings support your conclusions? Note that “The evidence is
inconclusive” is a very possible, and completely acceptable answer.
Ambition
The project should take some creativity and eort i.e. should be more than a matter
of copy/pasting code.
Groupwork
You will anonymously rate your team members and yourself on team citizenship
(e.g. attends meetings, does what they promise, etc), not on ability.
Final grades will be adjusted based on peer ratings.
Individual grades are based on the project grade and a multiplied computed
from the peer ratings. This multiplier will range from 1.05 (for people who go
above and beyond) down to 0 (for people who don’t participate).
As a last resort you may fire a team member who refuses to participate. Please
contact the instructor well before it comes to this.
If you are fired you must start a new project and your peer rating multiplier will
take a hit.
Examples for inspiration
These are some examples of interesting analyses. Many of these examples would take longer than
you have for the final project. These are meant to be inspirations but not expectations.
Blog posts from polygraph
David Robinson’s text analysis of Trump tweets (http://varianceexplained.org/r/trump-
tweets/)
Initial project proposal: due 6/23 at 11:59pm
Describe your proposed project
Who are on your team?
What question(s) will you try to answer?
What data sets will you use? You should have already found and taken a first
look at the data set
How will you use the data to try to answer the question?
Project proposals should be submitted as Piazza questions for all other students to
see. I will make comments to these proposals. Note that these comments are meant
to help you to refine your goals. You are not obligated to complete all tasks that you
promised in the proposal.
Exploratory analysis: due 6/30 at 11:59pm
Write up your initial findings in an R Markdown document.
You should have at least N plots (still deciding N, but at least N should be greater
than 3).
Analysis document: due 7/6 at 11:59pm
Write up your technical results in an R Markdown document.
Provide detailed comments so that it is clear to me what you have done.
Put all code, data, etc together.
Blog post: due 7/6 at 11:59pm
should be 1000-1500 words
have at least two figures
target general audience
About project proposal
Write a project proposal with your team.
You should brainstorm a long list of ideas, then narrow it down to a couple that are feasible given
your knowledge of R, the time constraints, and the available data. Write the proposal for one of
these ideas, but you should keep a couple backups in case the original project doesn’t work out
for some reason.
The point of this exercise it to think though a reasonable project (and get feedback from the
instructor). You will not be held to doing exactly what you say you will do in this proposal; expect
to adapt your project as you continue to work on it ( just ask Robert Burns
Deliverable
Write a one page proposal posted on Piazza which discusses:
What questions will you try to answer? List 5-10 possible questions.
What data sets will you use? You should have already found and taken a first look at the
data sets. Make sure the data is clean enough to reasonably use and actually has the
information content to answer your questions.
What are some things you will do with the data to get at your questions? For example what
Include a list of 3 backup ideas you brainstormed, with a couple bullet points of detail. Just in
case.
Advice
Meet once very early for an initial brainstorm. Have everyone go o and explore some ideas. Meet
again for a final brainstorm. Then write the proposal.
Look at the data sets you plan on using to make sure they aren’t awful. If you plan on creating a
data set (e.g. by scraping a website) convince me this will be feasible (you don’t have to have the
scraper working perfectly).
About exploratory analysis
By this point you should have done an exploratory analysis and have initial results. What this
means will vary from project to project so there aren’t many formal requirements. The point of
this is to: take stock of where you are, show me that you have made good progress and convince
me you will be able to finish the project.
Basically we expect to see that you
have the data
asked/answered a bunch of questions by making lots of plots and computing statistics
narrowed down the scope of the project to something coherent and manageable
have some initial results
What “initial results” means will also vary from project to project. For example, if the project is to
build a model to predict Y based on X then you should have a looked at a few simple models
Deliverables
Gather everything into one folder called n_eda (where n = your group number, which I will assign
to your group). This folder should have four subfolders: /summary, _results, /everything, and
/data. Please zip the n_eda folder and submit it to Google Form that I will set up.
1. Write a summary of what you have tried, what you found and what you have le to do. This
document should be about a page and can be mostly bullet points. Put this document into
a folder called summary.
2. Have some form of initial results. This could be a .Rmd document with a couple plots. The
initial results should be short and to the point. Put the initial results into a folder called
initial_results.
3. Include the rest of the work you have done. Simply gather all the scripts/.Rmd files you
have so far from each team member and put them into one folder (called everything). This
is just so I can see all the work you have done.
4. Include the data.
Write a blog post explaining what you found. It should answer:
1. What is the question(s) you tried to answer? Why should someone care?
2. What is the data/how did you get it?
3. How did you answer the questions (e.g. what statistical techniques, etc)?
4. What are your findings?
Points 1 and 4 are the most important for the blog post (your analysis document focuses on 2 and
3). This blog post should be aimed at a general audience who is not afraid of graphs/a little data
(think 538 . The vast majority of the technical details should be in
the analysis document.
Requirements for the blog post
The post should be 1000-1500 words.
Include a title and your names.
Don’t display R code unless it is used to convey a point.
There should be at least 2 visualizations.
Make sure to describe the figures somewhere in the text.
These plots should be communicatory plots, not exploratory plots.
The post should be submitted in .html (probably written in R Markdown)
Submission
Include everything that went into creating this plot post in a folder called n_blog (where n = your
group number). You can name the blog post whatever you want, just make sure it is a .html
document. Please compress n_blog and submit it to the Google Form to be set up.
I plan on posting these blog posts and your analyses on the internet. If you do not want your
name associated with the post (or if you don’t want even an anonymous version of the post
displayed to the outside world on the internet), please let me as soon as possible.
Grading
Communication (80%)
Does your main point come through
Is the document written well and clearly? Yes spelling and grammar matter.
Quality of the figures.
Eective communication? Ask your parents or friends to read your post and have
them to give you feedback.
Accuracy (10%)
Do you accurately convey a rigorous argument?
Ambition (10%)