Data Management and Exploratory Data Analysis CSC8631
Data Management and Exploratory Data Analysis
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Data Management and Exploratory Data Analysis
CSC8631 Coursework
Outline
This coursework assignment is designed to give you experience
building a data analytics pipeline to process, query and gain insights
into a dataset provided to you. You will build a real-world data anal-
ysis pipeline making use of the technologies introduced during the
lecture series and practicals as part of the CSC8631 module.
This coursework assignment provides you with an opportunity
to work on an entire data analysis pipeline, from data sanitisation
to querying and report generation. Consequently, you will have the
opportunity to make appropriate use of a wide range of technologies,
each of which we have seen during lectures and practical exercises.
In this assignment you will submit an individual assignment, but
collaboration with colleagues to discuss the problem and possible
solutions is strongly encouraged.
https://xkcd.com/882/
Scenario
Learning Analytics, a rapidly-growing application area for Data
Science, is defined as “the measurement, collection, analysis and reporting
of data about learners and their contexts, for purposes of understanding and
optimising learning and the environment in which it occurs” 1.
1 George Siemens and Phil Long.
Penetrating the Fog: Analytics in
Learning and Education. EDUCAUSE
review, 46(5):30, 2011
Existing mechanisms to record student engagement (e.g. atten-
dance monitoring) fail to capture the extent and quality of engage-
ment outside of the classroom environment. Further complementary
sources of data are routinely collected about our learners (e.g. use of
on-campus facilities, Virtual Learning Environment (VLE) and Re-
Cap access, and student wellbeing referrals); however, these currently
reside in a number of silos.
Learning Analytics seeks to aggregate these sources of data to de-
rive shared insights, and provide effective measures of engagement.
Insights may inform learning design, inform intervention processes
for at-risk students, and improve student attainment.
The most complete introduction is available in government policy
policy report “From Bricks To Clicks” 2. The report is quite extensive, 2 http://www.policyconnect.org.
uk/hec/sites/site_hec/files/
report/419/fieldreportdownload/
frombrickstoclicks-hecreportforweb.
pdf
but there are some nice case studies from Nottingham Trent and the
OU to give you a flavour of the types of projects in this area.
data management and exploratory data analysis 2
Challenge
In this project we will emulate a very familiar process undertaken
by data analysts. We will take a dataset provided to us, and develop
a suite of tools which allow us to extract interesting insights from
this data in a quick, reliable and repeatable manner. The datasets you
are expected interpret as a data analyst are commonly previously
unseen, so the process of building a pipeline is an exploratory one.
Consequently, you will be expected to review and interrogate the
data to gain an understanding of its structure and composition.
In this coursework you will develop a data analysis pipeline to
explore a given dataset. There are no formal requirements for the
functionality or focus of your analysis. Your data analysis should
follow routes of enquiry which are of greatest interest to you. There-
fore, there exists scope for a great deal of flexibility so we anticipate
solutions to this challenge will vary.
We encourage you to pursue ambitious analysis, but just as im-
portantly we are looking for good programming practice (see the
‘Best-practice development’ section of this document). When develop-
ing large systems such as these, it is important that you write your
code incrementally, and test it carefully before continuing to add
additional functionality.
Best-practice development
Throughout this coursework we are not simply interested in a so-
lution which achieves some desired functionality. You will also be
assessed on the following:
1. You should make use of Git for version control. As a rule of
thumb; “if it isn’t visible in the version control logs, it didn’t happen”.
You will be expected to be (and be assessed on) pushing to your
remote (on Github) regularly throughout the project.
2. All source code and programs as part of your solution should be
well documented.
3. You should consider the reproducibility of your analysis, making
use of ProjectTemplate for R.
4. You should produce much of your written documentation using
‘literate programming’ framework RMarkdown for R.
data management and exploratory data analysis 3
Design and Implementation Reports
You should prepare a short report via NESS (no more than two
pages) summarising the work carried out, and a critical reflection 3 3 You should consider the relative mer-
its and limitations of the approaches.
No one methodology is perfectly suited
to each project, so it is completely rea-
sonable for you to identify limitations
of CRISP-DM as a methodology.
on your experience using the tools and techniques introduced on this
module in completing the coursework assignment. We are particu-
larly interested in any assumptions you made about the data, and
how they motivated your design decisions.
You should also produce additional documentation detailing the
findings from your exploratory analysis. You should include doc-
umentation of all analyses undertaken, whether or not they pro-
duced ‘successful’ findings. You should follow the principles of the
CRISP-DM 4 methodology and use literate programming framework 4 Chapman, et al. (2000). CRISP-DM 1.0:
Step-by-step data mining guide. SPSS.RMarkdown 5, to align analytic code with narrative text.
5 http://rmarkdown.rstudio.com/You should submit the source file(s) for the notebook(s) as well as
output saved in PDF format. Your report should be a maximum of
20 pages and should be structured in a way that guides the reader
through the steps of your analysis.
• Technical
– Documentation, including code comments and a README
document.
– Effective use of Version Control
– Use of ProjectTemplate to achieve reproducibility in your
project.
• Methodology
– Documentation, including code comments and a README
document.
– Effective use of Version Control
– Use of ProjectTemplate to achieve reproducibility in your
project.
• Project reporting
– Critical Reflection
– Quality of the communication of your work and the accompa-
nying rationale/significance
data management and exploratory data analysis 4
Presentation (20% of the marks for the module)
You will pre-record a short presentation focusing on your key find-
ings. We recommend you record a Zoom meeting where you share
your screen; audio only is perfectly acceptable if you prefer not to use
video for your recording.
Your presentation will be five minutes in duration. We will allow
leeway of 5 minutes (± 30 seconds). Your marker will not watch any
content beyond 5:30 and any content appearing after 5:30 will not
be marked. Marks for your presentation will be divided equality
between Content and Delivery.
• Content (50% of the Presentation marks)
– The motivation for your analysis is clear.
– Clear statement of the data used to support your analysis.
– A clear description of the analysis work you undertook.
– A clear description of key findings from your analysis.
– Concluding remarks, relating the findings of your analysis back
to their implications for the business context.
• Delivery (50% of the Presentation marks)
– The slides and speech are clearly understandable 6. 6 If you have concerns about being
able to record audio, please get in
touch with Matt and Joe at your ear-
liest convenience. In our experience,
microphones included with your lap-
top, mobile device or headphones are
perfectly suited to this task.
– The presentation is well structured and has a natural flow.
– Time (5 minutes ± 30 seconds)
– The slides are well presented with thought given to formatting
and aesthetics
– Effort is taken to talk around the slides
Clarification of requirements
It is often necessary to clarify client requirements throughout the
course of a project. Matt and Joe will be happy to assist help clarify
any questions you have surrounding the deliverables you are sup-
posed to produce, and resolve any ambiguities which may arise as
you explore the provided dataset.
data management and exploratory data analysis 5
Deliverables and Online Submission
You will submit your assignment electronically via NESS 7 by 16:30pm 7 http://ness.ncl.ac.uk
on Friday 19th November 2021. You are required to submit several
‘deliverables’ to NESS.
Source code You are expected to submit all source code developed in
the coursework. You should also provide a README.txt document
clearly stating which files relate to which part of the coursework
solution. Your README.txt file should also provide instructions on
running your analyses. These instructions should be sufficient to
run the analysis. This should be automated as much as possible,
and any non-automated configuration or installation steps should
be clearly documented.
Written report Written reports should be submitted in PDF format,
and should clearly indicate your name and student number within
the document, and also in the file name.
Presentation Slides in PDF format, and in source code format e.g.
RMarkdown, Keynote, PowerPoint. Presentation video, e.g. mp4.
Zip files You will often be submitting a number of files at once. You
will likely find it most convenient to zip these files up prior to
submitting them to NESS. Please ensure any zip files contain your
student number and module code in the filename.
Version Control log In this coursework assignment we emphasise the
importance of correctly using version control. You should include
with your coursework submission a copy of your Git log. This may
be easily obtained using the following command;
git log > [yourstudentnumber]GitLogFile.txt.