ST404 Applied Statistical Modelling
Applied Statistical Modelling
ST404 Applied Statistical Modelling
1 Assignment 1
Assignment 1 counts for 25% of the module mark and consists of two deliverables:
1. An exploratory data analysis report in pdf format (20 marks).
Deadline for report submission: Monday, 8 February 12:00.
2. A slide presentation in Week 6 (5 marks).
Deadline for presentation submission: Monday, 8 February 12:00.
Assignment submission is via Moodle. Remember the advice given on report writing, proof reading
and avoiding plagiarism! All sources used, whether online or paper, need to be acknowledged and
appropriately referenced. The data analysis report will be submitted to TurnItIn UK for a plagiarism
check.
1.1 The data
The data to be used for assignments 1 and 2 is the USA Crime data. A full description of the data
set can be found in a separate file on Moodle.
The data concerns crime rates in a sample of USA counties. A county is an administrative and
political sub-division of a US state. Counties vary widely in both geographical area and population
size. Each row of the data gives summary statistics for one county.
This kind of aggregated data is sometimes called “ecological”. This terminology may be confusing
as it has nothing to do with the science of ecology. The relationships between aggregated variables
may be very different from the relationships at an individual level. You should be careful not to
draw conclusions about individuals from your analysis of aggregated data. This is called the
“ecological fallacy”.
One consequence of the ecological fallacy is that we cannot make conclusions about what causes
crime from these data. However, if we can identify factors associated with lower or higher rates or
crime we can understand how state resources are allocated to deal with the consequences of crime
and identify priority areas for crime prevention programmes.
1.2 The task
The aim of this analysis is to investigate the factors that are associated with both violent and non-
violent crime in different counties. Before building a formal statistical model we need to use
exploratory data analysis (EDA) techniques to understand the distribution of the variables and the
relationships between them. This assignment is all about EDA. The task of building a statistical
model is deferred until assignment 2 and should not be done in this assignment.
Some key questions to ask are:
Which variables show a strong relationship with the outcome variables?
◦ Can the relationship be characterized as a linear?
◦ Does the relationship appear to be homoscedastic?
◦ What transformations, if any, might be applied, to resolve any issues?
Page 1 of 4
ST404 Assignment 1 2021
◦ Are there any other approaches that could be taken to tackle these issues?
Which variables, if any, have a highly skewed distribution? What transformations might be
applied to reduce skewness and stabilize the spread of the observations?
Do any of the variables have outlying values? How should outliers be treated?
Which variables are highly correlated with each other? Are there variables that represent
different ways of measuring the same thing?
Given all of the above, what recommendations would you suggest for preparing these data
in order to fit a linear model?
Some of the data values are missing. You should also investigate possible missing data mechanisms
Can you suggest a mechanism for missing data (MCAR/MAR/MNAR)?
What should be done with missing values when you come to build your statistical model?
The assignment is deliberately open-ended. You should not assume that there is a single “correct”
answer. It is expected that different teams will come to different conclusions. You will be judged by
your ability to use sound statistical methodology to extract meaningful conclusions from real-world
data, and to communicate clearly your findings to the target audience. Details of the marking
scheme are given below.
Note that the emphasis of this part of the assignment is on exploration of the data before developing
a more rigorous statistical model. Explore the various EDA tools from the lectures and
motivate/support your conclusions with appropriate numerical and graphical evidence.
1.3 The report
The report should be written to a professional standard, using an 11pt font or larger and contain the
following sections:
1. Executive summary (maximum 3 bullet points): The aim of the executive summary is to
give a succinct overview on the key messages of your report that is accessible to a lay
audience.
2. Findings (maximum 2 pages): This section should contain a brief description of your main
findings.
3. Statistical methodology (maximum 5 pages): This section should give a description of the
EDA with appropriate numerical and graphical information. (It should also include an
appropriately referenced bibliography). The section may include, but should not necessarily
be restricted to, a discussion of the following questions:
a) What issues did you identify in the initial exploratory analysis? How will these impact
on modelling decisions when you come to build your statistical model?
b) Did you transform any of the variables? Why or why not?
c) Can you exclude any of the variables from the model building process? Why or why
not?
4. A paragraph or table on "authors' contributions": a very brief description of what each
team member contributed to the project (this is common practice in journals that publish
multi- disciplinary research) and the proposed mark weighting for each student (see below).
5. References A bibliography containing references cited in the text.
6. Appendix: comprehensive and annotated R-code that allows the initial exploratory analysis
and model development to be reproduced.
Page 2 of 4
ST404 Assignment 1 2021
1.4 The presentation
The oral presentation must be no more than 12 minutes long, and all students in a group should
spend a roughly equal amount of time speaking.
The slides should contain a brief description of your methodology and your findings, and be as
visually appealing as possible.
1.5 Marking criteria (Total 25 points)
Executive Summary [3 points]:
1. relevance of presented information;
2. appropriateness, clarity and correctness of language;
Findings [5 points]:
1. clarity and accurateness of overview;
2. quality and relevance of numerical and graphical output;
3. quality of conclusions presented;
4. appropriateness, clarity and correctness of language.
Statistical methodology [10 points]:
1. quality of exploratory analysis;
2. relevance and quality of numerical and graphical evidence;
3. structure and clarity, appropriate use of terminology, correctness of English.
Appendix [2 points]: appropriately annotated and complete.
Presentation [5 points]:
Slides
1. Layout, structure and visual appeal;
2. Accuracy and relevance of content.
Oral Presentation
1. Fluidity;
2. Persuasiveness;
3. Appropriate use of language;
4. Response to targeted questions, where appropriate.
Page 3 of 4
ST404 Assignment 1 2021
1.6 Further Guidelines and Instructions
1.6.1Layout
The report should be written in a font size 11 or higher with a 1.5 spacing between the lines. All
figures and tables should be numbered, have captions and be of appropriate size. Text included in
figures (such as titles and axis labels) should be readable under the same conditions as the rest of
the text. Margins should be sensible.
1.6.2Penalties
Late submission (-5% per working day), over page limit (-5%), not using the prescribed
layout (-5%).
1.6.3Marks
For the delivery during the poster session students will receive an individual mark. The other
deliverables (report and the poster itself) receive a group mark which will be distributed across team
members using the weighting algorithm described below.
1.6.4Peer Review
Each team should decide how to distribute the group mark by allocating to each team member a
share of n x 100% where n is the number of students in the team. This will act as a weighting factor
to convert the group mark into an individual mark. For example, suppose the group mark is 70%
and a team of 5 students decides to allocate 100% to each team member, then each member receives
the mark of 70%. On the other hand, if the team decides to allocate 108% to one team member and
98% to the other four members, then the former receives a mark of 75.6% and the latter four team
members receive the mark 68.6%. The maximum weighting factor that can be awarded is 110%, the
minimum weighting factor is 90%. The module leader reserves the right to moderate the weighting
factors, impose equal weighting factors and/or request further evidence.