Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ENVS3023/6034 –Advanced
Quantitative Methods
Long answer question practise – some tips
Last edited 5 May 2021
1
Quail.xls
The quail dataset provides records of presence or
absence (coded as 1 or 0 in variable Presence).
You have been asked by a hunting group who like
to shoot quail (a small game bird) whether
topographic variation (variables coded topov1 to
topov10 for fine to large scale variations) affects
quail presence. Their plan is to make the ground
more variable so that they have places to hide
when shooting the bird but they don’t want to do
this if the quail will avoid the area. What is your
advice to them?
2
What the questions says
…records of presence or absence
…variables coded topov1 to topov10 for
fine to large scale variations
[want] …places to hide when shooting
the bird
What is your advice to them?
3
Creating a research question
Is the presence of quail influenced by
topographic variation when everything else
is taken into account?
We are just asked for advice, not a model
or equation
Interpretation is important but mainly by
the researcher, not the hunter
4
Possible approaches?
GLM – if assumptions can be met
GAM – good but possibly slow
MARS –binomial response is possible (read the
earth pdf)
CART – not suitable as single tree
RF and SGB – possible because n = 2154 – but you
need to know how to specify in RF
NN – not really interpretable unless you add PDPs
5
Results from GLM model
6
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.7954362 0.9979657 5.807 6.35e-09 ***
meanNDVI -0.0789547 0.0096221 -8.206 2.30e-16 ***
spriNDVI 0.0564619 0.0052579 10.739 < 2e-16 ***
wintNDVI -0.0281757 0.0070006 -4.025 5.70e-05 ***
topov1 0.2092300 0.3059940 0.684 0.494120
topov2 1.0601105 0.7800388 1.359 0.174131
topov5 -1.0383771 1.4579828 -0.712 0.476340
topov10 -2.5709852 1.2681614 -2.027 0.042628 *
alt 0.0018815 0.0001809 10.401 < 2e-16 ***
townden -2.7884717 0.7536330 -3.700 0.000216 ***
rivden -1.1856096 0.4275608 -2.773 0.005555 **
Running randomForest on binary
response
7
randomForest(as.factor(Presence) ~
meanNDVI...
To answer the question
It’s easy to forget this!
Go back and read it again to check
What would your advice be to the hunters?
8
Body fat in men.csv
Excess body fat is bad for health but is not simple to
measure directly because the worst place to carry it
is around the internal organs, hidden from view.
Your task is to explore the data provided to see
whether it is possible to predict body fat (variable
Percent body fat) from various characteristics and
body measurements (variables Age to Wrist
circumference). The aim is to develop a simple way
(perhaps a rule of thumb?) for medical practitioners
and affected individuals to assess body fat.
9
What the questions says
… explore the data provided to see whether
it is possible to predict body fat
The aim is to develop a simple way for
medical practitioners and affected
individuals to assess body fat
10
Creating a research question
Can we predict body fat from the variables
provided?
We need to do this simply enough to be able
to communicate the results
What two techniques stand out as
appropriate?
11
Possible approaches?
(G)LM – if assumptions can be met. Link is identity
function so = linear regression. n = 252
GAM – too complex
MARS – possible but complex
CART – good for visualization but low predictive
power?
RF and SGB – too complex
NN - too complex
12
Run as LM
% body fat = 0.996*abdomen + 0.473* forearm –
1.506 wrist – 0.136*weight – 34.854
Why not round coefficients to simplify?
% body fat = abdomen + ½ forearm – 1½ wrist –
1½ (weight/10) – 35
Correlations in both cases are r = 0.857
My simplified formula is much easier for
practictioners to use.
13
Speeding violations.csv
These are data from the US and the focus is
on the police allegedly stopping people on the
basis of race. Your task is to discover
whether there is evidence for a difference in
tendency to drive too fast (variables Speed
and Overlimit) according to race (variable
Race).
14
What the questions says
…discover whether there is evidence for a
difference in tendency to drive too fast
according to race...
In other words, is race a significant
predictor of speed (the response)
There are four categorical predictors plus
date and time. Speed could also be a
continuous predictor but it must be related
to “overlimit” so probably best to omit.
15
Data cleaning
Possible problem here is that missing
values have been assigned **
Cannot search and replace in Excel so use
the .csv directly and do it in Notepad
Or in R use Speedy_clean <-
Speeders[(Speeders$Speed!="**"),]
Otherwise leave it and rely on listwise
deletion to take care of it.
16
Possible approaches?
GLM – possible using dummy coding (automatic
in R)
GAM - possible using dummy coding (automatic in R)
MARS – also possible if no missing values
CART – not suitable as single tree
RF and SGB – possible if we specify categorical
predictors as factors (n = 6929)
NN – possible if predictors standardized.
17
Example prediction from lm in R
18
Conclusion
Linear model and even neural nets have low
predictive power for this problem
Little evidence that race has an effect on
speeding over the limit
Might suggest that the observed
disproportionate stopping of certain groups
is racism.
19
Mammal sleep.xlsx
How much do different species of mammal
sleep and why are there differences? Explore
the data provided to determine which
variables seem to influence sleep duration
(three possible variables) and come up with
hypotheses to explain your findings.
20
What the questions says
…explore the data provided to determine
which variables seem to influence sleep
duration
…come up with hypotheses to explain your
findings
21
Creating a research question
How well can we predict sleep duration
from the variables provided?
Which variables are most important as
predictors?
So we need both predictive power and
interpretability to some extent (enough to
create hypotheses).
22
Possible approaches?
(G)LM – best if assumptions can be met
GAM – slow and at the limit as n = 62
MARS – possible even with n = 62
CART – not suitable (unreliable) as single tree
RF and SGB – too few points
NN – too few points
We should try a linear model because it is the
most interpretable and n is small. 23
Issues
Information on two datasheets so awkward
and needs tidying
Excel is good for sorting this out and
getting a feel for the data
Remember: we need to check for linearity if
we are to build a simple linear model, so
make some plots.
24
General tips
Make sure you read and then answer the question!
Answers need to include text, R code, plots etc.
Use resampling techniques (e.g. cross-validation)
wherever appropriate
Look at model fits and residuals when appropriate
Interpret any output as fully as you can
Write a conclusion.