Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ENVS3023/6034 –Advanced
Quantitative Methods Long answer question practise – some tips Last edited 5 May 2021 1 Quail.xls The quail dataset provides records of presence or absence (coded as 1 or 0 in variable Presence). You have been asked by a hunting group who like to shoot quail (a small game bird) whether topographic variation (variables coded topov1 to topov10 for fine to large scale variations) affects quail presence. Their plan is to make the ground more variable so that they have places to hide when shooting the bird but they don’t want to do this if the quail will avoid the area. What is your advice to them? 2 What the questions says …records of presence or absence …variables coded topov1 to topov10 for fine to large scale variations [want] …places to hide when shooting the bird What is your advice to them? 3 Creating a research question Is the presence of quail influenced by topographic variation when everything else is taken into account? We are just asked for advice, not a model or equation Interpretation is important but mainly by the researcher, not the hunter 4 Possible approaches? GLM – if assumptions can be met GAM – good but possibly slow MARS –binomial response is possible (read the earth pdf) CART – not suitable as single tree RF and SGB – possible because n = 2154 – but you need to know how to specify in RF NN – not really interpretable unless you add PDPs 5 Results from GLM model 6 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 5.7954362 0.9979657 5.807 6.35e-09 *** meanNDVI -0.0789547 0.0096221 -8.206 2.30e-16 *** spriNDVI 0.0564619 0.0052579 10.739 < 2e-16 *** wintNDVI -0.0281757 0.0070006 -4.025 5.70e-05 *** topov1 0.2092300 0.3059940 0.684 0.494120 topov2 1.0601105 0.7800388 1.359 0.174131 topov5 -1.0383771 1.4579828 -0.712 0.476340 topov10 -2.5709852 1.2681614 -2.027 0.042628 * alt 0.0018815 0.0001809 10.401 < 2e-16 *** townden -2.7884717 0.7536330 -3.700 0.000216 *** rivden -1.1856096 0.4275608 -2.773 0.005555 ** Running randomForest on binary response 7 randomForest(as.factor(Presence) ~ meanNDVI... To answer the question It’s easy to forget this! Go back and read it again to check What would your advice be to the hunters? 8 Body fat in men.csv Excess body fat is bad for health but is not simple to measure directly because the worst place to carry it is around the internal organs, hidden from view. Your task is to explore the data provided to see whether it is possible to predict body fat (variable Percent body fat) from various characteristics and body measurements (variables Age to Wrist circumference). The aim is to develop a simple way (perhaps a rule of thumb?) for medical practitioners and affected individuals to assess body fat. 9 What the questions says … explore the data provided to see whether it is possible to predict body fat The aim is to develop a simple way for medical practitioners and affected individuals to assess body fat 10 Creating a research question Can we predict body fat from the variables provided? We need to do this simply enough to be able to communicate the results What two techniques stand out as appropriate? 11 Possible approaches? (G)LM – if assumptions can be met. Link is identity function so = linear regression. n = 252 GAM – too complex MARS – possible but complex CART – good for visualization but low predictive power? RF and SGB – too complex NN - too complex 12 Run as LM % body fat = 0.996*abdomen + 0.473* forearm – 1.506 wrist – 0.136*weight – 34.854 Why not round coefficients to simplify? % body fat = abdomen + ½ forearm – 1½ wrist – 1½ (weight/10) – 35 Correlations in both cases are r = 0.857 My simplified formula is much easier for practictioners to use. 13 Speeding violations.csv These are data from the US and the focus is on the police allegedly stopping people on the basis of race. Your task is to discover whether there is evidence for a difference in tendency to drive too fast (variables Speed and Overlimit) according to race (variable Race). 14 What the questions says …discover whether there is evidence for a difference in tendency to drive too fast according to race... In other words, is race a significant predictor of speed (the response) There are four categorical predictors plus date and time. Speed could also be a continuous predictor but it must be related to “overlimit” so probably best to omit. 15 Data cleaning Possible problem here is that missing values have been assigned ** Cannot search and replace in Excel so use the .csv directly and do it in Notepad Or in R use Speedy_clean <- Speeders[(Speeders$Speed!="**"),] Otherwise leave it and rely on listwise deletion to take care of it. 16 Possible approaches? GLM – possible using dummy coding (automatic in R) GAM - possible using dummy coding (automatic in R) MARS – also possible if no missing values CART – not suitable as single tree RF and SGB – possible if we specify categorical predictors as factors (n = 6929) NN – possible if predictors standardized. 17 Example prediction from lm in R 18 Conclusion Linear model and even neural nets have low predictive power for this problem Little evidence that race has an effect on speeding over the limit Might suggest that the observed disproportionate stopping of certain groups is racism. 19 Mammal sleep.xlsx How much do different species of mammal sleep and why are there differences? Explore the data provided to determine which variables seem to influence sleep duration (three possible variables) and come up with hypotheses to explain your findings. 20 What the questions says …explore the data provided to determine which variables seem to influence sleep duration …come up with hypotheses to explain your findings 21 Creating a research question How well can we predict sleep duration from the variables provided? Which variables are most important as predictors? So we need both predictive power and interpretability to some extent (enough to create hypotheses). 22 Possible approaches? (G)LM – best if assumptions can be met GAM – slow and at the limit as n = 62 MARS – possible even with n = 62 CART – not suitable (unreliable) as single tree RF and SGB – too few points NN – too few points We should try a linear model because it is the most interpretable and n is small. 23 Issues Information on two datasheets so awkward and needs tidying Excel is good for sorting this out and getting a feel for the data Remember: we need to check for linearity if we are to build a simple linear model, so make some plots. 24 General tips Make sure you read and then answer the question! Answers need to include text, R code, plots etc. Use resampling techniques (e.g. cross-validation) wherever appropriate Look at model fits and residuals when appropriate Interpret any output as fully as you can Write a conclusion.