Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DATA3888 Assignment
Instructions
1. Your assignment submission needs to be a HTML document that you have compiled using R Markdown
or Quarto. Name your file as SIDXXX_Assignment.html” where XXX is your Student ID.
2. Under author, put your Student ID at the top of the Rmd file (NOT your name).
3. For your assignment, please use set.seed(3888) at the start of each chunk (where required).
4. Do not upload the code file (i.e. the Rmd or qmd file).
5. You must use code folding so that the marker can inspect your code where required.
6. Your assignment should make sense and provide all the relevant information in the text when the code
is hidden. Don’t rely on the marker to understand your code.
7. Any output that you include needs to be explained in the text of the document. If your code chunk
generates unnecessary output, please suppress it by specifying chunk options like message = FALSE.
8. Start each of the 3 questions in a separate section. The parts of each question should be in the same
section.
9. You may be penalised for excessive or poorly formatted output.
Question 1 - Case Study 1 (Reef): Visualising data
Sully and colleagues have curated a public dataset containing characteristics linked to coral bleaching over the
last two decades. The data is in the file Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv,
and the authors curated coral bleaching events at 3351 locations in 81 countries from 1998 to 2017. The full
description of the variables can be found in the supplementary table of the study.
a. In the paper, the authors claim “the highest probability of coral bleaching occurred at tropical mid-
latitude sites (15–20 degrees north and south of the Equator)”. Create an informative map visualisation
to explore this claim and comment on what you can learn from your visualisation.
b. A researcher wants to investigate coral bleaching events around the world as they occurred
from 1998 to 2017. Create an interactive map visualisation, representing the information you think
would be important. Justify your choice of visualisation, and comment on what you can learn from
your visualisation.
1
Question 2 - Case Study 2 (Kidney): Blood vs Biopsy Biomarker
for classification
In the data GSE46474, we estimated the accuracy for our predictive model in graft rejection from peripheral
blood gene expression dataset. However, rejection is a very active process that occurs in the kidney itself.
Here we will look at a similar kidney microarray dataset. Therefore, instead of genes being isolated and
sequenced from blood, we examine another dataset GSE138043 where the samples have been sequenced from
a kidney biopsy.
a. In each of the GSE46474 and GSE138043 datasets, use the topTable function in the limma package
to output the most differentially expressed genes between patients that experience graft rejection and
stable patients. Which genes are overlapped between the top 300 differentially expressed genes for each
dataset? In other words, which genes can be found in the top 300 differentially expressed genes for
BOTH datasets?
Hint. In the GSE46474 dataset, the outcome is found in the title column of the featureData and the gene
symbols are found the in Gene Symbol column of the featureData. In the GSE138043 dataset, the outcome
is found in the characteristics_ch1 column of the featureData and the gene symbols are found the in
gene_assignment column of the featureData, between the first and second // symbols.
b. Consider the following framework for cross-validation for a support vector machine (SVM) classifier.
Framework 1. Identify the 50 most differentially expressed genes from the entire dataset. Subset the entire
dataset to the 50 most differentially expressed genes. Randomly split the data into training and testing sets
(80:20 split). Build a SVM classifier on the training set. Calculate the accuracy of the classifier when applied
on the testing set.
For each of the GSE46474 and GSE138043 datasets, use repeated 5-fold cross validation (with 50 repeats),
following the framework above, to estimate the accuracy of graft survival prediction (rejection or stable).
Show your results in a visualisation and comment on the result.
c. Consider the following framework for cross-validation for a support vector machine (SVM) classifier.
Framework 2. Randomly split the entire dataset into training and testing sets (80:20 split). Identify the 50
most differentially expressed genes from the training data. Subset both the training and testing data to the
50 most differentially expressed genes. Build a SVM classifier on the training set. Calculate the accuracy of
the classifier when applied on the testing set.
For each of the GSE46474 and GSE138043 datasets, use repeated 5-fold cross validation (with 50 repeats),
following the framework above, to estimate the accuracy of graft survival prediction (rejection or stable).
Show your results in a visualisation and comment on the result.
d. Compare all the results from b and c using an appropriate graphic. Which of framework 1 or framework
2 is more valid? Is using blood or biopsy more accurate? Justify your answers.
2
Question 3: Case Study 3 (Brain): Streaming classifier for Brain-box
A physics instructor Zoe has created a data set stored under zoe_spiker.zip that contains brain signal
series (each series is a file) which corresponds to sequences of eye movements of varying lengths. The file
name corresponds to the true eye movement. For example the file LRL_z.wav corresponds to left-right-left
eye movements; the file LLRLRLRL_z.wav corresponds to left-left-right-left-right-left-right-left eye movements.
There are a total of 31 files.
a. Build a classification rule for detecting a series of {L, R} under a streaming condition where the function
will take a sequence of signals as an input. Explain how your classification rule works.
Note. Your function should take the entire .wav file as an input, but should run through the .wav file under
streaming conditions (e.g., by considering overlapping/rolling windows in the signal).
b. Create a metric to estimate the accuracy of your classifier on the length 3 wave files, justifying your
choice. Comment on the performance of your classifier (ie. is it reasonable for this context?).
c. Compare at least four different classification rules on the length 3 wave files, using the metric you
created. (This may include changing the parameters, different rules to identify events from non-events,
or different rules to identify left-movement from right-movement). What is your best model? Justify
your answer with appropriate visualisations.
d. For the best model that you found in part c, evaluate its performance on sequences of varying lengths.
Does the length of the sequence have an impact on the classification accuracy? Justify your answer
with appropriate visualisations.