Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CS2035
Assignment
Marking Scheme
In any case, make sure you submit all the necessary files such that the grader can run your
programs. This includes the csv data files (the ones provided with this assignment).
Your grade will be based on what the grader can run. A program that does
not run (i.e., stops because of any error) will be graded zero.
One way to check this is to copy what you plan to submit in a new folder and run your
scripts from this new folder (and/or send to a friend and ask her/him to run it).
While you can work with other students, your submission must reflect you own work.
Programs that are suspiciously similar will be graded zero.
Exercise 1 – Ovarian Cancer Detection
Serum proteomic pattern diagnostics use protein mass spectrometry to differentiate
biological samples from patients with and without cancer. In this exercise, our goal is to
build a classifier that can distinguish between cancerous and normal biopsies from the
mass spectrometry data. The same biological sample is analyzed by two different
spectrometric instruments (called A and B) for each patient.
The data from instrument A is presented in file ovarian A.csv and in ovarian B.csv for
instrument B. One row, for both files, corresponds to the data collected for the same
patient and the columns represent the spectrometric measurements (their meaning is not
relevant here).
The file ovarian diagnostic.csv indicates, for each of the 216 patients, if the sampled
tumour was cancerous (Cancer) or not (Normal).
We want to determine which instrument (between A and B) is the best at detecting
ovarian cancer.
a) Using a multivariate logistic regression (on all the measurement variables for each
instrument) and ROC analysis, determine the best instrument to detect ovarian
cancer in this group of patients. Provide explicitly the criteria you used to make
your decision. Generate a ROC figure that illustrates your findings.
b) We want a true positive rate (TPR) of 90% for the best method. What is the false
positive rate (FPR) in that case? What would be the FPR with the worst method,
for the same TPR (90%)?
Exercise 2 – Pulsars
Pulsars are a rare type of Neutron star that produce radio emission. As pulsars rotate,
their emission beam sweeps across the sky. When this beam crosses the line of large radio
telescopes, it produces a detectable pattern of broadband radio emission. Each pulsar
produces a slightly different emission pattern. In practice , detection uses radio frequency
interference and noise, making legitimate signals hard to find.
The dataset pulsars.csv contains eight measurements from 2,000 stars obtained with
radio telescopes. The description of those eight variables is provided, for information
only, in the file pulsars-info.txt. The nineth variable indicates if the star is a pulsar
(1) or not (0). This dataset has been checked by human annotators.
a) Using a multi-dimensional scaling (MDS) with a Sammon minimizing criteria, show
with a single figure that pulsars congregate together when projected onto a
two-dimensional space. Write a short comment that helps locate the pulsars in the
figure.
b) Evaluate the pulsar detection ability of a multivariate logistic regression (using all
eight variables) by calculating its AUC. Is it an effective method? Why?
c) (This question is not directly related to b)) Can you find a clustering methods
(among the ones seen in class) that would be successful in identifying a cluster
mostly composed of pulsars of the first dataset (pulsars.csv)? You must show
your work for this question, that is: if you find one, just write the code that
identifies the pulsar cluster and produce one single figure. If you don’t find any
satisfactory clustering method, show what your failed attempts produced in a single
multi-panel figure.
d) The file pulsar2.csv contains new measurements from 500 other stars but their
pulsar nature has not been checked by a human annotator. Determine which stars,
out of those 500, are pulsars. Aim for a 90% true positive rate. How many new
pulsars have you found in this new dataset?
(Hint: use the logistic regression from the first dataset to infer the probability that a
star in the second dataset is a pulsar. Then, infer the threshold that gave a TPR of
90% and use this threshold to classify the candidate pulsars of the second dataset)
Exercise 3 – Wheat Seeds
The file seeds.csv is a dataset that records seven measurements of kernels belonging to
three different varieties of wheat: Kama, Rosa and Canadian, coded 1, 2 and 3
respectively in the eighth column. The description of the seven measurements is
provided, for information only, in the file seeds-info.txt.
We want to identify a clustering method that manages to identify reasonably well the
Canadian wheat variety based on the seven measurements.
a) Perform a classical multi-dimensional scaling (MDS) that projects this dataset onto
a 2-dimensional space.