Introduction to Statistics for Data Science
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
MATH42715: Introduction to Statistics for Data Science
Assignment 2
1 Submission Information
Key information:
• Submission deadline: Midday on Friday 9th December 2022.
• Submission format: via Gradescope as a single electronic file in PDF format.
• Help on the statistical content will only be given until 6pm on Wednesday 7th December 2022. After
this time, only help with online submission will be available. For queries about online submission after
6pm on Wednesday 7th December 2022, please email Dr Tahani Coolen-Maturi ([mailto:tahani.matur
[email protected]]).
• The report should not exceed 8 pages. You are advised to include an Appendix, which does not count
towards the page limit, detailing enough R code to allow the reader to reproduce your analysis. You
may also like to use the Appendix to include supplementary tabular and graphical output.
2 Assignment Brief
The assignment is worth 60% of the overall mark for the module. Your work should be presented as a coherent
report, giving consideration to the tasks and marking scheme detailed in Sections 2.2.1 to 2.2.4 below. You
do not need to comprehensively describe everything you have done to explore and model the data. However,
you should provide a narrative which details and justifies the salient features of your approach, in addition to
reporting and interpreting your results in the context of the scientific problem presented in Section 2.1. There
will also be marks for the academic writing, structuring and presentation of this report; see See Section 2.2.5
below.
2.1 Data
In this assignment, you will analyse the BreastCancer data set which concerns characteristics of breast tissue
samples collected from 699 women in Wisconsin using fine needle aspiration cytology (FNAC). This is a
type of biopsy procedure in which a thin needle is inserted into an area of abnormal-appearing breast tissue.
Nine easily assessed cytological characteristics, such as uniformity of cell size and shape, were measured for
each tissue sample on a one to ten scale. Smaller numbers indicate cells that looked healthier in terms of
that characteristic. Further histological examination established whether each of the samples was benign or
malignant. The objective of the clinical experiment was to determine the extent to which a tissue sample
could be classified as benign or malignant using only the nine cytological characteristics.
For the purposes of this assignment, you may assume that the patients can be regarded as a random sample
from the population of women experiencing symptoms of breast cancer.
The data set is part of the mlbench package. The package can be installed by typing into the console
1
install.packages("mlbench")
It can then be loaded into R and inspected as follows:
## Load mlbench package
library(mlbench)
## Load the data
data(BreastCancer)
## Check size
dim(BreastCancer)
## [1] 699 11
## Print first few rows
head(BreastCancer)
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 1 1000025 5 1 1 1 2
## 2 1002945 5 4 4 5 7
## 3 1015425 3 1 1 1 2
## 4 1016277 6 8 8 1 3
## 5 1017023 4 1 1 3 2
## 6 1017122 8 10 10 8 7
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 1 1 3 1 1 benign
## 2 10 3 2 1 benign
## 3 2 3 1 1 benign
## 4 4 3 7 1 benign
## 5 1 3 1 1 benign
## 6 10 9 7 1 malignant
More information on the variables can be found by typing ?BreastCancer in the console.
2.2 Tasks and Making Scheme
Your ultimate goal is to build a classifier for the Class – benign or malignant – of a tissue sample based on
(at least some of) the nine cytological characteristics. It should be stressed that this is a real data set and
there is no “correct” answer. The sections below indicate the components your report should include and the
number of marks attributed to each.
2.2.1 Cleaning the Data (10 marks)
Before starting any analysis, you should clean the data:
• Technically, the nine cytological characteristics are ordinal variables on a 1 – 10 scale. In the
BreastCancer data, they are encoded as factors. For the purposes of this assignment, we will treat
them as quantitative variables. You should carefully convert the factors to quantitative variables.
• This data set contains some missing observations on predictors, encoded as NA. For the purposes of this
assignment, you should remove all of the rows where there are missing values before carrying out any
further analysis. To do this, you may find the is.na function helpful. For instance
## Print 24th row of Breast Cancer data and note there is a NA in the
## Bare.nuclei column:
BreastCancer[24,]
2
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 24 1057013 8 4 5 1 2
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 24 7 3 1 malignant
## Test whether each element on the 24th row is a NA:
is.na(BreastCancer[24,])
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 24 FALSE FALSE FALSE FALSE FALSE FALSE
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 24 TRUE FALSE FALSE FALSE FALSE
• Remember to provide a concise summary of how you cleaned the data.
2.2.2 Exploratory Data Analysis (20 marks)
Consider some exploratory data analysis. For example, how might you summarise the data graphically and
numerically? What does this tell you about the relationships between the response variable and predictor
variables and about the relationships between predictor variables? Remember to set your discussion in the
context of the scientific problem presented in Section 2.1.
2.2.3 Modelling (35 marks)
You should build classifiers using:
• Logistic regression, with best subset selection;
• The Bayes classifier for linear discriminant analysis (LDA) or quadratic discriminant analysis (QDA) or
both.