Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
AMA546
Statistical Data Mining
This question paper has 3 pages (cover included).
Instructions to Candidates:
This paper has TWO (2) questions.
Please attempt ALL of them, and show your steps.
Please write all your answers in a new answer sheet in pdf file. The question paper will
not be marked.
Python are suggested to be used.
Only submit the pdf file to the midterm submission channel.
Attachments: N.A.
Subject Examiner: Dr. Tang Wenlu
1
The full marks are 100.
1. Consider the linear model
Yj = β0 + β1zj,1 + . . .+ β5zj,5 + εj, j = 1, . . . , n.
The data of Prostate Cancer example shown in the attachment within ’.txt’ file come
from a study by Stamey et al. (1989). This study examined the correlation between
the level of prostate specific antigen (PSA) and a number of clinical measures, in 97
men who were about to receive a radical prostatectomy. The goal is to predict the log
of PSA (lpsa) from a number of measurements including lcavol, lweight, age, lbph, svi,
lcp, gleason and pgg45.
(a) (10 marks) Please define a function with name ’ridgesolution’ to carry out the
estimator of ridge regression with arbitrary tuning parameter λ. Then input
y = (1/2, 1/5, 1/4)>, λ = 1 and X =
0 12 3
4 2
, and print out the output.
(b) (10 marks) Set the random seed ”2022”. Split the data set into training set and
testing set by proportion 80% and 20% (Use the ’np.ceil’ when calculate training
sample size because iit is not an integer by directly multiplying 97 with 0.8).
Then give the result of the following table.
Term Least square Ridge Lasso
Intercept
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
Testing Error
(c) (30 marks) Please describe the procedures in your analysis and show show con-
clusion
2. (a) (10 marks)Given a group of values, the entropy of the group is defined as the
formula as following:
H(X) = −
n∑
i=1
P (xi) log2 P (xi) ,
n∑
i=1
P (xi) = 1
2
where P (x) is the probability of appearance for the value x. Now define a function
in Python to calculate the entropy of a group.
Moreover, the gini of the group is defined as
H(X) = 1−
n∑
i=1
{P (xi)}2,
n∑
i=1
P (xi) = 1.
where P (x) is the probability of appearance for the value x. Then define a
function in Python to calculate the entropy of a group.
Suppose the input group: [1, 1, 2, 2, 2], please print out the output of each func-
tion.
(b) (10 marks) Given two lists of truth and predicted labels, the TPR and FPR is
defined as the formula as following:
TPR :=
TP
AP
FPR =
FP
AN
where TP, AP, FP and AN are define in the Chapter 5 lecture notes. Now define
a function in Python to calculate the TPR and FPR of a group.
Suppose the input group:
y predict = [1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
y truth = [1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1],
please print out the output of the function.
(c) (30 marks) The Audit Data Data Set in the attachment in ’.csv’ file is from ttps:
//arcive.ics.uci.edu/ml/datasets/Audit+Data. The goal of the research is to
help the auditors by building a classification model that can predict the fraudulent
firm on the basis the present and historical risk factors. Please show your analysis
and conclusion in details. Note that the classification should be conducted using
at least two methods and the evaluation of the methods is required.