SDSC 8009: Data Mining and Knowledge Discovery
Data Mining and Knowledge Discovery
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
SDSC 8009: Data Mining and Knowledge Discovery
Individual Assignment I: Data Mining Essentials and Graph Essentials
Deadline: Feb 18 2022, 23:59pm HKT
Remember to explain each of your answer in details. Please submit your assignment in a
single .zip file containing both your writing answers in a .pdf file and your code in a single
iPython/Jupyter notebook file that comes with proper documentation. Please make your
code well organized so the code for each question can be found in a single block or several
neighboring blocks. Make sure we can run your code on Google Colab.
Question 1. Linear Regression, Logistic Regression, Maximum Likelihood Estimation (MLE):
10pts
1) Explain why mean squared error is used as the loss function of linear regression models. Is your
explaination also valid for nonlinear regression models?
2) Given data {xi, yi}Ni=1 and logistic regression model P (yi = 1|xi, θ) = Sigmoid(θTxi) = 11+e−θT xi .
Explain why minimizing binary cross entropy loss is equivalent to MLE (or MCLE) for logistic
regression and derive its gradient.
Question 2. KNN: 20pts
Develop a KNN classifier class in Python using Numpy (you cannot directly use the KNN Classifier from
sklearn or similar packages). It will be trained and tested on the Iris dataset (please refer to our tuto-
rial 1 https://colab.research.google.com/drive/18yeREKD2Wf5pdUBguNdv1s9ItdjWeAWy?authuser=
3 for how to obtain the training and test data). Answer the following questions:
1) What are the hyperparameters needed as the arguments for the init function of the KNN classifier
class?
2) Do we need to train KNN classifier on the training data if we only need to do prediction for the test
data? Point out the distances among which samples we need to compute. Implement the predict
function following your answer to these questions.
3) Plot a scatter or lineplot with x-axis as k = 1, 2, ..., 10 and y-axis as the accuracy of your model on
the test data, the accuracy should be computed as the average of those obtained from ten repeatance.
Question 3. Adjacency Matrix: 10pts
Given adjacency matrix A of a simple undirected graph, answer the following questions:
1) How can we compute the number of common neighbors between node i and node j?
2) How can we compute the number of triangles?
Question 4. BFS and Connected Components: 30pts
For the following network: https://snap.stanford.edu/data/bigdata/communities/com-lj.ungraph.
txt.gz. and for p from 1 to 100,
• remove p% of edges randomly and
• compute the size of the largest connected component using BFS. Denote this size as s(p).
Develop a Python Program to finish this task and plot a figure with x-axis as p and y-axis as s(p). You
can use networkx or other libraries to work with the graph. Submit your code and the plot.
1
2Question 5. Eigenvector Centrality: 10pts
1) Come up with an example of a directed connected graph where eigenvector centrality becomes zero
for some nodes. Detail when this happens.
2) Show an example where the eigenvector centrality of all nodes in the graph is the same while
betweenness centrality gives different values for different nodes.
Question 6. PageRank: 20 pts
Figure 1. A simple directed graph.
1) Calculate PageRank values for the graph in Fig. 1 given α = 1, β = 0, α = 0.3, β = 0.5, α = 0, β = 1.
Discuss the effects of different values of α and β for this particular problem.
2) Does β have any influence on the order of nodes ranked by PageRank? In other words, if for one
value of β the centrality value of node vi is greater than that of vj , is it possible to change β in a
way such that vj centrality becomes larger than that of vi?
3) In PageRank, what α values can we select in order to guarantee that centrality values are calculated
correctly, i.e., values do not diverge?
4) What is the largest eigenvalue of matrix AD−1 given the graph in Fig. 1? Does this value change
when the graph is different? Prove it.