Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMPSCI 589 Homework
1 Instructions
• This homework assignment consists of a programming portion. While you may discuss problems
with your peers, you must answer the questions on your own and implement all solutions
independently. In your submission, do explicitly list all students with whom you discussed this
assignment.
• We strongly recommend that you use LATEX to prepare your submission. The assignment should
be submitted on Gradescope as a PDF with marked answers via the Gradescope interface. The
source code should be submitted via the Gradescope programming assignment as a .zip file.
Include with your source code instructions for how to run your code.
• We strongly encourage you to use Python 3 for your homework code. You may use other
languages. In either case, you must provide us with clear instructions on how to run your code
and reproduce your experiments.
• You may not use any machine learning-specific libraries in your code, e.g., TensorFlow, PyTorch,
or any machine learning algorithms implemented in scikit-learn (though you may use other
functions provided by this library, such as one that splits a dataset into training and testing
sets). You may use libraries like numpy and matplotlib. If you are not certain whether a specific
library is allowed, do ask us.
• All submissions will be checked for plagiarism using two independent plagiarism-detection tools.
Renaming variable or function names, moving code within a file, etc., are all strategies that do
not fool the plagiarism-detection tools we use. If you get caught, all penalties mentioned in the
syllabus will be applied—which may include directly failing the course with a letter grade of “F”.
→ Before starting this homework, please review this course’s policies on plagiarism by
reading Section 14 of the syllabus.
• The tex file for this homework (which you can use if you decide to write your solution in LATEX)
can be found here.
• The automated system will not accept assignments after 11:55pm on March 01.
1
Programming Section (100 Points Total)
In this section of the homework, you will implement two classification algorithms: k -NN and Decision
Trees. Notice that you may not use existing machine learning code for this problem: you
must implement the learning algorithms entirely on your own and from scratch.
1. Evaluating the k-NN Algorithm
(50 Points Total)
In this question, you will implement the k -NN algorithm and evaluate it on a standard benchmark
dataset: the Iris dataset. Each instance in this dataset contains (as attributes) four properties
of a particular plant/flower. The goal is to train a classifier capable of predicting the flower’s
species based on its four properties. You can download the dataset here.
The Iris dataset contains 150 instances. Each instance is stored in a row of the CSV file and is
composed of 4 attributes of a flower, as well as the species of that flower (its label/class). The
goal is to predict a flower’s species based on its 4 attributes. More concretely, each training
instance contains information about the length and width (in centimeters) of the sepal of a flower,
as well as the length and width (in centimeters) of the flower’s petal. The label associated with
each instance indicates the species of that flower: Iris Versicolor, Iris Setosa, or Iris Virginica.
See Figure 1 for an example of what these three species of the Iris flower look like. In the
CSV file, the attributes of each instance are stored in the first 4 columns of each row, and the
corresponding class/label is stored in the last column of that row.
Figure 1: Pictures of three species of the Iris flower (Source: Machine Learning in R for beginners).
The goal of this experiment is to evaluate the impact of the parameter k on the algorithm’s
performance when used to classify instances in the training data, and also when used to classify
new instances. For each experiment described below, you should use Euclidean distance as the
distance metric and then follow these steps:
(a) shuffle the dataset to make sure that the order in which examples appear in the dataset file
does not affect the learning process;1
(b) randomly partition the dataset into disjoint two subsets: a training set, containing 80%
of the instances selected at random; and a testing set, containing the other 20% of the
instances. Notice that these sets should be disjoint: if an instance is in the training set,
1If you are writing Python code, you can shuffle the dataset by using, e.g., the sklearn.utils.shuffle function.
2
it should not be in the testing set, and vice-versa.2 The goal of splitting the dataset in
this way is to allow the model to be trained based on just part of the data, and then to
“pretend” that the rest of the data (i.e., instances in the testing set, which were not used
during training) correspond to new examples on which the algorithm will be evaluated.
If the algorithm performs well when used to classify examples in the testing set, this is
evidence that it is generalizing well the knowledge it acquired after learning based on the
training examples;