2DI70/2MMS80 - Statistical Learning Theory
Statistical Learning Theory
2DI70/2MMS80 - Statistical Learning Theory
Nearest neighbor classification and
handwritten digit classification
1 Introduction
Sometimes simple ideas can be surprisingly good. This is the case with one of the oldest,
but still rather popular learning rule, known as the k-nearest neighbor rule (abbreviated
k-NN in this document). Consider the setting of supervised learning. Suppose you have a
training data set {Xi, Yi}ni=1, where Xi ∈ X and Yi ∈ Y, where X should be a metric space
(that is, a space endowed with a way to measure distances). As usual, our goal is to learn
a prediction rule f : X → Y that is able to do “good” predictions on unseen data.
The idea of k-NN is remarkably simple. Given a point x ∈ X for which we want a
prediction, we simply look for the k “closest” points in the training set and make a prediction
based on a majority vote (classification) or average (regression) of the neighbor labels. That
is as simple as that. Computationally this might seem cumbersome, particularly for large
datasets. But one can use clever computational tricks to ensure this can be done quickly.
In this assignment, which is divided in two parts, you will: (i) get a first-hand experience
with this method by implementing it and choosing a good set of tunable parameters in a
sound way; (ii) analyze the performance of this method in some generality and get a better
understanding why it is sensible.
To make the explanation more concrete let us consider the problem of handwritten digit
classification (which is the topic of part I): given a low resolution image of a handwritten
digit we would like to classify it as one of the digits in {0, 1, . . . , 9}. More specifically our
images have 28x28 pixels, each pixel taking values in {0, 1, 2, . . . , 255}. Therefore X =
{0, 1, . . . , 255}28×28 and Y = {0, 1, . . . , 9}.
2 The k-NN rule
Let d : X × X → [0,+∞) be a metric1 in X . Let x ∈ X be an arbitrary point in X and
consider the re-ordering of each pair of the training data as(
X(1)(x), Y(1)(x)
)
,
(
X(2)(x), Y(2)(x)
)
, . . . ,
(
X(n)(x), Y(n)(x)
)
,
1A metric or distance is a function that must satisfy the following properties: (i) ∀x ∈ X d(x, x) = 0; (ii)
∀x, y ∈ X d(x, y) = d(y, x) (symmetry); (iii) ∀x, y, z ∈ X d(x, y) ≤ d(x, z) + d(z, y) (triangle inequality).
1
so that
d(x,X(1)(x)) ≤ d(x,X(2)(x)) ≤ · · · ≤ d(x,X(n)(x)) .
Note that the ordering depends on the specific point x (hence the cumbersome notation)
and might not be unique. In that case we can break ties in some pre-defined way (e.g., if
two points are at equal distance from x the point that appears first in the original dataset
will also appear first in the ordered set). The k-NN rule (for classification) is defined as
fˆn(x) = arg max
y∈Y
{
k∑
i=1
1
{
Y(i)(x) = y
}}
. (1)
In other words, just look among the k-nearest neighbors and choose the class that is rep-
resented more often. Obviously, there might be situations where two (or more) classes
appear an equal number of times. In such situations one can break these ties according to
a pre-specified rule.
The performance of the method described above hinges crucially on the choice of two
parameters: k, the number of neighbors used for prediction and; d : X×X → R, the distance
metric used to define proximity of two points in the feature space. There are many possible
choices for d, and a na¨ıve but sensible starting point is to consider the usual Euclidean
distance: if x, y ∈ Rl then the Euclidean distance is simply given by
√∑l
i=1(xi − yi)2.
3 The MNIST dataset
This MNIST dataset2 is a classical dataset frequently used to demonstrate machine learn-
ing methods, and is still often used as a benchmark to demonstrate methodologies. This
dataset is provided as comma-separated value (csv) files in CANVAS. The training set
MNIST train.csv consists of 60000 images of handwritten digits and the corresponding la-
bel (provided by a human expert). The test set MNIST test.csv consists of 10000 images
of handwritten digits and the corresponding labels (this file will be provided only closer to
the assignment deadline). In addition, in CANVAS you will also find two smaller training
and test sets, MNIST train small.csv (3000 examples) and MNIST test small.csv (1000
examples). These will be used for a large part of the assignment, to avoid the struggles
associated with large datasets and to test out your implementations.
The format of the data is as follows: each row in the .csv file has 785 entries and
corresponds to a single example. The first entry in the row is the “true” label, in Y =
{0, 1, . . . , 9} and the 784 subsequent entries encode the image of the digit – each entry
corresponding to a pixel intensity, read in a lexicographical order (left-to-right then top-to-
bottom). Pixel intensities take values in {0, 1, . . . , 255}. The Matlab function showdigit.m
in CANVAS will take as input a row of this data and display the corresponding digit image.
Figure 3 shows several examples from MNIST train.csv.
2See http://yann.lecun.com/exdb/mnist/ for more details and information.
2
True label is 5 True label is 0 True label is 4 True label is 1 True label is 9
Figure 1: First five examples from MNIST train.csv and the corresponding labels provided
by a human expert.
Ultimately the goal is to minimize the probability of making errors. For the purposes of
this assignment we will use simply the 0/1 loss. This means that the empirical risk is simply
the average number of errors we make. If {X ′i, Y ′i }mi=1 denotes the pairs of features/labels
in a test set and {Yˆ ′i }mi=1 denotes the corresponding inferred labels by the k-NN rule then
the empirical risk on the test set is given by 1m
∑m
i=1 1
{
Yˆ ′i 6= Y ′i
}
.
PART I - Computational Assignment
The goal of the first part of the assignment is to implement “from scratch” a nearest neighbor
classifier. This means you will not be allowed to use existing libraries and implementations of
nearest neighbors, and should only make use of standard data structures and mathematical
operations3. The only “exception” is that you are allowed to use a sorting subroutine (i.e.,
a function that, given a vector of numerical values, can sort them in ascending order and
give the correspondent reordered indexes). The rational for the above restrictions is for
you to experience what are the critical aspects of your implementation, and understand if
it is scalable to big datasets. For this assignment you are allowed to use any language or
command interpreter (preferably a high-level language, but not necessarily so). You will
not be judged on your code, but rather on your choices and corresponding justification.
You should prepare a report (in English) and upload it via CANVAS. The report should
be self-contained, and you should pay attention to the following points:
• The report should feature an introduction, explaining the problem and methodology.
• Use complete sentences: there should be a coherent story and narrative - not simply
numerical answers to the questions without any criticism or explanation.
• Pay attention to proper typesetting. Use a spelling checker. Make sure figures and
tables are properly typeset.
3This means that, libraries encoding useful data-structures are allowed, as long as these are not specifically
targeting nearest neighbors.
3
• It is very important that for you to have a critical attitude, and comment on the your
choices and results.
The report for part I should be submitted as a single .pdf file. In addition, submit a
separate .pdf file with the code/script you used (you will not be graded on this, but if
needed we might look at it to better understand the results in your report). In your report
you should do the following experiments and answer the questions below.
a) Write down your implementation of k-NN neighbors (using as training data
MNIST train small.csv) and report on its accuracy to predict the labels
in both the training and test sets (respectively MNIST train small.csv and
MNIST test small.csv). For this question use the simple Euclidean distance. Make
a table of results for k ∈ {1, . . . , 20}, plot your the empirical training and test loss
as a function of k, and comment on your results. Explain how ties are broken in
Equation 1.
b) Obviously the choice of the number of neighbors k is crucial to obtain good per-
formance. This choice must be made WITHOUT LOOKING at the test dataset.
Although one can use rules-of-thumb, a possibility is to use cross-validation. Leave-
One-Out Cross-Validation (LOOCV) is extremely simple in our context. Implement
LOOCV to estimate the risk of the k-NN rule for k ∈ {1, . . . , 20}. Report these
LOOCV risk estimates4 on a table and plot them as well the empirical loss on the test
dataset (that you obtained in (a)). Given your results, what would be a good choice
for k? Comment on your results.
c) Obviously, the choice of distance metric also plays an important role. Consider a
simple generalization of the Euclidean distance, namely `p distances (also known as
Minkowski distances). For x, y ∈ Rl define
dp(x, y) =
(
l∑
i=1
|xi − yi|p
)1/p
,
where p ≥ 1. Use leave-one-out cross validation to simultaneously choose a good value
for k ∈ {1, . . . , 20} and p ∈ [1, 15].
d) (this question is more open) Building up on your work for the previous questions
suggest a different distance metric or some pre-processing of the data that you consider
appropriate to improve the performance of the k-NN method. Note that, any choices
you make should be done solely based on the training data (that is, do not clairvoyantly
optimize the performance of your method on the test data). Clearly justify ALL the
choices made and describe the exact steps you took. Someone reading your report
should be able to replicate your results.
4Recall that these estimates use only the information on the training dataset.
4
Now that you implemented and tested your methodologies in a smaller scale, let us see
how these methods scale to the full datasets. For the remaining questions you will use the
full MNIST training and test sets.
e) Make use of either the Euclidean distance or dp with your choice of p in part (c)
(use only one or the other). Determine a good value for k using leave-one-out cross
validation when considering the full training set (60000 examples). Was your imple-
mentation able to cope with this large amount of data? Did you have to modify it
in any way? If so, explain what you did. What is the risk estimate you obtain via
cross-validation?
f) (it is only possible to answer this question after I provide you the file
MNIST test.csv) Using the choice of k in part (e) compute the loss of your method
on the test set provided. How does this compare with the cross-validation estimate
you computed in (e)? Would you choose a different value for k had you been allowed
to look at the test dataset earlier?
g) Bonus question: each training example is currently a high-dimensional vector. A
very successful idea in machine learning is that of dimensionality reduction. This is
typically done in an unsupervised way - feature vectors are transformed so that most
information is preserved, while significantly lowering their dimension. A possibility in
our setting is to use Principal Component Analysis (PCA) to map each digit image
to a lower dimensional vector. There is an enormous computational advantage (as
computing distances will be easier) but there might be also an advantage in terms
of statistical generalization. Use this idea in our setting, and choose a good number
of principal components to keep in order to have good accuracy (again, this choice
should be solely based on the training data). Document clearly all the steps of your
procedure. In this question you are allowed to use an existing implementation of PCA
or related methods.
IMPORTANT: if for some reason you are unable to make things work for the large
datasets, use instead for the training data the first 30000 rows of MNIST train.csv and
for testing the first 5000 rows of MNIST test.csv.