1 Assignment Weight
The assignment is worth 15% of the total points.
Read everything below carefully as this assignment has changed term-over-term.
2 Objective
The purpose of this project is to explore techniques in supervised learning. It is important to realize that
understanding an algorithm or technique requires understanding how it behaves empirically under a variety of
circumstances. As such, rather than implement each of the algorithms, you will be asked to experiment with
them and compare their performance. This is quite involved and also possibly quite different from what you
are used to; however, it is central and in many ways the essence of supervised learning.
3 Procedure
First, you should design two interesting classiffcation problems. For the purposes of this assignment, a classiffcation
problem is just a set of training examples and a set of test examples. You can download data, take from
your own research, or make up your own. Be careful about the datasets you choose. You will need to explain
why the datasets are interesting, use the datasets in later assignments, and be able to discuss context with a
deeper understanding of the datasets. *Please take note of the updates to Fall 2024 below concerning datasets.
After selecting two interesting classiffcation problems, you will go through the process of exploring the data,
tuning the algorithms you’ve learned about, and writing a thorough analysis of your ffndings. You need not
implement any learning algorithm yourself; however, you must participate in the journey of exploring, tuning,
and analyzing. Concretely, this means:
You may program in any language you wish and are allowed to use any library, as long as it was not
written speciffcally to solve this assignment.
• TAs must be able to recreate your experiments on a standard linux machine if necessary.
• The analysis you provide in the report is paramount.
You should experiment with these three learning algorithms on each dataset. They are:
• Neural Networks. You may use networks of nodes with as many layers as you like and any activation
function you see fft.
• Support Vector Machines. You must try at least two different kernel functions.
• k-Nearest Neighbors. You must try signiffcant values of k for comparison.
Each algorithm is described in detail in your textbook, the assigned readings on Canvas, and on the internet.
Instead of implementing the algorithms yourself, you should use libraries that do this for you and make sure to
provide proper attribution. Also, note that you’ll need to do some tinkering to obtain good results and graphs,
and this might require you to modify these libraries in various ways.
Extra Credit Opportunity:
There is an opportunity to add 5 points of extra credit to your report. In addition to the above algorithms, you
will also implement Boosting for Decision Trees. Be sure to use some form of pruning. You will need to
explain and demonstrate how weak learners affect bias and variance. This is not mandatory and may require
more time for proper analysis.
3.1 Experiments and Analysis
Your report should contain:
• A description of your classiffcation problems, and why you feel they are interesting. Think hard about
this. To be interesting the problems should be non-trivial on the one hand, but capable of admitting
comparisons and analysis of the various algorithms on the other. Avoid the mistake of working on the
largest most complicated and messy dataset you can ffnd. The key is to be interesting and clear, no points
for hairy and complex.
• The training and testing error rates you obtained running the various learning algorithms on your problems.
At the very least you should include graphs that show performance on both training and test data as a
function of training size (note that this implies that you need to design a classiffcation problem that has
more than a trivial amount of data) and – for the algorithms that are iterative – training times/iterations.
Both of these kinds of graphs are referred to as learning curves.
• You must contain a hypothesis about your datasets. This is open-ended as each of you will have a variety
of different features and attributes to your data that may or may not perform a certain way given the
required algorithms. Whatever hypothesis you choose, you will need to back it up with experimentation
and thorough discussion. It is not enough to just show results.
• Graphs for each algorithm showing training and testing error rates as a function of selected hyperparameter
ranges. This type of graph is referred to as a model complexity graph (also sometimes validation curve).
Please experiment with more than one hyperparameter and make sure the results and subsequent analysis
you provide are meaningful.
• Analyses of your results. Why did you get the results you did? Compare and contrast the different
algorithms. What sort of changes might you make to each of those algorithms to improve performance?
How fast were they in terms of wall clock time? Iterations? Would cross validation help? How much
performance was due to the problems you chose? Which algorithm performed best? How do you deffne
best? Be creative and think of as many questions you can, and as many answers as you can.
Analysis writeup is limited to 8 pages. The page limit does include your citations. Anything past 8 pages
will not be read. Please keep your analysis as concise while still covering the requirements of the assignment.
As a ffnal check during your submission process, download the submission to double check everything looks
correct on Canvas. Try not wait until the last minute to submit as you will only be tempting Murphy’s Law.
In addition, your report must be written in LaTeX on Overleaf. You can create an account with your
Georgia Tech email (e.g. [email protected]). When submitting your report, you are required to include
a ’READ ONLY’ link to the Overleaf Project. If a link is not provided in the report or Canvas submission
comment, 5 points will be deduced from your score. Do not share the project directly with the Instructor or
TAs via email. For a starting template, please use the IEEE Conference template.
Update for Fall 2024
The following datasets will not be allowed on assignments during the term. This is due to a variety of reasons
concerning simplicity and overuse. If these datasets are used in your assignments, we will not grade the reports,
and you will receive a zero for the assignment. Please double-check that the dataset used is not on this list. That
being said, we want you to choose datasets that will be interesting for the assigned tasks in the assignments.
• Iris Dataset: UCI Machine Learning Repository: Iris Dataset
• Wine Quality Dataset: UCI Machine Learning Repository: Wine Quality Dataset
Adult (Census Income) Dataset: UCI Machine Learning Repository: Adult Dataset
• Breast Cancer Wisconsin Dataset: UCI Machine Learning Repository: Breast Cancer Wisconsin
Dataset
• MNIST Handwritten Digits Dataset: Kaggle: MNIST Dataset
• Digits Dataset: Scikit-learn: Digits Dataset
• Fashion MNIST: Kaggle: Fashion MNIST
• Credit Card Fraud Detection Dataset: Kaggle: Credit Card Fraud Detection
• Housing Prices Dataset (Boston Housing): Kaggle: Boston Housing Dataset
• Mall Customer Segmentation Data: Kaggle: Mall Customer Segmentation Data
• Letter Recognition Dataset: Kaggle: Letter Recognition Dataset
• Chest X-ray Images Dataset: Kaggle: Chest X-ray Images Dataset
• Stanford Large Network Dataset: Stanford Network Analysis Project (SNAP): Stanford Large Network
Dataset
• Vision Benchmark Suite: Caltech: Autonomous Car Dataset
• Anything Found on Scikit-Learn: Link to List: Scikit-learn Datasets
3.2 Acceptable Libraries
Here are a few examples of acceptable libraries. You can use other libraries as long as they fulffll the conditions
mentioned above.
Machine learning algorithms:
• scikit-learn (python)
• Weka (java)
• e1071/nnet/random forest(R)
• ML toolbox (matlab)
• tensorffow/pytorch (python)
Plotting:
• matplotlib (python)
• seaborn (python)
• yellowbrick (python)
• ggplot2 (R)
4 Submission Details
The due date is indicated on the Canvas page for this assignment. Make sure you have set your
timezone in Canvas to ensure the deadline is accurate. We are in the Eastern Time Zone for the course.
Due Date: Indicated as “Due” on Canvas
Late Due Date [20 point penalty per day]: Indicated as “Until” on Canvas.
You must submit:
• A ffle named README.txt containing instructions for running your code (see note below)
A ffle named yourgtaccount-analysis.pdf containing your writeup (GT account is what you log in with,
not your all-digits ID)
• Your source code in your personal repository on Georgia Tech’s private GitHub.
You may submit the assignment as many times as you wish up to the due date, but, we will only consider your
last submission for grading purposes.
Note: we need to be able to get to your code and your data. Providing entire libraries isn’t necessary when a
URL would sufffce; however, you should at least provide any ffles you found necessary to change and enough
support and explanation so we can reproduce your results on a standard linux machine.