Coursework specification for
“Machine Learning” CSC8111
Introduction
The purpose of the exercise is to provide you with opportunities to:
1. Gain practical experience with machine learning and pattern recognition methods in a practical application scenario.
2. Practice how to report your findings of a systematic, targeted experimental evaluation of methods for applied machine learning.
When marking the coursework, we will be looking for:
1. Understanding of key concepts of machine learning as they are being discussed during the lectures and how they can be applied to a concrete machine learning problem.
2. Adequate scientific writing style conforming to common standards (such as objectivity, preciseness, evidence based argumentation, proper citation of references).
Please note that there is a word limit combined with layout constraints for the written work:
● The report must not exceed 2,000 words (not including references).
● Figures account to the equivalent of 250 words.
● A maximum of four figures is allowed to be included into the document.
● Font sizes allowed are 11pt and 12pt (exception: bibliography, which could be set in 9pt, and footnotes, which should be set in 10pt but only used sparsely)
● Single column, single space
● Margin sizes (all four!) between 2 and 2.54 cm (strict!) Further, specific instructions are given below.
Have fun and good luck!
For this assignment you are going to work on a kaggle competition. “Kaggle is a platform for data prediction competitions. Companies, organizations and researchers post their data and have it scrutinized by the world's best statisticians and machine learning experts, i.e., you! You will be working on the competition – “Titanic: Machine Learning from Disaster”. This challenge provides a great starting point for those of you without any experience in applied machine learning. The data is highly structured and there are tutorials provided on the Kaggle site to guide you through several different approaches.
The following is Kaggle’s background to the challenge:
“The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
For this assignment you will analyze the Titanic dataset with regards to predicting what categories of passengers were likely to survive the sinking of the ocean liner. You will be using machine learning methods to predict as accurate as possible from general information about passengers which passengers survived the tragedy.
The data consists of approx. 1,300 records (divided into training and test subsets) for Titanic passengers. Each record consists of the following 11 attributes:
attribute |
description |
Possible values |
Survival |
Survival |
(0=No; 1=yes) |
Pclass |
Passenger Class |
(1 = 1st; 2 = 2nd; 3 = 3rd) |
Name |
Name |
[string] |
Sex |
Sex |
(male; female) |
Age |
Age |
[integer] |
Sibsp |
Number of Siblings/Spouses Aboard |
[integer] |
Parch |
Number of Parents/Children Aboard |
[integer] |
Ticket |
Ticket Number |
[string] |
Fare |
Passenger Fare |
[float] |
Cabin |
Cabin |
[String] |
embarked |
Port of Embarkation |
(C = Cherbourg; Q = Queenstown; S = Southampton) |
Note that the dataset contains missing values;
Experiments
Develop a complete analysis pipeline in a programming language / environment of your choice. Note that kaggle provides substantial support for this and you could even run the experiments on the kaggle site.
Demonstrators will provide support for Python (or other language such as Matlab) implementations. However, it is strongly recommended that you attempt the challenge using Python. Relevant Python resources are listed at the end of this document.
The objective of the analysis is to predict survivors of the Titanic disaster from the given data as accurately as possible.
There are no limitations with regards to the modelling approach, that is, you are free to explore (and report) as many methods, and their results,
as you wish. The minimum requirement though is that you analyze the dataset with a random forest classifier.
You will report the results of your experiments through prediction, that is, classification accuracies with regard to correctly predicted survival rates.
Note that you are NOT required to submit your solution to the official kaggle competition.
Your report will need to provide documentation about your analysis experiments.
In a brief introduction you will set the scene by describing the problem area. You will need to provide an overview of the dataset that you are analysing through an appropriate visualisation of the data. This will also help you exploring the dataset during your experiments.
Following the introduction, you will describe the methods you have used. At the very least you will need to provide an explanation of the random forest classifier that you will implement as a benchmark. You are encouraged to explore different classifiers. Summarise every method you have used for your experiments in the methods section. Results should be reported using appropriate evaluation measures, on the test set.
Remember to only use the provided training dataset for model estimation.
In the discussion section you will reflect upon your findings and contextualise this with the original task thereby linking back to the real- world problem (e.g., Would you have embarked on the Titanic if you were part of passenger category X?).
Submission
The following needs to be submitted by 16:00 16th December 2017 (through NESS):
● A pdf document containing your report (see format and structure specifications above).
● A zipped folder containing the code for your analysis in a runnable format. This should allow us to run your code and verify your results. The code itself will not be marked.