Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Assignment 2: Topic Classification
Value: 25%
This assignment is inspired by a typical real-life scenario. Imagine you have been hired as a Data
Scientist by a major news organization. Your job is to analyse the news feed to determine the
topic of incoming news articles so they can be organized and distributed to your readers.
For this assignment, you will be given a collection of BBC news articles and also summaries
of the same articles. The articles have been manually labelled as one of five topics : business,
entertainment, politics, sport and tech. Important: Do not distribute these news articles
on the Internet, as this breaches BBC copyright.
You are expected to assess various supervised machine learning methods using a variety of fea-
tures and settings to determine what methods work best for topic classification in this domain.
The assignment has two components: programming to produce a collection of models for topic
classification, and a report to evaluate the effectiveness of the models. The programming part
involves development of Python code for data preprocessing of articles and experimentation of
methods using NLP and machine learning toolkits. The report involves evaluating and comparing
the models using various metrics.
You will use the NLTK toolkit for basic language preprocessing, and scikit-learn for feature con-
struction and evaluating the machine learning models. You will be given an example of how to use
NLTK and scikit-learn to define the machine learning methods (example.py), and an example of
how to plot metrics in a graph (plot.py).
Data and Methods
A training dataset is a .tsv (tab separated values) file containing a number of articles, with one
article per line, and linebreaks within articles removed. Each line of the .tsv file has three fields:
instance number, text and topic (business, entertainment, politics, sport, tech).
A test dataset is a .tsv file in the same format as the training dataset except that your code should
ignore the topic field. Training and test datasets can be drawn from supplied files articles.tsv
or summaries.tsv (see below).
For all models, consider an article to be a collection of words, where a word is a string of at
least two letters, numbers or the symbols #, @, , $ or %, delimited by a space, after removing
all other characters (two characters is the default minimum word length for CountVectorizer in
scikit-learn). URLs should be treated as a space, so delimit words. Note that deleting “junk”
characters may create longer words that were previously separated by those characters.
Use the supervised learning methods discussed in the lectures: Decision Trees (DT), Bernoulli
Naive Bayes (BNB) and Multinomial Naive Bayes (MNB). Do not code these methods: instead
use the implementations from scikit-learn. Read the scikit-learn documentation on Decision Trees1
and Naive Bayes,2 and the linked pages describing the parameters of the methods.
Look at example.py to see how to use CountVectorizer and train and test the machine learning
algorithms, including how to generate metrics for the models developed, and plot.py to see how
to plot these metrics on a graph for inclusion in your report.
The programming part of the assignment is to produce DT, BNB and MNB models and your own
model for topic classification in Python programs that can be called from the command line to train
and classify articles read from correctly formatted .tsv files. The report part of the assignment
is to analyse these models using a variety of parameters, preprocessing tools and scenarios.
Programming
You will submit four Python programs: (i) DT classifier.py, (ii) BNB classifier.py, (iii)
MNB classifier.py and (iv) my classifier.py. The first three of these are standard models as
defined below. The last is a model that you develop following experimentation with the data. Use
the given datasets (articles.tsv and summaries.tsv) containing 1000 labelled articles and their
summaries to develop and test the models, as described below.
These programs, when called from the command line with two file names as arguments, the
first a training dataset and the second a test dataset (i.e. not hard-coded as training.tsv
and test.tsv), should print (to standard output, not a hard-coded file output.txt), the in-
stance number and topic produced by the classifier of each article in the test set when trained on
the training set (one per line with a space between the instance number and topic) – each topic
being the string “business”, “entertainment”, “politics”, “sport” or “tech”. For example:
python3 DT classifier.py training.tsv test.tsv > output.txt
should write to the file output.txt the instance number and topic of each article in test.tsv, as
determined by the Decision Tree classifier trained on training.tsv.
When reading in training and test datasets, make sure your code reads all the instances (some
Python readers use “excel” format, which uses double quotes as separators).
Standard Models
You will develop three standard models. For all models, make sure that scikit-learn does not
convert the text to lower case. For Decision Trees, use scikit-learn’s Decision Tree method with
criterion set to ’entropy’ and with random state=0. Scikit-learn’s Decision Tree method does not
implement pruning, rather you should ensure that Decision Tree construction stops when a node
covers fewer than 1% of the training set. Decision Trees are prone to fragmentation, so to avoid
overfitting and reduce computation time, for the Decision Tree models use as features only the
1000 most frequent words from the vocabulary, after preprocessing to remove “junk” characters
as described above. Write code to train and test a Decision Tree model in DT classifier.py.
For both BNB and MNB, use scikit-learn’s implementations, but use all of the words in the
vocabulary as features. Write two Pythons programs for training and testing Naive Bayes models,
one a BNB model and one an MNB model, in BNB classifier.py and MNB classifier.py.