Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Assignment 2: Sentiment Analysis
Value: 25%
This assignment is inspired by a typical real-life scenario. Imagine you have been hired as a
Data Scientist by a major airline company. Your job is to analyse the Twitter feed to determine
customer sentiment towards your company and its competitors.
In this assignment, you will be given a collection of tweets about US airlines. The tweets have been
manually labelled for sentiment. Sentiment is categorized as either positive, negative or neutral.
Important: Do not distribute these tweets on the Internet, as this breaches Twitter’s
Terms of Service.
You are expected to assess various supervised machine learning methods using a variety of features
and settings to determine what methods work best for sentiment classification in this domain. The
assignment has two components: programming to produce a collection of models for sentiment
analysis, and a report to evaluate the effectiveness of the models. The programming part involves
development of Python code for data preprocessing of tweets and experimentation of methods using
NLP and machine learning toolkits. The report involves evaluating and comparing the models
using various metrics, and comparison of the machine learning models to a baseline method.
You will use the NLTK toolkit for basic language preprocessing, and scikit-learn for feature construction
and evaluating the machine learning models. You will be given an example of how to
use NLTK and scikit-learn for this assignment (example.py). For the sentiment analysis baseline,
NLTK includes a hand-crafted (crowdsourced) sentiment analyser, VADER,1 which may perform
well in this domain because of the way it uses emojis and other features of social media text to
intensify sentiment, however the accuracy of VADER is difficult to anticipate because: (i) crowdsourcing
is in general highly unreliable, and (ii) this dataset might not include much use of emojis
and other markers of sentiment.
Data and Methods
A training dataset is a tsv (tab separated values) file containing a number of tweets, with one
tweet per line, and linebreaks within tweets removed. Each line of the tsv file has three fields:
instance number, tweet text and sentiment (positive, negative or neutral). A test dataset is a tsv
file in the same format as the training dataset except that your code should ignore the sentiment
field. Training and test datasets can be drawn from a supplied file dataset.tsv (see below).
For all models except VADER, consider a tweet to be a collection of words, where a word is a string
of at least two letters, numbers or the symbols #, @, , $ or %, delimited by a space, after removing
all other characters (two characters is the default minimum word length for CountVectorizer in
scikit-learn). URLs should be treated as a space, so delimit words. Note that deleting “junk”
characters may create longer words that were previously separated by those characters.