1 Overview
The goal of this project is to build and critically analyse supervised Machine Learning methods, to predict the sentiment of Tweets. That is, given a tweet, your model(s) will produce a prediction of the sentiment that is present in the tweet. You will be provided with a data set of tweets that have been annotated with positive,negative, and neutral sentiments. The assessment provides you with an opportunity to reflect on concepts in machine learning in the context of an open-ended research problem, and to strengthen your skills in data analysis and problem-solving.
The goal of this assignment is to critically assess the effectiveness of various Machine Learning classification algorithms on the problem of determining a tweet’s sentiment and to express the knowledge that you have gained in a technical report. The technical side of this project will involve applying appropriate machine learning algorithms to the data to solve the task.
The focus of the project will be the report, formatted as a short research paper. In the report, you will demonstrate the knowledge that you have gained, in a manner that is accessible to a reasonably informed reader.
2 Deliverables
Stage I:
Stage II:
3 Data Set
You are provided with a labelled training set of Tweets, and an unlabelled test set which will be used for final evaluation in the Kaggle in-class competition. In the train set, each row in the data file contains a tweet ID, the tweet text and the sentiment for that tweet2. For example,Tweet_ID, “if i didnt have you i’d never see the sun. #mtvstars lady gaga”, positive
The test dataset has a similar format except the rows do not include a sentiment (label). You are expected to treat each row of the dataset as an instance. For processing these instances, you need to change them to feature vectors. There are many methods for vectorizing textual instances. We have provided you with two examples.
In the given feature_analysis.ipynb file, you are provided with a basic piece of code that uses the CountVectorizer to transform the train tweets into vectors of Term_IDs and their count. For example,with the use of CountVectorizer the above tweet, will be transformed into the following vector:
[(51027, 1), (44650, 1), (40410, 1), (43384, 1), (22275, 1), (13438, 1), (20604, 1), …]
Where 51027 is the Term_ID for the word ‘you’, 44650 is the Term_ID for the word ‘the’ and so on. You can use and edit this basic code to vectorise your test and train datasets. There are many modifications you can use to experiment with different hypotheses you may have. For example, how ‘removing very frequent and/or very infrequent words’ can affect the behaviour of your Machine Learning models.
There are many more examples.
You are also provided with a basic piece of code that uses TfidfVectorizer to transform the tweets as a vector of values that measure their importance using the following formula:
Where ?!,# is the frequency of term t in document d, ?# in the number of documents containing t, and N is the total number of documents in the collection. You can learn more about TFIDF in (Schutze, 2008).
Using TFIDF the above example tweet will be transformed to the following vector:
[(51027, 0.17), (44650, 0.09), (40410, 0.23), (43384, 0.29), (22275, 0.22), (13438, 0.46), …]
Similar to the Bag of Words method, you can use and edit this basic code to vectorise your test and train datasets. Like above, there are many modifications you can use to experiment with different hypotheses you may have about how changing these features can change the behaviour of your Machine Learning models.
There are many other text vectorization methods that you can use (e.g. word2vec, Bert, etc.). You are welcome and encouraged to use as many vectorization methods as you choose.