Machine Learning with Applications in Python
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Classification challenge
Machine Learning with Applications in Python
INF 2179
Introduction: The goal of this classification challenge is to build a classifier that can accurately
predict the star rating of recipe reviews (sentiment analysis)
General information:
• Maximum number of points: 30 (account for 30% of the semester)
• Deadline: March 18th (to be submitted to Quercus before 23:59 PM)
• Individual challenge (group work not permitted)
Dataset:
Your objective is to develop a machine learning model that can predict the star rating of different
recipe reviews on a 1 to 5 star rating scale with a score of 0 denoting an absence of rating.
Two CSV datasets (adapted from https://www.kaggle.com/datasets/joebeachcapital/recipe-
reviews-and-user-feedback-dataset) are given to you named train.csv and test.csv.
It is important to note that the model should only be trained using the "train.csv" dataset and
evaluated using the "test.csv" dataset.
Hints:
Below are the suggested steps to take:
1) Extract some features from the reviews using:
a. CountVectorizer from scikit (play with the various parameters!)
b. Extract some basic text information using pandas (e.g., length of the text)
c. Feel free to use any other feature extraction techniques (using external python
libraries is allowed). You may not use all the features in the dataset.
2) Ensure that you inspect the data and apply any data cleaning needed.
3) Try various classifiers and various parameters. Here are some classifiers you can try:
a. Decision trees and random forests (with various depths and features)
b. Naïve Bayes
c. Feel free to try any other classifiers not seen in class (e.g., classifiers in scikit).
Note that it is possible to get the maximum number of points for the accuracy
using approach A or B (if the features extraction is done well).
4) Try to do a grid search on hyper parameters of your classifier
5) Once you have selected the best classifier, use the training set to build a final classifier.
Le T on ntier
Instructions for submission:
A valid submission should be a zip containing exactly 3 files :
1) pred.csv: a file containing 3637 lines one line per test prediction (do not include the index
nor the header, use index=False, header=False when saving the column). The file should
look like the following:
pred.csv:
1
2
5
.
.
.
2) accuracy.csv: a file containing the accuracy (on the test data) include neither the index
nor the header, see below:
accuracy.csv:
0.76
3) Notebook file: the file used to generate the model and all the above files, plus the
answer to the questions (see below).
After preparing these three files, you should upload them to Quercus as a ZIP file named with
your student ID (e.g., 1007111023.zip).
The code below shows how the accuracy and predictions can be exported in the right format.
pred = classifier.predict(test_set)
pred = pd.Series(pred).to_csv('pred.csv', index=False, header=False)
Evaluation:
The submission will be evaluated using three criteria:
1) Accuracy on test data (10 points)
Formula = min(max({accuracy} − 0.6
(0.85 − 0.6)
,0),1)*10
2) Code (10 points)
a. The code will be manually inspected and evaluated. To get the maximum number
of points, the jupyter notebook must:
i. Be compact
ii. Contain some comments (only where necessary)
iii. Be efficient (run in a reasonable amount of time, unless an advanced data
analytic is being used)
iv. Easy to run (should be run in a matter of a few minutes, if you do grid
search, make sure to comment out those sections and include the results
only)