COMP90051 Statistical Machine Learning
Statistical Machine Learning
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP90051 Statistical Machine Learning
Project 1 Specification
1 Overview
Authorship attribution is a task to identify the author of a given document. It has been studied extensively in machine
learning, natural language processing, linguistics, and privacy research and applies to real-world settings. Estab-
lishing the provenance of historical texts is helpful in placing important documents in their historical contexts. A
whistleblower may assume that their identity is private when posting online anonymously or pseudonymously with-
out attaching their real name. Where authorship attribution can be used to reveal their identity, as a kind of privacy
attack, it is important to study the possibility of such attacks so that more robust defences can be designed. Finally
plagiarism detection makes use of authorship attribution with applications from educational assessment to intellec-
tual property law in business.
Your task:
Your task is to come up with test predictions for an authorship attribution problem given a training set and test inputs.
You will participate in group of students in a Kaggle competition for this task, where you will upload your test predic-
tions. Your mark (detailed below) will be based on your test prediction performance and a short report documenting
your solution.
The training data is a set of papers published in a time period (2000-2019), the given paper information includes
the year it was published, the keywords it contains, the venue it was published in, and its authors. All the information
in the data has been given randomly assigned IDs, except years.
The test data is a list of 2,000 papers, and your task is to predict for each of the test papers, if the given target is an
author/co-author of the paper. 1,000 of these are positive instances where target is an author/co-author of the paper;
while the other 1,000 are negative where the target is not an author/coauthor of the paper.
2 Dataset
train.json – contains 26.1k papers authored by 2,302 authors. This file is in JSON format and each paper has a key
denoting the paper ID, and values including:
• author: a set of the IDs of the authors of the paper;
• year: the year the paper was published;
• venue: publication at specific venues (denoted venue_X where X ∈ {0,1, . . . ,469}) or “” if there is no specific
venue; and
• keywords: presence of specific keywords in the titles and abstracts of the publications (denoted keyword_X
where X ∈ {0,1, . . . ,499}. Note that this attribute only records presence of certain keywords, not counts of those
words, and is not necessarily an exhaustive list of all words.
1
test.json – contains 2,000 papers, each with a key denoting the test paper’s ID and also a year, venue and keywords
with two more items:
• target: The ID of the target author. This person may or may not be a true co-author of the paper. Your task in
this project is to determine whether the target is a true co-author of the paper; and
• coauthor: The IDs of other true co-authors of the paper. This field may be empty if there are no co-authors.
2.1 Kaggle Submission Format
You will need to submit your predictions on the 2,000 test papers to Kaggle at least once during the project (but ideally
several times). To accomplish this, you will place your 2,000 predictions in a file of a certain format (described next)
and upload this to Kaggle.
If your prediction probabilities are 0.1 for first test paper, 0.99 for the second test paper, then your output file
should be as follows in CSV format:
Id,Predicted
1,0.1
2,0.99
The test set will be used to generate an AUC for your performance1; you may submit test predictions multiple
times per day (if you wish). Section 6 describes rules for who may submit—in short you may only submit to Kaggle
as a team not individually. During the competition AUC on a 34% subset of the test set will be used to rank you
in the public leaderboard. We will use the other 64% test set to determine your final AUC and ranking. The split
of test set during/after the competition, is used to discourage you from constructing algorithms that overfit on the
leaderboard. The training data “train.json”, the test set “test.json”, and a sample submission file “sample.csv” will
be available within the Kaggle competition website. In addition to using the competition test data, so as to prevent
overfitting, we encourage you to generate your own test validation edge sets from the training graph, and test your
algorithms with that validation data also.
3 Report
A report describing your approach should be submitted through the Canvas LMS by 5pm Tuesday 26th April 2022 .
It should include the following content:
1. A brief description of the problem and introduction of any notation that you adopt in the report;
2. Description of your final approach(s) to authorship attribution, the motivation and reasoning behind it, and
why you think it performed well/not well in the competition; and
3. Any other alternatives you considered and why you chose your final approach over these (this may be in the
form of empirical evaluation, but it must be to support your reasoning—examples like “method A, got AUC 0.6
and method B, got AUC 0.7, hence I use method B”, with no further explanation, will be marked down).
Your description of the algorithm should be clear and concise. You should write it at a level that a postgraduate stu-
dent can read and understand without difficulty. If you use any existing algorithms, please do not rewrite the complete
description, but provide a summary that shows your understanding and references to the relevant literature. In the
1AUC, or area under ROC curve, is a performance metric similar to classification accuracy. A random binary predictor attains 0.5, while 1.0
corresponds to perfect classification. The AUC is an unbiased estimator of the risk under a certain “ranking loss”: if you sample randomly a
truly positive and truly negative example, you incur a loss of 1 if you score the positive below the negative, and a loss of 0 if you score the positive
example above the negative. AUC is a popular alternative to accuracy when you care about scoring and ranking and not just classification. You
should not need to delve deeply into AUC for this project, you may treat it similar to accuracy.
2
report, we will be interested in seeing evidence of your thought processes and reasoning for choosing one algorithm
over another.
Dedicate space to describing the features you used and tried, any interesting details about software setup or your
experimental pipeline, and any problems you encountered and what you learned. In many cases these issues are at
least as important as the learning algorithm, if not more important.
Report format rules. The report should be submitted as a PDF, and be no more than three pages, single column.
The font size should be 11pt or above. If a report is longer than three pages in length, we will only read and assess
the report up to page 3 and ignore further pages. (Don’t waste space on cover pages. References and appendices are
included in the page limit—you don’t get extra pages for these. Double-sided pages don’t give you extra pages—we
mean equivalent of three single-sided. Three pages means three pages total. Learning how to concisely communicate
ideas in short reports is an incredibly important communication skill for industry, government, academia alike.)
4 Submission
The final submission will consist of three parts:
• By 12pm noon Tuesday 25th April 2022 , submitted to the Kaggle competition website: A valid submission to
the Kaggle competition. This submission must be of the expected format as described above, and produce a
place somewhere on the leaderboard. Invalid submissions do not attract marks for the competition portion of
grading (see Section 5).
• By 5pm Tuesday 25th April 2022 , submitted to the Canvas LMS
– A written research report in PDF format (see Section 3).
– A zip archive2 of your source code3 of your authorship attribution algorithm including any scripts for au-
tomation, and a README.txt describing in just a few lines what files are for. Again, do not submit data.
You may include Slack/Github logs. (We are unlikely to run your data, but we may in order to verify the
work is your own, or to settle rare group disputes.)
– Your Kaggle team name—without your exact Kaggle team name, we may not be able to credit your Kaggle
submission marks which account for almost half of project assessment.
The submission link will be visible in the Canvas LMS prior to deadline.
Note that after about a week into Project 1 you will need to also submit a Group Agreement. While not a formal
legal contract, completing the agreement together is a helpful way to open up communication within your team, and
align each others’ expectations.
5 Assessment
The project will be marked out of 25. Note that there is a hurdle requirement on your combined continuous assess-
ment mark for the subject, of 25/50, of which Project 1 will contribute 25 marks. Late report submissions will incur a
deduction of 2 marks per day—it is not possible to make late competition entries.
The assessment in this project will be broken down into two components. The following criteria will be considered
when allocating marks.
Based on our experimentation with the project task, we expect that all reasonable efforts at the project will achieve
a passing grade or higher.
2Not rar, not 7z, not lzh, etc. Zip! Substantial penalties will be applied if you don’t follow this simple instruction.
3We would encourage you to use Python, but we will also accept submissions in Matlab, R, or otherwise. You are welcome to use standard
machine learning libraries, such as sklearn, pytorch, etc, but the code submitted should be your own. Do not submit data as Canvas has limited
storage space.
3
Kaggle Competition (12/25):
Your final mark for the Kaggle competition is based on your rank in that competition. Assuming N teams of
enrolled students compete, there are no ties and your team comes in at R place (e.g. first place is 1, last is N ) with an
AUC of A ∈ [0,1] then your mark is calculated as4
9× max{min{A,0.9}−0.55,0}
0.35
+3× N −R
N −1 .
Ties are handled so that you are not penalised by the tie: tied teams receive the rank of the highest team in the tie (as
if no entries were tied). This expression can result in marks from 0 to 12. For example, if teams A, B, C, D, E came 1st,
4th, 2nd, 2nd, 5th, then the rank-based mark terms (out of 3) for the five teams would be 3, 0.75, 2.25, 2.25, 0.
This complicated-looking expression can result in marks from 0 all the way to 12. We are weighing more towards
your absolute AUC than your ranking. The component out of 9 for AUC gives a score of 0/9 for AUC of 0.55 or lower;
9/9 for AUC of 0.9 or higher; and linearly scales over the interval of AUCs [0.55, 0.9]. We believe that a mid-way AUC is
achievable with minimal work, while results about 0.9 are good, but will require more effort. For example, an AUC of
0.9 for a student coming last would yield 9/12; or 10.5/12 if coming mid-way in the class.
External unregistered students may participate, but their entries will be removed before computing the final rank-
ings and the above expression, and will not affect registered students’ grades. We do not however actively invite such
participation.
The rank-based term encourages healthy competition and discourages collusion. The other AUC-based term re-
wards students who don’t place in the top but none-the-less achieve good absolute results. Therefore you can achieve
a high H1 grade overall irrespective of your placing in the ranking.
Note that invalid submissions will come last and will attract a mark of 0 for this part, so please ensure your output
conforms to the specified requirements.
Report (13/25):
The marking rubric in Appendix A outlines the criteria that will be used to mark your report.
6 Additional Competition Rules
Teams: You are required to form a team of three students including yourself. Piazza proj1_groups folder is available
to help with this—e.g. you can post if you’re looking there. It isn’t required that teams sit in one workshop.
Group account submissions to Kaggle: Only submissions from your group account are permitted on Kaggle. You
should not attempt to submit from your individual account or create additional accounts to circumvent the daily
submission limit. We have set the submission limit high so no advantage is gained from circumvention. Moreover
submitting too many times is likely to risk overfitting to the leaderboard portion of the test data.
Auxiliary data prohibited: The use of any additional datasets to help with your solution is prohibited. You should
only use the data that we have supplied for you. You should not search for original data sources, and indeed we have
deployed a number of obfuscation measures to prevent this.
Plagiarism policy: You are reminded that all submitted project work in this subject is to be your own individual
work. Automated similarity checking software will be used to compare submissions. It is University policy that aca-
demic integrity be enforced. For more details
4Note that Kaggle is set to take two “scored private submissions" per team. These means that by default, your top two submissions based
on public leaderboard score are chosen, then after competition close these two submissions will be scored on the entire test set, and the best
total test scoring submission will make up your AUC A. If you prefer non-default submissions for your two “scored private submissions” Kaggle
permits you to select others.
4
A Marking scheme for the Report
Critical Analysis Report Clarity and Structure
(Maximum = 8 marks) (Maximum = 5 marks)
8 marks 5 marks
Final approach is well motivated and its
advantages/disadvantages clearly discussed;
thorough and insightful analysis of why the final
approach works/not work for provided training
data; insightful discussion and analysis of other
approaches and why they were not used
Very clear and accessible description of all that has
been done, a postgraduate student can pick up the
report and read with no difficulty.