Automated Fact Checking For Climate Science Claims
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP90042
Automated Fact Checking For Climate Science Claims
Project type: Individual
Report and code submission due date: 9pm Mon, 15th May 2023
Codalab submission due date: 1pm Mon, 15th May 2023 (no extensions possible for this component)
The impact of climate change on humanity is a significant concern. However, the increase in unverified state-
ments regarding climate science has led to a distortion of public opinion, underscoring the importance of con-
ducting fact-checks on claims related to climate science. Consider the following claim and related evidence:
Claim: The Earth’s climate sensitivity is so low that a doubling of atmospheric CO2 will result in a
surface temperature change on the order of 1°C or less.
Evidence:
1. In his first paper on the matter, he estimated that global temperature would rise by around 5 to
6 °C (9.0 to 10.8 °F) if the quantity of CO 2 was doubled.
2. The 1990 IPCC First Assessment Report estimated that equilibrium climate sensitivity to a dou-
bling of CO 2 lay between 1.5 and 4.5 °C (2.7 and 8.1 °F), with a "best guess in the light of current
knowledge" of 2.5 °C (4.5 °F).
It should not be difficult to see that the claim is not supported by the evidence passages, and assuming the source
of the evidence is reliable, such a claim is misleading. The challenge of the project is to develop an automated
fact-checking system, where given a claim, the goal is to find related evidence passages from a knowledge source,
and classify whether the claim is supported by the evidence.
More concretely, you will be provided a list of claims and a corpus containing a large number evidence passages
(the “knowledge source”), and your system must: (1) search for the most related evidence passages from the
knowledge source given the claim; and (2) classify the status of the claim given the evidence in the following
4 classes: {SUPPORTS, REFUTES, NOT_ENOUGH_INFO, DISPUTED}. To build a successful system, it must be able to
retrieve the correct set of evidence passages and classify the claim correctly.
Besides system implementation, you must also write a report that describes your fact-checking system, e.g. how
the retrieval and classification components work, the reason behind the choices you made and the system’s
performance. We hope that you will enjoy the project. To make it more engaging we will run the task as a
Codalab competition. You will be competing with other students in the class. The following sections give more
details on the data format, system evaluation, grading scheme and use of Codalab. Your assessment will be
graded based on your report, your performance in the competition and your code.
Submission materials: Please submit the following:
Note that there are two different submission links/shells for the project report and code. The reason for
separating these two submissions is that we will be running peer reviewing for the report shortly after the
project has completed. Please note that you should be uploading a single pdf document for your report
and a single zip file for your code; all other formats are not allowed, e.g. docx, 7z, rar, etc. Your submission
will not be marked and will be given a score of 0 if you use these other formats. You do not need to upload
your data files in the zip (e.g. the evidence passages). You’ll find more information in the submission shells
on how to submit the report and code.
1
If multiple code files are included, please make it clear in the header of each file what it does. If pre-trained
models or embeddings are used, you do not need to include them as part of the submission, but make
sure your code or script downloads them if necessary. Note that we may review your code if needed,
however note that code is secondary — the primary focus of marking will be your report and your system
performance on Codalab.
You must submit at least one entry to the Codalab competition.
Late submissions: -10% per day
Marks: 35% of subject
Materials: See Using Jupyter Notebook and Python page on Canvas (under Modules>Resources) for informa-
tion on the basic setup required for COMP90042, including an iPython notebook viewer and the Python
packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. For this project, you are encouraged
to use the NLP tools accessible from NLTK, such as the Stanford parser, NER tagger etc, or you may elect
to use the Spacy or AllenNLP toolkit, which bundle a lot of excellent NLP tools. You may also use Python
based deep learning libraries: TensorFlow, Keras or PyTorch. You should use Python 3.8.
For this project, you’re allowed to use pretrained language models or word embeddings. Note, however,
that you are not allowed to: (1) use closed-source models, e.g. OpenAI GPT-3; (2) models that can’t be run
on Colab (e.g. very large models that don’t fit on the GPU on Colab); or (3) look for more training data
in any form (whether they are labelled or unlabelled for a related or unrelated task) to train your system.
In other words, you should only train your system using the provided data, which includes a training, a
development and a test set; see the instructions in “Datasets” below for information on their format and
usage. If there’s something you want to use and you are not sure if it’s allowed, please ask in the discussion
forum (without giving away too much about your ideas to the class).
Grading: You will be graded based on several criteria: clarity of expressions of your report, soundness and
novelty of your methods, substance of your work, interpretation of your results and performance of your
system (section “Grading” below will provide more details).
Updates: Any major changes to the project will be announced via Canvas. Minor changes and clarifications will
be announced in the discussion forum on Canvas; we recommend you check it regularly.
Academic Misconduct: While you’re free to discuss the project with other students, reuse of code between
students, copying large chunks of code from online sources, or other instances of clear influence will
be considered cheating. Do remember to cite your sources properly, both for research ideas and algo-
rithmic solutions and code snippets. We will be checking submissions for originality and will invoke
the University’s Academic Misconduct policy where inappropriate levels of collusion or plagiarism are
deemed to have taken place.
Datasets
You are provided with several files for the project:
[train-claims,dev-claims].json: JSON files for the labelled training and development set;
[test-claims-unlabelled].json: JSON file for the unlabelled test set;
evidence.json: JSON file containing a large number of evidence passages (i.e. the “knowledge source”);
dev-claims-baseline.json: JSON file containing predictions of a baseline system on the development set;
eval.py: Python script to evaluate system performance (see “Evaluation” below for more details).
For the labelled claim files (train-claims.json, dev-claims.json), each instance contains the claim ID, claim
text, claim label (one of the four classes: {SUPPORTS, REFUTES, NOT_ENOUGH_INFO, DISPUTED}), and a list of evi-
dence IDs. The unlabelled claim file (test-claims-unlabelled.json) has a similar structure, except that it only
contains the claim ID and claim text. More concretely, the labelled claim files has the following format:
2
{
"claim-2967":
{
claim_text: "[South Australia] has the most expensive electricity in the world."
claim_label: "SUPPORTS"
evidences: ["evidence-67732", "evidence-572512"]
},
"claim-375":
...
}
The list of evidence IDs (e.g. evidence-67732, evidence-572512) are drawn from the evidence passages in
evidence.json:
{
"evidence-0": "John Bennet Lawes, English entrepreneur and agricultural scientist",
"evidence-1": "Lindberg began his professional career at the age of 16, eventually ...",
...
}
Given a claim (e.g. claim-2967), your system needs to search and retrieve a list of the most relevant evidence
passages from evidence.json, and classify the claim (1 out of the 4 classes mentioned above). You should retrieve
at least one evidence passage.
The training set (train-claims.json) should be used for building your models, e.g. for use in development of
features, rules and heuristics, and for supervised/unsupervised learning. You are encouraged to inspect this
data closely to fully understand the task.
The development set (dev-claims.json) is formatted like the training set. This will help you make major im-
plementation decisions (e.g. choosing optimal hyper-parameter configurations), and should also be used for
detailed analysis of your system — both for measuring performance and for error analysis — in the report.
You will use the test set (test-claims-unlabelled.json) to participate in the Codalab competition. For this
reason no labels (i.e. the evidence passages and claim labels) are provided for this partition. You are allowed
(and encouraged) to train your final system on both the training and development set so as to maximise per-
formance on the test set, but you should not at any time manually inspect the test dataset; any sign that you
have done so will result in loss of marks. In terms of the format of the system output, we have provided
dev-claims-predictions.json for this. Note: you’ll notice that it has the same format as the labelled claim
files (train-claims.json or dev-claims.json), although the claim_text field is optional (i.e. we do not use this
field during evaluation) and you’re free to omit it.
Evaluation
We provide a script (eval.py) for evaluating your system. This script takes two input files, the ground truth and
your predictions, and computes three metrics: (1) F-score for evidence retrieval; (2) accuracy for claim classifica-
tion; and (3) harmonic mean of the evidence retrieval F-score and claim classification accuracy. Shown below is
the output from running predictions of a baseline system on the development set:
$ python eval.py --predictions dev-claims-baseline.json --groundtruth dev-claims.json
Evidence Retrieval F-score (F) = 0.3377705627705628
Claim Classification Accuracy (A) = 0.35064935064935066
Harmonic Mean of F and A = 0.3440894901357093
The three metrics are computed as follows:
1. Evidence Retrieval F-score (F): computes how well the evidence passages retrieved by the system
match the ground truth evidence passages. For each claim, our evaluation considers all the retrieved ev-
idence passages, computes the precision, recall and F-score by comparing them against the ground truth