COMP20008 Elements of Data Processing
Elements of Data Processing
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP20008 Elements of Data Processing
Assignment 1
Marks and due date
The assignment is worth 20 marks (20% of the subject grade) . The estimated time commit-
ment 20 – 30 hours.
Submission instructions
Please see Ed.
Background
It should be no surprise to us that while news sources provide most of the in-
formation we learn every day, it is questionable whether the news we read is
credible and trustworthy. Differentiating legitimate and correct news articles
from ‘fake’ or misleading ones is an important task, raising the interest of jour-
nalists, policy makers and researchers in many fields such as social science and
data science. Specifically in the context of health, the consequences of delivering
illegitimate news can be paramount and irreversible.
In this assignment, you will be working on a dataset comprising more than
1,000 health-related news articles, the reviews of these articles, and tweets about
the new articles.
Using this dataset, you will have an opportunity to learn how to identify
‘good’ and ‘bad’ news by characterising some linguistic patterns of news, as
well as comparing different news outlets based on their correctness of reporting.
Learning outcomes
• Be able to read and write data in different formats, and combine different
data sources.
• Practice text processing techniques using Python.
• Practice data processing, exploration and visualisation techniques with
Python.
• Practice visual analysis and written communication skills.
1
Dataset
The dataset consists of more than 1,000 health-related news articles, the reviews
of these articles, and tweets about the new articles. They are located in the
folder /course/data/a1/.
News articles
Each of the news articles is stored in a file in the /course/data/a1/content
folder. The name of each file corresponds to the ID of the article and is in
the format story_reviews_xxxxx.json, where each x is a digit; for example,
story_reviews_00001.json contains the news article of ID story_reviews_00001.
News reviews
Each article was independently reviewed by at least one expert, based on a list of
criteria. Against each criterion, the article receives a rating of satisfactory, not
satisfactory or not applicable, and an explanation for the rating. Each article is
also assigned an overall rating between 0 (least accurate) to 5 (most accurate).
All reviews are stored in a single file: /course/data/a1/reviews/HealthStory.json.
If you read it using Python’s json, you should see a list of reviews.
Tweets
News articles are shared on Twitter. The IDs of Tweets about the news articles
have been recorded in /course/data/a1/engagements/HealthStory.json. You
should get a dictionary with keys being article IDs, and values containing tweets.
Each tweet is represented by a unique integer. Note that a tweet may appear
multiple times, so if you count tweets, you need to avoid counting duplicates.
Your tasks (Total 20 marks)
Task 1. Loading data (1 mark)
Implement the function task1() in task1.py that outputs a json file called
task1.json in the following format:
{
"Total number of articles": X,
"Total number of reviews": Y,
"Total number of tweets": Z
}
where X, Y and Z are the number of news articles, number of reviews, and number
of tweets respectively.
NOTE: You are free to write any helper functions (even in another file) if
you like, but do not change the function definition of task1(). For example, do
not add any argument to task1(). This applies to all tasks.
To run your solution, open the terminal and run python main.py task1
2
Task 2. Data aggregation (2 marks)
In each review, you will see a news_id field containing the ID of the article.
This information allows you to match the review with the news article.
Implement the function task2() in task2.py which combines the articles
with their reviews to work out how many “satisfactory” ratings that article
receives, out of 10 criteria in total. Your function should save its output to
a csv file called task2.csv, which contains the following headings: news_id,
news_title, review_title, rating, num_satisfactory. Each row in the file
should contain the details of one article, with
• news_id being the ID of the news article in the format story_reviews_xxxxx;
• news_title being the title of the news article;
• review_title being the title of the review article;
• rating being the overall rating of the article (between 0 and 5); and
• num_satisfactory being the total number of criteria (between 0 and 10)
that are satisfactory.
The rows in task2.csv should be in ascending order of news_id.
To run your solution, open the terminal and run python main.py task2
Task 3. Extracting meta-data (2 marks)
Each news article comes with a publish_date field, which specifies when the
article was published. The field is in the millisecond precision format, which
is a floating point number. Convert it into a more readable format using the
datetime.datetime.fromtimestamp() function. It should return you an ob-
ject of type datetime, and you will need to consult the documentation to extract
the year, month and date of the article.
Implement the function task3() in task3.py that performs the following
two sub tasks:
• Extract the year, month, and day components from the publish_date of
each article, and output a csv file called task3a.csv, which contains the
following headings: news_id, year, month, day. Each row should contain
the ID of an article in the format story_reviews_xxxxx, and the year,
month, day on which it was published. The formats for year, month,
and day are 4-digit year, 2-digit month, and 2-digit day of the month. For
example, for the date 17 April 2022, the year, month, and day components
are 2022, 04, and 17 respectively.
The rows in task3a.csv should be in ascending order of news_id.
• Count the number of articles in each calendar year in the dataset, and
output a file called task3b.png, describing the yearly article counts using
an appropriately chosen graph.
There might be articles for which the publish date is unspecified; exclude
those articles in your output. Also note that you will likely write helper functions
for this task; however, as always, do not change the definition of task3().
To run your solution, open the terminal and run python main.py task3
3
Task 4. Assessing the credibility of news agencies (2 marks)
Are news article published by The New York Times more credible than those
by Fox News, according to the judgement of experts? Recall that each article is
given an overall rating, which is a score between 0 and 5 (inclusive) indicating
the credibility of the information presented in the article. In this task, you will
compare the average rating of articles published by each news source.
In each review, you will see a news_source field, which tells you the where
the article was published. Some reviews leave this field as blank (an empty
string), in which case you can exclude them.
Implement the function task4() in task4.py that performs two sub tasks:
• Output a csv file called task4a.csv, which contains the following head-
ings: news_source, num_articles, avg_rating. Each row should contain
details about one news source.
The rows in task4a.csv in ascending order of news_source.
• Output a file called task4b.png comparing the average ratings of all news
sources, that have at least 5 articles. Choose an appropriate plot for this
task. Also, sort the axes so that the most and least credible news sources
are easily detected.