Algorithms and Programming Foundations
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
FIT9136 Algorithms and
Programming Foundations in
Python
Assignment 3
OCT 2022
1
Table of Contents
1. Key Information
2. The Assignment
2.1. The Dataset: HardwareRecs
2.2. Task 1: Handling with File Contents and Preprocessing
2.3. Task 2: Building a Class for Data Analysis
2.4. Task 3: Analyzing the File for Data Visualization
2.5. User Manual
3. Do and Do NOT
3.1. Important NOTES
4. Submission Requirements
5. Academic Integrity
6. Marking Guide
7. Getting help
7.1. English language skills
7.2. Study skills
7.3. Things are tough right now
7.4. Things in the unit don’t make sense
7.5. I don’t know what I need
2
1. Key Information
Purpose This assignment will develop your skills in designing, constructing, testing,
and documenting a Python program according to specific programming
standards. This assessment is related to the following learning outcome
(LO):
● LO2 - Restructure a computational program into manageable units
of modules and classes using the object-oriented methodology
● LO3 - Demonstrate Input/Output strategies in a Python application
and apply appropriate testing and exception handling techniques
● LO4 - Investigate useful Python packages for scientific computing
and data analysis;
● LO5 - Experiment with data manipulation, analysis, and
visualisation technique to formulate business insight.
Your task This assignment is a individual task where you will write Python code for
a simple application whereby you will be developing an analyser
application as per the specifications.
Value 35% of your total marks for the unit.
Due Date Friday, 4 November 2022, 4:30 PM (AEST)
Submission ● Via Moodle Assignment Submission.
● FIT GitLab check-ins will be used to assess the history of
development
● MOSS will be used for similarity checking of all submissions.
Assessment
Criteria
The following aspects will be assessed:
1. Program functionality in accordance to the requirements
2. Code Architecture and Adherence to Python coding standards
3. The comprehensiveness of documented code
Late Penalties ● 10% deduction per calendar day or part thereof for up to one
week
● Submissions more than 7 calendar days after the due date will
receive a mark of zero (0) and no assessment feedback will be
provided.
Support
Resources
See Moodle Assessment page and Section 7 in this document.
Feedback Feedback will be provided via one formats:
● specific student feedback ten working days post submission
3
2. The Assignment
In this assignment, you will implement a basic parser to investigate the natural-language
posts from Q&A (Question and Answering) site. The parser is able to perform basic data
extraction, statistical analysis on a number of linguistic features and also to present the
analysis results using some form of visualisation.
2.1. The Dataset: HardwareRecs
Before you get started with any of the programming tasks, you should read through the
description of the dataset that we will be using for the purpose of this assignment.
The dataset is known as HardwareRecs [https://hardwarerecs.stackexchange.com] which is a
Q&A site for people seeking specific hardware recommendations. The Q&A site is a platform
for users to exchange knowledge by asking and answering questions such as Quora, Zhihu,
and Stack Overflow. Within HardwareRecs, users can ask questions about hardware
recommendations, while other users can also answer those questions with corresponding
suggestions.
The data is written in XML (Extensible Markup Language) format. Apart from the first two
lines and the last line which are XML specific format, each line in the dataset represents a
record of a post in the Q&A site, i.e., the row beginning with “”
is a piece of date in this assignment.
As seen in above image, each post contains four attributes:
● Id: the unique identifier to represent each post
● PostTypeId: the type of the post:
○ 1 = Question
○ 2 = Answer
○ 3 to 8 = Others
● CreationDate: the creation date and time of the post (format as
yyyy-mm-ddThh:mm:ss)
4
● Body: the content of the post
You should note that there are many different “PostTypeId” recorded in the dataset.
However, for the purpose of this assignment, the data required for processing and analysis
are the questions and answers in the site, which are those rows indicated by the
“PostTypeId” as 1 or 2.
Note: You should download the dataset from the FIT9136 S2 2022 Moodle site before
attempting the following tasks. The dataset is named data.xml.
2.2. Task 1: Handling with File Contents and Preprocessing
In the first task, you will begin by reading in all the posts of the given dataset. You will then
conduct a number of pre-processing tasks to clean the post content (Body) needed for
analysis in the subsequent tasks (Task 2 and 3) in this assignment. Upon completing the
pre-processing tasks, the content of questions and answers should be saved as two
individual output files. This would be a more efficient approach whenever we need to
manipulate the cleaned dataset without having to repeat the pre-processing task, especially
for large-scale data analysis.
For each post, you should first extract the content/body of it i.e., the string embedded
within “Body:"..."” in each row of the XML file. Then you need to carry out some
preprocessing steps to it as follows:
A. In HTML, XML documents, the logical constructs known as character data and
attribute values consist of sequences of characters, in which each character can
manifest directly (representing itself), or can be represented by a series of characters
called a character reference. We need to convert those special character references
back to its original form by following the rules in Table 1.
Example:
Before filtering: <p>In $200 price range, should I be looking at cards from AMD or
Nvidia?</p>
After filtering:
In $200 price range, should I be looking at cards from AMD or
Nvidia?
5
B. Replace special characters including “ ”, “ ” by a single empty space.
C. Remove all HTML tags. All data within the body attribute are content of posts in
HardwareRecs site, and it is rendered by HTML (Hypertext Markup Language) format.
Within HTML, there are many tags to annotate the content such as
, . All tags
contain start tags like
, and end tags like
. All of these tags are written as the
format “<*>”, and some tags even have detailed attributes like ‘href=”www.google.com”>’. You should remove all these HTML tags (including their
attributes inside) accordingly.
You should remove all these HTML tags (including their attributes inside) accordingly.
Note that we assume that the content in the body contain complete tags i.e., all start
tags are also accompanied by related end tags.
Example:
Before filtering:
In $200 price range, should I be looking at cards from AMD or
Nvidia?
After filtering: In $200 price range, should I be looking at cards from AMD or Nvidia?
Finally, once you have completed with the filtering process, you should identify if the post is
a question or answer. You should then save the data into two different files “question.txt”
and “answer.txt” according to the post type shown in the data. The cleaned body/content
for each post need to be saved in one line in the output file. Examples can be seen in Figure
2.
Note: You should write your code within the given template file
“preprocessData_studentID.py”, and name the file with your own ID. There are two
functions in the file: preprocessLine(inputLine) for dealing with the each valid data row from
the file, and splitFile(inputFile, outputFile_question, outputFile_answer) for reading the
input file, calling preprocessLine function to process the line, and saving the cleaned
6
questions and answers into output files. All files should be saved in the current folder as that
of source code file i.e., not using the absolute path.
2.3. Task 2: Building a Class for Data Analysis
The second task is about collating the required data for analysis. Apart from extracting the
clean body as achieved in Task 1, the main task here is to further parse the given row of the
data in XML format with object-oriented programming.
Your class “Parser” should contain the following methods:
● __init__(self, inputString):
This is the constructor required for creating instances of this class. The inputString
will be the row of data from the XML file.
● __str__(self):
Re-define this method to present your data (the instance variables) in a readable
format. You should return a formatted string in this method. The order of output
should be “ID, post type, creation date quarter, the cleaned content”.
● getID(self):
Get Id of the post (indicated by “Id” attribute)
● getPostType(self)
Get the post type of the post (indicated by “PostTypeId” attribute) with 1 as the
question, 2 as the answer, and 3-8 as others.
● getDateQuarter(self)
Get the date quarter of the creation date (indicated by the “CreationDate” attribute).
One year has four quarters including Q1 (Jan to Mar), Q2 (Apr to Jun), Q3 (Jul to Sep)
and Q4 (Oct to Dec). For example, given “2016-04-07T18:11:33.793” as the
CreationDate, your program should return a string named “2016Q2”.
● getCleanedBody(self)
Get the cleaned body of the posts (indicated by “Body” attribute) which is the
extracted cleaned body as that of task 1. You can import the function
preprocessLine() in the template of Task 1 to reuse the pre-processing functionality
of Task 1. But different from Task 1, we do not require splitting of the
question/answers or saving to the file.
7
● getVocabularySize(self)
Get the number of unique words in the cleaned body converted in the lower case.
Note that we do not count space or punctuation as the word. For example, given the
sentence “Although I use Mac, I do not like Mac.”, there are 7 unique words including
{“although”, “i”, “use”, “mac”, “do”, “not”, “like”}. The counting process may involve
splitting the words from the cleaned body returned from the getCleanedBody()
method. Note, just using str.split(" ") is not enough, as it may mistakenly recognize
“mac,” as a word instead of “mac”.
When instantiating this class with the data row from the XML file "data.xml" as input, e.g.,
Body="<p>In $200 price range, should I be looking at cards from AMD or
Nvidia?</p> ” /> , your class can extract the ID, post type, creation date quarter,
cleaned body, vocabulary size, and nicely print the input data.
Note: You should write your code within the given template “parser_studentID.py”, and
name the file with your own ID.
2.4. Task 3: Analyzing the File for Data Visualization
In this task, based on the class defined in Section 2.3 (Task 2) , you will implement two
functions to visualise the statistics as some form of graphs. The implementation of these two
functions should make use of the external Python packages, including NumPy, SciPy, Pandas,
and/or Matplotlib in order to create the suitable graphs for comparing the statistics
collected for posts.
The implementation of two methods should follow the requirement below:
● visualizeVocabularySizeDistribution(inputFile, outputImage):
Given the input file “data.xml”, you should count the vocabulary size for each post.
Then you should draw a bar chart in Python to visualize the distribution of the
vocabulary size of all posts. The x-axis is the vocabulary size, and the y-axis
represents the number of posts with a certain vocabulary size. Note that for the
x-axis, the vocabulary size interval is 10 and once the vocabulary size is larger than or
equal to 100, you should put them into “others”, i.e., 0-10, 10-20, 20-30, 30-40,
40-50, 50-60, 60-70, 70-80, 80-90, 90-100, others (left inclusive). You should save
your visualization figure into a png file named as “vocabularySizeDistribution.png”.
● visualizePostNumberTrend(inputFile, outputImag):
This function displays the trend of the post number in the Q&A site. Given the input
file “data.xml”, you should first get the number of questions and answers in each
quarter. Then following the time order, you should draw a line chart to annotate the
number of posts in each quarter. Note that you should draw two lines for question
number and answer number respectively, and add a legend in the figure to tell which
8
line is for which type of posts. You should save your visualization figure into a png file
named “postNumberTrend.png”.
Note: Please import the class defined in Section 2.3 (Task 2). Apart from the defining these
two functions, you should also call these two functions and obtain the png files. You should
put your code for this final task into the template file “dataVisualization_studentID.py”,
and name the file with your own ID.
2.5. User Manual
Apart from the comments along with the code, you should also prepare a user
documentation (at least 4 pages but not more than 8 pages) in the PDF format with clear
and complete instructions on how to run your programs and the analytical outputs (tables
and graphs). For the graphs in Task 3, it is required to add an explanation to describe the
graph.