Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
FIT3155 : Assignment 2
[Weight: 20 = 3 + 8 + 3 + 6 marks.]
Your assignment will be marked on the performance/efficiency of your program. You must
write all the code yourself, and should not use any external library routines, except those
that are considered standard. The usual input/output and other unavoidable routines are
exempted.
Follow these procedures while submitting this assignment:
The assignment should be submitted online via moodle strictly as follows:
• All your scripts MUST contain your name and student ID.
• Use gzip or Winzip to bundle your work into an archive which uses your student ID
as the file name. (STRICTLY AVOID UPLOADING .rar ARCHIVES!)
– Your archive should extract to a directory which is your student ID.
– This directory should contain a subdirectory for each of the four questions, named
as: q1/, q2/, q3/ and q4/.
– Your corresponding scripts and work should be tucked within those subdirectories.
Note: If you have scripts that are common to multiple questions, such as your
suffix tree construction, you may include the script in the parent directory, or
copy it into the relevant subdirectories.
• Submit your zipped file electronically via Moodle.
Academic integrity, plagiarism and collusion
Monash University is committed to upholding high standards of honesty and academic
integrity. As a Monash student your responsibilities include developing the knowl-
edge and skills to avoid plagiarism and collusion. Read carefully the material available
at https://www.monash.edu/students/academic/policies/academic-integrity to understand
your responsibilities. As per FIT policy, all submissions will be scanned via MOSS.
1
Assignment Questions
1. (3 marks) Recall that a spanning tree of a graph G(V,E) is a subgraph T consisting of
all the vertices in V and some subset of edges E ′ ⊂ E such that T is a tree. That is, a
spanning tree is a collection of edges that form a tree and connect all the vertices in V .
A spanning tree in a weighted graph is called a weighted spanning tree and a weighted
spanning tree of minimum total possible weight is called a minimum spanning tree.
Given a weighted undirected graph G(V,E), your task is to write a program that uses
Kruskal’s algorithm (see notes from FIT2004) to find a minimum weight spanning tree in
G. You may assume that G is connected.
Strictly follow the following specification to address this task:
Program name: kruskals.py
Arguments to your program: The total number of vertices |V | (the size of G’s vertex
set) as a positive integer, and a plain text file. Every line in the text file contains
single edge with weight w connecting vertex u to vertex v in G represented by 3
integers in the following format:
u v w
The integers u and v are integers in the range [0 . . . |V | − 1]. Each of the integers in
the line are separated by a single space.
Command line usage of your script:
kruskals.py <|V|>
Do not hard-code the filename in your program.
Output file name: output kruskals.txt
Output format: The output file should contain:
• The total weight (as an integer) of a minimum spanning tree in G on the first
line.
• The edges of the minimum weight spanning found by your program. There
should be one edge per line, (starting from the second line) in the same format
as the input edge file.
Example: In the example below the minimum weight spanning tree (shown in red) has
weight 1 + 4 + 6 + 2 + 1 + 2 = 16. In this case the usage of your program would be:
python kruskals.py 7 G.txt
2
Where G.txt would contain:
0 1 5
1 2 4
2 3 2
3 4 10
4 5 6
5 0 1
0 6 8
1 6 2
2 6 1
3 6 3
4 6 7
5 6 4
Finally, the output of your program in this case would be a file output kruskals.txt
containing the weight of the minimum spanning tree (16) on the first line, followed
by the edges contained in the tree:
16
0 5 1
2 6 1
2 3 2
1 6 2
5 6 4
4 5 6
3
Questions 2 and 3 must be addressed using a suffix tree. There are 6 marks assigned
to suffix tree construction and these are included in the 8 marks available for question
2. That is for question 2, 6/8 marks are dedicated to suffix tree construction while the
remaining 2 are allocated to your use of the suffix tree and the output of your program.
As always, marking will be based on the efficiency of your suffix tree construction, and
those of the tasks below.
2. (8 marks) Given a string str[0 . . . n− 1], write a program that constructs its suffix tree,
and using that suffix tree, outputs the suffix array of the string. Note suffix arrays were
covered in FIT2004.
Strictly follow the following specification to address this task:
Program name: suffix array.py
Argument to your program: An input file containing str[0 . . . n − 1]. It is safe to
assume that:
• str consists of lower case English characters, i.e. ASCII characters with values
in the range [97 . . . 122].
• There are no line breaks in the input file, and that all characters str[0 . . . n− 1]
are lexicographically larger than the terminal character, $, which you will need
to append to str[0 . . . n− 1] after reading it from the file.
Command line usage of your script:
suffix array.py
Do not hard-code the filename in your program.
Output file name: output suffix array.txt
• Output format: The output file should contain the start indices of all the suffixes
in str[0 . . . n−1]$ in the order that they appear in the corresponding suffix array.
Each line of the file should contain a single suffix index as shown in the example
below.
Example: If str[0 . . . 10]$ = mississippi$ then the output would be:
11
10
7
4
1
0
9
8
6
3
5
2
4
3. (3 marks) For strings S1[0 . . . n− 1] and S2[0 . . .m− 1], define the values L(i, j) for any
0 ≤ i ≤ n − 1 and 0 ≤ j ≤ m − 1, as the longest prefix common to the suffixes starting
at position i in S1 and position j in S2.
For a given pair of input strings S1[0 . . . n− 1] and S2[0 . . .m− 1] and a list of (i, j) pairs
your task is to use a suffix tree to compute the L(i, j) value for each (i, j) pair in the list.
Strictly follow the following specification to address this task:
Program name: lcp.py
Arguments to your program:
(a) An input file containing S1[0 . . . n− 1].
(b) An input file containing S2[0 . . .m− 1].
(c) An input file containing a list of (i, j) pairs, one per line in the format:
i j
where 0 ≤ i ≤ n− 1 and 0 ≤ j ≤ m− 1.
It is safe to assume that:
• S1 and S2 consist of lower case English characters, i.e. ASCII characters with
values in the range [97 . . . 122].
• There are no line breaks in the files containing S1 and S2.
• Each line of the (i, j) pairs file contains a single pair where each index is sepa-
rated by a single whitespace (see example below).
Command line usage of your script:
lcp.py
Do not hard-code the filename in your program.
Output file name: output lcp.txt
• Each line should contain of a single pair of i and j indices followed by the corre-
sponding L(i, j) value, with each value being separated by a single whitespace
e.g. in the following format:
i j L(i,j)
Example: If S1[0 . . . 9] = abcdacbdab, S2[0 . . . 7] = dacbdabc and the input (i, j) pairs
file contained:
3 0
4 2
0 5
then the output should be:
3 0 7
4 2 0
0 5 3
5
4. (6 marks)
To give your fingers a break from coding this question will focus on the amortised analysis
of a Fibonacci heap.
Strictly follow the following specification to address this task:
File name: fibonacci.pdf
Submission instrcutions:
Your pdf file should contain your solutions to parts a-j below. Clearly indicate
which question each section in your solution refers to and be careful to include all
the necessary working. You may type your solutions or write them by hand and scan
them into a pdf file. Please note that if you choose to submit hand written solutions
it is your responsibility to ensure that your hand writing is legible. If your marker
cannot read your work you will not receive marks for it!
Once you have collected your solutions into a pdf file (named appropriately) simply
place it inside of your q4 subdirectory.
6
Preliminaries
Recall that the goal of any amortised analysis is to provide an upper bound on the cost
of any sequence of operations that might be performed on a given data structure. That
is, if we denote the true, or actual cost of some sequence of operations {op1, op2, . . . , opn}
as TC({op1, op2, . . . , opn}) then we desire some costing scheme that always ensures that
the amortised cost of this sequence (AC({op1, op2, . . . , opn})) is such that,
AC({op1, op2, . . . , opn}) ≥ TC({op1, op2, . . . , opn}). (1)
It is important to stress that this must be true for any sequence of operations not just
one specific sequence!
At this point you are likely already familiar with one costing scheme known as the ag-
gregate method where we simply assign AC({op1, op2, . . . , opn}) to be the cost (or some
upper bound on it) of the worst case sequence. We then distribute this work evenly
amongst the constituent operations to define the amortised cost of each operation oi -
denoted aci - as,
aci =
AC({op1, op2, . . . , opn})
n
. (2)
However, the amortised cost of the sequence need not be distributed evenly amongst the
operations in the sequence.
In the potential method we instead define the amortised cost of an operation oi to be,
aci = tci + Φ(Di)− Φ(Di−1), (3)
where tci is the true, or the actual cost of operation oi, Di is the state of the data
structure after operation oi has been completed, and Φ is known as a potential function.
The potential function simply maps the state of the data structure Di to a real number
Φ(Di), which is referred to as the potential associated with Di.
Using our newly defined notation and noting that the amortised cost of a sequence of
operations is simply the sum of the amortised cost of each individual operation in the
sequence we obtain,
AC({op1, op2, . . . , opn}) =
n∑
i=1
aci =
n∑
i=1
tci + Φ(Dn)− Φ(D0) (4)
(a) (0.5 marks) Use equation 3 to derive the right-hand-side of equation 4.
Since TC({op1, op2, . . . , opn}) =
∑n
i=1 tci we require Φ(Dn) ≥ Φ(D0) in equation 4 so
that the amortised cost of the sequence always upper bounds its true cost. Moreover, it is
common to define a potential function such that Φ(D0) = 0 and since in practice we don’t
always know how many operations will be in our sequence we also require Φ(Di) ≥ Φ(D0)
for all i. Thus, if we can show that Φ(Di) ≥ 0 for all i, we guarantee that the amortised
cost will always upper bound the true cost of any sequence.
7
Fibonacci heaps
To this point everything has been a little abstract, so let’s try to intuit what we have so
far and begin applying it to a Fibonacci heap. Examining equation 3,
aci = tci + Φ(Di)− Φ(Di−1),
we can see that the amortised cost for operation oi can be more or less than the true
cost. This means that the potential function Φ acts as a kind of credit and debit account
for our data structure. If the change in potential ∆Φ = Φ(Di)− Φ(Di−1) is positive this
amounts to overestimating the cost of the operation, which increases the potential and
corresponds to making a credit to our account. Conversely if ∆Φ < 0 then our potential
decreases as we have underestimated the cost of the operation, which corresponds to
making a debit to our account. Since initially Φ(D0) = 0 and we want Φ(Di) ≥ 0 for
all i, we essentially start with an empty account that we want to keep out of debit by
ensuring that we overestimate at least as much as we underestimate. This ensures that
our amortised cost always upper bounds the true cost of the sequence. With this you
should now have the necessary background knowledge required to analyse a Fibonacci
heap which will be the focus of the remainder of the assignment.
Let’s begin by considering a restricted version of a Fibonacci heap that only supports the
operations:
• Merge1
• Insert
• Find min
• Extract min
The reason for the omission of decrease key and delete from this list will become clear
later. To analyse the amortised complexity of these operations let’s define the potential
of a restricted Fibonacci heap H to be,
Φ(H) = T (H), (5)
where T (H) is the number of nodes in the root list of H. If we were to use this potential
to derive the amortised cost of find min we’d proceed as follows.
First we need to find the true cost of the operation, which is clearly O(1) since the
minimum element in the heap can be accessed directly via Hmin. In addition, since this
operation does not alter the number of nodes in the root list of H we have,
Φ(Di)− Φ(Di−1) = T (H)− T (H) = 0,
and so by equation 3 the amortised cost of find min is given by,
aci = tci + Φ(Di)− Φ(Di−1)
= O(1) + 0
= O(1).
In this case we see that the amortised cost for find min is actually equal to the true cost
of the operation.
1To analyse merge using our potential method we’d need to consider the potential of two separate heaps
H1 and H2 initially. To do so we simply define the potential of a collection of heaps to be the sum of their
individual potentials.
8
(b) (0.5 marks) Use the potential function given in equation 5 to derive the amortised
cost of insert.
Now that we’ve seen some examples of the potential method let’s quickly revisit equation
3,
aci = tci + Φ(Di)− Φ(Di−1).
If we were to overestimate the cost of operation oi by 3 units of potential then our
current interpretation would imply that those 3 units of potential are worth 3 units of
work. We can then use this potential to compensate for 3 units of work the next time
we underestimate the cost of an operation. However, we are free to scale the units of
the potential so that a single unit corresponds to any constant amount of work. In other
words, equation 3 should really be written as,
aci = tci +m [Φ(Di)− Φ(Di−1)] , (6)
where m is a positive real constant that we’ve implicitly assumed to be equal to 1 up
until this point.
(c) (1 mark) Show that the amortised cost of extract min is O(Dmax(N)), where N is
the number of nodes in H and Dmax(N) denotes the maximum degree of a node in
H.
Hint: Start by showing the true cost is O(T (H) +Dmax(N)) and then show that you
can scale the units of potential to leave O(Dmax(N)) for the amortised cost.
(d) (0.5 marks) Argue that for our restricted heap Dmax(N) ≤ blog2(N)c.
Now that we’ve analysed the operations of our restricted heap (you can convince yourself
that the amortised cost of merge is O(1)) let’s reintroduce decrease key.
The reason we initially omitted this operation from our list was that it complicates the
analysis slightly. For instance, the bound derived in part (f) is no longer valid, but we can
still show that the amortised cost of extract min is O(log(N)) even when decrease keys
are allowed.
To do so we need to show that Dmax(N) is still O(log(N)). This amounts to showing
that in the worst case the size of a given tree in H is still given by a function that is
exponential in the degree of the tree, so let’s determine how small our trees can get. Let’s
define a maximally decreased tree of degree k to be a tree - of degree k - in a Fibonacci
heap that has lost the maximum number of nodes from its subtrees due to decrease key
operations. Put another way, a maximally decreased tree is one from which we can’t
remove any more nodes via decrease keys without also decreasing the degree of the root.
For instance, the maximally decreased trees of degree 0, 1, 2 and 3 contain 1, 2, 3 and 5
nodes respectively, as shown in figure 1.
Figure 1: Maximally decreased trees of degree 0, 1, 2 and 3. The marked - denoted by the
shading - nodes have lost one child and the numbers represent the degree of each node rather
than the key of the node.
9
(e) (0.5 marks) How many nodes are in a maximally decreased tree of degree k?
Hint: Begin by simply drawing out maximally decreased trees of increasing degree
(you might find the representation used in figure 1 helpful). Now count the number
of nodes in each of these trees, can you spot a pattern? It may help to recall that the
Fibonacci sequence is defined as,
F0 = 0, F1 = 1 and Fn = Fn−1 + Fn−2 for n > 1,
and the start of the sequence is shown below.
F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10
0 1 1 2 3 5 8 13 21 34 55
(f) (1.25 mark) Prove - using the principle of mathematical induction - that the expres-
sion you derived for the size of a maximally decreased tree in the previous part is
correct.
Hint: Think about how you can construct a maximally decreased tree of degree k
from maximally decreased trees of smaller degree.
At this point you should have an expression for the number of nodes in a maximally
decreased tree of degree k. By definition we know that this tree contains the fewest nodes
of any tree of degree k in our heap. Therefore, in general a tree of degree k rooted at r
in our heap must contain at least this many nodes i.e.,
size(r) ≥ g(k), (7)
where size(r) is the number of nodes in the tree and g(k) is the number nodes in a
maximally decreased tree of degree k that you found in part (e).
(g) (0.5 marks) Using equation 7 and the fact that,
Fk+2 ≥ φk, (8)
for all integers k ≥ 0, where φ = 1+
√
5
2
is the golden ratio, show that Dmax(N) =
O(log(N)) even when decrease key operations are allowed.
The result in part (g) proves that even when decrease key operations are allowed the
amortised complexity of extract min is still O(Dmax(N)) = O(log(N)).
Finally let’s conclude by deriving the amortised cost of decrease key.
(h) (0.5 marks) Using the potential given in equation 5, derive the amortised cost of a
decrease key operation. Is this what you’d expect?
Hint: Assume that you perform c ≥ 0 cascading cuts during the operation and
consider the true and amortised cost in this case.
Currently our potential is simply storing credits during decrease key that are used during
the extract min to offset the fact that we are adding more nodes to our root list. However,
we aren’t using any of our potential to compensate for the fact that marking nodes also
produces extra work in the form of cascading cuts in future decrease keys. To resolve this
let’s ‘back charge’ this work to the decrease keys that marked the nodes in the first place.
10
That is, we want to modify our potential so that we account for this work (make a credit)
every time we mark a node.
Let’s consider the potential,
Φ = T (H) + αM(H), (9)
where α is a positive constant and M(H) denotes the number of marked nodes in H. Now
every time we mark a node our potential increases by α units, but what value should we
choose for α?
(i) (0.5 marks) Determine the smallest integral value of α for which the potential in
equation 9 yields an O(1) amortised cost for decrease key.
(j) (0.25 marks) Explain why this value of α intuitively makes sense.