CSE 584A Algorithms for Biosequence Comparison
Algorithms for Biosequence Comparison
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CSE 584A Algorithms for Biosequence Comparison
Homework 3
• create a separate PDF file for each problem;
• include any figures (typeset or hand-drawn) inline or as floats;
• upload and submit your PDFs to Canvas before class time on the due date.
Always show your work.
1. (20%) Suppose we have the BWT of a text T but not the original text, and that we’ve augmented the
BWT to support constant-time rprev computations. Given a constant k, show how to further augment
the BWT to answer queries of the form “what is substring T [i..j]?” in worst-case time Θ(j − i + k),
using additional space at most (n log n)/k bits.
Note that you cannot just keep a copy of T around, because (n log n)/k < n log |Σ| for large enough k.
2. (30%) Consider the algorithm for enumerating the internal nodes of a suffix tree using the 2BWT. We
gave a recursive formulation for simplicity of exposition, but this problem will consider how to code
the algorithm with an explicit stack instead.
(a) Consider this (simplified) version of the EnumTree code from the class notes:
enumTree(I, d)
if I is right-diverse
for c ∈ Σ do . possible inverse suffix links in τ ′
extend I backwards by char c to obtain maximal bi-interval I ′
enumTree(I ′, d+ 1)
Write an iterative version of enumTree that uses an explicit stack to replace the recursive calls.
Your code should start from the root of the tree. To process a right-diverse maximal bi-interval
I, it should first find all the maximal bi-intervals I ′ obtained by extending I, then push them all
onto the stack.
(b) A problem with the code we’ve written so far (both recursive and iterative) is that the stack
depth can become very large – as large as the depth of the tree, which is bounded only by the
length of the text T . This means that we’d have to spend Θ(|T |) space in the worst case to store
the stack in memory, which is a Bad Thing if we are trying to save space by using the 2BWT in
the first place!
Consider the following heuristic improvement to the iterative algorithm. When enumerating
extensions I ′ of a given I, push the largest bi-interval I ′ onto the stack first. (The remaining
bi-intervals can be pushed in any order.) Prove that this heuristic guarantees that the stack
depth never exceeds Θ(log |T |).
3. (50%) One way to evaluate the complexity of a DNA sequence T is to count the number of distinct
k-mers that occur as substrings of T , for some fixed length k. Very repetitive sequences have only a
small number of distinct k-mers, while (for long enough k) non-repetitive sequences have a number of
distinct k-mers close to their length.
Suppose we have a suffix tree τ for a string T . Let L be the number of leaves in τ , and let Vk be the
set of internal nodes whose labels have length at least k. For any internal node v ∈ τ , let c(v) > 1 be
the number of v’s immediate children. Then the number Nk of distinct k-mers in τ is given by the
following formula (see Belazzougui and Cunial, Proc. 2015 Conf. on Combinatorial Pattern Matching
for an explanation):
Nk = L− k + 1 +
∑
v∈Vk
(1− c(v)).
The justification for this formula is roughly as follows. The number of distinct k-mers is equal to the
number of tree nodes at depth ≥ k whose parent is at depth < k. We begin our estimate of Nk with
the number of leaves of depth ≥ k in the tree; then, for each internal node at depth ≥ k, we add one
for this node and subtract one for each of its children. In this way, the sum includes a net contribution
of 0 (1− 1) for each node at depth ≥ k whose parent is also at depth ≥ k, and a contribution of 1 for
each node at depth ≥ k whose parent is at depth < k. Note that we can enumerate the nodes of Vk in
any order and still get the right answer.
Your job in this assignment is to implement the computation of Nk for a large DNA sequence. Since we
don’t want to spend the space required to build the suffix tree for such a sequence, you should instead
convert the sequence to its 2BWT, using your BWT construction from Homework 2, and then use
the 2BWT-based algorithm for enumerating all internal nodes of a sequence’s suffix tree to implement
the formula for computing Nk. Use the iterative, bounded-stack version of the enumeration algorithm
from the previous problem!
After validating your implementation on some small strings for which you can check the answer by
hand, apply it to the test sequence from Homework 2. Report the number of distinct (case-insensitive)
k-mers in this sequence for each 20 ≤ k ≤ 40. For each k in increasing order, print a line “k n”, where
k is the k-mer length and n is the number of distinct k-mers in the sequence. You need consider only k-
mers that appear on the forward strand, not the reverse-complement. Since your BWT construction
appends a “$” character to the sequence, be sure to correct your output so as not to count the k-mer
that ends with this extra character.
Test Case: Use the test case file for Homework 2, which is still present on the Unit 2 assignments
page.
Where to Turn In Your Solution: each student will have a GitHub Classroom repository for this
assignment. To create your repository, please accept the invitation to the assignment by following this
link:
To turn in your solution, please commit the following items to your assignment repo before the due
date and time:
• your code;
• the output of k-mer counting on your test case as described above;
• a README.md file describing any interesting properties of your implementation (including any
bugs, if known).
Please do not check in copies of the test case files or your BWTs.
Implementation Advice: to implement the enumeration of suffix tree nodes, you need to compute
C and occ for each direction of the 2BWT. You do not need to compute or store a compressed suffix
array for your sequence. Moreover, you do not have to print the nodes of the tree (unless you are using
these yourself for debugging). Finally, you’re working with DNA, so don’t bother with a wavelet tree
or similar fancy thing – a simple 4 × |T | table of occ values, decimated by a decent factor (I suggest
between 32 and 128), will suffice.
You need not implement fancy bit operations, such as the popcount we discussed in class, for your
decimated occ, but if you want to do so, feel free. It only really makes sense to do this if you store
your BWT using only two bits per base (and adjust for not explicitly storing the $ character).
The right-diversity check that forms part of the traversal algorithm will also give you c(v) for suffix
tree nodes v.
If your BWT constructor from Homework 2 is broken, let me know, and I’ll help you get the necessary
BWTs a different way.