Academia.eduAcademia.edu

Dynamic Fully-Compressed Suffix Trees

Lecture Notes in Computer Science

https://doi.org/10.1007/978-3-540-69068-9_19
Dynamic Fully-Compressed Suffix Trees Luı́s M. S. Russo⋆ ⋆ ⋆1,3 , Gonzalo Navarro†2 , and Arlindo L. Oliveira1 1 INESC-ID, R. Alves Redol 9, 1000 Lisboa, Portugal. [email protected] 2 Dept. of Computer Science, University of Chile. [email protected] 3 Dept. of Computer Science, University of Lisbon, Portugal. [email protected] Abstract. Suffix trees are by far the most important data structure in stringology, with myriads of applications in fields like bioinformatics, data compression and information retrieval. Classical representations of suffix trees require O(n log n) bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. A recent so-called fully-compressed suffix tree (FCST) requires asymptotically only the space of the text entropy. FCSTs, however, have the disadvantage of being static, not supporting updates to the text. In this paper we show how to support dynamic FCSTs within the same optimal space of the static version and executing all the operations in polylogarithmic time. In particular, we are able to build the suffix tree within optimal space. 1 Introduction and Related Work Suffix trees are extremely important for a large number of string processing prob- lems. Their many virtues have been described by Apostolico [1] and Gusfield [2]. The combinatorial properties of suffix trees have a profound impact in the bioin- formatics field, which needs to analyze large strings of DNA and proteins with no predefined boundaries. This partnership has produced several important re- sults, but it has also exposed the main shortcoming of suffix trees. Their large space requirements, together with their need to operate in main memory to be useful in practice, renders them inapplicable in the cases where they would be most useful, that is, on large texts. The space problem is so important that it originated a plethora of research results, ranging from space-engineered implementations [3] to novel data struc- tures that simulate suffix trees, most notably suffix arrays [4]. Some of those space-reduced variants give away some functionality in exchange. For example suffix arrays miss the important suffix link navigational operation. Yet, all these classical approaches require O(n log n) bits, while the indexed string requires only n log σ bits (we write log for log2 ), n being the size of the string and σ the ⋆⋆⋆ Supported by the Portuguese Science and Technology Foundation by grant SFRH/BPD/34373/2006 and project ARN, PTDC/EIA/67722/2006. † Partially funded by Millennium Institute for Cell Dynamics and Biotechnology, Grant ICM P05-001-F, Mideplan, Chile. size of the alphabet. For example the human genome requires 700 Megabytes, while even a space-efficient suffix tree on it requires at least 40 Gigabytes [5], and the reduced-functionality suffix array requires more than 10 Gigabytes. This is particularly evident in DNA because log σ = 2 is much smaller than log n. These representations are also much larger than the size of the compressed string. Recent approaches [6] combining data compression and succinct data structures have achieved spectacular results for the pattern search problem. For example Ferragina et al. [7] presented an index that requires nHk + o(n log σ) bits and counts the occurrences of a pattern of length m in time O(m(1 + (logσ log n)−1 )). Here nHk denotes the k-th order empirical entropy of the string [8], a lower bound on the space achieved by any compressor using k-th order modeling. As that index is also able of reproducing any text substring, its space is asymptotically optimal in the sense that no k-th order compressor can achieve asymptotically less space to represent the text. It turns out that it is possible to use this kind of data structures, that we will call compressed suffix arrays (CSAs)4 , and, by adding a few extra structures, support all the operations provided by suffix trees. Sadakane presented the first compressed suffix tree (CST) [5], adding 6n bits on top of the CSA. Recently Russo et al. [9] achieved a fully-compressed suffix tree (FCST), which works over the smallest existing CSA [7], adding only o(n log σ) bits to it. Hence the FCST breaks the Θ(n) extra-bits space barrier and retains asymptotic space optimality. Albeit very interesting as a first step, the FCST has the limitation of being static, and moreover of being built from the uncompressed suffix tree. CSAs have recently overcome this limitation, starting with the structure by Chan et al. [10]. In its journal version this work included the first dynamic CST, which builds on Sadakane’s (static) CST [5] and retains its Θ(n) extra space penalty. On the other hand, the smallest existing CSA [7] was made dynamic within the same space by Mäkinen et al. [11], which was recently improved by González et al. [12] so as to achieve logarithmic time slowdown. In this paper we make the FCST dynamic by building on this latter dynamic CSA. We retain the optimal space complexity and polylogarithmic time for all the operations. A comparison between Chan et al.’s CST and our FCST is shown in Table 1. Our FCST is not significantly slower, yet it requires much less space (e.g. one can realistically predict 25% of Chan et al.’s CST space on DNA). For the table we chose the smallest existing dynamic CSA, so that we show the time complexities that can be obtained within the smallest possible space for both CSTs. All these dynamic structures, as well as ours, indeed handle a collection of texts, where whole texts are added/deleted to/from the collection. Construction in compressed space is achieved by inserting a text into an empty collection. 2 Basic Concepts Fig. 1 illustrates the concepts in this section. We denote by T a string; by Σ the alphabet of size σ; by T [i] the symbol at position (i mod n) (so the first symbol 4 These are also called compact suffix arrays, FM-indexes, etc., see [6]. Table 1. Comparing compressed suffix tree representations. The operations are defined along Section 2. Time complexities, but not space, are big-O expressions. We give the generalized performance (assuming Ψ, t, Φ ≥ log n) and an instantiation using δ = (logσ log n) log n. For the instantiation we also assume σ = O(polylog(n)), and use the dynamic FM-Index variant of González et al. [12] as the compressed suffix array (CSA), for which the space holds for any k ≤ α logσ (n) − 1 and any constant 0 < α < 1. Chan et al. [10] Ours Space in bits |CSA| + O(n) + o(n) |CSA| + O((n/δ) log n) = nHk + O(n) + o(n log σ) = nHk + o(n log σ) SDep Φ = (logσ log n) log2 n Ψδ = (logσ log n) log2 n Count/ Ancestor log n = log n 1 =1 Parent log n = log n (Ψ + t)δ = (logσ log n) log2 n SLink Ψ = log n (Ψ + t)δ = (logσ log n) log2 n SLinki Φ = (logσ log n) log2 n Φ + (Ψ + t)δ = (logσ log n) log2 n Letter / Locate Φ = (logσ log n) log2 n Φ = (logσ log n) log2 n LCA log n = log n (Ψ + t)δ = (logσ log n) log2 n FChild/ NSib log n = log n (Ψ + t)δ + Φ log δ + (log n) log(n/δ) = ((logσ log n) log2 n) log log n 2 Child Φ log σ = (log log n) log n (Ψ + t)δ + Φ log δ + (log n) log(n/δ) = ((logσ log n) log2 n) log log n WeinerLink t = log n t = log n Insert(T ) / |T |(Ψ + t)δ |T |(Ψ + t)δ = |T |(log σ log n) log2 n Delete(T ) = |T |(logσ log n) log2 n is T [0]); by T.T ′ concatenation; by T = T [..i − 1].T [i..j].T [j + 1..] respectively a prefix, a susbtring and a suffix; by Parent(v) the parent node of node v; by TDep(v) its tree-depth; by Ancestor(v, v ′ ) whether v is an ancestor of v ′ ; by LCA(v, v ′ ) the lowest common ancestor. The path-label of a node v in a labeled tree is the concatenation of the edge-labels from the root down to v. We refer indifferently to nodes and to their path-labels, also denoted by v. The i-th letter of the path-label is denoted as Letter(v, i) = v[i]. The string-depth of a node v, denoted by SDep(v), is the length of its path-label. Child(v, X) is the node that results of descending from v by the edge whose label starts with symbol X, if it exists. The suffix tree of T is the deterministic compact labeled tree for which the path-labels of the leaves are the suffixes of T . We assume that T ends in a terminator symbol $ that does not belong to Σ. The generalized suffix tree of a collection C of texts is the tree that results from merging the respective suffix trees. Moreover each text is assumed to have a distinct terminator. For a detailed explanation see Gusfield’s book [2]. The suffix-link of a node v 6= Root of a suffix tree, denoted SLink(v), is a pointer to node v[1..]. Note that SDep(v) of a leaf v identifies the suffix of T starting at position n−SDep(v) = Locate(v). For example T [Locate(ab$)..] = T [7 − 3..] = T [4..] = ab$. The suffix array A[0, n − 1] stores the Locate values of the leaves in lexicographical order. Note that in a generalized suffix 1 2 i: 01 234 56 7890 12 345 67 8901 a ((0)((1)(2))((3)(4)((5)(6)))) b b B: 1 0 0 0 101101 0 0 1 b b $ b 0 ( 0 1 2 (3)(4) 5 6 ) i: 0 1 23 4 5 b b b a b 3 a a a a B: 1 0 0 0 1101101 0 0 11 b b b b a ( 0 1 2 ((3)(4) 5 6 )) 1 $ $ $ $ $ $ $ i: 0 12 34 5 67 0 1 2 3 4 5 6 b 4 A: 6 4 0 5 3 2 1 b Fig. 3. Parentheses representa- 5 tions of trees. The parentheses on Fig. 1. Suffix tree T of string b top represent the suffix tree, those 6 abbbab, with the leaves numbered. a in the middle the sampled tree, and The arrow shows the SLink be- 2 those on the bottom the sampled tween node ab and b. Below it we tree when b is also sampled along show the suffix array. The portion Fig. 2. with the B bitmap. The numbers of the tree corresponding to node Reverse are not part of the representation; b and respective leaves interval is tree T R . they are shown for clarity. The within a dashed box. The sampled rows labeled i: give the index of nodes have bold outlines. the parentheses. tree Locate must also identify the text to which the suffix corresponds. When we use arithmetic expressions involving A and A−1 they are computed within a given text, i.e. they do not jump to another text. Moreover for simplicity we use only one text in our example and hence omit the text identifier. The suffix tree nodes can be identified with suffix array intervals: each node corresponds to the range of leaves that descend from v. The node b corresponds to the interval [3, 6]. Hence the node v will be represented by the interval [vl , vr ]. Leaves are also represented by their left-to-right index (starting at 0). For example by vl − 1 we refer to the leaf immediately before vl , i.e. [vl −1, vl −1]. With this representation we can Count in constant time the number of leaves that descend from v. The number of leaves below b is 4 = 6 − 3 + 1. This is precisely the number of times that the string b occurs in the indexed string T . We can also compute Ancestor in O(1) time: Ancestor(v, v ′ ) ⇔ vl ≤ vl′ ≤ vr′ ≤ vr . 3 Static Fully-Compressed Suffix Trees and Our Plan In this section we briefly explain the static FCST we build on [9]. The FCST consists of a compressed suffix array, a δ-sampled tree S, and mappings between these structures. We also give the road map of our plan to dynamize the FCST. Compressed suffix arrays (CSAs) are compact and functional represen- tations of suffix arrays [6]. Apart from the basic functionality of retrieving A[i] = Locate(i) (within a time complexity that we will call Φ = Ω(log n)), state-of-the-art CSAs support operation SLink(v) for leaves v. This is called ψ(v) in the literature: A[ψ(v)] = A[v] + 1, and thus SLink(v) = ψ(v), let its time complexity be Ψ = Ω(log n). The iterated version of ψ, denoted ψ i , can usually be computed faster than O(iΨ ) with CSAs. This is achieved as ψ i (v) = A−1 [A[v] + i], let us assume that the CSA can also compute A−1 within O(Φ) time. CSAs might also support the WeinerLink(v, a) operation [13]: for a node v the WeinerLink(v, X) gives the suffix tree node with path-label X.v[0..]. This is called the LF mapping in CSAs, and is a kind of inverse of ψ, let its time complexity be t = Ω(log n). Consider the interval [3, 6] that represents the leaves whose path-labels start by b. In this case we have that LF(a, [3, 6]) = [1, 2], i.e. by using the LF mapping with a we obtain the interval of leaves whose path- labels start by ab. We extend of LF to strings, LF(X.Y, v) = LF(X, LF(Y, v)). CSAs also implement Letter(v, i) for leaves v. The easiest case is the first letter of a given suffix, Letter(v, 0) = T [A[v]]. This corresponds to v[0], the first letter of the path-label of leaf v. Dynamic CSAs implement v[0] in time O(log n). In general, Letter(v, i) = Letter(SLinki (v), 0) is implemented in O(Φ) time. CSAs are usually self-indexes, meaning that they replace the text: they can extract any substring, of size ℓ, of the indexed text in O(Φ + ℓΨ ) time. In this paper we will use a dynamic CSA for this part [12], which implements these operations with logarithmic slowdown to its static version [7]. The dynamic CSA actually handles a collection of texts, where insertions and deletions of whole texts T are carried out in time O(|T |(Ψ + t)). The δ-sampled tree exploits the property that suffix trees are self-similar, SLink(LCA(v, v ′ )) = LCA(SLink(v), SLink(v ′ )) whenever the expressions are well defined. This means, roughly, that the tree structure below SLink(v) con- tains the tree structure below v. Because of this regularity it is possible to store only a few sampled nodes instead of the whole suffix tree. A δ-sampled tree S, from a suffix tree T of Θ(n) nodes, chooses O(n/δ) nodes such that, for each node v, node SLinki (v) is sampled for some i < δ. Such a sampling can be obtained by choosing nodes with SDep(v) ≡δ/2 0 such that there is another node v ′ for which v = SLinkδ/2 (v ′ ). For such a sampling Lemma 1 holds, where LCSA(v, v ′ ) is the lowest common sampled ancestor of v and v ′ : Lemma 1. Let v, v ′ be nodes such that SLinkr (LCA(v, v ′ )) = Root, and let d = min(δ, r + 1). Then SDep(LCA(v, v ′ )) = max0≤i<d {i + SDep(LCSA(SLinki (v), SLinki (v ′ )))}. By itself however this property leads to an entangled loop of operations, because LCA depends on SLink and SLink(v) = LCA(ψ(vl ), ψ(vr )) depends on LCA. Using CSAs and observing that LCA(v, v ′ ) = LCA(min{vl , vl′ }, max{vr , vr′ }) we can simplify this equation to SDep(LCA(v, v ′ )) = max0≤i<d {i + SDep(LCSA( ψ i (min{vl , vl′ }), ψ i (max{vr , vr′ })))}. Therefore the kernel operations can be computed as: SDep(v) = SDep(LCA(v, v)) = max0≤i<d {i+SDep(LCSA(ψ i (vl ), ψ i (vr )))} LCA(v, v ′ ) = LF(v[0..i − 1], LCSA(ψ i (min{vl , vl′ }), ψ i (max{vr , vr′ }))) from which SLink is obtained as well. The i in the last equation is the one that maximizes SDep(LCA(v, v ′ )). Operation Parent(v) is easily computed on top of LCA. These operations take time O((Ψ +t)δ), except that SDep takes O(Ψ δ). Note that we have to solve LCSA. This requires to solve LCAS , that is, LCA queries on the sampled tree S, and also to map nodes to sampled nodes using operation LSA (see later). The sampled tree also needs to solve ParentS and store SDepT . The rest is handled by the CSA. For the dynamic version, we first show how the suffix tree T changes upon insertion and deletion of texts T to the collection. Then we show how to maintain the sampling properties of S under those updates of the (virtual) T . This will require some more data to be stored in the sampled nodes. Finally, we will make use of a dynamic parentheses representation for the sampled tree, which will already give us LCAS and ParentS , as well as a way to associate data to nodes and insert/delete nodes. Note that we just have to show how to provide this basic tree functionality, as the remaining operations are obtained as in the static version. To support TDep, however, they add other O(n/δ) nodes to the sampling, such that for any node v the node Parentj (v) is sampled, for some 0 ≤ j < δ. We have not found a way to efficiently maintain this second sampling in a dynamic scenario. As a consequence, our dynamic FCST does not support operation TDep nor those that require it [9]: LAQt and LAQs. The basic navigation operations FChild and NSib also require TDep, but we will present a different idea that solves them together with Child and a generalization of it, using just the CSA. Mapping between the CSA and the sampled tree. For every node v of the sampled tree we need to obtain the corresponding interval [vl , vr ]. On the other hand, given a CSA interval [vl , vr ] representing node v of T , the lowest sampled ancestor LSA(v) gives the lowest sampled tree node containing v. With LSA we can compute LCSA(v, v ′ ) = LCAS (LSA(v), LSA(v ′ )). In this paper we introduce a new method to implement these mappings that is efficient and simpler than the one presented in the static version [9]. 4 Updating the Suffix Tree and Its Sampling In this section we explain how to modify a suffix tree to reflect changes caused by inserting and removing a text T to/from the suffix tree. The CSA of Mäkinen et al. [11], on which we build, inserts T in right-to-left order. It first determines the position of the new terminator5 and then uses LF to find the consecutive positions of longer and longer suffixes, until the whole T is inserted. This right-to-left method perfectly matches with Weiner’s algo- rithm [13] to build the suffix tree of T : it first inserts suffix T [i + 1..] and then suffix T [i..], finding the points in the tree where the node associated to the new suffix is to be created if it does not already exist. The node is found by using Parent until the WeinerLink operation returns a non-empty interval. This 5 This insertion point is arbitrary in that CSA, thus there is no order among the texts. Moreover, all the terminators are the same in the CSA, yet it can be easily modified to handle different terminators. requires one Parent and one WeinerLink amortized operation per symbol of T . This algorithm has the important invariant that the intermediate data structure is a suffix tree. Hence, by carrying it out in synchronization with the CSA insertion algorithm, we can use the current CSA to implement Parent and WeinerLink. To maintain the property that the intermediate structure is a suffix tree, deletion of a text T must proceed by first locating the node of T that corresponds to T , and then using SLinks to remove all the nodes corresponding to its suffixes in T . We must simultaneously remove the leaves in the CSA (Mäkinen et al.’s CSA deletes a text right-to-left as well, but it is easy to adapt to use Ψ instead of LF to do it left-to-right). We now explain how to update the sampled tree S whenever nodes are in- serted or deleted from the (virtual) suffix tree T . The sampled tree must main- tain, at all times, the property that for any node v there is an i < δ such that SLinki (v) is sampled. The following concept from Russo et al. [14] is fundamen- tal to explain how to obtain this result. Definition 1. The reverse tree T R of a suffix tree T is the minimal labeled tree that, for every node v of T , contains a node v R denoting the reverse string of the path-label of v. We note we are not maintaining nor sampling T R , we just use it as a conceptual device. Fig. 2 shows a reverse tree. Observe that since there is a node with path- label ab in T there is a node with path-label ba in T R . We can therefore define a mapping R that maps every node v to v R . Observe that for any node v of T , except for the Root, we have that SLink(v) = R−1 (Parent(R(v))). This mapping is partially shown in Figs. 1 and 2 by the numbers. Hence the reverse tree stores the information of the suffix links. By Height(v R ) we refer to the distance between v and its farthest descendant leaf. For a regular sampling we choose the nodes for which TDep(v R ) ≡δ/2 0 and Height(v R ) ≥ δ/2. This is equivalent to our sampling rules on T (Section 3): Since the reverse suffixes form a prefix-closed set, T R is a non-compact trie, i.e. each edge is labeled by a single letter. Thus, SDep(v) = TDep(v R ). The rule for Height(v R ) is obviously related to that on SLink(v) by R. See Fig. 2 for an example of this sampling. Likewise, stating that there is an i < δ for which SLinki (v) is sampled is the same as stating that there is an i < δ for which TDep(Parenti (v R )) ≡δ/2 0 and Height(Parenti (v R )) ≥ δ/2 . Since TDep(Parenti (v R )) = TDep(v R ) − i, the first condition holds for exactly two i’s in [0, δ[. Since Height is strictly increasing the second condition holds for sure for the largest i. Notice that since every sampled node has at least δ/2 descendants that are not sampled, this means that we sample at most ⌊4n/δ⌋ nodes from a suffix tree with ≤ 2n nodes. Notice that whenever a node is inserted or removed from a suffix tree it never changes the SDep of the other nodes in the tree, hence it does not change any TDep in T R . This means that whenever the suffix tree is modified the only nodes that can be inserted or deleted from the reverse tree are the leaves. In T this means that when a node is inserted it does not break a chain of suffix links; it is always added at the beginning of such a chain. Weiner’s algorithm works precisely by appending a new leaf to a node of T R . Assume that we are using Weiner’s algorithm and decide that the node X.v should be added and we know the representation of node v. All we need to do to update the structure of the sampled tree is to verify that if by adding (X.v)R as a child of v R in T R we increase the Height of some of ancestor, in T R , that will now become sampled. Hence we must scan upwards in T R to verify if this is the case. Notice that we already carry out this scanning as a side effect of computing SLink(v), which also gives us the required SDep information. Also, we do not need to maintain Height values. Instead, if the distance from (X.v)R to the closest sampled node (v ′ )R is exactly δ/2 and TDep((v ′ )R ) ≡δ/2 0, then we know that v ′ meets the sampling condition and we sample it. Deleting a node (i.e. a leaf in T R ) is slightly more complex and involves some reference counting. This time assume we are deleting node X.v, again we need to scan upwards, this time to decide whether to make a node non-sampled. However SDep(v) − SDep(v ′ ) < δ/2 is not enough, as it may be that Height(v ′R ) ≥ δ/2 because of some other descendant. Therefore every sampled node v ′ counts how many descendants it has at distance δ/2. A node becomes non-sampled only when this counter reaches zero. Insertions and deletions of nodes in T must update these counters, by increasing/decreasing them whenever inserting/deleting a leaf at distance exactly δ/2 from nodes. Hence to Insert or Delete a node requires O((Ψ + t)δ) time, plus the time to manipulate the structure that holds the topology of S: we need to carry out insertions/deletions of nodes, while maintaining information associated to them (SDep, reference counts). Section 5.2 shows that those operations do not dominate the time O((Ψ + t)δ) needed to maintain the sampling conditions. 5 Dynamic Fully-Compressed Suffix Trees In this section we present the compact data structures we use and create to handle our dynamic structures: the CSA, the sampled tree, and the mappings. 5.1 Dynamic Compressed Suffix Arrays To maintain a dynamic CSA we use the following result by González et al. [12], which is an improvement upon those of Mäkinen et al. [11]: Theorem 1. A dynamic CSA over a collection C of texts can be stored within nHk (C) + o(n log σ) bits, for any k ≤ α logσ (n) − 1 and any constant 0 < α < 1, supporting all the operations with times t = Ψ = O(((logσ log n)−1 + 1) log n), Φ = O((logσ log n) log2 n), and inserting/deleting texts T in time O(|T |(t + Ψ )). Note that for a collection with p texts it is necessary to store the positions of the texts in A. This requires O(p log n) bits but it is not an issue unless the texts are very short [11]. Therefore, the problem of maintaining a dynamic CSA is already solved, except that we promised to support operation ChildT (and some derivatives) directly on the CSA. Indeed, ChildT (v, X) can be easily computed in O(Φ log n) time by binary searching for the interval of v = [vl , vr ] formed by those v ′ where Letter(v ′ , SDep(v) + 1) = X. Similarly, FChild(v) can be determined by computing X = Letter(vl , SDep(v) + 1) and then ChildT (v, X). To compute NSib(v) the process is similar: If Parent(v) = [vl′ , vr′ ] and vr′ > vr , then we compute X = Letter(vr′ + 1, SDep(v) + 1) and do ChildT (v, X). All the time complexities are thus dominated by that of ChildT . Now we show how ChildT can be computed in a more general and efficient way. The generalized branching for nodes v1 and v2 consists in determining the node with path-label v1 .v2 if it exists. A simple solution is to binary search the interval of v1 for the sub-interval of the v ′ ’s such that ψ m (v ′ ) ∈ v2 , where m = SDep(v1 ). This approach requires O(Φ log n) time and it was first considered using CSA’s by Huynh et al. [15]. Thus we are able to generalize ChildT (v, X), which uses v2 as the sub-interval of A of the suffixes starting with X. This general solution can be improved by noticing that we are using SLinki at arbitrary positions of the CSA for the binary search. Recall that SLinki is solved via A and A−1 . Thus, we could sample A and A−1 regularly so as to store their values explicitly. That is, we explicitly store the values A[jδ] and A−1 [jδ] for all j. To solve a generalized branching, we start by building a table of ranges D[0] = v2 and D[i] = LF(v1 [m − i..m − 1], v2 ), for 1 ≤ i < δ. If m < δ the answer is D[m]. Otherwise, we binary search the interval of v1 , accessing only the sampled elements of A. To determine the branching we should compute ψ m (jδ) = A−1 [A[jδ] + m] for some jδ values in v1 . To use the cheaper sampled A−1 as well, we need that A[jδ] + m be divisible by δ, thus we instead ′ compute ψ m for m′ = ⌊(A[jδ] + m)/δ⌋δ − A[jδ]. Hence instead of verifying that ′ ψ m (jδ) ∈ v2 , we verify that ψ m ∈ D[m − m′ ]. After this process we still have to binary search an interval of size O(δ), which is carried out naively. The overall process requires time O(Φ + (Ψ + t)δ) to access the last letters of v1 and build D, plus O((log n) log(n/δ)) for binary searching the samples; plus O(Φ log δ) for the final binary searches. We have assumed O(log n) time to access the sampled A and A−1 values in a dynamic scenario, whereas in a static scenario6 it would be O(1). In fact in a dynamic scenario we do not store exactly the A[jδ] values; instead we guarantee that for any k there is a k ′ such that k − δ < k ′ ≤ k and A[k ′ ] is sampled, and the same for A−1 . Still the sampled elements of A and the m′ to use can be easily obtained in O(log n) time. Those sampled sequences are not hard to maintain. For example, Mäkinen et al. [11, Sec. 7.1 of journal version] describes how to maintain A−1 (called SC in there), and essentially how to maintain A (called SA in there; the only missing point is to maintain approximately spaced samples in A, which can be done exactly as for A−1 ). 6 This speedup immediatly improves the results of Huynh et al. [15]. 5.2 Dynamic Sampled Trees The sampled tree contains only O(n/δ) nodes. As such it could be stored with pointers using only O((n/δ) log n) bits. Instead we use a dynamic parentheses data structure given by Chan et al. [10], which already supports LCA. Theorem 2. A list of O(n/δ) balanced parentheses can be maintained in O(n/δ) bits supporting the following operations in O(log n) time: – FindMatch(u), finds the matching parenthesis of u; – Enclose(u), finds the nearest pair of matching parentheses that encloses u; – DoubleEnclose(u, u′ ), finds the nearest pair of parentheses that encloses both u and u′ ; – Insert(u, u′ ), Delete(u, u′ ), inserts or deletes the matching parentheses located at u, u′ . The Enclose primitive computes ParentS in the sampled tree. Likewise the DoubleEnclose primitive computes the LCAS operation. In Section 5.3 we explain how to update the parentheses sequence when a node becomes sampled or non-sampled (i.e. , how to maintain the mapping with the CSA). Operations Rank and Select on the sequence of parentheses S can also be used to store in- formation on the nodes, by mapping between the parentheses sequence and their preorder values and vice versa: Rank′ (′ (S, i) gives the preorder number of the node identified by the opening parenthesis at S[i], while Select′ (′ (S, j) identi- fies the j-th node (in preorder) in S. Rank and Select over the parentheses bitmap can be handled using the following theorem. Theorem 3 ([11]). A bitmap of n bits supporting Rank, Select, √ Insert and Delete in O(log n) time can be maintained in nH0 + O(n/ log n) bits. Each node of S must also store its SDep. This is not complicated because the SDep of the nodes of T does not change, at least using Weiner’s algorithm. Thus we maintain a balanced tree where the SDep values can be read, inserted, and deleted, at the positions given by Rank′ (′ (S, i). When a node becomes sampled/non-sampled we insert/delete in this sequence. A similar mechanism is used to store the reference counts used for the sampling; in this case the stored values can be modified as well. Thus O(log n) time suffices for simulating the tree operations on S. 5.3 Mapping from CSA to the Sampled Tree and Back The lowest sampled ancestor LSA is the way to map from the CSA to S. LSA is computed by using an operation Reduce(v), that receives the numeric rep- resentation of leaf v and returns the position, in the parentheses representation of the sampled tree, where that leaf should be. Consider for example the leaf numbered 5 in Fig. 3. This leaf is not sampled, but in the original tree it appears somewhere between leaf 4 and the end of the tree, more specifically between parenthesis ’)’ of 4 and parenthesis ’)’ of the Root. We assume Reduce returns the first parenthesis, i.e. Reduce(5) = 4. In this case since the parenthesis we obtain is a ’)’ we know that LSA should be the parent of that node. Hence we compute LSA as follows: Reduce(v) , if S[Reduce(v)] = ′ (′ LSA(v) = Parent(Reduce(v)) , otherwise We present a new way to compute Reduce in O(log n) time and o(n) bits (cf. [9]). We use a bitmap B initiated with n bits all equal to 0. Now for every node v = [vl , vr ] we insert a 1 at Select0 (B, vl ) and after Select0 (B, vr ), which yields a bitmap with n + O(n/δ) bits. In our example it is 1000101101001, see Fig. 3. Hence we have the following relation Reduce(v) = Rank1 (B, Select0 ( B, v + 1)) − 1. We do not store B uncompressed, but rather using Theorem 3, which requires only O((n/δ) log n) bits as there are few 1’s in B. When a node [vl , vr ] becomes sampled we insert matching parentheses at S[Reduce(vl )] and after S[Reduce(vr )]. Also, it is necessary to insert the new 1’s in B as before. Fig. 3 illustrates the effect of sampling b = [3, 6]. Updating S when a sampled node v becomes non-sampled is easy, as we can obtain the parentheses u, u′ to delete. We must also delete the corresponding 1’s in B; note that the relative position of a 1 in a run of 1’s is irrelevant. Therefore Reduce can be computed in O(log n) time. According to our previous explanation, so can LSA and LCSA, for leaves. To map in the other direction, each node in the sampled tree must know its corresponding interval [vl , vr ]. This is also easy to obtain from B. Let u be the position in S of the opening parenthesis that identifies sampled node v. The corresponding closing parenthesis is u′ = FindMatch(u). Now vl = Rank0 (B, Select1 (B, u + 1)) and vr = Rank0 (B, Select1 (B, u′ + 1)) − 1. 6 Putting All Together The following theorem summarizes our result. Theorem 4. It is possible to represent the suffix tree of a dynamic text collection within the space and time bounds given in Table 1. The space and the variables Ψ , Φ, t, can be instantiated to the values of Theorem 1 for δ = ω(logσ n), or to another dynamic CSA supporting ψ, A, A−1 , LF, and T [A[v]], in times O(Ψ ), O(Φ), O(Φ), O(t), and O(log n), respectively, provided texts are inserted in right- to-left order and deleted in left-to-right order within the given time bounds. We note that Theorem 4 assumes that ⌈log n⌉ is fixed, and so is δ. This assumption is not uncommon in dynamic data structures, even if it affects as- sertions like that of pointers taking O(log n) bits. The CSA used in Theorem 1 can handle varying ⌈log n⌉ within the same worst-case space and complexities, and the same happens with Theorem 3, which is used for the mapping bitmap B. The only remaining part is the sampled tree. We discuss now how to cope with it while retaining the same space and worst-case time complexities. We use δ = ⌈log n⌉·⌈logσ ⌈log n⌉⌉, which will change whenever ⌈log n⌉ changes (sometimes will change by more than 1). Let us write δ = ∆(ℓ) = ℓ⌈logσ ℓ⌉. We maintain ℓ = ⌈log n⌉. As S is small enough, we can afford to maintain three copies of it: S sampled with δ, S − with δ − = ∆(ℓ − 1), and S + sampled with δ + = ∆(ℓ + 1). When ⌈log n⌉ increases (i.e. n doubles), S − is discarded, the current S becomes S − , the current S + becomes S, we build a new S + sampled with ∆(ℓ + 2), and ℓ is increased. A symmetric operation is done when ⌈log n⌉ decreases (i.e. n halves due to deletions), so let us focus on increases from now on. Note this can occur in the middle of the insertion of a text, which must be suspended, and then resumed over the new set of sampled trees. The construction of the new S + can be done by retraversing all the suffix tree T deciding which nodes to sample according to the new δ + . An initially empty parentheses sequence and a bitmap B initialized with zeros would give the correct insertion points from the chosen intervals, as both structures are populated. To ensure that we consider each node of T once, we process the leaves in order (i.e. v = [0, 0] to v = [n − 1, n − 1]), and for each leaf v we also consider all its ancestors [vl , vr ] (using ParentT ) as long as vr = v. For each node [vl , vr ] we consider, we apply SLink at most δ + times until either we find the first node v ′ = SLinki ([vl , vr ]) which either is sampled in S + , or SDep(v ′ ) ≡δ+ /2 0 and i ≥ δ + /2. If v ′ was not sampled we insert it into S + , and in both cases we increase its reference count if i = δ + /2 (recall Section 4). All the δ + suffix links in T are computed in O(δ + (Ψ + t)) time, as they form a single chain. Therefore the solution maintains the current complexities, yet only in an amortized sense. Deamortization can be achieved by the classical method of interleaving the normal operations of the data structure with the construction of the new S + . By performing a constant number of operations on the new S + for each inser- tion/deletion operation over C, we can ensure that the new S + will be ready in time. The challenge is to maintain the consistency of the traversal of T while texts are inserted/deleted. As we insert a text, the operations that update T consist of insertion of leaves, and possibly creation of a new parent for them. Assume we are currently at node [vl , vr ] in our traversal of T to update S + . If a new node [vl′ , vr′ ] we are inserted is behind the current node in our traversal order (that is, vr′ < vr , or vr′ = vr and vl′ > vl ), then we consider [vl′ , vr′ ] immediately; otherwise we leave this for the moment when we will reach [vl′ , vr′ ] in our traversal. Recall from Section 4 that those new insertions do not affect the existing SDeps nor suffix link paths, and hence can be considered independently of the current traversal process. Similarly, deleted nodes that fall behind the current node are processed immediately, and the others left for the traversal to handle it. If ℓ decreases while we are still building S + , we can discard it even before having completed its construction. Note that in general discarding a tree when ℓ changes involves freeing several data structures. This can also be done progres- sively, interleaved with the other operations. 7 Conclusions We presented the first dynamic fully-compressed representation of suffix trees (FCSTs). Static FCSTs broke the Θ(n) bits barrier of previous representations at a reasonable (and in some cases no) time complexity penalty, while retaining a surprisingly powerful set of operations. Dynamic FCSTs permit not only man- aging dynamic collections, but also building static FCSTs within optimal space. Hence the way is open to practical implementations of this structure, which can run in main memory for very large texts. We also gave some relevant results for the static case, as we improved or simplified the operations Reduce and Child. A challenge for future work is to obtain operations TDep, LAQt, and LAQs, which we were not able to maintain in a dynamic scenario. Acknowledgments. We are grateful to Veli Mäkinen and Johannes Fisher for pointing out the generalized branching problem to us. References 1. Apostolico, A. Combinatorial Algorithms on Words. NATO ISI Series. In: The myriad virtues of subword trees. Springer-Verlag (1985) 85–96 2. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press (1997) 3. Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Softw., Pract. Exper. 33(11) (2003) 1035–1049 4. Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5) (1993) 935–948 5. Sadakane, K.: Compressed Suffix Trees with Full Functionality. Theo. Comp. Sys. (2007) 6. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1) (2007) article 2 7. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3(2) (2007) article 20 8. Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3) (2001) 407–430 9. Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: LATIN. LNCS (2008) 10. Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dy- namic text collections. ACM Trans. Algorithms 3(2) (2007) 11. Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. In: CPM. LNCS 4009 (2006) 307–318 To appear in ACM TALG. 12. González, R., Navarro, G.: Improved dynamic rank-select entropy-bound struc- tures. In: Proc. LATIN. LNCS (2008) 13. Weiner, P.: Linear pattern matching algorithms. In: IEEE Symp. on Switching and Automata Theory. (1973) 1–11 14. Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. In: Proc. SPIRE. LNCS 4209 (2006) 163–180 15. Huynh, T.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. Theor. Comput. Sci. 352(1-3) (2006) 240–249

References (15)

  1. Apostolico, A. Combinatorial Algorithms on Words. NATO ISI Series. In: The myriad virtues of subword trees. Springer-Verlag (1985) 85-96
  2. Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press (1997)
  3. Giegerich, R., Kurtz, S., Stoye, J.: Efficient implementation of lazy suffix trees. Softw., Pract. Exper. 33(11) (2003) 1035-1049
  4. Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5) (1993) 935-948
  5. Sadakane, K.: Compressed Suffix Trees with Full Functionality. Theo. Comp. Sys. (2007)
  6. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1) (2007) article 2
  7. Ferragina, P., Manzini, G., Mäkinen, V., Navarro, G.: Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3(2) (2007) article 20
  8. Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3) (2001) 407-430
  9. Russo, L., Navarro, G., Oliveira, A.: Fully-compressed suffix trees. In: LATIN. LNCS (2008)
  10. Chan, H.L., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dy- namic text collections. ACM Trans. Algorithms 3(2) (2007)
  11. Mäkinen, V., Navarro, G.: Dynamic entropy-compressed sequences and full-text indexes. In: CPM. LNCS 4009 (2006) 307-318 To appear in ACM TALG.
  12. González, R., Navarro, G.: Improved dynamic rank-select entropy-bound struc- tures. In: Proc. LATIN. LNCS (2008)
  13. Weiner, P.: Linear pattern matching algorithms. In: IEEE Symp. on Switching and Automata Theory. (1973) 1-11
  14. Russo, L., Oliveira, A.: A compressed self-index using a Ziv-Lempel dictionary. In: Proc. SPIRE. LNCS 4209 (2006) 163-180
  15. Huynh, T.N.D., Hon, W.K., Lam, T.W., Sung, W.K.: Approximate string matching using compressed suffix arrays. Theor. Comput. Sci. 352(1-3) (2006) 240-249
About the author
Papers
28
Followers
2
View all papers from Luis Russoarrow_forward