throbber
Voi. 18 no. 6 2002
`
`Pages 873—879
`
` a
`
`831': an algorithm for finding near-exact
`
`sequence matches in time proportional to the
`
`logarithm of the database size
`
`Eider Giiadi 1”“. Michael G. Walker 7. James Z. Wangz and Wayne
`Vorkmurh t
`
`'incytc Pharmaceuticals, 3174 Porter Drive. Paio Alto. CA 94304. USA and
`2Department of Computer Science. Pennsylvania State University University Park.
`PA 16802. USA
`
`Received on January 1 1. 2001; revised on January ?. 2002; accepted on January 29. 2002
`
`the tree index). SST 27 times faster than BLAST.
`Availability: Request from the authors.
`Contact: egiladl®incyte.com; mwalker@incyte.com
`
`1
`
`INTRODUCTION
`
`In the current efforts to generate and interpret the complete
`genome sequences of humans and model organisms,
`large scale searches for near-exact matches are frequently
`performed. Examples include programs that assemble
`DNA from shotgun sequencing projects which initially
`search for overlapping fragments. large-scale searches of
`EST databases against genomic databases to determine the
`location of genes. and cross species genomic comparisons
`between very closely related genomes. Faster algorithms
`are needed because the time and cost of performing these
`large-scale sequence-similarity searches using even the
`fastest of the extant algorithms is prohibitive.
`
`1.1 Previous related research
`
`We new review previous results related to the Sequence
`Search Tree (SST) algorithm for sequence alignment, tree-
`struetured indexes. and ic-tuple encoding and filtration. In
`this discussion we shall refer to the length of a query
`sequence by the letter ‘m'. The size of the database refers
`to the sum of the lengths of all
`the sequences in the
`database. and is represented by the letter ‘n‘.
`
`ABSTRACT
`
`Motivation: Searches tor near exact sequence matches
`are perlormed frequently in
`large-scale sequencing
`projects and in comparative genomics. The time and
`cost oi performing these large—scale sequence-similarity
`Searches is prohibitive using even the fastest oi the extant
`algorithms. Faster algorithms are desired.
`Results: We have developed an algorithm. called SST
`(Sequence Search Tree).
`that searches a database of
`DNA sequences for near-exact matches.
`in time propor-
`tional to the logarithm oi the database size n. In SST. we
`partition each sequence into fragments of fixed length
`called ‘windows‘ using multiple offsets. Each window is
`mapped into a vector of dimension 4" which contains the
`frequency oi occurrence of its component k-tuples. with k
`a parameter typically in the range 4—6. Then we create a
`tree-struetured index of the windows in vector space. with
`tree-structured vector quantization (TSVQ). We identify
`the nearest neighbors oi a query sequence by partitioning
`the query into windows and searching the tree-structured
`index for nearest—neighbor windows in the database. When
`the tree ls balanced this yields an 0(Iogr1) complexity for
`the search. This complexity was observed in our compu-
`tations. SST is most effective for applications in which the
`target sequences show a high degree of similarity to the
`query sequence. such as assembling shotgun sequences
`or matching ESTs to genomic sequence. The algorithm is
`also an effective filtration method. Specifically.
`it can be
`used as a preprocessing step for other search methods
`to reduce the complexity of searching one large database
`against another. For the problem oi identifying overlapping
`fragments in the assembly of 120000 iragments from
`a 1.5 megabase genomic sequence. SST is 15 times
`faster than BLAST when we consider both building and
`searching the tree. For searching alone (i.e. after building
`
`used
`widely
`alignment. Extent
`Sequence
`LI.)
`sequence-similarity~finding programs include Needleman—
`Wunsch
`(Needleman
`and Wunsch,
`1970),
`Smith—
`Waterman (Smith and Waterman. 19%|), FASTA (Pearson
`and Lipman, 1988; Pearson, 1996) and BLAST (Altschul
`er al.. 1990. 199?).
`The Needleman—Wunsch and Smith—Waterman algo—
`rithms perform. global and local sequence alignment using
`a dynamic programming algorithm. Their computational
`complexity is 0(m * rt).
`
`
`‘To whom correspondence should be addressed.
`
`@ Oxlnrd Unlversity Press 2002
`
`B73
`
`SEQUENOM EXHIBIT 1092
`SEQUENOM EXHIBIT 1092
`Sequenom v. Stanford
`Sequenom V. Stanford
`IPR2013-00390
`
`IPR2013—00390
`
`

`

`E.Giladi et at.
`
`The FASTA algorithm identifies regions of local se(cid:173)
`quence similarity by first identifying candidate similar
`sequences based on shared k-tuples and performs local
`alignment with
`the Smith-Waterman algorithm.
`Its
`computational complexity is O(m * n), with a constant
`smaller than the Needleman-Wunsch or Smith-Waterman
`algorithms.
`identifies regions of local
`The BLAST algorithm
`sequence similarity by first identifying candidate similar
`sequences that have k-tuples in common with the query
`sequence, and then extending the regions of similarity. Its
`computational complexity is O(n).
`Myers implemented a sub-linear algorithm that finds
`long matches with less than a specified fraction of errors
`(Myers, 1994). Chang and Lawler also implemented a sub(cid:173)
`linear expected-time search algorithm (Chang and Lawler,
`1990). These algorithms index the occurrences of k-tuples
`in the database, giving sub-linear complexity. Because
`each tuple of the query may occur in many sequences
`in the database these algorithms still require examination
`of a substantial portion of the sequence database, and
`the search time still grows linearly with the size of the
`database. The developers of BLAST evaluated an indexing
`scheme similar to that described by Myers, but chose not
`to include it in the final versions of the program (Altschul
`et al., 1990, 1997).
`
`clustering. Several
`1.1.2 Sequence
`re(cid:173)
`previous
`searchers have created clusters of similar sequences
`(including tree-structured indexes) or have performed
`other complete pairwise comparisons of sequence
`databases (Gannett et al., 1992; Harris et al., 1992;
`Watanabe and Otsuka, 1995; Barker et al., 1996; Tatusov
`et al., 1997; Yona, 1999). With the exceptions discussed
`below, these earlier algorithms have used conventional
`sequence alignment methods such as BLAST or FASTA
`(rather than the k-tuple method used in SST) to determine
`pairwise distances between sequences, and thus are
`limited by the speed of those alignment methods. In
`most cases, these previous researchers were concerned
`with creating a classification of proteins, rather than with
`enabling fast searches.
`To perform an exhaustive pairwise search of all se(cid:173)
`quences in the protein sequence database, Gannett et
`al. created a tree-structured index of all the protein
`sequences and all their subsequences (a Patricia tree), and
`used Needleman-Wunsch as their measure of similarity
`between sequences (Gannett et al., 1992). The size of
`the resulting tree, and the use of Needleman-Wunsch
`as the similarity measure, resulted in a computationally
`intensive algorithm.
`Yona and colleagues created hierarchical clusters of
`the known protein sequences by k-means clustering and
`metric space embedding (Yona, 1999). Their intent was to
`
`874
`
`build a global view of protein sequence space, rather than
`to enable fast searches for similar sequences. In particular,
`the metric space embedding scheme in their algorithm is
`computationally intensive, and would be impractical for
`fast sequence similarity searches.
`
`1.1.3 Tree-structured indexes. Tree-structured indexes
`have been used previously to provide fast access to
`databases, and to allow compression of data for rapid
`transmission. Tree-structured vector quantization (TSVQ)
`uses a tree-structured index to encode vectors that can be
`transmitted more quickly than the original data (Gersho
`and Gray, 1992), albeit with some loss of information
`(lossy compression). TSVQ can be viewed as a nearest(cid:173)
`neighbor search algorithm, and was the inspiration for
`the tree-structured index used in SST. Agrawal et al.
`describe a method to reduce a high-dimensional set of
`vectors into a lower-dimensional set that retains most
`of the information, which can then be used to create a
`tree-structured index into the database (Agrawal et al.,
`1997). Vector representations and tree-structured indexes
`have been used widely for database access and data
`transmission, but to our knowledge these methods have
`not been used for DNA and protein sequence databases,
`except for the work of Gannett and of Yona described
`above.
`
`1.1.4 K -tuple encodings and filtration. K -tuple en(cid:173)
`codings have been used frequently in sequence analysis
`and in conjunction with filtration methods. We have
`already noted their use in FASTA and in BLAST, and
`describe here their use in FLASH, RAPID, and QUASAR.
`The FLASH algorithm (Califano and Rigoutsos, 1995)
`uses a form of k-tuple encoding that provides multiple
`indexes into the database per position in the sequence.
`The tuples used in FLASH are very descriptive and have
`a low probability of matching randomly in the database.
`This yields a high efficiency in screening for candidate
`database matches.
`The RAPID algorithm (Miller et al., 1998) finds k(cid:173)
`tuple matches in each sequence compared to the query
`sequence (and hence is linear in complexity). Instead of
`performing a (computationally expensive) alignment as is
`done in BLAST, RAPID uses a table of word frequencies
`computed from the database of sequences and calculates
`the probability that a set of matching words occur by
`chance. This approach yields a smaller coefficient of
`complexity than occurs in BLAST, but also remains linear.
`The QUASAR algorithm (Burkhardt et al., 1999) parti(cid:173)
`tions the database into blocks of equal length, analogous to
`the 'windows' used in SST. It then searches each database
`block for all occurrences of each k-tuple from the query
`(q-gram, in their nomenclature), and increments a counter
`for each block whenever it contains a k-tuple from the
`query. A lower-bound on the minimum number of k-tuples
`
`

`

`W=6
`
`jAACCGG!fTAcGTk cGT I
`~AFCGGfrlACGTAC~T J
`jAAccpGnAgGTACGT I
`
`2~
`
`Fig. I. A database sequence is partitioned into overlapping windows
`of length W = 6. The overlap parameter I"> = 2.
`
`in a block provides a filter to identify blocks that are sim(cid:173)
`ilar to the query sequence. The number of blocks which
`need to be accessed for every q-gram can be linear in the
`database size.
`The use of k-tuple frequency for approximate matching
`of texts, and in filtration schemes has also been used and
`analysed (Pevzner and Waterman, 1995; Wu and Manber,
`1992; Ukkonen, 1992; Jokinen and Ukkonen, 1991). SST
`differs from the previous algorithms in that it combines an
`efficient embedding of sequence fragments (windows) into
`metric space, using k-tuple frequency encoding, with a
`tree-structured index for that space. SST's tree-structured
`index eliminates large portions of the database from
`consideration during searches, and requires us to search
`only a small fraction of the database to find sequences with
`high similarity to the query sequence. The construction of
`the index has an average computational complexity which
`is 0 (n log n) while the search of the index has an average
`complexity O(m log n).
`
`2 THE SST ALGORITHM
`We now describe the details of the SST algorithm.
`Sections 2.1-2.3 describe the steps in constructing the
`tree index. These steps need to be performed only once
`for each database. Section 2.4 describes how the database
`index is searched.
`
`2.1 Database partitioning with sliding windows
`In the first step each sequence in the database is partitioned
`into overlapping windows, which consist of subsequences
`of nucleotides of fixed length W. The measure of overlap
`between windows is determined by a parameter 6. which
`is typically in the range 5 :::: 6.
`:::: W j2. The windows
`consist of the nucleotides beginning at position j * 6., and
`ending at position )*6.+ W, with j = 0, 1, 2 ... , WI 6.-1
`(Figure 1). Typical values of Ware in the range 25-1000.
`Each query sequence is partitioned into non-overlapping
`windows or into windows which overlap by half their
`length. The search for a query sequence consists of finding
`database windows that are similar to the query windows,
`as will be described in the next subsections.
`
`SST: an algorithm for finding near-exact sequence matches
`
`2.2 Mapping windows into vector space
`In order to build the tree index used by SST, it is necessary
`to map each window into a finite dimensional vector
`space. Specifically, for each sequence, we create a vector
`consisting of counts of the number of occurrences of each
`k-tuple. Thus, fork of 2, we create a vector of length 16
`that consists of counts of the number of occurrences of the
`16 k-tuples AA, AC, AG, AT, CA, ... , TG, TT. We begin
`by choosing a tuple size k, and by associating an integer
`I with each of the 4k k-tuples of nucleotides a 1a2 • • · ak.
`a; E {A,C,G,T}. This is done in a standard fashion using
`the formula
`
`0
`M(at) = 1
`2
`3
`
`ifa1 =A,
`if at= C,
`if at= G,
`if at= T.
`
`Tuples containing the undetermined symbol 'N' are ig(cid:173)
`nored. We now map each database window to the vector
`in R 4k who's Ith entry is the number of occurrences of tu(cid:173)
`ple I in that window. For example, with k = 2 the window
`{AACCGG} in Figure 1 is mapped to the vector
`
`(1100011000100000)T.
`
`The tuple size k ranges between 2 and 10 but is typically
`4 or 5, chosen for reasons described in Section 4.
`The L 1 distance (Manhattan distance) between the
`image of two windows in vector space is the number of
`k-tuples which occur in one of the windows and not in the
`other. Since each window has W - k + 1 component k(cid:173)
`tuples, their distance can be used to determine the number
`of k-tuples shared by the two windows. This in tum
`provides a lower bound for the edit distance between the
`windows (Jokinen and Ukkonen, 1991; Ukkonen, 1992).
`The correspondence between the L 1 distance in vector
`space and the edit distance in sequence space is utilized
`in SST to find near exact matches to query windows, by
`seeking nearest neighbors in metric space.
`We note that the mapping procedure described in this
`section is easily modified to deal with other alphabets such
`as the 20 letters used for protein sequences or a reduced
`alphabet.
`
`2.3 Tree-structured index for database windows
`The mapping described in the previous section trans(cid:173)
`formed the problem of finding near exact matches to
`query windows in the database to that of finding nearest
`neighbors in vector space. This search can be done very
`
`875
`
`

`

`E. Giladi et at.
`
`efficiently by constructing a tree-structured index in
`vector space.
`To create the tree-structured index, we recursively
`search the data for clusters that provide binary (or higher(cid:173)
`order) partitions. A variety of methods are available for
`finding such clusters and building the tree. One suitable
`method is TSVQ (Gersho and Gray, 1992) using k-means
`clustering which we have used and describe here.
`
`(I) Select two centroids x A, x 8 and their corresponding
`partitions of the data into disjoint sets A and Busing
`the following iterative procedure:
`
`• Choose two initial values for XA and X8.
`
`• For each vector y in the database compute the
`L 1 distances from the vector to each of the
`centroids:
`
`Assign y to the set A if dA(Y) < d8(y), and to
`the set B otherwise.
`
`• Compute the new centroids
`
`X8
`
`• Compute values for the terminating criteria
`
`where I A I is the size of set A.
`• Repeat until the change in D A and D 8 is less
`than a small threshold, or no vectors change
`partition.
`
`(2) Recursively partition the sets A and B generated
`above using the same algorithm.
`
`(3) The recursion terminates when the number of vec(cid:173)
`tors in a set is smaller then a specified tolerance or
`when the algorithm fails to fragment a cluster into
`two substantial new clusters.
`
`The leaves of the tree partition the k-dimensional space,
`and each leaf contains the set of vectors that are nearest
`neighbors to the centroid for that node. We note that when
`the tree is balanced the depth of the tree is proportional to
`the logarithm of the number of windows and the number
`of windows is proportional to the size of the database.
`Hence the average complexity of the tree construction is
`O(n Iogn).
`
`876
`
`2.4 The search procedure.
`For each query, the SST algorithm finds similar database
`sequences by dividing the query sequence into windows
`and searching for database windows that are similar to
`at least one of the query windows. This search is done
`efficiently in vector space using the tree-structured index
`as follows.
`Begin at the root of the tree. At each non-terminal
`node there are two branches (for the case of a binary
`tree), which are represented by their respective centroid.
`Select the branch whose centroid is the lesser distance
`from the query vector. Proceed recursively until reaching a
`terminal node. The vectors in the terminal node represent
`the database windows which are the nearest neighbors to
`the query window.
`The SST algorithm finds most of the nearest neighbors
`but is not guaranteed to find all the nearest neighbors of a
`query sequence. In particular, sequences that lie near the
`boundary of a partition may be closer to another sequence
`immediately across the boundary line than they are to any
`other sequence within the partition; we indicate the false
`negative rate for one experiment below.
`Additional processing such as alignment of the query to
`the database sequence with one of the standard alignment
`tools is possible.
`
`3 COMPUTATIONAL IMPLEMENTATION
`We now describe computer implementation strategies for
`the SST algorithm which significantly improve the speed
`and the scalability of the algorithm. These strategies in(cid:173)
`clude computation on sub-trees, sampling in the construc(cid:173)
`tion of the k-means tree, the use of sparse vector represen(cid:173)
`tation, fixed precision arithmetic and a compressed format
`for the database windows.
`As the number of sequences increases, it becomes
`impossible to store the whole tree index in RAM. To
`minimize disk access during the computation we perform
`the tree construction and search on subtrees. For example,
`when searching one database against another, we first
`search all query windows against the top part of the
`tree, say the first nine levels. This generates 29 = 512
`groups of windows. Now, each group is searched against
`its respective subtree. Disk access occurs only once for
`each subtree because it is small enough to fit into RAM.
`The subtree approach can be generalized to an arbitrary
`number of levels. Moreover, it provides a foundation for
`parallelizing the algorithm because adjacent subtrees can
`be processed independently on different processors.
`The speed of the tree construction can be substantially
`enhanced by computing the centroids based on a sample
`of the windows in each cluster, rather than using all of
`them. We find that a sample of 1000 windows to estimate
`two centroids provides a substantial improvement in the
`
`

`

`speed at a relatively low cost in the error rate; where higher
`accuracy is required, larger samples are useful.
`The computation of the centroids requires the use of
`floating point arithmetic. However, floating point opera(cid:173)
`tions are more expensive than integer arithmetic. Hence, in
`our implementation we use a 16 bit fixed precision arith(cid:173)
`metic.
`All frequency vectors are very sparse. They have at most
`W - k + I non-zero entries while their dimension is 4k.
`Typical values are W = 50, k = 5, 45 = 1024 »
`W - k + I = 46. The use of sparse vector representation
`allows us to store only the non-zero entries, by storing
`pairs (index, value), instead of storing the whole vector.
`This representation saves a substantial amount of space in
`both RAM and disk. In addition, substantial computations
`are saved by using sparse vector arithmetic, for example
`in the computation of the distance of each vector from a
`centroid.
`Database windows overlap and therefore have large
`portions in common. We use a compressed format for
`the windows which takes advantage of this overlap. The
`storage required for all windows of a database in this
`format is independent of both the window size W and the
`offset parameter 1'3.. The storage required by this format is
`about 2 bytes per nucleotide.
`When using SST we build one tree for the sequences in
`the database and one tree for their complement. We then
`search both trees. These computations are independent and
`can be performed concurrently.
`
`4 COMPUTATIONAL RESULTS
`We illustrate the performance of SST by applying it
`to detecting overlapping fragments in shotgun sequence
`assembly. We fragment a 1.5 megabase sequence of
`genomic DNA several times using a Poisson process with
`A. = 300 nucleotides. From the pool of fragments we
`generate three sets of fragments of: 30 675, 61 350 and
`122 700 sequences thus simulating a 6-fold, 12-fold and
`24-fold coverage of the genomic stretch. We apply SST
`to search each database of fragments against itself for
`overlapping sequences, with overlap size ~ 50.
`In the first computation we randomly introduce errors
`into each fragment at a rate of 5% with a maximum of
`2.5% insertions/deletions. In this computation the window
`size W and the tolerance T (the distance threshold for
`deciding if a query sequence matches a database sequence)
`are varied in the range 30 ::::: W ::::: 50 and 0 ::::: T ::::: 30.
`The number of fragments is 61 350 and the tuple size used
`in the encoding is k = 5. Query windows do not overlap.
`Figure 2 indicates the true positive rate (TPR) and the
`log 10 of the false positive rate (FPR) as a function of
`W and T. The TPR increases as W decreases and as T
`increases, and so does the FPR. With optimal parameters
`W = 30, T = 15, the TPR and FPR are .95 and
`
`SST: an algorithm for finding near-exact sequence matches
`
`Window alze W
`
`30
`
`0
`
`Thraahold T
`
`-2
`
`30
`
`30
`
`Window alze W
`
`30
`
`0
`
`Thrashold T
`
`Fig, 2. Top: the TPR as a function of the window size W and the
`the distance threshold T. Bottom: the log base 10 of the FPR as a
`function of the window size W and the distance threshold T on the
`right. Data has an error rate of 5%.
`
`0.0007, respectively. In the remainder of this section all
`computations use a window size of 30 and a threshold
`distance of T = 15.
`We now study the effect of varying the tuple size k
`on the accuracy and the speed of SST. The number of
`fragments is 61 350 and the error rate is 5% with 2.5%
`insertions/deletions. The tuple size k varies in the range
`3 ::::: k ~ 6. Tables 1 and 2 indicate the results of this
`test when there is no overlap between query windows and
`when the overlap is half the window size, respectively.
`The computation time increases with the tuple size k. The
`accuracy of the results also improve as k increases until
`K = 5 and then decreases. The results are more accurate
`when the query windows overlap but the computation is
`much longer in this case. In the remainder of this section
`all computations are done with a tuple size of k = 5.
`
`877
`
`

`

`E.Gi/adi eta/.
`
`Table I. The total time (to build the tree and search), the time to search
`the tree, the true positive rate (TPR) and the false positive rate (FPR) as a
`function of the tuple size K used. There is no overlap between the query
`windows
`
`Table 3. The true positive rate (TPR) and the false positive rate (FPR) as
`a function of the error rate P. Top, query windows do not overlap; bottom,
`query windows overlap. The rate of insertions/deletions is 2,5%
`
`K
`
`3
`4
`5
`6
`
`Total time
`
`Search time
`
`TPR
`
`FPR
`
`01:00:36
`01:28:24
`01:28:21
`02:34:55
`
`00:16:15
`00:21:29
`00:33:50
`01:31:23
`
`0.823
`0.904
`0.941
`0.923
`
`l.l5le-03
`7.746e-04
`6.849e-04
`5.869e-04
`
`Table 2. The total time (to build the tree and search it), the time to search
`the tree, the true positive rate (TPR) and the false positive rate (FPR) as
`a function of the tuple size K used. Query windows overlap by half their
`length
`
`p
`
`5
`10
`15
`20
`
`p
`
`5
`10
`15
`20
`
`TPR
`
`0.941
`0.718
`0.436
`0.202
`
`TPR
`
`0.985
`0.857
`0.610
`0.326
`
`FPR
`
`6.849e-04
`3.277e-04
`1.565e-04
`6.547e-05
`
`FPR
`
`l.llOe-03
`5.540e-04
`2.746e-04
`l.l97e-04
`
`K
`
`4
`5
`6
`
`Total time
`
`Search time
`
`TPR
`
`FPR
`
`01:08:44
`01:39:45
`01:49:28
`03:47:16
`
`00:24:55
`00:32:44
`00:55:03
`02:43:37
`
`0.928
`0.969
`0.985
`0.977
`
`2.024e-03
`l.284e-03
`l.llOe-03
`9.479e-04
`
`Table 4. The total time per query (build index and search), the speed up
`compared to BLAST, the time per query for search alone, the speed up
`compared to BLAST for search, the BLAST time per query. Query windows
`do not overlap
`
`We now study the degradation of the performance of
`the algorithm as the error rate (nucleotide mismatches,
`insertions, and deletions) increases. Tables 3 indicate the
`TPR and FPR as a function of the error rate P. The
`degradation of the performance is apparent. The accuracy
`improves when query windows overlap.
`We expect the TPR and FPR for BLAST to be as good
`as or better than those for SST. For an application such
`as sequence assembly, imperfect sensitivity (failure to
`detect true hits) is similar in effect to sequencing with less
`coverage; the missing sequences do not contribute to the
`initial assembly. SST is best used as a filtration method,
`whereby the small number of sequences returned by SST
`are confirmed by a slower method such as BLAST. This
`post-processing provides a method to remove or limit false
`positives with little computational expense.
`We now compare the scaling of the computation time
`per query between SST and BLAST as we vary the
`database size. Tables 4 and 5 show the linear scaling of
`the time per query with BLAST versus the nearly constant
`time with SST. The speed up compared to BLAST is also
`indicated. We see that for search alone SST is 27 times
`faster than BLAST while for building the tree index and
`searching it, SST is about 15 times faster then BLAST for
`the database of 120 000 sequences when query windows
`do not overlap. When query windows overlap SST is
`about 9.3 times faster than BLAST for building the index
`and searching it and 13.2 times faster than BLAST for
`searching alone. A higher speed up is expected for larger
`databases.
`
`Coverage
`
`SST total Total time
`time
`speed up
`
`SST query
`time
`
`Search time
`speed up
`
`BLAST
`time
`
`6
`12
`24
`
`0.0503
`0.0538
`0.0561
`
`4.7
`8.1
`14.7
`
`0.0246
`0.0276
`0.0301
`
`9.7
`15.8310
`27.4
`
`0.2409
`0.4379
`0.8266
`
`5 DISCUSSION
`SST is most effective for applications in which the
`target sequences show a high degree of similarity to the
`query sequence, such as assembling shotgun sequences
`or matching ESTs to genomic sequence. The accuracy is
`greatly improved when query windows overlap but this
`comes at the cost of a substantial slowdown. For some
`applications, such as assembly of shotgun sequences, it is
`sufficient to identify the set of nearest-neighbor sequences.
`For other applications, it is desirable or necessary to align
`the nearest-neighbor sequences to the query sequence
`using an algorithm such as Needleman-Wunsch or Smith(cid:173)
`Waterman. Since SST filters most true negatives only a
`small number of sequences are returned for each query,
`and such alignments can be done relatively quickly.
`In the example presented here, for 120 000 sequences,
`SST is 15 to 27 times faster than BLAST2, when query
`windows do not overlap. As seen in our computations
`SST scales logarithmically with the number of sequences,
`while BLAST2 scales linearly. Hence we estimate that for
`a database of one million sequences SST will be 170 to
`270 times faster than BLAST2, with similar gains for yet
`larger sequence collections.
`
`878
`
`

`

`Table 5. The total time per query (build index and search), the speed up
`compared to BLAST, the time per query for search alone, the speed up
`compared to BLAST for search, the BLAST time per query. Query windows
`overlap by half their length
`
`Coverage
`
`SST total Total time
`time
`speed up
`
`SST query
`time
`
`Search time BLAST
`speed up
`time
`
`24
`12
`6
`
`0.0891
`0.0757
`0.0704
`
`9.27
`5.78
`3.24
`
`0.0626
`0.0494
`0.0444
`
`13.2
`8.86
`5.4
`
`0.8266
`0.4379
`0.2409
`
`ACKNOWLEDGEMENTS
`We wish to thank Junming Yang, Tod Klingler and Richard
`Goold for stimulating discussions on this work.
`
`REFERENCES
`(1997) Method
`for high(cid:173)
`Agrawal,R., Equitz,W.R. et al.
`dimensionality
`indexing
`in a multi-media database. US
`Patent, Appl. No. 607922.
`Altschul,S.F., Madden,T.L. et al. (1997) Gapped BLAST and PSI(cid:173)
`BLAST: a new generation of protein database search programs.
`Nucleic Acids Res., 25, 3389-3402.
`Altschul,S.F., Gish,W. et al. (1990) Basic local alignment search
`tool. J. Mol. Bioi., 215,403-410.
`Barker,W.C., Pfeiffer,F. et al. (1996) Superfamily classification in
`PIR-international protein sequence database. Methods Enzymol.,
`266, 59-71.
`Burkhardt,S., Crauser,A. et a/. (1999) Q-gram based database
`searching using a suffix array (QUASAR). In Third International
`Conference on Computational Molecular Biology (Recomb 99).
`ACM Press, Lyon, France.
`Califano,A. and Rigoutsos,l. (1995) Flash: Fast look-up algorithm
`Technical report. IBM T.J. Watson
`for string homology.
`Research Center, Yorktown Heights, NY.
`Chang,W. and Lawler,E. (1990) Approximate string matching
`in sublinear expected time.
`In 31st Annual Symposium on
`Foundations of Computer Science. IEEE Computer Society
`Press, St. Louis, Missouri.
`
`SST: an algorithm for finding near-exact sequence matches
`
`Gersho,A. and Gray,R. (1992) Vector Quantization and Signal
`Compression. Kluwer, Boston.
`Gonnett,G.H., Cohen,M.A. et al. (1992) Exhaustive matching of the
`entire protein sequence database. Science, 256, 1443-1445.
`Harris,N.L., Hunter,L. et al. (1992) Mega-classification: Discover(cid:173)
`ing motifs in massive datastreams. In Proceedings of the lOth
`National Conference on Artificial intelligence. AAAI Press.
`Jokinen,P. and Ukkonen,E. (1991) Two algorithms for approximate
`string-matching in static texts.
`In Proceedings of the 16th
`Symposium on Mathematical Foundations of Computer Science,
`Lecture Notes in Computer Science 520, pp. 240-248.
`Miller,C., Gurd,J. et al. (1998) RAPID: an extremely fast dna
`database search tool. Genes, proteins, and computers V. York
`University, England.
`Myers,E. (1994) A sublinear algorithm for approximate keyword
`searching. Algorithmica, 12, 345-374.
`Needleman,S.B. and Wunsch,C.D. (1970) A general method appli(cid:173)
`cable to the search for similarities in the amino acid sequence of
`two proteins. J. Mol. Bioi., 48,443-453.
`Pearson,W.R. ( 1996) Effective protein sequence comparison. Meth(cid:173)
`ods Enzymol., 266, 227-258.
`Pearson,W.R. and Lipman,D.J. (1988) Improved tools for biological
`sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444--
`2448.
`Pevzner,P.A. and Waterman,M.S. (1995) Multiple filtration and
`approximate pattern matching. Algorithmica, 13, 135-154.
`Smith,T.F. and Waterman,M.S. (1981) Identification of common
`molecular subsequences. J. Mol. Bioi., 147, 195-197.
`Tatusov,R.L., Koonin,E.V. et al. (1997) A genomic perspective on
`protein families. Science, 278, 631-637.
`Ukkonen,E. (1992) Approximate string-matching with q-grams and
`maximal matches. Theoretical Computer Science, 92, 191-211.
`Watanabe,H. and Otsuka,J. (1995) A comprehensive representation
`of extensive similarity linkage between large numbers of pro(cid:173)
`teins. Comput. Appl. Biosci., 11, 159-166.
`Wu,S. and Manber,U. (1992) Fast text searching allowing errors.
`Communications of the ACM, 10, 83-91.
`Yona,G. (May 1999) Methods for global organization of all known
`protein sequences, PhD thesis, Computer Science, Hebrew
`University.
`
`879
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket