An optimized protocol for analysis of EST sequences

F. Liang

doi:10.1093/NAR/28.18.3657

Outline

An optimized protocol for analysis of EST sequences

format_quote

citations

Geo Pertea

2000, Nucleic Acids Research

https://doi.org/10.1093/NAR/28.18.3657

visibility

…

description

9 pages

Abstract

The vast body of Expressed Sequence Tag (EST) data in the public databases provide an important resource for comparative and functional genomics studies and an invaluable tool for the annotation of genomic sequences. We have developed a rigorous protocol for reconstructing the sequences of transcribed genes from EST and gene sequence fragments.

© 2000 Oxford University Press Nucleic Acids Research, 2000, Vol. 28, No. 18 3657–3665 An optimized protocol for analysis of EST sequences Feng Liang, Ingeborg Holt, Geo Pertea, Svetlana Karamycheva, Steven L. Salzberg and John Quackenbush* The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA Received April 17, 2000; Revised June 26, 2000; Accepted July 16, 2000 ABSTRACT quality and utility of the assembled sequences relies on the The vast body of Expressed Sequence Tag (EST) ability of sequence assembly programs to effectively generate high fidelity consensus sequences from the available EST data. data in the public databases provide an important Among the assembly programs that have been developed for resource for comparative and functional genomics genomic sequencing projects, the most extensively used are studies and an invaluable tool for the annotation of Phrap (http://www.phrap.org/phrap.docs/phrap.html ), CAP3 genomic sequences. We have developed a rigorous (10) and TIGR Assembler (11). The version of TIGR Assembler protocol for reconstructing the sequences of tran- now in use for genomic sequence assembly has been modified scribed genes from EST and gene sequence fragments. significantly from the original, which was optimized for A key element in developing this protocol has been the assembly of EST sequences. In this manuscript we refer to the evaluation of a number of sequence assembly programs original version as TA-EST and the modified version, optimized to determine which most faithfully reproduce transcript for genomic assembly, as TIGR Assembler. While all of these sequences from EST data. The TIGR Gene Indices have proven their utility in assembling genomic shotgun constructed using this protocol for human, mouse, rat sequencing data, EST sequences present a number of distinct and a variety of other plant and animal models have computational problems for an assembler. In genomic shotgun demonstrated their utility in a variety of applications and sequencing, which typically uses a single clone for the source are freely available to the scientific research community. DNA, sequences sharing <98% identity can be assumed to come from different copies of a repetitive sequence element. In contrast, EST data are derived from a wide variety of sources INTRODUCTION representing the spectrum of polymorphisms in the original samples. This is compounded by sequencing errors inherent in Our efforts to catalog the collection of human genes are progressing rapidly. Although both public and private efforts single pass sequencing, including a relatively high rate of inser- have greatly accelerated the pace of human genome sequencing, tions and deletions, contamination by vector and linker sequences annotation of the genome, including identification of the gene and the non-random distribution of sequence start sites in sequences, remains a significant challenge. Expressed Sequence oligo(dT)-primed libraries. Therefore, the degree of identity in Tag (EST) sequences represent the most extensive available overlapping sequences from the same gene will often be lower survey of the transcribed portion of the genome. ESTs are single than in genomic projects. In addition, the patterns of overlap- pass, partial sequencing reads generated from either the 5′- or the ping sequences caused by alternative transcripts are different 3′-end of a cDNA clone (1). There are >4 000 000 ESTs in from that observed in a genomic shotgun project. Finally, gene GenBank, nearly two-thirds of which are human (http:// sequences in GenBank and the ESTs in dbEST lack the base www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html ). ESTs call quality values that most assembly programs now use as have proven to be an indispensable tool for the identification of part of the assembly process. Sequence chromatograms can be expressed genes (2) and for genomic mapping (3,4). obtained for approximately half of the nearly 2 000 000 human There have been a number of attempts to identify unique ESTs from the Washington University ftp site and quality genes represented by EST data (5). UniGene (6) uses pairwise values can be derived for these sequences. This information is sequence comparisons at various levels of stringency to group not available for the remaining ESTs and for all of the gene related sequences, placing closely related and alternatively sequences. Of the four programs we evaluated, only TIGR spliced transcripts into clusters. The TIGR Gene Indices Assembler is capable of assembling a mix of sequences with described here use assembly algorithms, rather than clustering, and without quality values. to produce tentative consensus (TC) sequences that represent the Using the >118 000 rat ESTs in dbEST as a model, we evaluated underlying mRNA transcripts (7). This has several advantages: it Phrap, CAP3, TA-EST and TIGR Assembler to determine separates closely related genes into distinct consensus which program most faithfully assembles ESTs to produce TC sequences; it separates splice variants; it produces longer repre- sequences, to compare the number of TCs and singletons sentations of the underlying gene sequences. The resulting TCs produced and to evaluate the relative performance of the can be used for eukaryotic genome sequence annotation (8,9), algorithms. In this comparison we have focused on a number integration of complex mapping data and identification of of known genes. As there are a number of potential difficulties orthologous genes (I.Holt, F.Liang, G.Pertea, S.Karamycheva in working with available EST data, including the presence of and J.Quackenbush, submitted for publication). However, the undetected gene families and variable error rates, we *To whom correspondence should be addressed. Tel: +1 301 838 3528; Fax: +1 301 838 0208; Email: [email protected] 3658 Nucleic Acids Research, 2000, Vol. 28, No. 18 augmented our studies using simulated sequences designed to model known sequencing errors (12) or ESTs transcribed from closely related genes. Finally, we validated our simulation results by assembling ESTs derived from 73 known, annotated genes in GenBank. In our analysis we have found that CAP3 consistently out-performs the other programs, producing the fewest high quality assemblies from single genes while being tolerant to random errors yet maintaining the ability to discrim- inate between related genes; we have adopted CAP3 as the assembler for the TIGR Gene Indices. We used CAP3 to construct the most recent release of the Human Gene Index (HGI) (9), which is based on 1 610 947 human ESTs, 47 283 human sequences derived from CDS features in GenBank (we refer to these as NP, for NucProt, sequences) and 7223 curated expressed transcript (ET) sequences from the TIGR EGAD database (http:// Figure 1. DNA sequencing base call error probability. Error probability distribution adapted from Ewing and Green (12) used to simulate systematic base call errors. www.tigr.org/tdb/egad/egad.html ). Using the 52 825 ESTs that have been mapped by the International Radiation Hybrid Mapping consortium (13), we were able to assign map locations to >40% of the tentative human consensus (THC) sequences. While adding significant value to the HGI, this mapping Total sequencing error rates ranging from 1 to 8% (5–40 errors/ information also serves to validate the assemblies, as THCs sequence) and sequence coverages ranging from 5- to 50-fold containing multiple, independently mapped markers almost were simulated. Model sequences were generated and each invariably map to consistent locations within the genome. assembly program was used to independently assemble the sequences. The numbers of contiguous assemblies and single- tons were recorded. The best consensus sequence produced by MATERIALS AND METHODS each program was compared with the original sequence and its Rat Gene Index assembly fidelity was assessed using an assembly score (A-score) defined by Rat EST sequences were downloaded from dbEST. These were trimmed to remove vector sequences, poly(A/T) tails, adaptor A-score = (2 × sequence length) – (15 × no. of insertions) – sequences and contaminating bacterial sequences. The cleaned (15 × no. of deletions) – (5 × no. of substitutions), ESTs were clustered by comparing all pairs using WU-BLAST where a perfect assembly would have an A-score of 1200 for (http://blast.wustl.edu ) (14) and collecting those with ≥95% our 600 base test sequence. identity over regions at least 40 bp in length with unmatched over- hangs <20 bp. The sequences comprising each cluster were assem- Assembly of gene families bled using Phrap (v.990315), CAP3, TIGR Assembler and TA-EST Gene families were modeled by taking an 1800 bp segment of the and the results from the independent assemblies were compared. ECA1 gene and introducing substitutions at random positions, generating eight sequences that were 99, 98, 97, 96, 95, 94 and Modeling error rates for EST sequence assembly 90% identical to the original. For each of the eight family Errors produced during automated DNA sequencing are non- members, the gene sequence was artificially shotgunned, uniformly distributed and tend to be concentrated at the beginning creating 5-fold coverage of EST fragments 450–550 bp in size. and end of the sequence read (12). To assess the effects of Two pools of EST sequences were created, one with all eight sequencing errors we used a 600 base segment of a reference sequences and a six sequence family containing only those sequence (ECA1, GenBank accession no. U96455) to model ≥95% identical to the reference sequence. Each family was the distribution of sequence start positions and errors in EST assembled using the four assembly programs; the consensus data. From the reference sequence a set of fragment sequences assemblies were evaluated by comparing them with each ranging from 450 to 550 bases in length was generated. member of the gene family. Six independent simulations were Sequencing errors with a pattern similar to that previously conducted. reported (12) were introduced as substitutions, insertions or Assembly and evaluation of representative human genes deletions with a 3:1:1 ratio at positions selected using the normalized probability density model of the form (see Fig. 1): A set of 73 representative human genes was selected based on x ( x – 500 ) their EST content. EST sequences representing each gene were – ------ ---------------------- assembled using Phrap, CAP3, TA-EST and TIGR Assembler; 1 1 25 40 P ( x ) dx = ---- ---- e +e dx, N N the parent gene sequences were not included in the assembly process. For each gene the longest consensus sequence where x is the position along the length of the sequence read produced by each assembler was compared to the original gene and N is a normalization constant equal to: sequence in order to assess consensus quality; the errors in the 500 x ( x – 500 ) consensus sequences were tabulated and classified and a – ------ ---------------------- normalized A-score was calculated for each program by ò 1 25 40 N = --- e +e dx = 52.4999. 2 summing the A-scores for each gene and dividing by the total 0 Nucleic Acids Research, 2000, Vol. 28, No. 18 3659 sequence length. For a perfect sequence reconstruction the Table 1. Summary of TC and singletons of rat EST clusters using Phrap, normalized A-score would be 2.0. CAP3 and TA-EST HGI assembly and analysis Phrap CAP3 TA-EST TIGR Human EST and coding gene sequences were downloaded Assembler from dbEST and GenBank records and cleaned using the same TCs 16 635 16 647 16 977 17 653 filters as were used for rat. Cleaning eliminated 82 228 (5.1%) Singletons 121 138 2 751 7 540 of the original 1 610 947 ESTs and trimmed an additional Total 16 756 16 785 19 728 25 193 8 350 769 bases of contaminating sequence. A total of 54 506 human gene sequences were included: 47 283 human tran- scripts (NP sequences) parsed through Entrez from the CDS and CDS-join features in GenBank records and 7223 curated ET programs because they contain many closely related sequences sequences from the TIGR EGAD database (http://www.tigr.org/ that must be correctly assembled, despite sequencing errors tdb/egad/egad.html ). Sequences were compared using FLAST, and polymorphisms inherent in the data. The results for four a rapid sequence comparison program based on DDS (15) in representative clusters are presented in Table 2. Again, we can which query sequences are first concatenated and then see clear differences between the programs, indicating that searched against a nucleotide database. Sequences were placed TA-EST and TIGR Assembler are far less error tolerant than in clusters using criteria identical to those used for rat EST Phrap and CAP3. This analysis at first glance suggests that clustering. Sequences in each cluster were assembled using CAP3. A THC sequence containing a known gene was Phrap may be better at assembling ESTs containing errors. assigned the function of that gene; THCs without assigned However, as described below, Phrap tends to misassemble functions were searched using DPS (15) against a non- sequences and to produce low fidelity consensuses containing redundant protein database; high scoring hits were assigned a many insertions and miscalled bases. putative function. The THC sequences were assigned map locations using the most recent data from the International Radia- Table 2. Contigs and singletons produced by Phrap, CAP3, TA-EST and tion Hybrid Mapping Consortium (13). EST mapping TIGR Assembler for four representative ‘large’ clusters of rat sequences information was downloaded via ftp from the NCBI (ftp:// ftp.ncbi.nlm.nih.gov/repository/genemap ) and map locations Cluster Contigs/singletons produced assigned using Greg Schuler’s e-PCR program (16). (no. of ESTs) Phrap CAP3 TA-EST TIGR Assembler 1. (135) 11/0 17/0 21/43 25/41 RESULTS 2. (270) 25/1 35/1 37/125 42/126 Incorporation of EST sequences into TC assemblies 3. (540) 1/0 1/0 2/2 3/12 Construction of the TIGR Gene Indices relies on faithfully 4. (1791) 15/0 18/7 28/62 71/229 clustering and assembling sequences, so ESTs from the same Total 52/1 61/8 88/232 141/408 transcript are properly assembled while ESTs from distinct but closely related transcripts are appropriately placed into separate assemblies. Sequences that do not fit into any assemblies are Consensus assessment called singletons. The number of singletons provides an estimate of the number of rare transcripts represented in the data; to While clustering alone can provide an estimate of the number avoid overestimates, assemblers must be fairly tolerant of of genes represented in an EST database, the construction of sequencing errors so as to not produce an excessive number of TCs has a number of advantages. For example, each TC sequence singletons. tends to be longer than its component ESTs, facilitating func- Following sequence cleaning, pairwise comparisons placed tional assignment, transcript mapping and genomic sequence 118 473 Rat ESTs in 16 183 clusters. The sequences annotation. The utility of TC sequences depends critically on comprising each cluster were assembled using Phrap, CAP3, the fidelity of the consensus produced. To evaluate the quality TA-EST and TIGR Assembler, respectively, using each of this consensus for each assembly program, we used ESTs program’s default parameters. Following assembly, the number from known, annotated genes and compared the consensus of consensus sequences and singletons was tabulated, as shown sequences produced by each program to the reference in Table 1. While each program produces approximately the sequence. An example, representing the single copy cyto- same number of assemblies, TA-EST gives nearly 20 times the chrome c oxidase subunit II gene of the rat mitochondrial number of singletons produced by CAP3 or Phrap, suggesting genome (GenBank accession no. M27315), is shown in Figure 2 that it is much less tolerant of sequence discrepancies. This (this corresponds to Cluster 3 in Table 2). Analysis of the observation is further supported by the slightly larger number alignment in Figure 2 shows that CAP3, TA-EST and TIGR of TCs generated from high coverage clusters, suggesting that Assembler were all able to accurately reproduce the reference it is also more likely to split sequence contigs when sequencing sequence (modulo one consistent difference that suggests an errors occur. error or polymorphism in the GenBank sequence). However, Much of the difference between Phrap and CAP3 can be while CAP3 was able to use all the sequence data to produce a attributed to large clusters containing tens or hundreds of single consensus, TA-EST and TIGR Assembler used some of sequences. These present a unique challenge to the assembly the lower quality sequences to produce a second consensus that 3660 Nucleic Acids Research, 2000, Vol. 28, No. 18 Figure 2. CLUSTAL W (17) alignment of consensus sequence assemblies for the rat cytochrome c oxidase gene produced by Phrap, CAP3, TA-EST and TIGR Assembler. spans only part of the reference sequence and that differs from and 3′-EST sequences start from the same position (or nearly it by 9.3% (data not shown). Phrap assembled all of the ESTs so) and errors, while independent, are positionally clustered. into a single consensus, but the resulting sequence contains a The existence of these correlated errors can have a significant large number of insertions and other errors, representing a 5% impact on EST assembly; assembly programs must effectively error rate. While Phrap has been shown to produce accurate handle this in order to generate high fidelity contigs and an consensus sequences for genomic sequencing projects, the lack accurate estimate of the number of transcripts represented of quality values for EST sequences appears to have a significant within the data. adverse effect on its output. Further analysis, described below, To systematically assess the relative performance of the suggests that Phrap also over-assembles sequences, combining various assembly programs we generated model EST data with ESTs from distinct transcribed genes into single consensus lengths of 450–550 bp, error rates ranging from 1 to 8% and sequences. However, in genomic sequence assembly using various levels of coverage spanning a 600 base segment of the quality values CAP3 has been demonstrated to produce fewer ECA1 gene (GenBank accession no. U96455). These were errors than Phrap (10). assembled using each of the programs: both the number of contigs and singletons and the quality of the consensus Effects of sequencing errors on EST assembly sequences were compared (Figs 3 and 4). In each instance, The errors generated in automated DNA sequencing are known Phrap and CAP3 produced a single consensus sequence. In to be concentrated at the start and end of the sequence read contrast, TA-EST and TIGR Assembler split sequences into (12). In genomic sequence assembly this is mitigated by the singletons or separate contigs as the error rate increased (data random distribution of sequence start points. The situation is not shown). We also assessed the quality of the consensus quite different for ESTs. cDNA clones are constructed from sequences. For each program we calculated an A-score (see polyadenylated mRNA using oligo(dT) to prime reverse tran- Materials and Methods) for the consensus sequences produced scription first-strand DNA synthesis. Consequently, clone ends by each program at 5- and 50-fold EST coverage. Figure 3 Nucleic Acids Research, 2000, Vol. 28, No. 18 3661 Figure 3. Consensus sequence errors. Plot of A-scores for the best consensus assemblies produced by Phrap, CAP3, TA-EST and TIGR Assembler (TA) using simulated data for various error rates at 5× and 50× sequence coverage. Figure 4. Error source distribution and normalized A-score for assemblies of 73 known genes. Consensus sequence error classification for Phrap, CAP3, TA-EST and TIGR Assembler using EST sequences containing 5% errors at various depths of coverage. shows the A-score for the best consensus sequence produced situations analyzed, the fidelity of the Phrap consensus by each program as a function of EST error rates. Although sequence was consistently worse than that generated by CAP3. both CAP3 and Phrap produced a single consensus in all of the The best consensus assemblies produced by TA-EST and 3662 Nucleic Acids Research, 2000, Vol. 28, No. 18 TIGR Assembler score similarly to CAP3, although this must Table 3. Performance of the four assemblers under evaluation for ESTs be considered in the light of the fact that the former programs representing 73 known genes with an average coverage of 196 ± 180 tend to generate additional consensus sequences and singletons sequences and fail to produce a consensus if the error rate is sufficiently high. Phrap CAP3 TA-EST TIGR Assembler The consensus sequence discrepancies generated by each program were classified as insertions, deletions, substitutions (A) or IUPAC codes (an IUPAC code represents an ambiguous No. of single assemblies 46 59 15 2 nucleotide, e.g. Y represents one of C or T) and tabulated. Mean no. of assemblies 1.56 1.26 2.85 17.26 Figure 4 shows the distribution of consensus errors for each Standard deviation 0.88 0.58 1.79 17.20 program at various depths of coverage. While the total error rates for CAP3, TA-EST and TIGR Assembler are relatively (B) constant and independent of depth of coverage, the errors Mean no. of singletons 0.07 0.10 8.05 38.55 produced by Phrap, particularly the insertions and substitutions, Standard deviation 0.35 0.45 10.09 47.14 increase as the depth of coverage increases. This is consistent with the data in Figure 3, where the A-score for Phrap at 50-fold For each of these genes, one would expect the assembler to produce a single coverage is lower than that for 5-fold coverage. Phrap has a consensus without singletons. (A) The number of single contigs produced by tendency to retain the errors in the EST sequences, introducing each assembler and the mean and standard deviation of the number of them as insertions in the consensus. assemblies. (B) The mean and standard deviation of the number of singletons produced by each of the assemblers. Assembly of ESTs from a family of genes While software used for EST assembly should be relatively tolerant of random errors, it should be capable of separating For each gene we assessed the fidelity of the longest ESTs from distinct but closely related transcripts. To assess the consensus sequence produced by each of the four programs. As performance of the assembly programs to handle data from for our simulation studies, the best assemblies produced by gene families we generated model ESTs from sequences CAP3, TIGR Assembler and TA-EST were all significantly sharing 90% or greater identity (see Materials and Methods) better than those produced by Phrap. Figure 5 shows the and measured the number of contigs generated by each of the number of errors, classified by type, generated by each programs. TA-EST was generally unable to separate the gene program. Phrap produced considerably more insertions, deletions family members, always grouping the six member family into and substitutions than did the other assemblers. As a measure a single consensus and the eight member family into an of the fidelity of the best assembly produced by each gene we average of 1.38. Phrap did only slightly better, generating an average of 2 and 4.33 consensus sequences from the six and normalized the total A-score for all assemblies by the total eight member families, respectively. CAP3 did a good job of length of the assembled sequence; in this case, perfect assemblies discriminating between closely related but distinct transcripts, would produce a value of 2. The normalized A-score for CAP3 however, it too failed with sequences that share >96% identity, was 1.59, while those for TA-EST, TIGR Assembler and Phrap producing an average of 4.5 and 6.67 for the two families. were 1.49, 1.25 and 0.55, respectively. TIGR Assembler provided the greatest discrimination, generating As expected, based on our previous results, CAP3 generated an average of 5.83 and 8.5 consensus sequences, respectively, the highest quality assemblies of the corresponding gene for the two families. sequences. Further, CAP3 exceeded our expectations, generating fewer consensus sequences than even Phrap, which had Evaluation of EST assembly algorithms using highly produced the fewest assemblies in both our simulations and represented human genes our analysis of the Rat Gene Index. While these results may To further validate the results from our simulation studies, we have been different if sequence quality values had been used in examined 73 human genes with EST sequences spanning their the assemblies, the gene sequence and EST data in GenBank lengths. The length of these genes was 1881 ± 1037 bp and the do not include these data. For the available data our results sequences had an average coverage of 203 ± 183 ESTs. clearly indicate that CAP3 has the best balance of error tolerance Ideally, the ESTs from each gene should assemble to a single and error resolution. contig without singletons. However, without the gene sequence to serve as a reference, regions of low coverage and Assembling the HGI and assessing THC fidelity by e-PCR errors in the ESTs may cause multiple contigs and singletons to The HGI was assembled using CAP3 from 1 524 335 ESTs, be formed. We examined the performance of each assembler, tallying the number of contigs and singletons produced for 47 283 NPs and 7223 ET sequences, producing 75 424 THCs each gene. As summarized in Table 3, CAP3 was able to and 338 999 singletons. Of the 52 825 EST-based markers produce an average of 1.26 ± 0.58 contigs with a single contig placed on radiation hybrid maps, we were able to assign 28 577 in 59 of 73 cases (81%). The performance of Phrap was nearly markers to one or more THCs. In all, 32 404 map assignments as good, with 46 (63%) of the genes producing a single were made, suggesting a redundancy in the THC data set of consensus and an average of 1.56 ± 0.88 assemblies. Neither 1.13-fold (32 404/28 577). Of 20 731 THCs assigned map CAP3 nor Phrap generated a significant number of singletons. locations, 7328 contained two or more independently mapped Based on these measures, both programs performed significantly markers; of these, 7104 THCs (97%) contained multiple better that either TA-EST or TIGR Assembler. markers that mapped to nearby chromosomal locations, Nucleic Acids Research, 2000, Vol. 28, No. 18 3663 Figure 5. DNA sequencing base call error probability. The total number of errors, classified by type, in the best assembly produced by the four assemblers and the normalized A-score for 73 known genes. suggesting that the assemblies properly reconstructed the gene transcripts, generating an over-representation of some genes. sequence. In contrast, Phrap is insufficiently sensitive to sequence differ- ences, causing it to over-assemble ESTs and sacrifice the fidelity of the consensus sequences it produces by generating a DISCUSSION significantly higher number of insertions and incorrect base EST data have proven to be an important resource for gene assignments. CAP3 incorporates the best features of these discovery and mapping and promise to be invaluable for the other programs, producing high fidelity consensus sequences annotation of the eukaryotic genomes soon to be completed. and maintaining a high level of sensitivity to gene family However, the large number of EST sequences have made members while effectively handling sequencing errors. Based working with this data a challenge. The TIGR Gene Indices are on our analysis we have selected CAP3 for assembly of the an attempt to reduce those data to a manageable, well-defined TIGR Gene Indices (http://www.tigr.org/tdb/tgi.org ). collection of high fidelity consensus sequences. Central to this process is the use of a sequence assembly program that provides Factors that influence consensus sequence fidelity an accurate representation of the gene sequences from which the Most EST sequences in the public repositories do not have ESTs were derived. We have conducted an extensive analysis of quality values assigned to each base. Quality values indicate the performance of four of the most widely known DNA how accurate the base call is; values >20 (99% confidence) sequence assembly programs—Phrap (http://www.phrap.org/ represent high confidence calls (12). Without these, Phrap phrap.docs/phrap.html ), CAP3 (10) and two versions of TIGR assigns a default quality of 15 to each base. The construction Assembler (11)—and used a variety of measures to assess the by Phrap of a consensus relies heavily on the quality value; fidelity of the consensus sequences produced by this process. when several input sequences disagree, it often resolves the problem by inserting two different bases in the final consensus, Evaluation of sequence assembly programs producing an insertion error. In contrast, CAP3, TA-EST and While none of the assembly programs performed perfectly, TIGR Assembler use a ‘majority rule’ scheme that tends to CAP3 consistently provided the highest fidelity assemblies, resolve disagreements correctly. [This result is consistent with accurately assembling ‘dirty’ EST data without introducing an that reported by Miller and Powell (18), although that study inordinate number of errors into the consensus or generating considered an earlier generation of assembly algorithms.] unnecessary singletons. TIGR Assembler and TA-EST proved CAP3 uses only the majority base for its consensus; TA-EST slightly more sensitive to subtle yet consistent differences in and TIGR Assembler use an IUPAC code to represent possible sequence, such as those present in closely related members of ambiguities. a gene family. However, this sensitivity, combined with the Although each of the assembly programs uses internal checks to naturally occurring errors inherent in ESTs, causes both to split discriminate between sequences, all include a user-definable 3664 Nucleic Acids Research, 2000, Vol. 28, No. 18 Table 4. Radiation hybrid mapping data for a representative sample of THCs from HGI5.0 that contain multiple, independently mapped markers THC ID Marker position RHDB ID Chromosome Location Panel Score THC403868 766–890 RH53683 14 281.46 GB4 P = 1.70 THC403868 766–890 RH14883 14 4460 G3 F THC403877 71–335 RH26593 6 105.5 GB4 P > 3.00 THC403877 74–212 RH46761 6 105.5 GB4 P > 3.00 THC403877 1224–1344 RH46705 6 105.5 GB4 P > 3.00 THC403877 1414–1661 RH26034 6 105.5 GB4 P > 3.00 THC403892 552–748 RH24987 9 404 GB4 P > 3.00 THC403892 594–798 RH11721 9 403.9 GB4 P > 3.00 THC403892 660–798 RH13861 9 4806 G3 F THC403910 830–989 RH44275 10 547.83 GB4 P = 2.44 THC403910 916–1036 RH16755 10 548.62 GB4 P = 1.28 THC403911 50–149 RH51822 11 247.67 GB4 P = 2.48 THC403911 86–185 RH27455 11 247.67 GB4 P = 2.48 THC403911 86–185 RH10295 11 242.54 GB4 P = 0.00 THC403912 2–101 RH51822 11 247.67 GB4 P = 2.48 THC403912 38–137 RH27455 11 247.67 GB4 P = 2.48 THC403912 38–137 RH10295 11 242.54 GB4 P = 0.00 THC403923 597–759 RH15921 19 216.04 GB4 P = 0.01 THC403923 1179–1280 RH16797 19 214.04 GB4 P = 1.26 THC403929 1637–1761 RH12769 2 574.71 GB4 P > 3.00 THC403929 1773–1903 RH56254 2 573.36 GB4 P = 1.03 THC403929 1908–2241 RH14021 2 8064 G3 F THC403929 1918–2049 RH70825 2 572.14 GB4 P = 0.84 THC403929 1929–2217 RH56929 2 557.16 GB4 P = 0.96 THC403934 755–883 RH39372 17 295.52 GB4 P = 0.76 THC403934 793–998 RH76470 17 293.11 GB4 P = 0.02 THC403947 22–122 RH49734 1 145.4 GB4 P = 1.33 THC403947 1431–1555 RH50152 1 145.91 GB4 P = 1.12 THC403950 1724–1872 RH78931 11 18.46 GB4 P = 0.36 THC403950 1745–1920 RH32214 11 17 G3 P = 1.62 THC403950 1766–1877 RH27310 11 4.63 GB4 P = 2.39 THC403956 2300–2430 RH91675 16 194.86 GB4 F THC403956 2300–2430 RH79175 16 193.96 GB4 P = 0.33 THC403987 25–174 RH55229 17 329.2 GB4 P = 1.52 THC403987 25–174 RH14431 17 2298 G3 P = 0.47 THC403987 353–473 RH70694 17 319.86 GB4 P = 2.12 THC404053 354–470 RH44831 12 401.81 GB4 P > 3.00 THC404053 557–686 RH52825 12 400.21 GB4 P > 3.00 THC404053 1146–1390 RH75162 14 140.79 GB4 P = 0.17 The first column is the THC ID, the second column represents the position of the marker within the THC sequence, the third contains the RHDB ID (http://corba.ebi.ac.uk/RHdb ) for the marker, the fourth contains the chromosome associated with the marker, the fifth is the location of the marker on the chromosome expressed in CentiRays (CR), the sixth is the radiation hybrid panel on which the marker was mapped and the final column is the score associated with the marker position (F for G3 means that this is a ‘framework’ marker). It should be noted that the G3 and GB4 panels were constructed using different radiation dosages. Consequently, the ‘size’ of the chromosome in CR is different. Single lines separate distinct THCs; of the THCs shown, only the last, THC404053, has a discrepancy in its map location. parameter that specifies how similar two sequences must be to parameter for the four assemblers are 95, 65, 94.5 and 97.5% initially be considered identical. The default values of this for Phrap, CAP3, TA-EST and TIGR Assembler, respectively. Nucleic Acids Research, 2000, Vol. 28, No. 18 3665 This explains in part why TA-EST and Phrap, and to a lesser ACKNOWLEDGEMENTS extent CAP3, could not separate ESTs from genes sharing We would like to acknowledge Phil Green for providing Phrap >95% DNA sequence identity. One could increase the discrim- and Xiaoqui Huang for CAP3. This work could not have been ination of these programs by selecting a higher stringency, but accomplished without the remaining members of the TIGR this has other unwanted effects, including increasing the Gene Index Team, Thomas Hansen and Jonathan Upton. The number of consensus sequences and singletons. authors are indebted to Anna Glodek for her database development Human sequence mapping and validation efforts. The authors also wish to thank Michael Heaney and Susan Lo for database support, Vadim Sapiro, Billy Lee, Sonja The most extensive collection of EST and genomic mapping Gregory, Rajeev Karamchedu, Corey Irwin, Lily Fu and Eddy and sequence data is available for humans. This provides a Arnold for computer system support and Cathy Ronning, unique opportunity to assess the fidelity of the consensus Robin Buell, Joseph White and Claire M. Fraser for thoughtful sequences contained within the TIGR Gene Indices. Radiation comments and suggestions. This work was supported by a hybrid mapping does not provide precise map locations, but grant from the US Department of Energy. S.L.S. was rather bins markers into approximate chromosomal locations. supported in part by NIH Grant R01 LM06845-01 and NSF The likelihood that two markers from independently mapped Grant IIS-9902923. S.L.S., J.Q. and S.K. were supported in ESTs fall into the same or adjacent bins is extremely small, part by NSF Grant KDI-9980088. F.L., I.E.H., G.P. and J.Q. unless the ESTs were derived from the same gene. The 97% were supported in part by grant DE-FG02-99ER62852 from (7104/7328) concordance between map locations for the THCs the US Department of Energy containing multiple, independent radiation hybrid markers suggests that the consensus sequences faithfully reconstruct the genes from which the ESTs were derived. In many cases REFERENCES the mapped markers fall into distinct, non-overlapping regions 1. Adams,M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., of the THCs. If there were a large number of chimeric or Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B., misassembled sequences in the THCs one would expect a Moreno,R.F. et al. (1991) Science, 252, 1651–1661. discordance rate significantly higher than the 3% observed; 2. Adams,M.D., Kerlavage,A.R., Fleischmann,R.D., Fuldner,R.A., Bult,C.J., Lee,N.H., Kirkness,E.F., Weinstock,K.G., Gocayne,J.D., this rate is not significantly different than that expected due to White,O. et al. (1995) Nature, 377, 3–174. mapping errors at the various radiation hybrid laboratories (13). 3. Hudson,T.J., Stein,L.D., Gerety,S.S., Ma,J., Castle,A.B., Silva,J., The fact that these discrete markers map to consistent locations Slonim,D.K., Baptista,R., Kruglyak,L., Xu,S.H. et al. (1995) Science, within the genome provides an independent, experimental vali- 270, 1945–1954. dation for the clustering and assembly process used to create 4. Schuler,G.D., Boguski,M.S., Stewart,E.A., Stein,L.D., Gyapay,G., Rice,K., White,R.E., Rodriguez-Tome,P., Aggarwal,A., Bajorek,E. et al. the TIGR Gene Indices. Representative data for 14 of the 7328 (1996) Science, 274, 540–546. THCS containing multiple mapped ESTs can be found in 5. Bouck,J., Yu,W., Gibbs,R. and Worley,K. (1999) Trends Genet., 15, 159–162. Table 4; radiation hybrid map locations for the HGI are available 6. Boguski,M.S. and Schuler,G.D. (1995) Nature Genet., 10, 369–371. at http://www.tigr.org/tdb/hgi/searching/rh_map.html 7. Quackenbush,J. Liang,F., Holt,I., Pertea,G. and Upton,J. (2000) Nucleic Acids Res., 28, 141–145. Conclusions 8. Lin,X., Kaul,S., Rounsley,S., Shea,T.P., Benito,M.I., Town,C.D., Fujii,C.Y., Mason,T., Bowman,C.L., Barnstead,M. et al. (1999) We have conducted a careful analysis of sequence assembly Nature, 402, 761–768. programs in order to determine their performance in assembling 9. Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L. and EST sequences and developed a refined process of EST Quackenbush,J. (2000) Nature Genet., 25, 239–240. 10. Huang,X. and Madan,A. (1999) Genome Res., 9, 868–877. sequence cleaning, clustering, assembly and annotation that 11. Sutton,G., White,O., Adams,M.D. and Kerlavage,A.R. (1995) provides a faithful representation of the gene sequences from Genome Sci. Technol., 1, 9–18. which the ESTs were derived. With the imminent completion 12. Ewing,B. and Green,P. (1998) Genome Res., 8, 186–194. of the sequence of the human and other genomes, our challenge 13. Deloukas,P., Schuler,G.D., Gyapay,G., Beasley,E.M., Soderlund,C., will be to use all our available resources to accurately catalog Rodriguez-Tome,P., Hui,L., Matise,T.C, McKusick,K.B., Beckmann,J.S. et al. (1998) Science, 282, 744–746. and characterize the encoded genes. Finding genes in a 14. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) genomic sequence is a significant challenge; the vast body of J. Mol. Biol., 215, 403–410. EST data represents a tremendous resource that can be applied 15. Huang,X., Adams,M.D., Zhou,H. and Kerlavage,A.R. (1997) Genomics, to this problem. The TIGR Gene Indices provide a reliable 46, 37–45. 16. Schuler,G.D. (1997) Genome Res., 7, 541–550. reduction of the EST data and can simplify annotation by 17. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., providing fewer, accurate sequences that can be searched 22, 4673–4680. against genomic sequences. 18. Miller,M.J. and Powell J.I. (1994) J. Comp. Biol., 1, 257–269.

References (18)

Adams,M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B., Moreno,R.F. et al. (1991) Science, 252, 1651-1661.
Adams,M.D., Kerlavage,A.R., Fleischmann,R.D., Fuldner,R.A., Bult,C.J., Lee,N.H., Kirkness,E.F., Weinstock,K.G., Gocayne,J.D., White,O. et al. (1995) Nature, 377, 3-174.
Hudson,T.J., Stein,L.D., Gerety,S.S., Ma,J., Castle,A.B., Silva,J., Slonim,D.K., Baptista,R., Kruglyak,L., Xu,S.H. et al. (1995) Science, 270, 1945-1954.
Schuler,G.D., Boguski,M.S., Stewart,E.A., Stein,L.D., Gyapay,G., Rice,K., White,R.E., Rodriguez-Tome,P., Aggarwal,A., Bajorek,E. et al. (1996) Science, 274, 540-546.
Bouck,J., Yu,W., Gibbs,R. and Worley,K. (1999) Trends Genet., 15, 159-162.
Boguski,M.S. and Schuler,G.D. (1995) Nature Genet., 10, 369-371.
Quackenbush,J. Liang,F., Holt,I., Pertea,G. and Upton,J. (2000) Nucleic Acids Res., 28, 141-145.
Lin,X., Kaul,S., Rounsley,S., Shea,T.P., Benito,M.I., Town,C.D., Fujii,C.Y., Mason,T., Bowman,C.L., Barnstead,M. et al. (1999) Nature, 402, 761-768.
Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L. and Quackenbush,J. (2000) Nature Genet., 25, 239-240.
Huang,X. and Madan,A. (1999) Genome Res., 9, 868-877.
Sutton,G., White,O., Adams,M.D. and Kerlavage,A.R. (1995) Genome Sci. Technol., 1, 9-18.
Ewing,B. and Green,P. (1998) Genome Res., 8, 186-194.
Deloukas,P., Schuler,G.D., Gyapay,G., Beasley,E.M., Soderlund,C., Rodriguez-Tome,P., Hui,L., Matise,T.C, McKusick,K.B., Beckmann,J.S. et al. (1998) Science, 282, 744-746.
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) J. Mol. Biol., 215, 403-410.
Huang,X., Adams,M.D., Zhou,H. and Kerlavage,A.R. (1997) Genomics, 46, 37-45.
Schuler,G.D. (1997) Genome Res., 7, 541-550.
Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., 22, 4673-4680.
Miller,M.J. and Powell J.I. (1994) J. Comp. Biol., 1, 257-269.

About the author

Geo Pertea

Johns Hopkins University, Department Member

Papers

Followers

View all papers from Geo Perteaarrow_forward

An optimized protocol for analysis of EST sequences

Sign up for access to the world's latest research

Abstract

Related papers

References (18)

Related papers

Related topics

Cited by