throbber
analysis
`
`Quality assessment of the human genome
`sequence
`
`Jeremy Schmutz, Jeremy Wheeler, Jane Grimwood, Mark Dickson, Joan Yang, Chenier Caoile, Eva Bajorek, Stacey Black, Yee Man Chan,
`Mirian Denys, Julio Escobar, Dave Flowers, Dea Fotopulos, Carmen Garcia, Maria Gomez, Eidelyn Gonzales, Lauren Haydu, Frederick Lopez,
`Lucia Ramirez, James Retterer, Alex Rodriguez, Stephanie Rogers, Angelica Salazar, Ming Tsai & Richard M. Myers
`
`Stanford Human Genome Center, Department of Genetics, Stanford University School of Medicine, 975 California Avenue, Palo Alto, California 94304, USA
`
`...........................................................................................................................................................................................................................
`
`As the final sequencing of the human genome has now been completed, we present the results of the largest examination of the
`quality of the finished DNA sequence. The completed study covers the major contributing sequencing centres and is based on a
`rigorous combination of laboratory experiments and computational analysis.
`
`F rom the beginning, a primary objective of the Human
`
`Genome Project (HGP) was to generate a highly accurate
`reference sequence for the human genome. This sequence
`is now essentially complete and is available in its entirety
`as a reference for biomedical researchers. High-through-
`put genome sequencing has created a fundamental shift in the
`paradigm for biological research. Whereas gene discovery once
`drove DNA sequencing, now the sequencing of entire genomes
`drives gene discovery. As such, it is essential that the scientific
`community be informed about the accuracy of this reference
`sequence and of its fidelity to the biological templates from which
`it was derived.
`
`Box 1
`
`Large-scale sequencing terms for this study
`Accuracy The measure of how likely the base pairs in a consensus
`are to be the correct base call. For a 99.99% accurate DNA
`sequence, it must contain only one incorrect base per 10,000 bp.
`Accuracy is also sometimes referred to as the ‘quality’ of a base
`pair, because estimated base-pair qualities are assigned by the
`assembly software when it creates the consensus.
`Consensus The final reconstructed DNA sequence built by
`assembling the sequence reads and generating a consensus base
`call for each position in the assembly. In the case of a finished clone,
`there is only one consensus.
`Contiguity The measure of how many pieces are contained within
`the assembly. A contiguous assembly would typically have multiple
`overlapping sequence reads the entire length of the consensus.
`The finishing rules allow a consensus in more than one piece to be
`called contiguous (no gaps) in certain difficult situations if the break
`point is annotated in the database entry.
`Fidelity The fidelity of a consensus is how similar the consensus is
`to the underlying biological template from which the sequence
`reads were derived. Fidelity for a genome at the single base-pair
`level is difficult to measure without identifying and sequencing a
`different clone from the same position on the same chromosome
`and examining the difference between the sequences. In this study
`we evaluated the fidelity of a sequence in reference to the large-
`insert clone from which the sequence was derived, not the genomic
`template.
`Finishing The process of collecting data, performing
`computational manipulation to a data set to convert a shotgun
`assembly into a single high-quality contiguous DNA sequence, and
`verifying the fidelity of the consensus.
`
`World standards for sequence fidelity (known as the Bermuda
`Standards) were established at the meeting of HGP principal
`investigators in 1997 (http://www.gene.ucl.ac.uk/hugo/bermuda2.
`htm). These standards stated that finished sequence should contain
`less than one error per 10,000 DNA bases (99.99% accuracy), and
`that the sequence should be contiguous (without gaps). Compliance
`with the base-pair (bp) accuracy standard was measured by error
`probability assessments generated by DNA base-calling software1–3
`and by examining discrepancies between overlapping clone
`sequences. Compliance with the contiguity standard was an internal
`measurement based on each centre’s complex sequence-finishing
`methodology. Over the course of the project, additional standards
`were created to ensure sequence fidelity (http://www.genome.wustl.
`edu/Overview/g16stand.php).
`Although more than 2.8 billion base pairs of unique finished
`sequence has been generated by the sequencing centres comprising
`the International Human Genome Sequencing Consortium
`(IHGSC), until the present study was performed fewer than
`5,000,000 bp of this sequence has been verified independently for
`compliance with the finishing standard4. Finished chromosome
`sequence papers have now been published for 9 of the 24 human
`chromosomes5–13, with most of these papers estimating that the
`chromosomal sequence exceeds the 99.99% accuracy measure. To
`provide a more uniform picture of the finished sequence quality
`of the human genome, the National Human Genome Research
`Institute (NHGRI) solicited us to perform a detailed evaluation of
`the DNA sequence data that was generated for the HGP by seven
`of the IHGSC centres. We examined more than 34 megabases (Mb)
`of sequence data for accuracy, contiguity and fidelity (see Box 1),
`and participated in a computational data exchange with the
`Wellcome Trust Sanger Institute. This paper contains the results
`of our analysis of the quality of finished sequence data deposited by
`these centres in the public human genome databases from February
`2001 through to July 2002.
`
`Overview and procedure
`Our quality assessment of finished human bacterial artificial
`chromosome (BAC) sequences was conducted in two rounds. For
`the first round of analysis, we evaluated the finished sequence
`produced by the three largest NHGRI-funded sequencing centres:
`the Baylor Human Genome Sequencing Center, the Washington
`University Genome Sequencing Center and the Whitehead Institute
`Center for Genome Research. We selected 120 BAC clones (about
`6.7 Mb from each centre) from sequence submissions spanning the
`six-month period from 15 February 2001 through to 15 August
`2001. The second round of analysis evaluated the sequence pro-
`duced by the four smaller sequencing centres that individually
`
`NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
`
`© 2004 Nature Publishing Group
`Petitioner Microsoft Corporation - Ex. 1044, p. 365
`
`365
`
`

`

`than 1 bp error per 10 kb, with 59 of the clones having no identified
`errors. Twelve of the thirteen remaining clones exceed the 1 bp per
`10 kb standard owing to significant errors. Disregarding the signifi-
`cant errors, only 1 of the 197 clones exceeded the target error rate
`because of base-pair errors alone. Cumulative error results for each
`of the rounds are shown in Table 1. The individual centre base-pair
`error rates ranged from 1 in 25,420 bp to 1 in 154,479 bp, and
`significant error rates ranged from none found to 1 in 1.2 Mb
`(Supplementary Table S2).
`The vast majority of error events found in the finished human
`BAC sequences affected a single base pair in the consensus sequence
`(411 out of 466, 88.2%). Roughly half (48%) of these errors were
`single base-pair substitutions, with the remainder (52%) being
`single base-pair insertions or deletions. The substitution errors
`were primarily miscalled bases in regions of low quality. However,
`there are many positions where a miscalled base was incorporated
`into the consensus sequence despite the presence of multiple high-
`quality reads with the proper base calls; these are obvious finishing
`errors. Most of these errors occurred where a single discrepant
`subclone at that position was given a high-quality score by the base-
`calling algorithm and the miscalled base was incorporated into the
`consensus sequence. Additionally, we identified 42 (9% of error
`events) multiple base-pair insertions, deletions and substitutions of
`less than 20 bp, most of which were clone mutations in a single
`subclone that were erroneously included in the consensus.
`
`analysis
`
`contributed more than 30 Mb to the human genome: the Genome
`Therapeutics Corporation, the French National Sequencing Center
`Genoscope, the University of Washington Genome Center and the
`RIKEN Genomic Sciences Center. We selected 80 BAC clones (about
`3.4 Mb from each centre) from these sequencing centres, spanning
`the 17-month period from 15 February 2001 through to 30 June
`2002 (Supplementary Fig. S1 and Table S1).
`We sampled clones throughout the two time periods, and
`adjusted the number of clones that we selected to be a percentage
`of clones finished by each centre in each month. The sequencing
`centres provided us with sequencing read data and glycerol stocks of
`the large-insert clone. In contrast to previous quality assessments4,
`we created a new subclone library for each clone and sequenced this
`library to 3–4 times coverage in high-quality base pairs. We
`generated these reads from both ends of sized plasmid subclones,
`which gave us the ability to evaluate independently the centre’s
`submission regardless of how the original data were generated. We
`combined our new reads with the original data and then finished the
`resulting assembly to a high degree of accuracy, performing directed
`sequencing reactions from the large-insert clone when necessary.
`These directed reads included reactions performed with alternative
`chemistries such as dGTP and Invitrogen sequencing enhancers. All
`finishing quality analysis was performed using the Phred/Phrap/
`Consed3 pipeline. We then compared our ‘gold standard’ consensus
`to the original submitted consensus and then verified and classified
`any discrepancies. For each discrepancy, we counted the number of
`error events and base-pair errors and as necessary classified the error
`as a significant error or misassembly (see Box 2). We counted an
`error only if original data generated by the submitting centre
`supported the correct consensus; in this way, we avoided classifying
`any large-insert clone growth variations as sequencing errors.
`
`Accuracy results and base-pair errors
`Our analysis indicates that all of the sequencing centres surveyed
`met the standards for 99.99% accuracy over the time period studied.
`Figure 1 shows the plot of the error events and the base-pair errors
`for each clone that we assessed. These are plotted as rates, normal-
`ized per 10 kilobases (kb) over the length of each clone, and include
`all incorrect base pairs. Most (184 out of 197) of the clones have less
`
`Box 2
`
`Analysis terms for this study
`Base-pair errors The number of base-pair changes between our
`‘gold standard’ consensus and the original submitted sequence.
`Error events A count of the number of positions of change in the
`consensus discovered in the quality assessment process; a
`contiguous insertion, deletion or erroneous run of multiple base
`pairs is counted as a single error event because the multiple base-
`pair errors probably arose from a single process error.
`Misassembly A rearrangement or deletion of the consensus
`caused by the incorrect joining of two similar pieces of sequence
`that are geographically separated in the true consensus.
`Significant error A single error that causes at least 50 contiguous
`base pairs to be incorrect in the submitted consensus versus our
`gold standard consensus.
`
`Figure 1 A plot of the error events per 10 kb versus the base-pair errors per 10 kb for
`the clones surveyed. Each green circle represents a different surveyed clone. A
`detailed view of the boxed area (less than one error event per 10 kb and less than 1-bp
`error per 10 kb) shows the diagonal distribution of all of the clones containing only
`single base-pair errors. The red circle indicates 59 clones with no errors.
`
`366
`
`© 2004 Nature Publishing Group
`Petitioner Microsoft Corporation - Ex. 1044, p. 366
`
`NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
`
`

`

`analysis
`
`Table 1 Cumulative results of each of the quality assessment rounds
`
`Total
`Round 2
`Round 1
`Analysis results
`...........................................................................................................................................................................................................................................................................................................................................................
`Sequence analysed (kb)
`20,303
`13,887
`34,190
`Clones analysed
`117
`80
`197
`Error events
`183
`283
`466
`Substitution events
`73
`135
`208
`Insertion/deletion events
`110
`148
`258
`Error event rate (bp)
`1/110,948
`1/49,069
`1/73,369
`Base-pair errors*
`255
`381
`636
`Base-pair error rate
`1/79,621
`1/36,448
`1/53,758
`Significant errors†
`5
`8
`13
`Significant error rate (bp)
`1/4,060,688
`1/1,735,828
`1/2,630,005
`...........................................................................................................................................................................................................................................................................................................................................................
`* Does not include significant insertions, deletions or rearrangements of sequence.
`†Insertions or deletions of greater than 50 bp or significant rearrangements of sequence.
`
`Significant errors
`We found a significant error in 12 out of the 197 (6.1%) BAC clones
`that we analysed. There were 13 total significant errors in these
`clones (2.8% of the total error events). Most of these were ident-
`ifiable as potential problems from the initial assembly of only the
`contributing centre’s data set. We found large consensus deletions
`that were derived from deleted subclone templates or polymerase
`chain reaction (PCR) amplified products. Long stretches of
`sequence were also deleted as a result of incorrect joins made in
`repetitive regions, through which sequencing was difficult, and joins
`were based on minimal sequence overlap. The distribution of these
`sequence areas that were more difficult to sequence varies across the
`human genome12, and consequently, we did not survey difficult
`clones from every centre.
`
`Potential error-prone finishing techniques
`In the course of this quality assessment we identified finishing
`techniques that in some cases directly contributed to consensus
`errors that were not corrected before submission by the centre. A
`large number of the single base-pair-deletion errors were the result
`
`of G þ C compressions from dye-primer chemistry (now phased
`
`out of use in most centres) or dGTP chemistry (a chemistry for
`difficult-to-sequence regions), or from A or T base drop-out errors
`on the Megabace platform. Some of the larger deletions in simple
`sequence regions were from PCR-generated templates or from
`single subclones that had deleted a portion of the repeat copies.
`Clones consisting of mostly single-direction M13 reads had more
`serious assembly issues in repetitive areas. Higher assembly strin-
`gencies would have reduced greatly the number of incorrect joins
`and improved the overall accuracy for the identified misassembled
`repeat structures.
`
`Computational quality assessment of two contributors
`In addition to the quality assessments detailed in this paper, the
`Wellcome Trust Sanger Institute and the Joint Genome Institute/
`Stanford Human Genome Center exchanged 38 finished clones over
`the same time period as round one in this study. These two centres
`examined only trace data and built new assemblies to compare to
`the submitted assemblies, and they did not add additional sequen-
`cing data. Suspected errors were verified by the original submitting
`centre. This study found that, for these two centres, there was on
`average 1 bp error per 651,000 bp and one potential significant error
`in 11.1 Mb. Together these centres contributed about 39% of the
`human genome sequence. Although this analysis is not directly
`comparable with our more detailed study—because computational
`analysis alone is unable to detect all of the errors found with
`additional sequencing (see Supplementary text)—this provides a
`reviewed estimate of error rates for these two centres.
`
`Quality of the finished human genome
`We believe that the quality evaluation methodology outlined in this
`
`paper provides a uniform framework to evaluate sequence pro-
`duced by the disparate finishing systems used by the IHGSC
`sequencing centres in relation to the standards for finished sequence
`quality. Of the 197 clones analysed, we found that 182 (92.4%)
`significantly exceed the 99.99% accuracy standard, on the basis of a
`calculation of base-pair errors per 10 kb (Fig. 1). If the sampled data
`set is applicable to the entire genome, we can conclude that the base-
`pair accuracy standards have been exceeded tenfold, as there is less
`than 1 bp error per 100,000 bp of finished sequence. If we normalize
`for the relative amounts of sequence contributed by each centre, we
`should expect to find on average seven error events with nine
`incorrect bases per 1 Mb and one significant error per 6 Mb.
`We believe that caution should be exercised in extrapolating our
`data beyond the specific regions of the genome that were surveyed in
`our study. This quality assessment is an evaluation of process, as it
`was based on a methodology of sampling sequence production over
`time, not sampling uniformly from the finished product. As such,
`our results are a reflection of the finishing methodologies used by
`the centres for the time period evaluated in our study, and these
`methodologies were subject to continuous improvement. For the
`centres investigated in round one, we sampled from a single
`production period, whereas clones submitted early and late in the
`HGP were not sampled; these clones are more likely to contain a
`higher error count. Along with improvements to knowledge and
`technology, the goals of the overall HGP changed over the course of
`the project, and the quality threshold used by the sequencing centres
`fluctuated in response to these production goals. In addition, our
`thorough quality evaluation methodology (which included
`additional shotgun sequencing) was applied only to sequencing
`centres contributing 55% of the total human sequence, with an
`additional 39% assessed by computational evaluation alone. No
`sequence was surveyed from the 6% of the genome finished by many
`smaller contributors.
`As a result of differences in the application of finishing method-
`ologies, the most significant factor correlated to sequence quality is
`centre-to-centre variation (Supplementary Table S2). The sequencing
`centres had different thresholds at which they determined a given
`region to be ‘finished’, and some centres intentionally exceeded the
`required quality levels (Table S2). The stringency of the contiguity
`threshold (the cause of most of the significant errors identified)
`applied by each group was the result of an admixture of production
`pressures, ‘regional’ complexity of their genomic territory, person-
`nel experience in addressing the variations in finishing difficulty,
`and degree of communication and standardization of the group
`with the larger HGP community. The nature of the HGP as a pilot
`project for large-scale genomic sequencing makes it difficult to
`describe the quality of the human genome sequence as a singular
`entity, although this evaluation has provided valuable insights into
`the process of producing a complete, complex, finished genome
`sequence.
`
`NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
`
`© 2004 Nature Publishing Group
`Petitioner Microsoft Corporation - Ex. 1044, p. 367
`
`367
`
`

`

`analysis
`
`Applications to future projects
`Well-defined finishing standards specifying targets for accuracy,
`singular contiguity and fidelity—coupled with descriptions of
`processes that enable greater compliance with these standards—
`will enable future genome sequencing projects to generate a
`more uniform quality product. Continuous sampling of finished
`sequence for quality evaluation throughout
`the production
`process of
`future genome sequencing projects would better
`enable global quality statements to be made, and subsequent
`incorporation of quality assessment feedback into the process
`could further enhance the quality of the product. Standardizing
`or minimizing a number of variables—genome source, cloning
`and library construction platforms, hierarchical sequencing
`strategies, definitions of finished product—will help further.
`Such procedures will enhance not only verification ability on
`a clone or regional basis, but will also be tremendously helpful
`in solving recalcitrant problems such as the resolution of large
`duplicated genomic structures. As new genome-sequencing
`techniques emerge over the course of a sequencing project
`(for example, cloning vectors, sequence chemistries, detection
`platforms, finishing techniques), a centralized quality-control
`centre could serve as a resource for evaluating the technique’s
`relative ability to ensure fidelity with the genomic sequence,
`rather than each centre independently examining and evaluating
`all new technologies. In this capacity the quality-control centre
`would serve as a distributor of reviews and test performance
`reports for technological developments, which would allow
`all sequencing centres equal access to information about
`these techniques. A central trace data repository, such as the
`NCBI trace archive, is a positive step towards making all raw
`sequencing trace data available, but also storing the final
`assemblies would enable central coordination of gap-closing
`efforts and allow centres to concentrate on the finishing
`problems that they have developed pipelines to address, instead
`
`of expecting each centre to apply these complicated techniques
`A
`to an equal standard.
`
`Received 24 October 2003; accepted 26 January 2004; doi:10.1038/nature02390.
`
`1. Ewing, B. & Green, P. Base-calling of automated sequencing traces using Phred. II. Error probabilities.
`Genome Res. 8, 186–194 (1998).
`2. Ewing, B., Hiller, L., Wendl, M. & Green, P. Base-calling of automated sequence traces using Phred.
`I. Accuracy assessment. Genome Res. 8, 175–185 (1998).
`3. Gordon, D., Abajian, C. & Green, P. Consed: A graphical tool for sequence finishing. Genome Res. 8,
`195–202 (1998).
`4. Felsenfeld, A., Peterson, J., Schloss, J. & Guyer, M. Assessing the quality of the DNA sequence from the
`Human Genome Project. Genome Res. 9, 1–4 (1999).
`5. Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature
`414, 865–871 (2001).
`6. Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).
`7. Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
`8. Helig, R. et al. The DNA sequence and analysis of human chromosome 14. Nature 421, 601–607
`(2003).
`9. Hillier, L. et al. The DNA sequence of human chromosome 7. Nature 424, 157–164 (2003).
`10. Mungall, A. J. et al. The DNA sequence and analysis of human chromosome 6. Nature 425, 805–811
`(2003).
`11. Skaletsky, H. et al. The male-specific region of the human Y chromosome is a mosaic of discrete
`sequence classes. Nature 423, 825–837 (2003).
`12. International Human Genome Sequencing Consortium, Initial sequencing and analysis of the human
`genome. Nature 409, 860–921 (2001).
`13. Dunham, A. et al. The DNA sequence and analysis of human chromosome 13. Nature 428, 493–521
`(2004).
`14. Grimwood, J. et al. The DNA sequence and biology of human chromosome 19. Nature 428, 529–535
`(2004).
`
`Supplementary Information accompanies the paper on www.nature.com/nature.
`
`Acknowledgements We thank the participating centres for providing clone stocks and sequence
`data sets, and for their feedback about our quality assessment process. We also thank C. Lloyd and
`C. Bagguley for their detailed computational assessment of our finished sequence.
`
`Competing interests statement The authors declare that they have no competing financial
`interests.
`
`Correspondence and requests for materials should be addressed to J.S.
`(jeremy@shgc.stanford.edu) or R.M.M. (myers@shgc.stanford.edu).
`
`368
`
`© 2004 Nature Publishing Group
`Petitioner Microsoft Corporation - Ex. 1044, p. 368
`
`NATURE | VOL 429 | 27 MAY 2004 | www.nature.com/nature
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket