`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`T H E H U M A N G E N O M E
`
`The Sequence of the Human Genome
`J. Craig Venter,1* Mark D. Adams,1 Eugene W. Myers,1 Peter W. Li,1 Richard J. Mural,1
`Granger G. Sutton,1 Hamilton O. Smith,1 Mark Yandell,1 Cheryl A. Evans,1 Robert A. Holt,1
`Jeannine D. Gocayne,1 Peter Amanatides,1 Richard M. Ballew,1 Daniel H. Huson,1
`Jennifer Russo Wortman,1 Qing Zhang,1 Chinnappa D. Kodira,1 Xiangqun H. Zheng,1 Lin Chen,1
`Marian Skupski,1 Gangadharan Subramanian,1 Paul D. Thomas,1 Jinghui Zhang,1
`George L. Gabor Miklos,2 Catherine Nelson,3 Samuel Broder,1 Andrew G. Clark,4 Joe Nadeau,5
`Victor A. McKusick,6 Norton Zinder,7 Arnold J. Levine,7 Richard J. Roberts,8 Mel Simon,9
`Carolyn Slayman,10 Michael Hunkapiller,11 Randall Bolanos,1 Arthur Delcher,1 Ian Dew,1 Daniel Fasulo,1
`Michael Flanigan,1 Liliana Florea,1 Aaron Halpern,1 Sridhar Hannenhalli,1 Saul Kravitz,1 Samuel Levy,1
`Clark Mobarry,1 Knut Reinert,1 Karin Remington,1 Jane Abu-Threideh,1 Ellen Beasley,1 Kendra Biddick,1
`Vivien Bonazzi,1 Rhonda Brandon,1 Michele Cargill,1 Ishwar Chandramouliswaran,1 Rosane Charlab,1
`Kabir Chaturvedi,1 Zuoming Deng,1 Valentina Di Francesco,1 Patrick Dunn,1 Karen Eilbeck,1
`Carlos Evangelista,1 Andrei E. Gabrielian,1 Weiniu Gan,1 Wangmao Ge,1 Fangcheng Gong,1 Zhiping Gu,1
`Ping Guan,1 Thomas J. Heiman,1 Maureen E. Higgins,1 Rui-Ru Ji,1 Zhaoxi Ke,1 Karen A. Ketchum,1
`Zhongwu Lai,1 Yiding Lei,1 Zhenya Li,1 Jiayin Li,1 Yong Liang,1 Xiaoying Lin,1 Fu Lu,1
`Gennady V. Merkulov,1 Natalia Milshina,1 Helen M. Moore,1 Ashwinikumar K Naik,1
`Vaibhav A. Narayan,1 Beena Neelam,1 Deborah Nusskern,1 Douglas B. Rusch,1 Steven Salzberg,12
`Wei Shao,1 Bixiong Shue,1 Jingtao Sun,1 Zhen Yuan Wang,1 Aihui Wang,1 Xin Wang,1 Jian Wang,1
`Ming-Hui Wei,1 Ron Wides,13 Chunlin Xiao,1 Chunhua Yan,1 Alison Yao,1 Jane Ye,1 Ming Zhan,1
`Weiqing Zhang,1 Hongyu Zhang,1 Qi Zhao,1 Liansheng Zheng,1 Fei Zhong,1 Wenyan Zhong,1
`Shiaoping C. Zhu,1 Shaying Zhao,12 Dennis Gilbert,1 Suzanna Baumhueter,1 Gene Spier,1
`Christine Carter,1 Anibal Cravchik,1 Trevor Woodage,1 Feroze Ali,1 Huijin An,1 Aderonke Awe,1
`Danita Baldwin,1 Holly Baden,1 Mary Barnstead,1 Ian Barrow,1 Karen Beeson,1 Dana Busam,1
`Amy Carver,1 Angela Center,1 Ming Lai Cheng,1 Liz Curry,1 Steve Danaher,1 Lionel Davenport,1
`Raymond Desilets,1 Susanne Dietz,1 Kristina Dodson,1 Lisa Doup,1 Steven Ferriera,1 Neha Garg,1
`Andres Gluecksmann,1 Brit Hart,1 Jason Haynes,1 Charles Haynes,1 Cheryl Heiner,1 Suzanne Hladun,1
`Damon Hostin,1 Jarrett Houck,1 Timothy Howland,1 Chinyere Ibegwam,1 Jeffery Johnson,1
`Francis Kalush,1 Lesley Kline,1 Shashi Koduru,1 Amy Love,1 Felecia Mann,1 David May,1
`Steven McCawley,1 Tina McIntosh,1 Ivy McMullen,1 Mee Moy,1 Linda Moy,1 Brian Murphy,1
`Keith Nelson,1 Cynthia Pfannkoch,1 Eric Pratts,1 Vinita Puri,1 Hina Qureshi,1 Matthew Reardon,1
`Robert Rodriguez,1 Yu-Hui Rogers,1 Deanna Romblad,1 Bob Ruhfel,1 Richard Scott,1 Cynthia Sitter,1
`Michelle Smallwood,1 Erin Stewart,1 Renee Strong,1 Ellen Suh,1 Reginald Thomas,1 Ni Ni Tint,1
`Sukyee Tse,1 Claire Vech,1 Gary Wang,1 Jeremy Wetter,1 Sherita Williams,1 Monica Williams,1
`Sandra Windsor,1 Emily Winn-Deen,1 Keriellen Wolfe,1 Jayshree Zaveri,1 Karena Zaveri,1
`Josep F. Abril,14 Roderic Guigo«,14 Michael J. Campbell,1 Kimmen V. Sjolander,1 Brian Karlak,1
`Anish Kejariwal,1 Huaiyu Mi,1 Betty Lazareva,1 Thomas Hatton,1 Apurva Narechania,1 Karen Diemer,1
`Anushya Muruganujan,1 Nan Guo,1 Shinji Sato,1 Vineet Bafna,1 Sorin Istrail,1 Ross Lippert,1
`Russell Schwartz,1 Brian Walenz,1 Shibu Yooseph,1 David Allen,1 Anand Basu,1 James Baxendale,1
`Louis Blick,1 Marcelo Caminha,1 John Carnes-Stine,1 Parris Caulk,1 Yen-Hui Chiang,1 My Coyne,1
`Carl Dahlke,1 Anne Deslattes Mays,1 Maria Dombroski,1 Michael Donnelly,1 Dale Ely,1 Shiva Esparham,1
`Carl Fosler,1 Harold Gire,1 Stephen Glanowski,1 Kenneth Glasser,1 Anna Glodek,1 Mark Gorokhov,1
`Ken Graham,1 Barry Gropman,1 Michael Harris,1 Jeremy Heil,1 Scott Henderson,1 Jeffrey Hoover,1
`Donald Jennings,1 Catherine Jordan,1 James Jordan,1 John Kasha,1 Leonid Kagan,1 Cheryl Kraft,1
`Alexander Levitsky,1 Mark Lewis,1 Xiangjun Liu,1 John Lopez,1 Daniel Ma,1 William Majoros,1
`Joe McDaniel,1 Sean Murphy,1 Matthew Newman,1 Trung Nguyen,1 Ngoc Nguyen,1 Marc Nodell,1
`Sue Pan,1 Jim Peck,1 Marshall Peterson,1 William Rowe,1 Robert Sanders,1 John Scott,1
`Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy Stockwell,1 Russell Turner,1 Eli Venter,1
`Mei Wang,1 Meiyuan Wen,1 David Wu,1 Mitchell Wu,1 Ashley Xia,1 Ali Zandieh,1 Xiaohong Zhu1
`
`1304
`
`16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org
`
`Agilent Exhibit 1289
`Page 1 of 51
`
`
`
`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`T H E H U M A N G E N O M E
`
`A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of
`the human genome was generated by the whole-genome shotgun sequencing
`method. The 14.8-billion bp DNA sequence was generated over 9 months from
`27,271,853 high-quality sequence reads (5.11-fold coverage of the genome)
`from both ends of plasmid clones made from the DNA of (cid:222)ve individuals. Two
`assembly strategies(cid:209)a whole-genome assembly and a regional chromosome
`assembly(cid:209)were used, each combining sequence data from Celera and the
`publicly funded genome effort. The public data were shredded into 550-bp
`segments to create a 2.9-fold coverage of those genome regions that had been
`sequenced, without including biases inherent in the cloning and assembly
`procedure used by the publicly funded group. This brought the effective cov-
`erage in the assemblies to eightfold, reducing the number and size of gaps in
`the (cid:222)nal assembly over what would be obtained with 5.11-fold coverage. The
`two assembly strategies yielded very similar results that largely agree with
`independent mapping data. The assemblies effectively cover the euchromatic
`regions of the human chromosomes. More than 90% of the genome is in
`scaffold assemblies of 100,000 bp or more, and 25% of the genome is in
`scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed
`26,588 protein-encoding transcripts for which there was strong corroborating
`evidence and an additional ;12,000 computationally derived genes with mouse
`matches or other weak supporting evidence. Although gene-dense clusters are
`obvious, almost half the genes are dispersed in low G1C sequence separated
`by large tracts of apparently noncoding sequence. Only 1.1% of the genome
`is spanned by exons, whereas 24% is in introns, with 75% of the genome being
`intergenic DNA. Duplications of segmental blocks, ranging in size up to chro-
`mosomal lengths, are abundant throughout the genome and reveal a complex
`evolutionary history. Comparative genomic analysis indicates vertebrate ex-
`pansions of genes associated with neuronal function, with tissue-speci(cid:222)c de-
`velopmental regulation, and with the hemostasis and immune systems. DNA
`sequence comparisons between the consensus sequence and publicly funded
`genome data provided locations of 2.1 million single-nucleotide polymorphisms
`(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per
`1250 on average, but there was marked heterogeneity in the level of poly-
`morphism across the genome. Less than 1% of all SNPs resulted in variation in
`proteins, but the task of determining which SNPs have functional consequences
`remains an open challenge.
`
`Decoding of the DNA that constitutes the
`human genome has been widely anticipated
`for the contribution it will make toward un-
`
`1Celera Genomics, 45 West Gude Drive, Rockville, MD
`20850, USA. 2GenetixXpress, 78 Paci(cid:222)c Road, Palm
`Beach, Sydney 2108, Australia. 3Berkeley Drosophila
`Genome Project, University of California, Berkeley, CA
`94720, USA. 4Department of Biology, Penn State Uni-
`versity, 208 Mueller Lab, University Park, PA 16802,
`USA. 5Department of Genetics, Case Western Reserve
`University School of Medicine, BRB-630, 10900 Euclid
`Avenue, Cleveland, OH 44106, USA. 6Johns Hopkins
`University School of Medicine, Johns Hopkins Hospi-
`tal, 600 North Wolfe Street, Blalock 1007, Baltimore,
`MD 21287— 4922, USA. 7Rockefeller University, 1230
`York Avenue, New York, NY 10021— 6399, USA. 8New
`England BioLabs, 32 Tozer Road, Beverly, MA 01915,
`USA. 9Division of Biology, 147-75, California Institute
`of Technology, 1200 East California Boulevard, Pasa-
`dena, CA 91125, USA. 10Yale University School of
`Medicine, 333 Cedar Street, P.O. Box 208000, New
`Haven, CT 06520 — 8000, USA. 11Applied Biosystems,
`850 Lincoln Centre Drive, Foster City, CA 94404, USA.
`12The Institute for Genomic Research, 9712 Medical
`Center Drive, Rockville, MD 20850, USA. 13Faculty of
`Life Sciences, Bar-Ilan University, Ramat-Gan, 52900
`Israel. 14Grup de Recerca en Informa‘tica Me‘dica, In-
`stitut Municipal d(cid:213)Investigacio« Me‘dica, Universitat
`Pompeu Fabra, 08003-Barcelona, Catalonia, Spain.
`
`*To whom correspondence should be addressed. E-
`mail: humangenome@celera.com
`
`derstanding human evolution, the causation
`of disease, and the interplay between the
`environment and heredity in defining the hu-
`man condition. A project with the goal of
`determining the complete nucleotide se-
`quence of the human genome was first for-
`mally proposed in 1985 (1). In subsequent
`years, the idea met with mixed reactions in
`the scientific community (2). However, in
`1990, the Human Genome Project (HGP) was
`officially initiated in the United States under
`the direction of the National Institutes of
`Health and the U.S. Department of Energy
`with a 15-year, $3 billion plan for completing
`the genome sequence. In 1998 we announced
`our intention to build a unique genome-
`sequencing facility,
`to determine the se-
`quence of the human genome over a 3-year
`period. Here we report the penultimate mile-
`stone along the path toward that goal, a nearly
`complete sequence of the euchromatic por-
`tion of the human genome. The sequencing
`was performed by a whole-genome random
`shotgun method with subsequent assembly of
`the sequenced segments.
`The modern history of DNA sequencing
`began in 1977, when Sanger reported his meth-
`od for determining the order of nucleotides of
`
`DNA using chain-terminating nucleotide ana-
`logs (3). In the same year, the first human gene
`was isolated and sequenced (4). In 1986, Hood
`and co-workers (5) described an improvement
`in the Sanger sequencing method that included
`attaching fluorescent dyes to the nucleotides,
`which permitted them to be sequentially read
`by a computer. The first automated DNA se-
`quencer, developed by Applied Biosystems in
`California in 1987, was shown to be successful
`when the sequences of two genes were obtained
`with this new technology (6). From early se-
`quencing of human genomic regions (7), it
`became clear that cDNA sequences (which are
`reverse-transcribed from RNA) would be es-
`sential to annotate and validate gene predictions
`in the human genome. These studies were the
`basis in part for the development of the ex-
`pressed sequence tag (EST) method of gene
`identification (8), which is a random selection,
`very high throughput sequencing approach to
`characterize cDNA libraries. The EST method
`led to the rapid discovery and mapping of hu-
`man genes (9). The increasing numbers of hu-
`man EST sequences necessitated the develop-
`ment of new computer algorithms to analyze
`large amounts of sequence data, and in 1993 at
`The Institute for Genomic Research (TIGR), an
`algorithm was developed that permitted assem-
`bly and analysis of hundreds of thousands of
`ESTs. This algorithm permitted characteriza-
`tion and annotation of human genes on the basis
`of 30,000 EST assemblies (10).
`The complete 49-kbp bacteriophage lamb-
`da genome sequence was determined by a
`shotgun restriction digest method in 1982
`(11). When considering methods for sequenc-
`ing the smallpox virus genome in 1991 (12),
`a whole-genome shotgun sequencing method
`was discussed and subsequently rejected ow-
`ing to the lack of appropriate software tools
`for genome assembly. However,
`in 1994,
`when a microbial genome-sequencing project
`was contemplated at TIGR, a whole-genome
`shotgun sequencing approach was considered
`possible with the TIGR EST assembly algo-
`rithm. In 1995, the 1.8-Mbp Haemophilus
`influenzae genome was completed by a
`whole-genome shotgun sequencing method
`(13). The experience with several subsequent
`genome-sequencing efforts established the
`broad applicability of this approach (14, 15).
`A key feature of the sequencing approach
`used for these megabase-size and larger ge-
`nomes was the use of paired-end sequences
`(also called mate pairs), derived from sub-
`clone libraries with distinct insert sizes and
`cloning characteristics. Paired-end sequences
`are sequences 500 to 600 bp in length from
`both ends of double-stranded DNA clones of
`prescribed lengths. The success of using end
`sequences from long segments (18 to 20 kbp)
`of DNA cloned into bacteriophage lambda in
`assembly of the microbial genomes led to the
`suggestion (16) of an approach to simulta-
`
`www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001
`
`1305
`
`Agilent Exhibit 1289
`Page 2 of 51
`
`
`
`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`neously map and sequence the human ge-
`nome by means of end sequences from 150-
`kbp bacterial artificial chromosomes (BACs)
`(17, 18). The end sequences spanned by
`known distances provide long-range continu-
`ity across the genome. A modification of the
`BAC end-sequencing (BES) method was ap-
`plied successfully to complete chromosome 2
`from the Arabidopsis thaliana genome (19).
`In 1997, Weber and Myers (20) proposed
`whole-genome shotgun sequencing of the
`human genome. Their proposal was not well
`received (21). However, by early 1998, as
`less than 5% of the genome had been se-
`quenced, it was clear that the rate of progress
`in human genome sequencing worldwide
`was very slow (22), and the prospects for
`finishing the genome by the 2005 goal were
`uncertain.
`In early 1998, PE Biosystems (now Applied
`Biosystems) developed an automated, high-
`throughput capillary DNA sequencer, subse-
`quently called the ABI PRISM 3700 DNA
`Analyzer. Discussions between PE Biosystems
`and TIGR scientists resulted in a plan to under-
`take the sequencing of the human genome with
`the 3700 DNA Analyzer and the whole-genome
`shotgun sequencing techniques developed at
`TIGR (23). Many of the principles of operation
`of a genome-sequencing facility were estab-
`lished in the TIGR facility (24). However, the
`facility envisioned for Celera would have a
`capacity roughly 50 times that of TIGR, and
`thus new developments were required for sam-
`ple preparation and tracking and for whole-
`genome assembly. Some argued that the re-
`quired 150-fold scale-up from the H. influenzae
`genome to the human genome with its complex
`repeat sequences was not feasible (25). The
`Drosophila melanogaster genome was thus
`chosen as a test case for whole-genome assem-
`bly on a large and complex eukaryotic genome.
`In collaboration with Gerald Rubin and the
`Berkeley Drosophila Genome Project, the nu-
`cleotide sequence of the 120-Mbp euchromatic
`portion of the Drosophila genome was deter-
`mined over a 1-year period (26–28). The Dro-
`sophila genome-sequencing effort resulted in
`two key findings: (i) that the assembly algo-
`rithms could generate chromosome assemblies
`with highly accurate order and orientation with
`substantially less than 10-fold coverage, and (ii)
`that undertaking multiple interim assemblies in
`place of one comprehensive final assembly was
`not of value.
`These findings, together with the dramatic
`changes in the public genome effort subsequent
`to the formation of Celera (29), led to a modi-
`fied whole-genome shotgun sequencing ap-
`proach to the human genome. We initially pro-
`posed to do 10-fold sequence coverage of the
`genome over a 3-year period and to make in-
`terim assembled sequence data available quar-
`terly. The modifications included a plan to per-
`form random shotgun sequencing to ;5-fold
`
`T H E H U M A N G E N O M E
`
`coverage and to use the unordered and unori-
`ented BAC sequence fragments and subassem-
`blies published in GenBank by the publicly
`funded genome effort (30) to accelerate the
`project. We also abandoned the quarterly an-
`nouncements in the absence of interim assem-
`blies to report.
`Although this strategy provided a reason-
`able result very early that was consistent with a
`whole-genome shotgun assembly with eight-
`fold coverage, the human genome sequence is
`not as finished as the Drosophila genome was
`with an effective 13-fold coverage. However, it
`became clear that even with this reduced cov-
`erage strategy, Celera could generate an accu-
`rately ordered and oriented scaffold sequence of
`the human genome in less than 1 year. Human
`genome sequencing was initiated 8 September
`1999 and completed 17 June 2000. The first
`assembly was completed 25 June 2000, and the
`assembly reported here was completed 1 Octo-
`ber 2000. Here we describe the whole-genome
`random shotgun sequencing effort applied to
`the human genome. We developed two differ-
`ent assembly approaches for assembling the ;3
`billion bp that make up the 23 pairs of chromo-
`somes of the Homo sapiens genome. Any Gen-
`Bank-derived data were shredded to remove
`potential bias to the final sequence from chi-
`meric clones, foreign DNA contamination, or
`misassembled contigs. Insofar as a correctly
`and accurately assembled genome sequence
`with faithful order and orientation of contigs
`is essential for an accurate analysis of the
`human genetic code, we have devoted a con-
`siderable portion of this manuscript to the
`documentation of the quality of our recon-
`struction of the genome. We also describe our
`preliminary analysis of the human genetic
`code on the basis of computational methods.
`Figure 1 (see fold-out chart associated with
`this issue; files for each chromosome can be
`found in Web fig. 1 on Science Online at
`www.sciencemag.org/cgi/content/full/291/
`5507/1304/DC1) provides a graphical over-
`view of the genome and the features encoded
`in it. The detailed manual curation and inter-
`pretation of the genome are just beginning.
`To aid the reader in locating specific an-
`alytical sections, we have divided the paper
`into seven broad sections. A summary of the
`major results appears at the beginning of each
`section.
`
`1 Sources of DNA and Sequencing Methods
`2 Genome Assembly Strategy and
`Characterization
`3 Gene Prediction and Annotation
`4 Genome Structure
`5 Genome Evolution
`6 A Genome-Wide Examination of
`Sequence Variations
`7 An Overview of the Predicted Protein-
`Coding Genes in the Human Genome
`8 Conclusions
`
`1 Sources of DNA and Sequencing
`Methods
`Summary. This section discusses the rationale
`and ethical rules governing donor selection to
`ensure ethnic and gender diversity along with
`the methodologies for DNA extraction and li-
`brary construction. The plasmid library con-
`struction is the first critical step in shotgun
`sequencing. If the DNA libraries are not uni-
`form in size, nonchimeric, and do not randomly
`represent the genome, then the subsequent steps
`cannot accurately reconstruct the genome se-
`quence. We used automated high-throughput
`DNA sequencing and the computational infra-
`structure to enable efficient tracking of enor-
`mous amounts of sequence information (27.3
`million sequence reads; 14.9 billion bp of se-
`quence). Sequencing and tracking from both
`ends of plasmid clones from 2-, 10-, and 50-kbp
`libraries were essential to the computational
`reconstruction of the genome. Our evidence
`indicates that the accurate pairing rate of end
`sequences was greater than 98%.
`
`Various policies of the United States and the
`World Medical Association, specifically the
`Declaration of Helsinki, offer recommenda-
`tions for conducting experiments with human
`subjects. We convened an Institutional Re-
`view Board (IRB) (31) that helped us estab-
`lish the protocol for obtaining and using hu-
`man DNA and the informed consent process
`used to enroll research volunteers for the
`DNA-sequencing studies reported here. We
`adopted several steps and procedures to pro-
`tect the privacy rights and confidentiality of
`the research subjects (donors). These includ-
`ed a two-stage consent process, a secure ran-
`dom alphanumeric coding system for speci-
`mens and records, circumscribed contact with
`the subjects by researchers, and options for
`off-site contact of donors. In addition, Celera
`applied for and received a Certificate of Con-
`fidentiality from the Department of Health
`and Human Services. This Certificate autho-
`rized Celera to protect the privacy of the
`individuals who volunteered to be donors as
`provided in Section 301(d) of the Public
`Health Service Act 42 U.S.C. 241(d).
`Celera and the IRB believed that the ini-
`tial version of a completed human genome
`should be a composite derived from multiple
`donors of diverse ethnic backgrounds Pro-
`spective donors were asked, on a voluntary
`basis, to self-designate an ethnogeographic
`category (e.g., African-American, Chinese,
`Hispanic, Caucasian, etc.). We enrolled 21
`donors (32).
`Three basic items of information from
`each donor were recorded and linked by con-
`fidential code to the donated sample: age,
`sex, and self-designated ethnogeographic
`group. From females, ;130 ml of whole,
`heparinized blood was collected. From males,
`;130 ml of whole, heparinized blood was
`
`1306
`
`16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org
`
`Agilent Exhibit 1289
`Page 3 of 51
`
`
`
`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`collected, as well as five specimens of semen,
`collected over a 6-week period. Permanent
`lymphoblastoid cell
`lines were created by
`Epstein-Barr virus immortalization. DNA
`from five subjects was selected for genomic
`DNA sequencing: two males and three fe-
`males—one African-American, one Asian-
`Chinese, one Hispanic-Mexican, and two
`Caucasians (see Web fig. 2 on Science Online
`at www.sciencemag.org/cgi/content/291/5507/
`1304/DC1). The decision of whose DNA to
`sequence was based on a complex mix of fac-
`tors, including the goal of achieving diversity as
`well as technical issues such as the quality of
`the DNA libraries and availability of immortal-
`ized cell lines.
`
`1.1 Library construction and
`sequencing
`Central to the whole-genome shotgun sequenc-
`ing process is preparation of high-quality plas-
`mid libraries in a variety of insert sizes so that
`pairs of sequence reads (mates) are obtained,
`one read from both ends of each plasmid insert.
`High-quality libraries have an equal representa-
`tion of all parts of the genome, a small number
`of clones without inserts, and no contamination
`from such sources as the mitochondrial genome
`and Escherichia coli genomic DNA. DNA from
`each donor was used to construct plasmid librar-
`ies in one or more of three size classes: 2 kbp, 10
`kbp, and 50 kbp (Table 1) (33).
`In designing the DNA-sequencing pro-
`cess, we focused on developing a simple
`system that could be implemented in a robust
`and reproducible manner and monitored ef-
`fectively (Fig. 2) (34).
`Current sequencing protocols are based on
`
`Table 1. Celera-generated data input into assembly.
`
`T H E H U M A N G E N O M E
`
`the dideoxy sequencing method (35), which
`typically yields only 500 to 750 bp of sequence
`per reaction. This limitation on read length has
`made monumental gains in throughput a pre-
`requisite for the analysis of large eukaryotic
`genomes. We accomplished this at the Celera
`facility, which occupies about 30,000 square
`feet of laboratory space and produces sequence
`data continuously at a rate of 175,000 total
`reads per day. The DNA-sequencing facility is
`supported by a high-performance computation-
`al facility (36).
`The process for DNA sequencing was mod-
`ular by design and automated. Intermodule
`sample backlogs allowed four principal
`modules to operate independently: (i) li-
`brary transformation, plating, and colony
`picking;
`(ii) DNA template preparation;
`(iii) dideoxy sequencing reaction set-up
`and purification; and (iv) sequence deter-
`mination with the ABI PRISM 3700 DNA
`Analyzer. Because the inputs and outputs
`of
`each module have been carefully
`matched and sample backlogs are continu-
`ously managed, sequencing has proceeded
`without a single day’s interruption since the
`initiation of the Drosophila project in May
`1999. The ABI 3700 is a fully automated
`capillary array sequencer and as such can
`be operated with a minimal amount of
`hands-on time, currently estimated at about
`15 min per day. The capillary system also
`facilitates correct associations of sequenc-
`ing traces with samples through the elimi-
`nation of manual sample loading and lane-
`tracking errors associated with slab gels.
`About 65 production staff were hired and
`trained, and were rotated on a regular basis
`
`through the four production modules. A
`central laboratory information management
`system (LIMS) tracked all sample plates by
`unique bar code identifiers. The facility was
`supported by a quality control team that per-
`formed raw material and in-process testing
`and a quality assurance group with responsi-
`bilities including document control, valida-
`tion, and auditing of the facility. Critical to
`the success of the scale-up was the validation
`of all software and instrumentation before
`implementation, and production-scale testing
`of any process changes.
`
`1.2 Trace processing
`An automated trace-processing pipeline has
`been developed to process each sequence file
`(37). After quality and vector trimming, the
`average trimmed sequence length was 543
`bp, and the sequencing accuracy was expo-
`nentially distributed with a mean of 99.5%
`and with less than 1 in 1000 reads being less
`than 98% accurate (26). Each trimmed se-
`quence was screened for matches to contam-
`inants including sequences of vector alone, E.
`coli genomic DNA, and human mitochondri-
`al DNA. The entire read for any sequence
`with a significant match to a contaminant was
`discarded. A total of 713 reads matched E.
`coli genomic DNA and 2114 reads matched
`the human mitochondrial genome.
`
`1.3 Quality assessment and control
`The importance of the base-pair level ac-
`curacy of the sequence data increases as the
`size and repetitive nature of the genome to
`be sequenced increases. Each sequence
`read must be placed uniquely in the ge-
`
`No. of sequencing reads
`
`Fold sequence coverage
`(2.9-Gb genome)
`
`Fold clone coverage
`
`Insert size* (mean)
`Insert size* (SD)
`% Mates†
`
`Individual
`
`A
`B
`C
`D
`F
`Total
`A
`B
`C
`D
`F
`Total
`A
`B
`C
`D
`F
`Total
`Average
`Average
`Average
`
`Number of reads for different insert libraries
`
`2 kbp
`
`0
`11,736,757
`853,819
`952,523
`0
`13,543,099
`0
`2.20
`0.16
`0.18
`0
`2.54
`0
`2.96
`0.22
`0.24
`0
`3.42
`1,951 bp
`6.10%
`74.50
`
`10 kbp
`
`0
`7,467,755
`881,290
`1,046,815
`1,498,607
`10,894,467
`0
`1.40
`1.17
`0.20
`0.28
`2.04
`0
`11.26
`1.33
`1.58
`2.26
`16.43
`10,800 bp
`8.10%
`80.80
`
`50 kbp
`
`2,767,357
`66,930
`0
`0
`0
`2,834,287
`0.52
`0.01
`0
`0
`0
`0.53
`18.39
`0.44
`0
`0
`0
`18.84
`50,715 bp
`14.90%
`75.60
`
`Total
`
`2,767,357
`19,271,442
`1,735,109
`1,999,338
`1,498,607
`27,271,853
`0.52
`3.61
`0.32
`0.37
`0.28
`5.11
`18.39
`14.67
`1.54
`1.82
`2.26
`38.68
`
`Total number of
`base pairs
`
`1,502,674,851
`10,464,393,006
`942,164,187
`1,085,640,534
`813,743,601
`14,808,616,179
`
`*Insert size and SD are calculated from assembly of mates on contigs.
`
`†% Mates is based on laboratory tracking of sequencing runs.
`
`www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001
`
`1307
`
`Agilent Exhibit 1289
`Page 4 of 51
`
`
`
`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`nome, and even a modest error rate can
`reduce the effectiveness of assembly. In
`addition, maintaining the validity of mate-
`pair information is absolutely critical for
`the algorithms described below. Procedural
`controls were established for maintaining
`the validity of sequence mate-pairs as se-
`quencing reactions proceeded through the
`process, including strict rules built into the
`LIMS. The accuracy of sequence data pro-
`duced by the Celera process was validated
`in the course of the Drosophila genome
`project (26 ). By collecting data for the
`
`T H E H U M A N G E N O M E
`
`entire human genome in a single facility,
`we were able to ensure uniform quality
`standards and the cost advantages associat-
`ed with automation, an economy of scale,
`and process consistency.
`
`2 Genome Assembly Strategy and
`Characterization
`Summary. We describe in this section the two
`approaches that we used to assemble the ge-
`nome. One method involves the computational
`combination of all sequence reads with shred-
`ded data from GenBank to generate an indepen-
`
`dent, nonbiased view of the genome. The sec-
`ond approach involves clustering all of the frag-
`ments to a region or chromosome on the basis
`of mapping information. The clustered data
`were then shredded and subjected to computa-
`tional assembly. Both approaches provided es-
`sentially the same reconstruction of assembled
`DNA sequence with proper order and orienta-
`tion. The second method provided slightly
`greater sequence coverage (fewer gaps) and
`was the principal sequence used for the analysis
`phase. In addition, we document the complete-
`ness and correctness of this assembly process
`
`Fig. 2. Flow diagram for sequencing pipeline. Samples are received,
`selected, and processed in compliance with standard operating proce-
`dures, with a focus on quality within and across departments. Each
`process has de(cid:222)ned inputs and outputs with the capability to exchange
`
`samples and data with both internal and external entities according to
`de(cid:222)ned quality guidelines. Manufacturing pipeline processes, products,
`quality control measures, and responsible parties are indicated and are
`described further in the text.
`
`1308
`
`16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org
`
`Agilent Exhibit 1289
`Page 5 of 51
`
`
`
`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`sequences. In the past 2 years the PFP has
`focused on a product of lower quality and com-
`pleteness, but on a faster time-course, by con-
`centrating on the production of Phase 1 data
`from a 33 to 43 light-shotgun of each BAC
`clone.
`We screened the bactig sequences for con-
`taminants by using the BLAST algorithm
`against three data sets: (i) vector sequences
`in Univec core (38), filtered for a 25-bp
`match at 98% sequence identity at the ends
`of the sequence and a 30-bp match internal
`to the sequence; (ii) the nonhuman portion
`of the High Throughput Genomic (HTG)
`Seqences division of GenBank (39), fil-
`tered at 200 bp at 98%; and (iii) the non-
`redundant nucleotide sequences from Gen-
`Bank without primate and human virus en-
`tries, filtered at 200 bp at 98%. Whenever
`25 bp or more of vector was found within
`50 bp of the end of a contig, the tip up to
`the matching vector was excised. Under
`these criteria we removed 2.6 Mbp of pos-
`sible contaminant and vector
`from the
`Phase 3 data, 61.0 Mbp from the Phase 1
`and 2 data, and 16.1 Mbp from the Phase 0
`data ( Table 2). This left us with a total of
`4363.7 Mbp of PFP sequence data 20%
`finished, 75% rough-draft (Phase 1 and 2),
`and 5% single sequencing reads (Phase 0).
`An additional 104,018 BAC end-sequence
`mate pairs were also downloaded and in-
`cluded in the data sets for both assembly
`processes (18).
`
`2.2 Assembly strategies
`Two different approaches to assembly were
`pursued. The first was a whole-genome as-
`sembly process that used Celera data and the
`PFP data in the form of additional synthetic
`shotgun data, and the second was a compart-
`mentalized assembly process that first parti-
`tioned the Celera and PFP data into sets
`localized to large chromosomal segments and
`then performed ab initio shotgun assembly on
`each set. Figure 4 gives a schematic of the
`overall process flow.
`For the whole-genome assembly, the PFP
`data was first disassembled or “shredded” into a
`synthetic shotgun data set of 550-bp reads that
`form a perfect 23 covering of the bactigs. This
`resulted in 16.05 million “faux” reads that were
`sufficient to cover the genome 2.963 because
`of redundancy in the BAC data set, without
`incorporating the biases inherent in the PFP
`assembly process. The combined data set of
`43.32 million reads (83), and all associated
`mate-pair information, were then subjected to
`our whole-genome assembly algorithm to pro-
`duce a reconstruction of the genome. Neither
`the location of a BAC in the genome nor its
`assembly of bactigs was used in this process.
`Bactigs were shredded into reads because we
`found strong evidence that 2.13% of them were
`misassembled (40). Furthermore, BAC location
`
`and provide a comparison to the public genome
`sequence, which was reconstructed largely by
`an independent BAC-by-BAC approach. Our
`assemblies effectively covered the euchromatic
`regions of the human chromosomes. More than
`90% of the genome was in scaffold assemblies
`of 100,000 bp or greater, and 25