throbber
Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`T H E H U M A N G E N O M E
`
`The Sequence of the Human Genome
`J. Craig Venter,1* Mark D. Adams,1 Eugene W. Myers,1 Peter W. Li,1 Richard J. Mural,1
`Granger G. Sutton,1 Hamilton O. Smith,1 Mark Yandell,1 Cheryl A. Evans,1 Robert A. Holt,1
`Jeannine D. Gocayne,1 Peter Amanatides,1 Richard M. Ballew,1 Daniel H. Huson,1
`Jennifer Russo Wortman,1 Qing Zhang,1 Chinnappa D. Kodira,1 Xiangqun H. Zheng,1 Lin Chen,1
`Marian Skupski,1 Gangadharan Subramanian,1 Paul D. Thomas,1 Jinghui Zhang,1
`George L. Gabor Miklos,2 Catherine Nelson,3 Samuel Broder,1 Andrew G. Clark,4 Joe Nadeau,5
`Victor A. McKusick,6 Norton Zinder,7 Arnold J. Levine,7 Richard J. Roberts,8 Mel Simon,9
`Carolyn Slayman,10 Michael Hunkapiller,11 Randall Bolanos,1 Arthur Delcher,1 Ian Dew,1 Daniel Fasulo,1
`Michael Flanigan,1 Liliana Florea,1 Aaron Halpern,1 Sridhar Hannenhalli,1 Saul Kravitz,1 Samuel Levy,1
`Clark Mobarry,1 Knut Reinert,1 Karin Remington,1 Jane Abu-Threideh,1 Ellen Beasley,1 Kendra Biddick,1
`Vivien Bonazzi,1 Rhonda Brandon,1 Michele Cargill,1 Ishwar Chandramouliswaran,1 Rosane Charlab,1
`Kabir Chaturvedi,1 Zuoming Deng,1 Valentina Di Francesco,1 Patrick Dunn,1 Karen Eilbeck,1
`Carlos Evangelista,1 Andrei E. Gabrielian,1 Weiniu Gan,1 Wangmao Ge,1 Fangcheng Gong,1 Zhiping Gu,1
`Ping Guan,1 Thomas J. Heiman,1 Maureen E. Higgins,1 Rui-Ru Ji,1 Zhaoxi Ke,1 Karen A. Ketchum,1
`Zhongwu Lai,1 Yiding Lei,1 Zhenya Li,1 Jiayin Li,1 Yong Liang,1 Xiaoying Lin,1 Fu Lu,1
`Gennady V. Merkulov,1 Natalia Milshina,1 Helen M. Moore,1 Ashwinikumar K Naik,1
`Vaibhav A. Narayan,1 Beena Neelam,1 Deborah Nusskern,1 Douglas B. Rusch,1 Steven Salzberg,12
`Wei Shao,1 Bixiong Shue,1 Jingtao Sun,1 Zhen Yuan Wang,1 Aihui Wang,1 Xin Wang,1 Jian Wang,1
`Ming-Hui Wei,1 Ron Wides,13 Chunlin Xiao,1 Chunhua Yan,1 Alison Yao,1 Jane Ye,1 Ming Zhan,1
`Weiqing Zhang,1 Hongyu Zhang,1 Qi Zhao,1 Liansheng Zheng,1 Fei Zhong,1 Wenyan Zhong,1
`Shiaoping C. Zhu,1 Shaying Zhao,12 Dennis Gilbert,1 Suzanna Baumhueter,1 Gene Spier,1
`Christine Carter,1 Anibal Cravchik,1 Trevor Woodage,1 Feroze Ali,1 Huijin An,1 Aderonke Awe,1
`Danita Baldwin,1 Holly Baden,1 Mary Barnstead,1 Ian Barrow,1 Karen Beeson,1 Dana Busam,1
`Amy Carver,1 Angela Center,1 Ming Lai Cheng,1 Liz Curry,1 Steve Danaher,1 Lionel Davenport,1
`Raymond Desilets,1 Susanne Dietz,1 Kristina Dodson,1 Lisa Doup,1 Steven Ferriera,1 Neha Garg,1
`Andres Gluecksmann,1 Brit Hart,1 Jason Haynes,1 Charles Haynes,1 Cheryl Heiner,1 Suzanne Hladun,1
`Damon Hostin,1 Jarrett Houck,1 Timothy Howland,1 Chinyere Ibegwam,1 Jeffery Johnson,1
`Francis Kalush,1 Lesley Kline,1 Shashi Koduru,1 Amy Love,1 Felecia Mann,1 David May,1
`Steven McCawley,1 Tina McIntosh,1 Ivy McMullen,1 Mee Moy,1 Linda Moy,1 Brian Murphy,1
`Keith Nelson,1 Cynthia Pfannkoch,1 Eric Pratts,1 Vinita Puri,1 Hina Qureshi,1 Matthew Reardon,1
`Robert Rodriguez,1 Yu-Hui Rogers,1 Deanna Romblad,1 Bob Ruhfel,1 Richard Scott,1 Cynthia Sitter,1
`Michelle Smallwood,1 Erin Stewart,1 Renee Strong,1 Ellen Suh,1 Reginald Thomas,1 Ni Ni Tint,1
`Sukyee Tse,1 Claire Vech,1 Gary Wang,1 Jeremy Wetter,1 Sherita Williams,1 Monica Williams,1
`Sandra Windsor,1 Emily Winn-Deen,1 Keriellen Wolfe,1 Jayshree Zaveri,1 Karena Zaveri,1
`Josep F. Abril,14 Roderic Guigo«,14 Michael J. Campbell,1 Kimmen V. Sjolander,1 Brian Karlak,1
`Anish Kejariwal,1 Huaiyu Mi,1 Betty Lazareva,1 Thomas Hatton,1 Apurva Narechania,1 Karen Diemer,1
`Anushya Muruganujan,1 Nan Guo,1 Shinji Sato,1 Vineet Bafna,1 Sorin Istrail,1 Ross Lippert,1
`Russell Schwartz,1 Brian Walenz,1 Shibu Yooseph,1 David Allen,1 Anand Basu,1 James Baxendale,1
`Louis Blick,1 Marcelo Caminha,1 John Carnes-Stine,1 Parris Caulk,1 Yen-Hui Chiang,1 My Coyne,1
`Carl Dahlke,1 Anne Deslattes Mays,1 Maria Dombroski,1 Michael Donnelly,1 Dale Ely,1 Shiva Esparham,1
`Carl Fosler,1 Harold Gire,1 Stephen Glanowski,1 Kenneth Glasser,1 Anna Glodek,1 Mark Gorokhov,1
`Ken Graham,1 Barry Gropman,1 Michael Harris,1 Jeremy Heil,1 Scott Henderson,1 Jeffrey Hoover,1
`Donald Jennings,1 Catherine Jordan,1 James Jordan,1 John Kasha,1 Leonid Kagan,1 Cheryl Kraft,1
`Alexander Levitsky,1 Mark Lewis,1 Xiangjun Liu,1 John Lopez,1 Daniel Ma,1 William Majoros,1
`Joe McDaniel,1 Sean Murphy,1 Matthew Newman,1 Trung Nguyen,1 Ngoc Nguyen,1 Marc Nodell,1
`Sue Pan,1 Jim Peck,1 Marshall Peterson,1 William Rowe,1 Robert Sanders,1 John Scott,1
`Michael Simpson,1 Thomas Smith,1 Arlan Sprague,1 Timothy Stockwell,1 Russell Turner,1 Eli Venter,1
`Mei Wang,1 Meiyuan Wen,1 David Wu,1 Mitchell Wu,1 Ashley Xia,1 Ali Zandieh,1 Xiaohong Zhu1
`
`1304
`
`16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org
`
`Agilent Exhibit 1289
`Page 1 of 51
`
`

`

`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`T H E H U M A N G E N O M E
`
`A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of
`the human genome was generated by the whole-genome shotgun sequencing
`method. The 14.8-billion bp DNA sequence was generated over 9 months from
`27,271,853 high-quality sequence reads (5.11-fold coverage of the genome)
`from both ends of plasmid clones made from the DNA of (cid:222)ve individuals. Two
`assembly strategies(cid:209)a whole-genome assembly and a regional chromosome
`assembly(cid:209)were used, each combining sequence data from Celera and the
`publicly funded genome effort. The public data were shredded into 550-bp
`segments to create a 2.9-fold coverage of those genome regions that had been
`sequenced, without including biases inherent in the cloning and assembly
`procedure used by the publicly funded group. This brought the effective cov-
`erage in the assemblies to eightfold, reducing the number and size of gaps in
`the (cid:222)nal assembly over what would be obtained with 5.11-fold coverage. The
`two assembly strategies yielded very similar results that largely agree with
`independent mapping data. The assemblies effectively cover the euchromatic
`regions of the human chromosomes. More than 90% of the genome is in
`scaffold assemblies of 100,000 bp or more, and 25% of the genome is in
`scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed
`26,588 protein-encoding transcripts for which there was strong corroborating
`evidence and an additional ;12,000 computationally derived genes with mouse
`matches or other weak supporting evidence. Although gene-dense clusters are
`obvious, almost half the genes are dispersed in low G1C sequence separated
`by large tracts of apparently noncoding sequence. Only 1.1% of the genome
`is spanned by exons, whereas 24% is in introns, with 75% of the genome being
`intergenic DNA. Duplications of segmental blocks, ranging in size up to chro-
`mosomal lengths, are abundant throughout the genome and reveal a complex
`evolutionary history. Comparative genomic analysis indicates vertebrate ex-
`pansions of genes associated with neuronal function, with tissue-speci(cid:222)c de-
`velopmental regulation, and with the hemostasis and immune systems. DNA
`sequence comparisons between the consensus sequence and publicly funded
`genome data provided locations of 2.1 million single-nucleotide polymorphisms
`(SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per
`1250 on average, but there was marked heterogeneity in the level of poly-
`morphism across the genome. Less than 1% of all SNPs resulted in variation in
`proteins, but the task of determining which SNPs have functional consequences
`remains an open challenge.
`
`Decoding of the DNA that constitutes the
`human genome has been widely anticipated
`for the contribution it will make toward un-
`
`1Celera Genomics, 45 West Gude Drive, Rockville, MD
`20850, USA. 2GenetixXpress, 78 Paci(cid:222)c Road, Palm
`Beach, Sydney 2108, Australia. 3Berkeley Drosophila
`Genome Project, University of California, Berkeley, CA
`94720, USA. 4Department of Biology, Penn State Uni-
`versity, 208 Mueller Lab, University Park, PA 16802,
`USA. 5Department of Genetics, Case Western Reserve
`University School of Medicine, BRB-630, 10900 Euclid
`Avenue, Cleveland, OH 44106, USA. 6Johns Hopkins
`University School of Medicine, Johns Hopkins Hospi-
`tal, 600 North Wolfe Street, Blalock 1007, Baltimore,
`MD 21287— 4922, USA. 7Rockefeller University, 1230
`York Avenue, New York, NY 10021— 6399, USA. 8New
`England BioLabs, 32 Tozer Road, Beverly, MA 01915,
`USA. 9Division of Biology, 147-75, California Institute
`of Technology, 1200 East California Boulevard, Pasa-
`dena, CA 91125, USA. 10Yale University School of
`Medicine, 333 Cedar Street, P.O. Box 208000, New
`Haven, CT 06520 — 8000, USA. 11Applied Biosystems,
`850 Lincoln Centre Drive, Foster City, CA 94404, USA.
`12The Institute for Genomic Research, 9712 Medical
`Center Drive, Rockville, MD 20850, USA. 13Faculty of
`Life Sciences, Bar-Ilan University, Ramat-Gan, 52900
`Israel. 14Grup de Recerca en Informa‘tica Me‘dica, In-
`stitut Municipal d(cid:213)Investigacio« Me‘dica, Universitat
`Pompeu Fabra, 08003-Barcelona, Catalonia, Spain.
`
`*To whom correspondence should be addressed. E-
`mail: humangenome@celera.com
`
`derstanding human evolution, the causation
`of disease, and the interplay between the
`environment and heredity in defining the hu-
`man condition. A project with the goal of
`determining the complete nucleotide se-
`quence of the human genome was first for-
`mally proposed in 1985 (1). In subsequent
`years, the idea met with mixed reactions in
`the scientific community (2). However, in
`1990, the Human Genome Project (HGP) was
`officially initiated in the United States under
`the direction of the National Institutes of
`Health and the U.S. Department of Energy
`with a 15-year, $3 billion plan for completing
`the genome sequence. In 1998 we announced
`our intention to build a unique genome-
`sequencing facility,
`to determine the se-
`quence of the human genome over a 3-year
`period. Here we report the penultimate mile-
`stone along the path toward that goal, a nearly
`complete sequence of the euchromatic por-
`tion of the human genome. The sequencing
`was performed by a whole-genome random
`shotgun method with subsequent assembly of
`the sequenced segments.
`The modern history of DNA sequencing
`began in 1977, when Sanger reported his meth-
`od for determining the order of nucleotides of
`
`DNA using chain-terminating nucleotide ana-
`logs (3). In the same year, the first human gene
`was isolated and sequenced (4). In 1986, Hood
`and co-workers (5) described an improvement
`in the Sanger sequencing method that included
`attaching fluorescent dyes to the nucleotides,
`which permitted them to be sequentially read
`by a computer. The first automated DNA se-
`quencer, developed by Applied Biosystems in
`California in 1987, was shown to be successful
`when the sequences of two genes were obtained
`with this new technology (6). From early se-
`quencing of human genomic regions (7), it
`became clear that cDNA sequences (which are
`reverse-transcribed from RNA) would be es-
`sential to annotate and validate gene predictions
`in the human genome. These studies were the
`basis in part for the development of the ex-
`pressed sequence tag (EST) method of gene
`identification (8), which is a random selection,
`very high throughput sequencing approach to
`characterize cDNA libraries. The EST method
`led to the rapid discovery and mapping of hu-
`man genes (9). The increasing numbers of hu-
`man EST sequences necessitated the develop-
`ment of new computer algorithms to analyze
`large amounts of sequence data, and in 1993 at
`The Institute for Genomic Research (TIGR), an
`algorithm was developed that permitted assem-
`bly and analysis of hundreds of thousands of
`ESTs. This algorithm permitted characteriza-
`tion and annotation of human genes on the basis
`of 30,000 EST assemblies (10).
`The complete 49-kbp bacteriophage lamb-
`da genome sequence was determined by a
`shotgun restriction digest method in 1982
`(11). When considering methods for sequenc-
`ing the smallpox virus genome in 1991 (12),
`a whole-genome shotgun sequencing method
`was discussed and subsequently rejected ow-
`ing to the lack of appropriate software tools
`for genome assembly. However,
`in 1994,
`when a microbial genome-sequencing project
`was contemplated at TIGR, a whole-genome
`shotgun sequencing approach was considered
`possible with the TIGR EST assembly algo-
`rithm. In 1995, the 1.8-Mbp Haemophilus
`influenzae genome was completed by a
`whole-genome shotgun sequencing method
`(13). The experience with several subsequent
`genome-sequencing efforts established the
`broad applicability of this approach (14, 15).
`A key feature of the sequencing approach
`used for these megabase-size and larger ge-
`nomes was the use of paired-end sequences
`(also called mate pairs), derived from sub-
`clone libraries with distinct insert sizes and
`cloning characteristics. Paired-end sequences
`are sequences 500 to 600 bp in length from
`both ends of double-stranded DNA clones of
`prescribed lengths. The success of using end
`sequences from long segments (18 to 20 kbp)
`of DNA cloned into bacteriophage lambda in
`assembly of the microbial genomes led to the
`suggestion (16) of an approach to simulta-
`
`www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001
`
`1305
`
`Agilent Exhibit 1289
`Page 2 of 51
`
`

`

`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`neously map and sequence the human ge-
`nome by means of end sequences from 150-
`kbp bacterial artificial chromosomes (BACs)
`(17, 18). The end sequences spanned by
`known distances provide long-range continu-
`ity across the genome. A modification of the
`BAC end-sequencing (BES) method was ap-
`plied successfully to complete chromosome 2
`from the Arabidopsis thaliana genome (19).
`In 1997, Weber and Myers (20) proposed
`whole-genome shotgun sequencing of the
`human genome. Their proposal was not well
`received (21). However, by early 1998, as
`less than 5% of the genome had been se-
`quenced, it was clear that the rate of progress
`in human genome sequencing worldwide
`was very slow (22), and the prospects for
`finishing the genome by the 2005 goal were
`uncertain.
`In early 1998, PE Biosystems (now Applied
`Biosystems) developed an automated, high-
`throughput capillary DNA sequencer, subse-
`quently called the ABI PRISM 3700 DNA
`Analyzer. Discussions between PE Biosystems
`and TIGR scientists resulted in a plan to under-
`take the sequencing of the human genome with
`the 3700 DNA Analyzer and the whole-genome
`shotgun sequencing techniques developed at
`TIGR (23). Many of the principles of operation
`of a genome-sequencing facility were estab-
`lished in the TIGR facility (24). However, the
`facility envisioned for Celera would have a
`capacity roughly 50 times that of TIGR, and
`thus new developments were required for sam-
`ple preparation and tracking and for whole-
`genome assembly. Some argued that the re-
`quired 150-fold scale-up from the H. influenzae
`genome to the human genome with its complex
`repeat sequences was not feasible (25). The
`Drosophila melanogaster genome was thus
`chosen as a test case for whole-genome assem-
`bly on a large and complex eukaryotic genome.
`In collaboration with Gerald Rubin and the
`Berkeley Drosophila Genome Project, the nu-
`cleotide sequence of the 120-Mbp euchromatic
`portion of the Drosophila genome was deter-
`mined over a 1-year period (26–28). The Dro-
`sophila genome-sequencing effort resulted in
`two key findings: (i) that the assembly algo-
`rithms could generate chromosome assemblies
`with highly accurate order and orientation with
`substantially less than 10-fold coverage, and (ii)
`that undertaking multiple interim assemblies in
`place of one comprehensive final assembly was
`not of value.
`These findings, together with the dramatic
`changes in the public genome effort subsequent
`to the formation of Celera (29), led to a modi-
`fied whole-genome shotgun sequencing ap-
`proach to the human genome. We initially pro-
`posed to do 10-fold sequence coverage of the
`genome over a 3-year period and to make in-
`terim assembled sequence data available quar-
`terly. The modifications included a plan to per-
`form random shotgun sequencing to ;5-fold
`
`T H E H U M A N G E N O M E
`
`coverage and to use the unordered and unori-
`ented BAC sequence fragments and subassem-
`blies published in GenBank by the publicly
`funded genome effort (30) to accelerate the
`project. We also abandoned the quarterly an-
`nouncements in the absence of interim assem-
`blies to report.
`Although this strategy provided a reason-
`able result very early that was consistent with a
`whole-genome shotgun assembly with eight-
`fold coverage, the human genome sequence is
`not as finished as the Drosophila genome was
`with an effective 13-fold coverage. However, it
`became clear that even with this reduced cov-
`erage strategy, Celera could generate an accu-
`rately ordered and oriented scaffold sequence of
`the human genome in less than 1 year. Human
`genome sequencing was initiated 8 September
`1999 and completed 17 June 2000. The first
`assembly was completed 25 June 2000, and the
`assembly reported here was completed 1 Octo-
`ber 2000. Here we describe the whole-genome
`random shotgun sequencing effort applied to
`the human genome. We developed two differ-
`ent assembly approaches for assembling the ;3
`billion bp that make up the 23 pairs of chromo-
`somes of the Homo sapiens genome. Any Gen-
`Bank-derived data were shredded to remove
`potential bias to the final sequence from chi-
`meric clones, foreign DNA contamination, or
`misassembled contigs. Insofar as a correctly
`and accurately assembled genome sequence
`with faithful order and orientation of contigs
`is essential for an accurate analysis of the
`human genetic code, we have devoted a con-
`siderable portion of this manuscript to the
`documentation of the quality of our recon-
`struction of the genome. We also describe our
`preliminary analysis of the human genetic
`code on the basis of computational methods.
`Figure 1 (see fold-out chart associated with
`this issue; files for each chromosome can be
`found in Web fig. 1 on Science Online at
`www.sciencemag.org/cgi/content/full/291/
`5507/1304/DC1) provides a graphical over-
`view of the genome and the features encoded
`in it. The detailed manual curation and inter-
`pretation of the genome are just beginning.
`To aid the reader in locating specific an-
`alytical sections, we have divided the paper
`into seven broad sections. A summary of the
`major results appears at the beginning of each
`section.
`
`1 Sources of DNA and Sequencing Methods
`2 Genome Assembly Strategy and
`Characterization
`3 Gene Prediction and Annotation
`4 Genome Structure
`5 Genome Evolution
`6 A Genome-Wide Examination of
`Sequence Variations
`7 An Overview of the Predicted Protein-
`Coding Genes in the Human Genome
`8 Conclusions
`
`1 Sources of DNA and Sequencing
`Methods
`Summary. This section discusses the rationale
`and ethical rules governing donor selection to
`ensure ethnic and gender diversity along with
`the methodologies for DNA extraction and li-
`brary construction. The plasmid library con-
`struction is the first critical step in shotgun
`sequencing. If the DNA libraries are not uni-
`form in size, nonchimeric, and do not randomly
`represent the genome, then the subsequent steps
`cannot accurately reconstruct the genome se-
`quence. We used automated high-throughput
`DNA sequencing and the computational infra-
`structure to enable efficient tracking of enor-
`mous amounts of sequence information (27.3
`million sequence reads; 14.9 billion bp of se-
`quence). Sequencing and tracking from both
`ends of plasmid clones from 2-, 10-, and 50-kbp
`libraries were essential to the computational
`reconstruction of the genome. Our evidence
`indicates that the accurate pairing rate of end
`sequences was greater than 98%.
`
`Various policies of the United States and the
`World Medical Association, specifically the
`Declaration of Helsinki, offer recommenda-
`tions for conducting experiments with human
`subjects. We convened an Institutional Re-
`view Board (IRB) (31) that helped us estab-
`lish the protocol for obtaining and using hu-
`man DNA and the informed consent process
`used to enroll research volunteers for the
`DNA-sequencing studies reported here. We
`adopted several steps and procedures to pro-
`tect the privacy rights and confidentiality of
`the research subjects (donors). These includ-
`ed a two-stage consent process, a secure ran-
`dom alphanumeric coding system for speci-
`mens and records, circumscribed contact with
`the subjects by researchers, and options for
`off-site contact of donors. In addition, Celera
`applied for and received a Certificate of Con-
`fidentiality from the Department of Health
`and Human Services. This Certificate autho-
`rized Celera to protect the privacy of the
`individuals who volunteered to be donors as
`provided in Section 301(d) of the Public
`Health Service Act 42 U.S.C. 241(d).
`Celera and the IRB believed that the ini-
`tial version of a completed human genome
`should be a composite derived from multiple
`donors of diverse ethnic backgrounds Pro-
`spective donors were asked, on a voluntary
`basis, to self-designate an ethnogeographic
`category (e.g., African-American, Chinese,
`Hispanic, Caucasian, etc.). We enrolled 21
`donors (32).
`Three basic items of information from
`each donor were recorded and linked by con-
`fidential code to the donated sample: age,
`sex, and self-designated ethnogeographic
`group. From females, ;130 ml of whole,
`heparinized blood was collected. From males,
`;130 ml of whole, heparinized blood was
`
`1306
`
`16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org
`
`Agilent Exhibit 1289
`Page 3 of 51
`
`

`

`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`collected, as well as five specimens of semen,
`collected over a 6-week period. Permanent
`lymphoblastoid cell
`lines were created by
`Epstein-Barr virus immortalization. DNA
`from five subjects was selected for genomic
`DNA sequencing: two males and three fe-
`males—one African-American, one Asian-
`Chinese, one Hispanic-Mexican, and two
`Caucasians (see Web fig. 2 on Science Online
`at www.sciencemag.org/cgi/content/291/5507/
`1304/DC1). The decision of whose DNA to
`sequence was based on a complex mix of fac-
`tors, including the goal of achieving diversity as
`well as technical issues such as the quality of
`the DNA libraries and availability of immortal-
`ized cell lines.
`
`1.1 Library construction and
`sequencing
`Central to the whole-genome shotgun sequenc-
`ing process is preparation of high-quality plas-
`mid libraries in a variety of insert sizes so that
`pairs of sequence reads (mates) are obtained,
`one read from both ends of each plasmid insert.
`High-quality libraries have an equal representa-
`tion of all parts of the genome, a small number
`of clones without inserts, and no contamination
`from such sources as the mitochondrial genome
`and Escherichia coli genomic DNA. DNA from
`each donor was used to construct plasmid librar-
`ies in one or more of three size classes: 2 kbp, 10
`kbp, and 50 kbp (Table 1) (33).
`In designing the DNA-sequencing pro-
`cess, we focused on developing a simple
`system that could be implemented in a robust
`and reproducible manner and monitored ef-
`fectively (Fig. 2) (34).
`Current sequencing protocols are based on
`
`Table 1. Celera-generated data input into assembly.
`
`T H E H U M A N G E N O M E
`
`the dideoxy sequencing method (35), which
`typically yields only 500 to 750 bp of sequence
`per reaction. This limitation on read length has
`made monumental gains in throughput a pre-
`requisite for the analysis of large eukaryotic
`genomes. We accomplished this at the Celera
`facility, which occupies about 30,000 square
`feet of laboratory space and produces sequence
`data continuously at a rate of 175,000 total
`reads per day. The DNA-sequencing facility is
`supported by a high-performance computation-
`al facility (36).
`The process for DNA sequencing was mod-
`ular by design and automated. Intermodule
`sample backlogs allowed four principal
`modules to operate independently: (i) li-
`brary transformation, plating, and colony
`picking;
`(ii) DNA template preparation;
`(iii) dideoxy sequencing reaction set-up
`and purification; and (iv) sequence deter-
`mination with the ABI PRISM 3700 DNA
`Analyzer. Because the inputs and outputs
`of
`each module have been carefully
`matched and sample backlogs are continu-
`ously managed, sequencing has proceeded
`without a single day’s interruption since the
`initiation of the Drosophila project in May
`1999. The ABI 3700 is a fully automated
`capillary array sequencer and as such can
`be operated with a minimal amount of
`hands-on time, currently estimated at about
`15 min per day. The capillary system also
`facilitates correct associations of sequenc-
`ing traces with samples through the elimi-
`nation of manual sample loading and lane-
`tracking errors associated with slab gels.
`About 65 production staff were hired and
`trained, and were rotated on a regular basis
`
`through the four production modules. A
`central laboratory information management
`system (LIMS) tracked all sample plates by
`unique bar code identifiers. The facility was
`supported by a quality control team that per-
`formed raw material and in-process testing
`and a quality assurance group with responsi-
`bilities including document control, valida-
`tion, and auditing of the facility. Critical to
`the success of the scale-up was the validation
`of all software and instrumentation before
`implementation, and production-scale testing
`of any process changes.
`
`1.2 Trace processing
`An automated trace-processing pipeline has
`been developed to process each sequence file
`(37). After quality and vector trimming, the
`average trimmed sequence length was 543
`bp, and the sequencing accuracy was expo-
`nentially distributed with a mean of 99.5%
`and with less than 1 in 1000 reads being less
`than 98% accurate (26). Each trimmed se-
`quence was screened for matches to contam-
`inants including sequences of vector alone, E.
`coli genomic DNA, and human mitochondri-
`al DNA. The entire read for any sequence
`with a significant match to a contaminant was
`discarded. A total of 713 reads matched E.
`coli genomic DNA and 2114 reads matched
`the human mitochondrial genome.
`
`1.3 Quality assessment and control
`The importance of the base-pair level ac-
`curacy of the sequence data increases as the
`size and repetitive nature of the genome to
`be sequenced increases. Each sequence
`read must be placed uniquely in the ge-
`
`No. of sequencing reads
`
`Fold sequence coverage
`(2.9-Gb genome)
`
`Fold clone coverage
`
`Insert size* (mean)
`Insert size* (SD)
`% Mates†
`
`Individual
`
`A
`B
`C
`D
`F
`Total
`A
`B
`C
`D
`F
`Total
`A
`B
`C
`D
`F
`Total
`Average
`Average
`Average
`
`Number of reads for different insert libraries
`
`2 kbp
`
`0
`11,736,757
`853,819
`952,523
`0
`13,543,099
`0
`2.20
`0.16
`0.18
`0
`2.54
`0
`2.96
`0.22
`0.24
`0
`3.42
`1,951 bp
`6.10%
`74.50
`
`10 kbp
`
`0
`7,467,755
`881,290
`1,046,815
`1,498,607
`10,894,467
`0
`1.40
`1.17
`0.20
`0.28
`2.04
`0
`11.26
`1.33
`1.58
`2.26
`16.43
`10,800 bp
`8.10%
`80.80
`
`50 kbp
`
`2,767,357
`66,930
`0
`0
`0
`2,834,287
`0.52
`0.01
`0
`0
`0
`0.53
`18.39
`0.44
`0
`0
`0
`18.84
`50,715 bp
`14.90%
`75.60
`
`Total
`
`2,767,357
`19,271,442
`1,735,109
`1,999,338
`1,498,607
`27,271,853
`0.52
`3.61
`0.32
`0.37
`0.28
`5.11
`18.39
`14.67
`1.54
`1.82
`2.26
`38.68
`
`Total number of
`base pairs
`
`1,502,674,851
`10,464,393,006
`942,164,187
`1,085,640,534
`813,743,601
`14,808,616,179
`
`*Insert size and SD are calculated from assembly of mates on contigs.
`
`†% Mates is based on laboratory tracking of sequencing runs.
`
`www.sciencemag.org SCIENCE VOL 291 16 FEBRUARY 2001
`
`1307
`
`Agilent Exhibit 1289
`Page 4 of 51
`
`

`

`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`nome, and even a modest error rate can
`reduce the effectiveness of assembly. In
`addition, maintaining the validity of mate-
`pair information is absolutely critical for
`the algorithms described below. Procedural
`controls were established for maintaining
`the validity of sequence mate-pairs as se-
`quencing reactions proceeded through the
`process, including strict rules built into the
`LIMS. The accuracy of sequence data pro-
`duced by the Celera process was validated
`in the course of the Drosophila genome
`project (26 ). By collecting data for the
`
`T H E H U M A N G E N O M E
`
`entire human genome in a single facility,
`we were able to ensure uniform quality
`standards and the cost advantages associat-
`ed with automation, an economy of scale,
`and process consistency.
`
`2 Genome Assembly Strategy and
`Characterization
`Summary. We describe in this section the two
`approaches that we used to assemble the ge-
`nome. One method involves the computational
`combination of all sequence reads with shred-
`ded data from GenBank to generate an indepen-
`
`dent, nonbiased view of the genome. The sec-
`ond approach involves clustering all of the frag-
`ments to a region or chromosome on the basis
`of mapping information. The clustered data
`were then shredded and subjected to computa-
`tional assembly. Both approaches provided es-
`sentially the same reconstruction of assembled
`DNA sequence with proper order and orienta-
`tion. The second method provided slightly
`greater sequence coverage (fewer gaps) and
`was the principal sequence used for the analysis
`phase. In addition, we document the complete-
`ness and correctness of this assembly process
`
`Fig. 2. Flow diagram for sequencing pipeline. Samples are received,
`selected, and processed in compliance with standard operating proce-
`dures, with a focus on quality within and across departments. Each
`process has de(cid:222)ned inputs and outputs with the capability to exchange
`
`samples and data with both internal and external entities according to
`de(cid:222)ned quality guidelines. Manufacturing pipeline processes, products,
`quality control measures, and responsible parties are indicated and are
`described further in the text.
`
`1308
`
`16 FEBRUARY 2001 VOL 291 SCIENCE www.sciencemag.org
`
`Agilent Exhibit 1289
`Page 5 of 51
`
`

`

`Downloaded from
`
`http://science.sciencemag.org/
`
`
`
`on November 6, 2018
`
`sequences. In the past 2 years the PFP has
`focused on a product of lower quality and com-
`pleteness, but on a faster time-course, by con-
`centrating on the production of Phase 1 data
`from a 33 to 43 light-shotgun of each BAC
`clone.
`We screened the bactig sequences for con-
`taminants by using the BLAST algorithm
`against three data sets: (i) vector sequences
`in Univec core (38), filtered for a 25-bp
`match at 98% sequence identity at the ends
`of the sequence and a 30-bp match internal
`to the sequence; (ii) the nonhuman portion
`of the High Throughput Genomic (HTG)
`Seqences division of GenBank (39), fil-
`tered at 200 bp at 98%; and (iii) the non-
`redundant nucleotide sequences from Gen-
`Bank without primate and human virus en-
`tries, filtered at 200 bp at 98%. Whenever
`25 bp or more of vector was found within
`50 bp of the end of a contig, the tip up to
`the matching vector was excised. Under
`these criteria we removed 2.6 Mbp of pos-
`sible contaminant and vector
`from the
`Phase 3 data, 61.0 Mbp from the Phase 1
`and 2 data, and 16.1 Mbp from the Phase 0
`data ( Table 2). This left us with a total of
`4363.7 Mbp of PFP sequence data 20%
`finished, 75% rough-draft (Phase 1 and 2),
`and 5% single sequencing reads (Phase 0).
`An additional 104,018 BAC end-sequence
`mate pairs were also downloaded and in-
`cluded in the data sets for both assembly
`processes (18).
`
`2.2 Assembly strategies
`Two different approaches to assembly were
`pursued. The first was a whole-genome as-
`sembly process that used Celera data and the
`PFP data in the form of additional synthetic
`shotgun data, and the second was a compart-
`mentalized assembly process that first parti-
`tioned the Celera and PFP data into sets
`localized to large chromosomal segments and
`then performed ab initio shotgun assembly on
`each set. Figure 4 gives a schematic of the
`overall process flow.
`For the whole-genome assembly, the PFP
`data was first disassembled or “shredded” into a
`synthetic shotgun data set of 550-bp reads that
`form a perfect 23 covering of the bactigs. This
`resulted in 16.05 million “faux” reads that were
`sufficient to cover the genome 2.963 because
`of redundancy in the BAC data set, without
`incorporating the biases inherent in the PFP
`assembly process. The combined data set of
`43.32 million reads (83), and all associated
`mate-pair information, were then subjected to
`our whole-genome assembly algorithm to pro-
`duce a reconstruction of the genome. Neither
`the location of a BAC in the genome nor its
`assembly of bactigs was used in this process.
`Bactigs were shredded into reads because we
`found strong evidence that 2.13% of them were
`misassembled (40). Furthermore, BAC location
`
`and provide a comparison to the public genome
`sequence, which was reconstructed largely by
`an independent BAC-by-BAC approach. Our
`assemblies effectively covered the euchromatic
`regions of the human chromosomes. More than
`90% of the genome was in scaffold assemblies
`of 100,000 bp or greater, and 25

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket