`
`— I
`
`nitial sequencing and analysis of the
`
`International Human fienome Sequencing consortium
`
`,
`-nunno"..."nun-"m...nuns-nu-
`annual-nu-
`nun-qu-u-uu-uu.
`‘A partial list ofauthors appears on the opposite page. Afiiiiations are listed at the and ofthe paper.
`
`Exhibit
`wn W‘
`Date
`6 '-
`Leslie Rockwood CSR RPR
`nu“nn-unun-uup.unun-unnuu...".u.
`
`m nailIQOlolulfllllDOlll‘ll-I------v-uv
`
`The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution.
`Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human
`genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
`
`coordinate regulation of the genes in the clusters.
`0 There appear to be about 30,000—40,000 protein—coding genes in
`the human genome—only about twice as many as in worm or fly.
`However,
`the genes are more complex, with more alternative
`splicing generating a larger number of protein products.
`OThe full set of proteins (the ‘proteome’) encoded by the human
`genome is more complex than those of invertebrates. This is due in
`part to the presence of vertebrate-specific protein domains and
`motifs (an estimated 7% of the total), but more to the fact that
`vertebrates appear to have arranged pro-existing components into a
`richer collection of domain architectures.
`
`0 Hundreds of human genes appear likely to have resulted from
`horiZOntal transfer from bacteria at some point in the vertebrate
`lineage. Dozens of genes appear to have been derived from trans—
`posable elements.
`0 Although about half of the human genome derives from trans—
`posable elements, there has been a marked decline in the overall
`activity of such elements in the hominid lineage. DNA transposons
`appear to have become completely inactive and long-terminal
`repeat (LTR) retroposons may also have done so.
`0 The pericentromeric and subtelomeric regions of chrOmosomes
`are filled with large recent segmental duplications of sequence from
`elsewhere in the genome. Segmental duplication is much more
`frequent in humans than in yeast, fly or worm.
`0 Analysis of the organization of Alu elements explains the long-
`standing mystery of their surprising genomic distribution, and
`suggests that there may be strong selection in favour of preferential
`retention ofAlu elements in GC-rich regions and that these ‘selfish’
`elements may benefit their human hosts.
`OThe mutation rate is about twice as high in male as in female
`meiosis, showing that most mutation occurs in males.
`0 Cytogenetic analysis of the sequenced clones confirms sugges-
`tions that large GC-poor regions are strongly correlated with ‘dark
`G—bands’ in karyolypes.
`ORecombination rates tend to be much higher in distal regions
`(around 20 megabases (MbD of chromosomes and on shorter
`chromosome arms in general,
`in a pattern that promotes the
`occurrence of at least one crossover per chromosome arm in each
`meiosis.
`'
`
`o More than 1.4 million single nucleotide polymorphisms (SNPs)
`in the human genome have been identified. This collection should
`allow the initiation of genome-wide linkage disequilibrium
`mapping of the genes in the human population.
`In this paper, we start by presenting background information on
`the project and describing the generation, assembly and evaluation
`of the draft genome sequence. We then focus on an initial analysis of
`the sequence itself: the broad chromosomal landscape; the repeat
`elements and the rich palaeontological record of evolutionary and
`biological processes that they provide;
`the human genes and
`proteins and their differences and similarities with those of other
`
`The rediscovery ofMendel’s laws ofheredity in the opening weeks of
`the 20th century“3 sparked a scientific quest to understand the
`nature and content of genetic information that has propelled
`biology for the last hundred years. The scientific progress made
`falls naturally into four main phases, corresponding roughly to the
`four quarters of the century. The first established the cellular basis of
`heredity: the chromosornes. The second defined the molecular basis
`of heredity: the DNA double helix. The third unlocked the informa—
`tional basis ofheredity, with the discovery of the biological mechan-
`ism bywhich cells read the information contained in genes and with
`the invention of the recombinant DNA technologies of cloning and
`sequencing by which scientists can do the same.
`The last quarter of a century has been marked by a relentless drive
`to decipher first genes and then entire genomes, spawning the field
`of genomics. The fruits of this work already include the genome
`sequences of 599 viruses and viroids, 205 naturally occurring
`plasmids,
`185 organelles, 31 eubacteria, seven archaea, one
`fungus, two animals and one plant.
`Here we report the results of a collaboration involving 20 groups
`from the United States,
`the United Kingdom, Japan, France,
`- Germany and China to produce a draft sequence of the human
`genome. The draft genome sequence was generated from a physical
`map covering more than 96% ofthe euchromatic part ofthe human
`genome and, together with additional sequence in public databases,
`it covers about 94% of the human genome. The sequence was
`produced over a relatively short period, with coverage rising from
`about 10% to more than 90% over roughly fifteen months. The
`sequence data have been made available without restriction and
`updated daily throughout the project. The task ahead is to produce a
`finished sequence, by closing all gaps and resolving all ambiguities.
`Already about one billion bases are in final form and the task of
`bringing the vast majority of the sequence to this standard is now
`straightforward and should proceed rapidly.
`The sequence of the human genome is of interest in several
`respects. It is the largest genome to be extensively sequenced so far,
`being 25 times as large as any previously sequenced genome and
`eight times as large as the sum of all such genomes. It is the first
`vertebrate genome to be extensively sequenced. And, uniquely, it is
`the genome of our own species.
`Much work remains to be done to produce a complete finished
`sequence, but the vast
`trove of information that has become
`available through this collaborative effort allows a global perspective
`on the human genome. Although the details will change as the
`sequence is finished, many points are already clear.
`OThe genomic landscape shows marked variation in the distribu—
`tion of a number of features,
`including genes,
`transposable
`elements, GC content, CpG islands and recombination rate. This
`gives us important clues about function. For example, the devel-
`"""
`'
`‘
`poor
`SEQUENOM EXHIBIT 1101
`3:311:13: SEQUENOM EXHIBIT 1101
`[plex
`Sequenom v. Stanford
`Sequenom V. Stanford
`860
`IPR2013-00390
`IPR2013-00390
`
`Macmillan Magazines Ltd
`
`NATUREI VOL 409 | 15 FEBRUARY 2001I\Mv.nature.com
`
`SEQUENOM EXHIBIT 1101
`
`
`
`articles
`
`Initial sequencing and analysis of the
`human genome
`
`International Human Genome Sequencing Consortium ..
`
`• A partial list of authors appears on the opposite page. Affiliations are listed at tlte end of the paper.
`
`/fOI
`!;~'-ib-it.=_-=.~'P~~A~..:.:I\)~1-... ~-.,.~---
`,-:2 ~ / <?(
`oate
`\..leslie Rockwood CSR RPR ~
`
`The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution.
`Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human
`genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.
`
`The rediscovery of Mendel's laws ofheredity in the opening weeks of
`the 20th century'- 3 sparked a scientific quest to understand the
`nature and content of genetic information that has propelled
`biology for the last hundred years. The scientific progress made
`falls naturally into four main phases, corresponding roughly to the
`four quarters of the century. The first established the cellular basis of
`heredity: the chromosomes. The second defined the molecular basis
`ofheredity: the DNA double helix. The third unlocked the informa(cid:173)
`tional basis of heredity, with the discovery of the biological mechan(cid:173)
`ism by which cells read the information contained in genes and with
`the invention of the recombinant DNA technologies of cloning and
`sequencing by which scientists can do the same.
`The last quarter of a century has been marked by a relentless drive
`to decipher first genes and then entire genomes, spawning the field
`of genomics. The fruits of this work already include the genome
`sequences of 599 viruses and viroids, 205 naturally occurring
`plasmids, 185 organelles, 31 eubacteria, seven archaea, one
`fungus, two animals and one plant.
`Here we report the results of a collaboration involving 20 groups
`from the United States, the United Kingdom, Japan, France,
`. Germany and China to produce a draft sequence of the human
`genome. The draft genome sequence was generated from a physical
`map covering more than 96% of the euchromatic part of the human
`genome and, together with additional sequence in public databases,
`it covers about 94% of the human genome. The sequence was
`produced over a relatively short period, with coverage rising from
`about 10% to more than 90% over roughly fifteen months. The
`sequence data have been made available without restriction and
`updated daily throughout the project. The task ahead is to produce a
`finished sequence, by closing all gaps and resolving all ambiguities.
`Already about one billion bases are in final form and the task of
`bringing the vast majority of the sequence to this standard is now
`straightforward and should proceed rapidly.
`The sequence of the human genome is of interest in several
`respects. It is the largest genome to be extensively sequenced so far,
`being 25 times as large as any previously sequenced genome and
`eight times as large as the sum of all such genomes. It is the first
`vertebrate genome to be extensively sequenced. And, uniquely, it is
`the genome of our own species.
`Much work remains to be done to produce a complete finished
`sequence, but the vast trove of information that has become
`available through this collaborative effort allows a global perspective
`on the human genome. Although the details will change as the
`sequence is finished, many points are already clear.
`• The genomic landscape shows marked variation in the distribu(cid:173)
`tion of a number of features, including genes, transposable
`elements, GC content, CpG islands and recombination rate. This
`gives us important clues about function. For example, the devel(cid:173)
`opmentally important HOX gene clusters are the most repeat-poor
`regions of the human genome, probably reflecting the very complex
`
`coordinate regulation of the genes in the clusters.
`• There appear to be about 30,000-40,000 protein-coding genes in
`the human genome-only about twice as many as in worm or fly.
`However, the genes are more complex, with more alternative
`splicing generating a larger number of protein products.
`• The full set of proteins (the 'proteome') encoded by the human
`genome is more complex than those of invertebrates. This is due in
`part to the presence of vertebrate-specific protein domains and
`motifs (an estimated 7o/o of the total), but more to the fact that
`vertebrates appear to have arranged pre-existing components into a
`richer collection of domain architectures.
`• Hundreds of human genes appear likely to have resulted from
`horizontal transfer from bacteria at some point in the vertebrate
`lineage. Dozens of genes appear to have been der~ved from trans(cid:173)
`posable elements.
`• Although about half of the human genome derives from trans(cid:173)
`posable elements, there has been a marked decline in the overall
`activity of such elements in the hominid lineage. DNA transposons
`appear to have become completely inactive and long-terminal
`repeat (LTR) retroposons may also have done so.
`• The pericentromeric and subtelomeric regions of chromosomes
`a~e filled with large recent segmental duplications of sequence from
`elsewhere in the genome. Segmental duplication is much more
`frequent in humans than in yeast, fly or worm.
`• Analysis of the organization of Alu elements explains the long(cid:173)
`standing mystery of their surprising genomic distribution, and
`suggests that there may be strong selection in favour of preferential
`retention of Alu elements in GC-rich regions and that these 'selfish'
`elements may benefit their human hosts.
`• The mutation rate is about twice as high in male as in female
`meiosis, showing that most mutation occurs in males.
`• Cytogenetic analysis of the sequenced clones confirms sugges(cid:173)
`tions that large GC-poor regions are strongly correlated with 'dark
`G-bands' in karyotypes.
`• Recombination rates tend to be much higher in distal regions
`(around 20 megabases (Mb)) of chromosomes and on shorter
`chromosome arms in general, in a pattern that promotes the.
`occurrence of at least one crossover per chromosome arm in each
`meiosis.
`• More than 1.4 million single nucleotide polymorphisms (SNPs)
`in the human genome have been identified. This collection should
`allow
`the initiation of genome-wide linkage disequilibrium
`mapping of the genes in the human population.
`In this paper, we start by presenting background information on
`the project and describing the generation, assembly and evaluation
`of the draft genome sequence. We then focus on an initial analysis of
`the sequence itself: the broad chromosomal landscape; the repeat
`elements and the rich palaeontological record of evolutionary and
`biological processes that they provide; the human genes and
`proteins and their differences and similarities with those of other
`
`860
`
`~©2001 Macmillan Magazines Ltd
`
`NATURE I VO.J,. 409 115 FEBRUARY 200tlwww.nature.com
`
`SEQUENOM EXHIBIT 1101
`
`
`
`Genome Sequencing Centres (Listed in order of total genomic
`sequence conbibuted, with a partial list of personnel. A full list of
`contributors at each centre is available as Supplementary
`Information.)
`
`Whitehead Institute for Biomedical Research, Center for Genome
`Research: Eric S. Lander1*, Lauren M. Linton\ Bruce Blrren1*,
`Chad Nusbaum1*, Michael C. Zodyh , Jennifer Baldwin\
`Keri Devon\ Ken Dewar\ Michael Doyle\ William FitzHugh1
`" ,
`Roel Funke\ Diane Gage\ Katrina Harris\ Andrew Heaford1
`,
`John Howland\ Usa Kann\ Jessica Lehoczky\ Rosie LeVine\
`
`
`Paul McEwan 1 I Kevin McKernan 1 I James Meldrim 1
`' Jill p. Meslrov1* r
`Cher Miranda\ William Morris\ Jerome Naylor\
`Christina Raymond\ Mark Rosetti\ Ralph Santos\
`Andrew Sheridan\ Carrie Sougnez\ Nicole Stange-Thomann\
`Nikola Stojanovic\ Aravind Subramanian1
`& Dudley Wyman 1
`
`, John Sulston2*,
`The Sanger Centre: Jane Rogers2
`, Stephan Beck2
`Rachael Alnscough2
`1 David Benttei, John Burton2
`Christopher Clee2
`, Nigel Carter2, Alan Coulson2
`,
`Rebecca Deadman2
`, Panos Deloukas2
`, Andrew Dunham2
`,
`, Richard Durbin2*, Usa French2
`lan Dunham2
`, Darren Grafham2
`,
`Simon Gregori, Tim Hubbard2*, Sean Humphray2
`, Adrienne Hunt2,
`, Christine Lloyd2
`Matthew Jones2
`, Amanda McMurray2
`,
`, James C. Mullikin2*,
`Lucy Matthews2
`, Simon Mercer-2, Sarah Milne2
`
`
`Andrew Mungall21 Robert Plumb21 Mark Ross2, Ratna Shownkeen2
`& Sarah Slms2
`
`,
`
`Washington University Genome Sequencing Center:
`, LaDeana w. Hmier*,
`Robert H. Watersto~3*, Richard K. Wilson3
`
`John D. McPherson , Marco A. Marra3, Elaine R. Mardis3
`,
`, Asif T. Chinwalla3*, Kymberlie H. Pepin3
`Lucinda A. Fulton3
`, Stephanie L. Chissoe3
`Warren R. Glsh3
`, Michael C. Wendl3
`,
`Kim D. Delehauntyl, Tracie L. Miner, Andrew Delehaunty3
`,
`, Roberts. Fulton3
`Jason B. Kramer , Lisa L. Cook3
`,
`, Patrick J. Minx3 & Sandra w. Clifton3
`Douglas L. Johnson3
`
`,
`
`US DOE Joint Genome Institute: Trevor Hawkins\
`Elbert Branscomb\ Paul Predki4
`, Paul Richardson4
`,
`Sarah Wenning\ Tom Slezak\ Norman Doggett4
`1 Jan-Fang Cheng4
`,
`, Susan Lucas4
`1 Christopher Elkin4
`Anne Ofsen4
`,
`Edward Uberbacher4 & Marvin Frazier4
`
`Baylor College of Medicine Human Genome Sequencing Center:
`Richard A. Glbbs5*, Donna M. Muzny5
`, Steven E. SchererS,
`, Kim C. Worley5*, Catherine M.
`John B. Bouck5*, Erica J. Sodergren5
`
`Rlves5, James H; Gorrell5
`, Michael L. MetzkerS,
`Susan L. Naylof, Raju S. Kucherlapati7
`, David L. Nelson,
`& George M. Weinstock8
`
`RIKEN Genomic Sciences Center: Yoshiyuki Sakaki9
`
`
`Asao Fujiyama9, Masahira Hattori9, Tetsushi Yada9
`,
`, Takehiko ltoh9
`, Chiharu Kawagoe9
`Atsushi Toyoda9
`, Yasushi Totoki9 & Todd Taylor9
`Hidemi Watanabe9
`
`,
`
`1
`
`Genoscope and CNRS UMR-8030: Jean Weissenbach10
`,
`
`, William Saurin10, Francois Artiguenave10
`Roland Heilig10
`,
`Philippe Brottier10
`, Thomas Bruls10
`, Eric Pelletier10
`,
`Catherine Robert10 & Patrick Wincker10
`
`GTC Sequencing Center: Douglas R. Smith 1
`\
`Lynn Doucette-Stamm 11
`, Marc Rubenfield11
`, Keith Weinstock 11
`Hong Mel Lee11 & JoAnn Dubois11
`Department of Genome Analysis, Institute of Molecular
`
`,
`
`articles
`
`Biotechnology: Andre Rosenthal12, Matthias Platzer12
`
`,
`
`Gerald Nyakatura12, Stefan Taudien12 & Andreas Rump12
`
`Beijing Genomlcs Institute/Human Genome Center:
`Huanming Yang13
`, Jun Yu13
`, Jian Wang13
`, Guyang Huang14
`& Jun Gu15
`
`Multimegabase Sequencing Center, The Institute for Systems
`
`
`Biology: Leroy Hood16, Lee Rowen16, Anup Madan16 & Shizen Qin16
`
`Stanford Genome Technology Center: Ronald W. Davis17
`,
`Nancy A. Federspiel17
`, A. Pia Abo Ia 17 & Michael J. Proctor17
`
`Stanford Human Genome Center: Richard M. Myers18
`Jeremy Schmutz18
`1 Mark Dickson18, Jane Grimwood18
`
`& David R. Cox18
`University of Washington Genome Center: Maynard V. Olson19
`Rajinder Kaul19 & Christopher Raymond19
`
`,
`
`'
`
`Department of Molecular Biology, Kelo University School of
`Medicine: Nobuyoshi Shimizu20
`, Kazuhiko Kawasaki20
`& Shinsei Minoshima20
`
`University of Texas Southwestern Medical Center at Dallas:
`Glen A. Evans21t 1 Maria Athanasiou21 & Roger Schultz21
`
`University of Oklahoma's Advanced Center for Genome
`
`Technology: Bruce A. Roe22, Feng Chen22 & Huaqin Pan22
`
`Max Planck Institute for Molecular Genetics: Juliana Ramsey23
`Hans Lehrach23 & Richard Reinhardfl
`'
`
`Cold Spring Harbor Laboratory, Uta Annenberg Hazen Genome
`
`Center: W. Richa.rd McCombie24, Melissa de Ia Bastide24
`& Neilay Dedhia24
`
`GBF-German Research Centre for Biotechnology:
`Helmut Bliicke~, Klaus Homischer25 & Gabriele Nordsiek25
`
`.. Genome Analysis Group (listed In alphabetical order, also
`Includes individuals listed under other headings):
`, Jeffrey A. Ballet7Rich~ Agarwala26, L. Aravind26
`, Alex Bateman2
`
`
`,
`0
`
`Ser~f1m Batzoglou1, Ewan Bimey2B, PeerBork29
`, Daniel G. Brown1,
`•
`ChrJstopher B. Burge3
`', Lorenzo Cerutti28
`, Hsiu-Chuan Chen26
`,
`Deanna Church26
`1 Michele Clamp2
`, Richard R. Copley30
`Tobias Doerks29•30, Sean R. Eddy32, Evan E. Eichler27,
`'
`
`Terrence S. Furey33, James Galaganl, James G. R. Gilbert\
`Cyrus Harmon34
`, Yoshihide Hayashizaki35
`, David HaussleylS
`Henning Hermjakob28
`, Karsten Hokamp37
`, Wonhee Jang26
`,
`'
`
`L. Steven Johnson32, Thomas A. Jones32
`, Simon Kasif8,
`Arek Kaspryzk28
`
`, Scot Kennedy39, W. James Kent40
`, Paul Kitts26
`, Doron Lancet41,
`Eugene V. Koonin26
`, lan Korf, David Kulp34
`
`
`Todd M. Lowe42, Aoife Mclysaghf7, Tarjei Mikkelsen38
`,
`
`, Victor J. Pollara1
`John V. Moran431 Nicola Mulder8
`,
`
`ChrisP. Ponting44, Greg Schuler-26
`, Jiirg Schultz30
`, Guy Slatere,
`Arian F. A. Smit45, Elia Stupka28
`
`, Joseph Szustakowki38
`,
`Danielle T~ierry-Mieg26, Jean Thierry-Mieg26
`, Lukas Wagne,ZS,
`John Wai1Js3
`, Raymond Wheeler-', Alan Williams34
`, Yuri 1. WolfS,
`Kenneth H. Wolfe37
`, Shiaw-Pyng Yang3 & Ru·Fang Yeh31
`
`,
`
`Scientific management: National Human Genome Research
`Institute, US National Institutes of Health: Francis Collins46*
`MarkS. Guyer46
`, Jane Peterson46
`, Adam Felsenfeld46*
`'
`& Kris A. Wetterstrand46
`; Office of Science, US Department of
`
`Energy: Aristides Patrinos47; The Wellcome Trust: Michael J.
`Morgan48
`
`NATIJRE I VOL 40911 5 FEBRUARY 2001 1 www.nature.com
`
`~ @ 2001 Macmillan Magazines Ltd
`
`861
`
`
`
`articles
`
`organisms; and the history' of genomic segments. (Comparisons
`are drawn throughout with the genomes of the budding yeast
`Saccharomyces cerevisiae, the nematode worm Caenorhabditis
`elegans, the fruitfly Drosophila melanogaster and the mustard weed
`Arabidopsis thaliana; we refer to these for convenience simply as
`yeast, worm, fly and mustard weed.) Finally, we discuss applications
`of the sequence to biology and medicine and describe next Steps in
`the project. A full description of the methods is provided as
`Supplementary Information on Nature's web site (http://www.
`nature.com).
`We recognize that it is impossible to provide a comprehensive
`analysis of this vast dataset, and thus our goal is to illustrate the
`range of insights that can be gleaned from the human genome and
`thereby to sketch a research agenda for the future.
`
`Background to the Human Genome Project
`
`The Human Genome Project arose from two key insights that
`emerged in the early 1980s: that the ability to take global views of
`genomes could greatly accelerate biomedical research, by allowing
`researchers to attack problems in a comprehensive and unbiased
`fashion; and that the creation of such global views would require a
`communal effort in infrastructure building, unlike anything pre(cid:173)
`viously attempted in biomedical research. Several key projects
`helped to crystallize these insights, including:
`(1) The sequencing of the bacterial.viruses 4>Xl744.s and lambda6
`, the
`animal virus SV407 and the human mitochondrion8 between 1977
`and 1982. These projects proved the feasibility of assembling small
`sequence fragments into complete genomes, and showed the value
`of complete catalogues of genes and other functional elements.
`(2) The programme to create a human genetic map to make it
`possible to locate disease genes of unknown function based solely on
`their inheritance patterns, launched by Botstein and colleagues in
`1980 {ref. 9).
`(3) The programmes to create physical maps of clones covering the
`yeast10 and worm11 genomes to allow isolation of genes and regions
`based solely on their chromosomal position, launched by Olson and
`Sulston in the mid-1980s.
`
`(4) The development of random shotgun sequencing of comple(cid:173)
`mentary DNA fragments for high-throughput gene discovery by
`SchimmeJI 2 and Schimmel and Sutcliffe13
`, later dubbed expressed
`sequence tags (ESTs) and pursued with automated sequencing by
`Venter and others14- 20•
`The idea of sequencing the entire human genome was first
`proposed in discussions at scientific meetings organized by the
`US Department of Energy and others from 1984 to 1986 {refs 21,
`22). A committee appointed by the US National Research Council
`endorsed the concept in its 1988 reporfl, but recommended a
`broader programme, to include: the creation of genetic, physical
`and sequence maps of the human genome; parallel efforts in key
`model organisms such as bacteria, yeast, worms, flies and mice; the
`development of technology in support of these objectives; and
`research into the ethical, legal and social issues raised by human
`genome research. The programme was launched in the US as a joint
`effort of the Department of Energy and the National Institutes of
`Health. In other countries, the UK Medical Research Council and
`the Wellcome Trust supported genomic research in Britain; the
`Centre d'Etude du Polymorphisme Humain and the French Mus(cid:173)
`cular Dystrophy Association launched mapping efforts in France;
`government agencies, including the Science and Technology Agency
`and the Ministry of Education, Science, Sports and Culture sup(cid:173)
`ported genomic research efforts 'in Japan; and the European Com(cid:173)
`munity helped to launch several international efforts, notably the
`programme to sequence the yeast genome. By late 1990, the Human
`Genome Project had been launched, with the creation of genome
`centres in these countries. Additional participants subsequently
`joined the effort, notably in Germany and China. In addition, the
`Human Genome Organization {HUGO) was founded to provide a
`forum for international coordination of genomic research. Several
`26 provide a more comprehensive discussion of the genesis
`books24
`-
`of the Human Genome Project.
`Through 1995, work progressed rapidly on two fronts {Fig. 1).
`The first was construction of genetic and physical maps of the
`human and mouse genomes27
`31
`, providing key tools for identifica(cid:173)
`-
`tion of disease genes and anchoring points for genomic sequence.
`The second was sequencing of the yeast32 and worm33 genomes, as
`
`1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`I
`
`)I
`
`1984
`
`~· • I
`
`Discussion and debate
`in scientific community
`NRC report
`I
`
`39 species )I
`
`Bacterial genome sequencing
`H.
`lu
`E. bali
`S. cerevisiae se9uencing
`
`1f
`
`C. elegans sequencing
`I • • • • • • • • • • • ___ ...;;;;.;;;~;;;.;;..;;.;;:,;;;,;,;,;~---till •• a WIll ¥II Ill U U
`D. melanogaster sequencing
`
`.......... --........ __ ;.;.;.;.;;;.;;.;,;;;:---..--Ol:..-.-.... -..... ..
`
`'1:1 ::I~ :JI .a'!.';: 'II'
`
`__ ....;A.;.;·..;.th.;.;a;;;li;;;ana.;.;.;se_,.q:;.ue;;;n,;,;c.;.;in;;;g:....- •••• • • • •
`
`Genetic maps
`
`Microsatellites
`
`SNPs
`
`~ Physical maps
`~ eDNA sequencing
`[
`Genomic sequencing
`Genetic maps ____ _.M..,i .. cro;;o;;;sa;;;tOiie;;.;.llit;;;e;;s ____ _
`
`Full length )I
`
`ESTs
`
`1
`
`Pilot
`sequencing
`
`SNPs
`
`)I
`
`c: Physical maps - - - - - - - - - - - - - - - - - - - - - - - - -
`~ eDNA sequencing
`ESTs
`Full length
`1
`~
`~~~
`draft. goo~ Finishing, -100%
`1 t
`Chromosome 22 Chromosome 21
`
`Genomic sequencing
`
`Pilot project,15%
`
`Figure 1 Timeline of large-scale genomic analyses. Shown are selected components of
`work on several non-vertebrate model organisms (red), the mouse (blue} and the human
`
`(green) from t99D: earlier projects are described in the text. SNPs, single nucleotide
`polymorphisms: ESTs, expressed sequence tags.
`
`862
`
`~ © 2001 Macmillan Magazines Ltd
`
`NATURE I VOL 409115 FEBRUARY 200llwww.nature.com
`
`
`
`well as targeted regions of mammalian genomes34- 37• These projects
`showed that large-scale sequencing was feasible and developed the
`two-phase paradigm for genome sequencing. In the first, 'shotgun',
`phase, the genome is divided into appropriately sized segments and
`each segment is covered to a high degree of redundancy (typically,
`eight- to tenfold) through the sequencing of randomly selected
`subfragments. The second is a 'finishing' phase, in which sequence
`gaps are dosed and remaining ambiguities are resolved through
`directed analysis. The results also showed that complete genomic
`sequence provided information about genes, regulatory regions and
`chromosome structure that was not readily obtainable from eDNA
`studies alone.
`In 1995, genome scientists considered a proposal38 that would
`have involved producing a draft genome sequence of the human
`genome in a first phase and then returning to finish the sequence in
`a second phase. After vigorous debate, it was decided that such a
`plan was premature for several reasons. These included the need first
`to prove that high-quality, long-range finished sequence could be
`produced from most parts of the complex, repeat-rich human
`genome; the sense that many aspects of the sequencing process
`were still rapidly evolving; and the desirability of further decreasing
`costs.
`Instead, pilot projects were launched to demonstrate the feasi(cid:173)
`bility of cost-effective, large-scale sequencing, with a target comple(cid:173)
`tion date of March 1999. The projects successfully produced
`finished sequence with 99.99% accuracy and no gaps39
`• They also
`introduced bacterial artificial chromosomes (BACs)~0, a new large(cid:173)
`insert cloning system that proved to be more stable than the cosmids
`and yeast artificial chromosomes (YACs) 41 that had been used
`previously. The pilot projects drove the maturation and conver(cid:173)
`gence of sequencing strategies, while producing 15% of the human
`genome sequence. With successful completion of this phase, the
`human genome sequencing effort moved into full-scale production
`in March 1999.
`The idea of first producing a draft genome sequence was revived
`at this time, both because the ability to finish such a sequence was no
`longer in doubt and because there was great hunger in the scientific
`community for human sequence data. In addition, some scientists
`favoured prioritizing the production of a draft genome sequence
`over regional finished sequence because of concerns about com(cid:173)
`mercial plans to generate proprietary databases of human sequence
`4
`that might be subject to undesirable restrictions on use42
`•
`..
`The consortium focused on an initial goal of pro.ducing, in a first
`production phase lasting until June 2000, a draft genome sequence
`covering most of the genome. Such a draft genome sequence,
`although not completely finished, would rapidly allow investigators
`·to begin to extract most of the information in the human sequence.
`Experiments showed that sequencing clones covering about 90o/o of
`the human genome to a redundancy of about four- to fivefold ('half(cid:173)
`shotgun' coverage; see Box 1) would accomplish this45
`• The draft
`46
`'
`genome sequence goal has been achieved, as described below.
`The second sequence production phase is now under way. Its
`aims are to achieve full-shotgun coverage of the existing clones
`during 2001, to obtain clones to fill the remaining gaps in the
`physical map, and to produce a finished sequence (apart from
`regions that cannot be cloned or sequenced with currently available
`techniques) no later than 2003.
`
`Strategic issues
`
`articles
`
`libraries with more uniform representation. The practice of sequen(cid:173)
`cing from both ends of double-stranded clones ('double-barrelled'
`shotgun sequencing) was introduced by Ansorge and others37 in
`1990, allowing the use of 'linking information' between sequence
`fragments.
`The application of shotgun sequencing was also extended by
`applying it to larger and larger DNA molecules-from plasmids
`(- 4 kilo bases (kb)) to cosmid clones37
`( 40 kb ), to artificial chro(cid:173)
`mosomes cloned in bacteria and yeasr5 (1 00-500 kb) and bacterial
`genomes56 (1-2 megabases (Mb)). In principle, a genome of arbi(cid:173)
`trary size may be directly sequenced by the shotgun method,
`provided that it contains no repeated sequence and can be uni(cid:173)
`formly sampled at random. The genome can then be assembled
`using the simple computer science technique of'hashing' (in which
`one detects overlaps by consulting an alphabetized look-up table of
`all k-letter words in the data). Mathematical analysis of the
`expected number of gaps as a function of coverage is similarly
`straightforward57
`•
`Practical difficulties arise because of repeated sequences and
`cloning bias. Small amounts of repeated sequence pose little
`problem for shotgun sequencing. For example, one can readily
`assemble typical bacterial genomes (about 1.5% repeat) or the
`euchromatic portion of the fly genome (about 3o/o repeat). By
`contrast, the human genome is filled (>50%) with repeated
`sequences, including interspersed repeats derived from transposable
`elements, and long genomic regions that have been duplicated in
`tandem, palindromic or dispersed fashion (see below). These
`include large duplicated segments (50-500 kb) with high sequence
`identity (98-99.9%), at which mispairing during recombination
`creates deletions responsible for genetic syndromes. Such features
`complicate the assembly of a correct and finished genome sequence.
`There are two approaches for sequencing large repeat-rich
`is a whole-genome shotgun sequencing
`genomes. The first
`approach, as has been used for the repeat-poor genomes of viruses,
`bacteria and flies, using linking information and computational
`
`Hierarchical shotgun sequencing
`
`Genomic DNA
`
`BAC library
`
`Organized
`mapped large
`clone contigs
`
`BAC to be
`sequenced
`
`Shotgun
`clones
`
`Shotgun
`sequence
`
`.....
`.....
`--.
`_ , ,..J r-J - " ""
`-
`)
`--",_r-~ _,..!.r,.._ ,l. ~.,..-:...,,......
`t
`
`... ACCGTAAATGGGCTGATCATGCTTAAA
`TGATCATGCTTAAACCCTGTGCATCCTACTG ...
`
`Hierarchical shotgun sequencing
`48
`Soon after the invention of DNA sequencing methods47
`, the
`•
`51
`shotgun sequencing strategy was introduced49
`; it has remained
`"
`the fundamental method for large-scale genome sequencing52
`54 for ·
`"
`the past