`
`Research Article • DOI: 10.2478/mngs-2013-0001 • MNGS • 201 • 10–203
`
`An improved approach to mate-paired library preparation for
`Illumina sequencing
`
`Abstract
`High quality data from mate-pair libraries provides long range sequence
`linkage across the genome, which is crucial for de novo assembly and
`structural-variant detection. Current commercial methods available
`for the construction of such libraries have differing limitations and are
`often linked to a single sequencing platform in a kit format, which may
`not be cost effective. We present an alternative mate-paired protocol,
`demonstrated using
`Illumina sequencing platforms, combining
`the
`specificity of hybridisation and ligation, to circularise fragments with high
`yield. An adapter sequence is incorporated between the junction site of
`the mate pairs, the length of which is evenly controlled by nick translation.
`We present a comparison of results from 3 Kb E. coli and Plasmodium
`falciparum 3D7 mate-pairs made with our protocol, alongside commercial
`mate-pair methods. Furthermore, we present the results of a set of 3 and
`6 Kb mate-pair libraries from seven different mouse strains made with our
`mate-pair protocol to demonstrate its reliability and robustness.
`
`Naomi Park*,
`Lesley Shirley,
`Yong Gu,
`
`Thomas M. Keane,
`Harold Swerdlow,
`Michael A. Quail
`
`The Wellcome Trust Sanger Institute,
`Wellcome Trust Genome Campus,
`Hinxton, Cambridge, UK
`
`Keywords
`Circularisation • Next Generation Sequencing • Mate-Pair • Long Insert • Illumina
`• Nextera • Pippin Prep
`
`Received 18 February 2013
`Accepted 26 June 2013
`
`© Versita Sp. z o.o.
`
`Introduction
`
`Mate-pair libraries, also known as long-insert libraries, have
`been used successfully to aid de novo sequencing, structural-
`variant detection and genome finishing [1]. Distance information
`from mate-pair reads has particular value for joining contigs
`flanking repetitive sequences. The resolution of larger structural
`rearrangements such as insertions, deletions and inversions are
`aided by mapping mate-pair reads to a reference sequence [2].
`However, the construction of mate-pair libraries is notoriously
`difficult [3], particularly for degraded samples or for samples with
`limited amounts of DNA. During mate-pair library preparation,
`distal sequences are brought together in a circularisation reaction
`after which the majority of the large fragment is removed, leaving
`the original ends as a juxtaposed mate-pair, insert sizes are
`typically between 2 and 40 Kb. Mate-pair libraries should map
`onto the reference sequence as ‘outward-facing’ paired reads,
`with a gap between the mapped reads that is approximately the
`same as the size of the original fragments that were selected.
`Low quality, damaged or degraded DNA samples often lead to
`an increase in undesirable ‘inward-facing’ reads, which align to
`the reference sequence pointing towards each other and tend
`to map close together. The desired insert size is directed by the
`specific project and this in turn directs selection of a suitable
`methodology. Techniques such as Illumina’s (San Diego, CA,
`USA) Mate Pair Library Preparation Kit v2 and Nextera Mate Pair
`
`* E-mail: nh4@sanger.ac.uk
`
`10
`
`Sample Prep Kit , as well as both of the SOLiD 4 and SOLiD 5500
`methods (Life Technologies; Staley Road, Grand Island, NY, USA),
`utilise intramolecular circularisation to bring together the ends
`of smaller fragments (i.e. 3-10 Kb). Cre-Lox recombination has
`been used for both smaller and larger fragment libraries (i.e. 3-20
`Kb) [3,4].Even larger fragment sizes (i.e. 40 Kb) may be inserted
`into a fosmid or BAC vector as in the Lucigen (Middleton, WI,
`USA) NxSeq 40 Kb mate-pair cloning kit [5], or up to 300 Kb from
`the utilisation of existing BAC libraries previously prepared for
`Sanger capillary sequencing [6]. Each circularisation technique
`is suitable for sequencing on Illumina NGS platforms, or may be
`easily adapted to do so. As per the Illumina mate-pair methods,
`linear mate-pair fragments generated by other techniques can be
`captured on streptavidin beads for end-repair, A-tailing, adapter
`ligation and PCR amplification. However, in our experience,
`no single commercially available circularisation method for the
`generation of 3 Kb mate pair libraries is both reliable and optimal
`for all sample types.
`
`Illumina mate-paired libraries
`Illumina mate-paired libraries utilise blunt-ended circularisation
`of 3-5 Kb fragments, followed by a secondary fragmentation
`step. During Mate Pair Library Preparation Kit v2 (Illumina v2)
`library construction, biotinylated nucleotides are incorporated
`at the ends of the sheared fragments. Degraded samples may
`contain nicks into which the biotin can also insert which, after the
`
`00001
`
`EX1029
`
`
`
`An improved approach to mate-paired library
`preparation for Illumina sequencing
`
`secondary fragmentation step, are bound by streptavidin beads
`alongside genuine mate-pairs. As ligation between blunt ends is
`generally more difficult to achieve than ligation between cohesive
`ends, [7] the circularisation yield may be poor, leading to a lower
`complexity final library. Ligation between two independent
`fragments can also generate an undesirable proportion of
`chimeric reads. Random secondary fragmentation of circularised
`fragments causes uneven genomic sequence length either
`side of the junction and, as the junction contains no adapter
`sequence, sequencing reads that pass through the junction of
`the two joined ends cannot be identified and pose problems
`during mapping and de novo assembly [4].The Nextera Mate
`Pair Sample Prep Kit (Nextera) circumvents a number of these
`issues. Transposome mediated fragmentation and biotinylated
`adapter tagging of genomic DNA generates an identifiable mate-
`pair junction sequence. Whilst biotin will not incorporate into the
`nicks of degraded DNA, tagmentation of poor quality samples
`is likely to fragment such DNA to a size below the desired size
`range (Nextera Mate Pair Sample Preparation Guide). As with
`Illumina v2, circularisation proceeds via blunt ended ligation and
`random secondary fragmentation.
`
`SOLiD mate-paired libraries
`The Life Technologies 2x50 bp Mate-Paired Library kit for the
`SOLiD 4 system incorporates hybridisation and ligation in order
`to circularise fragments. As described in the manufacturer’s
`protocol, an adapter is ligated to both ends of the 3 Kb fragment;
`the adapter has a 2-base overhang. A biotinylated internal
`adapter, complementary to the two overhangs is added and the
`ligation reaction is held at 20°C to complete circularisation. The
`efficiency of the circularisation is often low, probably due to the
`short 2 bp hybridisation region which may limit the efficiency of
`the ligation [8]. The mate-pair protocol for the SOLiD 5500 series
`improves the yield of circularisation in comparison to the SOLiD
`4 method (Life Technologies press release). A different left and
`right adapter are ligated which contain longer base overhangs,
`the exact length of which is undisclosed, blocked by short
`oligonucleotides. The reaction is heated to 70°C and cooled,
`during which time the blocking groups denature, allowing the
`left and right complementary overhangs to anneal to each other,
`thus forming a circle. Because the left and right adapters are of
`different sequence composition, only ~50% of adapter-ligated
`fragments are amenable to circularisation. Additionally, the 70°C
`temperature of the circularisation reaction may be detrimental
`to AT-rich genomes and/or to degraded samples. All SOLiD
`circularised fragments contain two nicks on opposite strands
`either side of the biotinylated adapter, which are nick-translated
`into the inserted genomic sequence, and subsequently digested
`with T7 and S1 exonuclease at the translated nick site to give
`linear dsDNA for streptavidin-bead capture. This process of
`nick translation and digestion from an adapter sequence is
`favourable in comparison to random shearing, which causes
`uneven genomic sequence length either side of the junction.
`As the junction of the joined DNA ends is marked by a known
`adapter sequence, the reads can be trimmed or split easily.
`
`Cre-Lox recombination
`libraries
`Roche
`(Penzberg, Germany) GS-FLX paired-end
`generate 3-20 Kb mate-pairs with cre-recombinase mediated
`recombination. Adaptations of this method to generate Illumina
`sequencing-ready libraries have previously been reported in the
`literature [3,4]. Circularisation adapters which contain LoxP sites
`are ligated to both ends of the fragment. This product undergoes
`cre recombination to generate circularised DNA containing
`biotinylated adapter sequence at the junction site. Although this
`method of circularisation is highly efficient, random secondary
`fragmentation of circularised fragments generates uneven
`sequence length either side of the LoxP adapter sequence,
`which may cause sequencing reads to pass through the adapter
`and into the other side, resulting in mapping issues.
`
`Improved (Sanger) mate-paired libraries
`In order to generate unbiased and diverse Illumina mate-paired
`libraries containing even genomic sequence either side of a
`common adapter sequence, we altered the Illumina mate-pair
`protocol to use a modified SOLiD 4 hybridisation and ligation
`circularisation approach. A single double stranded adapter
`(coloured green in Figure 1) is ligated to each end of the sheared
`fragment, leaving a 9 base overhang. This increased length of
`sticky end, as shown in Figure 2, generates a stable structure for
`subsequent ligation to a biotinylated internal adapter (coloured
`red in Figure 1). Due to the absence of a phosphate at the 5’
`end of the “Adapter Bottom” oligonucleotide, the circularised
`fragments contain one nick on each strand, which are used
`for nick translation into the genomic sequence, as is done with
`both SOLiD methods. The nicked sites are extended outward
`from the mate-pair region by T7 exonuclease (New England
`Biosciences; Ipswich, MA, USA), generating a single stranded
`region. This single stranded region is digested by S1 nuclease
`(Life Technologies), releasing a linear biotinylated mate-paired
`fragment from the rest of the circle. Subsequently only the
`biotinylated mate-paired fragment is captured by streptavidin-
`beads for Illumina library preparation.
`This new method
`(See Supplementary method) was
`compared to the Illumina v2, Nextera, SOLiD 4, SOLiD 5500
`, and the cre-lox methods of circularisation by making 3 Kb
`E. coli and Plasmodium falciparum 3D7 (pf3D7) mate-pair
`libraries. Each method was adapted for sequencing on Illumina
`platforms and sequenced in a multiplexed pool. The robustness
`and reproducibility of our mate-pair method was further
`demonstrated by 3 Kb and 6 Kb mate-paired libraries made from
`seven mouse strains. For each library, we present an analysis of
`the post-circularisation yield and the total library yield. We also
`show sequencing quality metrics for mapped read percentage,
`total mapped reads, proper-paired reads along with the number
`of singletons, duplicates and chimeras.
`
`Results and Discussion
`
`An ideal mate-pair library will have the following features:
`a diverse population of reads which align to the reference
`
`11
`
`00002
`
`
`
`Figure 1.
`
`N. Park et al.
`
`5’ pCTGCTTGTGGACGTTGTACATCGTGGTGC 3’
`
`
`Adapter Top
`5’ TGTACAACGTCCACAAGCAG 3’
`
`Adapter Bottom
`5’ pGGAGCCTAGTGCGCACCACGA 3’
`Internal Adapter Top
`5’ pGCACTAGGCTCCGCACCACGA 3’
`Internal Adapter Bottom
`Double stranded Adapter after annealing
`5’ pCTGCTTGTGGACGTTGTACATCGTGGTGC 3’
`3’ GACGAACACCTGCAACATGT 5’
`Double stranded Internal Adapter after annealing
`5’ pGGAGCCTAGTGCGCACCACGA 3’
`3’ AGCACCACGCCTCGGATCACGp 5’
`
`Figure 2.
`Figure 1 Adapter and Internal Adapter Oligonucleotides. The double stranded adapter (green) has a 9-base overhang which is complementary to each
`
`9-base overhang of the internal adapter (red). The biotinylated thymidine (T) is highlighted in yellow.
`
`Post Adapter ligation:
`5’ TGTACAACGTCCACAAGCAG NNNNNNNN CTGCTTGTGGACGTTGTACATCGTGGTGC 3’
`3’CGTGGTGCTACATGTTGCAGGTGTTCGTC NNNNNNNN GACGAACACCTGCAACATGT 5’
`
`Post Circularisation with the Internal Adapter:
` 5’ NNNN CTGCTTGTGGACGTTGTACATCGTGGTGCGGAGCCTAGTGCGCACCACGATGTACAACGTCCACAAGCAG NNNN 3’
` 3’ NNNN GACGAACACCTGCAACATGTAGCACCACGCCTCGGATCACGCGTGGTGCTACATGTTGCAGGTGTTCGTC NNNN 5’
`
`
`Figure 2. Schematic of the library preparation steps. NNNN denotes 3 Kb DNA fragments ligated to the adapter. The internal adapter hybridises and
`ligates to the adapter leaving a nick (underlined) to enable translation into the genomic region. The biotinylated thymidine (T) is highlighted
`in yellow.
`
`genome in an outward-facing orientation, reads which do
`not reach adapter sequence, an average size between the
`outward-facing paired reads matching that desired, no chimeric
`reads, and no GC bias. Generation of fragments of the desired
`size is dependent upon the genomic starting material being
`sufficiently intact, the method employed for fragmentation
`being reliable and the method (if any) used to size select for
`the desired fragment range being accurate and selective. In the
`case of the hybridisation/ligation/circularisation approach used
`within this paper, inward-facing paired reads with a small insert
`size are caused by the non-specific capture of non-biotinylated
`library fragments. This is in contrast to the Illumina v2 mate-
`pair method, in which biotinylated bases can insert into
`nicked sites of damaged DNA; the resulting molecules can be
`captured on the streptavidin-coated beads, and yield inward-
`facing paired reads. The diversity of the final mate-pair library
`is dependent upon a number of factors, including the amount
`of adapter-ligated DNA going into the circularisation reaction,
`the circularisation yield (which can be determined after Plasmid
`Safe® digestion (Epicentre; Madison, WI, USA) of remaining
`linear DNA) and the number of PCR cycles.
`
`Mate-Pair Method Comparison
`We compared the method presented in this paper with the
`Illumina v2, Nextera, SOLiD 4, SOLiD 5500 and cre-lox methods
`
`of circularisation. Both Nextera protocols, using 1 µg input
`with an AMPure (Beckman Coulter; Brea, CA, USA) bead size
`selection, and using a 4 µg input with a gel size selection were
`carried out. Each experiment was performed in duplicate.
`Where possible, variables such as fragmentation, size selection,
`circularisation input amount and Illumina sequencing preparation
`were standardised in order to aid a fair comparison. Each
`method was evaluated with the preparation and sequencing of
`both E.Coli (50.8% GC) and Plasmodium falciparum 3D7 (19.3%
`GC) genomes.
`
`Circularisation
`In theory, any improvement in the yield of circularisation
`should directly increase the complexity of the final mate-
`pair library. The yield of each circularisation reaction from a
`normalised input of 400 ng material is presented in Table 1. For
`both genomes, the Sanger method demonstrated a ~1.7-fold
`improvement in yield above the SOLiD 5500 method and a ~4-
`fold improvement in yield above the SOLiD 4 adapter protocol.
`The estimated library size (defined as the total number of unique
`fragments (Table 2)) for the SOLiD 4 E.Coli libraries, reflects
`this positive correlation between circularisation yield and final
`library complexity. This correlation is also demonstrated with
`the SOLiD 5500 pf3D7 library; however, the estimated library
`size of the E.Coli SOLiD 5500 library is twice that of the Sanger
`
`12
`
`00003
`
`
`
`An improved approach to mate-paired library
`preparation for Illumina sequencing
`
`library, indicating the relationship between circularisation yield
`and final library diversity may be influenced by other factors. In
`the case of the SOLiD 4 method, extremely poor circularisation
`of pf3D7 (Table 1) led to an inability to produce a successful
`library. The Illumina v2 protocol demonstrated a ~1.3-fold
`improvement in circularisation yield above the Sanger method.
`However, sequencing metrics (Table 2) indicates this is at
`least partially due to the formation of chimeric circles by two
`unrelated fragments (~15% of mapped reads) and not genuine
`mate-pairs. Conversely, the Nextera protocol demonstrated an
`improvement of 1.7-2.9-fold in circularisation yield which is not
`as markedly related to an increase in chimeric reads (although
`this is still elevated at ~2.3%). Circularisation yields of the
`cre-lox libraries were undetectable by high sensitivity qubit
`and generated a ~5-fold lower estimated library size than the
`Sanger libraries.
`
`Mate-Pair Size
`Genomic DNA was mechanically sheared for each 3 Kb library
`with the exception of the Nextera libraries. These libraries
`underwent transposome mediated fragmentation and adapter
`tagging of genomic DNA. Transposome mediated fragmentation
`is dependent upon high quality and accurately quantified starting
`material. In our experience, the accurate quantification of “real
`life” samples is often difficult and inaccurate due to impurities,
`even with the use of fluorometric based methods specific for
`duplex DNA such as the Qubit dsDNA BR kit. Additionally, the
`GC content of the genome alters the fragmentation pattern and,
`despite the proportional scaling up of all reaction components,
`the 4 µg tagmentation of pf3D7 generated a ~2.7-fold larger
`mean fragment size than that of the 1 µg tagmentation. With
`the exception of the 1 µg Nextera libraries, all libraries were
`size selected using the Blue Pippin (Sage Science; Beverly, MA,
`USA) using conditions shown in supplementary Table 1 and 2,
`
`to target the peak size maximum as determined by the Agilent
`Bioanalyzer (Agilent; Santa Clara, CA, USA).
`
`Library quality
`Library quality statistics of reads mapped to the E.Coli and
`pf3D7 genomes are given in Table 2. Despite multiple attempts
`pf3D7 SOLiD 4 libraries failed to yield a final library. All methods
`yielded a high proportion (>75%) of mappable reads, with the
`exception of one E.Coli SOLiD 4 library (38%) of which only
`34% were proper pairs and the cre-lox libraries (59-67%), of
`which only 36-55% were proper pairs. All other libraries were
`70-87% proper pairs, the lower end being the Nextera libraries.
`Singleton reads ranged from 2 to 5%, except for E.coli Nextera
`libraries (~11%) and cre-lox libraries (8-24%). Inward facing
`reads were low for all libraries, the highest being the Sanger
`(0.8-1.4%) and Illumina v2 (0.9-1.8%) libraries. Although
`still low, the inward facing reads are likely due to insufficient
`removal of non-biotinylated material during the washing steps
`of these particular libraries. Duplicate rates ranged between
`0.2-12%, Nextera libraries generating the lowest values and
`cre-lox libraries the highest.
`Intermolecular circularisation may occur if two different
`fragments concatamerise or, in the case of the Sanger,
`SOLiD 4, SOLiD 5500 and cre-lox libraries, it is possible for
`two fragments to incorrectly ligate to each other in addition
`to ligating to the circularisation adapter. Either of these
`scenarios will result in an artefact of structural variation,
`chimeric reads. The presence of chimeric reads poses a
`major problem in the generation of mate-pair libraries and the
`elimination of these is highly desirable. For both genomes, the
`Sanger and cre-lox methods of circularisation generated the
`lowest number of chimeric reads (0.06/0.1% respectively for
`E.Coli and 0.7/0.3% respectively for pf3D7). Due to the poor
`performance of the cre-lox libraries and the high performance
`
`Table 1. E.coli/pf3D7 circularisation and final library yields. 400 ng material was used as input into each circularisation reaction; all libraries went through
`13 cycles of PCR.
`
`E.Coli
`
`P. falciparum 3D7
`
`Method
`
`Mean ±
`Stdev
`Output Post
`Plasmid Safe
`Digestion
`(ng)
`
`Mean ± Stdev
`Circularisation
`Yield (%)
`
`Mean ±
`Stdev Library
`Yield (pMol)
`
`Mean ±
`Stdev
`Library Yield
`(nmol/l)
`
`Mean ±
`Stdev
`Output Post
`Plasmid Safe
`Digestion
`(ng)
`
`Mean ± Stdev
`Circularisation
`Yield (%)
`
`Mean ± Stdev
`Library Yield
`(pMol)
`
`Mean ±
`Stdev
`Library
`Yield
`(nmol/l)
`
`Sanger
`
`37.2 ±3
`
`9.3 ±0.7
`
`0.028 ±0.001
`
`1.4 ±0.07
`
`41.7 ±10
`
`10.4 ±2
`
`0.0248 ±0.01
`
`1.2 ±0.5
`
`Nextera
`1ug
`
`Nextera
`4ug
`
`110 ±12
`
`27.6 ±3
`
`2.921 ±0.2
`
`146 ±10.4
`
`74.0 ±6
`
`18.5 ±1
`
`1.941 ±1
`
`97.1 ±55.0
`
`107.4 ±14
`
`26.9 ±3
`
`2.200 ±0.4
`
`110 ±18.8
`
`69.4 ±1
`
`17.3 ±0.2
`
`0.504 ±0.008
`
`25.2 ±0.4
`
`SOLiD5500
`
`22.5 ±0.6
`
`5.6 ±0.2
`
`0.052 ±0.005
`
`2.6 ±0.2
`
`23.0 ±0.1
`
`5.7 ±0.02
`
`0.012 ±0.0002
`
`0.59 ±0.008
`
`SOLiD4
`
`9.6 ±2
`
`2.4 ±0.4
`
`0.008 ±0.003
`
`0.4 ±0.2
`
`9.9 ±3
`
`Illuminav2
`
`51.3 ±17
`
`12.8 ±4
`
`0.112 ±0.001
`
`5.6 ±0.06
`
`48.0 ±0
`
`2.5 ±0.7
`
`12.0 ±0
`
`Fail
`
`Fail
`
`0.202 ±0.04
`
`10.1 ±2.1
`
`454
`
`ND
`
`NA
`
`0.010 ±0.003
`
`0.5 ±0.2
`
`ND
`
`NA
`
`0.003 ±0.00001
`
`0.15
`±0.0007
`
`13
`
`00004
`
`
`
`N. Park et al.
`
`2522,2913,3456
`
`2501,2906,3489
`
`1,666,050
`
`1,436,903
`
`2872,3191,3962
`
`106,831,975
`
`2874,3176,3647
`
`96,778,532
`
`4281(0.34)
`
`3455(0.35)
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`3199,3589,4189
`
`3216,3592,4188
`
`3,181,087
`
`3,119,007
`
`11066(1.17)
`
`11413(1.09)
`
`6695,7976,10,109
`
`221,719,888
`
`6777,8032,10,223
`
`217,591,242
`
`2079,3010,4549
`
`474,453,473
`
`2073,3086,4689
`
`596,244,158
`
`2593,2867,3293
`
`2614,2907,3332
`
`6,662,099
`
`7,092,996
`
`3072,3406,3985
`
`3062,3415,4000
`
`2506,3017,3893
`
`2527,3050,3889
`
`3167,3409,3808
`
`3128,3386,3808
`
`3198,3531,4063
`
`3249,3582,4137
`
`1,578,265
`
`1,022,158
`
`45,559,230
`
`46,305,416
`
`748,054
`
`4,697,911
`
`17,608,752
`
`13,748,108
`
`4094,4740,5336
`
`408,363,590
`
`2570,3222,4323
`
`231,344,363
`
`1862,2521,3570
`
`637,651,262
`
`1418,2023,2925
`
`532,093,793
`
`3148,3407,3818
`
`2496,2884,3414
`
`8,446,293
`
`6,721,919
`
`Quartiles
`Insert Size
`
`Size
`
`Estimated Library
`
`21267(2.3)
`
`40831(2.3)
`
`6919(0.67)
`
`4242(0.67)
`
`1421(0.1)
`
`1695(0.12)
`
`8379(0.71)
`
`762(0.06)
`
`844(0.05)
`
`109748(10.92)
`
`76244(9.82)
`
`2226(0.14)
`
`1606(0.12)
`
`1477532(15.89)
`
`194408(2.25)
`
`167792(1.79)
`
`1920467(15.91)
`
`348911(3.1)
`
`217956(1.79)
`
`38810(2.51)
`
`25111(2.45)
`
`-
`
`-
`
`41250(5.21)
`
`54131(6.04)
`
`4414(0.32)
`
`2025(0.22)
`
`1624(0.2)
`
`5040(0.33)
`
`33446(3.48)
`
`12540(2.13)
`
`102108(9.45)
`
`124106(11.98)
`
`167799(15.39)
`
`11938(1.2)
`
`260410(15.56)
`
`16788(1.24)
`
`10824(0.76)
`
`13334(0.99)
`
`36445(2.25)
`
`34923(2.22)
`
`37013(2.03)
`
`41127(2.25)
`
`27408(1.79)
`
`40488(7.75)
`
`74499(5.83)
`
`19212(1.82)
`
`20201(2)
`
`29072(2.11)
`
`23894(1.83)
`
`34952(2.24)
`
`33331(2.28)
`
`43832(3.81)
`
`90133(5.68)
`
`-
`
`-
`
`3202(0.32)
`
`3722(0.35)
`
`1994(0.13)
`
`1286(0.12)
`
`1310(0.14)
`
`2386(0.13)
`
`16178(1.42)
`
`10290(1.51)
`
` falciparum 3D7
`
`P.
`
`4722(0.29)
`
`6486(0.37)
`
`9832(0.9)
`
`14522(0.86)
`
`1362(0.1)
`
`2686(0.19)
`
`2024(0.17)
`
`1242(0.09)
`
`5704(0.35)
`
`3478(0.22)
`
`8822(0.48)
`
`6060(0.33)
`
`10770(0.8)
`
`17220(0.99)
`
`E.Coli
`
`Chimeras (%)
`
` Duplicates (%)
`
`Inward reads (%)
`
`1546478
`
`1326766
`
`9395990
`
`11251194(92.32)
`
`12186512
`
`1545406(85.01)
`
`1080721(66.96)
`
`1035672(59.66)
`
`1534005(91.36)
`
`1277513(88.73)
`
`1058409(89.03)
`
`1008034(74.73)
`
`1379340(84.36)
`
`1305714(82.32)
`
`1558622(84.70)
`
`1462202(79.28)
`
`1149143(85.18)
`
`1585574(91.22)
`
`-
`
`-
`
`1006516
`
`1068608
`
`1576742
`
`1046822
`
`944780
`
`1817908
`
`1135602
`
`682164
`
`1613924
`
`1735912
`
`1635128
`
`1840202
`
`1093040
`
`1679048
`
`1374586
`
`1439736
`
`1188802
`
`1348984
`
`1349072
`
`1586134
`
`1844424
`
`1738250
`
`454_2
`
`454_1
`
`Illuminav2_2
`
`Illuminav2_1
`
`SOLiD4_2
`
`SOLiD4_1
`
`SOLID5500_2
`
`SOLiD5500_1
`
`Nextera 4µg_2
`
`Nextera 4µg_1
`
`Nextera 1µg_2
`
`Nextera 1µg_1
`
`Sanger_2
`
`Sanger_1
`
`454_2
`
`454_1
`
`Illuminav2_2
`
`Illuminav2_1
`
`SOLiD4_2
`
`SOLiD4_1
`
`SOLID5500_2
`
`SOLiD5500_1
`
`Nextera 4µg_2
`
`Nextera 4µg_1
`
`Nextera 1µg_2
`
`Nextera 1µg_1
`
`Sanger_2
`
`Sanger_1
`
`Method
`
`All reads
`
`
`
`
`14
`
`between unrelated DNA molecules.
`Illumina MiSeq. Singletons are defined as an individual read which does not have a corresponding mate-pair. Chimeras are defined as incorrect mate-pairs formed during circularisation when ligation occurs
`Table 2. Post sequencing analysis metrics for 3kb E.Coli and pf3D7 mate-paired libraries prepared using commercial methods alongside the Sanger method. Libraries were indexed, multiplexed and sequenced on the
`
`137129(8.87)
`
`857836(55.47)
`
`1005179(65.00)
`
`111472(8.40)
`
`657592(49.56)
`
`776177(58.50)
`
`233605(2.49)
`
`8175824(87.01)
`
`8637064(91.92)
`
`304631(2.50)
`
`10652296(87.41)
`
`-
`
`-
`
`18314(1.82)
`
`18714(1.75)
`
`64545(4.09)
`
`43751(4.18)
`
`47012(4.98)
`
`89371(4.92)
`
`44706(3.94)
`
`31129(4.56)
`
`-
`
`-
`
`-
`
`-
`
`763222(75.83)
`
`791843(78.67)
`
`865328(80.98)
`
`895871(83.84)
`
`1310604(83.12)
`
`1389665(88.14)
`
`866396(82.76)
`
`920122(87.90)
`
`753156(79.72)
`
`811459(85.89)
`
`1434318(78.90)
`
`878902(77.40)
`
`960130(84.55)
`
`536014(78.58)
`
`588694(86.30)
`
`381225(23.62)
`
`686376(42.53)
`
`399508(23.01)
`
`621090(35.78)
`
`62561(5.72)
`
`97663(5.82)
`
`38687(2.81)
`
`41563(2.89)
`
`34991(2.94)
`
`45338(3.36)
`
`921338(84.29)
`
`998143(91.32)
`
`1416994(84.39)
`
`481644(35.04)
`
`522669(38.02)
`
`1225494(85.12)
`
`1015338(85.41)
`
`957798(71.00)
`
`186088(11.38)
`
`1169260(71.51)
`
`169252(10.67)
`
`1124720(70.91)
`
`205162(11.15)
`
`1318192(71.63)
`
`213900(11.60)
`
`1224990(66.42)
`
`75641(5.61)
`
`83148(4.78)
`
`1051838(77.97)
`
`1452332(83.55)
`
`(%)
`
`Singleton reads
`
`Proper pairs (%)
`
`(%)
`
`Mapped reads
`
`00005
`
`
`
`An improved approach to mate-paired library
`preparation for Illumina sequencing
`
`of the Sanger libraries in other quality metrics, the Sanger
`method of circularisation provides an attractive option in
`order to avoid chimeric reads.
`All Nextera libraries generated the most diverse libraries
`(highest estimated library size) for both genomes by far (Covaris
`Inc; Woburn, MA, USA). Its low requirement for input DNA
`(1-4 µg without/with gel size selection), high circularisation
`efficiency and good performance in other quality metrics make
`it a promising option for the generation of mate-pair libraries.
`However the 2% chimera rate, the requirement for high quality
`and accurately quantified starting material, as well as the
`sensitivity of transposome mediated fragmentation to GC
`content, are all limitations. Illumina v2 libraries generated the
`second highest level of diversity but its historical sensitivity to
`poor quality starting material, lack of internal adapter sequence
`and high chimeric rate are limitations.
`SOLiD 4 libraries generated good data if sufficient library
`was generated, but the method is limited by its efficiency of
`circularisation, leading to limited library diversity. E.Coli SOLiD
`5500 libraries performed very well, with library diversity second
`to that of the Illumina methods and achieved excellent scoring
`Figure 3
`in other quality metrics aside from a 0.7% rate of chimeras
`(compared to 0.06% for the Sanger libraries). However, the
`Plasmodium pf3D7 libraries did not perform quite as well as the
`E.Coli libraries, resulting in a higher duplicate rate and much
`reduced estimated library size. This indicates the 70°C heating
`and snap cooling of circularisation may have a detrimental
`effect for low GC genomes. We postulate this method of
`circularisation may also pose problems for partially degraded
`samples.
`In our method, biotin is incorporated into the mate-pair
`junction via an adapter sequence and cannot insert into nicked
`sites of damaged DNA, as may be the case for Illumina v2
`libraries. Mechanical shearing is not affected by degraded
`material to the extent of Nextera fragmentation. We therefore
`consider our method to be preferable for partially degraded
`(<40kb but >5kb) starting material.
`The GC profiles of the E.Coli and pf3D7 mate-pair libraries
`are shown in Figure 3. Each method of circularisation shows
`minimal bias towards GC content which may be attributed to the
`standardised use of KAPA HiFi polymerase (Kapa Biosystems,
`Woburn, MA, USA) for all library amplifications. [11]
`
`Mus musculus Sanger Mate-Pair Libraries
`In order to demonstrate the robustness and reproducibility
`of our mate-pair method we prepared 3 Kb and 6 Kb mate-
`paired libraries made from seven mouse strains. Genomic DNA
`was sheared with a red miniTUBE for each 3 Kb library and a
`g-TUBE (Covaris Inc.) for each 6 Kb library. All libraries were
`size selected using the Blue Pippin to target the peak size
`maximum as determined by the Bioanalyzer. Fragment sizes
`after size selection and post mapping of reversed reads are
`listed in Table 3, and an example trace of each library size is
`presented in Figure 4. We noted an average discrepancy of
`1.5 Kb for the 3 Kb libraries and 1.2 Kb for the 6 Kb libraries
`
`between the peak fragment size observed on the Agilent
`Bioanalyzer and that observed after mapping.
`
`Library diversity
`Library diversity of the mate-pair libraries is dependent on the
`amount of adapter-ligated DNA going into the circularisation
`reaction, the circularisation yield and the number of PCR cycles
`used. Starting with 10 µg/20 µg of Mus musculus genomic
`DNA, we obtained between 0.7-1.8 µg/1.7-4.4 µg of adapter-
`ligated material for input into the 3 Kb/6 Kb circularisation
`reaction. This recovery could be increased or decreased, by
`decreasing or increasing respectively the width of the size
`range collected by the Blue Pippin. The Blue Pippin size-range
`parameters (Supplementary Table 1 and 2) were selected in
`order to balance good recovery at the cost of an increased size
`distribution. Table 4 shows that the circularisation yield for our
`method ranged from 11.2-17.1% (3 Kb) and 7.3-21.3% (6 Kb)
`which generated a post-PCR final library yield of between
`0.088 and 1.5 pMol (3 Kb) and 0.091 and 0.64 pMol (6 Kb). In
`addition to varying input amount for the circularisation reaction
`
`a)
`
`b)
`
`Figure 3. GC profile analysis of E.Coli and pf3D7 sequence data. The GC
`content distribution for each method: a) E.Coli and b) pf3D7 is
`shown alongside the theoretical data for the reference genome
`(red trace).
`
`15
`
`00006
`
`
`
`N. Park et al.
`
`Table 3. Library quality metrics for 3 Kb and 6 Kb Mus musculus mate-pair libraries, as determined after size selection (bioanalyzer 7500 kit) and mapping
`of reversed reads post sequencing. All libraries underwent 13 cycles of PCR.
`
`Fragment
`length
`Post Size
`Selection
`
`Fragment
`length
`Post
`Mapping
`
`Difference
`
`Circularisation
`Input (ng)
`
`Output Post
`Plasmid Safe
`Digestion (ng)
`
`Circularisation
`Yield (%)
`
`Library
`Yield
`(pMol)
`
`Library
`Yield
`(nmol/l)
`
`6.1
`5.1
`5.9
`6.2
`3.9
`4.5
`5.3
`5.3 ±0.9
`
`6.2
`6.3
`6.5
`6.5
`6.5
`6.9
`7.2
`6.6 ±0.3
`
`4.6
`3.2
`4.5
`4.6
`3
`3.1
`3.2
`3.7 ±0.8
`
`6.2
`6.3
`6.5
`6.5
`6.5
`6.9
`7.2
`6.6 ±0.3
`
`1.5
`1.9
`1.4
`1.6
`0.9
`1.4
`2.1
`1.5 ±0.4
`
`1.1
`0.7
`1.2
`0.9
`1.6
`1.1
`1.7
`1.2 ±0.4
`
`725
`840
`870
`1065
`1770
`1578
`1185
`1147.6 ±393.9
`
`2500
`2200
`3700
`1700
`4400
`1700
`2900
`2728.6 ±1017.7
`
`3 kb
`
`82
`128
`112
`119
`302
`188
`161
`156 ±72.9
`6 kb
`
`280
`413
`790
`221
`497
`124
`258
`369 ±222.8
`
`11.2
`15.2
`12.9
`11.2
`17.1
`11.9
`13.6
`13.3 ±2.2
`
`11.2
`18.8
`21.3
`13
`11.3
`7.3
`8.9
`13.1 ±5.1
`
`0.088
`0.239
`0.091
`0.111
`1.504
`0.937
`0.39
`0.48 ±0.5
`
`0.272
`0.236
`0.638
`0.09
`0.177
`0.088
`0.161
`0.24 ±0.2
`
`4.4
`12
`4.5
`5.6
`75.2
`46.9
`19.5
`24 ±27.1
`
`13.6
`11.8
`43.1
`4.5
`8.8
`4.4
`8.1
`13.5 ±13.5
`
`Sample
`AKR/J
`SPRET/EiJ
`PWK/PhJ
`C57BL/6NJ
`NOD/ShiLtJ
`WSB/EiJ
`FVB/NJ
`Mean ± stdev
`
`AKR/J
`SPRET/EiJ
`PWK/PhJ
`C57BL/6NJ
`NOD/ShiLtJ
`WSB/EiJ
`FVB/NJ
`Mean ± stdev
`Figure 4
`a)
`
`b)
`
`Figure 4. Example Mus musculus mate-pair library post size-selection bioanalyzer traces, alongside reverse-mapped library insert size measured after
`sequencing: a) 3 Kb and b) 6 Kb. Each library maps approximately 1 Kb smaller than the bioanalyzer peak maximum.
`
`16
`
`00007
`
`
`
`An improved approach to mate-paired library
`preparation for Illumina sequencing
`
`and fluctuating circularisation yield, the post-PCR final library
`yield is subject to random variation in sample loss during each
`reaction clean-up.
`
`Library quality
`Library quality statistics of reads mapped to the Mus musculus
`genome are given in Table 4. A high proportion of reads map (86-
`96%) of which the majority are proper pairs (70-91%). Chimeric
`reads were not observed to any significant level (below 1%)
`and may be a result of actual genomic alterations between
`the sequenced strains and the C57BL/6J reference genome
`(NCBIm37), and not due to true chimeras [9]. The majority of
`sequenced libraries have a high estimated library size and a
`low duplicate rate; those which have worse statist