throbber
Not All Sequence Tags Are Created Equal: Designing and
`Validating Sequence Identification Tags Robust to Indels
`
`Brant C. Faircloth1*, Travis C. Glenn2
`
`1 Department of Ecology and Evolutionary Biology, University of California Los Angeles, Los Angeles, California, United States of America, 2 Department of Environmental
`Health Science, University of Georgia, Athens, Georgia, United States of America
`
`Abstract
`
`Ligating adapters with unique synthetic oligonucleotide sequences (sequence tags) onto individual DNA samples before
`massively parallel sequencing is a popular and efficient way to obtain sequence data from many individual samples. Tag
`sequences should be numerous and sufficiently different to ensure sequencing, replication, and oligonucleotide synthesis
`errors do not cause tags to be unrecoverable or confused. However, many design approaches only protect against
`substitution errors during sequencing and extant tag sets contain too few tag sequences. We developed an open-source
`software package to validate sequence tags for conformance to two distance metrics and design sequence tags robust to
`indel and substitution errors. We use this software package to evaluate several commercial and non-commercial sequence
`tag sets, design several large sets (maxcount = 7,198) of edit metric sequence tags having different lengths and degrees of
`error correction, and integrate a subset of these edit metric tags to polymerase chain reaction (PCR) primers and sequencing
`adapters. We validate a subset of these edit metric tagged PCR primers and sequencing adapters by sequencing on several
`platforms and subsequent comparison to commercially available alternatives. We find that several commonly used sets of
`sequence tags or design methodologies used to produce sequence tags do not meet the minimum expectations of their
`underlying distance metric, and we find that PCR primers and sequencing adapters incorporating edit metric sequence tags
`designed by our software package perform as well as their commercial counterparts. We suggest that researchers evaluate
`sequence tags prior to use or evaluate tags that they have been using. The sequence tag sets we design improve on extant
`sets because they are large, valid across the set, and robust to the suite of substitution, insertion, and deletion errors
`affecting massively parallel sequencing workflows on all currently used platforms.
`
`Citation: Faircloth BC, Glenn TC (2012) Not All Sequence Tags Are Created Equal: Designing and Validating Sequence Identification Tags Robust to Indels. PLoS
`ONE 7(8): e42543. doi:10.1371/journal.pone.0042543
`
`Editor: Shin-Han Shiu, Michigan State University, United States of America
`
`Received May 14, 2012; Accepted July 9, 2012; Published August 10, 2012
`Copyright: ß 2012 Faircloth, Glenn. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits
`unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
`
`Funding: This work was supported by a Smithsonian Scholarly Studies Grant to Stephen P. Hubbell and BCF, National Science Foundation (NSF) grant DEB-
`1136626 to BCF and TCG, NSF grant DEB-0614208 to TCG, an Amazon Web Services Educational grant to BCF and TCG, and material (TruSeq-style adapters) and
`sequencing contributions (HiSeq lanes) from Integrated DNA Technologies. The funders had no role in study design, data collection and analysis, decision to
`publish, or preparation of the manuscript.
`
`Indexed-adapter (TruSeq-style) and sequencing
`Competing Interests: This study was partly supported by an Amazon Web Services Educational grant.
`contributions (HiSeq lanes) from Integrated DNA Technologies supported this work. TruSeq-style Oligonucleotide sequences ß 2007–2012 Illumina, Inc. All rights
`reserved. Derivative works created by Illumina customers are authorized for use with Illumina instruments and products only. All other uses are strictly prohibited.
`There are no further patents, products in development or marketed products to declare. This does not alter the authors’ adherence to all the PLoS ONE policies on
`sharing data and materials, as detailed online in the guide for authors.
`
`* E-mail: brant@faircloth-lab.org
`
`Introduction
`
`Synthetic, oligonucleotide sequence identification tags (sequence
`tags) can be attached to individual pieces of DNA allowing pooling
`and sample tracking during massively parallel sequencing (MPS)
`[1–3]. Sequence tags enable efficient distribution of the output
`from these platforms among many individually identifiable
`samples rather than extensive, deep sequencing of single individ-
`uals or mixed samples. Thus, the ability to tag and track sequenced
`DNA from many individuals in multiplex increases the efficiency
`of MPS when the genomes being sequenced are small [4] or when
`researchers want to apportion the output of MPS platforms among
`smaller genomic regions of many individuals [5–7].
`Groundbreaking prior work introduced the idea of sequence
`tagging by incorporating tags to sequence reads using polymerase
`chain reaction (PCR) primers and DNA ligation [1–3]. Yet, early
`sequence tags were designed for specific platforms and platform-
`specific error patterns, and few tag sets were created to address the
`
`complement of errors (insertions, deletions, and substitutions)
`affecting the uniqueness of each tag sequence across the suite of
`current sequencing platforms. Errors can also be introduced to
`sequence tags during tag synthesis and strand replication (library
`preparation or template amplification),
`in addition to DNA
`sequencing.
`Errors in sequence tag synthesis occur during the coupling
`reaction, when DNA bases are being joined to form the desired
`oligonucleotide strand [8]. Coupling errors produce n-1, n-2, and
`n-3 congeners containing deletion errors throughout the oligo
`[9,10]. Relatively expensive purification techniques remove most
`of these congeners, particularly the n-2 and n-3 varieties, but some
`n-1 congeners
`remain, even with increasingly sophisticated
`purification methods
`(e.g., HPLC)
`[11]. Thus, all
`synthetic
`oligonucleotides have the potential to contain deletion errors,
`and this potential increases significantly when expensive purifica-
`tion is not used. However, expensive purification techniques are
`increasingly cost prohibitive as the number of required sequence
`
`PLOS ONE | www.plosone.org
`
`1
`
`August 2012 | Volume 7 |
`
`Issue 8 | e42543
`
`00001
`
`EX1063
`
`

`

`tags or adapters containing tags increases, and HPLC purification
`can introduce additional problems if sequence tagged adapters or
`sequence tagged primers are sequentially purified [12] without
`accounting for carryover.
`Errors in strand replication often occur during the amplicon
`generation or library preparation process (c.f. [13]), because
`researchers use thermostable DNA polymerases and PCR to
`generate amplicons,
`increase library concentration by ligation-
`mediated PCR, or add sequence tags to adapter-ligated fragments.
`Thermostable DNA polymerases predominately incorporate
`substitution errors to DNA strands during replication [14,15],
`although most DNA polymerases can produce new DNA strands
`containing insertion or deletion errors at a lower frequency
`[15,16]. The error rate is template- and polymerase-dependent,
`and modern proof-reading DNA polymerases having exonuclease
`activity exhibit
`low rates of nucleotide incorporation error,
`suggesting that these types of enzymes should be used in all
`amplicon sequencing and library preparation procedures [17].
`Similar synthesis errors accrue during downstream template
`amplification (i.e., emulsion PCR [emPCR] for 454, Ion Torrent
`and SOLiD platforms or cluster formation for Illumina), but this is
`generally less of a problem because sequences are determined from
`the consensus of many molecules on one particle or in one cluster.
`Sequencing errors occur on all MPS platforms, but the type of
`errors and the error rates vary across MPS platforms [18–25].
`Sequencing errors on platforms
`from Roche 454, Applied
`Biosystems (Ion Torrent), and Pacific Biosciences largely consist
`of insertion and deletion errors, whereas sequencing errors on
`platforms from Illumina and Applied Biosystems (SOLiD) are
`generally substitutions [26,27]. Single-read sequencing error rates
`vary from 0.5–5% [20,21,25,28] on Roche, Illumina, and Applied
`Biosystems platforms to 18% on the Pacific Biosciences platform
`[23]. Sequencing error rates are not uniformly distributed across
`sequence reads from platforms that amplify the templates (e.g.,
`Illumina, Ion Torrent and Roche) with most errors occurring at
`the beginning and end of reads [18,22,29]. This biased distribution
`of sequencing errors along a read affects sequence tags immedi-
`ately adjacent to or far from the start of the sequence read [30] to
`a greater degree than sequence tags offset from 59 or 39 ends.
`Synthesis, replication, and sequencing errors negatively impact
`the utility of sequence tags because they change the basepair
`composition of individual tags by inserting bases to, substituting
`bases within, or deleting bases from the identifying sequence. All
`three types of error can cause one tag to appear identical to
`another (crossover) or sufficiently alter a sequence tag such that it
`is unrecognizable (loss) and untraceable to the source material. A
`uniformly distributed error
`rate of 1.0% during an MPS
`sequencing run producing 106 reads, each having an 8 bp
`sequence tag, results in approximately 77,000 reads (8%) having
`more than one error within the sequence tag (Figure S1).
`Probability ensures
`that
`longer
`sequence tags, which allow
`multiplexing of more samples, are affected by sequencing error
`to a greater degree, and tags of longer length should have greater
`minimum distance from all tags in the set.
`Using error-correction schemes,
`researchers can construct
`sequence tags that are more robust to synthesis, replication, and
`sequencing errors (i.e., minimizing crossover and loss) while also
`allowing the correction of certain types of errors. Hamady et al.
`[31] used Hamming codes [32] to develop a set of error-correcting
`sequence tags with which they successfully tracked a large number
`of reads in multiplex (see also [33]). However, Hamming codes
`assume that the errors occurring within each sequence tag are only
`substitutions [34,35]. Insertion and deletion errors violate the
`codeword scheme and reduce the utility of Hamming-based tags
`
`Not All Sequence Tags Are Created Equal
`
`when commercial synthesis does not completely remove n-1
`standard Taq polymerase is used during strand
`congeners,
`replication, or
`sequence data are generated on platforms
`incorporating insertion and deletion errors
`(Figure 1;
`[36]).
`Additionally, when Hamming-distance tags are constructed using
`a binary representation of each base (e.g., T = 00; G = 01; C = 10;
`A = 11), which we define as ‘‘binary encoding’’ (Figure S2), 33% of
`substitution errors, while detectable, are uncorrectable because
`sequencing errors occur among actual nucleotides (Figure 2; [37]).
`Thus, sequence tags appropriately designed using Hamming codes
`should use nucleotide representations of each base rather than
`their binary encoding [37].
`Sequence tags based on the edit metric or Levenshtein distance
`[38,39] are superior to Hamming-distance tags, because edit
`metric sequence tags are robust to the types of errors introduced
`by oligonucleotide synthesis, replication, and DNA sequencing:
`insertions, deletions, and substitutions. Edit metric sequence tags
`allow for error correction according to the following formulas [38–
`40]:
`
`Required Edit Distance~2|(Errors)z1
`
`or
`
`Correctable Errors~(Edit Distance{1)=2
`
`Figure 1. Insertion and deletion errors violate the codeword
`scheme and reduce the utility of Hamming-based tags. Panel (A)
`shows two sequence tags that are different from one another by seven
`substitutions (Hamming distance = 7) – a distance more than sufficient
`to differentiate tags in the presence of substitution errors. However,
`these same two tags have an edit distance of two (B) – meaning that a
`total of two insertions, substitutions, or deletions can turn Tag 1 into
`Tag 2 and confuse samples. Although it seems improbable that two
`indels or substitutions would occur in a sequence tag, consider the
`third case (C) in which a single deletion event at the 59 end of a
`sequence tag adjoining DNA template beginning with 59 guanine
`confuses Tag 1 with Tag 2. Edit metric sequence tags of distance three
`or greater would mitigate this mistake.
`doi:10.1371/journal.pone.0042543.g001
`
`PLOS ONE | www.plosone.org
`
`2
`
`August 2012 | Volume 7 |
`
`Issue 8 | e42543
`
`00002
`
`

`

`Not All Sequence Tags Are Created Equal
`
`(3) methods for prepending sequence tags to amplification primers
`and inserting tags into platform-specific sequencing adapters; and
`(4) multiprocessing support to speed tag generation when tag
`lengths are long ($8 nt).
`We use components of EDITTAG to validate a number of
`existing sequence tag sets provided by commercial and non-
`commercial sources, design several sets of edit metric sequence
`tags of varying edit distance, and integrate a subset of edit metric
`sequence tags to Epicentre Nextera adapters, Illumina TruSeq
`adapters, and PCR primers. We then validate this subset of tags by
`sequencing across the indices of indexed adapters and sequence-
`tagged PCR primers on the Illumina (GAIIx and HiSeq 2000) and
`Roche 454 (FLX Titanium) platforms.
`
`Materials and Methods
`
`EDITTAG provides a suite of Python (http://www.python.org)
`programs for: validating sequence tags for conformance to the edit
`or Hamming distance metrics, designing edit metric sequence tags,
`and incorporating sequence tags to amplicons or platform-specific
`sequencing adapters. We describe implementation details for each
`of these EDITTAG processes, and we follow each description with
`the steps we followed to implement or validate each process.
`
`Sequence Tag Validation
`The validate_edit_metric_tags.py program within EDITTAG
`checks existing tag sets, alone or incorporated into PCR primers or
`sequencing adapters,
`for conformance to the edit metric by
`performing pairwise, edit distance comparisons between each tag
`in the input set and all other tags in the set. In short, the program
`iterates through the set of tags input; computes the pairwise edit
`distance between all tags in the set using either a C-based Python
`module or a pure-Python method; and outputs either
`the
`minimum distance of the set, those tag pairs having an edit
`distance less than the minimum expected, or the edit distance
`between all members of a set, depending on the output options
`selected by the user. This program is also capable of computing
`the Hamming distance between sequence tag inputs based on
`selection of the Hamming algorithm in place of the edit distance
`algorithm by the user.
`We used validate_edit_metric_tags.py to test the conformance
`of eight existing sequence tag sets available from commercial
`(Illumina, Inc. and Roche 454, Inc.) and non-commercial sources
`[29,31,40–42] to their respective distance metric (Hamming or
`edit) by appropriately formatting an input file for these tags (File
`S1) and inputting this file to the program. We used the tag-
`rescanning feature of design_edit_metric_tags.py (described below)
`to determine the number of tags in these sequence tags sets having
`minimum edit distances of three and five.
`
`Sequence Tag Design
`is a
`Technically, designing error-correcting sequence tags
`matter of generating all n-length combinations of [A,C,G,T];
`filtering tags based on subjective or platform-specific criteria
`including removal of: combinations containing homopolymer
`runs, combinations with undesirable base composition, or
`individual tags that are perfect self-complements; and iteratively
`comparing each tag in the remaining group against all other tags
`in the remaining group to create the largest set that maintains
`some minimum edit distance. Practically, the process is more
`complex because the design of
`sequence tag sets
`requires
`comparison of all tags in the candidate set to all other tags in
`the candidate set. Given sequence tags of sufficient length, this
`requirement rapidly approaches the limits of desktop computation.
`
`Figure 2. Using Hamming codes to design binary encoded
`sequence tags when synthesis, replication, or sequencing
`errors mutate the nucleotide sequence reduces the number
`of single-base errors that are correctable during downstream
`demultiplexing. Here, we show two sequence tags (Tag 1 and Tag 2)
`and both their nucleotide and binary encodings. Tag 1 and Tag 2 have a
`Hamming distance of four between their binary representations and a
`Hamming distance of two between their nucleotide representations.
`Error 1 is correctable to Tag 2, because a single nucleotide substitution
`(in purple) results in a single, binary difference (11 versus 01) between
`Error 1 and Tag 2, and single binary errors are correctable when tags are
`at least three binary differences from each other. Error 2 and Error 3 tags
`also exhibit a single nucleotide substitution (in purple) but two binary
`differences from Tag 1 and two binary differences from Tag 2. Because
`there is more than a single binary difference, we cannot determine
`whether the source tag was originally Tag 1 or Tag 2, we cannot correct
`the error, and we must discard the read. More generally, because of the
`binary encoding and the Hamming distance between tags (Hamming
`distance four between binary representations, Hamming distance two
`between nucleotide representations), we can correct single binary
`errors seen in the substitutions around the perimeter of inset (B), but
`we cannot correct double binary errors across the diagonals of inset (B).
`Because these single nucleotide, double binary substitutions (i.e., across
`the diagonals) comprise two of six potential substitution mutations, we
`cannot correct 33% (2/6) of single nucleotide substitution errors.
`doi:10.1371/journal.pone.0042543.g002
`
`Thus, we can correct up to two sequencing errors in sequence tags
`from a set having an edit distance of five. Although edit metric
`sequence tags are provided by several commercial (e.g., Roche
`454, Inc.) and non-commercial sources [40,41], there are few
`available methods (c.f. [29]) of generating sets of edit metric-based
`sequence tags. Furthermore, current methods may generate tags
`that do not correctly follow the edit metric (Table 1), and current
`methods are best suited to generating sequence tag sets comprising
`tags of shorter length (#8 nt). The continually increasing output of
`MPS platforms suggests that
`large collections of edit metric
`sequence tags will be essential to distributing output across smaller
`genomes, select genomic regions, and populations of individuals.
`Here, we introduce EDITTAG, a collection of tools for testing
`sequence tags for conformance to the edit or Hamming distance
`metric, generating edit metric sequence tags, and programmati-
`cally applying sequence tags to PCR primers and platform-specific
`sequencing adapters. EDITTAG differs from similar programs by
`providing: (1) a method to check the conformity of previously
`designed tags, adapters, linkers, or primers to the edit metric; (2) a
`method to generate edit metric sequence tags of arbitrary length;
`
`PLOS ONE | www.plosone.org
`
`3
`
`August 2012 | Volume 7 |
`
`Issue 8 | e42543
`
`00003
`
`

`

`Not All Sequence Tags Are Created Equal
`
`editdistance2
`Designalgorithmsimilarto
`
`editdistance2
`Designalgorithmsimilarto
`
`expectededitdistance
`Sometagsviolate
`
`expectededitdistance
`Sometagsviolate
`
`expectededitdistance
`Sometagsviolate
`
`Onlycorrects66%oferrors
`
`distance
`expectedHamming
`Sometagsviolate
`
`7,198
`
`1,936
`
`531
`
`211
`
`61
`
`132
`
`151
`
`760
`
`8or12
`
`81
`
`21
`
`130
`
`52
`
`27
`
`64
`
`429
`
`49
`
`1544
`
`47
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`-
`
`58
`
`551
`
`40
`
`-
`
`2
`
`2
`
`2
`
`2
`
`3
`
`3
`
`3
`
`3
`
`2
`
`3
`
`2
`
`4
`
`4
`
`3
`
`3
`
`3
`
`3
`
`3
`
`4/22
`
`2
`
`3
`
`3
`
`3
`
`3
`
`4
`
`3
`
`3
`
`3
`
`3
`
`2
`
`3
`
`2
`
`4
`
`4
`
`3
`
`3
`
`3
`
`3
`
`3
`
`Comments
`
`Tags$Dexpected
`
`PairViolations
`
`obs
`
`exp
`
`MinimumDistance
`
`doi:10.1371/journal.pone.0042543.t001
`7IlluminaNexteratagsareincorporatedtoeitherendofthetemplatestrandincombinatorialfashiontoidentifyupto96samples.
`Thisissimilartoanexpectededitdistanceoftwo.
`6WegeneratedFrank[42]tagsusing:‘barcrawl-l,length.-m3’.BARCRAWLusesahybridapproachtocreatedistancebetweentagswhileaccountingforasingledeletion.
`5Meyeretal.[3]tagsarefromthenprot.2007.520-S1.docsupplementaryfile.
`4Adeyetal.[41]tagsarefromthegb-2010-11-12-r119-s3.pdfsupplementaryfile.
`3WegeneratedMeyeretal.[29]tagsusing:‘pythoncreate_index_sequences.py-l,length.-d3’.
`2Hamadyetal.[31]tagsareHammingdistance4fromoneanotherinbinaryencodingbutHammingdistance2fromoneanotherinnucleotideencoding.
`1Hamadyetal.[31]tagsarefromthenmeth.1184-S1.pdfsupplementaryfile.
`
`Edit
`
`Edit
`
`Edit
`
`Edit
`
`Edit
`
`Edit
`
`Edit
`
`Other
`
`Edit
`
`Other
`
`Edit
`
`Hamming
`
`Hamming
`
`Hamming
`
`Edit
`
`7,198
`
`1,936
`
`531
`
`211
`
`61
`
`132
`
`151
`
`760
`
`8/12
`
`81
`
`21
`
`130
`
`52
`
`27
`
`96
`
`Edit
`
`711
`
`Edit
`
`Hamming
`
`75
`
`1544
`
`Hamming
`
`48
`
`10
`
`6
`
`7
`
`8
`
`9
`
`10
`
`10
`
`6
`
`8
`
`6
`
`8
`
`9
`
`6
`
`7
`
`8
`
`6
`
`6
`
`8
`
`8
`
`EDDITTAG
`
`EDDITTAG
`
`EDDITTAG
`
`EDDITTAG
`
`DesignedforthispublicationEDDITTAG
`
`Roche454RL-MIDExtended
`
`Roche454MIDExtended
`
`Frank20092
`
`IlluminaNexteraDNA7
`
`Frank20096
`
`Qiuetal.2003
`
`Correcteditdistance
`
`Meyeretal.20085
`
`Meyeretal.20085
`
`IlluminaTruSeqRNAandDNA
`
`CorrectHammingdistance
`
`Adeyetal.20104
`
`Meyeretal.20103
`
`Meyeretal.20103
`
`Hamadyetal.20071
`
`IlluminaTruSeqsRNA
`
`ContainViolations
`
`DesignAlgorithm
`
`Ntags
`
`Length(nt)
`
`SetName
`
`Class
`
`Table1.Commercialandnon-commercialsequencetagsetsandtheconformanceofeachtothestatedorassumeddistancemetric(editorHamming).
`
`PLOS ONE | www.plosone.org
`
`4
`
`August 2012 | Volume 7 |
`
`Issue 8 | e42543
`
`00004
`
`

`

`For example, the full set of 10 nucleotide tags contains 1,048,576
`members, which requires 550 billion pairwise edit distance
`comparisons across all tags in the candidate set. If storage of each
`result requires 8 bits,
`then storing the entire array requires
`approximately 500 GB - a daunting object with which to work.
`Additionally, this considers only the first stage of processing and
`ignores
`the additional computational and storage overhead
`required to select and test subsets of edit metric sequence tags.
`Thus, we modified the approach used by the lexicode algorithm
`[43] to speed up processing, reduce memory consumption, and
`enable parallelization of jobs across multiple processors. Briefly,
`our approach first generates all n-length combinations of
`[A,C,G,T]. Then, if the remaining group is sufficiently large, we
`apportion tags into discrete batches of 25,000 tags, and we
`distribute each batch among the available number of processing
`cores
`to (optionally)
`remove those tags having problematic
`composition (homopolymers, improper GC, perfect self-comple-
`ments). After filtering, we rebuild the set of candidate tags returned
`from each processing core, and we create the following data
`structure, where the 0th position of each ‘‘row’’ below is a
`sequence tag ‘‘key’’ to which we pair a ‘‘value’’ comprising a list of
`all tags in the set:
`
`(tag0,½(tag0),(tag1),(tag2),(tag3)Š),
`(tag1,½(tag0),(tag1),(tag2),(tag3)Š),
`:::
`
`(
`
`)
`
`If this data structure is sufficiently long (more than 500 ‘‘rows’’ as
`illustrated above), we apportion the structure into batches
`containing 500 ‘‘rows’’, and we distribute each batch among the
`available number of processors. Iterating over each row, we then
`compute the edit distance between the ‘‘key’’ and all sequence tags
`in the value list using either a C-based Python module (http://
`pylevenshtein.googlecode.com) or a pure-Python method. To
`reduce memory consumption when iterating over millions of tags,
`we produce a summary vector for each key giving the count of all
`other sequence tags having values that fall within edit distance
`categories (0, 1, 2, …, N), and we use the 0-indexed position of the
`count in the vector to denote the edit distance. Thus, the vector:
`

`

`
`1,12,124,5
`

`

`
`corresponds to a key having a single tag edit distance 0 from the
`key, 12 tags edit distance one from the key, 124 tags edit distance
`two from the key, and five tags edit distance three from the key.
`We then reduce the data by keeping only those keys having the
`maximum count of comparisons at the minimum desired edit
`distance, a technique that allows us to reduce the remaining
`number of pairwise comparisons over the entire data set by
`approximately 99% (estimated from the generation of eight
`nucleotide, edit distance three tags).
`After reducing the data, for each key we compute the edit
`distance between the key and all sequence tags in the value; we
`drop any tags in the value less than the desired edit distance; and
`we iterate over the remaining tags in the value, retaining only
`those tags that are also the desired edit distance from one another.
`Finally, we determine the count of remaining tags in the value list
`for each key, and we return the key (and its values) having the
`largest value list. Additionally, we include an option that quickly
`
`Not All Sequence Tags Are Created Equal
`
`returns subsets of keys within this final set having edit distances
`from the key at values greater than the minimum desired edit
`distance.
`We used this approach to design sets of edit metric sequence
`tags ranging from four to 10 nucleotides in length and having edit
`distances of three. We used the shortcut method described above
`to select subsets, within each of these sets, having edit distances
`from four to nine. After creating these edit distance tags, we
`validated each set of resulting tags for conformance to the edit
`metric using validate_edit_metric_tags.py, the program described
`in the previous subsection.
`
`Sequence Tag Application
`EDITTAG provides two convenience programs for integrating
`sequence tags to platform-specific adapters and PCR primers. The
`first program (add_tags_to_primers.py)
`is meant primarily for
`integration of sequence tags to PCR amplicons when designing
`sequence-tagged PCR primers.
`In brief,
`this program adds
`sequence tags to the 59 ends of both upper and lower PCR
`primers, optionally removes
`common bases between each
`sequence tag and primer sequence, optionally prepends both
`primers with a sequence (GTTT) promoting +A addition [44] to
`facilitate adapter ligation, uses Primer3 [45] to evaluate tagged
`primers
`for complementarity problems and the presence of
`hairpins, and outputs all tagged primers to an sqlite (http://
`www.sqlite.org) database or comma-separated file for subsequent
`evaluation and selection.
`The second program (add_tags_to_adapters.py) simply inte-
`grates designed sequence tags to adapters and/or primers by
`inputting the list of desired sequence tags, the adapter/primer
`sequence 59 of the sequence tag location, and the adapter/primer
`sequence 39 of the sequence tag location. This program is largely
`meant to reduce mistakes when manually positioning sequence
`tags within large numbers of adapters or primers.
`
`Testing Sequence Tag Integration to PCR Primers
`To test
`the design and resulting utility of PCR primers
`sequence-tagged using the helper program, we integrated the
`entire set (n = 164) of 10 nucleotide, edit distance five sequence
`tags (File S2) to primers amplifying the rbcLa locus in land plants
`[46,47]. We used the resulting database to select 95 hairpin-free,
`sequence tagged primers (File S3) which we had commercially
`synthesized, adding a single 39 phosophorothioate linkage to each
`oligo (Integrated DNA Technologies, Inc.). We used these primers
`to amplify the rbcLa locus in 190 tropical forest tree species (2695
`in a reaction mixture containing 5.0 mL CTAB-
`reactions)
`extracted [48], purified (AMPure) DNA, 0.3 mM KAPA dNTP
`mix, 0.2 mM each primer, 16 KAPA HiFi PCR Buffer, 0.5 U
`KAPA HiFi HotStart polymerase and the following touchdown
`PCR thermal profile: 95uC for 30 s; 20 cycles of 95uC for 30 s,
`66uC for 30 s minus 0.25uC per cycle, 72uC for 1.5 min; 20 cycles
`of 95uC for 30 s, 60uC for 30 s, 72uC for 1.5 m; 72uC for 15 min.
`Following PCR, we visualized amplicons by running 7 mL of PCR
`product on 1.5% agarose gels for 90 minutes at 100 V and
`staining with ethidium bromide.
`We cleaned PCR amplicons and normalized amplicon concen-
`trations across samples using SequalPrep normalization plates
`(Invitrogen, Inc.), combined sequence-tagged PCR amplicons
`from a 96-well plate into a single pool, and concentrated the pool
`using a SpeedVac. Prior to sequencing, we used T/A ligation to
`add standard 454 GS FLX Titanium sequencing adapters to the 59
`and 39 ends of each amplicon pool [49]. We quantified the
`resulting adapter-ligated amplicon pools using qPCR (KAPA
`Biosystems), we combined amplicon pools at equimolar ratios, and
`
`PLOS ONE | www.plosone.org
`
`5
`
`August 2012 | Volume 7 |
`
`Issue 8 | e42543
`
`00005
`
`

`

`we sequenced amplicon pools using a portion of one 1/8th plate of
`a 454 GS FLX Titanium sequencing run (UCLA Genotyping
`Core). We demultiplexed the resulting sequence data using
`demuxipy (https://github.com/faircloth-lab/demuxipy/); com-
`bined the read counts by sequence tag from each pool of rbcLa
`amplicons to minimize the variance introduced to counts by
`differences in template quality,
`template quantity, and PCR;
`averaged the count of reads per sequence tag across pools; and
`computed fold difference between the average number of reads
`per sequence-tag and the global average number of reads.
`
`Validating Edit Metric Tag Addition to Nextera-style
`Adapters
`To validate edit metric sequence tags incorporated to Nextera-
`style sequencing adapters, we first
`removed the Epicentre-
`provided IDX1 and IDX8 adapters from the Nextera barcoding
`kit to ensure the edit distance of the remaining set (n = 10 adapters)
`was three. We then created 14 new Nextera adapter sequences by
`incorporating six nucleotide, edit metric three sequence tags to
`each adapter, and we used validate_edit_metric_tags.py to ensure
`we maintained an overall edit distance of
`three among all
`members of the set (File S4). We commercially synthesized and
`HPLC-purified these new adapters (Integrated DNA Technolo-
`gies, Inc.), and we incorporated each indexed adapter to target
`enriched [50] genomic DNA using PCR, according to the Nextera
`manual (Epicentre Biotechnologies). Following PCR, we quanti-
`fied the indexed libraries using qPCR (KAPA Biosystems), pooled
`sequencing libraries at equimolar concentrations into groups of 12,
`and sequenced the pooled libraries using two lanes of an Illumina
`GAIIx DNA sequencer (LSU Genomics Facility). Because we were
`interested in validating our ability to sequence across these indices
`and because we wanted to fairly compare our ability to sequence
`across edit metric and ‘‘standard’’ (Hamming distance) sequence
`tags, we demultiplexed sequence data using the standard Illumina
`pipeline, counted, and compared the number of reads assigned to
`each sequence tag.
`
`Validating Edit Metric Tag Addition to TruSeq-style
`Adapters
`To validate edit metric sequence tags integrated to TruSeq-style
`sequencing adapters, we used the helper program (add_tags_to_a-
`dapters.py) to incorporate 10 nt sequence tags of edit distance five
`to a set of 135 TruSeq-style adapters (File S5). We commercially
`synthesized all 135 adapters (Integrated DNA Technologies, Inc.),
`with a replicate subset of 24 that were HPLC-purified using a
`randomization protocol to ensure adapters did not follow each
`other on the HPLC (eliminating relevant carry-over), and we
`conducted two experiments.
`In the first experiment, we focused on a subset of adapters
`where the first 6 nt of the 10 nt tag conforms to a minimum edit
`distance of 3 (BFIDT-000 to BFIDT-045). We made an equimolar
`pool of the 24 adapters, and we used this adapter pool to construct
`a library with a single genomic DNA sample using Illumina
`TruSeq reagents (leaving out the standard Illumina adapters). We
`then pooled this mixed library with a subset of Nextera-style
`adapters (total
`library mass = 1% TruSeq style; 99% Nextera-
`style), and we sequenced libraries using a single lane of a GAIIx
`(see details above). We demultiplexed sequence data using the
`standard Illumina pipeline, counted, and compared the number of
`reads assigned to each sequence tag.
`In the second experiment we incorporated 12 EDITTAG
`indexed adapters and 12 Illumina TruSeq indexed adapters to
`DNA libraries using a modified version of an on-bead library
`
`Not All Sequence Tags Are Created Equal
`
`preparation method [51] and reagents from New England Biolabs.
`Following preparation, we quantified libraries using qPCR (Kapa
`Biosciences, Inc.), normalized library concentration across sam-
`ples, and enriched individual or pooled libraries
`for ultra-
`conserved elements using 2560 probes [50,52]. Following PCR
`recovery and Qubit qu

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket