`DOI 10.1186/s13059-016-1039-4
`
`M E T H O D
`Open Access
`Streamlined analysis of duplex sequencing
`data with Du Novo
`Nicholas Stoler1†, Barbara Arbeithuber2†, Wilfried Guiblet1, Kateryna D. Makova3* and Anton Nekrutenko1,4*
`
`Abstract
`
`Duplex sequencing was originally developed to detect rare nucleotide polymorphisms normally obscured by
`the noise of high-throughput sequencing. Here we describe a new, streamlined, reference-free approach for
`the analysis of duplex sequencing data. We show the approach performs well on simulated data and precisely
`reproduces previously published results and apply it to a newly produced dataset, enabling us to type
`low-frequency variants in human mitochondrial DNA. Finally, we provide all necessary tools as stand-alone
`components as well as integrate them into the Galaxy platform. All analyses performed in this manuscript
`can be repeated exactly as described at http://usegalaxy.org/duplex.
`Keywords: Duplex sequencing, Low frequency polymorphism discovery, Next generation sequencing, Genomic
`data analysis
`
`Background
`The term “genetic variation” is often used to imply
`allelic combinatorics within a diploid organism such as
`humans or Drosophila. Yet the majority of organisms
`in the biosphere are not diploid (prokaryotes and
`viruses), and even those that are include non-diploid
`genomes such as mitochondria and chloroplasts. Iden-
`tification of genetic variants—e.g., single nucleotide
`polymorphisms (SNPs) and small indels—is especially
`challenging in non-diploid systems due to the lack of a
`simple “homozygote-or-heterozygote” expectation: a
`heterozygous site may have not just two but multiple
`allelic variants, with frequencies ranging anywhere from
`0 to 1 [1, 2]. Because high-throughput sequencing tech-
`nologies exhibit considerable amounts of noise [3], it
`becomes increasingly difficult to reliably call variants
`with frequencies below 1 % [4–9]. In these situations
`increased sequencing depth does not improve the pre-
`dictive power but instead introduces additional noise.
`This complicates the identification of
`low-frequency
`
`* Correspondence: kdm16@psu.edu; aun1@psu.edu
`†Equal contributors
`3Department of Biology, Penn State University, 310 Wartik Lab, University
`Park, PA 16802, USA
`1Graduate Program in Bioinformatics and Genomics, The Huck Institutes for
`the Life Sciences, Penn State University, 505 Wartik Lab, University Park, PA
`16802, USA
`Full list of author information is available at the end of the article
`
`variants that is becoming critically important in a var-
`iety of applications. For example, humans have numer-
`ous disease-causing mitochondrial variants where the
`disorder penetrance is proportional to the allele fre-
`quency [10]. Because dramatic shifts in allele frequency
`can occur during mitochondrial bottleneck during oo-
`genesis, a disease-causing variant present at a very low
`frequency in the mother may increase in frequency in
`the child to exhibit a disease phenotype. The lack of
`cures for diseases caused by mitochondrial DNA muta-
`tions and the recent regulatory approval of tri-parental
`in vitro fertilization by the UK House of Commons
`makes it critical to identify low-frequency variants in
`the human mitochondrial genome [11]. Other examples
`illustrating the importance of discovering low-frequency
`genome alterations include tracking mutational dynamics
`in viral genomes, malignant lesions, and somatic tissues
`[12, 13].
`Today the vast majority of strategies for the identi-
`fication of
`low-frequency sequence variants rely on
`next-generation sequencing technologies. Noise reduc-
`tion in these approaches ranges from simple base-
`quality filtering to complex statistical strategies in-
`corporating instrument and mapping errors [4, 7, 14].
`However, there is still considerable uncertainty about
`alternative alleles with frequencies below 1 %. For ex-
`ample, Fig. 1 illustrates
`the number of potential
`
`© 2016 The Author(s). Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
`International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
`reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
`the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
`(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
`
`GUARDANT - EXHIBIT 2013
`TwinStrand Biosciences, Inc. v. Guardant Health, Inc. IPR2022-01152
`
`
`
`Stoler et al. Genome Biology (2016) 17:180
`
`Page 2 of 10
`
`polymorphisms observed within the human mitochon-
`drial genome as a function of
`the allele frequency
`cutoff. At 1 % there is an average of three sites [7],
`while at 0.75 % the number surpasses 10, and, finally,
`around 0.1 % almost all sites appear polymorphic.
`Clearly, the majority of these sites are false-positives
`but how does one know for certain? Potentially,
`highly
`sensitive
`techniques with a high dynamic
`range, such as droplet digital PCR [15, 16], can be
`used to validate each site, but it would quickly be-
`come prohibitively expensive and laborious to perform
`this on hundreds or thousands of sites.
`An approach that offers a potential solution is du-
`plex sequencing [17]. This recently developed method
`was designed to increase sequencing accuracy by over
`four orders of magnitude. Duplex sequencing uses
`randomly generated barcodes to uniquely tag each
`molecule in a sample. The tagged fragments are then
`PCR amplified prior to the preparation of a sequen-
`cing library, creating fragment families characterized
`by unique combinations of barcodes at both 5′ and
`3′ ends (a conceptually similar primer ID approach
`[12] allows tagging of cDNA fragments at the 5′ end
`only). A family contains multiple reads, each originat-
`ing from a single input DNA fragment. A legitimate
`sequence variant will
`thus be present
`in all reads
`within a family. In contrast, sequencing and amplifi-
`themselves as “polymor-
`cation errors will manifest
`phisms” within a family and so can be identified and
`removed. A consensus can be called from these read
`families. The consensus of all the reads originating
`from the same strand reduces errors originating from
`sequencing and PCR amplification. Then, comparing
`consensus sequences from complementary strands can
`identify early PCR errors.
`
`that duplex sequencing promises
`Despite the fact
`great advances,
`the methods for both experimental
`and computational aspects of this technique are still
`evolving. In fact, the latter is lagging as it is based on
`alignment to a reference genome, which is disadvanta-
`geous for several reasons. The use of a reference gen-
`ome biases results toward that reference, affecting
`studies using de novo assembly or studies examining
`indels or other alleles that diverge far enough from
`the reference to cause alignment difficulties. The
`current analysis method also removes a large (and po-
`tentially useful) fraction of the original data due to
`stringent filters and uses suboptimal tools for variant
`identification. Here we describe an alternative analysis
`strategy which removes reliance on a reference se-
`quence, preserves a higher proportion of
`the input
`reads, and can be deployed as a standalone application or
`as a part of the Galaxy system. We demonstrate the
`application of this approach by validating rare variants in
`the human mitochondrial genome.
`
`Results and discussion
`A reference-free approach
`Our approach is outlined in Fig. 2. First, paired reads
`generated from a duplex sequencing experiment are
`merged into families. This is performed by sorting
`according to the barcode. Each fragment is expected to
`be represented by two single-stranded families corre-
`sponding to each strand. These two single-stranded
`families are expected to have the same unique tags but in
`the opposite order: the α tag from one single-stranded
`family will be the β tag in the other (also see Fig. 1
`in [17]). In order to group single-stranded families
`from the same fragment together, we normalize the
`order of the concatenation to produce a “canonical
`
`Fig. 1 The relationship between the minor allele frequency (maf) threshold (x-axis) and the total number of variable sites (y-axis) detected by [7].
`Lowering the MAF threshold leads to an exponential increase in the number of variable positions. The image was generated by applying variable
`MAF thresholds to data from 156 human samples and plotting the average number of variable sites at a given MAF threshold. The line thickness
`corresponds to the 95 % confidence interval around the mean value
`
`
`
`Stoler et al. Genome Biology (2016) 17:180
`
`Page 3 of 10
`
`Fig. 2 The Du Novo approach. First, reads tagged with identical barcodes are grouped into strand-specific families. Reads within each family are
`aligned and single-stranded consensus sequences (SSCSs) are generated. Finally, the SSCSs are reduced into duplex consensus sequences (DCSs)
`
`barcode” (a concatenated string consisting of α and β
`tags), which will be identical
`for both strands. The
`order of the canonical barcode is determined by a
`simple string comparison. Sorting the output groups
`the reads so that the two families constituting each
`duplex will be adjacent, with the read pairs separated
`by strand.
`Next, the reads in each single-stranded family are
`aligned to themselves and these alignments are used
`to call the single-stranded consensus sequence (SSCS).
`First, a threshold is applied, requiring a user-specified
`number of reads to produce a consensus (three by
`default). The consensus calling is conducted by deter-
`mining the majority base at each position. If no base
`is in the majority, “N” is used. Positions with gaps are
`considered in the same way as bases. Quality filtering
`is performed at this stage: bases with a Phred quality
`score [18] lower than a user-specified threshold are
`not counted (20 is used by default). For positions
`with gaps, a quality score is calculated by considering
`the qualities of eight neighboring bases. The calcu-
`lated score is a weighted average, with the weight de-
`creasing linearly with distance from the gap. Finally, a
`duplex consensus is called using the two SSCSs. The
`SSCSs are aligned using the Smith–Waterman algo-
`rithm [19] and then each pair of bases is compared.
`If the bases agree, that base is used in that position
`to generate duplex consensus. If
`they disagree,
`the
`International Union of Pure and Applied Chemistry
`(IUPAC) ambiguity code for the two bases is used.
`Gap and non-gap characters produce an “N”. In the
`end the above approach reduces the initial set of
`sequencing reads to a collection of duplex consensus
`sequences (DCSs; as the duplex sequencing experi-
`ments are performed with paired-end reads,
`the
`output of the procedure also consists of pairs corre-
`sponding to forward and reverse double-stranded
`consensuses). DCSs are then filtered (i.e., sequences
`with ambiguous nucleotides
`can be
`removed or
`trimmed), mapped against the reference genome, and
`realigned to normalize gap-containing regions and the
`
`resulting alignments are used to call variants. In this
`scenario variants
`are
`expected to have
`the
`full
`spectrum of allele frequencies between 0 and 1 and
`do not follow a diploid expectation. For that reason
`we use variant callers capable of dealing with this
`limitation such as the Naive Variant Caller (NVC)
`[20] or Freebayes [21]. Finally, variant calls are post-
`processed to compute the strand bias (using formulae
`from Guo et al. [22]). This approach is implemented
`in a pipeline relying exclusively on open-source soft-
`ware (https://github.com/galaxyproject/dunovo and ac-
`cessible through the Galaxy system). We termed this
`approach Du Novo—for duplex sequencing de novo
`assembly-based calling.
`
`Du Novo reliably identifies very low frequency variants
`First, we evaluated the performance of Du Novo by ap-
`plying it to a dataset generated from a simulated mixing
`experiment. The advantage of performing the simulation
`is that the “truth” is known explicitly. We randomly
`generated 21 “heteroplasmies” by modifying human
`mitochondrial sequence. This altered version of
`the
`mitochondrial genome was then “mixed” with unmodi-
`fied reference sequence at a ratio of 1:10,000 (thus, each
`“heteroplasmy” in this mix has the frequency of 0.0001)
`and a duplex experiment was simulated on the mixture.
`This was done by randomly generating 2500 fragments
`from the altered sequence and 25,000,000 fragments
`from unmodified reference, adding barcodes, and per-
`forming in silico PCR and sequencing (see “Methods”).
`The polymerase error rate in PCR and sequencing was
`set at 0.1 % per base. After applying Du Novo to the
`simulated reads and aligning the DCSs to the mitochon-
`drial reference, the median read depth was 166,574×.
`Next, we identified all variable sites and filtered them
`using a series of minor allele frequency (MAF) thresh-
`olds and requiring a minimum DCS coverage of 10,000.
`The relationship between MAF thresholds and the
`numbers of false positives and false negatives is shown
`in Additional
`file 1: Figure S1. Du Novo correctly
`identifies 20 of the 21 variants with no false positives.
`
`
`
`Stoler et al. Genome Biology (2016) 17:180
`
`Page 4 of 10
`
`The remaining variant was present at a frequency of
`0.00004 (likely a result of random fluctuation), along with
`46 false positives with an equal or higher MAF.
`
`Comparison with original approach: Du Novo replicates
`published estimates
`To assess the performance of our method on real-world
`data and to compare it head-to-head with the original
`approach of Kennedy et al. [23], we re-analyzed a re-
`cently published dataset by Schmitt and colleagues [13]
`using both methods. In [13]
`the authors employed
`duplex sequencing to identify a rare mutation at the
`ABL1 locus responsible for resistance to the chronic
`myeloid leukemia therapeutic compound imatinib. The
`resistance is conferred by the presence of G-to-A
`substitutions within the ABL1 coding region resulting
`in an E279K amino acid replacement. This substitution
`is present in a small sub-clonal subset of cells at an
`~1 % frequency. The dataset (Sequence Read Archive
`accession SRR1799908) contains 6,921,891 read pairs
`representing 1,468,089 unique tag combinations (po-
`tential families; Table 1).
`First, we analyzed this dataset with Du Novo. Requiring
`each family to contain at least three reads reduced this
`number to 120,365 SSCSs and reconciling these into
`DCSs further reduced this number to 20,746 DCSs con-
`structed from 2,083,140 read pairs (the remaining
`6,921,891 − 2,083,140 = 4,838,751 were represented by
`families with less than three reads and were omitted;
`see Additional file 2: Figure S2). Mapping DCSs to the
`reference human genome showed the G-to-A substitu-
`tion with frequency varying from 1.28 to 1.31 % de-
`pending on the variant caller (NVC [20] and FreeBayes
`[21], respectively) but irrespective of the mapper used
`(BWA-MEM [24] or BWA [25]).
`
`Table 1 Characteristics of ABL1 and SC8 duplex sequencing
`experiments
`Number of
`Read pairs
`
`SC8
`6,921,891 17,385,100
`
`ABL1
`
`Next, we repeated this experiment with the published
`duplex sequencing pipeline [23]. This produced 1.29 and
`1.31 % frequencies at the G-to-A substitution site for
`NVC and FreeBayes, respectively. Thus, the allele fre-
`quency estimates were essentially identical between the
`two approaches. Du Novo produced a higher depth at
`the variable site: 1099 for our method versus 618 for the
`published pipeline [23]. However, at such low allele fre-
`quencies even formidable coverage results in a relatively
`small proportion of reads supporting the minor allele.
`For example, in the case of this analysis the minor allele
`(“A”) is supported by 14 duplex consensuses from the
`total of 1099, resulting in a MAF of 1.28 %. Yet each of
`these 14 families is in turn derived from multiple start-
`ing reads ranging from a minimum of 5 to a maximum
`of 102 (Fig. 3a), providing additional support for the
`reliability of the minor allele calls.
`
`Using Du Novo to call low-frequency heteroplasmies at
`mitochondrial DNA
`After ensuring the adequate performance of Du Novo
`on the ABL1 data, we applied it to the identification of
`low-frequency variants in human mitochondrial DNA
`(mtDNA). Previously, we have reported 174 point
`heteroplasmies identified from the analysis of mtDNA
`in 39 mother–child pairs (a total of 156 samples = 39
`mothers × 2 tissues + 39 children × 2 tissues [7]). We
`chose family SC8 as it displays significant variability
`across samples and individuals. This family contains
`two heteroplasmic sites—at positions 7607 and 13,708.
`According to our published results [7], the MAFs at site
`7607 are 0.7, 1.1, 0.0, and 0.0 % in mother’s buccal
`tissue, mother’s blood, child’s buccal tissue, and child’s
`blood, respectively. The corresponding MAFs at site
`13,708 are 0.0, 0.0, 2.2, and 1.6. To verify these frequen-
`cies, we performed the duplex sequencing experiment
`using genomic DNA extracted from SC8 child buccal
`tissue in which mtDNA has been enriched via long-
`range PCR as previously described in [7]. We started
`with 17,385,100 read pairs that contained 2,100,704
`unique tags and were assembled into 82,230 DCSs. The
`estimated allele frequency at position 13,708 was
`0.53 %, a figure substantially lower than the 2.2 % esti-
`mated previously [7]. The coverage at this site was
`1138 with six reads representing the minor allele (“A”).
`To check the reliability of this call we estimated strand
`bias (SB; using formula 1 from [22]) for all sites with
`MAF ≥0.5 %. There were 20 sites (excluding 13,708)
`with MAF ranging from 0.51 to 21.2 % and with SB values
`ranging from 0.94 to 6.08 (the lower the value, the less SB
`there is at a site; 0 is an ideal value [22]). SB = 0.01 at site
`13,708, which is outside of the SB distribution for all other
`variable sites in our sample, strongly suggesting that this is
`the only true heteroplasmy in this sample. In addition,
`
`1,467,768 2,100,705
`
`748,411
`
`1,148,444
`
`677,069
`
`884,295
`
`60,333
`
`222,823
`
`743,669
`
`1,092,748
`
`672,946
`
`832,875
`
`Unique tags
`Unique αβ configurations
`Unique αβ configurations with 1 read pair
`Unique αβ configurations with ≥3 read pairs
`Unique βα configurations
`Unique βα configurations with 1 read pair
`Unique βα configurations with ≥3 read pairs
`Unique αββα
`24,313
`Unique αββα with ≥3 read pairs on both strands 20,746
`Reads within αββα families with ≥3 read pairs on
`2,156,105 8,636,692
`both strands
`
`60,032
`
`140,486
`
`140,485
`
`109,999
`
`
`
`Stoler et al. Genome Biology (2016) 17:180
`
`Page 5 of 10
`
`Fig. 3 Distribution of family sizes (number of reads per family) supporting A and G alleles on both strands (plus and minus) for a site 130,872,141
`in the ABL1 dataset and b site 13,708 in the SC8 dataset
`
`examining individual DCSs at this site indicates that each
`of them is generated from a large number of original reads
`(Fig. 3b) confirming this polymorphism, albeit at a signifi-
`cantly lower frequency.
`
`The utility of SSCSs
`In the SC8 experiment described above, we estimated the
`MAF at site 13,708 to be 0.53 %—a much lower value
`compared with the original one (2.2 %) obtained from re-
`sequencing [7]. The likely cause of this deviation lies in
`the design of the duplex experiment. In this study we
`performed duplex sequencing not directly on mtDNA but
`instead on products of a long-range PCR (see “Methods”).
`In this particular case this is unavoidable as the samples
`are obtained by a minimally invasive “cheek swab”, result-
`ing in a very low concentration of mtDNA. The core issue
`is that complementary strands of
`the resulting PCR
`products (the starting material for our duplex sequencing
`experiment) can randomly pair after amplification, form-
`ing heteroduplexes and leading to an underestimation of
`MAFs when using DCSs only (Additional file 3: Figure
`S3). To test whether this indeed is the cause of MAF
`underestimation, we performed variant calling using
`SSCSs instead of DCSs and obtained a MAF of 1.7 %
`(strand bias = 0.02 and depth = 4548), a value much closer
`to the 2.2 % reported in the original publication. Thus,
`although the background error frequency is higher for
`SSCSs in comparison with DCSs [17] in certain situations,
`such as experiments using ampliconic DNA, the use of
`SSCSs for polymorphism detection may be preferable to
`obtain more accurate allele frequencies.
`
`Loss of data as a result of sequencing errors in duplex tags
`One of the fundamental weaknesses of duplex sequen-
`cing is the fact that the majority of families in a duplex
`experiment contain only a single read pair (Additional
`
`file 4: Figure S4). This eliminates a substantial amount
`of otherwise useful data from the analysis, contributing
`to the inefficiency of the current protocol. To under-
`stand the potential sources of read loss, we examined
`individual stages of the duplex analysis process. This in-
`formation is compiled in Table 1 and is based on the re-
`analysis of both previously published data (ABL1 data)
`[13] and results generated in our laboratory (the SC8
`dataset described above). Both cases feature a large
`number of initial read pairs and unique tags. However,
`these numbers are rapidly reduced by requiring at least
`three reads within each single-stranded family. Combin-
`ing SSCSs into DCSs also greatly reduces the number of
`useful sequences since both strands must be present and
`meet the three-read threshold. One potential explan-
`ation for the large number of families with only one read
`pair is sequencing errors within duplex tags. Each bar-
`code with an error will almost certainly be unique, creat-
`ing an entirely new apparent
`family with only one
`member. The number of reads with an erroneous bar-
`code may be a minority but this can still result in the
`number of families with erroneous barcodes being very
`high (a majority). The fraction of erroneous barcodes (r)
`can be expressed in the following form:
`
`r ¼ 1– 1 –Eð
`Þl
`
`ð1Þ
`
`where E is the per-base error rate and l is the barcode
`length (in this case 24 as it is a combination of α and β
`tags, each of which is 12 nucleotides). Here, E is a
`cumulative error rate taking into account the chance of
`a mutation at every cycle of PCR plus the sequencing re-
`action. The cumulative error rate can be calculated from
`the error rate at each stage using the same equation
`(Eq. 1), this time using E as the error rate per base per
`stage, l as the number of stages (number of PCR cycles
`
`
`
`Stoler et al. Genome Biology (2016) 17:180
`
`Page 6 of 10
`
`plus 1 for the sequencing reaction), and r as the cumula-
`tive error rate. Even assuming a low per-stage error rate
`of 0.1 %, this gives a cumulative error rate of about 3 %.
`Using this in Eq. 1 again, we obtain the fraction of
`barcodes expected to contain an error to be 52.5 %:
`
`r ¼ 1– 1–0:03ð
`Þ24≈ 0:525
`
`ð2Þ
`
`Now, suppose in a hypothetical duplex experiment ten
`fragments of DNA were ligated with α and β
`initial
`adapters (a unique α and β for each of the ten frag-
`ments) and the subsequent PCR amplification and Illu-
`mina sequencing process produced 100 read pairs (10
`pairs per original fragment). If there are no errors, these
`100 read pairs should be recognized as members of ten
`duplex families during the analysis stage. If we now fac-
`tor in the erroneous barcode rate of ~52 % calculated
`above, one would observe 62 total families: ten real fam-
`ilies and 52 artifactual families consisting of a single read
`pair. This phenomenon increases the total number of
`families by reducing the read count within legitimate
`families—a trend apparent in real data (Additional file 2:
`Figure S2). Furthermore, the relationship between the
`number of single-read families and the total number of
`reads can serve as a proxy for the error rate. For example,
`in the SC8 experiment there were 1,717,170 single read
`families and 17,385,100 total read pairs. Assuming that all
`single read families are byproducts of sequencing errors
`within duplex tags, this gives 1,717,170/17,385,100 = 0.098
`as the fraction of erroneous barcodes (r). With l = 24 we
`can solve Eq. 1 for E obtaining an estimate of ~0.4 % for
`the cumulative error rate.
`To test this reasoning we simulated duplex experiments
`with different error rates. The starting distribution of
`family sizes was constant in each case, with 1.20 % of
`fragments producing a family with only one read. With an
`error rate of zero, the proportion of output families which
`were composed of a single read was, as expected, pre-
`cisely 1.20 % (Additional file 4: Figure S4), meaning no
`excess beyond those with a natural family size of one.
`When the error rate was raised to 0.1 % per base per
`cycle of PCR/sequencing reaction, 75.5 % of output
`families were composed of a single read. This meant
`that 74.3 % of families were artifacts consisting of a
`read originating from a fragment that produced mul-
`tiple reads. Instead of being grouped with its sibling
`reads, each of these instead was grouped by itself be-
`cause of an error in the barcode.
`While this was only a simulation and the above calcu-
`lations make a number of simplifying assumptions, they
`nevertheless highlight the significance of sequencing er-
`rors within tags as one of the main causes of data loss.
`We are currently developing a family reconstruction
`approach that would allow mismatches in tags and is
`
`expected to significantly reduce the number of single
`read families.
`
`Interactive analysis of duplex data
`The underlying components of the Du Novo process
`are distributed as an open source software and can be
`used from the command line (https://github.com/
`galaxyproject/dunovo). However, to increase the number
`of potential users we also make Du Novo accessible
`through the Galaxy system (http://usegalaxy.org). Figure 4
`illustrates all stages of the duplex analysis workflow. This
`example begins with fastq datasets generated by an Illu-
`mina machine that are used as inputs in the Du Novo
`pipeline. Initially, reads are processed to identify and
`count duplex tags (Make families). Reads having identical
`tags (families) are aligned (Align families) and alignments
`are reduced to DCSs (Make consensus reads). The DCSs
`are trimmed to remove ambiguous nucleotides (Sequence
`Content Trimmer), converted to fastq format (this is
`because DCSs are reported as fasta datasets; Combine
`FASTA and QUAL), and mapped to the reference ge-
`nomes (in this example with both BWA and BWA-
`MEM). BAM datasets produced by mappers are combined
`(MergeSamFiles) and realigned (BamLeftAlign) and vari-
`able sites are identified with the Naive Variant Caller
`(NVC). A Variable Call Format (VCF) dataset generated
`by NVC is processed by Variant Annotator, which tabu-
`lates allele frequencies and strand bias values. Finally, the
`data are filtered on MAF (≥0.5 %) and strand bias (<1).
`This workflow is available at https://usegalaxy.org/u/aun1/
`w/duplex-analysis-from-reads. The most computationally
`demanding portion of the workflow is the alignment of
`reads within each family (Align Families). For instance,
`processing of 6,921,891 read pairs comprising the ABL1
`dataset [9] took an average of 0.004 s per pair or approxi-
`mately 9 h of wall time on a 16-CPU cluster node. One of
`the advantages of using Galaxy at https://usegalaxy.org for
`the analysis of duplex sequencing data is that its under-
`lying infrastructure relies on high-performance resources
`provided by the Texas Advanced Computing Center
`(TACC) and the Extreme Science and Engineering
`Discovery Environment (XSEDE), making it possible to
`perform analyses of multiple duplex datasets by multiple
`users simultaneously.
`
`Conclusions
`The continuing drop in the price of massively parallel
`sequencing will expand the use of the duplex technique
`and will amplify the need for a scalable analysis solution
`such as Du Novo reported here. Our approach allows
`the use of both single- and double-stranded consensuses
`for variant discovery depending on the experimental
`design and is parallelized to take advantage of
`the
`advanced high performance compute infrastructure. By
`
`
`
`Stoler et al. Genome Biology (2016) 17:180
`
`Page 7 of 10
`
`Fig. 4 A complete workflow implementing the Du Novo approach to variant discovery from duplex sequence data. It is accessible
`from http://usegalaxy.org/duplex
`
`allowing our tools to be used both from the command
`line and through the Galaxy interface we hope to reach
`a wide audience of computational and experimental
`researchers.
`
`Methods
`Duplex sequencing protocol used for human
`mitochondrial amplicons
`Two overlapping mtDNA regions (each ~9 kb, represent-
`ing the entire mitochondrial genome) were amplified from
`sample SC8C1-k1169-A*B (DNA extracted from buccal
`swabs of the child of family SC8 collected under IRB
`30432EP), using the primer pairs L*2817 + H*11570 and
`L10796 + H3370 and mixed at equimolar quantities, as de-
`scribed previously [7, 26]. Amplicons (2 μg) were sheared
`to ~550 bp and purified using 1.6 volumes of Agencourt
`AMPure XP beads (Beckman Coulter). Duplex sequencing
`
`libraries were prepared as described in Kennedy et al.
`[23] with several minor modifications. Briefly, T-tailed
`adapters were prepared by hybridization of MWS51
`and MWS55, followed by extension, and a restriction
`digest with TaaI
`(HypCH4III) at 60 °C for 16 h.
`Adapters were purified by precipitation with two volumes
`of absolute ethanol and 0.5 volumes of 5 M NH4OAc. The
`hybridized PCR amplicon was end-repaired with the End-
`Repair Enzyme Mix provided in the Illumina TruSeq Kit
`according to the manufacturer’s protocol and A-tailed
`and the adapter was ligated with 1800 units of T4 ligase
`(NEB) with 20× molar excess at 16 °C for 30 min. Amp-
`lified tag families were generated from 15 attomoles of
`adapter-ligated amplicon by 23 cycles of PCR (the opti-
`mal cycle number was evaluated by real-time PCR).
`The library was quantified with the KAPA Library
`Quantification Kit (Kapa Biosystems) according to the
`
`
`
`Stoler et al. Genome Biology (2016) 17:180
`
`Page 8 of 10
`
`manufacturer’s instructions. Sequencing was performed
`on an Illumina MiSeq platform using 301-bp paired-
`end reads.
`
`Construction of read families
`Read pairs are grouped into families according to the
`random tags which constitute the first 12 bp of each read
`using the Du Novo pipeline either in Galaxy or through
`the command line. For each pair, we first construct a
`barcode which is the concatenation of the two tags from
`the two reads. Then the reads are sorted according to this
`compound barcode. Single-stranded families from the
`same fragment will have the same 12-bp tags but in the
`opposite order: the α tag from one family will be the β tag
`in the other. In order to group single-stranded families
`from the same fragment together, we normalize the order
`of the concatenation to produce a “canonical barcode”
`which will be identical for both strands. The order of the
`canonical barcode is determined by a simple string com-
`parison. Then the original order of the tags is recorded in
`a separate field. Sorting the output groups the reads so
`that the two families constituting each duplex will be adja-
`cent, with the read pairs separated by strand.
`
`Aligning families and consensus calling
`The reads in each single-stranded family are aligned to
`themselves using a script calling the MAFFT multiple
`sequence aligner [27]. These alignments were used to
`call the SSCSs. First, a threshold is applied, requiring a
`specified number (default = 3) of reads to produce a
`consensus. Then, the consensus calling is performed by
`determining the majority base at each position. If no
`base is in the majority, “N” is used. Positions with gaps
`are considered in the same way as bases. Quality filter-
`ing is done at this stage: bases with a PHRED quality
`score lower than a user-given threshold are not counted
`(default = 20). For positions with gaps, a quality score is
`calculated by considering the quality scores of the eight
`nearest bases. The calculated score is a weighted aver-
`age, with the weight decreasing linearly with distance
`from the gap. Finally, duplex consensus sequences are
`called using the two SSCSs. The two sequences are
`aligned using the Smith–Waterman algorithm (using an
`existing C implementation from https://code.google.-
`com/archive/p/swalign/) and then each pair of bases is
`compared. If the bases agree, that base is used in that
`position. If they disagree, the IUPAC ambiguity code
`for the two bases is used. Gap and non-gap characters
`produce an “N”. If a SSCS has no matching opposite
`strand consensus, the user may choose to include the
`single-stranded consensus in the output, direct it to a
`separate file, or discard it.
`
`In silico