throbber
SEQUENOM EXHIBIT 1095
`Sequenom v. Stanford
`SEQUENOM EXHIBIT 1095
`IPR2013-00390
`
`(cid:47)(cid:50)(cid:3)(cid:40)(cid:59)(cid:43)(cid:44)(cid:37)(cid:44)(cid:55)(cid:3)(cid:20)(cid:19)(cid:20)(cid:21)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)
`(cid:47)(cid:82)(cid:3)(cid:89)(cid:17)(cid:3)(cid:52)(cid:88)(cid:68)(cid:78)(cid:72)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)
`(cid:38)(cid:82)(cid:81)(cid:87)(cid:72)(cid:86)(cid:87)(cid:72)(cid:71)(cid:3)(cid:38)(cid:68)(cid:86)(cid:72)(cid:3)(cid:20)(cid:19)(cid:24)(cid:15)(cid:28)(cid:21)(cid:22)(cid:3)
`
`

`

`Li et al.
`
`neighborhood quality standard (NQS) [Altshuler et al. 2000),
`l’olyBayes develops an explicit statistical framework to lllodel
`variants.
`
`All new sequencing technologies are shotgun melllods tllat
`give sequences derived fronl a single molecule sampled from a
`larger population. (Current Inetllods amplify the starting teln—
`plate by some farm of PCR, but true single molecule methods are
`expected in the future.) This means the nletllods for calling vari—
`ants from new tecllnology data are most closely related to the
`second group described above, including ssahaSNP and Poly-
`Bayes. However, because of sampling and error rate, we need to
`combine data from multiple reads. In practice, errors at a par—
`ticular site are correlated, alld we must take this correlation irlto
`account. This is analogous to calling a consensus from a sequence
`assembly, and we propose a Bayesian approach to this issue that
`is related to that used in assembly software CAP3 (Huang and
`Madan 1999).
`In summary, this article presents methods and software for
`mapping short sequence reads to a reference genorne, calculating
`the probability of a read alignment being correct, and consensus
`genotype calling with a model that incorporates correlated errors
`and diploid sampling. The applicability and accuracy of the
`methods are evaluated based on both real data from the bacte—
`
`riunl Salmonella paratyphr' alld simulated data fronl the diploid
`human X chromosome.
`
`Results
`
`Overview of MAQ algorithms
`
`type sequence from the alignment. The consensus sequence is
`illferred from a Bayesian statistical model, arld each consensus
`genotype is associated with a paired quality that measures the
`probability that the consensus genotype is incorrect. Potential
`SN l’s are detected by comparing the consensus sequence to the
`reference alld can be further filtered by a set of predefined rules.
`These rules are designed to achieve the best performance on deep
`human resequencing data and aim to compensate for simplifica-
`tions and assumptions used ill the statistical model (e.g., treatirlg
`neighbor pesitions independently).
`
`Implementation
`
`We implemented the software MAQ to align short reads and call
`genotypes based on the algorithnl described ill the Methods sec—
`tion. MAQ consists of a set of related programs that are compiled
`into a single binary executable. [t is able to map reads, cal] con-
`sensus sequences including SNP arld inde] variants, simulate dip-
`loid genomes alld read sequences, and post—process the results ill
`various ways. MAQ also has an option to process Applied Biosys—
`lems SOLiD data that uses two base “color-space" encoding. f-ur-
`tller details are available from the documentation at the MAQ
`website.
`
`MAQ is easy to use. For bacterial genomes, alignments alld
`variant calling can be done with a single command line, taking a
`few Illillutes on a laptop. [ll addition, MAQ comes witll a com—
`pact arld fast OpenGL-based read alignment viewer, MAQview,
`which shows the read alignments, base qualities, arld mappirlg
`qualities in a graphical interface.
`lloth MAQ alld MAQview are designed with genome—wide
`human resequencing in mind. First, the read alignment, which is
`the slowest step ill the whole pipeline, can be divided irlto small
`tasks and parallelized 011 a modern computer cluster using less
`than 1 GB of memory for each processor core. The separate sub-
`parts of the alignment can then be merged together to give the
`final alignment. Second, the read alignments are stored in a bi—
`nary compressed file. Text—based information is only extracted
`when necessary. This strategy saves disk space by a factor of three
`to five. Third, a novel technique is implemented to index the
`compressed alignment file, which enables swift retrieval of reads
`ill ally region of the reference sequence. Viewing the alignments
`of a human—sized genome is as fast as viewing those of a single
`BAC sequence. As a whole, MAQ arld MAQview provide all effi-
`cierlt suite for managing data from tllurnirla sequencing.
`MAQ arld MAQview are implemented in C7C++ with auxil-
`iary tools in Perl. They have been extensively evaluated on large-
`scale simulated data and real data and have been tested by users
`front various research groups. MAQ software is freely distributed
`urlder the GNU General Public License (GPL). The project home
`page is at httpzh'maq.sourceforgehet.
`
`MAQ is a program that rapidly aliglls short reads to the reference
`genome and accurately infers variants, including SNPs alld short
`indels, from the alignment.
`At the alignment stage, MAQ first searches for the ungapped
`rnatcll with lowest mismatch score, defilled as the sum of quali—
`ties at mismatching bases. To speed up the alignment, MAQ only
`considers positions that have two or fewer mismatches in the
`first 28 bp (default parameters). Sequences that fail to reacll a
`mismatch score threshold but whose mate pair is mapped are
`searched with a gapped alignment algorilllrn in the regions de-
`fined by the lllate pair. To evaluate the reliability of alignments.
`MAQ assigns each individual alignment a pitted—scaled quality
`score (capped at 99), which measures the probability that the true
`alignment is rlot the one found by MAQ. MAQ always reports a
`single alignment, arld if a read can be aligned equally well to
`multiple positions, MAQ will ralldomly pick one position alld
`give it a mapping quality zero. Because their mapping score is set
`to zero, reads that are mapped equally well to multiple positions
`will not contribute to variant calling. However, they do give ill-
`formatiorl on copy number of repetitive sequences and 011 the
`fraction of reads that can be aligned to the genome, arld can
`easily be filtered out for downstream analysis if desired. Mapping
`Although it is always good to look at real data, it is impossible to
`quality scores and mapping all reads that match the genome
`assess read alignment accuracy on real data, because in a shotgun
`even if repetitive are where MAQ differs from lllost other align—
`ment programs.
`sample We cannot know where the reads come from.
`We simulated a diploid sequence (two haploid sequences)
`MAQ fully utilizes the mate-pair information of paired
`from the human reference chromosome X, as described irl the
`reads. [t is able to use this information to correct wrong aligll—
`ments, to add confidence to correct alignments, and to accurate—
`Methods: 136,012 substitutions. 7377 1—bp insertions, alld 7589
`l—bp deletions were added to the diploid genome, giving all over—
`ly map a read to repetitive sequences if its mate is confidently
`aligned. With paired-end reads, MAQ also firlds short insertions,Ir
`all polymorphisrn rate of 0.00]. From this mutated diploid ge—
`nome, we simulated 100 million pairs of 35-bp mate-pair reads
`deletions (indels) from the gapped alignment described above.
`with errors (~45.2-fold coverage orl chromosome X). The average
`At the SNP calling stage, MAQ produces a consensus geno-
`
`
`SNP calling for large-scale simulated data
`
`1852 Genome Research
`wwwgenolneorg
`
`

`

`insert size is l 70 hp with a standard deviation of 20 hp. Statistics
`on base qualities were estimated from real data where base quali-
`ties have been calibrated.
`
`With the default MAQ options, we aligned the simulated
`reads against the whole human reference genome excluding Y
`and unassembled contigs. It took 1100 CPU hours to do this
`alignment, and 914-496 of reads get mapped. Figure 1]} shows the
`distribution of mapping qualities (red curve) and the mapping
`error rate (blue curve] in each 10-based quality interval. It the
`mapping quality were estimated precisely, we would expect to
`see a straight blue line between ("O-9," 100) and (“290," 10—9).
`MAQ qualities appear to be overestimated; in other words, the
`true alignment error rate is higher than what mapping quality
`predicts to be. To investigate whether the overestimation is due
`to the fact that we did not consider mutations and indels in the
`
`model, we also simulated reads without introducing any muta-
`tions. For these data, the mapping quality could be estimated
`more accurately (data not shown), which confirms that mu ta-
`tions and indels may interfere with the calculation of mapping
`qualifies. We see in Figure 1 that this effect is greatest for map-
`ping quality ~70—8tt. However, even these reads have accuracy
`better than 10", which is sufficient for most mapping based
`applications, including structural variant calling and SNP calling.
`We called the consensus sequence from the MAQ align-
`ment. The pink curve in Figure 13 shows that most of the con-
`sensus bases have a quality over 60. About 5% of the consensus
`bases have a quality smaller than 10. They are in repetitive re~
`gions where read alignment is not reliable. We then compared
`the consensus to the diploid sequence from which reads were
`generated, and calculated the error rate of the consensus. The
`green curve indicates that the consensus quality also roughly
`agrees with the true error rate, We called indels using paired-end
`indel detection methods described in the Methods section, and
`required at least two reads to support the indel.
`After MAQ’s substitution calling, we further filtered the sub-
`stitutions based on five rules.- it) discard SNPs within the 3-bp
`flanking region around a potential indet; (2) discard SNPs cov-
`ered by three or fewer reads; (3) discard SNPs covered by no read
`with a mapping quality higher than 60,- t4) in any iti-bp window,
`if there are three or more SNPS, discard them all; and (S) discard
`SNPs with consensus quality smaller than 10. MAQ provides a
`Perl script maq.p1 to achieve all these filters.
`To see how well MAQ calls SNPs and indels at different cov—
`
`Mapping and assembly with qualities
`
`erage, we chose several subsets oi reads and called variants from
`those subsets. We compared the indels and filtered substitution
`calls to the true variants we added to the diploid genome in the
`simulation and measured the accuracy by false-positive rate (FF)
`and fatse-ne gative rate (PM (Fig. 23). MAQ consistently generates
`very few faise positives but does miss true substitutions. Most of
`these missing substitutions fall in ”filtered regions,” which tend
`to consist of repetitive sequences. In the human genome as rep-
`resented by the x chromosome, we can call variants on ~85% of
`the sequence using single end reads and 93% using paired-end
`reads.
`The difference between the blue and the red curves indicates
`
`the fraction of missing substitutions in the regions trusted by
`MAQ. This difference decreases from ~15% at 8 x down to 1% at
`30 X. Note that we apply more filters on SNPs than on filtered
`regions, which leads to the 1% difference between the two curves
`at high depth. Most of difference at low depth is accounted for by
`sampling variation. At, say, ltJX coverage there is 5 X coverage
`on average of each haplotype. However, the actual number of
`reads at a site will be distributed around the average at best ac-
`cording to a Poisson distribution. Given that we may need to see
`a variant several times to be confident enough to call it, there is
`a significant probability that not enough reads will be aligned
`and the variant will be missed.
`
`A simple model to this issue is to assume we require it reads
`to call an allele. We call this strategy the k-allele method. if we
`assume all read bases have an error rate 0.003, or phred quality 25,
`the theoretical FN and FF are shown in Figure 2C. If we require
`low FP rate. the FN rate of MAQ’s model largely agrees with that
`of the k-allele methods, allowing for the fact that some of the
`data have Qvalue lower than 25 or low mapping quality.
`A uniquely aligned read tends to be wrongly mapped if it
`has many good altemative hits. Mapping quality helps to down-
`weight such a read in SNP calling and to reduce the false SNPs
`caused by wrong alignments. To see the effect of mapping qual-
`ity, we altered our method to ignore the mapping quality and to
`use only uniquely mapped reads in calling SNPs. We filtered the
`resulting SNPs using the same five rules as previously, except the
`third one, as we assumed no mapping quality is available in this
`case. in comparison, without mapping quality, MAQ discovered
`21? false SNPs out of 12?,910 predictions. and with mapping
`quality, MAQ gave 186 out of 126.228, yielding a 14% reduction
`in FF. This reduction amounts to 31 false SNPS, which is small in
`
`A
`
`Sing.) Furl Aignuws:
`.
`.
`_
`limt‘lli'fliifil'nffi ---------
`...............
`mill.
`.H; WWW“?
`WLfl'rgflfftCE
`
`-
`
`.u-r
`[an
`'
`..-
`iii
`90
`to”
`33
`“3-:
`FD
`a
`.
`_
`30
`g as
`so“
`'5 50
`e
`'
`it!"
`E ‘0
`Ii'i'"
`'30
`rat
`rt:
`ifl'o
`to
`to
`1)
`sir-a we rue see not:
`M 1m- AM we 2&9
`(polity no Insert Intervals;
`
`
`
`.
`
`.
`
`B
`
`9r,
`
`'
`
`fished End Aitgimnk
`:
`percent .11:qu ---------
`E'
`'“flf-‘Fing'enwmrs
`.s....
`'
`
`terwncm, .
`
`.
`
`-
`..
`I
`
`..
`
`‘e‘éflkiég
`
`
`
`Errorrate
`
`Distribution of mapping qualities, consensus qualities, true alignment error rate, and true consensus error rate. The red line shows the
`Figure 1 .
`fraction of reads whose mapping qualities fail in each interval. (Pink line) The fraction of consensus genotypes whose consensus qualities fail in each
`interval; (blue line) the true alignment error rate of reads In each interval; (green line) the true consensus error rate of reads in each interval. (A) Reads
`are aligned without using mate-pair information. Single-end alignments do not contain enough information for MAQ to assign mapping quality larger
`than 90; therefore, the data in the top bin are missing. (3) Reads are aligned using mate-pair information.
`
`Genome Research
`wwwgel tome-org
`
`1853
`
`{Jimmy no tam fitter-rats;
`
`

`

`RIUGK)! 05%“th coiling fifiafigmmfi
`.
`g
`filttimtioxltlnr:
`substitutes: FP
`-'
`'eibstttmitm EN
`-
`-
`-
`-
`'
`
`-
`
`B
`
`.ikL'ul'cK} of “afield“ filming (.05 Women”
`.
`fitter“! {won -
`that‘JEE-‘ltltloflf'l’ ~
`subsist-41mm """"
`indelile
`-
`,mkviEN
`
`-
`
`c Morlacyufk— Nit-luv. :1 relbocl {item en: {5W3}
`.
`.
`sat-mile ~~~~~~~~
`Farsi-aorta -
`Fits-slide -
`l-‘ai stein-1e
`res/aka»
`F'fi S-aflele
`
`--
`
`A
`
`{It}
`no
`5.0
`
`§_
`ofe .
`F to.
`33.
`”At2!}
`1o
`
`
`
`'5
`
`id
`
`15
`
`20 H SCI
`Manet-icing depth
`
`3S
`
`$1.3
`
`£5
`
`5
`
`!c-
`
`=5
`
`3‘}
`25
`20
`fiequfilt‘iflg depth
`
`33
`
`30
`
`4-!-
`
`S
`
`tt'.‘
`
`:5
`
`'10
`35
`20
`bequeath-g depth
`
`3‘5
`
`£50
`
`43
`
`Figure 2. Accuracy oi variant calling. In the figure, "filtered regions" are regions covered by three or iewer reads or by no reads with mapping quality
`higher than 60. For substitutions, FP equals the number of positions called as different from homozygous reference that in fact should be identical to
`the leference according to the simulation, divided by the total number of MAQ substitution calls; FN equals the number of positions that are different
`from the reference according to the simulation but are missed by MAQ, divided by the total number of mutations added in the simulation. For indels,
`FP equals the number of indel calls within 5-bp flanking regions of a true indel, divided by the total number of MAQ indel calls; FN equals the number
`of true indel calls missed by MAQ, divided by the total number of indels in simulation. (A) variants are called based on singie-end alignment. {8) Variants
`are called based on paired-end alignment. (0 Theoretical accuracy of kallele method, where we call an allele as long as at least it reads are supporting
`the allele, assuming all roads are correctly aligned (see also Supplemental material).
`
`comparison to the 136,012 true substitutions in the simulation.
`However, in real data the FF is higher and in some applications,
`such as in the study of somatic mutations in cancer, the number
`of true SNPs will be much lower and the rate of false SNPs more
`critical.
`
`After these filters, MAQ predicted two homozygous differ-
`ences. Checking the capillary reads used in reference assembly
`confirms that the current reference is wrong at one of the homo-
`zygous sites. The other homozygous site is covered by 19 reads,
`with all of them identical to each other but different from the
`
`This simulation only gives a rough evaluation of MAQ’s per-
`formance. On one hand, in the simulation process, reads are
`evenly distributed along the genome, no contamination exists,
`base qualifies are accurate and sequencing errors are entirely in-
`dependent. Ai] these factors make SNP calling simpler. The true
`accuracy on real data will almost always be lower than the simu-
`lation. On the other hand, although errors are independent, we
`use a dependent mode] to infer the consensus. Using an inde-
`pendent model would achieve higher accuracy for simulated
`data. Moreover, we were using the same set of filters across all
`depths. Adjusting the threshold in filters might help to reach a
`better balance point between FN and FF at different depths.
`
`SNP calling for bacterial genomes
`
`To evaluate MAQ on real data, we obtained one lane of 2.9 mil‘
`lion 36-bp lllumina read sequences of S. paratflrhi A AKUJZt‘SOI
`strain collected by the pathogen group at the Sanger Institute.
`The short reads are purity filtered. To calibrate the quality values,
`we put PhiX sample on the fifth lane of the same run, calculate
`a quality calibration table from the alignment against the known
`PhiK genome, and then apply the table on reads from other lanes
`to infer base qualities. 5. pamtjipbi is a 4.8-Mbp bacterium, in-
`cluding plasmid (Holt et a]. 2007), and so we had ~20 >< coverage.
`An initial reference genome sequence of the same strain (AC:
`FMZOOOSS) was also produced by the pathogen group with cap-
`illary sequencing. Read sequences have been submitted to Euro-
`pean Read Archive (AC.- ERAi)l)0(l12)_
`After mapping and consensus base calling, we filtered the
`SNPs based on the same five rules as for the human K simulation,
`but in comparison to SNP calling on simulated human X chro-
`mosome, we did not filter SNPs around indels as we only had
`single-end reads; we decreased the threshold on mapping quality
`(rule 3) to 40 because single end reads usually have lower map-
`ping quality than using mate-pair reads, and we increased the
`threshold on consensus quality (rule 5) to 40 because for haploid
`genome where there are no true heterozygotes, it is easier to get
`higher consensus quality.
`
`1854 Genome Research
`wwgenomeorg
`
`reference. This site is possibly a true mutation between the ref-
`erence sequence and the Illumina-sequenced sample.
`As well as these two homozygous differences, MAQ also pre-
`dicted four beterozygotes. All four cases look confident from read
`alignment and show excessively high read depth in comparison
`to the average depth. Three are clustered together, and it appears
`likely that there is an additional copy of this region that was not
`identified in the reference. The fourth position may also be in a
`duplicated region (see below).
`Alignment against the same reference strain only evaluates
`the FP of MAQ SNP calling. To assess the FN, we aligned the reads
`to a previously published sequence from another reference
`strain ATCC9150 (McClelland et a]. 2004].
`We downloaded the sequence of strain ATCCQISO (AC:
`NC_006511) from NCBI and aligned it, using cross_match
`(P. Green, unpubl; http:iiwww.phrap.orgiphredphrapconsed.
`html), against the AKU12601 stain with tile two homozygous
`SNPs discovered previously masked as N. Cross_match gave
`seven alignments, spanning the compiete ATCCQlSi) and
`99.97% of AKU12601 genome. 211 substitutions and 39 indels
`(five of them longer than 20 bp} are contained in the alignment.
`MAQ did not give any false positives and predicted 173 true
`substitutions. Of the missing 38 substitutions, 35 were covered
`by no uniquely aligned reads, and one site was covered by only
`one unlquely aligned read. Discovering SNPs at these 36 sites is
`almost impossible with single end short reads. Of the remaining
`two (38 — 36) sites, one site was called as a SNP initially but was
`filtered out due to low read coverage l two reads), and the other
`was dropped because it was covered by no read with mapping
`quality higher than 24- and so did not pass the filter, either. In
`regions passing the SNP calling filters (96.9% of ATCC9150 ge-
`nome), no SNPs were missed. Interestingly, the four heterozy-
`goles in the AKU12601 read mapping were not called as SNPs any
`more. One site became a repeat in ATCC9150, and the other
`three were called confident monomorphic sites with about the
`average read depth. This observation possibly revealed that
`around these three sites, there are no copy nunlber changes be-
`
`

`

`tween A'l‘CC‘J'le‘J strain and the sample resequenced by [llu—
`mina.
`
`It is worth noting that AKU'lZétll and ATCC915tl are highly
`similar strains. Aligning short reads against a reference genome
`that is more distant to the sample being reseq uenced would be
`harder, especially when there are highly variable regions. In these
`regions, doing de novo assembly [Zerbino and Birney 2008} first
`and theil aligning the contigs may greatly help.
`
`Discussion
`
`MAQ is capable of human whole-genome alignments and sup-
`ports SNP calling on a diploid sample. lt has been used to map
`short sequencing reads for structural variant calling in cancer
`samples (Campbell et al. 2008) and for whole—genome methyla—
`tion analyses (Down et al. 2008). [t is able to accurately estimate
`the error probability of each alignment and of each consensus
`genotype as well. MAQ can also simulate reads from a diploid
`genome based on a haploid reference. Simulation suggests that
`20— to 30—fold coverage is needed for achieving FNs below 1% in
`the nonrepeat regions of a diploid sample.
`
`The reliability of short read alignments
`
`The reliability of read alignments can substantially affect the ac—
`curacy of the detection of variations. Knowing which alignment
`is reliable is key to the subsequent analyses. The most convenient
`way to measure the reliability is to define uniqueness: A read is
`said to be uniquely Inapped if its second best hit contains
`more mismatches than its best hit. Generally this simple crite-
`rion works well, but potential difficulties are illustrated by the
`following scenarios: (1) a read has two one-mismatch hits, one
`with a Q30 mismatch and the other with a Q3 mismatch; (2) a
`read has one perfect hit and 100 one—mismatch hits; and (3} a
`read has a perfect hit and a (lB—mismatch hit. In the first case,
`although the read is notunique, the hit with a (130 mismatch may
`still be reliable. In the remaining two cases, although the read can
`be uniquely aligned, the alignments are not reliable. For the hu-
`man genorne, these types of scenarios may happen at tirrles due
`to the large fraction of repetitive sequences.
`In our view,
`it is better to mgard the position a read is
`mapped to as a random variable, and the reliability of an align-
`ment can be naturally interpreted as the likelihood of the read
`being mapped to the correct position. At this point, mapping
`quality directly measures the reliability. It considers the repeat
`structure of the reference and the base quality of read sequencers,
`which is implied in Equation 1 (see Methods), and can easily
`handle the three cases shown above.
`
`Time complexity
`
`Mapping and assembly with qualities
`
`scan for a read once a perfect or one—mismatch hit was found.
`The perfect and one—mismatch hits, which exist for the majority
`of reads, are found in the first scan. However, stopping after the
`first scan for these reads would greatly reduce the resolution of
`mapping qualities. Reads that can be mapped confidently may
`not be effectively distinguished from those poorly aligned when
`the suboptimal hits were not available.
`
`Evaluating the accuracy
`
`Short reads tend to be wrongly aligned because one or two mu—
`tations or sequencing errors may make the best position wrong.
`When evaluating the accuracy of alignments, we have to look at
`the fraction of discarded reads (FD) and the fraction of wrongly
`aligned reads (PW) at the same time. Only counting one type of
`the errors might be misleading.
`While on simulated data it is possible to estimate both PD
`and FW of‘alignments, on real data we cannot calculate FW as we
`do not know what the correct alignment is. As a consequence, we
`cannot directly measure the accuracy of the alignment using real
`data. To see what alignment strategy works best, we must evalu—
`ate a measurable outcome from the alignment, such as the accu—
`racy of SNP calls, structural variations, or Lhe agreement between
`expression profiling and microarray results. The criteria may vary
`with different applications.
`In resequencing, accuracy can be measured by the SNI’ ac—
`curacy, which, again, should be measured by the fraction of mi ss—
`ing polymorphic sites (FN) and the fraction of wrong calls (PP) at
`the same time. We can always trade one type of error for the
`other and therefore once again counting one type of error is
`misleading.
`Unlike in an alignment, both FI’ and FN of SNl’s on real data
`can be estimated from other sources of data. F[’ can be evaluated
`
`by capillary resequencing or genotyping a small subset of SNP
`calls. FN can be estimated by comparing SNP calls to the whole-
`genome chip-genotyping results. The fraction of' chip-
`genotyping polymorphic sites that are not found is the FN. It
`should be noted that such a fraction is only the FN on the sites
`where probes can be designed for the genotyping microarray.
`These sites tend to be unique in the reference genome and are
`usually easier to find by short-read resequencing. The overall FN
`across the whole genome is higher.
`In resequencing, it is also a good idea to explicitly define the
`resequenceeble regions (or the regions where SNl’s can be confi—
`dently called). We want to distinguish low SNP-density regions
`from hard-to-resequence regions. Using MAQ, the fraction of the
`human genome that is resequenceable with 3S-bp reads is ~85%,
`and with read pairs separated by 170 bp it is ~93%. Achieving
`higher coverage would require a mixture of varying insert sizes
`and longer reads.
`
`Methods
`
`[f we map N reads to an L long reference and use k bits in index-
`ing,
`the time complexity of MAQ alignment algorithm is
`(JICINlogN + cal. + c32‘ka'J. The first term NlogN corresponds to
`the time spent on sorting the indexes; the second, on scanning
`Single end read mapping
`the Whole reference sequence; and the third term, on processing
`To map reads efficiently, MAQ first indexes read sequences artd
`the alignment when there is a seed lu't. In MAQalignment, k is 24
`scans the reference genome sequence to identify hits that are
`and N is typically 2 million and therefore 2"‘N= 0.1, but as
`extended and scored. With the Eland-like (A.]. Cox, unpubl.)
`constant «:3 is usually much larger than c2 and the human ge-
`hashing technique, MAQ, by default, guarantees to find align-
`nome has many repeats, the time spent on the last two terms is
`ments with up to two mismatches in the first 28 bp of the reads.
`approximately equal.
`MAQ maps a read to a position that minimiZes the sum of quality
`By default, MAQ scans the reference three times against six
`values of mismatched bases. If there are tnultiple equally best
`positions, then one of them is chosen at random.
`hash tables. It would be possible to save time by stopping the
`
`
`Genome Research
`wwwgenomeorg
`
`1355
`
`

`

`Li et al.
`
`In this article, we will call a potential read alignment posi—
`tion a hit. The algorithm MAQ uses to find the best hit is quite
`similar to the one used in Eland. It builds multiple hash tables to
`index the reads and scans the reference sequence against the
`hash tables to find the hits. By default, six hash tables are used,
`ensuring that a sequence with two mismatches or fewer will be
`hit. The six hash tables correspond to six noncontiguous seed
`templates (Buhler 2001; Ma et al. 2002). Given 8-bp reads, for
`example, the six templates are 11110000, 00001111, 11000011,
`00111100, 11001100, and 00110011, where nucleotides at 1 will
`be indexed while those at t) are not. By default, MAQ indexes the
`first 28 bp of the reads, which are typically the most accurate part
`of the read.
`
`In alignment, MAQ loads all reads into memory and then
`applies the first template as follows. For each read, MAQ takes the
`nucleotides at the 1 positions of the template, hashes them into
`a 24—bit integer, and puts the integer together with the read iden—
`tifier into a list. When all the reads are processed, MAQ orders the
`list based on the 24-bit integers, such that reads with the same
`hashing integer are grouped together in memory. [Each integer
`and its corresponding region are then recorded in a hash table
`with the integer as the key. We call this process indexing.
`At the same time that MAQ indexes the reads with the first
`template, it also indexes the reads with the second template that
`is complementary to the first one. Taking two templates at a time
`helps the mate—pair mapping, which will be explained in the
`section below.
`
`After the read indexing with the two templates, the refer-
`ence will be scanned base by base on both forward and reverse
`strands. I-Zach 28—bp subsequence of the reference will be hashed
`through the two templates used in indexing and will be looked
`up in the two hash tables, respectively. [fa hit is found to a read,
`MAQ will calculate the sum of qualities of mismatched bases q
`over the whole length of the read, extending out from the 28—bp
`seed without gaps (the current implementation has a read length
`limit of 63 bp). MAQ their hashes the coordinate of the hit and
`the read identifier into another 24-bit integer h and scores the hit-
`as $22“ + h. In this score, b can be considered as a pseudorandorn
`number, which differentiates hits with identical q: If there are
`multiple hits with the same :4, the flit with the smallest b will be
`identified as the best, effectively selecting randomly from the
`candidates. For each read, MAQ only holds in memory the posi-
`tion and score of its rivo best scored hits and the number of 0—, 1—,
`and 2—mismatch hits in the seed region.
`When the scan of the reference is complete, the next two
`templates are applied and the reference will be scanned once
`again until no more templates are left.
`Using six templates guarantees to find seed hits with no
`more than two mismatches, and it also finds 57% of hits with
`three mismatches. In addition, MAQ can use 20 templates to
`guarantee finding all seed hits with three mismatches at the cost
`of speed. In this configuration, 64% of seed hits with four mis-
`matches are also found, though our experience is that these hits
`are not useful in practice.
`
`Single and mapping qualities
`MAQ assigns each individual alignment a mapping quality. The
`mapping quality Q, is the phred-scaled probability (Ewing and
`Green 1998) that a read alignment may be wrong:
`
`reference and an ungapped exhaustive alignment is performed. A
`practical model for alignment with heuristic algorithms will be
`presented in the Supplemental ma terial.
`Suppose we have a reference sequence x and a read sequence
`2. On the assumption that sequencing errors are independent at-
`different sites of the read, the probability p(‘/.|X,rr) of 2 coming
`from the position rr equals the product of the error probabilities
`of the mismatched bases at the aligned position. For example,
`if read 2 mapped to position it has two mismatches: one with
`phred base quality 20 and the other with 10,
`then p(x|x,rr) :
`10—00 I 101310 = 0.001..
`To calculate the posterior probability p,(rr|x,z), we assume a
`uniform prior distribution p(u|x), and applying the Bayesian for-
`mula gives
`
`:(z|x,rr)
`It‘s-lHIXrZ} =,_,-{1— :
`2 Ptzna’i
`\'=l
`
`(1)
`
`where r. = |x| is the length of x and i = lzl. Scaling p, in the phred
`way, we get the [trapping quality of the alignment:
`
`Q5(rr|x,z) = —10|ogm[i w ps(rr|x,z)].
`
`The calculation of [Equation '1 requires summing over all
`positions on the reference. It is impractical to calculate the sum
`given a human-sized genome. In practice, we approximate Q, as:
`
`Q, = min[q2 — q, — 4.343 Irigr12,4 + (3 — k’)[§ — 14)
`— 4.343 logp1(3 — k’.28ll-
`
`Where 9, is the sum of quality values of mismatches of the best
`hit, qg is the corresponding sum for the second best flit, rig is the
`number of hits having the same number of mismatches as the
`sec

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket