`Sequenom v. Stanford
`SEQUENOM EXHIBIT 1095
`IPR2013-00390
`
`(cid:47)(cid:50)(cid:3)(cid:40)(cid:59)(cid:43)(cid:44)(cid:37)(cid:44)(cid:55)(cid:3)(cid:20)(cid:19)(cid:20)(cid:21)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)
`(cid:47)(cid:82)(cid:3)(cid:89)(cid:17)(cid:3)(cid:52)(cid:88)(cid:68)(cid:78)(cid:72)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)(cid:3)
`(cid:38)(cid:82)(cid:81)(cid:87)(cid:72)(cid:86)(cid:87)(cid:72)(cid:71)(cid:3)(cid:38)(cid:68)(cid:86)(cid:72)(cid:3)(cid:20)(cid:19)(cid:24)(cid:15)(cid:28)(cid:21)(cid:22)(cid:3)
`
`
`
`Li et al.
`
`neighborhood quality standard (NQS) [Altshuler et al. 2000),
`l’olyBayes develops an explicit statistical framework to lllodel
`variants.
`
`All new sequencing technologies are shotgun melllods tllat
`give sequences derived fronl a single molecule sampled from a
`larger population. (Current Inetllods amplify the starting teln—
`plate by some farm of PCR, but true single molecule methods are
`expected in the future.) This means the nletllods for calling vari—
`ants from new tecllnology data are most closely related to the
`second group described above, including ssahaSNP and Poly-
`Bayes. However, because of sampling and error rate, we need to
`combine data from multiple reads. In practice, errors at a par—
`ticular site are correlated, alld we must take this correlation irlto
`account. This is analogous to calling a consensus from a sequence
`assembly, and we propose a Bayesian approach to this issue that
`is related to that used in assembly software CAP3 (Huang and
`Madan 1999).
`In summary, this article presents methods and software for
`mapping short sequence reads to a reference genorne, calculating
`the probability of a read alignment being correct, and consensus
`genotype calling with a model that incorporates correlated errors
`and diploid sampling. The applicability and accuracy of the
`methods are evaluated based on both real data from the bacte—
`
`riunl Salmonella paratyphr' alld simulated data fronl the diploid
`human X chromosome.
`
`Results
`
`Overview of MAQ algorithms
`
`type sequence from the alignment. The consensus sequence is
`illferred from a Bayesian statistical model, arld each consensus
`genotype is associated with a paired quality that measures the
`probability that the consensus genotype is incorrect. Potential
`SN l’s are detected by comparing the consensus sequence to the
`reference alld can be further filtered by a set of predefined rules.
`These rules are designed to achieve the best performance on deep
`human resequencing data and aim to compensate for simplifica-
`tions and assumptions used ill the statistical model (e.g., treatirlg
`neighbor pesitions independently).
`
`Implementation
`
`We implemented the software MAQ to align short reads and call
`genotypes based on the algorithnl described ill the Methods sec—
`tion. MAQ consists of a set of related programs that are compiled
`into a single binary executable. [t is able to map reads, cal] con-
`sensus sequences including SNP arld inde] variants, simulate dip-
`loid genomes alld read sequences, and post—process the results ill
`various ways. MAQ also has an option to process Applied Biosys—
`lems SOLiD data that uses two base “color-space" encoding. f-ur-
`tller details are available from the documentation at the MAQ
`website.
`
`MAQ is easy to use. For bacterial genomes, alignments alld
`variant calling can be done with a single command line, taking a
`few Illillutes on a laptop. [ll addition, MAQ comes witll a com—
`pact arld fast OpenGL-based read alignment viewer, MAQview,
`which shows the read alignments, base qualities, arld mappirlg
`qualities in a graphical interface.
`lloth MAQ alld MAQview are designed with genome—wide
`human resequencing in mind. First, the read alignment, which is
`the slowest step ill the whole pipeline, can be divided irlto small
`tasks and parallelized 011 a modern computer cluster using less
`than 1 GB of memory for each processor core. The separate sub-
`parts of the alignment can then be merged together to give the
`final alignment. Second, the read alignments are stored in a bi—
`nary compressed file. Text—based information is only extracted
`when necessary. This strategy saves disk space by a factor of three
`to five. Third, a novel technique is implemented to index the
`compressed alignment file, which enables swift retrieval of reads
`ill ally region of the reference sequence. Viewing the alignments
`of a human—sized genome is as fast as viewing those of a single
`BAC sequence. As a whole, MAQ arld MAQview provide all effi-
`cierlt suite for managing data from tllurnirla sequencing.
`MAQ arld MAQview are implemented in C7C++ with auxil-
`iary tools in Perl. They have been extensively evaluated on large-
`scale simulated data and real data and have been tested by users
`front various research groups. MAQ software is freely distributed
`urlder the GNU General Public License (GPL). The project home
`page is at httpzh'maq.sourceforgehet.
`
`MAQ is a program that rapidly aliglls short reads to the reference
`genome and accurately infers variants, including SNPs alld short
`indels, from the alignment.
`At the alignment stage, MAQ first searches for the ungapped
`rnatcll with lowest mismatch score, defilled as the sum of quali—
`ties at mismatching bases. To speed up the alignment, MAQ only
`considers positions that have two or fewer mismatches in the
`first 28 bp (default parameters). Sequences that fail to reacll a
`mismatch score threshold but whose mate pair is mapped are
`searched with a gapped alignment algorilllrn in the regions de-
`fined by the lllate pair. To evaluate the reliability of alignments.
`MAQ assigns each individual alignment a pitted—scaled quality
`score (capped at 99), which measures the probability that the true
`alignment is rlot the one found by MAQ. MAQ always reports a
`single alignment, arld if a read can be aligned equally well to
`multiple positions, MAQ will ralldomly pick one position alld
`give it a mapping quality zero. Because their mapping score is set
`to zero, reads that are mapped equally well to multiple positions
`will not contribute to variant calling. However, they do give ill-
`formatiorl on copy number of repetitive sequences and 011 the
`fraction of reads that can be aligned to the genome, arld can
`easily be filtered out for downstream analysis if desired. Mapping
`Although it is always good to look at real data, it is impossible to
`quality scores and mapping all reads that match the genome
`assess read alignment accuracy on real data, because in a shotgun
`even if repetitive are where MAQ differs from lllost other align—
`ment programs.
`sample We cannot know where the reads come from.
`We simulated a diploid sequence (two haploid sequences)
`MAQ fully utilizes the mate-pair information of paired
`from the human reference chromosome X, as described irl the
`reads. [t is able to use this information to correct wrong aligll—
`ments, to add confidence to correct alignments, and to accurate—
`Methods: 136,012 substitutions. 7377 1—bp insertions, alld 7589
`l—bp deletions were added to the diploid genome, giving all over—
`ly map a read to repetitive sequences if its mate is confidently
`aligned. With paired-end reads, MAQ also firlds short insertions,Ir
`all polymorphisrn rate of 0.00]. From this mutated diploid ge—
`nome, we simulated 100 million pairs of 35-bp mate-pair reads
`deletions (indels) from the gapped alignment described above.
`with errors (~45.2-fold coverage orl chromosome X). The average
`At the SNP calling stage, MAQ produces a consensus geno-
`
`
`SNP calling for large-scale simulated data
`
`1852 Genome Research
`wwwgenolneorg
`
`
`
`insert size is l 70 hp with a standard deviation of 20 hp. Statistics
`on base qualities were estimated from real data where base quali-
`ties have been calibrated.
`
`With the default MAQ options, we aligned the simulated
`reads against the whole human reference genome excluding Y
`and unassembled contigs. It took 1100 CPU hours to do this
`alignment, and 914-496 of reads get mapped. Figure 1]} shows the
`distribution of mapping qualities (red curve) and the mapping
`error rate (blue curve] in each 10-based quality interval. It the
`mapping quality were estimated precisely, we would expect to
`see a straight blue line between ("O-9," 100) and (“290," 10—9).
`MAQ qualities appear to be overestimated; in other words, the
`true alignment error rate is higher than what mapping quality
`predicts to be. To investigate whether the overestimation is due
`to the fact that we did not consider mutations and indels in the
`
`model, we also simulated reads without introducing any muta-
`tions. For these data, the mapping quality could be estimated
`more accurately (data not shown), which confirms that mu ta-
`tions and indels may interfere with the calculation of mapping
`qualifies. We see in Figure 1 that this effect is greatest for map-
`ping quality ~70—8tt. However, even these reads have accuracy
`better than 10", which is sufficient for most mapping based
`applications, including structural variant calling and SNP calling.
`We called the consensus sequence from the MAQ align-
`ment. The pink curve in Figure 13 shows that most of the con-
`sensus bases have a quality over 60. About 5% of the consensus
`bases have a quality smaller than 10. They are in repetitive re~
`gions where read alignment is not reliable. We then compared
`the consensus to the diploid sequence from which reads were
`generated, and calculated the error rate of the consensus. The
`green curve indicates that the consensus quality also roughly
`agrees with the true error rate, We called indels using paired-end
`indel detection methods described in the Methods section, and
`required at least two reads to support the indel.
`After MAQ’s substitution calling, we further filtered the sub-
`stitutions based on five rules.- it) discard SNPs within the 3-bp
`flanking region around a potential indet; (2) discard SNPs cov-
`ered by three or fewer reads; (3) discard SNPs covered by no read
`with a mapping quality higher than 60,- t4) in any iti-bp window,
`if there are three or more SNPS, discard them all; and (S) discard
`SNPs with consensus quality smaller than 10. MAQ provides a
`Perl script maq.p1 to achieve all these filters.
`To see how well MAQ calls SNPs and indels at different cov—
`
`Mapping and assembly with qualities
`
`erage, we chose several subsets oi reads and called variants from
`those subsets. We compared the indels and filtered substitution
`calls to the true variants we added to the diploid genome in the
`simulation and measured the accuracy by false-positive rate (FF)
`and fatse-ne gative rate (PM (Fig. 23). MAQ consistently generates
`very few faise positives but does miss true substitutions. Most of
`these missing substitutions fall in ”filtered regions,” which tend
`to consist of repetitive sequences. In the human genome as rep-
`resented by the x chromosome, we can call variants on ~85% of
`the sequence using single end reads and 93% using paired-end
`reads.
`The difference between the blue and the red curves indicates
`
`the fraction of missing substitutions in the regions trusted by
`MAQ. This difference decreases from ~15% at 8 x down to 1% at
`30 X. Note that we apply more filters on SNPs than on filtered
`regions, which leads to the 1% difference between the two curves
`at high depth. Most of difference at low depth is accounted for by
`sampling variation. At, say, ltJX coverage there is 5 X coverage
`on average of each haplotype. However, the actual number of
`reads at a site will be distributed around the average at best ac-
`cording to a Poisson distribution. Given that we may need to see
`a variant several times to be confident enough to call it, there is
`a significant probability that not enough reads will be aligned
`and the variant will be missed.
`
`A simple model to this issue is to assume we require it reads
`to call an allele. We call this strategy the k-allele method. if we
`assume all read bases have an error rate 0.003, or phred quality 25,
`the theoretical FN and FF are shown in Figure 2C. If we require
`low FP rate. the FN rate of MAQ’s model largely agrees with that
`of the k-allele methods, allowing for the fact that some of the
`data have Qvalue lower than 25 or low mapping quality.
`A uniquely aligned read tends to be wrongly mapped if it
`has many good altemative hits. Mapping quality helps to down-
`weight such a read in SNP calling and to reduce the false SNPs
`caused by wrong alignments. To see the effect of mapping qual-
`ity, we altered our method to ignore the mapping quality and to
`use only uniquely mapped reads in calling SNPs. We filtered the
`resulting SNPs using the same five rules as previously, except the
`third one, as we assumed no mapping quality is available in this
`case. in comparison, without mapping quality, MAQ discovered
`21? false SNPs out of 12?,910 predictions. and with mapping
`quality, MAQ gave 186 out of 126.228, yielding a 14% reduction
`in FF. This reduction amounts to 31 false SNPS, which is small in
`
`A
`
`Sing.) Furl Aignuws:
`.
`.
`_
`limt‘lli'fliifil'nffi ---------
`...............
`mill.
`.H; WWW“?
`WLfl'rgflfftCE
`
`-
`
`.u-r
`[an
`'
`..-
`iii
`90
`to”
`33
`“3-:
`FD
`a
`.
`_
`30
`g as
`so“
`'5 50
`e
`'
`it!"
`E ‘0
`Ii'i'"
`'30
`rat
`rt:
`ifl'o
`to
`to
`1)
`sir-a we rue see not:
`M 1m- AM we 2&9
`(polity no Insert Intervals;
`
`
`
`.
`
`.
`
`B
`
`9r,
`
`'
`
`fished End Aitgimnk
`:
`percent .11:qu ---------
`E'
`'“flf-‘Fing'enwmrs
`.s....
`'
`
`terwncm, .
`
`.
`
`-
`..
`I
`
`..
`
`‘e‘éflkiég
`
`
`
`Errorrate
`
`Distribution of mapping qualities, consensus qualities, true alignment error rate, and true consensus error rate. The red line shows the
`Figure 1 .
`fraction of reads whose mapping qualities fail in each interval. (Pink line) The fraction of consensus genotypes whose consensus qualities fail in each
`interval; (blue line) the true alignment error rate of reads In each interval; (green line) the true consensus error rate of reads in each interval. (A) Reads
`are aligned without using mate-pair information. Single-end alignments do not contain enough information for MAQ to assign mapping quality larger
`than 90; therefore, the data in the top bin are missing. (3) Reads are aligned using mate-pair information.
`
`Genome Research
`wwwgel tome-org
`
`1853
`
`{Jimmy no tam fitter-rats;
`
`
`
`RIUGK)! 05%“th coiling fifiafigmmfi
`.
`g
`filttimtioxltlnr:
`substitutes: FP
`-'
`'eibstttmitm EN
`-
`-
`-
`-
`'
`
`-
`
`B
`
`.ikL'ul'cK} of “afield“ filming (.05 Women”
`.
`fitter“! {won -
`that‘JEE-‘ltltloflf'l’ ~
`subsist-41mm """"
`indelile
`-
`,mkviEN
`
`-
`
`c Morlacyufk— Nit-luv. :1 relbocl {item en: {5W3}
`.
`.
`sat-mile ~~~~~~~~
`Farsi-aorta -
`Fits-slide -
`l-‘ai stein-1e
`res/aka»
`F'fi S-aflele
`
`--
`
`A
`
`{It}
`no
`5.0
`
`§_
`ofe .
`F to.
`33.
`”At2!}
`1o
`
`
`
`'5
`
`id
`
`15
`
`20 H SCI
`Manet-icing depth
`
`3S
`
`$1.3
`
`£5
`
`5
`
`!c-
`
`=5
`
`3‘}
`25
`20
`fiequfilt‘iflg depth
`
`33
`
`30
`
`4-!-
`
`S
`
`tt'.‘
`
`:5
`
`'10
`35
`20
`bequeath-g depth
`
`3‘5
`
`£50
`
`43
`
`Figure 2. Accuracy oi variant calling. In the figure, "filtered regions" are regions covered by three or iewer reads or by no reads with mapping quality
`higher than 60. For substitutions, FP equals the number of positions called as different from homozygous reference that in fact should be identical to
`the leference according to the simulation, divided by the total number of MAQ substitution calls; FN equals the number of positions that are different
`from the reference according to the simulation but are missed by MAQ, divided by the total number of mutations added in the simulation. For indels,
`FP equals the number of indel calls within 5-bp flanking regions of a true indel, divided by the total number of MAQ indel calls; FN equals the number
`of true indel calls missed by MAQ, divided by the total number of indels in simulation. (A) variants are called based on singie-end alignment. {8) Variants
`are called based on paired-end alignment. (0 Theoretical accuracy of kallele method, where we call an allele as long as at least it reads are supporting
`the allele, assuming all roads are correctly aligned (see also Supplemental material).
`
`comparison to the 136,012 true substitutions in the simulation.
`However, in real data the FF is higher and in some applications,
`such as in the study of somatic mutations in cancer, the number
`of true SNPs will be much lower and the rate of false SNPs more
`critical.
`
`After these filters, MAQ predicted two homozygous differ-
`ences. Checking the capillary reads used in reference assembly
`confirms that the current reference is wrong at one of the homo-
`zygous sites. The other homozygous site is covered by 19 reads,
`with all of them identical to each other but different from the
`
`This simulation only gives a rough evaluation of MAQ’s per-
`formance. On one hand, in the simulation process, reads are
`evenly distributed along the genome, no contamination exists,
`base qualifies are accurate and sequencing errors are entirely in-
`dependent. Ai] these factors make SNP calling simpler. The true
`accuracy on real data will almost always be lower than the simu-
`lation. On the other hand, although errors are independent, we
`use a dependent mode] to infer the consensus. Using an inde-
`pendent model would achieve higher accuracy for simulated
`data. Moreover, we were using the same set of filters across all
`depths. Adjusting the threshold in filters might help to reach a
`better balance point between FN and FF at different depths.
`
`SNP calling for bacterial genomes
`
`To evaluate MAQ on real data, we obtained one lane of 2.9 mil‘
`lion 36-bp lllumina read sequences of S. paratflrhi A AKUJZt‘SOI
`strain collected by the pathogen group at the Sanger Institute.
`The short reads are purity filtered. To calibrate the quality values,
`we put PhiX sample on the fifth lane of the same run, calculate
`a quality calibration table from the alignment against the known
`PhiK genome, and then apply the table on reads from other lanes
`to infer base qualities. 5. pamtjipbi is a 4.8-Mbp bacterium, in-
`cluding plasmid (Holt et a]. 2007), and so we had ~20 >< coverage.
`An initial reference genome sequence of the same strain (AC:
`FMZOOOSS) was also produced by the pathogen group with cap-
`illary sequencing. Read sequences have been submitted to Euro-
`pean Read Archive (AC.- ERAi)l)0(l12)_
`After mapping and consensus base calling, we filtered the
`SNPs based on the same five rules as for the human K simulation,
`but in comparison to SNP calling on simulated human X chro-
`mosome, we did not filter SNPs around indels as we only had
`single-end reads; we decreased the threshold on mapping quality
`(rule 3) to 40 because single end reads usually have lower map-
`ping quality than using mate-pair reads, and we increased the
`threshold on consensus quality (rule 5) to 40 because for haploid
`genome where there are no true heterozygotes, it is easier to get
`higher consensus quality.
`
`1854 Genome Research
`wwgenomeorg
`
`reference. This site is possibly a true mutation between the ref-
`erence sequence and the Illumina-sequenced sample.
`As well as these two homozygous differences, MAQ also pre-
`dicted four beterozygotes. All four cases look confident from read
`alignment and show excessively high read depth in comparison
`to the average depth. Three are clustered together, and it appears
`likely that there is an additional copy of this region that was not
`identified in the reference. The fourth position may also be in a
`duplicated region (see below).
`Alignment against the same reference strain only evaluates
`the FP of MAQ SNP calling. To assess the FN, we aligned the reads
`to a previously published sequence from another reference
`strain ATCC9150 (McClelland et a]. 2004].
`We downloaded the sequence of strain ATCCQISO (AC:
`NC_006511) from NCBI and aligned it, using cross_match
`(P. Green, unpubl; http:iiwww.phrap.orgiphredphrapconsed.
`html), against the AKU12601 stain with tile two homozygous
`SNPs discovered previously masked as N. Cross_match gave
`seven alignments, spanning the compiete ATCCQlSi) and
`99.97% of AKU12601 genome. 211 substitutions and 39 indels
`(five of them longer than 20 bp} are contained in the alignment.
`MAQ did not give any false positives and predicted 173 true
`substitutions. Of the missing 38 substitutions, 35 were covered
`by no uniquely aligned reads, and one site was covered by only
`one unlquely aligned read. Discovering SNPs at these 36 sites is
`almost impossible with single end short reads. Of the remaining
`two (38 — 36) sites, one site was called as a SNP initially but was
`filtered out due to low read coverage l two reads), and the other
`was dropped because it was covered by no read with mapping
`quality higher than 24- and so did not pass the filter, either. In
`regions passing the SNP calling filters (96.9% of ATCC9150 ge-
`nome), no SNPs were missed. Interestingly, the four heterozy-
`goles in the AKU12601 read mapping were not called as SNPs any
`more. One site became a repeat in ATCC9150, and the other
`three were called confident monomorphic sites with about the
`average read depth. This observation possibly revealed that
`around these three sites, there are no copy nunlber changes be-
`
`
`
`tween A'l‘CC‘J'le‘J strain and the sample resequenced by [llu—
`mina.
`
`It is worth noting that AKU'lZétll and ATCC915tl are highly
`similar strains. Aligning short reads against a reference genome
`that is more distant to the sample being reseq uenced would be
`harder, especially when there are highly variable regions. In these
`regions, doing de novo assembly [Zerbino and Birney 2008} first
`and theil aligning the contigs may greatly help.
`
`Discussion
`
`MAQ is capable of human whole-genome alignments and sup-
`ports SNP calling on a diploid sample. lt has been used to map
`short sequencing reads for structural variant calling in cancer
`samples (Campbell et al. 2008) and for whole—genome methyla—
`tion analyses (Down et al. 2008). [t is able to accurately estimate
`the error probability of each alignment and of each consensus
`genotype as well. MAQ can also simulate reads from a diploid
`genome based on a haploid reference. Simulation suggests that
`20— to 30—fold coverage is needed for achieving FNs below 1% in
`the nonrepeat regions of a diploid sample.
`
`The reliability of short read alignments
`
`The reliability of read alignments can substantially affect the ac—
`curacy of the detection of variations. Knowing which alignment
`is reliable is key to the subsequent analyses. The most convenient
`way to measure the reliability is to define uniqueness: A read is
`said to be uniquely Inapped if its second best hit contains
`more mismatches than its best hit. Generally this simple crite-
`rion works well, but potential difficulties are illustrated by the
`following scenarios: (1) a read has two one-mismatch hits, one
`with a Q30 mismatch and the other with a Q3 mismatch; (2) a
`read has one perfect hit and 100 one—mismatch hits; and (3} a
`read has a perfect hit and a (lB—mismatch hit. In the first case,
`although the read is notunique, the hit with a (130 mismatch may
`still be reliable. In the remaining two cases, although the read can
`be uniquely aligned, the alignments are not reliable. For the hu-
`man genorne, these types of scenarios may happen at tirrles due
`to the large fraction of repetitive sequences.
`In our view,
`it is better to mgard the position a read is
`mapped to as a random variable, and the reliability of an align-
`ment can be naturally interpreted as the likelihood of the read
`being mapped to the correct position. At this point, mapping
`quality directly measures the reliability. It considers the repeat
`structure of the reference and the base quality of read sequencers,
`which is implied in Equation 1 (see Methods), and can easily
`handle the three cases shown above.
`
`Time complexity
`
`Mapping and assembly with qualities
`
`scan for a read once a perfect or one—mismatch hit was found.
`The perfect and one—mismatch hits, which exist for the majority
`of reads, are found in the first scan. However, stopping after the
`first scan for these reads would greatly reduce the resolution of
`mapping qualities. Reads that can be mapped confidently may
`not be effectively distinguished from those poorly aligned when
`the suboptimal hits were not available.
`
`Evaluating the accuracy
`
`Short reads tend to be wrongly aligned because one or two mu—
`tations or sequencing errors may make the best position wrong.
`When evaluating the accuracy of alignments, we have to look at
`the fraction of discarded reads (FD) and the fraction of wrongly
`aligned reads (PW) at the same time. Only counting one type of
`the errors might be misleading.
`While on simulated data it is possible to estimate both PD
`and FW of‘alignments, on real data we cannot calculate FW as we
`do not know what the correct alignment is. As a consequence, we
`cannot directly measure the accuracy of the alignment using real
`data. To see what alignment strategy works best, we must evalu—
`ate a measurable outcome from the alignment, such as the accu—
`racy of SNP calls, structural variations, or Lhe agreement between
`expression profiling and microarray results. The criteria may vary
`with different applications.
`In resequencing, accuracy can be measured by the SNI’ ac—
`curacy, which, again, should be measured by the fraction of mi ss—
`ing polymorphic sites (FN) and the fraction of wrong calls (PP) at
`the same time. We can always trade one type of error for the
`other and therefore once again counting one type of error is
`misleading.
`Unlike in an alignment, both FI’ and FN of SNl’s on real data
`can be estimated from other sources of data. F[’ can be evaluated
`
`by capillary resequencing or genotyping a small subset of SNP
`calls. FN can be estimated by comparing SNP calls to the whole-
`genome chip-genotyping results. The fraction of' chip-
`genotyping polymorphic sites that are not found is the FN. It
`should be noted that such a fraction is only the FN on the sites
`where probes can be designed for the genotyping microarray.
`These sites tend to be unique in the reference genome and are
`usually easier to find by short-read resequencing. The overall FN
`across the whole genome is higher.
`In resequencing, it is also a good idea to explicitly define the
`resequenceeble regions (or the regions where SNl’s can be confi—
`dently called). We want to distinguish low SNP-density regions
`from hard-to-resequence regions. Using MAQ, the fraction of the
`human genome that is resequenceable with 3S-bp reads is ~85%,
`and with read pairs separated by 170 bp it is ~93%. Achieving
`higher coverage would require a mixture of varying insert sizes
`and longer reads.
`
`Methods
`
`[f we map N reads to an L long reference and use k bits in index-
`ing,
`the time complexity of MAQ alignment algorithm is
`(JICINlogN + cal. + c32‘ka'J. The first term NlogN corresponds to
`the time spent on sorting the indexes; the second, on scanning
`Single end read mapping
`the Whole reference sequence; and the third term, on processing
`To map reads efficiently, MAQ first indexes read sequences artd
`the alignment when there is a seed lu't. In MAQalignment, k is 24
`scans the reference genome sequence to identify hits that are
`and N is typically 2 million and therefore 2"‘N= 0.1, but as
`extended and scored. With the Eland-like (A.]. Cox, unpubl.)
`constant «:3 is usually much larger than c2 and the human ge-
`hashing technique, MAQ, by default, guarantees to find align-
`nome has many repeats, the time spent on the last two terms is
`ments with up to two mismatches in the first 28 bp of the reads.
`approximately equal.
`MAQ maps a read to a position that minimiZes the sum of quality
`By default, MAQ scans the reference three times against six
`values of mismatched bases. If there are tnultiple equally best
`positions, then one of them is chosen at random.
`hash tables. It would be possible to save time by stopping the
`
`
`Genome Research
`wwwgenomeorg
`
`1355
`
`
`
`Li et al.
`
`In this article, we will call a potential read alignment posi—
`tion a hit. The algorithm MAQ uses to find the best hit is quite
`similar to the one used in Eland. It builds multiple hash tables to
`index the reads and scans the reference sequence against the
`hash tables to find the hits. By default, six hash tables are used,
`ensuring that a sequence with two mismatches or fewer will be
`hit. The six hash tables correspond to six noncontiguous seed
`templates (Buhler 2001; Ma et al. 2002). Given 8-bp reads, for
`example, the six templates are 11110000, 00001111, 11000011,
`00111100, 11001100, and 00110011, where nucleotides at 1 will
`be indexed while those at t) are not. By default, MAQ indexes the
`first 28 bp of the reads, which are typically the most accurate part
`of the read.
`
`In alignment, MAQ loads all reads into memory and then
`applies the first template as follows. For each read, MAQ takes the
`nucleotides at the 1 positions of the template, hashes them into
`a 24—bit integer, and puts the integer together with the read iden—
`tifier into a list. When all the reads are processed, MAQ orders the
`list based on the 24-bit integers, such that reads with the same
`hashing integer are grouped together in memory. [Each integer
`and its corresponding region are then recorded in a hash table
`with the integer as the key. We call this process indexing.
`At the same time that MAQ indexes the reads with the first
`template, it also indexes the reads with the second template that
`is complementary to the first one. Taking two templates at a time
`helps the mate—pair mapping, which will be explained in the
`section below.
`
`After the read indexing with the two templates, the refer-
`ence will be scanned base by base on both forward and reverse
`strands. I-Zach 28—bp subsequence of the reference will be hashed
`through the two templates used in indexing and will be looked
`up in the two hash tables, respectively. [fa hit is found to a read,
`MAQ will calculate the sum of qualities of mismatched bases q
`over the whole length of the read, extending out from the 28—bp
`seed without gaps (the current implementation has a read length
`limit of 63 bp). MAQ their hashes the coordinate of the hit and
`the read identifier into another 24-bit integer h and scores the hit-
`as $22“ + h. In this score, b can be considered as a pseudorandorn
`number, which differentiates hits with identical q: If there are
`multiple hits with the same :4, the flit with the smallest b will be
`identified as the best, effectively selecting randomly from the
`candidates. For each read, MAQ only holds in memory the posi-
`tion and score of its rivo best scored hits and the number of 0—, 1—,
`and 2—mismatch hits in the seed region.
`When the scan of the reference is complete, the next two
`templates are applied and the reference will be scanned once
`again until no more templates are left.
`Using six templates guarantees to find seed hits with no
`more than two mismatches, and it also finds 57% of hits with
`three mismatches. In addition, MAQ can use 20 templates to
`guarantee finding all seed hits with three mismatches at the cost
`of speed. In this configuration, 64% of seed hits with four mis-
`matches are also found, though our experience is that these hits
`are not useful in practice.
`
`Single and mapping qualities
`MAQ assigns each individual alignment a mapping quality. The
`mapping quality Q, is the phred-scaled probability (Ewing and
`Green 1998) that a read alignment may be wrong:
`
`reference and an ungapped exhaustive alignment is performed. A
`practical model for alignment with heuristic algorithms will be
`presented in the Supplemental ma terial.
`Suppose we have a reference sequence x and a read sequence
`2. On the assumption that sequencing errors are independent at-
`different sites of the read, the probability p(‘/.|X,rr) of 2 coming
`from the position rr equals the product of the error probabilities
`of the mismatched bases at the aligned position. For example,
`if read 2 mapped to position it has two mismatches: one with
`phred base quality 20 and the other with 10,
`then p(x|x,rr) :
`10—00 I 101310 = 0.001..
`To calculate the posterior probability p,(rr|x,z), we assume a
`uniform prior distribution p(u|x), and applying the Bayesian for-
`mula gives
`
`:(z|x,rr)
`It‘s-lHIXrZ} =,_,-{1— :
`2 Ptzna’i
`\'=l
`
`(1)
`
`where r. = |x| is the length of x and i = lzl. Scaling p, in the phred
`way, we get the [trapping quality of the alignment:
`
`Q5(rr|x,z) = —10|ogm[i w ps(rr|x,z)].
`
`The calculation of [Equation '1 requires summing over all
`positions on the reference. It is impractical to calculate the sum
`given a human-sized genome. In practice, we approximate Q, as:
`
`Q, = min[q2 — q, — 4.343 Irigr12,4 + (3 — k’)[§ — 14)
`— 4.343 logp1(3 — k’.28ll-
`
`Where 9, is the sum of quality values of mismatches of the best
`hit, qg is the corresponding sum for the second best flit, rig is the
`number of hits having the same number of mismatches as the
`sec