throbber
Chapter 16
`
`An Overview of the Analysis of Next Generation
`Sequencing Data
`
`Andreas Gogol-Do¨ ring and Wei Chen
`
`Abstract
`
`Next generation sequencing is a common and versatile tool for biological and medical research.
`We describe the basic steps for analyzing next generation sequencing data, including quality checking
`and mapping to a reference genome. We also explain the further data analysis for three common
`applications of next generation sequencing: variant detection, RNA-seq, and ChIP-seq.
`
`Key words: Next generation sequencing, Read mapping, Variant detection, RNA-seq, ChIP-seq
`
`1. Introduction
`
`In the last decade, a new generation of sequencing technologies
`revolutionized DNA sequencing (1). Compared to conventional
`Sanger sequencing using capillary electrophoresis, the massively
`parallel sequencing platforms provide orders of magnitude more
`data at much lower recurring cost. To date, several so-called next
`generation sequencing platforms are available, such as the 454-
`FLX (Roche), the Genome Analyzer (Illumina/Solexa), and
`SOLiD (Applied Biosystems); each having its own specifics.
`Based on these novel technologies, a broad range of applications
`has been developed (see Fig. 1).
`Next generation sequencing generates huge amounts of data,
`which poses a challenge both for data storage and analysis, and
`consequently often necessitates the use of powerful computing
`facilities and efficient algorithms. In this chapter, we describe the
`general procedures of next generation sequencing data analysis
`with a focus on sequencing applications that use a reference
`sequence to which the reads can be aligned. After describing
`how to check the sequencing quality, preprocess the sequenced
`reads, and map the sequenced reads to a reference, we briefly
`
`Junbai Wang et al. (eds.), Next Generation Microarray Bioinformatics: Methods and Protocols,
`Methods in Molecular Biology, vol. 802, DOI 10.1007/978-1-61779-400-1_16, # Springer Science+Business Media, LLC 2012
`249
`
`00001
`
`EX1037
`
`

`

`250
`
`A. Gogol-Do¨ ring and W. Chen
`
`Copy Number
`Variations
`
`Isoform
`Quantification
`
`Small
`RNA
`
`Novel
`Transcripts
`
`RNA-seq
`
`ChIP-seq
`
`Single
`Nucleotide
`Variations
`
`Variant
`Detection
`
`Structural Variations
`
`Meta-
`transcript
`omics
`
`-
`
`Meta-
`genomics
`
`RNA
`DNA
`
`Small
`Indels
`
`Long Inserts
`
`Re-sequencing
`Sequencing
`
`Genome
`Sequencing
`
`Using Reference
`
`e-Novo
`
`Quantitative
`
`Qualitative
`
`Fig. 1. Illustration of some common applications based on next generation sequencing. The decoding of new genomes is
`only one of various possibilities to use sequencing. Variant detection, ChIP-seq, and RNA-seq are discussed in this book.
`Metagenomics (16) is a method to study communities of microbial organisms by sequencing the whole genetic material
`gathered from environmental samples.
`
`discuss three of the most common applications for next generation
`sequencing.
`1. Variant detection (2) means to find genetic differences
`between the studied sample and the reference. These differ-
`ences range from single nucleotide variants (SNVs) to large
`genomic deletions, insertions, or rearrangements.
`2. RNA-seq (3) can be used to determine the expression level of
`annotated genes as well as to discover novel transcripts.
`3. ChIP-seq (4) is a method for genome-wide screening pro-
`tein–DNA interactions.
`
`2. Methods
`
`2.1. General Read
`Processing
`
`Current next generation sequencing technologies based on photo-
`chemical reactions recorded on digital images, which are further
`processed to get sequences (reads) of nucleotides or, for SOLiD,
`
`00002
`
`

`

`16
`
`An Overview of the Analysis of Next Generation Sequencing Data
`
`251
`
`dinucleotide “colors” (5) (base/color calling). The sequencing
`data analysis starts from files containing DNA sequences and qual-
`ity values for each base/color.
`
`1. Check the overall success of the sequencing process by count-
`ing the raw reads, i.e., spots (clusters/beads) on the images,
`and the fraction of reads accepted after base calling (filtered
`reads). These counts could be looked up in a results file gen-
`erated by the base calling software. A low number of filtered
`reads could be caused by various problems during the library
`preparation or sequencing procedure (see Note 1). Only the
`filtered reads should be used for further processing. For more
`ways to test the quality of the sequencing process see Notes
`2 and 3.
`
`2. Sequencing data are usually stored in proprietary file formats.
`Since some mapping software tools do not accept these for-
`mats as input, a script often has to be employed to convert the
`data into common file formats such as FASTA or FASTQ.
`
`3. The sequenced DNA fragments are sometimes called
`“inserts” because they are wrapped by sequencing adapters.
`The adapters are partially sequenced if the inserts are shorter
`than the read length, for example, in small RNAs sequencing
`(see Subheading 2.4, step 5). In these occasions, it is necessary
`to remove the sequenced parts of the adapter from the reads,
`which could be achieved by removing all read suffixes that are
`adapter prefixes (see Note 4).
`
`Many applications of next generation sequencing require a refer-
`ence sequence to which the sequenced reads could be aligned.
`Read mapping means to find the position in the reference where
`the read matches with a minimum number of differences. This
`position is hence most likely the origin of the sequenced DNA
`fragment (see Note 5).
`
`1. There are numerous tools available for read mapping (6).
`Select a tool that is appropriate for mapping reads of the
`given kind (see Note 6). Some applications may require spe-
`cial read mapping procedures that, for example, allow small
`insertions and deletions (indels) or account for splicing in
`RNA-seq.
`
`2. Select an appropriate maximum number of allowed errors
`(see Note 7).
`
`3. For most applications you only need uniquely mapped reads,
`i.e., reads matching to a single “best” genomic position.
`If nonuniquely mapped reads could also be useful, then con-
`sider to specify an upper bound for the number of reported
`mapping positions, because otherwise the result list is blown
`up by reads mapping to highly repetitive regions.
`
`2.2. Mapping to a
`Reference
`
`00003
`
`

`

`252
`
`A. Gogol-Do¨ ring and W. Chen
`
`4. Most mapping tools create output files in proprietary formats,
`so we advice to convert the mapping output into a common
`file format such as BED, GFF, or SAM (7, 8).
`
`5. Count the percentage of all reads which could be mapped to
`at least one position in the reference. A low amount of map-
`pable reads could indicate a low sequencing quality (see Note
`3) or a failed adapter removal (see Note 4).
`
`6. Some pieces of DNA could be overamplified during library
`preparation (PCR artifacts) resulting in a stack of redundant
`reads that are mapped to the same genomic position and same
`strand. If it is necessary to get rid of such redundancy, discard
`all but one read mapped to the same position and on the same
`strand.
`
`7. Transform SOLiD reads into nucleotide space after mapping.
`
`2.3. Application
`1: Variant Detection
`
`The detection of different variation types requires different
`sequencing formats and analysis strategies. Tools are available for
`the detection of most variant types (2) (see Note 8).
`
`1. For detecting SNVs, search the mapped reads for bases that
`are different from the reference sequence. Since there will
`probably be more sequencing errors than true SNVs, each
`SNV candidate must be supported by several independent
`reads. A sufficient coverage is therefore required (see Note
`9). Note that some SNVs might be heterozygous, which
`means that they occur only in some of the reads spanning
`them.
`
`2. Structural variants can be detected by sequencing both ends
`of DNA fragments (paired-end sequencing) (see Fig. 2) (9).
`After mapping the individual reads independently to the ref-
`erence, estimate the distribution of fragment lengths. Then
`search for read pairs which were mapped to different chromo-
`somes or have abnormal distance, ordering, or strand orienta-
`tion. Search for a most parsimonious set of structural variants
`explaining all discordant read pairs. The more read pairs can
`be explained by the same variant, the more reliable this variant
`is and the more precise the break point(s) could be deter-
`mined. If only one end of a DNA fragment could be mapped
`to the reference, the other end is possibly part of a (long)
`insertion. Given a suitable coverage, the sequence of the
`insertion can possibly be determined by assembling the
`unmapped reads.
`
`2.4. Application
`2: RNA-seq
`
`The experimental sequencing protocols and hence the data analy-
`sis procedures are usually different for longer RNA molecules such
`as mRNA (Subheading 2.4 steps 2 and 3) and small RNA such as
`miRNA (Subheading 2.4 steps 5 and 6).
`
`00004
`
`

`

`16
`
`An Overview of the Analysis of Next Generation Sequencing Data
`
`253
`
`Deletion
`
`Insertion
`
`Long Insertion
`
`too wide
`
`too close
`
`only one read
`mapped
`
`Inversion
`
`Duplication
`
`Translocation
`
`Sample
`
`Reference
`
`Sample
`
`Reference
`
`same strands
`
`divergent strands
`
`mapped on different
`chromosomes
`
`Fig. 2. Different variant types detected by paired-end sequencing (9). (1) Deletion: The reference contains a sequence
`that is not present in the sample. (2–3) Insertion and Long Insertion: The sample contains a sequence that does not exist
`in the reference. (4) Inversion: A part of the sample is reverse compared to the reference. (5) Duplication: A part of the
`reference occurs twice in the sample (tandem repeat). (6) Translocation: The sample is a combination of sequences
`coming from different chromosomes in the reference. Note that the pattern for concordant reads varies depending on the
`sequencing technologies and the library preparation protocol.
`
`1. Check the data quality. Classify the mapped reads on the basis
`of available genome annotation into different functional
`groups such as exons, introns, rRNA, intergenic, etc. For
`example, in the case of sequencing polyA-RNA, only a small
`fraction of reads should be mapped to rRNA.
`
`2. Determine the expression level of annotated genes by count-
`ing the reads mapped to the corresponding exons, and then
`divide these counts by the cumulated exon lengths (in kb) and
`the total number of mapped reads (in millions). The resulting
`RPKM (“reads per kilobase of transcript per million mapped
`reads”) can be used for comparing expression levels of genes
`in different data sets (10).
`
`3. To quantify different splicing isoforms, select reads belonging
`exclusively to certain isoforms, for example, reads mapping to
`exons or crossing splicing junctions present only in a single
`isoform. From the amounts of these reads infer a maximum
`likelihood estimation of the isoform expression levels.
`
`4. To discover novel transcripts or splicing junctions, use a
`spliced alignment procedure to map the RNA-seq reads to a
`reference genome. Then find a most parsimonious set of tran-
`scripts that explains the data. Alternatively, you could first
`assemble the sequencing reads and then align the assembled
`
`00005
`
`

`

`254
`
`A. Gogol-Do¨ ring and W. Chen
`
`contigs to the genome (11). In both cases, it is advisable to
`sequence long paired-end reads.
`
`5. Small RNA-seq reads are first preprocessed to remove adapter
`sequences (see Subheading 2.1, step 3). To profile known
`miRNA, the reads could then be mapped either to the
`genome or to the known miRNA precursor sequences (12).
`Do not remove redundant reads (see Subheading 2.2, step 6)
`when analyzing this kind of data. The expression level of a
`specific miRNA could be estimated given the number of
`redundant sequencing reads mapped to its mature sequence
`(see Note 10). Normalize the raw read counts by the total
`number of mapped reads in the data set (see Subheading 2.4,
`step 2 and Note 11).
`
`6. To discover novel miRNAs, use a tool such as miRDeep (13),
`which uses a probabilistic model of miRNA biogenesis to
`score compatibility of
`the position and frequency of
`sequenced RNA with the secondary structure of the miRNA
`precursor.
`
`In ChIP-seq, chromatin immunoprecipitation uses antibodies to
`specifically select the proteins of interest together with any piece of
`randomly fragmented DNA bound to them. Then the precipi-
`tated DNA fragments are sequenced. Genomic regions binding to
`the proteins consequently feature an increased number of mapped
`sequencing reads.
`
`1. Use a “peak calling” tool to search for enriched regions in the
`ChIP-seq data (10) (see Note 12). ChIP-seq data should be
`evaluated relative to a control data set obtained either by
`sequencing the input DNA without ChIP or by using an
`antibody with unspecific binding such as IgG (see Note 9).
`
`2. An alternative way to analyze the data that is especially suited
`for profiling histone modifications is to determine the nor-
`malized read density (RPKM) of certain genomic areas such as
`genes or promoter regions. This method is similar to the
`analysis of RNA-seq data (see Subheading 2.4, step 2).
`
`1.
`
`In some cases, the sequencing results could be improved by
`manually restarting the base calling using nondefault para-
`meters. For example, choosing a better control lane when
`starting the Illumina offline base caller could boost up the
`number of successfully called sequencing reads. Candidates
`for good control
`lanes
`feature a nearly uniform base
`
`2.5. Application
`3: ChIP-seq
`
`3. Notes
`
`00006
`
`

`

`16
`
`An Overview of the Analysis of Next Generation Sequencing Data
`
`255
`
`distribution (see Note 2). Note that for this reason a flow cell
`should never be filled completely with, e.g., small RNA
`libraries, since these are not expected to produce uniform
`base distributions.
`
`2. Check the base/color distribution over the whole read length.
`If the sequenced DNA fragments are randomly sampled from
`the genome – for example, sequencing genomic DNA, ChIP-
`seq, or (long) RNA-seq libraries – then the bases should be
`nearly uniformly distributed for all sequencing cycles. The
`software suite provided by the instrument vendors usually
`creates all relevant plots.
`
`3. The base caller annotates each base with a value reflecting its
`putative quality. These values could be used to determine the
`number of high/low quality bases for each cycle. The overall
`quality of sequenced bases normally declines slowly toward
`the end of the read. A drop of quality for a single cycle could
`be a hint for a temporary problem during the sequencing.
`
`4. Since the sequenced adapter could contain errors, it is reason-
`able to allow some mismatches during the adapter search.
`Note that there is a trade-off between the sensitivity and the
`specificity of this search.
`
`5.
`
`In order to avoid wrongly mapped reads, it is important to use
`a reference as accurate and complete as possible. All possible
`sources of reads should be present in the reference.
`
`6. Not all tools can handle SOLiD reads in dinucleotide color
`space; Roche 454 reads may contain typical indels in homo-
`polymer runs. When mapping the relatively short reads cre-
`ated by or Illumina Genome Analyzer or SOLiD, it is usually
`sufficient to consider only mismatches, unless it is planned to
`detect small indels.
`
`7. We recommend to choose a mapping strategy that guarantees
`accurate mappings rather than to maximize the mere number
`of mapped reads. Next generation sequencing usually gener-
`ates huge quantities of reads, so a negligible loss of reads is
`certainly affordable. Consequently, most mapping tools are
`optimized to allow only a small number of mismatches.
`Higher error numbers are only necessary if the reads are
`long or if we are especially interested in variations between
`reads and reference.
`
`8. Check the success of your experiment by comparing your
`results to already known variants deposited in public data
`bases such as dbSNP (14) and the Database of Genomic
`Variants (15).
`
`00007
`
`

`

`256
`
`A. Gogol-Do¨ ring and W. Chen
`
`9. Sequencing reads are never uniformly distributed throughout
`the genome, and any statistical analysis assuming this is inac-
`curate. Some parts of the genome usually are covered by much
`more reads than expected, whereas some other parts are not
`sequenced at all. The experimenter should be aware of this fact,
`for example, when planning the required read coverage for
`variant detection. Moreover, this effect certainly impacts quan-
`titative measurements such as ChIP-seq or RNA-seq. ChIP-
`seq assays, for example, should always include a control library
`(see Subheading 2.5, step 1), and in a RNA-seq experiment, it
`is easier to compare expression levels of the same gene in
`different circumstances rather than the expression level of
`different genes in the same sample.
`
`10. Note that the actual sequenced mature miRNA could be
`shifted by some nucleotides compared to the annotation in
`the miRNA databases.
`
`11. One problem of this normalization method is that sometimes
`few miRNAs get very high read counts, which means that any
`change of their expression level could affect the read counts of
`all other miRNAs. In some cases, a more elaborated normali-
`zation method could therefore be necessary.
`
`12. Most tools for analyzing ChIP-seq data focus on finding
`punctuate binding sites (peaks) typical for transcription fac-
`tors. For ChIP-seq experiments targeting broader binding
`proteins,
`like polymerases or histone marks
`such as
`H3K36me3, use a tool that can also find larger enriched
`regions. In order to precisely identify protein binding sites,
`it is often necessary to determine the average length of the
`sequenced fragments. Some ChIP-seq data analysis tools esti-
`mate the fragment length from the sequencing data. Keep in
`mind that this is not trivial, because ChIP-seq data usually
`consist of single-end sequencing reads. Therefore, always
`check whether the estimated length is plausible according to
`the experimental design.
`
`00008
`
`

`

`16
`
`An Overview of the Analysis of Next Generation Sequencing Data
`
`257
`
`References
`
`1. Shendure J, Ji H (2008) Next-generation
`DNA sequencing. Nature Biotechnology
`26:1135–1145
`2. Medvedev P, Stanciu M, Brudno M (2009)
`Computational methods
`for discovering
`structural
`variation with next-generation
`sequencing. Nature Methods 6:S13-S20
`3. Mortazavi A, Williams BA, McCue K et al
`(2008) Mapping and quantifying mammalian
`transcriptomes by RNA-Seq. Nature Methods
`5:621–628
`4. Johnson DS, Mortazavi A, Myers RM et al
`(2007) Genome-Wide Mapping of in Vivo
`Protein-DNA Interactions.
`Science 316
`(5830):1497–1502
`5. Fu Y, Peckham HE, McLaughlin SF et al.
`SOLiD Sequencing and 2-Base Encoding.
`http://appliedbiosystems.com
`6. Flicek P, Birney E (2009) Sense from
`sequence reads: methods for alignment and
`assembly. Nature Methods 6:S6-S12
`7. UCSC Genome Bioinformatics. Frequently
`Asked Questions: Data
`File
`Formats.
`http://genome.ucsc.edu/FAQ/FAQfor-
`mat.html
`8. Sequence Alignment/Map (SAM) Format.
`http://samtools.sourceforge.net/SAM1.pdf
`
`9. Korbel JO, Urban AE, Affourtit JP et al.
`(2007)
`Paired-End Mapping
`Reveals
`Extensive Structural Variation in the Human
`Genome. Science 318 (5849):420–426
`10. Pepke S, Wold B, Mortazavi A (2009) Com-
`putation for ChIP-seq and RNA-seq studies.
`Nature Methods 6:S22-S32
`11. Haas BJ, Zody MC (2010) Advancing RNA-seq
`analysis. Nature Biotechnology 28:421–423
`12. Griffiths-Jones S, Grocock RJ, van Dongen S
`et al (2006) miRBase: microRNA sequences,
`targets and gene nomenclature. Nucleic Acids
`Research 34:D140-D144. http://microrna.
`sanger.ac.uk
`13. Friedl€ander MR, Chen W, Adamidi C et al
`(2008) Discovering microRNAs from deep
`sequencing data using miRDeep. Nature Bio-
`technology 26:407–415
`14. dbSNP.
`http://www.ncbi.nlm.nih.gov/
`projects/SNP
`15. Database of Genomic Variants. http://
`projects.tcag.ca/variation
`16. Handelsman J, Rondon MR, Brady SF et al
`(1998) Molecular biological access to the
`chemistry of unknown soil microbes: a new
`frontier for natural products. Chemistry &
`Biology 5:245–249
`
`00009
`
`

`

`TEV ellieTCR)
`
`AikChoonTan
`
`Junbai Wang
`
`ite
`
`00010
`
`00010
`
`

`

`Next Generation Microarray
`Bioinformatics
`
`Methods and Protocols
`
`Edited by
`Junbai Wang
`
`Department of Pathology, Oslo University Hospital, Radium Hospital, Montebello, Oslo, Norway
`Aik Choon Tan
`
`Division of Medical Oncology, Department of Medicine, School of Medicine,
`University of Colorado Anschutz Medical Campus, Aurora, CO, USA
`Tianhai Tian
`
`School of Mathematical Sciences, Monash University, Melbourne, VIC, Australia
`
`00011
`
`

`

`Editors
`Junbai Wang, Ph.D.
`Department of Pathology
`Oslo University Hospital
`Radium Hospital
`Montebello, Oslo, Norway
`junbai.wang@rr-research.no
`
`Tianhai Tian, Ph.D.
`School of Mathematical Sciences
`Monash University
`Melbourne, VIC, Australia
`tianhai.tian@monash.edu
`
`Aik Choon Tan, Ph.D.
`Division of Medical Oncology
`Department of Medicine School of Medicine
`University of Colorado Anschutz Medical Campus
`Aurora, CO, USA
`aikchoon.tan@ucdenver.edu
`
`ISSN 1064-3745
`ISBN 978-1-61779-399-8
`DOI 10.1007/978-1-61779-400-1
`Springer New York Dordrecht Heidelberg London
`
`e-ISSN 1940-6029
`e-ISBN 978-1-61779-400-1
`
`Library of Congress Control Number: 2011943561
`ª Springer Science+Business Media, LLC 2012
`All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the
`publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
`USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of
`information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
`now known or hereafter developed is forbidden.
`The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified
`as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
`
`Printed on acid-free paper
`
`Humana Press is part of Springer Science+Business Media (www.springer.com)
`
`00012
`
`

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket