throbber

`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`LETTER
`
`Predicting Gene Regulatory Elements in Silico
`on a Genomic Scale
`Alvis Brazma,1 Inge Jonassen,2 Jaak Vilo,3,4 and Esko Ukkonen3
`1European Molecular Biology Laboratory (EMBL) Outstation–Hinxton, European Bioinformatics Institute,
`Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; 2Department of Informatics,
`University of Bergen, Høyteknologisenteret, N5020 Bergen, Norway; 3Department of Computer Science,
`FIN-00014 University of Helsinki, Helsinki, Finland
`
`We performed a systematic analysis of gene upstream regions in the yeast genome for occurrences of regular
`expression-type patterns with the goal of identifying potential regulatory elements. To achieve this goal, we
`have developed a new sequence pattern discovery algorithm that searches exhaustively for a priori unknown
`regular expression-type patterns that are over-represented in a given set of sequences. We applied the algorithm
`in two cases, (1) discovery of patterns in the complete set of >6000 sequences taken upstream of the putative
`yeast genes and (2) discovery of patterns in the regions upstream of the genes with similar expression profiles.
`In the first case, we looked for patterns that occur more frequently in the gene upstream regions than in the
`genome overall. In the second case, first we clustered the upstream regions of all the genes by similarity of their
`expression profiles on the basis of publicly available gene expression data and then looked for sequence patterns
`that are over-represented in each cluster. In both cases we considered each pattern that occurred at least in
`some minimum number of sequences, and rated them on the basis of their over-representation. Among the
`highest rating patterns, most have matches to substrings in known yeast transcription factor-binding sites.
`Moreover, several of them are known to be relevant to the expression of the genes from the respective clusters.
`Experiments on simulated data show that the majority of the discovered patterns are not expected to occur by
`chance.
`
`Completely sequenced genomes, together with the
`emerging DNA microarray technologies enabling
`the measurement of the gene expression levels in
`cell cultures (Schena et al. 1995; for a survey, see
`Ramsay 1998), are opening new possibilities for
`studying gene regulation. The sequencing of the
`first eukaryotic genome (the yeast Saccharomyces cer-
`evisiae) was completed in 1996 (Goffeau et al. 1996;
`Mewes et al. 1997). Data about the expression levels
`of almost all of the ∼6000 yeast genes have been
`obtained (DeRisi et al. 1997; Velculescu et al. 1997;
`Wodicka et al. 1997) during 1997. In particular, De-
`Risi et al. (1997) measured the relative expression
`levels of the yeast genes at seven consecutive time
`points (in 2-hr intervals) during a shift from anaero-
`bic to aerobic metabolism (diauxic shift). They
`showed that some of the genes that are known to be
`involved in metabolic pathways related to the di-
`auxic shift underwent a very significant change in
`their expression level during the shift. By treating
`the expression measurements as a time series, it is
`
`4Corresponding author.
`E-MAIL vilo@cs.helsinki.fi; FAX 358 9 708 44441.
`
`possible to cluster genes according to similarities in
`their expression profiles. It may be hypothesized
`that at least some of the genes in a cluster are regu-
`lated by similar mechanisms.
`The transcription regulation mechanisms in eu-
`karyotic genomes are not well understood. Evi-
`dently, however, an essential role is played by tran-
`scription factors, which can bind to particular DNA
`sequences, called transcription factor-binding sites,
`believed to be about 5–25 bp long. In yeast, these
`sites are usually within several hundred base pairs
`upstream of the respective ORFs (Mellor 1993).
`Regular expression type patterns, as well as
`nucleotide distribution matrices, have both been
`used for describing transcription factor-binding
`sites, (e.g., see Bucher 1990; Ghosh 1990; Chen et al.
`1995; Wingender et al. 1996). Inference of such de-
`scriptions from the sequences that are assumed to
`contain a site for a particular transcription factor is
`a difficult problem as the consensus of the different
`binding sites of the same transcription factor is of-
`ten rather weak. Algorithms have been proposed for
`inferring such descriptions from sets of relatively
`small number of sequences (about 20) in which all
`
`1202 GENOME RESEARCH
`
`8:1202–1215 ©1998 by Cold Spring Harbor Laboratory Press ISSN 1054-9803/98 $5.00; www.genome.org
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1202
`
`

`

`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`or almost all of the sequences are known to contain
`the site for the respective transcription factor (e.g.,
`see Stormo and Hartzell 1989; Wolfertstetter et al.
`1996; van Helden et al. 1998). More recently, van
`Helden et al. (1998) and Yada et al. (1998) have
`proposed methods for the discovery of putative
`transcription factor-binding sites from larger data
`sets. Yada et al. (1998) applied their method to ana-
`lyze about 400 human promotor sequences.
`Apparently, an even more difficult problem is
`identifying potential binding sites or other regula-
`tory elements from sets of sequences only suspected
`to contain such elements. In this report, we con-
`sider the case when only a small portion of the se-
`quences in the given set may actually contain a
`common regulatory element, and the total number
`of sequences may be up to thousands. In this set-
`ting, it may not be possible to infer precise binding
`site descriptions; still, if the number of sequences
`containing a common regulatory element is larger
`than would be expected by chance, it may be pos-
`sible to obtain hints about sequence properties of
`such an element and in which particular sequences
`it may be present.
`An obvious difficulty in attacking this problem
`is the computational complexity of the algorithmic
`problem of discovering interesting sequence pat-
`terns in a large collection of sequences only some of
`which may contain a common pattern. Ultimately
`the results of such discoveries should be taken as
`predictions that must be verified by independent,
`that is, wet biology, means. Still, some validation
`can be obtained by comparing the discovered site
`descriptions to the transcription factor database en-
`tries, or by statistical means by comparing the dis-
`tribution of the discovered patterns to the distribu-
`tion in simulated data.
`Pattern discovery methods basically fall into
`two groups; sequence-driven and pattern-driven
`methods (for a survey, see Brazma et al. 1998a,b).
`Algorithms in the first group normally work by
`combining the results of pairwise sequence com-
`parisons to form patterns that match the subsets of
`the sequences. These algorithms are too slow to find
`patterns that occur in arbitrarily sized subsets of
`thousands of sequences. Pattern-driven algorithms
`work by enumerating or searching a predefined pat-
`tern class to find patterns and their occurrence fre-
`quencies. In these methods, one needs a very fast
`method for locating all matches of each pattern
`from the search space. Special data structures and
`pattern occurrence lists have been used for this pur-
`pose, but the methods have been limited to the
`analysis of smaller data sets.
`
`IN SILICO PREDICTION OF REGULATORY ELEMENTS
`We have developed a new, more powerful, pat-
`tern discovery algorithm that is able to discover
`various subclasses of regular expression type pat-
`terns of unlimited length common to as few as ten
`sequences from thousands. We used this algorithm
`for predicting regulatory elements from gene up-
`stream regions in the yeast S. cerevisiae.
`We considered two cases. First, we looked for
`patterns that occur more frequently in the gene up-
`stream regions than in randomly chosen regions in
`the yeast genome. For each pattern present in at
`least 10 sequences (from >12,000), we calculated a
`score equal to the ratio of the number of upstream
`regions that contain the pattern divided by the
`number of random regions (of the same length and
`number) that contain the pattern, and rated the pat-
`terns according to this ratio.
`In the second case, we used information from
`the yeast genome expression data (DeRisi et al.
`1997) to cluster the genes according to their expres-
`sion profiles. After clustering the upstream regions
`(treating the expression measurements as time se-
`ries) we selected characteristic clusters according to
`some rigorous criteria. We hypothesized that some
`of the genes in a cluster may contain binding sites
`for the same transcription factors or other common
`regulatory elements. We used our algorithm to look
`for patterns that are over-represented in each cluster
`as compared with other upstream regions.
`We systematically compared the high-scoring
`patterns that we discovered to the transcription fac-
`tor-binding site descriptions for the yeast in TRANS-
`FAC database (Wingender et al. 1996). We found
`that most of the discovered patterns (both from the
`total set of upstream regions and from the clusters)
`have matches to substrings of genome regions that
`contain transcription factor-binding sites. We also
`compared the distribution of patterns present in up-
`stream regions to the distribution of the patterns
`that can be discovered in random regions of the
`genome and showed that the distributions are
`rather different. The comparison with the TRANS-
`FAC database as well as the overall statistics of the
`discovered patterns suggest that many of the discov-
`ered patterns can be important for the expression
`profile of the particular clusters of genes or for the
`transcription or translation initiation in general.
`RESULTS
`First, we describe the pattern discovery in the com-
`plete set of yeast gene upstream regions, then the
`clustering of the yeast gene expression data, and
`finally, the results obtained by pattern discovery
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1203
`
`GENOME RESEARCH 1203
`
`

`

`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`BRA¯ZMA ET AL.
`from within the subsets of upstream regions of
`genes sharing similar expression profiles.
`We considered three different types of patterns:
`(P1) substring patterns (i.e., words in the alphabet A,
`T, G, C); (P2) substring patterns with wild cards (of
`fixed length); and (P3) patterns with character
`groups [such patterns can be represented as words
`over IUPAC code (Corhish-Bowden 1984) charac-
`ters; here we will use a more explicit notation].
`We denote wild-card positions by a dot in the
`pattern (e.g., TA.A), and the group positions by en-
`listing all possible characters in square brackets (e.g.,
`T[AT]A). A wild-card position is group position
`[ATCG], that is, all characters are allowed. For in-
`stance, pattern A[TG].C matches all strings that con-
`tain a substring beginning with A, followed by ei-
`ther T or G, followed by any character, followed by
`C. In practice, for reasons of efficiency, we restrict
`ourselves to various subclasses of these pattern
`classes (e.g., limiting the number of possible wild
`cards or group symbols). The implementation of the
`algorithm, results, data, and additional images are
`available on the worldwide web at http://
`www.cs.Helsinki.FI/∼vilo/Yeast/.
`
`Discovering Patterns from the Total Set
`of Upstream Regions
`We extracted upstream regions relative to all ORFs,
`as annotated in the MIPS Yeast genome database
`(Mewes et al. 1997). Concretely, we extracted seven
`sets of upstream regions of length 100 from the po-
`sitions ⳮ100 to 0, ⳮ150 to ⳮ50, ⳮ200 to ⳮ100,
`ⳮ250 to ⳮ150, ⳮ300 to ⳮ200, ⳮ350 to ⳮ250, and
`ⳮ400 to ⳮ300, a set of regions of length 300 from
`positions ⳮ300 to 0, and a set of regions of length
`600 from positions ⳮ600 to 0 (all positions are rela-
`tive to the start codon of the ORF; see Methods).
`Also we extracted two sets of sequences of the same
`number and length from randomly selected loca-
`tions of the same chromosome. These sets of ran-
`dom regions were used as random samples of the
`yeast genome sequences (the nucleotide and di-
`nucleotide distribution in the random regions re-
`flected that in the genome in general) (1) to com-
`pare the upstream regions to random regions for
`identifying patterns that are more frequent in up-
`stream regions than in the genome in general and
`(2) to compare the two random sets against each
`other for testing whether the pattern occurrence sta-
`tistics resulting from the comparison of upstream
`and random regions can be explained by chance.
`We analyzed these data sets for occurrences of
`
`1204 GENOME RESEARCH
`
`patterns. We presented each pattern that occurred
`at least 10 times in upstream or random regions as a
`dot in a two-dimensional plot (see Fig. 1, left col-
`umn). The vertical axis shows the number of up-
`stream regions, and the horizontal axis the number
`of random regions, where the pattern is present.
`
`Figure 1 The distribution of all patterns (of unre-
`stricted length) with at most one wild-card symbol in
`the regions ⳮ250 to ⳮ150 (upstream from the ORFs)
`and randomly chosen genomic regions of length 100
`bp. Dots in graphs in the leftcorrespond to patterns
`that occur in x sequences from the random regions
`(along horizontal axis) and y sequences from the up-
`stream regions (vertical axis). In graphs on the right,
`the upstream regions are replaced by another set of
`random regions; therefore, these plots show the ex-
`pected statistics if the regions are chosen at random.
`(Top row) All patterns with at least 10 occurrences.
`(Second row) Subset of top row with all patterns con-
`taining at least two characters C or G and not contain-
`ing any of the substrings AAAA, TTTT, ATAT, or TATA.
`(Bottom two rows) Same plots as in the first two rows,
`but only including patterns with at most 200 occur-
`rences in upstream or random regions (i.e., zoomed to
`the lower left corner).
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1204
`
`

`

`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`Hence a dot in plot location (x; y) indicates that
`there is a pattern that occurs in x random regions
`and yupstream regions. The patterns deviating from
`the diagonal, and particularly, being above the di-
`agonal, are the ones that can distinguish the up-
`stream regions from the random regions (and,
`therefore, are likely to distinguish the upstream re-
`gions from the genome in general), in contrast to
`the patterns that fall close to the diagonal and thus
`occur with the same frequency in upstream and ran-
`dom regions. The dots farthest above the diagonal
`correspond to the patterns that are potential candi-
`dates for regulatory elements. For each pattern we
`calculated a score as defined by equation (2) in
`Methods, which is essentially the number of occur-
`rences in the upstream regions divided by the sum
`of the number of occurrences in the random regions
`and a correcting constant.
`A control experiment (right column in Fig. 1)
`was done to estimate whether the difference in pat-
`tern frequencies observed for upstream versus ran-
`dom sequence segments could be explained by
`chance. In the control experiments, we compared
`two sets of random regions. The pattern occurrence
`statistics obtained when comparing the upstream
`regions to the random regions is rather different
`from the statistics obtained when comparing two
`sets of random regions. We also tested that this con-
`siderable difference can be explained neither by
`higher AT content in the upstream regions, nor by
`poly(A), poly(T), or poly(AT) patterns. To achieve
`this goal, we plotted the patterns containing at least
`two characters C or G and not containing any of the
`substrings AAAA, TTTT, ATAT, or TATA. The differ-
`ence between the plots remained essentially as
`strong (see Fig. 1). Therefore, we conclude that the
`distribution of patterns in the upstream regions dif-
`fers from the distribution in regions. In particular,
`there are some specific patterns that occur consid-
`erably more often in upstream regions than in ran-
`dom regions.
`The best distinction (as judged by visual inspec-
`tion) between upstream and random regions by sub-
`string patterns was achieved for upstream regions of
`length 100 when counting matches only on the
`gene’s strand. [The use of only one strand can be
`justified because of the very distinct distribution of
`different bases in a region of 300 bp upstream from
`the start of the gene (see Fig. 3, below, in Methods).]
`Similar differences were observed for all considered
`lengths and region relative positions. We also ex-
`perimented with the three sets of sequences of
`length 600 and 300 bp, analyzing substring patterns
`on either strand; and the sequences of length 100,
`
`IN SILICO PREDICTION OF REGULATORY ELEMENTS
`analyzing the patterns that contain up to one wild
`card. Some results for patterns with at most one
`wild-card symbol from regions of length 100 bp at
`upstream positions ⳮ250 to ⳮ150 are shown in Fig-
`ure 1.
`Many of the top-scoring patterns, particularly,
`for the region ⳮ250 to ⳮ150, are effectively poly(T)
`sequences. Still, as mentioned above, these trivial
`poly(T) patterns cannot explain the differences in
`the pattern occurrence statistics compared with ran-
`dom genomic regions; therefore, overall, the pat-
`terns not containing poly(T) sequences are signifi-
`cant. We removed from the list of discovered pat-
`terns the ones that contain substrings TTTT or
`AAAA (and additionally the patterns ending in the
`wild-card—we call the remaining patterns non-
`trivial) and the list of the 20 remaining highest scor-
`ing patterns are given in Table 1 (the numbering of
`the patterns is given for the total list of patterns
`including the trivial ones).
`We compared the groups of highest scoring
`nontrivial patterns from each of the seven regions
`of length 100 bp of various distances with the re-
`spective ORFs. We used the program Pratt (Jonassen
`1997) to try to find patterns that would be a con-
`sensus for a substantial number of patterns for each
`group. More concretely, we took the 20 highest
`scoring patterns and used Pratt to discover patterns
`matching at least 6 patterns. It turned out that only
`for regions ⳮ150 to ⳮ50, the highest scoring pat-
`tern groups have a relatively good consensus pat-
`tern GATG.G.T, the region ⳮ200 to ⳮ100 has two
`consensus patterns, T.ACCCG and CGGGT.A,
`which are mutually symmetric, and the region
`ⳮ250 to ⳮ150 has the consensus ACCCG (note
`that it is a subpattern of T.ACCCG). No significant
`consensus patterns have been found for other re-
`gions.
`We also matched the 50 highest scoring non-
`trivial patterns for each of the regions against all the
`transcription factor-binding site descriptions given
`in the TRANSFAC (Wingender et al. 1996) database
`for the yeast. The results of the exact matches are
`given in the Table 2 (by an exact match, we mean
`that the discovered pattern exactly matched a sub-
`string in the binding site description). Note that al-
`though the highest scoring patterns from neighbor-
`ing regions are not necessarily similar themselves,
`the number of coinciding binding sites (from
`TRANSFAC) matched by patterns from two regions
`show a considerable correlation with the distance
`between the positions of the regions.
`The complete list of the discovered patterns is
`available on the World Wide Web.
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1205
`
`GENOME RESEARCH 1205
`
`

`

`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`BRA¯ZMA ET AL.
`
`Table 1. Highest Scoring Nontrivial
`Patterns with (at Most) One
`Wild-Card Symbol
`Scoreb N+c Nⴑd
`No.a
`Pattern
`A. Regionsⳮ100..0
`37
`AAG.AAACAAA
`6.54
`2
`27
`A.TAAGAACA
`5.79
`6
`43
`A.AATAGGA
`5.61
`8
`26
`AAGAAA.CAAA
`5.58
`9
`25
`GTAACAA.C
`5.36
`12
`25
`AAA.AACTTA
`5.36
`13
`39
`ACAAC.TAA
`5.09
`20
`64
`AG.AAACAAA
`5.06
`21
`48
`ACAAACAA.A
`4.97
`23
`77
`AATAGTA.A
`4.92
`26
`27
`AATAGTATA
`4.77
`32
`22
`TCACTAC.T
`4.72
`34
`22
`CAAACA.ACA
`4.72
`35
`55
`ACA.ATAGA
`4.72
`37
`54
`AGAGA.ATA
`4.63
`42
`26
`AATAAACAA.A
`4.59
`47
`35
`AAAG.ACAAG
`4.57
`50
`53
`CTAAGAA.A
`4.55
`52
`21
`A.AAGGGAAG
`4.51
`56
`48
`CAAA.TAAC
`4.50
`57
`B. Regionsⳮ250..ⳮ150
`29
`TTACCCGC
`6.22
`14
`54
`GT.ACCCG
`5.59
`58
`42
`T.ACCCGC
`5.48
`71
`64
`CGGGTA.T
`5.06
`126
`48
`G.TACCCG
`4.97
`141
`47
`CGGGTAA.A
`4.87
`165
`37
`GTTACCCG
`4.83
`178
`65
`TACAT.TATA
`4.43
`305
`46
`TTTCTC.TTT
`4.32
`353
`119
`TTACCCG
`4.30
`372
`20
`TTTCCTGT.T
`4.29
`379
`24
`CTCATCTC.T
`4.24
`405
`28
`TCACGTGA
`4.20
`425
`28
`T.ATATATTC
`4.20
`427
`114
`CGGGTAA
`4.12
`454
`19
`TGTGT.GAT
`4.08
`460
`19
`ATTACCCG.A
`4.08
`465
`23
`G.ACATATAT
`4.06
`474
`27
`TA.GTAAAC
`4.05
`485
`47
`TTTCTCT.TT
`4.03
`500
`Matches were only allowed on the W(gene) strand.
`aNo. of the pattern enumerating them decreasingly by scores
`(before trivial patterns were removed).
`bFrom equation 2.
`cNo. of upstream regions matching the pattern.
`dNo. of random sequences matching the pattern.
`
`1
`0
`3
`0
`0
`0
`3
`8
`5
`11
`1
`0
`0
`7
`7
`1
`3
`7
`0
`6
`0
`5
`3
`8
`5
`5
`3
`10
`6
`23
`0
`1
`2
`2
`23
`0
`0
`1
`2
`7
`
`1206 GENOME RESEARCH
`
`Clustering the Gene Expression Data
`DeRisi et al. (1997) studied the relative expression
`rate changes of yeast genes during the diauxic shift.
`They inoculated yeast cells from an exponentially
`growing yeast culture into fresh medium and after
`some initial period, harvested samples at seven 2-hr
`intervals, isolated their mRNA, and prepared fluo-
`rescently labeled cDNA. Two different fluorescent
`moieties were used—one for cells harvested in each
`of the successive time points, the other for refer-
`ence, from cells harvested at the first time point.
`The cDNAs from each time point, together with the
`reference cDNA were hybridized to the microarray
`with ∼6400 DNA sequences representing ORFs of
`the yeast genome. Measurement of the relative fluo-
`rescence intensity for each of the ∼6400 ⳯ 7 ele-
`ments reflect the relative abundance of the corre-
`sponding mRNA in each cell population. The mea-
`surement data is available on the Internet.
`We used the data from these yeast gene expres-
`sion studies (DeRisi et al. 1997) and clustered all the
`genes by similarities in their expression profiles in
`several alternative ways. To achieve this goal, we
`developed and implemented a simple algorithm
`based on discretizing the time series of the measure-
`ment space into a simplified form and then cluster-
`ing these simple time series. Some rigorous selection
`criteria were used for defining good clusters (for de-
`tails, see Methods). This produced 32 different clus-
`ters containing from 10 to 77 ORFs each and 11
`clusters containing at least 25 ORFs (see Table 3).
`The most significant changes in gene expres-
`sion rates during the diauxic shift occurred during
`the last two time points. This significance is re-
`flected in the clusters that we obtained (although
`some fluctuations at earlier time points occur for
`smaller groups of genes, which may be due to
`noise). Many of the constructed clusters strongly
`overlap. From the 11 clusters of at least 25 ORFs
`each, in 8 clusters, the expression level is increasing
`in the time point 6, in 2 it is decreasing, and in 1 it
`is unchanged.
`
`Discovering Patterns from the Gene Clusters
`We studied whether clusters of genes with similar
`expression profiles can help to discover sequence
`patterns putatively describing transcription factor-
`binding sites. For each cluster, we compared the cor-
`responding upstream regions of length 300 bp
`against all other upstream regions. The algorithm
`was used to find the highest scoring patterns con-
`taining up to three wild cards. The patterns were
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1206
`
`

`

`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`IN SILICO PREDICTION OF REGULATORY ELEMENTS
`
`Table 2. Matches to TRANSFAC Binding Sites for the 50 Best Patterns Found for Each 100-bp
`Upstream Region
`
`ⴑ150
`
`ⴑ200
`
`ⴑ250
`
`ⴑ300
`
`ⴑ350
`
`ⴑ400
`
`ⴑ100
`.
`.
`
`Y$ARS1_03
`Y$ARSH4_02
`Y$CAR1_01
`Y$CAR2_01
`Y$CDC2_01
`Y$CDC9_01
`Y$CEN12_01
`Y$CEN6_01
`Y$CENIV_01
`Y$CFES_01
`Y$CHA1_04
`Y$CSVIII_02
`Y$CTA1_01
`YSCYC1_12
`Y$CYC1_14
`Y$DDR2_02
`Y$G3PDH_01
`Y$GAL1_03
`Y$GAL1_04
`Y$GAL1_06
`Y$GAL1_14
`Y$GAL2_03
`Y$HO_06
`Y$HO_07
`Y$ICL1_01
`Y$MAL61_02
`Y$MES1_01
`Y$PDC1_02
`Y$PGK_01
`Y$PHO8_02
`Y$POX1_01
`Y$RAP_01
`Y$RP51A_01
`Y$RPL16A_01
`Y$RRNA_01
`Y$RRNA_02
`Y$STE6_02
`Y$SUC2_02
`Y$TEF2_01
`Y$TOP2_01
`+
`+
`+
`Y$TRP1_01
`+
`+
`Y$TRP5_01
`+
`+
`+
`Y$X40_01
`+
`+
`.
`Y$Y30_01
`For each 100-bp region starting at the seven different positions upstream from ORFs, the 50 highest scoring nontrivial patterns were
`matched (in substring sense) against the yeast transcription factor binding sites as given in the TRANSFAC (Wingender et al. 1996)
`database. The first column gives the binding site identifier in TRANSFAC that is matched by one of the best patterns from any of these
`sets.
`(+) At least one of the respective patterns matches exactly the corresponding TRANSFAC site.
`(.) A pattern matches only the reverse-complement of the TRANSFAC site.
`
`+
`
`+
`+
`+
`
`+
`.
`
`.
`.
`.
`+
`
`+
`.
`.
`.
`+
`
`+
`
`.
`+
`
`+
`
`+
`+
`+
`
`.
`
`+
`.
`+
`
`+
`+
`
`.
`+
`
`+
`+
`
`.
`+
`+
`+
`
`+
`
`+
`.
`+
`
`+
`+
`+
`.
`
`.
`
`.
`.
`+
`
`.
`
`+
`
`+
`
`+
`+
`
`+
`+
`+
`.
`.
`+
`
`+
`+
`+
`+
`+
`+
`.
`+
`+
`.
`
`+
`+
`+
`
`+
`+
`
`+
`
`+
`+
`+
`
`+
`.
`+
`.
`+
`.
`.
`.
`
`+
`
`.
`.
`
`+
`+
`
`+
`
`+
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1207
`
`GENOME RESEARCH 1207
`
`

`

`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`BRA¯ZMA ET AL.
`Table 3. Summary Information about Pattern Scores in the Clusters and Random Sets
`Score for the best
`pattern in the resp.
`Score range for
`random set
`the 10 best patterns
`No. of genes
`Cluster name
`Cr(3,4)(000010)
`1.70
`3.80–2.80
`77
`Cr(5,2,4)(000020)
`1.94
`3.99–2.96
`55
`2.12
`3.77–2.95
`41
`C(5,2,4)(0000021)
`2.09
`7.15–4.11
`38
`C(5,2,4)(0000022)
`Cr(3,5)(000010)
`2.73
`3.50–3.17
`38
`2.42
`2.87–2.52
`37
`C(5,2,4)(0000012)
`Cr(5,3,5)(000020)
`2.12
`3.60–3.08
`37
`Cr(5,2,3)(00002-1)
`3.86
`4.00–3.89
`25
`Cr(3,3)(000001)
`4.33
`3.55–3.29
`26
`3.13
`3.69–3.21
`27
`C(5,2,6)(00000-1-2)
`3.86
`4.00–3.89
`25
`C(5,3,6)(00000-1-2)
`For explanation of the cluster names, see Methods. The first eight clusters consist of genes the expression level of which increase at
`time point 6; the last two of genes the expression level of which decrease at time point 6. The statistics include all patterns (trivial
`variants were not removed).
`
`scoring for the cluster Cr(5,2,4)(000020) (containing
`55 sequences) matching 64% (35 out of 55) of se-
`quences in the cluster and 21% (1280 out of 5921)
`of remaining upstream regions, thus getting a score
`of 2.95. Other high-scoring patterns in this cluster
`
`matched on either strand and ranked by the score
`given by equation 1 in Methods.
`To evaluate the overall significance of the re-
`sult, we picked for each cluster a random subset (of
`the same size) of upstream regions from the total set
`of genes, and analyzed this set exactly
`the same way as the cluster. We found
`that, for 10 clusters out of 11 containing
`at least 25 sequences and for all clusters
`containing at least 30 sequences, the
`scores of the best patterns found from
`clusters is better than for the best pat-
`terns found from the randomly picked
`sets (see Table 3; Fig. 2).
`The largest clusters (>30 sequences)
`correspond to the expression profiles
`with increase in the expression level
`at time point 6, and, for each of these
`clusters, high-scoring patterns con-
`taining the substring CCCC are found
`[CCCC has score 1.9 for cluster
`Cr(5,2,4)(000020)]. The cluster C(5,2,4)
`(0000022) with 38 sequences contains
`patterns that are standing out as the
`highest in comparison with the pattern
`scores for the random set of the size 38.
`The highest scoring patterns are given in
`the Table 4 (note that in Table 4 we have
`removed trivial variants of the patterns,
`e.g., patterns ending with wild-card char-
`acters). Pattern CCCCT (and its reverse
`complement AGGGG) is the highest
`
`1208 GENOME RESEARCH
`
`Figure 2 The plots of the scores of the 30 best patterns found from
`the clusters of upstream sequences from genes with similar expres-
`sion profiles and of random sets of the upstream sequences of the
`same size. The dotted line is the average score of the 30 best patterns
`found from the random sets of the respective sizes. For the sets of 30
`sequences and more, the pattern scores from the random sets of the
`upstream sequences are stabilizing and are considerably lower than
`for 30 best pattern scores for the respective clusters.
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1208
`
`

`

`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`CCCCT..T
`A..AGGGG
`GGGGC
`GCCCC
`G..GGGG
`CCCC..C
`CCCC...T
`
`A...GGGG
`CCCCT
`AGGGG
`CCCT..TT
`AA..AGGG
`GGG.TG
`CA.CCC
`
`IN SILICO PREDICTION OF REGULATORY ELEMENTS
`Table 4. Highest Scoring Patterns for the Cluster C(5,2,4)(0000022)
`N+a
`Total+b
`Scorec
`Pattern
`TRANSFAC (exact matches)
`A. Highest score in experiment allowing patterns to have at most 3 wild cards and
`no group characters
`22
`27
`7.09
`Y$DDR2_01, Y$DDR2_02, Y$TPI_02
`7.09
`22
`27
`–
`20
`27
`4.09
`Y$GAL2_02, Y$SUC2_02, Y$RRNA_01
`Y$ERG11_01
`20
`27
`4.09
`Y$CYB2_02
`Y$CYC1_04, Y$CYC1_05, Y$CYC1_06
`19
`28
`3.73
`19
`28
`3.73
`Y$GAL3_01, Y$MAL2R_01
`25
`42
`3.65
`Y$SUC2_01, Y$DDR2_01, Y$DDR2_02
`Y$TPI_02, Y$GAL3_01, Y$GAL4_01
`Y$MAL2R_01, Y$MAL63_01, Y$PDC1_02
`Y$HAP4_01
`Y$SUC2_02, Y$RRNA_01, Y$ERG11_01
`3.65
`42
`25
`Y$MEL1_02, Y$FPS1_01
`Y$DDR2_01, Y$DDR2_02, Y$TPI_02
`3.03
`38
`25
`Y$CAR1_02
`3.03
`38
`25
`Y$DDR2_01
`2.95
`22
`19
`–
`2.95
`22
`19
`–
`2.93
`21
`20
`Y$GAL1_04, Y$CYC1_12, Y$GAL1_14
`2.93
`21
`20
`Y$DDR2_02, Y$TPI_02
`B. Highest score in experiment allowing patterns having at most one group character
`e
`with two alternative letters (all pairs allowed)
`20
`28
`3.86
`Y$DDR2_01, Y$DDR2_02, Y$TPI_02
`24
`3.58
`Y$DDR2_02, Y$TPI_02
`20
`24
`47
`3.27
`Y$CYB2_02, Y$GAL2_02, Y$SUC2_02,
`Y$RRNA_01, Y$ERG11_01
`29
`58
`2.94
`Y$DDR2_01, Y$DDR2_02, Y$TPI_02,
`Y$SUC2_02, Y$CAR1_02, Y$ERG11_01
`29
`48
`2.90
`Y$CYB2_02, Y$DDR2_02, Y$TPI_02,
`Y$CYC1_04, Y$CYC1_05, Y$CYC1_06,
`Y$GAL2_02, Y$SUC2_02, Y$RRNA_01,
`Y$CAR1_02, Y$ERG11_01, Y$GAL1_15
`Trivial pattern variants were removed, e.g., patterns ending with a wild-card character.
`aNo. of upstream regions matching the pattern.
`bTotal number of matches in the upstream regions.
`cNormalized version of pattern score.
`dTRANSFAC entries matching the pattern.
`eBest patterns from experiment 2 not also found in experiment 1.
`
`CCCCT[GT]
`CCCCT[AT]
`[CG]CCCC
`CCCC[CT]
`[AG]CCCC
`
`include C..CCC.T (score 2.88), T.C..CCC (score
`2.85), and T.AGGG (score 2.27). Furthermore, the
`pattern CCCCT was also among the 10 highest scor-
`ing patterns found for the clusters Cr(3,5)(000010),
`Cr(5,3,5)(000020), and C(5,2,4)(0000022). These
`four clusters strongly overlap (17 ORFs are in all four
`clusters). DeRisi et al. (1997) describe a set contain-
`ing seven genes (see Fig. 5C in DeRisi et al. 1997) out
`of which five are contained in our cluster
`Cr(5,2,4)(000020). They note the presence of the
`
`pattern CCCCT in the upstream regions of each
`gene in their set and that it is known to be a stress-
`responsive motif.
`We also analyzed the upstream regions of the
`genes in the clusters having expression level de-
`crease at time point 6, and found that they contain
`patterns with matches to binding sites for the RAP1
`factor, which is known to be related to the stringent
`control of ribosomal protein gene transcription in S.
`cerevisiae(Moehle and Hinnebusch 1991). Some of
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1209
`
`GENOME RESEARCH 1209
`
`

`

`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`BRA¯ZMA ET AL.
`the patterns found in the upstream regions of genes
`in the clusters C(5,2,6)(00000-1-2) (27 sequences)
`and C(5,3,6)(00000-1-2) (25 sequences) match sub-
`strings in the sites for REB1 and BAF1 proteins, that
`are repressors (Diffley and Stillman 1988; Wang et
`al. 1990), which corresponds well with the fact that
`the clusters contain genes with decreasing expres-
`sion level.
`The complete list of patterns discovered from
`the clusters is available on the worldwide web.
`DISCUSSION
`The results of the analysis of the complete set of
`gene upstream regions show that, given a genome
`with annotated genes, some transcription factor-
`binding site (or other regulatory element) descrip-
`tions can be generated without any background
`knowledge about the transcription factors of the or-
`ganism. As far as we know, these are the first results
`from an automatic method for the discovery of pos-
`sible regulatory elements applied to a complete ge-
`nome, when no background information about
`transcription factors is used. Additionally, we have
`used expression level data to find groups of genes
`with similar expression profiles and searched for
`patterns common in the upstream regions of genes
`in each c

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket