`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`LETTER
`
`Predicting Gene Regulatory Elements in Silico
`on a Genomic Scale
`Alvis Brazma,1 Inge Jonassen,2 Jaak Vilo,3,4 and Esko Ukkonen3
`1European Molecular Biology Laboratory (EMBL) Outstation–Hinxton, European Bioinformatics Institute,
`Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK; 2Department of Informatics,
`University of Bergen, Høyteknologisenteret, N5020 Bergen, Norway; 3Department of Computer Science,
`FIN-00014 University of Helsinki, Helsinki, Finland
`
`We performed a systematic analysis of gene upstream regions in the yeast genome for occurrences of regular
`expression-type patterns with the goal of identifying potential regulatory elements. To achieve this goal, we
`have developed a new sequence pattern discovery algorithm that searches exhaustively for a priori unknown
`regular expression-type patterns that are over-represented in a given set of sequences. We applied the algorithm
`in two cases, (1) discovery of patterns in the complete set of >6000 sequences taken upstream of the putative
`yeast genes and (2) discovery of patterns in the regions upstream of the genes with similar expression profiles.
`In the first case, we looked for patterns that occur more frequently in the gene upstream regions than in the
`genome overall. In the second case, first we clustered the upstream regions of all the genes by similarity of their
`expression profiles on the basis of publicly available gene expression data and then looked for sequence patterns
`that are over-represented in each cluster. In both cases we considered each pattern that occurred at least in
`some minimum number of sequences, and rated them on the basis of their over-representation. Among the
`highest rating patterns, most have matches to substrings in known yeast transcription factor-binding sites.
`Moreover, several of them are known to be relevant to the expression of the genes from the respective clusters.
`Experiments on simulated data show that the majority of the discovered patterns are not expected to occur by
`chance.
`
`Completely sequenced genomes, together with the
`emerging DNA microarray technologies enabling
`the measurement of the gene expression levels in
`cell cultures (Schena et al. 1995; for a survey, see
`Ramsay 1998), are opening new possibilities for
`studying gene regulation. The sequencing of the
`first eukaryotic genome (the yeast Saccharomyces cer-
`evisiae) was completed in 1996 (Goffeau et al. 1996;
`Mewes et al. 1997). Data about the expression levels
`of almost all of the ∼6000 yeast genes have been
`obtained (DeRisi et al. 1997; Velculescu et al. 1997;
`Wodicka et al. 1997) during 1997. In particular, De-
`Risi et al. (1997) measured the relative expression
`levels of the yeast genes at seven consecutive time
`points (in 2-hr intervals) during a shift from anaero-
`bic to aerobic metabolism (diauxic shift). They
`showed that some of the genes that are known to be
`involved in metabolic pathways related to the di-
`auxic shift underwent a very significant change in
`their expression level during the shift. By treating
`the expression measurements as a time series, it is
`
`4Corresponding author.
`E-MAIL vilo@cs.helsinki.fi; FAX 358 9 708 44441.
`
`possible to cluster genes according to similarities in
`their expression profiles. It may be hypothesized
`that at least some of the genes in a cluster are regu-
`lated by similar mechanisms.
`The transcription regulation mechanisms in eu-
`karyotic genomes are not well understood. Evi-
`dently, however, an essential role is played by tran-
`scription factors, which can bind to particular DNA
`sequences, called transcription factor-binding sites,
`believed to be about 5–25 bp long. In yeast, these
`sites are usually within several hundred base pairs
`upstream of the respective ORFs (Mellor 1993).
`Regular expression type patterns, as well as
`nucleotide distribution matrices, have both been
`used for describing transcription factor-binding
`sites, (e.g., see Bucher 1990; Ghosh 1990; Chen et al.
`1995; Wingender et al. 1996). Inference of such de-
`scriptions from the sequences that are assumed to
`contain a site for a particular transcription factor is
`a difficult problem as the consensus of the different
`binding sites of the same transcription factor is of-
`ten rather weak. Algorithms have been proposed for
`inferring such descriptions from sets of relatively
`small number of sequences (about 20) in which all
`
`1202 GENOME RESEARCH
`
`8:1202–1215 ©1998 by Cold Spring Harbor Laboratory Press ISSN 1054-9803/98 $5.00; www.genome.org
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1202
`
`
`
`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`or almost all of the sequences are known to contain
`the site for the respective transcription factor (e.g.,
`see Stormo and Hartzell 1989; Wolfertstetter et al.
`1996; van Helden et al. 1998). More recently, van
`Helden et al. (1998) and Yada et al. (1998) have
`proposed methods for the discovery of putative
`transcription factor-binding sites from larger data
`sets. Yada et al. (1998) applied their method to ana-
`lyze about 400 human promotor sequences.
`Apparently, an even more difficult problem is
`identifying potential binding sites or other regula-
`tory elements from sets of sequences only suspected
`to contain such elements. In this report, we con-
`sider the case when only a small portion of the se-
`quences in the given set may actually contain a
`common regulatory element, and the total number
`of sequences may be up to thousands. In this set-
`ting, it may not be possible to infer precise binding
`site descriptions; still, if the number of sequences
`containing a common regulatory element is larger
`than would be expected by chance, it may be pos-
`sible to obtain hints about sequence properties of
`such an element and in which particular sequences
`it may be present.
`An obvious difficulty in attacking this problem
`is the computational complexity of the algorithmic
`problem of discovering interesting sequence pat-
`terns in a large collection of sequences only some of
`which may contain a common pattern. Ultimately
`the results of such discoveries should be taken as
`predictions that must be verified by independent,
`that is, wet biology, means. Still, some validation
`can be obtained by comparing the discovered site
`descriptions to the transcription factor database en-
`tries, or by statistical means by comparing the dis-
`tribution of the discovered patterns to the distribu-
`tion in simulated data.
`Pattern discovery methods basically fall into
`two groups; sequence-driven and pattern-driven
`methods (for a survey, see Brazma et al. 1998a,b).
`Algorithms in the first group normally work by
`combining the results of pairwise sequence com-
`parisons to form patterns that match the subsets of
`the sequences. These algorithms are too slow to find
`patterns that occur in arbitrarily sized subsets of
`thousands of sequences. Pattern-driven algorithms
`work by enumerating or searching a predefined pat-
`tern class to find patterns and their occurrence fre-
`quencies. In these methods, one needs a very fast
`method for locating all matches of each pattern
`from the search space. Special data structures and
`pattern occurrence lists have been used for this pur-
`pose, but the methods have been limited to the
`analysis of smaller data sets.
`
`IN SILICO PREDICTION OF REGULATORY ELEMENTS
`We have developed a new, more powerful, pat-
`tern discovery algorithm that is able to discover
`various subclasses of regular expression type pat-
`terns of unlimited length common to as few as ten
`sequences from thousands. We used this algorithm
`for predicting regulatory elements from gene up-
`stream regions in the yeast S. cerevisiae.
`We considered two cases. First, we looked for
`patterns that occur more frequently in the gene up-
`stream regions than in randomly chosen regions in
`the yeast genome. For each pattern present in at
`least 10 sequences (from >12,000), we calculated a
`score equal to the ratio of the number of upstream
`regions that contain the pattern divided by the
`number of random regions (of the same length and
`number) that contain the pattern, and rated the pat-
`terns according to this ratio.
`In the second case, we used information from
`the yeast genome expression data (DeRisi et al.
`1997) to cluster the genes according to their expres-
`sion profiles. After clustering the upstream regions
`(treating the expression measurements as time se-
`ries) we selected characteristic clusters according to
`some rigorous criteria. We hypothesized that some
`of the genes in a cluster may contain binding sites
`for the same transcription factors or other common
`regulatory elements. We used our algorithm to look
`for patterns that are over-represented in each cluster
`as compared with other upstream regions.
`We systematically compared the high-scoring
`patterns that we discovered to the transcription fac-
`tor-binding site descriptions for the yeast in TRANS-
`FAC database (Wingender et al. 1996). We found
`that most of the discovered patterns (both from the
`total set of upstream regions and from the clusters)
`have matches to substrings of genome regions that
`contain transcription factor-binding sites. We also
`compared the distribution of patterns present in up-
`stream regions to the distribution of the patterns
`that can be discovered in random regions of the
`genome and showed that the distributions are
`rather different. The comparison with the TRANS-
`FAC database as well as the overall statistics of the
`discovered patterns suggest that many of the discov-
`ered patterns can be important for the expression
`profile of the particular clusters of genes or for the
`transcription or translation initiation in general.
`RESULTS
`First, we describe the pattern discovery in the com-
`plete set of yeast gene upstream regions, then the
`clustering of the yeast gene expression data, and
`finally, the results obtained by pattern discovery
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1203
`
`GENOME RESEARCH 1203
`
`
`
`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`BRA¯ZMA ET AL.
`from within the subsets of upstream regions of
`genes sharing similar expression profiles.
`We considered three different types of patterns:
`(P1) substring patterns (i.e., words in the alphabet A,
`T, G, C); (P2) substring patterns with wild cards (of
`fixed length); and (P3) patterns with character
`groups [such patterns can be represented as words
`over IUPAC code (Corhish-Bowden 1984) charac-
`ters; here we will use a more explicit notation].
`We denote wild-card positions by a dot in the
`pattern (e.g., TA.A), and the group positions by en-
`listing all possible characters in square brackets (e.g.,
`T[AT]A). A wild-card position is group position
`[ATCG], that is, all characters are allowed. For in-
`stance, pattern A[TG].C matches all strings that con-
`tain a substring beginning with A, followed by ei-
`ther T or G, followed by any character, followed by
`C. In practice, for reasons of efficiency, we restrict
`ourselves to various subclasses of these pattern
`classes (e.g., limiting the number of possible wild
`cards or group symbols). The implementation of the
`algorithm, results, data, and additional images are
`available on the worldwide web at http://
`www.cs.Helsinki.FI/∼vilo/Yeast/.
`
`Discovering Patterns from the Total Set
`of Upstream Regions
`We extracted upstream regions relative to all ORFs,
`as annotated in the MIPS Yeast genome database
`(Mewes et al. 1997). Concretely, we extracted seven
`sets of upstream regions of length 100 from the po-
`sitions ⳮ100 to 0, ⳮ150 to ⳮ50, ⳮ200 to ⳮ100,
`ⳮ250 to ⳮ150, ⳮ300 to ⳮ200, ⳮ350 to ⳮ250, and
`ⳮ400 to ⳮ300, a set of regions of length 300 from
`positions ⳮ300 to 0, and a set of regions of length
`600 from positions ⳮ600 to 0 (all positions are rela-
`tive to the start codon of the ORF; see Methods).
`Also we extracted two sets of sequences of the same
`number and length from randomly selected loca-
`tions of the same chromosome. These sets of ran-
`dom regions were used as random samples of the
`yeast genome sequences (the nucleotide and di-
`nucleotide distribution in the random regions re-
`flected that in the genome in general) (1) to com-
`pare the upstream regions to random regions for
`identifying patterns that are more frequent in up-
`stream regions than in the genome in general and
`(2) to compare the two random sets against each
`other for testing whether the pattern occurrence sta-
`tistics resulting from the comparison of upstream
`and random regions can be explained by chance.
`We analyzed these data sets for occurrences of
`
`1204 GENOME RESEARCH
`
`patterns. We presented each pattern that occurred
`at least 10 times in upstream or random regions as a
`dot in a two-dimensional plot (see Fig. 1, left col-
`umn). The vertical axis shows the number of up-
`stream regions, and the horizontal axis the number
`of random regions, where the pattern is present.
`
`Figure 1 The distribution of all patterns (of unre-
`stricted length) with at most one wild-card symbol in
`the regions ⳮ250 to ⳮ150 (upstream from the ORFs)
`and randomly chosen genomic regions of length 100
`bp. Dots in graphs in the leftcorrespond to patterns
`that occur in x sequences from the random regions
`(along horizontal axis) and y sequences from the up-
`stream regions (vertical axis). In graphs on the right,
`the upstream regions are replaced by another set of
`random regions; therefore, these plots show the ex-
`pected statistics if the regions are chosen at random.
`(Top row) All patterns with at least 10 occurrences.
`(Second row) Subset of top row with all patterns con-
`taining at least two characters C or G and not contain-
`ing any of the substrings AAAA, TTTT, ATAT, or TATA.
`(Bottom two rows) Same plots as in the first two rows,
`but only including patterns with at most 200 occur-
`rences in upstream or random regions (i.e., zoomed to
`the lower left corner).
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1204
`
`
`
`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`Hence a dot in plot location (x; y) indicates that
`there is a pattern that occurs in x random regions
`and yupstream regions. The patterns deviating from
`the diagonal, and particularly, being above the di-
`agonal, are the ones that can distinguish the up-
`stream regions from the random regions (and,
`therefore, are likely to distinguish the upstream re-
`gions from the genome in general), in contrast to
`the patterns that fall close to the diagonal and thus
`occur with the same frequency in upstream and ran-
`dom regions. The dots farthest above the diagonal
`correspond to the patterns that are potential candi-
`dates for regulatory elements. For each pattern we
`calculated a score as defined by equation (2) in
`Methods, which is essentially the number of occur-
`rences in the upstream regions divided by the sum
`of the number of occurrences in the random regions
`and a correcting constant.
`A control experiment (right column in Fig. 1)
`was done to estimate whether the difference in pat-
`tern frequencies observed for upstream versus ran-
`dom sequence segments could be explained by
`chance. In the control experiments, we compared
`two sets of random regions. The pattern occurrence
`statistics obtained when comparing the upstream
`regions to the random regions is rather different
`from the statistics obtained when comparing two
`sets of random regions. We also tested that this con-
`siderable difference can be explained neither by
`higher AT content in the upstream regions, nor by
`poly(A), poly(T), or poly(AT) patterns. To achieve
`this goal, we plotted the patterns containing at least
`two characters C or G and not containing any of the
`substrings AAAA, TTTT, ATAT, or TATA. The differ-
`ence between the plots remained essentially as
`strong (see Fig. 1). Therefore, we conclude that the
`distribution of patterns in the upstream regions dif-
`fers from the distribution in regions. In particular,
`there are some specific patterns that occur consid-
`erably more often in upstream regions than in ran-
`dom regions.
`The best distinction (as judged by visual inspec-
`tion) between upstream and random regions by sub-
`string patterns was achieved for upstream regions of
`length 100 when counting matches only on the
`gene’s strand. [The use of only one strand can be
`justified because of the very distinct distribution of
`different bases in a region of 300 bp upstream from
`the start of the gene (see Fig. 3, below, in Methods).]
`Similar differences were observed for all considered
`lengths and region relative positions. We also ex-
`perimented with the three sets of sequences of
`length 600 and 300 bp, analyzing substring patterns
`on either strand; and the sequences of length 100,
`
`IN SILICO PREDICTION OF REGULATORY ELEMENTS
`analyzing the patterns that contain up to one wild
`card. Some results for patterns with at most one
`wild-card symbol from regions of length 100 bp at
`upstream positions ⳮ250 to ⳮ150 are shown in Fig-
`ure 1.
`Many of the top-scoring patterns, particularly,
`for the region ⳮ250 to ⳮ150, are effectively poly(T)
`sequences. Still, as mentioned above, these trivial
`poly(T) patterns cannot explain the differences in
`the pattern occurrence statistics compared with ran-
`dom genomic regions; therefore, overall, the pat-
`terns not containing poly(T) sequences are signifi-
`cant. We removed from the list of discovered pat-
`terns the ones that contain substrings TTTT or
`AAAA (and additionally the patterns ending in the
`wild-card—we call the remaining patterns non-
`trivial) and the list of the 20 remaining highest scor-
`ing patterns are given in Table 1 (the numbering of
`the patterns is given for the total list of patterns
`including the trivial ones).
`We compared the groups of highest scoring
`nontrivial patterns from each of the seven regions
`of length 100 bp of various distances with the re-
`spective ORFs. We used the program Pratt (Jonassen
`1997) to try to find patterns that would be a con-
`sensus for a substantial number of patterns for each
`group. More concretely, we took the 20 highest
`scoring patterns and used Pratt to discover patterns
`matching at least 6 patterns. It turned out that only
`for regions ⳮ150 to ⳮ50, the highest scoring pat-
`tern groups have a relatively good consensus pat-
`tern GATG.G.T, the region ⳮ200 to ⳮ100 has two
`consensus patterns, T.ACCCG and CGGGT.A,
`which are mutually symmetric, and the region
`ⳮ250 to ⳮ150 has the consensus ACCCG (note
`that it is a subpattern of T.ACCCG). No significant
`consensus patterns have been found for other re-
`gions.
`We also matched the 50 highest scoring non-
`trivial patterns for each of the regions against all the
`transcription factor-binding site descriptions given
`in the TRANSFAC (Wingender et al. 1996) database
`for the yeast. The results of the exact matches are
`given in the Table 2 (by an exact match, we mean
`that the discovered pattern exactly matched a sub-
`string in the binding site description). Note that al-
`though the highest scoring patterns from neighbor-
`ing regions are not necessarily similar themselves,
`the number of coinciding binding sites (from
`TRANSFAC) matched by patterns from two regions
`show a considerable correlation with the distance
`between the positions of the regions.
`The complete list of the discovered patterns is
`available on the World Wide Web.
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1205
`
`GENOME RESEARCH 1205
`
`
`
`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`BRA¯ZMA ET AL.
`
`Table 1. Highest Scoring Nontrivial
`Patterns with (at Most) One
`Wild-Card Symbol
`Scoreb N+c Nⴑd
`No.a
`Pattern
`A. Regionsⳮ100..0
`37
`AAG.AAACAAA
`6.54
`2
`27
`A.TAAGAACA
`5.79
`6
`43
`A.AATAGGA
`5.61
`8
`26
`AAGAAA.CAAA
`5.58
`9
`25
`GTAACAA.C
`5.36
`12
`25
`AAA.AACTTA
`5.36
`13
`39
`ACAAC.TAA
`5.09
`20
`64
`AG.AAACAAA
`5.06
`21
`48
`ACAAACAA.A
`4.97
`23
`77
`AATAGTA.A
`4.92
`26
`27
`AATAGTATA
`4.77
`32
`22
`TCACTAC.T
`4.72
`34
`22
`CAAACA.ACA
`4.72
`35
`55
`ACA.ATAGA
`4.72
`37
`54
`AGAGA.ATA
`4.63
`42
`26
`AATAAACAA.A
`4.59
`47
`35
`AAAG.ACAAG
`4.57
`50
`53
`CTAAGAA.A
`4.55
`52
`21
`A.AAGGGAAG
`4.51
`56
`48
`CAAA.TAAC
`4.50
`57
`B. Regionsⳮ250..ⳮ150
`29
`TTACCCGC
`6.22
`14
`54
`GT.ACCCG
`5.59
`58
`42
`T.ACCCGC
`5.48
`71
`64
`CGGGTA.T
`5.06
`126
`48
`G.TACCCG
`4.97
`141
`47
`CGGGTAA.A
`4.87
`165
`37
`GTTACCCG
`4.83
`178
`65
`TACAT.TATA
`4.43
`305
`46
`TTTCTC.TTT
`4.32
`353
`119
`TTACCCG
`4.30
`372
`20
`TTTCCTGT.T
`4.29
`379
`24
`CTCATCTC.T
`4.24
`405
`28
`TCACGTGA
`4.20
`425
`28
`T.ATATATTC
`4.20
`427
`114
`CGGGTAA
`4.12
`454
`19
`TGTGT.GAT
`4.08
`460
`19
`ATTACCCG.A
`4.08
`465
`23
`G.ACATATAT
`4.06
`474
`27
`TA.GTAAAC
`4.05
`485
`47
`TTTCTCT.TT
`4.03
`500
`Matches were only allowed on the W(gene) strand.
`aNo. of the pattern enumerating them decreasingly by scores
`(before trivial patterns were removed).
`bFrom equation 2.
`cNo. of upstream regions matching the pattern.
`dNo. of random sequences matching the pattern.
`
`1
`0
`3
`0
`0
`0
`3
`8
`5
`11
`1
`0
`0
`7
`7
`1
`3
`7
`0
`6
`0
`5
`3
`8
`5
`5
`3
`10
`6
`23
`0
`1
`2
`2
`23
`0
`0
`1
`2
`7
`
`1206 GENOME RESEARCH
`
`Clustering the Gene Expression Data
`DeRisi et al. (1997) studied the relative expression
`rate changes of yeast genes during the diauxic shift.
`They inoculated yeast cells from an exponentially
`growing yeast culture into fresh medium and after
`some initial period, harvested samples at seven 2-hr
`intervals, isolated their mRNA, and prepared fluo-
`rescently labeled cDNA. Two different fluorescent
`moieties were used—one for cells harvested in each
`of the successive time points, the other for refer-
`ence, from cells harvested at the first time point.
`The cDNAs from each time point, together with the
`reference cDNA were hybridized to the microarray
`with ∼6400 DNA sequences representing ORFs of
`the yeast genome. Measurement of the relative fluo-
`rescence intensity for each of the ∼6400 ⳯ 7 ele-
`ments reflect the relative abundance of the corre-
`sponding mRNA in each cell population. The mea-
`surement data is available on the Internet.
`We used the data from these yeast gene expres-
`sion studies (DeRisi et al. 1997) and clustered all the
`genes by similarities in their expression profiles in
`several alternative ways. To achieve this goal, we
`developed and implemented a simple algorithm
`based on discretizing the time series of the measure-
`ment space into a simplified form and then cluster-
`ing these simple time series. Some rigorous selection
`criteria were used for defining good clusters (for de-
`tails, see Methods). This produced 32 different clus-
`ters containing from 10 to 77 ORFs each and 11
`clusters containing at least 25 ORFs (see Table 3).
`The most significant changes in gene expres-
`sion rates during the diauxic shift occurred during
`the last two time points. This significance is re-
`flected in the clusters that we obtained (although
`some fluctuations at earlier time points occur for
`smaller groups of genes, which may be due to
`noise). Many of the constructed clusters strongly
`overlap. From the 11 clusters of at least 25 ORFs
`each, in 8 clusters, the expression level is increasing
`in the time point 6, in 2 it is decreasing, and in 1 it
`is unchanged.
`
`Discovering Patterns from the Gene Clusters
`We studied whether clusters of genes with similar
`expression profiles can help to discover sequence
`patterns putatively describing transcription factor-
`binding sites. For each cluster, we compared the cor-
`responding upstream regions of length 300 bp
`against all other upstream regions. The algorithm
`was used to find the highest scoring patterns con-
`taining up to three wild cards. The patterns were
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1206
`
`
`
`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`IN SILICO PREDICTION OF REGULATORY ELEMENTS
`
`Table 2. Matches to TRANSFAC Binding Sites for the 50 Best Patterns Found for Each 100-bp
`Upstream Region
`
`ⴑ150
`
`ⴑ200
`
`ⴑ250
`
`ⴑ300
`
`ⴑ350
`
`ⴑ400
`
`ⴑ100
`.
`.
`
`Y$ARS1_03
`Y$ARSH4_02
`Y$CAR1_01
`Y$CAR2_01
`Y$CDC2_01
`Y$CDC9_01
`Y$CEN12_01
`Y$CEN6_01
`Y$CENIV_01
`Y$CFES_01
`Y$CHA1_04
`Y$CSVIII_02
`Y$CTA1_01
`YSCYC1_12
`Y$CYC1_14
`Y$DDR2_02
`Y$G3PDH_01
`Y$GAL1_03
`Y$GAL1_04
`Y$GAL1_06
`Y$GAL1_14
`Y$GAL2_03
`Y$HO_06
`Y$HO_07
`Y$ICL1_01
`Y$MAL61_02
`Y$MES1_01
`Y$PDC1_02
`Y$PGK_01
`Y$PHO8_02
`Y$POX1_01
`Y$RAP_01
`Y$RP51A_01
`Y$RPL16A_01
`Y$RRNA_01
`Y$RRNA_02
`Y$STE6_02
`Y$SUC2_02
`Y$TEF2_01
`Y$TOP2_01
`+
`+
`+
`Y$TRP1_01
`+
`+
`Y$TRP5_01
`+
`+
`+
`Y$X40_01
`+
`+
`.
`Y$Y30_01
`For each 100-bp region starting at the seven different positions upstream from ORFs, the 50 highest scoring nontrivial patterns were
`matched (in substring sense) against the yeast transcription factor binding sites as given in the TRANSFAC (Wingender et al. 1996)
`database. The first column gives the binding site identifier in TRANSFAC that is matched by one of the best patterns from any of these
`sets.
`(+) At least one of the respective patterns matches exactly the corresponding TRANSFAC site.
`(.) A pattern matches only the reverse-complement of the TRANSFAC site.
`
`+
`
`+
`+
`+
`
`+
`.
`
`.
`.
`.
`+
`
`+
`.
`.
`.
`+
`
`+
`
`.
`+
`
`+
`
`+
`+
`+
`
`.
`
`+
`.
`+
`
`+
`+
`
`.
`+
`
`+
`+
`
`.
`+
`+
`+
`
`+
`
`+
`.
`+
`
`+
`+
`+
`.
`
`.
`
`.
`.
`+
`
`.
`
`+
`
`+
`
`+
`+
`
`+
`+
`+
`.
`.
`+
`
`+
`+
`+
`+
`+
`+
`.
`+
`+
`.
`
`+
`+
`+
`
`+
`+
`
`+
`
`+
`+
`+
`
`+
`.
`+
`.
`+
`.
`.
`.
`
`+
`
`.
`.
`
`+
`+
`
`+
`
`+
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1207
`
`GENOME RESEARCH 1207
`
`
`
`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`BRA¯ZMA ET AL.
`Table 3. Summary Information about Pattern Scores in the Clusters and Random Sets
`Score for the best
`pattern in the resp.
`Score range for
`random set
`the 10 best patterns
`No. of genes
`Cluster name
`Cr(3,4)(000010)
`1.70
`3.80–2.80
`77
`Cr(5,2,4)(000020)
`1.94
`3.99–2.96
`55
`2.12
`3.77–2.95
`41
`C(5,2,4)(0000021)
`2.09
`7.15–4.11
`38
`C(5,2,4)(0000022)
`Cr(3,5)(000010)
`2.73
`3.50–3.17
`38
`2.42
`2.87–2.52
`37
`C(5,2,4)(0000012)
`Cr(5,3,5)(000020)
`2.12
`3.60–3.08
`37
`Cr(5,2,3)(00002-1)
`3.86
`4.00–3.89
`25
`Cr(3,3)(000001)
`4.33
`3.55–3.29
`26
`3.13
`3.69–3.21
`27
`C(5,2,6)(00000-1-2)
`3.86
`4.00–3.89
`25
`C(5,3,6)(00000-1-2)
`For explanation of the cluster names, see Methods. The first eight clusters consist of genes the expression level of which increase at
`time point 6; the last two of genes the expression level of which decrease at time point 6. The statistics include all patterns (trivial
`variants were not removed).
`
`scoring for the cluster Cr(5,2,4)(000020) (containing
`55 sequences) matching 64% (35 out of 55) of se-
`quences in the cluster and 21% (1280 out of 5921)
`of remaining upstream regions, thus getting a score
`of 2.95. Other high-scoring patterns in this cluster
`
`matched on either strand and ranked by the score
`given by equation 1 in Methods.
`To evaluate the overall significance of the re-
`sult, we picked for each cluster a random subset (of
`the same size) of upstream regions from the total set
`of genes, and analyzed this set exactly
`the same way as the cluster. We found
`that, for 10 clusters out of 11 containing
`at least 25 sequences and for all clusters
`containing at least 30 sequences, the
`scores of the best patterns found from
`clusters is better than for the best pat-
`terns found from the randomly picked
`sets (see Table 3; Fig. 2).
`The largest clusters (>30 sequences)
`correspond to the expression profiles
`with increase in the expression level
`at time point 6, and, for each of these
`clusters, high-scoring patterns con-
`taining the substring CCCC are found
`[CCCC has score 1.9 for cluster
`Cr(5,2,4)(000020)]. The cluster C(5,2,4)
`(0000022) with 38 sequences contains
`patterns that are standing out as the
`highest in comparison with the pattern
`scores for the random set of the size 38.
`The highest scoring patterns are given in
`the Table 4 (note that in Table 4 we have
`removed trivial variants of the patterns,
`e.g., patterns ending with wild-card char-
`acters). Pattern CCCCT (and its reverse
`complement AGGGG) is the highest
`
`1208 GENOME RESEARCH
`
`Figure 2 The plots of the scores of the 30 best patterns found from
`the clusters of upstream sequences from genes with similar expres-
`sion profiles and of random sets of the upstream sequences of the
`same size. The dotted line is the average score of the 30 best patterns
`found from the random sets of the respective sizes. For the sets of 30
`sequences and more, the pattern scores from the random sets of the
`upstream sequences are stabilizing and are considerably lower than
`for 30 best pattern scores for the respective clusters.
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1208
`
`
`
`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`CCCCT..T
`A..AGGGG
`GGGGC
`GCCCC
`G..GGGG
`CCCC..C
`CCCC...T
`
`A...GGGG
`CCCCT
`AGGGG
`CCCT..TT
`AA..AGGG
`GGG.TG
`CA.CCC
`
`IN SILICO PREDICTION OF REGULATORY ELEMENTS
`Table 4. Highest Scoring Patterns for the Cluster C(5,2,4)(0000022)
`N+a
`Total+b
`Scorec
`Pattern
`TRANSFAC (exact matches)
`A. Highest score in experiment allowing patterns to have at most 3 wild cards and
`no group characters
`22
`27
`7.09
`Y$DDR2_01, Y$DDR2_02, Y$TPI_02
`7.09
`22
`27
`–
`20
`27
`4.09
`Y$GAL2_02, Y$SUC2_02, Y$RRNA_01
`Y$ERG11_01
`20
`27
`4.09
`Y$CYB2_02
`Y$CYC1_04, Y$CYC1_05, Y$CYC1_06
`19
`28
`3.73
`19
`28
`3.73
`Y$GAL3_01, Y$MAL2R_01
`25
`42
`3.65
`Y$SUC2_01, Y$DDR2_01, Y$DDR2_02
`Y$TPI_02, Y$GAL3_01, Y$GAL4_01
`Y$MAL2R_01, Y$MAL63_01, Y$PDC1_02
`Y$HAP4_01
`Y$SUC2_02, Y$RRNA_01, Y$ERG11_01
`3.65
`42
`25
`Y$MEL1_02, Y$FPS1_01
`Y$DDR2_01, Y$DDR2_02, Y$TPI_02
`3.03
`38
`25
`Y$CAR1_02
`3.03
`38
`25
`Y$DDR2_01
`2.95
`22
`19
`–
`2.95
`22
`19
`–
`2.93
`21
`20
`Y$GAL1_04, Y$CYC1_12, Y$GAL1_14
`2.93
`21
`20
`Y$DDR2_02, Y$TPI_02
`B. Highest score in experiment allowing patterns having at most one group character
`e
`with two alternative letters (all pairs allowed)
`20
`28
`3.86
`Y$DDR2_01, Y$DDR2_02, Y$TPI_02
`24
`3.58
`Y$DDR2_02, Y$TPI_02
`20
`24
`47
`3.27
`Y$CYB2_02, Y$GAL2_02, Y$SUC2_02,
`Y$RRNA_01, Y$ERG11_01
`29
`58
`2.94
`Y$DDR2_01, Y$DDR2_02, Y$TPI_02,
`Y$SUC2_02, Y$CAR1_02, Y$ERG11_01
`29
`48
`2.90
`Y$CYB2_02, Y$DDR2_02, Y$TPI_02,
`Y$CYC1_04, Y$CYC1_05, Y$CYC1_06,
`Y$GAL2_02, Y$SUC2_02, Y$RRNA_01,
`Y$CAR1_02, Y$ERG11_01, Y$GAL1_15
`Trivial pattern variants were removed, e.g., patterns ending with a wild-card character.
`aNo. of upstream regions matching the pattern.
`bTotal number of matches in the upstream regions.
`cNormalized version of pattern score.
`dTRANSFAC entries matching the pattern.
`eBest patterns from experiment 2 not also found in experiment 1.
`
`CCCCT[GT]
`CCCCT[AT]
`[CG]CCCC
`CCCC[CT]
`[AG]CCCC
`
`include C..CCC.T (score 2.88), T.C..CCC (score
`2.85), and T.AGGG (score 2.27). Furthermore, the
`pattern CCCCT was also among the 10 highest scor-
`ing patterns found for the clusters Cr(3,5)(000010),
`Cr(5,3,5)(000020), and C(5,2,4)(0000022). These
`four clusters strongly overlap (17 ORFs are in all four
`clusters). DeRisi et al. (1997) describe a set contain-
`ing seven genes (see Fig. 5C in DeRisi et al. 1997) out
`of which five are contained in our cluster
`Cr(5,2,4)(000020). They note the presence of the
`
`pattern CCCCT in the upstream regions of each
`gene in their set and that it is known to be a stress-
`responsive motif.
`We also analyzed the upstream regions of the
`genes in the clusters having expression level de-
`crease at time point 6, and found that they contain
`patterns with matches to binding sites for the RAP1
`factor, which is known to be related to the stringent
`control of ribosomal protein gene transcription in S.
`cerevisiae(Moehle and Hinnebusch 1991). Some of
`
`Petitioner Microsoft Corporation - Ex. 1036, p. 1209
`
`GENOME RESEARCH 1209
`
`
`
`
`
`Downloaded from on April 19, 2018 - Published by genome.cshlp.org
`
`
`
`Cold Spring Harbor Laboratory Press
`
`BRA¯ZMA ET AL.
`the patterns found in the upstream regions of genes
`in the clusters C(5,2,6)(00000-1-2) (27 sequences)
`and C(5,3,6)(00000-1-2) (25 sequences) match sub-
`strings in the sites for REB1 and BAF1 proteins, that
`are repressors (Diffley and Stillman 1988; Wang et
`al. 1990), which corresponds well with the fact that
`the clusters contain genes with decreasing expres-
`sion level.
`The complete list of patterns discovered from
`the clusters is available on the worldwide web.
`DISCUSSION
`The results of the analysis of the complete set of
`gene upstream regions show that, given a genome
`with annotated genes, some transcription factor-
`binding site (or other regulatory element) descrip-
`tions can be generated without any background
`knowledge about the transcription factors of the or-
`ganism. As far as we know, these are the first results
`from an automatic method for the discovery of pos-
`sible regulatory elements applied to a complete ge-
`nome, when no background information about
`transcription factors is used. Additionally, we have
`used expression level data to find groups of genes
`with similar expression profiles and searched for
`patterns common in the upstream regions of genes
`in each c