throbber
Chemical
`Information and
`Computer Sciences
`
`Includes Chemical Computation and Molecular Modeling
`
`VOLUME 38, 1998
`JCISD8 38(1—6) 1-1260 (1998)
`ISSN 0096-2338
`
`GEORGE W. A. MILNE, Editor
`
`Associate Editors
`
`A. J. Hopfinger
`
`Kenny Lipkowitz
`Reiner Luckenbach
`
`Wendy A. Warr
`
`Stephen R. Heller, Software Review Editor
`
`ADVISORY BOARD
`
`Alexandru T. Balaban
`Juergen Brickmann
`Johann Gasteiger
`
`James B. Hendrickson
`Barry K. Lavine
`Milan Randic'
`
`-
`
`Harry P. Schultz
`Theodore Simos
`
`Charles L. Wilkins
`Peter Willett
`
`AMERICAN CHEMICAL SOCIETY, PUBLICATIONS DIVISION
`
`Robert D. Bovenschulte, Director
`
`Mary E. Scanlan, Director, Publishing Operations
`
`Anne C. O’Melia, Manager, Editorial Office
`
`Debora Ann Bittaker, Journals Editing Manager, Editorial Office
`
`Kathleen E. Duffy, Journals Editing Manager, Editorial Office
`
`Diane E. Needham, Journals Editing Manager, Editorial Office
`
`Shana Sullivan, Associate Editor, Editorial Ofi”ice
`
`CFAD V. Anacor, |PR2015-01776 ANACOR EX. 2125 - 1/13
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 1/13
`
`

`
`Tm: mztevml may be pmtcned hy(apyHgM law ma. :7 u 5 cm.)
`
`J. Chem. Inf Comput. Sci. 1998, 38, 1192-1203
`
`Bioactive Diversity and Screening Library Selection via Affinity Fingerprinting
`
`Steven L. Dixon*~l and Hugo O. Villari
`
`Telik, Irrc., 750 Gateway Blvd., South San Francisco, California 94080
`
`Received June 6, 1998
`
`The Sinrilarity Principle provides the conceptual framework behind most modern approaches to library
`sampling and design. However, it is often the case that compounds which appear to be very similar structurally
`may in fact exhibit quite different activities toward a given target. Conversely, some targets recognize a
`wide variety of molecules and thus bind compounds that have markedly different structures. Affinity
`fingerprints largely overcome the difficulties associated with selecting compounds on the basis of structure
`alone. By describing each compound in terms of its binding affinity to a set of functionally dissimilar
`proteins, fundamental factors relevant to binding and biological activity are automatically encoded. We
`demonstrate how affinity fingerprints may be used in conjunction with simple algorithms to select active-
`enriched diverse training sets and to efficiently extract the most active compounds from a large library.
`
`INTRODUCTION
`
`High throughput screening (HTS) techniques are now used
`routinely in the pharmaceutical industry to assay the entire
`contents of large corporate libraries (> 100 000 compounds)
`against biological targets. While this brute-force approach
`to lead generation certainly has its place in the field of drug
`discovery, it is not practical to adapt to the HTS format every
`new target of potential biological importance. The steady
`stream of ‘‘low throughput” targets being produced by
`genomics research obligates the continued development of
`library sampling techniques which can find low ;tM hits by
`screening relatively small numbers of compounds.
`
`When there is no a priori knowledge regarding the
`structure of active compounds,
`the generally accepted
`procedure is to screen a diverse subset of the overall library,
`then examine compounds which are structurally similar to
`any promising leads. Within a given library, the success of
`this rational sampling approach is heavily dependent upon
`the makeup of the initial diverse subset, and, to some extent,
`on the way in which the similarity searching is done.
`
`Once the algorithmic designs for diversity and similarity
`have been specified, the remaining determinant of success
`is the bioactive relevance of the descriptors used to charac-
`terize each compound. Descriptors which are devoid of
`information that
`is relevant
`to target activity cannot be
`expected to enhance the success rate of rational sampling
`over that of random sampling. While this point may seem
`obvious, it deserves special consideration as compounds are
`represented in chemical spaces of higher and higher dimen-
`sionality."2 Without proper selection of descriptors, simply
`increasing the number of dimensions may tend to obscure
`information provided by the bioactively relevant subset.
`Descriptor selection itself is complicated by the fact that it
`is difficult to know which elements of structure are relevant
`
`without benefit of an existing QSAR model.
`
`One way of addressing these issues is to redefine our
`notions of compound space. The goal is to minimize the
`number of dimensions and maximize the amount of bioac-
`
`tively relevant information provided. We describe here the
`selection of diverse and focused screening libraries based
`on a bioactive profile of each compound. These profiles,
`which we term “affinity fingerprints”,3 are based on the idea
`that commonalities exist among the binding sites of certain
`proteins and that these shared characteristics are manifested
`by statistical correlations in binding affinity data. To the
`extent
`that binding sites resemble one another, affinity
`fingerprints encode information that is directly relevant to
`bioactivity. We demonstrate how these bioactive profiles
`of compounds compare to a typical
`set of structural
`fingerprints known as the ISIS MOLSKEYS.“
`
`AFFINITY FINGERPRINTS
`
`Details have been published elsewhere‘ regarding the
`experimental measurement of affinity fingerprints and the
`general characteristics of the reference proteins, so we
`provide only an outline here. Briefly, affinity fingerprints
`are determined using high throughput competitive binding
`assays, wherein each compound in our library is screened
`against a panel of functionally dissimilar proteins. An IC50
`value is determined in each assay, and the binding affinity
`is defined as —log1o(IC50) or pIC50. The set of binding
`affinities measured for a single compound across the entire
`panel of proteins is termed the affinity fingerprint.
`Ideally, proteins are selected for the panel on the basis of
`statistical criteria,
`the most important being orthogonality
`to other panel members. At the same time, proteins must
`also provide a minimum level of information about
`the
`library, and we generally require that greater than 20% of
`the compounds bind with an IC50 value below 100 ,aM. These
`two criteria produce a panel
`that
`is
`fairly small
`(<20
`proteins), yet highly informative.
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 2/13
`
`

`
`ence between the measured and fitted binding values should
`certainly not approach the precision of the binding assay. In
`reproducibility experiments, we have found the pIC50 values
`from temporally separated HTS runs to correlate at usually
`no higher than r = 0.9. This serves as a strict upper bound
`for the R value in multilinear fits of the binding values for
`potential new panel members.
`In practice, Inulticollinearities
`in any panel we have ever used have ranged from about R
`= 0.35 to R = 0.75.
`
`It is important to note that when a reference panel is fairly
`small, say, fewer than 10 proteins, then it is relatively easy
`to find new proteins which satisfy the R < 0.9 criterion. As
`the panel becomes larger, however, it becomes increasingly
`difficult to find informative proteins that cannot be fit in
`terms of the existing panel. We believe this phenomenon
`to be a reflection of panel completeness. This is not meant
`to imply that such a panel will be able to accurately predict
`affinities for every other conceivable protein. Some proteins
`are so selective that perhaps only one compound in 10 000
`will bind with an IC50 below l0;1M. These types of proteins
`would be considered highly orthogonal to just about any
`finite panel, but they provide so little chemical information
`about the vast majority of compounds, that their inclusion
`in the panel is not justified.
`Aside from statistical issues, there are of course practical
`considerations surrounding the composition of the panel.
`Protein availability, consistency, and stability, and whether
`or not a robust high throughput assay can be developed are
`factors that come into play.
`In some instances, a protein
`may cease to be available in sufficiently high quantity or
`quality, and it must be removed from the panel for all future
`fingerprinting. Thus, over time, the number of reference
`proteins we have used has fluctuated. The primary panel
`used in this investigation contained 16 proteins which were
`selected from a pool of several hundred according to the
`statistical and practical criteria just discussed. Some ex-
`amples are presented with proteins that are not members of
`this primary panel, but which are under consideration for
`future panels.
`Affinity fingerprints, like conventional QSAR descriptors,
`simply provide a means of characterizing or describing
`compounds in multidimensional space. Unlike structural
`descriptors, however, affinity fingerprints automatically tell
`us whether a compound has some or all of the features that
`are essential for favorable interaction with each of a wide
`
`
`
`CalculatedAffinity
`
`Observed Affinity
`
`Figure 1. Protein surrogate model for human serum albumin. For
`a set of 200 structurally diverse compounds, the binding affinities
`(pIC50 values) measured for human serum albumin are ap-
`proximately represented as a linear combination of binding affinities
`from three other proteins. Compounds associated with solid points
`are shown in Figure 2.
`
`diversity across an entire library. Figure 2 contains a
`representative sample (average pairwise similarity = 0.305)
`of compounds from the model, and they are seen to contain
`a wide variety of backbones and chemical functionalities.
`The ability to construct such surrogate models may seem
`at odds with our usual notions about proteins, i.e., that they
`only recognize a small number of compounds which align
`perfectly into a unique binding site. However,
`unusual for a high affinity ligand of one protein to bind
`strongly to other proteins.
`Indeed, a lack of specificity is
`the downfall of many promising lead compounds in the drug
`discovery process. Also note that a great deal of the
`information that underlies a protein surrogate model comes
`from compounds which are not high affinity and thus have
`only a subset of the features that are necessary to bind
`strongly to the target protein. However, this is the exact
`sort of information that is required in order to carry out a
`search that starts with moderate affinity compounds and
`ultimately locates high affinity compounds.
`
`MOLECULAR DIVERSITY
`
`variety of binding sites. This is a very powerful tool for
`rational sampling because Inost protein targets will share
`some binding site characteristics with one or more proteins
`in a sufficiently diverse panel. Figure 1
`illustrates this
`principle at work. For a set of 200 compounds with diverse
`structures (average pairwise Tanimoto similarity = 0.309
`based on ISIS MOLSKEYS), the measured binding affinities
`against human serum albumin are accurately represented as
`a linear combination of binding affinities from three other
`proteins. Any notion of accuracy is always somewhat
`A number of important investigationsz-“"5" have fo-
`cused on the selection of molecular descriptors that are able
`subjective, but
`it must be remembered that
`this protein
`surrogate model is not based upon a series of structural
`to distinguish compounds on the basis of biological activity.
`analogues but rather on a set of compounds selected for
`The datasets analyzed have usually been comprised of
`CFAD V. Anacor, |PR2015-01776 ANACOR EX. 2125 - 3/13
`
`The scientific literature is increasingly populated by books
`and articles that address a wide range of issues surrounding
`the field of molecular diversity.5'6 Topics include the choice
`of which molecular descriptors to employ,2~7"5 the proper
`Ineans of selecting diversity,""'°~”’ and reducing the dimen-
`sionality of compound space.6~9v“ Since the way in which
`each compound is represented ultimately limits the success
`of all subsequent procedures, we begin the discussion with
`this fundamental and critical issue.
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 3/13
`
`

`
`1194 J. Chem. Inf Compur. Sci., Vol. 38, N0. 6, I998
`
`DIXON AND VILLAR
`
`Plcso
`Obs. Calc.
`6.82
`6.47
`
`«G
`
`Obs.
`5.88
`
`Calc.
`5.58
`
`Obs.
`5.55
`
`Calc.
`5.18
`
`<;>7N\\_<\E£$':°_
`
`Obs.
`5.36
`
`Calc.
`5.12
`
`Obs.
`4.97
`
`Calc.
`5.06
`
`Obs.
`4.74
`
`Calc.
`4.05
`
`Obs.
`4.10
`
`Calc.
`4.15
`
`Calc.
`2.89
`
`/H
`
`, O
`
`bs.
`2.99
`
`t:n::*S‘H
`
`: 0-‘~02
`
`Obs.
`4.30
`
`Calc.
`4.72
`
`7&5
`
`Obs.
`3.80
`
`Calc.
`3.16
`
`0 O
`
`1&3
`
`Cl
`
`Obs.
`3.11
`
`Calc.
`3.62
`
`1??
`
`Calc.
`4.60
`
`'\=
`
`Obs.
`4.52
`CI
`
`0 O
`
`O
`
`Obs.
`3.92
`
`Calc.
`4.09
`
`0
`
`Cl
`
`awQo
`
`Cl
`
`Obs.
`O 3.30
`
`Calc.
`3.57
`
`Obs. Calc.
`2.52
`2.57
`
`Obs.
`2.52
`
`Calc.
`2.57
`
`Obs.
`2.52
`
`Calc.
`2.99
`
`:1“?
`
`,2
`
`O “
`
`P
`
`O /
`N
`
`Figure 2. Sample compounds from human serum albumin surrogate model.
`
`compounds that are selected either according to known
`activity against one or more targets or
`from libraries
`generated as a result of SAR studies around active hits. These
`datasets certainly encompass a great deal of bioactive
`diversity, and they are appropriate for demonstrating various
`properties of molecular descriptors. However, since the
`collections have tended to be biased toward the targets of
`
`tors which are able to distinguish actives from inactives in
`small, biased datasets do much less well when applied to
`HTS data from larger, unbiased libraries. Since we are
`ultimately concerned with the discovery of new drugs in a
`practical setting,
`it
`is
`important
`to consider molecular
`diversity in the context of real libraries, where only a tiny
`fraction of the compounds will exhibit high activity.
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 4/13
`
`

`
`0’
`
`N '
`
`\,o
`
`16 nM
`
`Figure 3. A counterintuitive example of structural diversity and its relationship to binding affinity for a member of the reference panel.
`Here, binding affinity is indicated by the IC5o value from a competitive binding assay.
`
`affinity compounds are shown. This series of nM hits clearly
`diversity can sometimes seem to be at odds. For one of our
`exhibits a considerable amount of structural diversity, with
`reference proteins, IC50 values and structures for several high
`CFAD V. Anacor, |PR2015-01776 ANACOR EX. 2125 - 5/13
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 5/13
`
`

`
`1196 J. Chem. Inf Compul. Sci., Vol. 38, N0. 6, 1998
`
`an average pairwise Tanirnoto similarity of only 0.421 based
`on the ISIS MOLSKEYS. Though they all look different
`to the eye of a chemist, they all in fact bind quite strongly,
`implying a high degree of similarity from the protein’s
`perspective. Paradoxically, when small structural rrrodifica—
`tions are rrrade to one of these compounds (average similarity
`to hit = 0.873), the affinity drops by more than five orders
`of magnitude. From the perspective of the protein,
`then,
`these comporrnds are all quite different from the 19 nM hit
`they so closely resemble structurally.
`Examples such as these are not difficult to find after a
`library has been assayed against several proteins. Neverthe-
`less, it is very perplexing that high affinity can be preserved
`while leaping across chemical
`families, yet
`it can be
`destroyed altogether by one small structural change. The
`generally accepted explanation is based on a pharmacoplrore
`concept, i.e., that all of the high affinity compounds must in
`fact possess the correct combination and orientation of groups
`to interact favorably with the binding site on the protein.
`One snrall change can remove one of these essential elements
`and a great deal of the affinity.
`Once a corrrpourrd with these essential features is identi-
`fied, it is often possible to explore the chemical space around
`it and develop a QSAR that satisfactorily explains the
`variations in affinity.
`In general, though, the relationship is
`only valid for conrpourrds sufficiently similar to those used
`in developing the model. This is because the descriptors
`are frequently just measuring small differences among
`compounds that all share sonre common backbone or scaffold
`which provides an appropriate framework for attachment of
`groups that can lead to high affinity.
`In effect, it is holding
`constant a myriad of other factors that govern affinity to a
`particular binding site.
`If this active backbone is replaced
`then the descriptors may not carry over to a
`QSAR built around the new backbone.
`This discussion is directly relevant to the issue of rrrolec—
`ular diversity, because many structural parameters, while
`extremely effective in explaining differences in biological
`activity among corrrpourrds that are referenced to sorrre
`restricted template, do not necessarily give meaningful
`comparisons of activity among things that have gross
`structural differences.
`It
`is very difficult to conceive of
`structural descriptors that can reliably predict activity for a
`given target across the range of compounds present in a
`typical corporate library.
`If there were such descriptors, then
`HTS would not be nearly so widespread as it is. The lack
`of library-wide QSARs is a result of our inability to model
`the range of thermodynamic processes involved in binding
`between a protein and an essentially unrestricted collection
`of compounds. Yet in order to select subsets of comporrnds
`that exhibit significantly more bioactive diversity than
`random sampling, one needs descriptors which correlate with
`activity in this global sense. The implication is that unless
`one has access to such parameters, then focusing on many
`of the finer points of rrrolecular diversity may have a limited
`impact on the amount of bioactive diversity present in the
`cornporrnds selected.
`
`DIXON AND VILLAR
`
`And, of course, there are certain types of structural features
`in small molecules that should generally be avoided because
`they are associated with undesirable chemical and ph2rrnra—
`cokinetic properties. Overall design issues can also have
`an impact, as there are extreme cases where a diversity
`algorithm does not give reasonable coverage of the space it
`sarnples.““2 But in the absence of library-wide knowledge
`of biological activity, there is little reason to believe that
`one subset of compounds selected in an unbiased fashion
`will be significantly rrror'e prolific than another when it comes
`to generating leads from large libraries that have no particular
`bias toward the target of interest.
`
`DIVERSITY ALGORITHMS
`
`We now turn our attention to the issue of algorithms for
`compound selection and focus on two simple but significantly
`different diversity designs: one which selects cornporrnds
`that are distributed in an approximately uniform fashion
`throughout space and one which samples compounds only
`from the edges of space. These two subsettirrg approaches,
`which we shall refer to as spread and edge, were chosen to
`see whether or not drastic differences in design really make
`any difference, and also whether deliberately selecting
`outliers,
`i.e., edge comporrnds, would have a deleteriorrs
`effect on rational sampling.
`to select a subset of
`is
`In spread design,
`the goal
`compounds S that fills the chosen descriptor‘ space with
`minimal redundancy. The approach we adopt
`involves
`picking subset members that are as far away as possible, on
`average, from their nearest neighbors. Accordingly,
`the
`objective function to nraximize for spread diversity was
`defined as
`
`05m, = 2,65 MIN(d,j: jeS, j 7: 1)
`
`(1)
`
`In the case of affinity fingerprints, (1,, represents the Euclidean
`distance between corrrpourrds i and j. For binary structure
`keys, (fij is simply one minus the Tanirnoto similarity, which,
`for a string of /1 bits, is given by
`
`(1,, = r — 2,4,, brtjbrrj,/2,:,_,,(brr,,3 + brrjf —
`bitikbitjk)
`
`(2)
`
`A simple stochastic procedure is used to maximize the
`objective function. Starting with a randomly chosen subset,
`the two compounds with the smallest pairwise distance are
`identified. Of those two, the one which is closer to some
`other compound in S is flagged for ejection. This flagged
`compound is exchanged for one that is outside of S if the
`exchange will bring about an overall increase in the objective
`function. A series of these pairwise exchanges is made until
`no further increase in 0,,,,e,,d can be achieved. At this point,
`a new random subset
`is selected, and the procedure is
`repeated. After several random restarts,
`the collection of
`compounds with the highest associated objective function
`is retained.
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 6/13
`
`

`
`Variable2
`
`Variable2
`
`-1.0
`
`-0.5
`
`0.0
`
`0.5
`
`1.0
`
`Variable 1
`
`Edge Design
`
`Variable2
`
`Variable2
`
`-1.0
`
`-0.5
`
`0.0
`
`0.5
`
`1.0
`
`Variable 1
`
`Clustering
`
`Variable 1
`
`Variable 1
`
`-1.0
`
`-0.5
`
`0.0
`
`0.5
`
`1.0
`
`Illustration of spread and edge diversity designs using a two—diInensional distribution of Gaussian random data points. Cluster
`Figure 4.
`centroids from hierarchical agglomerative clustering are included for comparison.
`
`developed, and one stochastic approach with a mechanism
`for local optimization can be Inade to perform about as well
`as another. The effectiveness of these “global” methods in
`locating satisfactory optima lies not so Inuch in the subtle
`ways in which they go from totally arbitrary to locally
`optimal, but more in the opportunity that they afford to
`sample a wide range of randomly generated configurations.
`In simulated annealing, for example, this aspect is controlled
`by the cooling schedule.
`In our algorithm, it is controlled
`by the number of random restarts.
`While the spread design seeks to maximize the average
`distance between each subset member and its nearest
`
`increase the objective function. The only difference is that
`the compound flagged for ejection is the one which exhibits
`the smallest average distance to the other subset members.
`
`Both of these diversity algorithms are extremely simple
`to implement, and their computational expenses scale only
`linearly with the size of the overall library. They do exhibit
`quadratic scaliI1g with respect to the size of S, but this does
`not become a serious drawback unless Very large subsets
`are desired. The intended use here is for testing in a low
`throughput mode, so typically only 50-100 compounds
`would be selected, and quadratic scaling is not an issue.
`
`neighbor, the edge design attempts to Inaximize the average
`distance between each subset member and (III the remaI'III'Izg
`collipnmzds in S,
`
`Oedge = Zajesldij “ ]av(dav/dij)]
`
`(3)
`
`Figure 4 illustrates how these selection methods behave
`when applied to a set of 2000 synthetically generated points
`in 2-D space. Here, a subset of 50 points (enclosed by boxes)
`was selected using each diversity algonthm. For comparison,
`results from hierarchical agglomerative clustering with
`complete linkage” are also included.
`In this case, 50 cluste1's
`were generated, and the point closest to each centroid was
`This expression contains a penalty term, — l/dij, that prevents
`selected for the subset. Clustering bears some resemblance
`any two highly similar compounds from being selected. The
`to spread in overall appearance, but the former is seen to be
`average pairwise distance rim. observed over the entire library
`affected somewhat more by variations in the density of
`is used to construct a reasonable scaling factor for the penalty
`points. This is certainly not a criticism, but hierarchical
`term.
`It is of course expensive to compute (lm. for extremely
`agglomerative clustering does become prohibitively expen-
`large libraries, but a randomly chosen subset of 1000
`sive for large libraries, regardless of the subset size. Note
`compounds is usually sufficient to give a reliable estimate
`of this quantity.
`that the spread and edge subsets are quite different, so they
`Using the same type of stochastic approach described
`should provide a good demonstration of the effect of diversity
`earlier, a series of pairwise exchanges is Inade in order to
`design on rational samplin .
`CFAD V. Anacor, |PR2015-01776 ANgACOR EX. 2125 - 7/13
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 7/13
`
`

`
`1198 J. Chem. Inf Comput. Sci., Vol. 38, No. 6, I998
`
`DIXON AND VILLAR
`
`Spread (P1,P2)
`
`Edge ( P1 ,P2)
`
`El Subset Member
`
`CI Subset Member
`
`P2Affinity
`
`P2Affinity
`
`P1 Affinity
`
`P1 Affinity
`
`Spread (MOLSKEYS)
`
`Edge (MOLSKEYS)
`
`E] Subset Member
`I P3 Active
`
`Cl Subset Member
`I P3 Active
`
`P2Affinity
`
`P1 Affinity
`
`P1 Affinity
`
`:5"
`1:
`
`E:
`
`<N0
`
`.
`
`Figure 5. Affinity space representation of bioactively diverse (P1, P2) and structurally diverse (MOLSKEYS) subsets selected from a
`library of 8000 compounds. P3 is a protein which is statistically related to the affinity fingerprint proteins P1 and P2.
`BIOACTIVE DIVERSITY EXAMPLES
`
`As stated earlier, when one has access to molecular
`descriptors that correlate with target activity across an entire
`library, then the subtle issues surrounding diversity become
`more relevant. Distances in descriptor space then have a
`direct bearing on the distribution of activities, so it is possible
`to control, to some extent, the bioactive diversity of a subset.
`Figure 5 illustrates the results of a diversity exercise carried
`out on 8000 compounds from our library. Here, P1 and P2
`are proteins that comprise a 2-D affinity fingerprint, and P3
`is a third “target” protein that exhibits a multicollinearity of
`R = 0.75 with P1 and P2. This is a stronger relationship
`than would normally exist between most targets and the
`reference panel, but these proteins were selected to provide
`a clear demonstration of the ability of bioactively relevant
`descriptors to select a greater number of compounds with
`high activity against the target.
`For this exercise, active compounds on P3 were defined
`to be those with IC5o values below 1 /AM. Using both edge
`and spread designs, subsets of 50 compounds were selected
`in the 2-D affinity fingerprint space (Pl,P2), and in the 166-
`bit structural fingerprint space of the ISIS MOLSKEYS. All
`
`Table 1. Summary of Average Properties of 1000 Randomly
`Selected Library Compounds
`
`property
`
`MW
`log P
`no. of rings
`diameter (in bonds)
`no. of H-bond donors
`no. of H-bond acceptors
`no. of hydrogens
`no. of carbons
`no. of nitrogens
`no. of oxygens
`
`av value
`278
`2.59
`2.18
`9.90
`1.52
`5.19
`15.3
`13.9
`1.91
`2.52
`
`The edge algorithm applied with P1 and P2 locates seven
`of ten P3 active compounds, while the spread algorithm finds
`five of ten. Note that the success of the spread algorithm is
`essentially a consequence of its tendency to sample some
`compounds from the edge. When the MOLSKEYS are used
`to select compounds, the edge technique generates a subset
`which appears to be slightly more diverse in P1,P2 space
`than when spread is used, and the MOLSKEYS edge design
`selects one P3 active compound.
`It should be apparent from Figure 5 that the majority of
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 8/13
`
`

`
`I Random
`
`3
`
`4
`
`5
`
`6
`
`7
`
`P3 Binding Affinity
`
`Frequency
`
`20406080
`
`0
`
`I Random
`
`M 5
`
`6
`
`Frequency Z
`/9
`2%If7/
`f?fé
`\\\\\\\\\\\\~.\\\\\
`5?1%\\\\\\\\\\\:~\\\\\
`$55%I;/
`
`0
`
`3 P
`
`3 Binding Affinity
`
`Spread (MOLSKEYS)
`
`Edge (MOLSKEYS)
`
`Diverse
`
`I Random
`
`M. M
`
`Frequency
`
`20406080
`
`O
`
`Diverse
`
`I Random
`
`J.-. M
`
`5
`
`6
`
`7
`
`3
`
`4
`
`\\\~A\\\\\\\\\\\\\\\\\\\\~«\\\\~e&\\\\\\\V\\\\v
`
`s\\\\\\\\\\\\xxx\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\w
`
`.\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\§Mxx“xxx\\\\\\\\\\\\\\\\\\\\\~.A\
`
`3
`
`.\\\\\-t\\\\\\\\\\\\«
`\\\\\\\\\\\\\\\\\\Ȥ
`
`Figure 6. Frequency distributions of binding affinities are used to illustrate the bioactive diversity, with respect to protein P3, of compound
`subsets selected according to diversity in affinity fingerprint space (P1, P2) and structural space (MOLSKEYS).
`
`P3 Binding Affinity
`
`P3 Binding Affinity
`
`but it illustrates an underlying reason why affinity fingerpiints
`can be so powerful. Although the compounds with high
`affinity for a particular target may not always cluster in one
`small region of affinity fingerprint space, they do tend to
`bind strongly to at least one member of the reference panel
`and are thus distinct from the concentrated mass of com-
`
`pounds that have low affinity. This sort of separation is
`difficult if not impossible to achieve with ordinary structural
`descriptors, simply because the low affinity compounds are
`so diverse structurally.
`Continuing with this example, Figure 6 summarizes the
`frequency distributions of P3 binding affinities for subsets
`of compounds selected as before.
`In these cases, however,
`the subset size was increased from 50 to 200 so as to obtain
`
`smoother statistics. For comparison, a single set of 200
`compounds was selected randomly, and its distribution is
`overlaid with that of each diverse subset.
`
`By contrast, structurally diverse subsets selected using the
`MOLSKEYS appear to offer little or no advantage over
`random sampling as far as P3 is concerned. The spread
`design compounds are distributed very much like the random
`subset, with the only difference being that random selection
`appears to result in slightly higher average affinity. The edge
`design differs more significantly from random than does
`spread, but these differences are confined primarily to the
`region of low affinity compounds.
`
`RATIONAL SAMPLING
`
`After a diverse subset of compounds has been screened
`against a target, the activity data obtained from this training
`set may be used to select focused libraries for subsequent
`examination.
`Ideally, a series of small, focused blocks of
`compounds is screened, with new information being incor-
`porated at the end of each block in order to fine tune the
`search for active compounds. This rational sampling ap-
`proach to lead generation and optimization is summarized
`in Figure 7.
`
`Compared to random selection, both of the P1,P2 subsets
`show a significant shift in the distribution toward higher
`affinities. The edge design actually exhibits a near uniform
`distribution over about three orders of magnitude in con-
`centration.
`Intuitively speaking, this result is perhaps more
`Each focused block is nothing more than a collection of
`along the lines of what would be expected with spread
`compounds which are expected, or at least hoped,
`diversity. However, it must be remembered that the natural
`active. Upon screening, most of these compounds will
`distribution of affinities is essentially a skewed bell shape,
`usually turn out to be inactive, but with proper design of the
`so the only way to achieve a uniformly distributed sample
`focused block, the compounds should show a higher level
`is to significantly bias the selection toward higher affinities.
`of activity, on average, than the large library from which
`CFAD V. Anacor, |PR2015-01776 ANACOR EX. 2125 - 9/13
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 9/13
`
`

`
`1200 J. Chem. Inf Compur. Sci., Vol. 38, N0. 6, 1998
`
`DrxoN AND VILLAR
`
`Compound
`Library
`Diversity
`Algorithm
`
`Training
`Set
`
`1
`Determine Activity
`against Target
`
`1
`Refine Activity
`Model
`
`Select Focused
`Library
`
`Figure 7. Flowchart summary of rational sampling methodology.
`
`Illustration of the iterative search procedure used in
`nearest ncighbors rational sampling.
`
`they were selected. The choice of compounds may be based
`on an outright model of activity that can be applied to the
`remaining unscreened portion of the library, or it may be
`based on an implicit model, which assumes a neighborhood
`behavior of activity.” In this paper, we focus on the latter
`approach and employ a str'aightforward nearest neighbors
`search around lead compounds in order to locate additional
`high activity compounds.
`Figure 8 illustrates the iterative search procedure used in
`nearest neighbors rational sampling.
`Initial leads are simply
`the handful of most active compounds uncovered in the
`training set screen, and they may or may not be actives.
`However, each time an active compound is discovered in
`the focused screen, it is incorporated into the list of leads.
`A Search that cycles repetitively through all the leads is used
`so that the focused library is not confined to one region of
`compound space, and so that a disproportionate amount of
`time is not spent screening analogues of a lead that has no
`actives nearby.
`Screening in blocks not only allows the search to be
`expanded around new leads but also affords the opportunity
`to determine which descriptors are contributing relevant
`information and which are not. Distances in compound space
`should reflect, as much as possible, the relative positions of
`compounds in activity space. One way of accomplishing
`this is to use what we call activity-biased scaling. A given
`descriptor xi. is weighted according to how strongly it has
`been observed to correlate with activity over the set of
`compounds that has already been screened,
`
`xk —* |rk ark
`
`(4)
`
`(GST Pl—1) and papain. The library
`S-transferase Pl-l
`contained 20 000 compounds, some properties of which are
`summarized in Table I. These compounds were obtained
`from various vendors,
`through collaborations with other
`companies, and from synthetic work for internal projects. A
`more detailed description of the types of compounds in our
`library is given in ref 3.
`For each target, the 1

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket