`Information and
`Computer Sciences
`
`Includes Chemical Computation and Molecular Modeling
`
`VOLUME 38, 1998
`JCISD8 38(1—6) 1-1260 (1998)
`ISSN 0096-2338
`
`GEORGE W. A. MILNE, Editor
`
`Associate Editors
`
`A. J. Hopfinger
`
`Kenny Lipkowitz
`Reiner Luckenbach
`
`Wendy A. Warr
`
`Stephen R. Heller, Software Review Editor
`
`ADVISORY BOARD
`
`Alexandru T. Balaban
`Juergen Brickmann
`Johann Gasteiger
`
`James B. Hendrickson
`Barry K. Lavine
`Milan Randic'
`
`-
`
`Harry P. Schultz
`Theodore Simos
`
`Charles L. Wilkins
`Peter Willett
`
`AMERICAN CHEMICAL SOCIETY, PUBLICATIONS DIVISION
`
`Robert D. Bovenschulte, Director
`
`Mary E. Scanlan, Director, Publishing Operations
`
`Anne C. O’Melia, Manager, Editorial Office
`
`Debora Ann Bittaker, Journals Editing Manager, Editorial Office
`
`Kathleen E. Duffy, Journals Editing Manager, Editorial Office
`
`Diane E. Needham, Journals Editing Manager, Editorial Office
`
`Shana Sullivan, Associate Editor, Editorial Ofi”ice
`
`CFAD V. Anacor, |PR2015-01776 ANACOR EX. 2125 - 1/13
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 1/13
`
`
`
`Tm: mztevml may be pmtcned hy(apyHgM law ma. :7 u 5 cm.)
`
`J. Chem. Inf Comput. Sci. 1998, 38, 1192-1203
`
`Bioactive Diversity and Screening Library Selection via Affinity Fingerprinting
`
`Steven L. Dixon*~l and Hugo O. Villari
`
`Telik, Irrc., 750 Gateway Blvd., South San Francisco, California 94080
`
`Received June 6, 1998
`
`The Sinrilarity Principle provides the conceptual framework behind most modern approaches to library
`sampling and design. However, it is often the case that compounds which appear to be very similar structurally
`may in fact exhibit quite different activities toward a given target. Conversely, some targets recognize a
`wide variety of molecules and thus bind compounds that have markedly different structures. Affinity
`fingerprints largely overcome the difficulties associated with selecting compounds on the basis of structure
`alone. By describing each compound in terms of its binding affinity to a set of functionally dissimilar
`proteins, fundamental factors relevant to binding and biological activity are automatically encoded. We
`demonstrate how affinity fingerprints may be used in conjunction with simple algorithms to select active-
`enriched diverse training sets and to efficiently extract the most active compounds from a large library.
`
`INTRODUCTION
`
`High throughput screening (HTS) techniques are now used
`routinely in the pharmaceutical industry to assay the entire
`contents of large corporate libraries (> 100 000 compounds)
`against biological targets. While this brute-force approach
`to lead generation certainly has its place in the field of drug
`discovery, it is not practical to adapt to the HTS format every
`new target of potential biological importance. The steady
`stream of ‘‘low throughput” targets being produced by
`genomics research obligates the continued development of
`library sampling techniques which can find low ;tM hits by
`screening relatively small numbers of compounds.
`
`When there is no a priori knowledge regarding the
`structure of active compounds,
`the generally accepted
`procedure is to screen a diverse subset of the overall library,
`then examine compounds which are structurally similar to
`any promising leads. Within a given library, the success of
`this rational sampling approach is heavily dependent upon
`the makeup of the initial diverse subset, and, to some extent,
`on the way in which the similarity searching is done.
`
`Once the algorithmic designs for diversity and similarity
`have been specified, the remaining determinant of success
`is the bioactive relevance of the descriptors used to charac-
`terize each compound. Descriptors which are devoid of
`information that
`is relevant
`to target activity cannot be
`expected to enhance the success rate of rational sampling
`over that of random sampling. While this point may seem
`obvious, it deserves special consideration as compounds are
`represented in chemical spaces of higher and higher dimen-
`sionality."2 Without proper selection of descriptors, simply
`increasing the number of dimensions may tend to obscure
`information provided by the bioactively relevant subset.
`Descriptor selection itself is complicated by the fact that it
`is difficult to know which elements of structure are relevant
`
`without benefit of an existing QSAR model.
`
`One way of addressing these issues is to redefine our
`notions of compound space. The goal is to minimize the
`number of dimensions and maximize the amount of bioac-
`
`tively relevant information provided. We describe here the
`selection of diverse and focused screening libraries based
`on a bioactive profile of each compound. These profiles,
`which we term “affinity fingerprints”,3 are based on the idea
`that commonalities exist among the binding sites of certain
`proteins and that these shared characteristics are manifested
`by statistical correlations in binding affinity data. To the
`extent
`that binding sites resemble one another, affinity
`fingerprints encode information that is directly relevant to
`bioactivity. We demonstrate how these bioactive profiles
`of compounds compare to a typical
`set of structural
`fingerprints known as the ISIS MOLSKEYS.“
`
`AFFINITY FINGERPRINTS
`
`Details have been published elsewhere‘ regarding the
`experimental measurement of affinity fingerprints and the
`general characteristics of the reference proteins, so we
`provide only an outline here. Briefly, affinity fingerprints
`are determined using high throughput competitive binding
`assays, wherein each compound in our library is screened
`against a panel of functionally dissimilar proteins. An IC50
`value is determined in each assay, and the binding affinity
`is defined as —log1o(IC50) or pIC50. The set of binding
`affinities measured for a single compound across the entire
`panel of proteins is termed the affinity fingerprint.
`Ideally, proteins are selected for the panel on the basis of
`statistical criteria,
`the most important being orthogonality
`to other panel members. At the same time, proteins must
`also provide a minimum level of information about
`the
`library, and we generally require that greater than 20% of
`the compounds bind with an IC50 value below 100 ,aM. These
`two criteria produce a panel
`that
`is
`fairly small
`(<20
`proteins), yet highly informative.
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 2/13
`
`
`
`ence between the measured and fitted binding values should
`certainly not approach the precision of the binding assay. In
`reproducibility experiments, we have found the pIC50 values
`from temporally separated HTS runs to correlate at usually
`no higher than r = 0.9. This serves as a strict upper bound
`for the R value in multilinear fits of the binding values for
`potential new panel members.
`In practice, Inulticollinearities
`in any panel we have ever used have ranged from about R
`= 0.35 to R = 0.75.
`
`It is important to note that when a reference panel is fairly
`small, say, fewer than 10 proteins, then it is relatively easy
`to find new proteins which satisfy the R < 0.9 criterion. As
`the panel becomes larger, however, it becomes increasingly
`difficult to find informative proteins that cannot be fit in
`terms of the existing panel. We believe this phenomenon
`to be a reflection of panel completeness. This is not meant
`to imply that such a panel will be able to accurately predict
`affinities for every other conceivable protein. Some proteins
`are so selective that perhaps only one compound in 10 000
`will bind with an IC50 below l0;1M. These types of proteins
`would be considered highly orthogonal to just about any
`finite panel, but they provide so little chemical information
`about the vast majority of compounds, that their inclusion
`in the panel is not justified.
`Aside from statistical issues, there are of course practical
`considerations surrounding the composition of the panel.
`Protein availability, consistency, and stability, and whether
`or not a robust high throughput assay can be developed are
`factors that come into play.
`In some instances, a protein
`may cease to be available in sufficiently high quantity or
`quality, and it must be removed from the panel for all future
`fingerprinting. Thus, over time, the number of reference
`proteins we have used has fluctuated. The primary panel
`used in this investigation contained 16 proteins which were
`selected from a pool of several hundred according to the
`statistical and practical criteria just discussed. Some ex-
`amples are presented with proteins that are not members of
`this primary panel, but which are under consideration for
`future panels.
`Affinity fingerprints, like conventional QSAR descriptors,
`simply provide a means of characterizing or describing
`compounds in multidimensional space. Unlike structural
`descriptors, however, affinity fingerprints automatically tell
`us whether a compound has some or all of the features that
`are essential for favorable interaction with each of a wide
`
`
`
`CalculatedAffinity
`
`Observed Affinity
`
`Figure 1. Protein surrogate model for human serum albumin. For
`a set of 200 structurally diverse compounds, the binding affinities
`(pIC50 values) measured for human serum albumin are ap-
`proximately represented as a linear combination of binding affinities
`from three other proteins. Compounds associated with solid points
`are shown in Figure 2.
`
`diversity across an entire library. Figure 2 contains a
`representative sample (average pairwise similarity = 0.305)
`of compounds from the model, and they are seen to contain
`a wide variety of backbones and chemical functionalities.
`The ability to construct such surrogate models may seem
`at odds with our usual notions about proteins, i.e., that they
`only recognize a small number of compounds which align
`perfectly into a unique binding site. However,
`unusual for a high affinity ligand of one protein to bind
`strongly to other proteins.
`Indeed, a lack of specificity is
`the downfall of many promising lead compounds in the drug
`discovery process. Also note that a great deal of the
`information that underlies a protein surrogate model comes
`from compounds which are not high affinity and thus have
`only a subset of the features that are necessary to bind
`strongly to the target protein. However, this is the exact
`sort of information that is required in order to carry out a
`search that starts with moderate affinity compounds and
`ultimately locates high affinity compounds.
`
`MOLECULAR DIVERSITY
`
`variety of binding sites. This is a very powerful tool for
`rational sampling because Inost protein targets will share
`some binding site characteristics with one or more proteins
`in a sufficiently diverse panel. Figure 1
`illustrates this
`principle at work. For a set of 200 compounds with diverse
`structures (average pairwise Tanimoto similarity = 0.309
`based on ISIS MOLSKEYS), the measured binding affinities
`against human serum albumin are accurately represented as
`a linear combination of binding affinities from three other
`proteins. Any notion of accuracy is always somewhat
`A number of important investigationsz-“"5" have fo-
`cused on the selection of molecular descriptors that are able
`subjective, but
`it must be remembered that
`this protein
`surrogate model is not based upon a series of structural
`to distinguish compounds on the basis of biological activity.
`analogues but rather on a set of compounds selected for
`The datasets analyzed have usually been comprised of
`CFAD V. Anacor, |PR2015-01776 ANACOR EX. 2125 - 3/13
`
`The scientific literature is increasingly populated by books
`and articles that address a wide range of issues surrounding
`the field of molecular diversity.5'6 Topics include the choice
`of which molecular descriptors to employ,2~7"5 the proper
`Ineans of selecting diversity,""'°~”’ and reducing the dimen-
`sionality of compound space.6~9v“ Since the way in which
`each compound is represented ultimately limits the success
`of all subsequent procedures, we begin the discussion with
`this fundamental and critical issue.
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 3/13
`
`
`
`1194 J. Chem. Inf Compur. Sci., Vol. 38, N0. 6, I998
`
`DIXON AND VILLAR
`
`Plcso
`Obs. Calc.
`6.82
`6.47
`
`«G
`
`Obs.
`5.88
`
`Calc.
`5.58
`
`Obs.
`5.55
`
`Calc.
`5.18
`
`<;>7N\\_<\E£$':°_
`
`Obs.
`5.36
`
`Calc.
`5.12
`
`Obs.
`4.97
`
`Calc.
`5.06
`
`Obs.
`4.74
`
`Calc.
`4.05
`
`Obs.
`4.10
`
`Calc.
`4.15
`
`Calc.
`2.89
`
`/H
`
`, O
`
`bs.
`2.99
`
`t:n::*S‘H
`
`: 0-‘~02
`
`Obs.
`4.30
`
`Calc.
`4.72
`
`7&5
`
`Obs.
`3.80
`
`Calc.
`3.16
`
`0 O
`
`1&3
`
`Cl
`
`Obs.
`3.11
`
`Calc.
`3.62
`
`1??
`
`Calc.
`4.60
`
`'\=
`
`Obs.
`4.52
`CI
`
`0 O
`
`O
`
`Obs.
`3.92
`
`Calc.
`4.09
`
`0
`
`Cl
`
`awQo
`
`Cl
`
`Obs.
`O 3.30
`
`Calc.
`3.57
`
`Obs. Calc.
`2.52
`2.57
`
`Obs.
`2.52
`
`Calc.
`2.57
`
`Obs.
`2.52
`
`Calc.
`2.99
`
`:1“?
`
`,2
`
`O “
`
`P
`
`O /
`N
`
`Figure 2. Sample compounds from human serum albumin surrogate model.
`
`compounds that are selected either according to known
`activity against one or more targets or
`from libraries
`generated as a result of SAR studies around active hits. These
`datasets certainly encompass a great deal of bioactive
`diversity, and they are appropriate for demonstrating various
`properties of molecular descriptors. However, since the
`collections have tended to be biased toward the targets of
`
`tors which are able to distinguish actives from inactives in
`small, biased datasets do much less well when applied to
`HTS data from larger, unbiased libraries. Since we are
`ultimately concerned with the discovery of new drugs in a
`practical setting,
`it
`is
`important
`to consider molecular
`diversity in the context of real libraries, where only a tiny
`fraction of the compounds will exhibit high activity.
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 4/13
`
`
`
`0’
`
`N '
`
`\,o
`
`16 nM
`
`Figure 3. A counterintuitive example of structural diversity and its relationship to binding affinity for a member of the reference panel.
`Here, binding affinity is indicated by the IC5o value from a competitive binding assay.
`
`affinity compounds are shown. This series of nM hits clearly
`diversity can sometimes seem to be at odds. For one of our
`exhibits a considerable amount of structural diversity, with
`reference proteins, IC50 values and structures for several high
`CFAD V. Anacor, |PR2015-01776 ANACOR EX. 2125 - 5/13
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 5/13
`
`
`
`1196 J. Chem. Inf Compul. Sci., Vol. 38, N0. 6, 1998
`
`an average pairwise Tanirnoto similarity of only 0.421 based
`on the ISIS MOLSKEYS. Though they all look different
`to the eye of a chemist, they all in fact bind quite strongly,
`implying a high degree of similarity from the protein’s
`perspective. Paradoxically, when small structural rrrodifica—
`tions are rrrade to one of these compounds (average similarity
`to hit = 0.873), the affinity drops by more than five orders
`of magnitude. From the perspective of the protein,
`then,
`these comporrnds are all quite different from the 19 nM hit
`they so closely resemble structurally.
`Examples such as these are not difficult to find after a
`library has been assayed against several proteins. Neverthe-
`less, it is very perplexing that high affinity can be preserved
`while leaping across chemical
`families, yet
`it can be
`destroyed altogether by one small structural change. The
`generally accepted explanation is based on a pharmacoplrore
`concept, i.e., that all of the high affinity compounds must in
`fact possess the correct combination and orientation of groups
`to interact favorably with the binding site on the protein.
`One snrall change can remove one of these essential elements
`and a great deal of the affinity.
`Once a corrrpourrd with these essential features is identi-
`fied, it is often possible to explore the chemical space around
`it and develop a QSAR that satisfactorily explains the
`variations in affinity.
`In general, though, the relationship is
`only valid for conrpourrds sufficiently similar to those used
`in developing the model. This is because the descriptors
`are frequently just measuring small differences among
`compounds that all share sonre common backbone or scaffold
`which provides an appropriate framework for attachment of
`groups that can lead to high affinity.
`In effect, it is holding
`constant a myriad of other factors that govern affinity to a
`particular binding site.
`If this active backbone is replaced
`then the descriptors may not carry over to a
`QSAR built around the new backbone.
`This discussion is directly relevant to the issue of rrrolec—
`ular diversity, because many structural parameters, while
`extremely effective in explaining differences in biological
`activity among corrrpourrds that are referenced to sorrre
`restricted template, do not necessarily give meaningful
`comparisons of activity among things that have gross
`structural differences.
`It
`is very difficult to conceive of
`structural descriptors that can reliably predict activity for a
`given target across the range of compounds present in a
`typical corporate library.
`If there were such descriptors, then
`HTS would not be nearly so widespread as it is. The lack
`of library-wide QSARs is a result of our inability to model
`the range of thermodynamic processes involved in binding
`between a protein and an essentially unrestricted collection
`of compounds. Yet in order to select subsets of comporrnds
`that exhibit significantly more bioactive diversity than
`random sampling, one needs descriptors which correlate with
`activity in this global sense. The implication is that unless
`one has access to such parameters, then focusing on many
`of the finer points of rrrolecular diversity may have a limited
`impact on the amount of bioactive diversity present in the
`cornporrnds selected.
`
`DIXON AND VILLAR
`
`And, of course, there are certain types of structural features
`in small molecules that should generally be avoided because
`they are associated with undesirable chemical and ph2rrnra—
`cokinetic properties. Overall design issues can also have
`an impact, as there are extreme cases where a diversity
`algorithm does not give reasonable coverage of the space it
`sarnples.““2 But in the absence of library-wide knowledge
`of biological activity, there is little reason to believe that
`one subset of compounds selected in an unbiased fashion
`will be significantly rrror'e prolific than another when it comes
`to generating leads from large libraries that have no particular
`bias toward the target of interest.
`
`DIVERSITY ALGORITHMS
`
`We now turn our attention to the issue of algorithms for
`compound selection and focus on two simple but significantly
`different diversity designs: one which selects cornporrnds
`that are distributed in an approximately uniform fashion
`throughout space and one which samples compounds only
`from the edges of space. These two subsettirrg approaches,
`which we shall refer to as spread and edge, were chosen to
`see whether or not drastic differences in design really make
`any difference, and also whether deliberately selecting
`outliers,
`i.e., edge comporrnds, would have a deleteriorrs
`effect on rational sampling.
`to select a subset of
`is
`In spread design,
`the goal
`compounds S that fills the chosen descriptor‘ space with
`minimal redundancy. The approach we adopt
`involves
`picking subset members that are as far away as possible, on
`average, from their nearest neighbors. Accordingly,
`the
`objective function to nraximize for spread diversity was
`defined as
`
`05m, = 2,65 MIN(d,j: jeS, j 7: 1)
`
`(1)
`
`In the case of affinity fingerprints, (1,, represents the Euclidean
`distance between corrrpourrds i and j. For binary structure
`keys, (fij is simply one minus the Tanirnoto similarity, which,
`for a string of /1 bits, is given by
`
`(1,, = r — 2,4,, brtjbrrj,/2,:,_,,(brr,,3 + brrjf —
`bitikbitjk)
`
`(2)
`
`A simple stochastic procedure is used to maximize the
`objective function. Starting with a randomly chosen subset,
`the two compounds with the smallest pairwise distance are
`identified. Of those two, the one which is closer to some
`other compound in S is flagged for ejection. This flagged
`compound is exchanged for one that is outside of S if the
`exchange will bring about an overall increase in the objective
`function. A series of these pairwise exchanges is made until
`no further increase in 0,,,,e,,d can be achieved. At this point,
`a new random subset
`is selected, and the procedure is
`repeated. After several random restarts,
`the collection of
`compounds with the highest associated objective function
`is retained.
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 6/13
`
`
`
`Variable2
`
`Variable2
`
`-1.0
`
`-0.5
`
`0.0
`
`0.5
`
`1.0
`
`Variable 1
`
`Edge Design
`
`Variable2
`
`Variable2
`
`-1.0
`
`-0.5
`
`0.0
`
`0.5
`
`1.0
`
`Variable 1
`
`Clustering
`
`Variable 1
`
`Variable 1
`
`-1.0
`
`-0.5
`
`0.0
`
`0.5
`
`1.0
`
`Illustration of spread and edge diversity designs using a two—diInensional distribution of Gaussian random data points. Cluster
`Figure 4.
`centroids from hierarchical agglomerative clustering are included for comparison.
`
`developed, and one stochastic approach with a mechanism
`for local optimization can be Inade to perform about as well
`as another. The effectiveness of these “global” methods in
`locating satisfactory optima lies not so Inuch in the subtle
`ways in which they go from totally arbitrary to locally
`optimal, but more in the opportunity that they afford to
`sample a wide range of randomly generated configurations.
`In simulated annealing, for example, this aspect is controlled
`by the cooling schedule.
`In our algorithm, it is controlled
`by the number of random restarts.
`While the spread design seeks to maximize the average
`distance between each subset member and its nearest
`
`increase the objective function. The only difference is that
`the compound flagged for ejection is the one which exhibits
`the smallest average distance to the other subset members.
`
`Both of these diversity algorithms are extremely simple
`to implement, and their computational expenses scale only
`linearly with the size of the overall library. They do exhibit
`quadratic scaliI1g with respect to the size of S, but this does
`not become a serious drawback unless Very large subsets
`are desired. The intended use here is for testing in a low
`throughput mode, so typically only 50-100 compounds
`would be selected, and quadratic scaling is not an issue.
`
`neighbor, the edge design attempts to Inaximize the average
`distance between each subset member and (III the remaI'III'Izg
`collipnmzds in S,
`
`Oedge = Zajesldij “ ]av(dav/dij)]
`
`(3)
`
`Figure 4 illustrates how these selection methods behave
`when applied to a set of 2000 synthetically generated points
`in 2-D space. Here, a subset of 50 points (enclosed by boxes)
`was selected using each diversity algonthm. For comparison,
`results from hierarchical agglomerative clustering with
`complete linkage” are also included.
`In this case, 50 cluste1's
`were generated, and the point closest to each centroid was
`This expression contains a penalty term, — l/dij, that prevents
`selected for the subset. Clustering bears some resemblance
`any two highly similar compounds from being selected. The
`to spread in overall appearance, but the former is seen to be
`average pairwise distance rim. observed over the entire library
`affected somewhat more by variations in the density of
`is used to construct a reasonable scaling factor for the penalty
`points. This is certainly not a criticism, but hierarchical
`term.
`It is of course expensive to compute (lm. for extremely
`agglomerative clustering does become prohibitively expen-
`large libraries, but a randomly chosen subset of 1000
`sive for large libraries, regardless of the subset size. Note
`compounds is usually sufficient to give a reliable estimate
`of this quantity.
`that the spread and edge subsets are quite different, so they
`Using the same type of stochastic approach described
`should provide a good demonstration of the effect of diversity
`earlier, a series of pairwise exchanges is Inade in order to
`design on rational samplin .
`CFAD V. Anacor, |PR2015-01776 ANgACOR EX. 2125 - 7/13
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 7/13
`
`
`
`1198 J. Chem. Inf Comput. Sci., Vol. 38, No. 6, I998
`
`DIXON AND VILLAR
`
`Spread (P1,P2)
`
`Edge ( P1 ,P2)
`
`El Subset Member
`
`CI Subset Member
`
`P2Affinity
`
`P2Affinity
`
`P1 Affinity
`
`P1 Affinity
`
`Spread (MOLSKEYS)
`
`Edge (MOLSKEYS)
`
`E] Subset Member
`I P3 Active
`
`Cl Subset Member
`I P3 Active
`
`P2Affinity
`
`P1 Affinity
`
`P1 Affinity
`
`:5"
`1:
`
`E:
`
`<N0
`
`.
`
`Figure 5. Affinity space representation of bioactively diverse (P1, P2) and structurally diverse (MOLSKEYS) subsets selected from a
`library of 8000 compounds. P3 is a protein which is statistically related to the affinity fingerprint proteins P1 and P2.
`BIOACTIVE DIVERSITY EXAMPLES
`
`As stated earlier, when one has access to molecular
`descriptors that correlate with target activity across an entire
`library, then the subtle issues surrounding diversity become
`more relevant. Distances in descriptor space then have a
`direct bearing on the distribution of activities, so it is possible
`to control, to some extent, the bioactive diversity of a subset.
`Figure 5 illustrates the results of a diversity exercise carried
`out on 8000 compounds from our library. Here, P1 and P2
`are proteins that comprise a 2-D affinity fingerprint, and P3
`is a third “target” protein that exhibits a multicollinearity of
`R = 0.75 with P1 and P2. This is a stronger relationship
`than would normally exist between most targets and the
`reference panel, but these proteins were selected to provide
`a clear demonstration of the ability of bioactively relevant
`descriptors to select a greater number of compounds with
`high activity against the target.
`For this exercise, active compounds on P3 were defined
`to be those with IC5o values below 1 /AM. Using both edge
`and spread designs, subsets of 50 compounds were selected
`in the 2-D affinity fingerprint space (Pl,P2), and in the 166-
`bit structural fingerprint space of the ISIS MOLSKEYS. All
`
`Table 1. Summary of Average Properties of 1000 Randomly
`Selected Library Compounds
`
`property
`
`MW
`log P
`no. of rings
`diameter (in bonds)
`no. of H-bond donors
`no. of H-bond acceptors
`no. of hydrogens
`no. of carbons
`no. of nitrogens
`no. of oxygens
`
`av value
`278
`2.59
`2.18
`9.90
`1.52
`5.19
`15.3
`13.9
`1.91
`2.52
`
`The edge algorithm applied with P1 and P2 locates seven
`of ten P3 active compounds, while the spread algorithm finds
`five of ten. Note that the success of the spread algorithm is
`essentially a consequence of its tendency to sample some
`compounds from the edge. When the MOLSKEYS are used
`to select compounds, the edge technique generates a subset
`which appears to be slightly more diverse in P1,P2 space
`than when spread is used, and the MOLSKEYS edge design
`selects one P3 active compound.
`It should be apparent from Figure 5 that the majority of
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 8/13
`
`
`
`I Random
`
`3
`
`4
`
`5
`
`6
`
`7
`
`P3 Binding Affinity
`
`Frequency
`
`20406080
`
`0
`
`I Random
`
`M 5
`
`6
`
`Frequency Z
`/9
`2%If7/
`f?fé
`\\\\\\\\\\\\~.\\\\\
`5?1%\\\\\\\\\\\:~\\\\\
`$55%I;/
`
`0
`
`3 P
`
`3 Binding Affinity
`
`Spread (MOLSKEYS)
`
`Edge (MOLSKEYS)
`
`Diverse
`
`I Random
`
`M. M
`
`Frequency
`
`20406080
`
`O
`
`Diverse
`
`I Random
`
`J.-. M
`
`5
`
`6
`
`7
`
`3
`
`4
`
`\\\~A\\\\\\\\\\\\\\\\\\\\~«\\\\~e&\\\\\\\V\\\\v
`
`s\\\\\\\\\\\\xxx\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\w
`
`.\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\§Mxx“xxx\\\\\\\\\\\\\\\\\\\\\~.A\
`
`3
`
`.\\\\\-t\\\\\\\\\\\\«
`\\\\\\\\\\\\\\\\\\Ȥ
`
`Figure 6. Frequency distributions of binding affinities are used to illustrate the bioactive diversity, with respect to protein P3, of compound
`subsets selected according to diversity in affinity fingerprint space (P1, P2) and structural space (MOLSKEYS).
`
`P3 Binding Affinity
`
`P3 Binding Affinity
`
`but it illustrates an underlying reason why affinity fingerpiints
`can be so powerful. Although the compounds with high
`affinity for a particular target may not always cluster in one
`small region of affinity fingerprint space, they do tend to
`bind strongly to at least one member of the reference panel
`and are thus distinct from the concentrated mass of com-
`
`pounds that have low affinity. This sort of separation is
`difficult if not impossible to achieve with ordinary structural
`descriptors, simply because the low affinity compounds are
`so diverse structurally.
`Continuing with this example, Figure 6 summarizes the
`frequency distributions of P3 binding affinities for subsets
`of compounds selected as before.
`In these cases, however,
`the subset size was increased from 50 to 200 so as to obtain
`
`smoother statistics. For comparison, a single set of 200
`compounds was selected randomly, and its distribution is
`overlaid with that of each diverse subset.
`
`By contrast, structurally diverse subsets selected using the
`MOLSKEYS appear to offer little or no advantage over
`random sampling as far as P3 is concerned. The spread
`design compounds are distributed very much like the random
`subset, with the only difference being that random selection
`appears to result in slightly higher average affinity. The edge
`design differs more significantly from random than does
`spread, but these differences are confined primarily to the
`region of low affinity compounds.
`
`RATIONAL SAMPLING
`
`After a diverse subset of compounds has been screened
`against a target, the activity data obtained from this training
`set may be used to select focused libraries for subsequent
`examination.
`Ideally, a series of small, focused blocks of
`compounds is screened, with new information being incor-
`porated at the end of each block in order to fine tune the
`search for active compounds. This rational sampling ap-
`proach to lead generation and optimization is summarized
`in Figure 7.
`
`Compared to random selection, both of the P1,P2 subsets
`show a significant shift in the distribution toward higher
`affinities. The edge design actually exhibits a near uniform
`distribution over about three orders of magnitude in con-
`centration.
`Intuitively speaking, this result is perhaps more
`Each focused block is nothing more than a collection of
`along the lines of what would be expected with spread
`compounds which are expected, or at least hoped,
`diversity. However, it must be remembered that the natural
`active. Upon screening, most of these compounds will
`distribution of affinities is essentially a skewed bell shape,
`usually turn out to be inactive, but with proper design of the
`so the only way to achieve a uniformly distributed sample
`focused block, the compounds should show a higher level
`is to significantly bias the selection toward higher affinities.
`of activity, on average, than the large library from which
`CFAD V. Anacor, |PR2015-01776 ANACOR EX. 2125 - 9/13
`
`CFAD v. Anacor, IPR2015-01776 ANACOR EX. 2125 - 9/13
`
`
`
`1200 J. Chem. Inf Compur. Sci., Vol. 38, N0. 6, 1998
`
`DrxoN AND VILLAR
`
`Compound
`Library
`Diversity
`Algorithm
`
`Training
`Set
`
`1
`Determine Activity
`against Target
`
`1
`Refine Activity
`Model
`
`Select Focused
`Library
`
`Figure 7. Flowchart summary of rational sampling methodology.
`
`Illustration of the iterative search procedure used in
`nearest ncighbors rational sampling.
`
`they were selected. The choice of compounds may be based
`on an outright model of activity that can be applied to the
`remaining unscreened portion of the library, or it may be
`based on an implicit model, which assumes a neighborhood
`behavior of activity.” In this paper, we focus on the latter
`approach and employ a str'aightforward nearest neighbors
`search around lead compounds in order to locate additional
`high activity compounds.
`Figure 8 illustrates the iterative search procedure used in
`nearest neighbors rational sampling.
`Initial leads are simply
`the handful of most active compounds uncovered in the
`training set screen, and they may or may not be actives.
`However, each time an active compound is discovered in
`the focused screen, it is incorporated into the list of leads.
`A Search that cycles repetitively through all the leads is used
`so that the focused library is not confined to one region of
`compound space, and so that a disproportionate amount of
`time is not spent screening analogues of a lead that has no
`actives nearby.
`Screening in blocks not only allows the search to be
`expanded around new leads but also affords the opportunity
`to determine which descriptors are contributing relevant
`information and which are not. Distances in compound space
`should reflect, as much as possible, the relative positions of
`compounds in activity space. One way of accomplishing
`this is to use what we call activity-biased scaling. A given
`descriptor xi. is weighted according to how strongly it has
`been observed to correlate with activity over the set of
`compounds that has already been screened,
`
`xk —* |rk ark
`
`(4)
`
`(GST Pl—1) and papain. The library
`S-transferase Pl-l
`contained 20 000 compounds, some properties of which are
`summarized in Table I. These compounds were obtained
`from various vendors,
`through collaborations with other
`companies, and from synthetic work for internal projects. A
`more detailed description of the types of compounds in our
`library is given in ref 3.
`For each target, the 1